WO2019236560A1

WO2019236560A1 - Pair-wise or n-way learning framework for error and quality estimation

Info

Publication number: WO2019236560A1
Application number: PCT/US2019/035363
Authority: WO
Inventors: Pradeep Sen; Yasamin Mostofi; Ekta Prashnani; Hong Cai
Original assignee: The Regents Of The University Of California
Priority date: 2018-06-04
Filing date: 2019-06-04
Publication date: 2019-12-12

Abstract

A machine-implemented method for automatically determining a likely human preference or quality for a data object or the perceived similarity or distance between a data object and a reference. Comparison sets of data objects are provided to devices and human direct or indirect responses to the data objects are collected. The data objects are labelled with human preference or emotional labels and a pairwise or n-way comparison learning data set is built. A machine learning component is trained to predict a human preference or emotional response to data objects. The machine learning component can then evaluate and provide a predicted human perceptual quality or emotional response to a data object to evaluate.

Description

PAIR-WISE OR N-WAY LEARNING FRAMEWORK

FOR ERROR AND QUALITY ESTIMATION

PRIORITY CLAIM AND REFERENCE TO RELATED APPLICATION

[001] The application claims priority under 35 U.S.C. § 119 and all applicable statutes and treaties from prior provisional application serial number 62/680,393, which was filed June 4, 2018.

STATEMENT OF GOVERNMENT ASSISTANCE

[002] This invention was made with Government support under Grant Nos.

1619376 and 1321168 awarded by the National Science Foundation. The Government has certain rights in this invention.

FIELD

[003] Fields of the invention include computer vision, image analysis, computer auditory analysis and speech recognition, error detection, data quality evaluation, recommendation systems, machine learning and artificial intelligence. Example applications of the invention include image compression/coding, restoration, adaptive reconstruction, machine vision systems, image processing pipelines (such as in cameras and image modification/editing software), image rendering systems, advertisement/media/shopping recommendation systems, and gaming systems.

BACKGROUND

[004] Many areas of computer science and artificial intelligence seek to model human opinion or perception. For example, computer vision seeks to model human visual perception, one aspect of which focuses on mimicking human decision-making about the quality or error of an image or video (or a part of an image or video). Similarly, computer auditory analysis seeks to replicate human judgment regarding auditory data. Recommendation systems attempt to model human preference for media (such as films, music, books, etc.), products (such as clothing, toys, electronics, games, hotels, etc.), or services (house cleaning, plumbing, medical, etc.) to suggest ones that a customer might like (i.e., Which ones are the best? Which ones will I like the most?) or which are similar to something they are familiar with (i.e., which book is most similar to book X?). A fundamental hurdle for these and other similar artificial intelligence systems is the automatic computation of the fundamental perceptual quality score of a distorted image, goods, or service (represented by a data object) which is required in order to rank them for the user. It is also difficult to automatically compute the perceptual similarity (or distance/error) of an image, good, or service with respect to a corresponding reference in a way that models human observers’ perception, which is necessary, for example, when recommending products that are similar to a given one.

[005] Because of the importance and potential impact of accurately modeling human preference and perception for many applications, these problems have received considerable attention in the past. For example, in the case where the data objects are images, an area known as full-reference image-quality assessment (FR-IQA) has been studied as a way to measure similarity or distances between an image and a reference. See, e.g., L. Zhang, L. Zhang, X. Mou, and D. Zhang,“A comprehensive evaluation of full reference image quality assessment algorithms,” In Proceedings of the IEEE International Confer-ence on Image Processing (2012). Such algorithms can be used to control image/video compression schemes in order to produce more compact/efficient image/video representations that are visually similar to the original. To compute such a metric, many past efforts simply compute mathematical distances between the images based on norms such as i^or L₂, but these are known to be perceptually inaccurate. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli,“Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, 13(4):600-612 (2004). Others have proposed metrics that try to exploit known aspects of the human visual system (HVS) such as contrast sensitivity, high-level structural acuity, and masking, or use other statistics/features. However, such hand-coded models are fundamentally limited by the difficulty of accurately modeling the complexity of the HVS and therefore do not work well in practice.

[006] Seeking improvement, others have proposed FR-IQA methods that employ machine learning to learn more sophisticated models [P. Gastaldo, R. Zunino, and J. Redi. Supporting visual quality assessment with machine learning. EURASIP Journal on Image and Video Processing, 2013(1): 1-15, 2013] Many learning-based methods use hand-crafted image features, and recently published methods apply deep learning to FR-IQA to learn features automatically [V. V. Lukin, N. N. Ponomarenko, O. I. Ieremeiev, K. O. Egiazarian, and J. Astola. Combining full-reference image visual quality metrics by neural network. In SPIE/IS&T Electronic Imaging, pages 93940K-93940K, 2015.]. The accuracy such prior learning -based methods depends on the size and quality of the datasets they are trained on, and existing IQA datasets are small and noisy. For instance, many datasets are labeled using a mean opinion score (MOS) where each user gives the distorted image a subjective quality rating (that can correspond, for example, to a score of between one and ten, where one is the worst quality score and ten is a perfect image score). These individual scores are then averaged in an attempt to reduce noise. Unfortunately, creating a good IQA dataset in this fashion is difficult because humans cannot assign quality or error labels to a distorted image consistently, even when comparing to a reference image that is designated as a perfect image.

[007] Other datasets (e.g., TID2008 and TID2013) leverage the fact that it is much easier for people to select which image from a distorted pair is closer to the reference than to have people assign them quality scores. To translate user preferences into quality scores, such techniques then select a set of distorted images and apply a Swiss tournament to assign scores to each. A fundamental problem with this approach is that the same distorted image can have varying scores in different sets. Moreover, the number of images and dis-tortion types in all of these datasets is very limited. The largest dataset in published methods that is known to the present inventors (TID2013) has only 25 images and 24 distortions. Therefore, machine learning-based methods trained on these datasets have limited generalizability to new distortions, which makes such machine learning approaches fall short of the human perceptual ability to compare images, even for simple comparisons.

[008] In another area of image/video analysis, algorithms have been developed to automatically determine the quality of a given image. In other words, given an input image, these algorithms act as a quality metric that try to estimate the fundamental quality of the image taking into account human preferences and perception (i.e., how good is the image?). These algorithms, known as no-reference image-quality assessment (NR-IQA), can be used for automatically adjusting consumer photographs to make them look better by optimizing the image to improve the calculated quality score, used to automatically create photo-albums, collages, or other media by using the score to select the best photographs, or reduce redundant images in a photo collection by saving only the ones with the best scores.

[009] For NR-IQA, significant effort has been devoted to assessing the quality of images that are degraded by specific types of distortions, such as blurring, blocking, and noise, or specific types of image processing operations, such as JPEG and JPEG2000 compression. Some methods have been proposed to cope with a combination of distortions that are known in advance. Although these approaches are valuable for analytical purposes, their use in real-world applications is limited, since in many practical applications, the type of distortion is not known beforehand and sometimes, the images may be corrupted by distortions that are not previously seen or defined.

[0010] Others have focused on general-purpose NR-IQA methods that are not dedicated to any specific kind of image distortions. For instance, several methods have been developed based on the idea that there should be certain statistical properties associated with good-quality images. Others then further exploit the flexibility of machine learning-based approaches to obtain a mapping from hand-designed image features or statistics to quality scores by utilizing human-labeled datasets. These methods, however, are fundamentally limited as it is very challenging to manually design universally-applicable features that are consistent with human perception, due to the complexity of the human visual system. More recently, several methods apply deep learning-based methods to learn relevant image features automatically. As discussed earlier, the accuracy of these learning-based approaches depends on the size and quality of the datasets they are trained on, and existing IQA datasets are small and noisy.

[0011] Beyond image/video analysis, similar research on quality assessment has been done into human perception of auditory signals (e.g., speech). A standard quality assessment framework is the Perceptual Evaluation of Audio Quality (PEAQ) proposed by the International Telecommunications Union (ITU). PEAQ is a computational model that takes as input a reference signal and a degraded signal, and outputs a score ranging from 1 to 5 (e.g., 5: excellent, 1: bad), by utilizing psychoacoustic models and cognitive models. Some researchers have proposed modifications and improvements to the basic PEAQ model. Others propose audio (particularly, speech) quality assessment methods without a reference signal by utilizing spectral analysis or probabilistic modeling. Recently, machine learning-based methods have also been proposed, where quality metrics are trained to predict users’ Mean Opinion Scores (MOS). However, these scores have the same problems as that of images because they can be inconsistent and subjective.

[0012] Atcheson US Patent No. 5,583,763 provides system for predicting user preference. The determination is made based on the user's prior indicated preferences. The user designates his or her preferred selections as entries in a user's preference list. Entries in the user's list are compared with entries in the other users' lists. When a significant number of matches have been found between two lists, the unmatched entries of the other user's preference list are extracted. The unmatched entries are further processed. Those unmatched entries with a high correlation to the user's preference list are presented to the user as selections in which the user is likely to be interested. This system seeks to present a set of new items to a particular user based on user’s past preferences. The weight assigned to the new items (to be presented to the current user) are based on the preferences for these new items by those existing users who have similar past preferences as the current user.

[0013] Eriksson WO 2014137381 discloses a system for determining a pre determined number of top ranked items by accepting a set of unranked items, the pre-determined number, and a random selection of pairwise comparisons to create graph structure using the set of unranked items and the random selection of pairwise comparisons, wherein the graph structure includes vertices corresponding to the items and edges corresponding to a pairwise ranking and performing a depth-first search for each item that is an element of the set of unranked items for paths along the edges through the graph that are not greater than a length equal to the pre-determined number. This system does not compute the scores for every item but rather identifies the fixes number of top-k items based on a few random pairwise comparisons. The system is limited to a prescribed set of unranked items and only identifies a fixed number of top-ranked items by randomly doing pairwise comparison of a few items from the set. It does not assign an accurate preference -based score to each item in the set.

[0014] Shen Patent US 10002415B2 develops a learning system to predict rating of the images to quantify aesthetic properties of images. The training uses a dataset where a user annotates an image with their opinion of the image. This form of human annotation is unreliable due to high variability of human opinions. Consequently, the Shen trained system is more prone to errors compared to the present invention, which uses a pairwise or N-way learning framework and a pairwise-comparison-based dataset or an N-way comparison-based dataset.

SUMMARY OF THE INVENTION

[0015] A preferred embodiment is a machine-implemented method for automatically determining a likely human preference or quality for a data object or the perceived similarity or distance between a data object and a reference. Comparison sets of data objects are provided to devices and human direct or indirect responses to the data objects are collected. The data objects are labelled with human preference or emotional labels and a pairwise or n-way comparison learning data set is built. A machine learning component is trained to predict a human preference or emotional response to data objects. The machine learning component can then evaluate and provide a predicted human perceptual quality or emotional response to a data object to evaluate, or a data object and a reference object and can provide the human-perceived similarity/difference between the data object to evaluate and the reference.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] FIGs. 1A and 1B are schematic diagrams of a preferred machine learning system of the invention configured to be trained with a pairwise or N-way dataset of human preference;

[0017] FIG. 2A is a flowchart illustrating steps of a preferred training method for a machine learning system of the invention; FIG. 2B is a flowchart illustrating steps of a preferred operation of a machine learning system of the invention

[0018] FIG. 3 illustrates an assumption for a preferred pairwise-learning dataset, in that data objects are capable of being ordered on a scale;

[0019] FIG. 4 is a schematic diagram of a preferred pairwise-learning framework of the invention;

[0020] FIGs. 5A and 5B show the preferred error-estimation function f

DETAIFED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0021] Preferred embodiment machine learning methods and systems automatically compute visual similarity between two images or videos (or other data objects) in a manner that is consistent with human visual perception (or other human preference). In other words, machine learning is leveraged to enable a computer system to automatically qualitatively estimate how similar (or different) are two images are from each other (or which of two or more data objects are preferred) in a manner that agrees with human observers. In the example of an image data object, the similarity/difference can be in terms of visual content (e.g., two images are similar if subjects cannot see any visible difference between the two), or in terms of emotional response (e.g., two images are similar if the subject has the same emotional response to them), depending on the application. For example, a photograph of a meadow and a photograph of the sunset would not be visually similar in terms of content (i.e., subjects could easily tell differences between the two images), but they might be similar in terms of how they make the subject feel. Likewise other data objects, e.g., sound files for music, can be evaluated in terms of style, e.g., Jazz or Rock, in terms of emotion, e.g., upbeat or sad, or in terms of similarity, e.g., whether two sound clips sound identical to humans.

[0022] Preferred methods train a system with a ranking for data objects based on pairwise or N-way comparisons. The system’s end goal during operation is to compute a quality score for a data object or compute the error (or similarity) of a data object with respect to a reference. The system is trained and then operates to determine the quality of a single, stand-alone, data object or the error of a data object with respect to a reference. The training is performed using human pairwise or N-way preferences on data objects, and in the process quality or error predictor subnetworks also get trained. During evaluation stage (the stage that is used during system operation), this indirectly-trained quality or error predictor sub-networks are used. Training these error or quality predictors permits the system to operate. Using the PLF or NLF is an indirect but effective way to train these predictors because, directly training these predictors would require labeling data objects with human perception of quality or error on single data objects, which are difficult to obtain. Once the system is trained and can be used to determine the qualities (or errors) for individual objects, the objects can then be easily ranked using these scores if desired. In other words, the ability of the system to rank the objects in the end is a side-effect of the system being able to determine qualities (or errors) for individual objects. The approach is highly customizability in recommendation systems as it can work with many types of data objects, by collecting human pairwise or N-way data and training the system with that data. The only role humans play is to respond to prompts from a system to enable the system to create the labeled dataset (e.g., with preference labels), which could be done before training of the machine learning component begins or during training of the machine learning component. The machine learning training process itself is automatic and does implicate human involvement.

[0023] A preferred embodiment is a machine learning system that can be leveraged to enable a computer system to automatically provide a qualitative evaluation of an image or video or other data object consistent with human observations, including a measure of one or more fundamental qualities of the image/video or other data object (i.e., how“good” is it?) as well as other emotional observations/responses, e.g., humor (“how funny is it?”), sadness, scariness, attractiveness, age-appropriateness. Other data objects include media (sounds, music, films, film scenes), goods (clothing, electronics, hotels), services, and other types of data objects. The qualitative estimate output by the computer system can be formatted to be processed by other algorithms via score or another metric that provides a range of values or a set of semantically-meaningful ratings (e.g.,“excellent,”“very good,”“good,” “bad,” or“terrible”).

[0024] Preferred systems of the invention include an interface for receiving the relevant data objects. For images, videos, or sound, a data object can simply be the digital representation of the appropriate media files. For other objects, such as arbitrary media files (e.g., films, books, etc.), or other goods and services, the data object can either be a digital representation of the object (e.g., the MPEG file for a film, the PDF of a book, reviews of a hotel from a travel website, to name a few) or simply a unique identifier (e.g., a name, a title, a catalog number, a URL, a street address) that uniquely distinguishes each data object from the others. The data object is provided to a machine learning component that has been trained with a “pairwise-learning framework” (PLF) or an n-way learning (NLF) framework. During training, a human-labeled dataset is provided, which is acquired before PLF or NLF is deployed to train the system. The dataset for this training is acquired by capturing human perception and judgement through pairwise or N-way comparisons of presented data objects. A property of the present PLF or NLF training is that although the system is trained using such a dataset consisting of human judgements or emotional responses on pairs or N-numbered sets of data objects, the system learns to predict fundamental quality or error values for individual data objects during evaluation/operation; even though human judgements of quality/error/similarity were never captured for single data objects in the training dataset. The machine learning component for PLF or NLF training can be selected from neural networks, deep neural networks, multi-layer perceptrons (MLP), convolutional networks (CNNs), deep CNNs, recurrent neural networks, autoencoder neural networks, long short term memory (LSTM) networks, generative adversarial networks (GANs), support vector machines, or random forests, as examples.

[0025] Preferred methods of the invention provide a data-driven, machine-learning - based PLF or NLF training that trains a machine learning model to predict the perceived similarity (or dissimilarity) between images, or alternative can be used to estimate the human perceived fundamental quality of a single image (or other objects of human perception). The PLF or NLF training approach requires a data set of human observations that is generated from presenting two or more data objects for comparison (e.g., two images) and asking a population of humans to indicate which of the two or more input data objects is preferred. The invention leverages the inventors’ determination that people are better at deciding which of the two or more data objects is preferred, and provides a more reliable measure than obtained when having people assign an absolute quality or similarity (distance) values to each, even when people have multiple objects presented to them when assigning absolute values.

[0026] In the case of trying to estimate the fundamental perceptual quality of the object, the human subjects are asked through a computer system which of the two or more objects appears to have more or most of the desired perceptual quality (e.g., which one is better, funnier, scarier, sadder, etc.), or to rank more than two objects in order of preference. Likewise, when trying to estimate the similarity (or difference) between an object and a reference, the humans subjects are asked through a computer system which of the two or more data objects is more or most similar (or different) from the given reference, or to rank similarity of multiple data objects to the reference in order. In the case of trying to estimate visual content similarity, the system prompts can ask the users“which object, A or B, is more similar to the reference?” or“which object, A, B, or C is most similar to the reference and which object A, B, C is least similar to the reference?” In the case of trying to estimate the emotional similarity, subjects can be prompted by the system to answer“which object, A or B, makes you feel more similar as the reference?” or“which object, A, B, or C makes you feel most similar as the reference and which object, A, B, or C makes you feel the least similar as the reference?” Thus, the first prompts measure content similarity/distance, while the second prompts measure emotional similarity/distance. Usually, most applications seek to compute only one measure or the other.

[0027] The total number of possible pairwise or N-way comparisons between any two or more objects in a dataset can be much larger than the total number of objects in the dataset itself. Therefore, in cases where the number of objects is very large (e.g., the full video library of an online movie streaming service), acquiring human preference responses for all pairs or N-sets of objects can be time-consuming and/or expensive. To make the acquisition of human labels on pairwise or N-way comparisons less expensive and more practical in these situations, the system can instead acquire human preferences for a subset of the total possible pairs or N-sets of objects. Estimation can be used for filling in missing human preference labels (pairs of objects for which humans have provided preference information) before the machine-learning training process. One type of estimation is maximum- likelihood estimation of the missing labels. The prompting component 12 can include a maximum likelihood estimation function that fills in missing labels to create a full labeled dataset to train the error estimation components l4a and l4b. As an example, assume that there are 100 data objects, and a total of 4,950 ways to select data object pairs. Instead of selecting all possible pairs and prompting for human preference for all of them (which would take a long time and be expensive), the prompting component selects a smaller subset of pairs (e.g. 3,000) for which to seek human preference. The prompting component 12 then, using these human-labeled pairs, estimates the true preference labels for the remaining pairs using a maximum likelihood approach and creates the full training dataset which contains pairs (or N-sets) of objects and their respective preference labels for each pair (or N-set). A preferred approach uses a“maximum likelihood” estimation of the missing pairwise-preference probability labels [K. Tsukida and M. R. Gupta. How to analyze paired comparison data. Technical report, University of Washington, Seattle, Deptartment of Electrical Engineering, 2011] using the measured responses. Such a technique utilizes a limited budget but maximizes the number of object pairs or N-sets that can be labeled with human responses. Specifically, Tsukida and Gupta teach an MLE estimation of missing pairwise and provide a method to, given a set of data objects with a subset of all possible pairings of data objects labeled with human preferences, use MLE estimation to obtain the human preference for the remaining data objects pairs that are not labeled with human preference.

[0028] Once the dataset has been labeled with human responses, the PLF or NLF trained system is then trained via machine learning with the dataset. Notably, although PFF or NFF trained system is trained using pairs or N-sets of data objects, it can be then used to evaluate error or quality of a single data object during evaluation stage. Specifically, by predicting the percentage of people that selected one data object over another, the system learns to predict individual scores for each data object that have fundamental semantic meaning given the application.

[0029] An example application of the invention is an automated system for automatically measuring the“scariness rating” of a movie (i.e., how scary is it?). Hand-coded algorithms are ill-suited for this purpose since it is hard to quantify how to measure“scariness.” Furthermore, creating a proper dataset for machine-learning is difficult because people cannot assign scariness ratings to films in a consistent basis (e.g., rating the scariness of a given film from 0 to 100 is not easy) lnstead, in an example, a dataset is created by showing human subjects data objects concerning pairs of movies (e.g., by displaying the titles or movie posters) and asking them to select which movie was scarier. After gathering this information from many users and many pairs of movies, the systemcan be trained to predict the percentage of people who selected one movie over another given the unique identifier or digital files for each film as input ln doing so, the machine-learning component of the PFF trained system learns to assign accurate scariness“scores” to each film so that the probability of preferring it would match the user study. Once this training process is complete, the machine-learning component of the PFF trained system can evaluate a new film (given by its unique identifier or digital file) and automatically assign an accurate scariness rating that is consistent with the human population of the user study. This scariness rating can then be used to rank movies, e.g.,“List the scariest (or least-scary) movies” for display, to create automatically-created play lists, or in parental- control systems, to name a few applications.

[0030] This example embodiment can be changed to measure the funniness, “goodness,” and other perceptual qualities of films by changing the question that is asked of the users during data acquisition. Likewise, it can also be applied to other media, goods, and services by changing the data objects referenced during the user study. For example, a database of individual television commercials can be created, and users can be shown pairs of commercials and asked which one was better, more interesting, more attention-getting, etc. Once the dataset is created, the system can be trained by providing the information on the commercials (e.g., the digital video file) and learn to predict the probability human preference of one commercial over the other ln the process of doing so, the system learns to output quality values for the commercials. Such a system can be used to automatically “judge” commercials under development (e.g., to see which ad appeals more to a certain demographic, or which political ad is more convincing, etc.), questions which are typically answered using human focus groups which can be expensive and time-consuming. This approach can be applied for evaluating movies, video games, books, clothing, appliances, etc.

[0031] ln another application, a system of the invention can suggest similar songs for an online streaming music service. Subjects are presented with two or more songs (e.g., by either playing clips of the songs, the full songs, or simply shown the titles, to name a few possibilities) as well as a reference song and asked to select which of the two songs makes them feel the same way (or closer to) to the reference song., or which makes them feel most and which the least the same way as the references After gathering a set of choices in response to prompts and determining the probability of preference for each training pair, the system can then be trained to predict the emotional similarity between songs given, e.g., the audio file for each song. Such a system can be used to find songs that have the same emotional connection with the listener as a given song (“Play a song like X.”) This approach can also be extended for estimating emotional similarities/distances for films, books, games, etc.

[0032] The PLF or NLF training/testing is preferably automated, and can be conducted locally or remotely via a network. For example, in a remote network example and for the application of image analysis, pairs or N-sets of images are selected from an image data set and provided to a device over the network, such as a workstation, smart TV, laptop, tablet or smart phone. A user of the device then selects the preferred image, or the most and least preferred image. The acquisition of human preference between pairs of images or N-sets of images can leverage applications and social networks, and can be part of a contest or a game to attract interest. The selection of images can also be part of verification step, e.g., to access particular content. The data acquisition can also be in a controlled environment, with volunteers or paid human testers, or by subscribers to the service. For example, an online film streaming service can choose to query users who have watched a particular pair or N-set of movies to get the data for comparisons, or an online book seller might ask people who have read the two books.

[0033] The selection process by a system prompting a population of human subjects builds a dataset that can be used to train the machine learning approaches to estimate individual quality ratings or error (similarity) scores for each image or other data objects using the present PLF or NLF methods. The PLF or NLF trained system can determine error scores (or similarity scores or quality labels) for single data objects after being trained on the pairwise or N-set preference labels. In a preferred embodiment for predicting perceived visual error (similarity) in images/video, during training, the inputs to the system are a reference image and a pair or more of distorted images derived from the reference (the distorted image(s) can also be different images from the reference, in another variation). The output of the system during training stage is the probability that humans would prefer one distorted image of the pair over the other(s). During evaluation stage, the error estimation component trained with the PLF or NLF can then predict visual error of a given distorted image.

[0034] In preferred training to estimate similarity (or error/distances), a system of the invention first creates a large dataset by showing multiple users through local or remote devices many image triplets (sets of two distorted images and the corresponding reference) and obtaining selection data regarding the human choice of the closest image between the two. The systems stores the percentage of people who, for example, chose A over B, which is effectively the probability of selecting A. The system designates these acquired probabilities as ground-truth labels for each triplet, which are then used to train the learning model using back-propagation. The machine-learning component is thereby taught to predict the probability of human preference given the input images.

[0035] Despite training on human preference of image pairs or N-sets, the PLF or

NLF trained architecture allows the machine learning component to predict single image visual error/quality values after the training/optimization process is complete. Extensive verification through experiments has confirmed that an a PLF-dataset trained network exceeds results compared to the state-of-the-art.

[0036] Preferred implementations compute the similarity/error/distances of images, which includes segregation of error estimation blocks into feature extraction and score computation sub-blocks. Important general system components and method steps are realized via error estimation steps that feed a probability estimation analysis component. [0037] In a preferred machine-implemented method for automatically determining a likely human preference or quality for a data object or the perceived similarity or distance between a data object and a reference, the method includes providing comparison sets of data objects to devices and collecting human direct or indirect responses to the data objects or to references of the data objects. The method includes labelling the data objects with human preference or emotional labels and building a pairwise or n-way comparison learning data set. A machine learning component is trained with the learning data set to predict a human preference or emotional response to data objects, wherein the learning comprises estimating probability of human preference or emotion, determining an error of the estimated probability of human preference or emotion, and updating parameters of a learning component to reduce the error of the estimated probability of human preference or emotion. The machine learning component receives a data object or a reference thereto to evaluate and providing a predicted human perceptual quality or emotional response to the data object to evaluate, or a data object or a reference thereto and a reference object and providing the human-perceived similarity/difference between the data object to evaluate and the reference. The machine learning component can have plural machine learning components, with each of the plural machine learning components receiving a data object and computing the perceptual quality of the given data object which is then provided to a probability component to determines a likelihood of preference between separate data objects provided to the plural machine learning components. The human direct response can be a selection of which object exhibits a desired perceptual quality more strongly, wherein the perceptual quality can be funnier, scarier, sadder, more serious, more appropriate for a given subset of the population, or better. The machine learning component can include plural machine learning components, with each of the plural machine learning components receiving a data object and a reference object and computing the perceptual similarity or difference of the given object with respect to the reference; and which is then provided to a probability component to determines a likelihood of preference between separate data objects provided to the plural machine learning components. The data objects can represent one of sounds, songs, films, tv shows, advertisements, movies, videos, books, clothing, toys, or electronics.

[0038] A preferred system for automatically determining a likely human preference or quality for a data object or the perceived similarity or distance between a data object and a reference includes a dataset including a set of pairwise labels for data objects indicating human selections of comparisons between pairs of N-sets of data objects obtained from a prompting component. A machine learning component is trained on the data set via a probability estimation component, the machine learning component having an input for receiving data objects during testing and having code for updating parameters during training based upon an error distance between predictions of the probability estimation component and human probability of preference. The machine learning component comprises one of a neural network, deep neural network, multi-layer perceptron, convolutional network, deep convolutional network, recurrent neural network, autoencoder neural networks, long short-term memory network, generative adversarial networks, support vector machine, or random forest. The dataset can include labels generated from human selections and additional labels estimated from human selections ,

[0039] Preferred embodiments of the invention will now be discussed with respect to the drawings and experiments used to demonstrate the invention. The drawings may include schematic representations, which will be understood by artisans in view of the general knowledge in the art and the description that follows. [0040] In a preferred embodiment, schematically represented in FIG. 1A, training includes obtaining human ratings in response to prompts from a system with a prompting component 12 (e.g., computer networks, handheld devices, etc.) that provides a group of test subjects with a reference object and a series of A images and B images. Error estimation components 14a and 14b conduct error estimations (which can be accomplished via parallel processing) and estimate error (one for error A and one for error B), taking as input one of the two images and the reference image. The error estimation components 14a and 14b are shown separately to convey that A images and B images both undergo the error / quality estimation process (in an identical manner) to output the respective error / quality which is received by the probability estimator 16. In a preferred implementation, if the aim is to estimate quality, then a single data object is input to a single quality estimation component 14a or 14b whose aim is to predict the quality of that single input data object. And, if the aim is to compute the error, then a single data object and the reference data object are input to quality estimation component 14a or 14b. The quality/error-estimation components 14a and 14b are preferably composed of a single type of learning model (i.e., the same architecture in both), and can be integrated into a single machine-learning model. When the same model is used, only one quality/error-estimation component 14a or 14b needs to be used during operation after training because quality/error- estimation components 14a and 14b implement the same function. A probability-estimation component 16 is used during training only to obtain predicted probability of preference between the input data objects, as the system uses the actual probability of preference labels from humans to train the system. The actual architecture for the error-estimation component can be selected from neural networks, deep neural networks, multi-layer perceptrons (MLP), convolutional networks (CNNs), deep CNNs, recurrent neural networks, autoencoder neural networks, long short-term memory (LSTM) networks, generative adversarial networks (GANs), support vector machines, or random forests, as examples.

[0041] The probability-estimation component 16 acquires“scores” computed by the quality/error-estimation components l4a and l4b and uses them to estimate the probability of preference of A over B. In a preferred implementation, the probability is estimated by subtracting the two inputs and passing the result through a sigmoid function. This technique assigns the scores of A and B a meaningful value that is based on the Bradley-Terry statistical model of human preference. In a variation, the sigmoid function is replaced by another function (e.g., a line or a polynomial). In one implementation, the entire probability-estimation component 16 is a single machine-learning component that automatically computes probability (so no subtraction or sigmoid). Therefore, instead of using the Bradley-Terry model to estimate probabilities from the estimated scores, the machine-learning model provides the probability estimation component. In such an implementation, there are three machine-learning components: a quality/similarity/error estimation machine learning component 14a to estimate quality/similarity/error of object A, a quality/similarity/error estimation machine learning component 14b to estimate quality/similarity/error of object B, and a quality/similarity/error estimation machine learning component 16 to estimate the probability of preference. The entire system is trained end-to- end to match the human probability labels.

[0042] In the preferred embodiment during training, the error-estimation machine learning components 14a and 14b estimate the implicit error (or distance) between image A or B and the reference, and the subsequent component 16 translates these errors into a probability of preference. A machine-learning model (component 14a or 14b) trained on the probability of preference labels using PLF or NLF training can therefore learn to estimate the implicit error scores. In the preferred embodiment, error estimation component 14a (or l4b) is the only machine learning model. In another variation, the probability estimation component 16 is also a trainable component trained using machine learning However in the evaluation stage, when trying to compute the error (or quality) of an object, the probability estimation component 16 does not need to be used. Note that in cases when humans do not agree which image is better (say, when the images are so similar that it is a 50-50 guess after sampling a large pool), the 50% probability of preference would translate to similar scores for each image, which is the expected result. Once trained, one of the quality/error-estimations 14a and 14b can be used in stand alone fashion where it would take in a distorted image and the original reference and it would output its estimated error (distance). During operation, machine learning components 14a and 14b leverage the same mathematical function in the example implementation so it doesn’t matter which machine learning components 14a or 14b is used during operation in this case (i.e., they are identical). In preferred embodiments all of the error/quality estimators 14a, 14b and the probability-estimation component 16 get trained through the human pairwise or N-way preference dataset. However, the probability-estimation component 16 can also be fixed.

[0043] While images are shown being compared with a reference image in FIG. 1A, the prompting component 12 can instead train the error estimation components 14a and 14b on a comparison of other data objects A or B. Examples are discussed above, and human preference can be applied to other data objects (sounds, songs, films, books, etc.) as the basis for training. In these cases, the system would learn to measure the similarity (or distance) between objects in the corresponding domain. For example, a PLF or NLF system trained on movies learns to compute similarities and distances for movies and could answer questions such as“What are the movies most similar to film X?” by simply ranking the films by their similarity score to film X in the correct order. [0044] While images are shown being compared with a reference image in FIG. 1 A, the prompting component 12 can instead train the error estimation PLF or NLF components 14a and 14b on a comparison between images A or B (or objects A or B) without using a reference object as shown in FIG. IB. In this approach, human ratings are obtained in response to a series of A and B images provided by the prompting component and are used to estimate a perceptual quality (e.g., goodness, funniness, scariness, etc.) of the objects, as opposed to the similarity/error/distance, which is a relative measure that requires a reference. The corresponding pairwise or N-set learning framework 14a and 14b and 16 then uses relative comparisons between objects. For an NLF training, there would be N copies of component 14. For example with 10-set inputs, there would be 10 copies of component 14 (i.e., 14a - 14j). For each input image A to the estimator 14a, a corresponding image B is fed to the error/quality estimator 14b, at the same time. This provides information concerning relative preference of A over B for each input data object pair.

[0045] In FIG. IB the error/quality estimator machine-learning 14a and 14b now estimates quality. The quality-estimation machine learning components 14a and 14b therefore estimate the quality of A and the quality of B, and the probability estimation component 16 determines a probability of human preference for A over B . The prompting component 12 shows human groups pairs of images and asks each person to select the one that demonstrates a stronger quality. The prompting component preferably builds the PLF or NLF dataset in advance of training the system components 14a, 14b. There is usually no“prompting” of humans during training when the object pairs have been labeled by the prompting component 12 in advance of training. For example, for a system capable of estimating humor in an image, human subjects would be shown pairs of images and asked to select the image that is more funny. After being trained with pairs like this, the system is able to take in a single image and automatically compute an accurate“funniness” rating that is consistent with the human subjects. The probability estimation component 16 plays a key role during training but not during later operation (also called testing). During training, the probability estimator 16 takes in as input the quality or error scores output by error/quality estimation components l4a and l4b (which at the start of training are very inaccurate) and uses them to predict the probability of preference between the input pair or n-set of data objects. The learning process aims to match this predicted probability of preference to the actual human preference amongst the input data objects. Therefore, in this training process, the error/quality estimator l4a, l4b will get trained to predict error or quality scores accurately. In other words, by training the machine-learning model 14a, 14b on the probability of preference labels by passing their outputs through the probability of preference prediction block 16, the system can learn the implicit quality scores. During testing, the aim is to predict the error or quality of a given single data object using PLF or NLF trained error/quality estimator 14a or 14b. Testing showed that the present system provides better performance than prior systems.

[0046] The“quality” can encompass a variety of human preferences or emotional responses beyond simply measuring generic quality or likeability. In one example, the system can present user devices radically different images intended to create a specific emotion, such as sadness or shock. Radically different images for such an emotional response can be e.g., a car crash scene, and a baby or an injured bird, and the measurement will gauge the relative emotional response between the different images though the images include radically different objects/scenes. Additional example emotions can be artistic styles, color schemes, story content, etc. Similarly, with music objects, radically different styles of music samples can be presented, and human subjects queried through their devices about which sample causes more sadness, humor, happiness, etc.

[0047] A specific example embodiment will be illustrated with respect to testing of a PLF system and discuss the development of the data set developed in a training phase via the prompting component 12 of FIG. 1A, with the data set using a reference image. The prompting component collected labels for image pairs according to the percentage of persons who preferred an image A over an image B as being more consistent with a reference image. A value of 50% indicates that both images are equally“distant” from the reference, while a value of 70% of image A and 30% of image B indicates that image A is substantially closer to the reference.

[0048] This pairwise probability preferences are used as ground-truth labels, which provides a larger and more robust than previous image quality assessment (IQ A) methods. The PLF dataset trains error/quality machine learning components l4a, l4b, by comparing the machine-predicted probability of preference (output of probability estimation component 16, where component 16 received the output of l4a and l4b) to actual human probability of preference and generating the training signal for l4a, l4b and 16..

[0049] The choice for the error-estimation function is flexible, and in the experimental system a new deep convolutional neural network (DCNN) was developed. The errors of A and B are then used to compute the predicted probability of preference for the image pair in the probability estimation 16, which is trained using the pairwise probabilities we can use the learned error-estimation function on a single image A and a reference R to compute the perceptual error of A with respect to R. This allows auto quantification of the perceived error of a distorted image with respect to a refer-ence, even though our system was never explicitly trained with hand-labeled, perceptual-error scores. [0050] The principle steps executed by a preferred software code for a specific implementation of the preferred method for predicting error between a data object A and a reference data object R are shown in FIGs. 2A and 2B. The code of FIG. 2A is training code, which is used to train the computational model (14a and 14b) starting from randomly initialized parameters of the learning component. The training code receives 20 input image pairs A,B and a reference R (during the task of image error prediction). The code selects 22 an image pair and reference 24 (A, B and R) for training evaluation. Feature extraction 26 is conducted by the relevant features 28 for A, B and R using DCNNs. The features of A and R are received by a subsequent fully- connected neural network (FCNN) to compute 30 the final error 32 between A and R. Similar processing is performed for features of B and R to compute the final error between B and R. These computed errors are then received by the probability of preference prediction component 16 to compute 34 the machine-predicted probability of preference 36 between A and B. The learning components 14a and 14b in the model till now are the feature extraction DCNN and the final error computation FCNN. A more flexible model design could include the probability of preference prediction component as a learning component as well. The machine-predicted probability of preference is compared to the actual human preference 38 between A and B to compute 40 the prediction error 42. When the prediction error is high as determined by computing the loss (or error) between the predicted probability of preference and the true labeled probability of preference (42) a gradient computation 44, this indicates that the parameters of the learning components are not correct and a training signal 46 is generated to update 48 the parameters of the learning components. The preferred method of FIG. 2A uses the gradient-descent method (i.e., backpropagation, as it is known in the art) to update parameters and minimize this loss or prediction error 42. This process is repeated until the prediction error is low for all image pairs present in the training dataset. The error can ideally go as low as 0, which implies that machine has learned to perfectly predict human preference. In practice, the training continues until there is no more change in the prediction error and the error is low (e.g., 10 ⁴). The second form of code is the testing code, with operations and information illustrated in FIG. 2B. A test image A and reference R 50 are provided to the trained CNN and FCNN components (14a and 14b), which then and uses those to predict the error between an image and a reference image. Specifically, the CNN extracts 52 features 54 of an image and its reference and feeds it to the FCNN which predicts 56 the final error 58 between the image and the reference.

[0051] Unlike previous datasets IQ A datasets, a system of the invention constructs a dataset that focuses exclusively on the probability of pairwise or N-way preference, which can be obtained via prompting with pairwise or N-set data, as discussed above. With prompting pairs or N-sets, the prompting component provides test subjects with two or more distorted versions (A and B) of reference image R, and the subjects are prompted to select the one(s) that looks more similar to R. The system then stores the percentage of people who selected image A over B as the ground-truth label for this pair, which is the probability of preference of A over B (JJAB). This approach is robust, and does not suffer from set-dependency or scalability issues like Swiss tournaments because the images (or other data objects) are never labelled with quality scores.

[0052] As seen in FIG. 3, the data set permits distorted images to be placed on a scale based on their underlying perceptual-error scores (e.g., SA, S B, SC) with respect to the reference. The reference is assigned to have 0 error. The probability of preferring distorted image A over B can be computed by applying a function h to their errors, e.g., P_AB = h (sA, sB). All distorted images can be mapped to a 1-D "perceptual-error" axis (similar IQA), with the reference at the origin and distorted versions placed at varying distances from the origin based on their perceptual error (images that are more perceptually similar to the reference are closer, others farther away). Note that since each reference image has its own quality axis, comparing a distorted version of one reference to that of another does not make logical sense. This alignment along a scale is an assumption based on which the machine learning components l4a/l4b are designed (not a feature of the dataset). Specifically, in the preferred training method, data objects are assumed to be capable of being lined up on a scale in order of their “goodness” (represented by error or quality). Training for quality or error values for a data object can work for any class of data objects which can be “ranked” in a unique manner.

[0053] Given this axis, a function h can be defined which takes the perceptual-error scores of A and B (denoted by SA and SB, respectively), and computes the probability of preferring A over B: P_AB = h (sA, sB). The variable, h defines the function of the probability component 16 and / defines the function of the error/quality estimation components l4a/l4b. One suitable function that was applied experimentally, use the Bradley-Terry (BT) sigmoid model [R. A. Bradley and M. E. Terry. Rank analysis of incom-plete block designs: I.“The method of paired comparisons” Biometrika, 39(3/4):324-345, 1952] This function has successfully modeled human responses for pairwise comparisons in other applications, for example, when exploring a large collection of font faces based on human rankings for attributes such “dramatic” or“legibility” attribute for the fonts [P. O’Donovan, J. L ibeks, A. Agarwala, and A. Hertzmann. Exploratory font selection using crowdsourced attributes. ACM Transactions on Graphics, 33(4):92, 2014], or when organizing photo collections based on human preferences [H. Chang, F. Yu, J. Wang, D. Ashley, and A. Finkelstein. Automatic triage for a photo series. ACM Transactions on Graphics, 35(4): 148, 2016]

[0055] Unlike the standard BT model, the exponent here is negated so that lower scores are assigned to images visually closer to the reference. Given this assignment, the goal is then to learn a function / that maps a distorted image to its perceptual error with respect to the reference, constrained by the observed probabilities of preference. A preferred general optimization framework to train / is:

[0057] where Q denotes the parameters of the image error-estimation function f

RAB,_I is the ground-truth probability of preference based on human responses, and T is the total number of training pairs. Eq. 2 represents a theoretical explanation of the training process which is used to build a pair-wise learning dataset. If the training data is fitted correctly and is sufficient in terms of images and distortions, Eq. 2 will train / to estimate the underlying perceptual-error scores for every image so that their relative spacing on the image quality scale will match their pairwise probabilities (enforced by Eq. 1), with images that are closer to the reference having smaller numbers. An ideal“fitting” would lead to zero error in [above for a very large dataset, and low error is discussed above. The dataset size can depend on the number of parameters of a neural network model and the number of target image distortions. These underlying perceptual errors are estimated up to an additive constant, as only the relative distances between images are constrained by Eq. 1. This constant can be accounted for by setting the error of the reference with itself to O.

[0058] FIG. 4 shows a preferred pairwise-learning framework that consists of error- estimations l4a, l4b and probability-estimation functions 16. Each of the two error estimation components (l4a and l4b) perform the same function. Each error estimation component (l4a or l4b) has only two feature extraction blocks, one for input image A (60a) and one for the reference (60b). The reference feature extractor is also used by l4b since reference image is also received by l4b. In the FIG. 4 implementation, the error-estimation function / has two weight-shared feature-extraction (FE) networks 60a, 60b, 60c that take in reference R and a distorted input (A or B), and a score-computation (SC) network 62a, 62b that uses the extracted features from each image to compute the perceptual-error score. There is one SC network in one error- estimation function f. In terms of the components of l4a and l4b: l4a contains two FE networks (l8a and l8b) and ONE SC network (20a). similarly, l4b contains two FE networks (l8b and l8c) and one SC network (20b). Note that the FE block for R is shared between f(A_i, R_i, e) and f(Bi, Ri, 6). The computed perceptual-error scores for A and B (s_A and s_B) are then passed to the probability-estimation function h, which implements the Bradley-Terry (BT) model (Eq. 1) and outputs the probability of preferring A over B.

[0059] The inputs to the FIG. 4 system are sets of three images (A, B, and R) and the output is the probability of preferring A over B with respect to R. The learning blocks 60a (f A_i, R_i, 6) ) and 60b (f A_i, R_i, 6) ) compute the perceptual error of each image. The“learning blocks” that are being referred to are the error estimation components (l4a and l4b). Component l4a is composed of 3 learning sub-blocks: 60a, 60b, which are used to extract useful features, and 62a which estimates the perceptual error after receiving the extracted features from 60a and 60b.“Feature” does not necessarily relate to image analysis only. Sound data objects (and other data objects as well) can also have“features.” For example, frequency composition of sound can be a feature. Feature extraction is not a part of dataset building, but rather, it is a sub-component of the error or quality estimation block which aids the learning process by extracting meaning information from the data. The estimated errors s_A and s_B are then subtracted and fed through a sigmoid 20a and 20b that implements the BT model in Eq. 1 (function h) to predict the probability of preferring A over B. Feeding 62a and 62b to sigmoid is the “estimated probability computation.” Unlike the score computation which outputs a score, this outputs a probability of preference. This probability is used to train the network l4a, l4b using probability of preference labels as discussed above with reference to FIG. 2A. The entire system can then be trained by backpropagating the squared L₂ error between the pre-dicted probabilities and the ground-truth human preference labels to minimize Eq. 2.

[0060] An expressive computational model for / and a sufficiently large dataset with a rich variety of images and distortion types is needed for accurate operation. The invention provides DCNN-based architecture that is an expressive model for f A method for construction of a large-scale image distortion dataset with probabilistic pairwise human comparison labels is also provided.

[0061] Artisans will appreciate that the pairwise- or N-set learning framework is general and can be used to train prior learning model for error computation by simply replacing ( /(A_j, R_i Q) and /(B_i R_i Q). When prior learning model is trained, the error-estimation function / can be used by itself to compute the perceptual error of individual images with respect to a reference.

[0062] In a preferred computation model, the error- estimation block / consists of two kinds of subnetworks (subnets, for short). There are three identical, weight-shared feature-extraction (FE) subnets (one for each input image), and two weight-shared score-computation (SC) subnets that compute the perceptual-error scores for A and B. Together, two FE and one SC subnets comprise the error-estimation function f. Errors are computed on a patch- wise basis by feeding corresponding patches from A, B, and R through the FE and SC subnets, and aggregate them to obtain the overall errors, s_A and s_B. [0063] FIGs. 5A and 5B show the preferred error-estimation function / The feature- extraction (FE) subnet of /has 11 convolutional (CONV) layers 70 with skip connections to compute the features for an input patch A^m. The number after "CONV" indicates the number of feature maps. Each layer has 3 x 3 filters and a non-linear ReLU, with 2 x 2 max-pooling after every even layer. FIG. 5B shows a score-computation (SC) subnet uses two fully-connected (PC) networks 72 (each with 1 hidden layer with 512 neurons) to compute patch- wise weights and errors, followed by a weighted averaging function 74 over all patches to compute the final image score s_A (or s_B).

[0064] For each set of input patches (A^m, B^m, and R^m, where m is the patch index), the corresponding feature maps from the FE CONV layers at different depths are flattened and concatenated into feature vectors x/ , x™ , x™ . Using features from multiple layers has two advantages: 1) multiple CONV layers contain features from different scales of the input image, thereby leveraging both high-level and low-level features for error score computation, and 2) skip connections enable better gradient backpropagation through the network.

[0065] Once these feature vectors are computed by the FE sub-net, the differences between the corresponding feature vectors of the distorted and reference patches are fed into the SC subnet (FIG. 5B). Each SC subnet consists of two fully-connected (FC) networks. The first FC network takes in the multi-layer feature difference x/- c and predicts the patchwise error (s/ ). These are aggregated using weighted averaging to compute the overall image error (s_A), where the weight for each patch w™ is computed using the second FC network. This network uses the feature difference from the last CONV layer of the FE subnet as input ( y y™), since the weight for a patch is akin to the higher-level patch saliency captured by deeper CONV layers.

[0066] Feeding the feature differences to the SC subnet ensures that when estimating the perceptual error for a reference image (i.e., A = R), the SC block would receive c - c = 0 as input. The system would therefore output a constant value which is invariant to the reference image, caused by the bias terms in the fully-connected networks in the SC subnet. By subtracting this constant from the predicted error, the system ensures that the“origin” of the quality axis is always positioned at O for each reference image.

[0067] Training benefits from a random patch-sampling strategy, which prevents over-fitting and improves learning. At every training iteration, the system randomly samples 36 patches of size 64 x 64 from training images, which are of size 256 x 256. The density of this patch sampling used to test the system ensures that any pixel in the input image is included in at least one patch with a high probability (0.900). This is in contrast with earlier approaches [S. Bosse, D. Maniry, K. Mtiller, T. Wiegand, and W. Samek. “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Transactions on Image Processing 27(l):206-2l9], where patches are sampled sparsely and there is only a 0.154 probability that a specific pixel in an image will be in one of the sampled patches. This makes it harder to learn a good perceptual-error metric. At test time, the system randomly sampled 1,024 patches for each image to compute the perceptual error.

[0068] To test the system, a large-scale dataset labeled with pairwise probability of preferences was developed and the data set included a wide variety of image distortions. A test set with a large number of images and distortion types that do not overlap with the training set was also developed, allowing a rigorous evaluation of the generalizability of IQA algorithms. Table 1 compares the present dataset with the four largest IQA datasets known to the inventors. The present data set is larger than the other 4 combined, as shown in Table 1 in all categories.

[0069]

[0070] The dataset contained 200 unique reference images (160 reference images are used for training and 40 for testing), which are selected from the Waterloo Exploration Database [K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, andL. Zhang.“Waterloo exploration database: New challenges for image quality assessment models,” IEEE Transactions on Image Processing, 22(2): 1004-1016, 2017] because of its high-quality images. The selected reference images are representative of a wide variety of real-world content. The image size in the data set during experiments was 256 x 256, which is a popular size in computer vision and image processing applications. This size also enables crowd-sourced workers to evaluate the images without scrolling the screen. Methods and systems of the invention are not limited to any particular size, because the system samples patches from the images. [0071] The experimental dataset included a total of 75 distortions, with a total of 44 distortions in the training set, and 31 in the test set which are distinct from the training set. The image distortions span the following categories: 1) common image artifacts (e.g., additive Gaussian noise, speckle noise); 2) distortions that capture important aspects of the HVS (e.g., non-eccentricity, contrast sensitivity); and 3) complex artifacts from computer vision and image processing algo-rithms (e.g., deblurring, denoising, super-resolution, com-pression, geometric transformations, color transformations, and reconstruction). Although recent IQ A datasets cover some of the distortions in categories 1 and 2, they do not contain many distortions from category 3 even though they are important to computer vision and image processing.

[0072] A training set of 160 reference images and 44 distortions was used. Each training example is a pairwise comparison consisting of a reference image R, two distorted versions A and B, along with a label p

which is the estimated probabilistic human preference based on collected human data. For each reference image R, two kinds of pairwise comparisons were conducted: inter type and intra-type. In an inter-type comparison, A and B are generated by applying two different types of distortions to R. For each reference image, there are 4 groups of inter-type comparisons, each containing 15 distorted images generated us-ing 15 randomly- sampled distortions. In an intra-type comparison, A and B are generated by applying the same distortion to R with different parameters. For each reference image, there are 21 groups of intra type comparisons, containing 3 distorted images generated using the same distortion with different parameter settings. The exhaustive pairwise comparisons within each group (both inter-type and intra-type) and the corresponding human la-bels P_AB, are then used as the training data. Overall, there are a total of 77,280 pairwise comparisons for training (67,200 inter type and 10,080 intra-type). Inter-distortion comparisons allow the system capture human preference across different distortion types and are more challenging than the intra-distortion comparisons due to a larger variety of pairwise combinations and the difficulty in comparing images with different distortion types. A larger pro-portion of our dataset is therefore preferably devoted to inter-distortion comparisons.

[0073] A test set was constructed with 40 reference images and 31 distortions, which are representative of a variety of image contents and visual effects. None of the test set images and distortions were in the training set. For each reference image, there were 15 distorted images with randomly-sampled distortions (sampled to ensure that the test set has both inter and intra-type com-parisons). Probabilistic labels are assigned to the exhaustive pairwise comparisons of the 15 distorted images for each reference. The test set then contained a total of 4,200 dis-torted image pairs (105 per reference image).

[0074] In the experiments to collect human responses for system testing, human participants were presented A, B, and R image sequences and prompted to choose which of A and B is more similar to R. Directly collecting a sufficient number of responses per pair to accurately estimate p_AB could be prohibitively time-consuming and expensive. The dataset sizes were determined based on a number of factors, such as variety of image content, variety of distortions, and annotation costs. Statistical estimation of a number of responses needed to make p_AB accurate can simplify the data collection, and then an estimator can extrapolate direct responses. In a preferred example, a maximum likelihood (ML) estimator is used accurately label a larger set of pairs based on a smaller set of directly acquired labels.

[0075] The number of human responses per comparison were modeled as a

Bernoulli random variable v with a success probability p_AB , which is the prob-ability of a person preferring A over B. Given n human responses v_lr i = i, ... , n, we can estimate p_AB by p_AB = -

v_t. n should be selected such that

target P target and a tolerance h . Choosing n = 40 and h = 0.15 provides a reasonable Ptarget ³ 0.94. This permits collection of 40 responses for each pairwise comparison in the training and test sets. This could create a high number of tens of thousands of human responses needed for training. The number of human responses are relevant only during the training. The test set is simply used when the system is deployed after training. To prove the validity of the method outside of the training set, we have shown the performance of the trained model on a sample test set. For this, we have collected human preference on test set also and show that the error estimation of our model on the test set matches the human preference.

[0076] Application of an estimator to limit the needed direct data collection is therefore preferred. However, direct responses will not be expensive in many real-world applications of the invention. Many applications of the invention have a built-in facility to collect human responses. Most any business that has users that access services or content via a user interface can collect choices by presenting options to users and building a data set from direct user selections over time. For applications where direct responses are expensive to collect, the use of an estimator reduces the time and expense required to build the dataset needed for system training.

[0077] To estimate p_AB for all possible pairs of N images (e.g., A =75 in each inter type group) with perceptual error scores s = [s₁ ... , s_n], denote the human responses by a count matrix C = {c_{i ;·}} where c_L is the number of times image / is preferred over image j. The scores can then be obtained by solving an ML estim tion problem:

the sigmoid function of the BT model discussed above. The optimization problem can be solved via gradient descent.

[0078] A sufficient number of pairwise comparisons must be queried so that the optimal solution recovers the underlying true scores. It is sufficient to query a subset of all the possible comparisons as long as each image appears in at least k comparisons (k < N-l ) presented to the humans, and k can be determined empirically. Empirical analysis revealed that k = 10 is sufficient as the binary error rate over the estimated part of the subset becomes 0.0006. ML estimation reduces the number of pairs that need labeling considerably, e.g. for one image data set from 81,480 to 62,280 (23.56% reduction).

[0079] The invention was tested for performance compared to popular and state-of- the art IQA methods. An example implementation of the FIG. 1A approach outperformed all state-of-the-art methods. We attribute the improved performance of the presented systems and methods to the robust dataset and the novel learning framework. In the dataset, two factors contribute to the robustness: first, the accurate human response collection scheme, which uses probability of preference as labels instead of unreliable absolute image quality scores as done in prior IQA methods, and second, the size of the dataset with many different and complex image distortions not created by the prior IQA methods. The second improvement in performance comes from the novel learning framework which we designed to predict probabilistic preference labels during training instead of the noise-prone absolute image quality labels which results in a more accurate modeling of the human perception.

[0080] The invention can be used in a wide range of applications and industries. The specific implementation described in the example above can be used for image/video compression algorithms (processing the image to reduce its storage/transmission while keeping it as similar as possible to the original), image searching (finding an image in the database that is most similar to an image provided), computer graphics rendering (computing samples of the image until it is within a certain visual distance X from a ground-truth reference), reconstruction algorithms (reconstructing/fixing a degraded image so that it is as close as possible to the original), and similar applications. An implementation that computes quality can be used for similar applications, as well as specific tasks such as sorting pictures in an album based on quality, finding the best images on the internet, adjusting camera parameters (or applying post-process filters) to take the best possible image for the user, automatically selecting the best images for human inspection to weed out bad ones. The PLF or NLF can also, for example, present human subjects data through devices to compare films or TV shows, and then the machine learning components can be trained to provide ratings for films or shows.

[0081] The invention can also be applied to smaller subset of users to achieve more customized preference results. To do this, the user study would be performed as normal but rather than lumping all of the data in one big set for training, subsets of users could be constructed based on user attributes such as personal preferences, previous user history, or user self-evaluation, to name a few. The training process is then performed for each subset of users separately, which would allow the quality or similarity ratings for a specific object to vary between groups ln this way, a group of users who likes horror movies would see a scary movie be given a“good” rating, while a group of users who do not like them would not.

[0082] Preferred systems collect data sets with human preference labels. The system collects human labels based upon input through a user device and the labels indicate, for example, the fraction of queried human population that preferred one object over another based on the given application task. For learning to estimate the similarity between objects, this task is to determine which object was more similar to a given reference. For learning to measure an intrinsic perceived quality, the user is asked to select the object that displayed that quality more strongly ln both cases, selecting between two choices suffers from less subjectivity than other measures of perceptual decision-making. [0083] The number of objects shown to the user at a time could be extended beyond just two or three objects at a time, to help accelerate the acquisition process. For example, rather than being shown two book titles and asking the user to indicate which story was sadder, the subject could be shown multiple book titles (e.g., five) an N-set and asked to rank them in order from least sad to most sad. By doing so, the user would effectively indicate the pairwise comparisons between all the books in the list and thus would produce more pairwise labels with fewer user questions.

[0084] The experimental data set in the example embodiment discussed above was created with the Amazon Mechanical Turk network, which presented image triplets to user devices (a triplet consists of two images and a reference) in order to create a dataset to train for estimating image similarity/error. Users were paid to label triplets by selecting from presented images as to which image they thought to be closer to the reference. In the case of training for a perceptual quality, users would only be shown two images (i.e., a doublet). Another approach is to provide data stimulus (e.g., triplet or doublet images) online on a server and obtain data via a crowdsourcing approach. In addition to direct human response, the system can collect other parameters, such as data relating to the time it took to make a selection, and gaze tracking (where did they look to see similarities/differences). The system can also collect internal data that has been used for other purposes, such as a browsing and/or using history as an indication of preference to create the dataset. For example, if a retail shopping site history indicates that a user viewed two products and choose A over B, the PLF or NLF trained system can create scores for each product, either globally for all users or specifically tailored to individuals.

[0085] While specific embodiments of the present invention have been shown and described, it should be understood that other modifications, substitutions and alternatives are apparent to one of ordinary skill in the art. Such modifications, substitutions and alternatives can be made without departing from the spirit and scope of the invention, which should be determined from the appended claims.

[0086] Various features of the invention are set forth in the appended claims.

Claims

1. A machine-implemented method for automatically determining a likely human preference or quality for a data object or the perceived similarity or distance between a data object and a reference, the method comprising steps of:

providing comparison sets of data objects to devices and collecting human direct or indirect responses to the data objects or to references of the data objects;

labelling the data objects with human preference or emotional labels and building a pairwise or n-way comparison learning data set;

training a machine learning component with the learning data set to predict a human preference or emotional response to data objects, wherein the learning comprises estimating probability of human preference or emotion, determining an error of the estimated probability of human preference or emotion, and updating parameters of a learning component to reduce the error of the estimated probability of human preference or emotion; and

receiving, by the machine learning component, a data object or a reference thereto to evaluate and providing a predicted human perceptual quality or emotional response to the data object to evaluate, or a data object or a reference thereto and a reference object and providing the human-perceived similarity/difference between the data object to evaluate and the reference.

2. The method of claim 1, wherein the machine learning component comprises plural machine learning components, with each of the plural machine learning components receiving a data object and computing the perceptual quality of the given data object which is then provided to a probability component to determines a likelihood of preference between separate data objects provided to the plural machine learning components.

3. The method of the previous claims, wherein the human direct response is a selection of which object exhibits a desired perceptual quality more strongly, wherein the perceptual quality can be funnier, scarier, sadder, more serious, more appropriate for a given subset of the population, or better.

4. The method of the previous claims, wherein the human direct response is a selection of which data object is better.

5. The method of claim 1, wherein the machine learning component comprises plural machine learning components, with each of the plural machine learning components receiving a data object and a reference object and computing the perceptual similarity or difference of the given object with respect to the reference; and which is then provided to a probability component to determines a likelihood of preference between separate data objects provided to the plural machine learning components.

6. The method of claim 5, wherein the users are asked to identify the object that is closer or more similar to the reference object in order to train a system to measure similarity.

7. The method of claim 5, wherein the users are asked to identify the object that is farther or more dissimilar to the reference object in order to train a system to measure distance.

8. The method of the previous claims, wherein the data objects are images.

9. The method of the previous claims, wherein the data object represents one of sounds, songs, films, tv shows, advertisements, movies, videos, books, clothing, toys, or electronics.

10. A system for automatically determining a likely human preference or quality for a data object or the perceived similarity or distance between a data object and a reference, the system comprising:

a dataset including a set of pairwise labels for data objects indicating human selections of comparisons between pairs of N-sets of data objects obtained from a prompting component; and

a machine learning component trained on the data set via a probability estimation component, the machine learning component having an input for receiving data objects during testing and having code for updating parameters during training based upon an error distance between predictions of the probability estimation component and human probability of preference.

11. The system of claim 10, wherein the machine learning component comprises one of a neural network, deep neural network, multi-layer perceptron, convolutional network, deep convolutional network, recurrent neural network, autoencoder neural networks, long short-term memory network, generative adversarial networks, support vector machine, or random forest.

12. The system of claim 10, wherein the dataset comprises labels generated from human selections and additional labels estimated from human selections.