CN116324764A

CN116324764A - Determining visual topics in a collection of media items

Info

Publication number: CN116324764A
Application number: CN202180063447.1A
Authority: CN
Inventors: 克里斯蒂娜·博尔; 伊万·奥罗佩萨; 丽莉·贝格; 特雷西·古; 伊桑·施赖伯; 张善丰; 霍华德·周; 大卫·亨登; 李�真; 彭福堂; 特雷莎·科; 杰森·张
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-05-11
Filing date: 2021-12-13
Publication date: 2023-06-23
Also published as: EP4204993A1; WO2022240444A1; JP2023548906A; KR20230073327A

Abstract

The media application determines clusters of media items based on pixels of images or videos from a collection of media items such that the media items in each cluster have visual similarity, wherein the collection of media items is associated with a user account. The media application selects a subset of clusters of media from the clusters of corresponding media items based on media items in each cluster that have visual similarity within a range of threshold similarity values. The media application causes a user interface to be displayed that includes a subset of the clusters of media.

Description

Determining visual topics in a collection of media items

Cross Reference to Related Applications

The present application claims priority from U.S. patent application Ser. No.17/509,767, entitled "Determining a Visual Theme in a Collection of Media Items (determining visual theme in collection of media items)", filed on 10 months 25 of 2021, which claims priority from both U.S. provisional patent application Ser. No.63/187,390, entitled "Determining a Visual Theme from Pixels in aCollection of Media Items (determining visual theme from pixels in collection of media items)", filed on 5 months 11 of 2021, and U.S. provisional patent application Ser. No.63/189,658, entitled "Determining a Visual Theme from Pixels in a Collection of Media Items (determining visual theme from pixels in collection of media items)", filed on 17 months 2021, each of which is incorporated herein in its entirety.

Background

Users of devices such as smartphones or other digital cameras capture and store a large number of photographs and videos in their image libraries. Users utilize such libraries to view their photos and videos to recall various events, such as birthdays, weddings, holidays, travel, etc. The user may have a large library of images, with thousands of images taken over a long period of time.

The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.

Disclosure of Invention

A computer-implemented method comprising: generating a vector representation of media items from a collection of media items associated with a user account using a trained machine learning model; determining clusters of media items based on the vector representations of media items such that the media items in each cluster have visual similarity, wherein vector distances between the vector representations of pairs of media items indicate the visual similarity of the media items, and wherein the clusters are selected such that the vector distances between each pair of media items within the clusters are outside a range of threshold visual similarity values; and causing a user interface to be displayed that includes a subset of the cluster of media items.

In some embodiments, each media item has an associated timestamp, the media items captured within the predetermined period of time are associated with episodes (epoode), and selecting the subset of clusters of media items is based on the corresponding associated timestamps such that corresponding media items in the subset of clusters of media items satisfy a temporal diversity criterion that excludes more than a predetermined number of corresponding media items from a particular episode. In some embodiments, the method further comprises excluding media items associated with a category in the list of prohibited categories from the collection of media items before selecting the subset of the cluster of media items. In some embodiments, the method further comprises excluding media items corresponding to categories in the list of prohibited categories prior to determining the cluster of media items. In some embodiments, each media item is associated with a location and selecting the subset of clusters of media items is based on the location in response to the subset of clusters of media comprising more than a predetermined number of media items such that the subset of clusters meets the location diversity criterion. In some embodiments, the clusters of media items are further determined based on corresponding media items associated with tags having semantic similarity. In some embodiments, the method further includes scoring each media item in the subset of clusters of media items based on analyzing a likelihood that a user associated with the user account performs a positive action with respect to the media items, and selecting a media item in the subset of clusters of media items based on the corresponding score satisfying the threshold score. In some embodiments, the method further includes receiving feedback from the user regarding the subset of clusters and modifying a corresponding score for the subset of clusters of media items based on the feedback. In some embodiments, wherein the feedback comprises an explicit action indicated by removing one or more of the media items in the subset of clusters of media items from the user interface or an implicit action indicated by viewing one or more of the corresponding media items in the subset of clusters of media items or the subset of clusters of shared media items. In some embodiments, the method further includes receiving aggregate feedback from the user for an aggregate subset of the clusters of media, providing the aggregate feedback to the trained machine learning model, wherein parameters of the trained machine learning model are updated, and modifying media items of the clusters based on the parameters of the updated trained machine learning model. In some embodiments, the method further includes selecting the particular media item from among the subset of clusters of media items as the cover photograph of each of the subset of clusters of media items based on the particular media item from among the subset of clusters of media items including a maximum number of objects corresponding to visual similarity. In some embodiments, the method further includes adding a title to each cluster of the subset of clusters of media items based on the type of visual similarity and the common phrase. In some embodiments, the user interface is displayed at predetermined intervals. In some embodiments, the method further includes providing a notification to a user associated with the user account that a subset of the clusters of media items are available, wherein the notification includes a corresponding title for each of the subset of clusters of media items. In some embodiments: the method also includes determining computations to be performed on the respective devices to optimize the computations and implementing a trained machine learning model on the plurality of devices based on the computations to be performed on the respective devices.

In some embodiments, a method comprises: receiving media items from a collection of media items associated with a user account as input to a trained machine learning model; generating output image embedding (empdding) of clusters of media items with a trained machine learning model, wherein the media items in each cluster have visual similarity and the media items with visual similarity are closer to each other in vector space than dissimilar media items such that partitioning the vector space generates clusters of media items; selecting a subset of clusters of media items based on corresponding media items in each cluster having visual similarity within a range of threshold visual similarity values; and causing a user interface to be displayed that includes a subset of the cluster of media items.

In some embodiments, the functional image is removed from the collection of media items before the collection of media items is provided to the trained machine learning model. In some embodiments, the trained machine learning model is trained with feedback from the user that includes a reaction to the set of media items or a modification to the title of the set of media items.

Embodiments may also include a system comprising one or more processors and a memory storing instructions for execution by the one or more processors, the instructions comprising: determining clusters of media items based on pixels of images or videos from a collection of media items such that the media items in each cluster have visual similarity, wherein the collection of media items is associated with a user account; selecting a subset of clusters of media items based on corresponding media items in each cluster having visual similarity within a range of threshold visual similarity values; and causing a user interface to be displayed that includes a subset of the cluster of media items. In some embodiments, each media item has an associated timestamp; the media items captured within the predetermined period of time are associated with episodes; and selecting the subset of clusters of media items is based on the corresponding associated timestamps such that corresponding media items in the subset of clusters of media items satisfy a temporal diversity criterion that excludes more than a predetermined number of corresponding media items from a particular episode.

Embodiments may also include a non-transitory computer-readable medium comprising instructions stored thereon that, when executed by one or more computers, cause the one or more computers to perform operations comprising: determining clusters of media items based on pixels of images or videos from a collection of media items such that the media items in each cluster have visual similarity, wherein the collection of media items is associated with a user account; selecting a subset of clusters of media items based on corresponding media items in each cluster having visual similarity within a range of threshold visual similarity values; and causing a user interface to be displayed that includes a subset of the cluster of media items.

The present specification advantageously describes a way to identify clusters of similar images (or other media items) using a machine learning model without manually identifying the images or manually providing categories of the images (or other media items). In this way, an improved method for classifying images or other media items may be provided, which may for example provide classification to become clusters, which more reliably reflect potential trends in the data than in the case of predefined classifications or categories. Additionally, the machine learning model may advantageously reduce power consumption and improve efficiency by using a static training set and updating the machine learning model in response to an update size being less than a threshold size.

Drawings

FIG. 1 is a block diagram of an example network environment, according to some embodiments described herein.

FIG. 2 is a block diagram of an example computing device according to some embodiments described herein.

Fig. 3A-3B illustrate different example sets of media items that each match a particular visual theme, according to some embodiments. Fig. 3A illustrates a first set of media items matching a first visual theme of an object having a curved shape, a second visual theme of three images as a same still picture, and a third visual theme of a cat in a stuffed toy shark in a different pose. Fig. 3B illustrates a fourth visual theme in which the same object (backpack) is seen in each image taken at different locations at different times, according to some embodiments described herein.

Fig. 4 includes an example of a visual theme of natural images of different mountains with both temporal diversity and positional diversity, according to some embodiments.

FIG. 5 includes an example of a user interface including a cluster with visual theme according to some embodiments described therein.

FIG. 6 is a flowchart illustrating an example method for displaying a subset of clusters of media items, according to some embodiments described therein.

FIG. 7 is a flowchart illustrating an example method for generating an embedding of clusters of media items and selecting a subset of clusters of media items using a machine learning model, according to some embodiments described herein.

Detailed Description

Network environment 100

Fig. 1 illustrates a block diagram of an example environment 100. In some embodiments, environment 100 includes media server 101, user device 115a, user device 115n, and network 105. The users 125a, 125n may be associated with respective user devices 115a, 115 n. In some embodiments, environment 100 may include other servers or devices not shown in fig. 1, or may not include media server 101. In fig. 1 and the remaining figures, the letter following the reference numeral, e.g. "115a", indicates a reference to an element having that particular reference numeral. Reference numerals without a subsequent letter, such as "115", herein represent a general reference to an embodiment of the element with that reference numeral.

The media server 101 may include a processor, memory, and network communication hardware. In some embodiments, the media server 101 is a hardware server. The media server 101 is communicatively coupled to a network 105 via a signal line 102. The signal line 102 may be a wired connection, such as Ethernet, coaxial cable, fiber optic cable, etc., or a wireless connection, such as

Or other wireless technology. In some embodiments, the media server 101 transmits data to and receives data from one or more of the user devices 115a, 115n via the network 105. The media server 101 may include a media application 103a and a database 199.

The media application 103a may include code and routines operable to determine clusters of media items, with user permissions and based on pixels of images or videos from a collection of media items, such that the media items in each cluster have visual similarity, wherein the collection of media items is associated with a user account. For example, one cluster may have objects with similar shapes and colors, another cluster may have parks with similar environmental attributes, and another cluster may have images of pets in different situations. The media application 103a selects a subset of the clusters of media items based on corresponding media items in each cluster that have visual similarity within a range of threshold visual similarity values. The media application 103a is a user interface to be displayed comprising a subset of the clusters of media items.

In some embodiments, media application 103a may be implemented using hardware including a Central Processing Unit (CPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), any other type of processor, or a combination thereof. In some embodiments, the media application 103a may be implemented using a combination of hardware and software.

Database 199 may store a collection of media associated with user accounts, a training set of machine learning models, user actions associated with media (viewing, sharing, commenting, etc.). Database 199 may store media items that are indexed and associated with the identity of user 125 of user device 115. Database 199 may also store social network data associated with user 125, user preferences of user 125, and the like.

The user device 115 may be a computing device that includes memory and a hardware processor. For example, the user device 115 may include a desktop computer, a mobile device, a tablet computer, a mobile phone, a wearable device, a head-mounted display, a mobile email device, a portable gaming device, a portable music player, a reader device, or another electronic device capable of accessing the network 105.

In the illustrated embodiment, user device 115a is coupled to network 105 via signal line 108 and user device 115n is coupled to network 105 via signal line 110. The media application 103 may be stored as a media application 103b on the user device 115a or as a media application 103c on the user device 115 n. The signal lines 108 and 110 may be wired connections, such as Ethernet, coaxial cable, fiber optic cable, etc., or wireless connections, such as

Or other wireless technology. The user devices 115a, 115n are accessed by users 125a, 125n, respectively. The user equipment 115a, 115n in fig. 1 is used as an example. Although fig. 1 illustrates two user devices 115a and 115n, the present disclosure is applicable to a system architecture having one or more user devices 115.

In some embodiments, the user account includes a collection of media items. For example, users capture images and video from their cameras (e.g., smartphones or other cameras), upload images from Digital Single Lens Reflex (DSLR) cameras, add media captured by another user that is shared with them to their collection of media items, and so forth. The media application 103 determines clusters of media items based on pixels of images or videos from a collection of media items such that the media items in each cluster have visual similarity. For example, fig. 3A illustrates a first visual theme 300 of an image having visual similarity to a brown object having a curved shape. Specifically, the first object was an iced beverage in a glass, the second object was a heart-shaped latte in a coffee cup, and the third object was a bowl made with different shades of brown wood. Other examples may include mountains, natural arches, ocean waves with humans, parallel lines extending to the horizon (e.g., train tracks, roads, etc.), changes over time (plant growth, sun movement, drawing in progress), etc.

A cluster of media items may include images from the same episode, such as when a user takes multiple images of the same work of art but at different angles. For example, fig. 3A includes a second example 325 with three images that are the same still picture taken in a different manner such that leaves on the tree become progressively more distinguishable in the three images.

The media application 103 is based on a subset of clusters of media items of the corresponding media items in each cluster that have visual similarity within a range of threshold visual similarity values. The threshold of visual similarity may be between extremely similar media items and media items that are more similar than items that have only a distant relationship. For example, the subject matter of the first example 300 in fig. 3A is a brown circular object. This may be in the middle of the range of threshold similarity values. In contrast, the third example 350 in fig. 3A is a cluster of media items with the theme of cats in stuffed toy (stuffed) sharks photographed at different time periods. This is a more visually similar theme. A fourth example 375 in fig. 3B includes the theme of an orange backpack for use by a person when traveling to a different trip. Another example of a threshold similarity value that may be closer to very similar media is that the media item is a pink flower that is slightly differently shaped.

When media items are not sufficiently similar in vision, it may be difficult to discern topics in the media items, and as a result, they may appear more like a collection of random media items, rather than content that is of interest to the user to view. In some embodiments, the media application 103 limits the number of media items to keep the visual theme more consistent and to make the collection look unlike groupings of all cat images available in, for example, a user's library.

The media application 103 may cause a user interface to be displayed that includes a subset of the clusters of media items. In some embodiments, the media application 103 displays a user interface comprising a subset of the clusters of media items at predetermined intervals. For example, the media application 103 may display a user interface with a subset of clusters daily, weekly, monthly, etc. The media application 103 may modify the frequency of the subset of the display clusters based on the feedback. For example, if the user views a subset of the clusters each time they are available, the media application 103 may maintain the frequency of display, but if the user views a subset of the clusters less frequently, the media application 103 may decrease the frequency of display.

The media application 103 may also provide a notification to a user associated with the user account that a subset of the clusters is available, with a corresponding title for the subset of the clusters. For example, the media application 103 may provide daily notifications, weekly notifications, monthly notifications, etc. to the user. In some embodiments, the user interface includes options for limiting the frequency of notifications and/or the display of a subset of the clusters of media items.

Computing device example 200

FIG. 2 is a block diagram of an example computing device 200 that may be used to implement one or more features described herein. Computing device 200 is any suitable computer system, server, or other electronic or hardware device. In one example, the computing device 200 is a user device 115 for implementing the media application 103. In another example, computing device 200 is media server 101. In yet another example, the media application 103 is partially on the user device 115 and partially on the media server 101.

One or more of the methods described herein can be run in the following: a standalone program capable of executing on any type of computing device, a program running on a web browser, a mobile application ("app") running on a mobile computing device (e.g., a cell phone, a smartphone, a smart display, a tablet, a wearable device (watch, arm band, jewelry, headwear, virtual reality goggles or glasses, augmented reality goggles or glasses, a head-mounted display, etc.), a notebook computer, etc.). In a primary example, all of the computations are performed in a mobile application on a mobile computing device. However, it is also possible to use a client/server architecture, e.g., a mobile computing device sends user input data to a server device and receives final output data from the server for output (e.g., for display). In another example, the computing can be split between the mobile computing device and one or more server devices.

In some embodiments, computing device 200 includes a processor 235, memory 237, I/O interface 239, display 241, camera 243, and storage device 245. Processor 235 may be coupled to bus 218 via signal line 222, memory 237 may be coupled to bus 218 via signal line 224, i/O interface 239 may be coupled to bus 218 via signal line 226, display 241 may be coupled to bus 218 via signal line 228, camera 243 may be coupled to bus 218 via signal line 230, and storage device 245 may be coupled to bus 218 via signal line 232.

The processor 235 can be one or more processors and/or processing circuits that execute program code and control the basic operation of the computing device 200. A "processor" includes any suitable hardware system, mechanism, or component that processes data, signals, or other information. A processor may include a system with a general purpose Central Processing Unit (CPU) having one or more cores (e.g., in a single-core, dual-core, or multi-core configuration), multiple processing units (e.g., in a multi-processor configuration), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a Complex Programmable Logic Device (CPLD), dedicated circuitry for implementing functions, dedicated processor for implementing neural network model-based processing, neural circuitry, a processor optimized for matrix computation (e.g., matrix multiplication), or other system. In some embodiments, the processor 235 may include one or more coprocessors that implement neural network processing. In some embodiments, the processor 235 may be a processor that processes data to produce a probabilistic output, e.g., the output produced by the processor 235 may be inaccurate or may be accurate within the range of expected outputs. The processing need not be limited to a particular geographic location or have time constraints. For example, a processor may perform its functions in real-time, offline, batch mode, etc. Portions of the processing may be performed by different (or the same) processing systems at different times and at different locations. The computer may be any processor in communication with the memory.

Memory 237 is typically provided in computing device 200 for access by processor 235 and may be any suitable processor-readable storage medium, such as Random Access Memory (RAM), read Only Memory (ROM), electrically erasable read only memory (EEPROM), flash memory, etc., adapted to store instructions for execution by the processor or set of processors, and located separately from and/or integrated with processor 235. The memory 237 is capable of storing software operated on the computing device 200 by the processor 235, including the media application 103.

Memory 237 may include an operating system 262, other applications 264, and application data 266. Other applications 264 can include, for example, camera applications, image library applications, image management applications, image exhibition hall (gamma) applications, media display applications, communication applications, web hosting engines or applications, map applications, media sharing applications, and the like. One or more of the methods disclosed herein can operate in a variety of environments and platforms, e.g., as a stand-alone computer program capable of running on any type of computing device, as a web application with web pages, as a mobile application ("app") running on a mobile computing device or the like.

Application data 266 may be data generated by other applications 264 or hardware of computing device 200. For example, the application data 266 may include images captured by the camera 243, user actions recognized by other applications 264 (e.g., social networking applications), and so forth.

The I/O interface 239 can provide functionality to enable the computing device 200 to interface with other systems and devices. The devices of the interface can be included as part of the computing device 200 or can be separate and in communication with the computing device 200. For example, network communication devices, storage devices (e.g., memory 237 and/or database 199), and input/output devices can communicate via I/O interface 239. In some embodiments, I/O interface 239 can be connected to devices such as input devices (keyboard, pointing device, touch screen, microphone, camera, scanner, sensor, etc.) and/or output devices (display device, speaker device, printer, display, etc.). For example, when a user provides touch input, I/O interface 239 transmits data to media application 103.

Some examples of interface devices that can connect to I/O interface 239 can include a display 241 that can be used to display content, e.g., user interfaces for images, video, and/or output applications as described herein, and receive touch (or gesture) input from a user. For example, the display 241 may be utilized to display a user interface including a subset of the clusters of media items. The display 241 can include any suitable display device, such as a Liquid Crystal Display (LCD), light Emitting Diode (LED) or plasma display screen, cathode Ray Tube (CRT), television, monitor, touch screen, three-dimensional display screen, or other visual display device. For example, the display 241 can be a flat display screen provided on a mobile device, multiple display screens embedded in a glasses form factor or headphone device, or a monitor screen of a computer device.

Camera 243 may be any type of image capturing device capable of capturing images and/or video. In some embodiments, camera 243 captures images or video that I/O interface 239 transmits to media application 103.

The storage device 245 stores data related to the media application 103. For example, the storage device 245 may store a collection of media items associated with a user account, a subset of a cluster of media, a training set of machine learning models, and so forth. In embodiments where media application 103 is part of media server 101, storage device 245 is the same as database 199 in FIG. 1.

Example media application 103

Fig. 2 illustrates an example media application 103 that includes a filtering module 202, a clustering module 204, a machine learning module 205, a selection module 206, and a user interface module 208. In some embodiments, the media application 103 uses a clustering module 204 or a machine learning module 205.

The filter module 202 excludes media items from the collection of media items that correspond to categories in the list of prohibited categories. In some embodiments, the filtering module 202 includes a set of instructions executable by the processor 235 to exclude media items corresponding to categories in the list of prohibited categories. In some embodiments, the filter module 202 is stored in the memory 237 of the computing device 200 and can be accessible and executable by the processor 235.

In some embodiments, the filtering module 202 excludes media from the collection of media items before the clustering module 204 performs clustering. In an alternative embodiment, the filtering module 202 excludes media from the collection of media items after the clustering module 204 performs clustering. For example, the filtering module 202 excludes media items associated with visual similarity of categories from the list of prohibited categories. The prohibited category list may include media items that are not for their photographic value but rather are captured as functional images such as receipts, files, parking timers, screen shots, etc.

In some embodiments where the media application 103 includes a machine learning module 205, the filtering module 202 removes functional images from the collection of media items before the collection of media items is provided to the machine learning model. For example, the filtering module 202 removes receipts, descriptions, documents, and screenshots before the collection of media items is provided to the machine learning model.

The clustering module 204 determines clusters of media items based on pixels of images or videos from a collection of media items such that the media items in each cluster have visual similarity. In some embodiments, the cluster module 204 includes a set of instructions executable by the processor 235 to generate a cluster of media items. In some embodiments, the cluster module 204 is stored in the memory 237 of the computing device 200 and is accessible and executable by the processor 235.

In some embodiments, the cluster module 204 accesses a collection of media items associated with a user account, such as a library associated with the user. In instances where the filtering module 202 excludes media items, the clustering module 204 accesses a collection of media items that do not have media items corresponding to a list of prohibited categories. The clustering module 204 may determine clusters of media items based on pixels of images or videos from a collection of media items such that the media items in each cluster have visual similarity. In some embodiments, the clusters use an N-dimensional gaussian diversity function to determine visual similarity.

In some embodiments, the machine learning module 205 includes a machine learning model that is trained to generate output image embeddings of clusters of media such that media items in each cluster have visual similarity. In some embodiments, the machine learning module 205 includes a set of instructions executable by the processor 235 to generate an image embedding. In some embodiments, the machine learning module 205 is stored in the memory 237 of the computing device 200 and is accessible and executable by the processor 235.

In some embodiments, the machine learning module 205 may use vectors (embedding) in the multidimensional feature space to determine visual similarity in the cluster. Images with similar features may have similar feature vectors, e.g., the vector distance between feature vectors of such images may be smaller than the vector distance between dissimilar images. The feature space may be a function of various factors of the image, such as the depicted subject matter (detected objects in the image), composition of the image, color information, image orientation, image metadata, specific objects identified in the image (e.g., user-approved, known faces), and so forth.

In some embodiments, training may be performed using supervised learning. In some embodiments, the machine learning module 205 includes a set of instructions executable by the processor 235. In some embodiments, the machine learning module 205 is stored in the memory 237 of the computing device 200 and is accessible and executable by the processor 235.

In some embodiments, the machine learning module 205 may use training data (obtained if licensed for training purposes) to generate a trained model, in particular, a machine learning model. For example, the training data may include real data in the form of clusters of media that are associated with descriptions of visual similarity of the clusters. In some embodiments, the description of visual similarity may include feedback from the user as to whether the clusters are related and include an explicit topic. In some embodiments, the description of visual similarity may be automatically added by image analysis. Training data may be obtained from any source, such as, for example, a data store specially labeled for training, data providing training data licensed for use as machine learning, and the like.

In some embodiments, the training data may include synthetic data generated for training purposes, such as data that is not based on activity in the context being trained, e.g., data generated from simulated or computer generated images/videos, etc. In some embodiments, the machine learning module 205 uses weights taken from another application and not edited/passed. For example, in these embodiments, the trained model may be generated, for example, on a different device and provided as part of the media application 103. In various embodiments, the trained model may be provided as a data file that includes model structures or forms (e.g., that define the number and types of neural network nodes, connectivity between nodes, and organization of nodes into multiple layers), and associated weights. The machine learning module 205 may read the data file of the trained model and implement a neural network with node connectivity, layers, and weights based on the model structure or form specified in the trained model.

The machine learning module 205 generates a trained model referred to herein as a machine learning model. In some embodiments, the machine learning module 205 is configured to apply a machine learning model to data, such as application data 266 (e.g., input media), to identify one or more features in an input media item and to generate feature vectors (embeddings) representative of the media item. In some embodiments, the machine learning module 205 may include software code to be executed by the processor 235. In some embodiments, the machine learning module 205 may specify a circuit configuration (e.g., for a programmable processor, for a Field Programmable Gate Array (FPGA), etc.) that enables the processor 235 to apply a machine learning model. In some embodiments, the machine learning module 205 may include software instructions, hardware instructions, or a combination. In some embodiments, the machine learning module 205 may provide an Application Programming Interface (API) that can be used by the operating system 262 and/or other applications 264 to invoke the machine learning module 205, for example, to apply a machine learning model to the application data 266 to output image embeddings of clusters of media. In some embodiments, media items that match visual similarity are closer to each other in vector space than dissimilar images, such that partitioning the vector space generates clusters of media items.

In some embodiments, the machine learning model is a classifier that takes as input a collection of media items. Examples of classifiers include neural networks, support vector machines, k-nearest neighbors, logistic regression, naive bayes, decision trees, perceptrons, etc.

In some embodiments, the machine learning model may include one or more model forms or structures. For example, the model form or structure can include any type of neural network, such as a linear network, a deep neural network that implements multiple layers (e.g., a "hidden layer" between an input layer and an output layer, where each layer is a linear network), a Convolutional Neural Network (CNN) (e.g., a network that splits or divides input data into multiple portions or tiles, processes each tile separately using one or more neural network layers, and aggregates the results of the processing of each tile), a sequence-to-sequence neural network (e.g., a network that receives sequence data such as words in sentences, frames in videos, etc., as input, and produces a sequence of results as output), and the like.

The model form or structure may specify connectivity between individual nodes and organize the nodes into tiers. For example, a node of a first layer (e.g., an input layer) may receive data as input data or application data 266. Such data can include, for example, one or more pixels per node, e.g., when a machine learning model is used for analysis, e.g., an input image, such as a first image associated with a user account. Depending on the connectivity specified in the model form or structure, the subsequent middle tier may receive as input the output of the nodes of the previous tier. These layers may also be referred to as hidden layers. The last layer (e.g., the output layer) produces the output of the machine learning model. For example, the output may be image embedding of a cluster of media. In some embodiments, the model form or structure also specifies the number and/or type of nodes in each layer.

The features output by the machine learning module 205 may include a topic (e.g., sunset contrast (vs.) for a particular person); colors present in the image (green hills versus blue lakes); color equalization; light source, angle and intensity; the position of the object in the image (e.g., following the three-way rule); the position of objects relative to each other (e.g., depth of field), the shooting position; focus (foreground versus background); or shadows. While the above features are human-understandable, it will be appreciated that the feature output may be an embedded or representative image and other mathematical values that are not human-resolvable (e.g., no separate feature values may correspond to particular features such as colors present, object locations, etc.); however, the trained model is robust to images such that similar images output similar features, and images with significant dissimilarities have corresponding dissimilarity features.

In some embodiments, the model form is a CNN having network layers, where each network layer extracts image features at different levels of abstraction. CNNs for identifying features in images may be used for image classification. The model architecture may include a combination and ordering of layers consisting of multidimensional convolution, average pooling, max pooling, activation functions, normalization, regularization, and other layers and modules in practice for applying deep neural networks.

In various embodiments, the machine learning model may include one or more models. The one or more models may include a plurality of nodes arranged in layers according to a model structure or form. In some embodiments, the node may be a computing node without memory, e.g., configured to process an input unit to produce an output unit. The computation performed by the node may include, for example, multiplying each of the plurality of node inputs by a weight, obtaining a weighted sum, and adjusting the weighted sum with a deviation or intercept value to produce a node output. For example, the machine learning module 205 may adjust the respective weights based on feedback responsive to one or more parameters of the automatically updated machine learning model.

In some embodiments, the computation performed by the node may further include applying a step/activate function to the adjusted weighted sum. In some embodiments, the step/activate function may be a non-linear function. In various embodiments, such computation may include operations such as matrix multiplication. In some embodiments, the computation of multiple nodes may be performed in parallel, such as multiple processor cores using a multi-core processor, a separate processing unit using a Graphics Processing Unit (GPU), or dedicated neural circuitry. In some embodiments, a node may include memory, for example, may be capable of storing and using one or more earlier inputs in processing subsequent inputs. For example, the nodes having memory may include Long Short Term Memory (LSTM) nodes. LSTM nodes may use memory to maintain states that allow the node to behave as a Finite State Machine (FSM). Models with such nodes may be used to process sequential data, such as words in sentences or paragraphs, a series of images, frames in video, speech or other audio, and so forth. For example, a heuristic-based model used in the gating model may store one or more previously generated features corresponding to a previous image.

In some embodiments, the machine learning model may include embeddings or weights for individual nodes. For example, a machine learning model may be initialized as a plurality of nodes organized into layers specified by a model form or structure. At initialization, a corresponding weight may be applied to the connections between each pair of nodes connected in a model form, e.g., nodes in successive layers of a neural network. For example, the corresponding weights may be randomly assigned or initialized to default values. The machine learning model may then be trained, for example, using a training set of clusters of media, to produce results. In some embodiments, a subset of the total architecture may be reused as a migration learning approach from other machine learning applications to take advantage of pre-trained weights.

For example, training may include applying supervised learning techniques. In supervised learning, training data can include a plurality of inputs (e.g., media items from a collection of media items associated with a user account) and a corresponding expected output for each input (e.g., image embedding of a cluster of media). Based on a comparison of the output of the machine learning model with the expected output, the value of the weight is automatically adjusted, for example, in a manner that increases the probability that the machine learning model will produce the expected output when provided with similar inputs.

In some embodiments, training may include applying unsupervised learning techniques. In unsupervised learning, only input data (e.g., media items from a collection of media items associated with a user account) may be provided and a machine learning model may be trained to differentiate the data, e.g., to cluster features of an image into multiple groups.

In various embodiments, the trained model includes a set of weights corresponding to the model structure. In embodiments omitting the training set, the machine learning module 205 may generate a machine learning model based on prior training, for example, by a developer of the machine learning module 205, by a third party, or the like. In some embodiments, the machine learning model may include a fixed set of weights, e.g., downloaded from a server that provides the weights.

In some embodiments, the machine learning module 205 may be implemented in an offline manner. Implementing the machine learning module 205 may include using a static training set that does not include updating as data in the static training set changes. This advantageously results in an increase in the efficiency of the processing performed by the computing device 200 and a decrease in the power consumption of the processing device 200. In these embodiments, a machine learning model may be generated in a first stage and provided as part of the machine learning module 205. In some embodiments, small updates to the machine learning model may be implemented in an online manner, wherein updates to the training data are included as part of the training of the machine learning model. A small update is an update having a size less than a size threshold. The size of the update is related to the number of variables in the machine learning model that are affected by the update. In such embodiments, an application (e.g., operating system 262, one or more other applications 264, etc.) invoking machine learning module 205 may embed images for use with clusters of media items to identify visually similar clusters. The machine learning module 205 may also generate a system log periodically, e.g., hourly, monthly, quarterly, etc., and may be used to update the machine learning model, e.g., for embedding of the machine learning model.

In some embodiments, the machine learning module 205 may be implemented in a manner that can be adapted to the particular configuration of the computing device 200 on which the machine learning module 205 is implemented. For example, the machine learning module 205 may determine a computational graph that utilizes available computational resources, such as the processor 235. For example, if the machine learning module 205 is implemented as a distributed application on multiple devices, such as where the media server 101 includes multiple instances of the media server 101, the machine learning module 205 may determine the computations to be performed on the respective devices in a manner that optimizes the computations. In another example, the machine learning module 205 may determine that the processor 235 includes GPUs having a particular number (e.g., 1000) of GPU cores and implement the machine learning module 205 accordingly (e.g., as 1000 separate processes or threads).

In some embodiments, the machine learning module 205 may implement an ensemble of trained models. For example, the machine learning model may include a plurality of trained models, each of the plurality of trained models adapted for the same input data. In these embodiments, the machine learning module 205 may select a particular trained model, for example, based on available computing resources, success rates of a priori reasoning, and the like.

In some embodiments, the machine learning module 205 may execute a plurality of trained models. In these embodiments, the machine learning module 205 may combine the outputs from applying the individual models, for example, using voting techniques that score the individual outputs from applying each trained model, or by selecting one or more particular outputs. In some embodiments, such selectors are part of the model itself and act as a connection layer between the trained models. Further, in these embodiments, the machine learning module 205 may apply a time threshold to apply the individual trained model (e.g., 0.5 ms) and utilize only those individual outputs that are available within the time threshold. The output that is not received within the time threshold may not be utilized, e.g., discarded. Such an approach may be suitable, for example, when there are specified time constraints while the machine learning module 205 is invoked, for example, by the operating system 262 or one or more applications 264. In this manner, the maximum time it takes for the machine learning module 205 to perform tasks, e.g., to identify one or more features in an input media item and generate a feature vector (embedding) representing the media item, can be bounded, which increases the responsiveness of the media application 103 and results in the machine learning module 205 providing real-time assurance for best effort classification.

In some embodiments, the machine learning module 205 receives feedback. For example, the machine learning module 205 may receive feedback from a user or group of users via the user interface module 208. If the single user provides feedback, the machine learning module 205 provides feedback to the machine learning model, which uses the feedback to update parameters of the machine learning model to modify the output image embedding of the cluster of media items. In the case where a set of users provide feedback, the machine learning module 205 provides aggregated feedback to the machine learning model, which uses the aggregated feedback to update parameters of the machine learning model to modify the output image embeddings of the clusters of media items. For example, the aggregated feedback may include a subset of clusters of media items and how the user reacts to the subset of clusters of media by: viewing only one image and rejecting viewing the remaining media, viewing all corresponding media items in the subset, sharing the corresponding media items, providing an indication of approval or disapproval of the corresponding media items (e.g., approval/disapproval, like, +1, etc.), deleting/adding individual media items from a subset of the cluster of media items, modifying titles, etc. The machine learning module 205 may modify the clusters of media based on the parameters of the updated machine learning model.

In some embodiments, the machine learning model is trained with feedback from the user, wherein the feedback includes a reaction to a subset of clusters and a modification of a title of one of the clusters in the subset. The machine learning module 205 provides feedback to the machine learning model to modify parameters to exclude clusters of media items that have some type of visual similarity (e.g., individual images of waves on the ocean are visually similar but are not types of media that the user may view compared to images of surfers on the waves at different times and/or different locations).

The selection module 206 selects a subset of the clusters of media items based on the visual similarity determined by the cluster module 204. In some embodiments, the selection module 206 includes a set of instructions executable by the processor 235 to select a subset of the cluster of media items. In some embodiments, the selection module 206 is stored in the memory 237 of the computing device 200 and can be accessible and executable by the processor 235.

In some embodiments, the selection module 206 selects a subset of the clusters of media items, wherein the media items have visual similarity within a range of threshold visual similarity values. For example, the range may be between 0.05 and 0.3 in the range of 0-4. Other ranges and dimensions are also possible. A subset of clusters of media items within a range of threshold visual similarity values may be considered to have visual topics that are identified as related and cohesive.

In some embodiments, where the clusters of media items exceed a predetermined number (e.g., more than 15 media items), the selection module 206 may impose additional restrictions during selection of the subset of clusters of media items. For example, the selection module 206 may apply time diversity by identifying a timestamp associated with each media item, identifying a plurality of media items associated with the same episode (e.g., media items associated with the same time period and the same location) based on the timestamps, and selecting a subset of clusters of media items based on the associated timestamps such that the subset of clusters of media items meets a time diversity criterion that excludes more than a predetermined number of media items from a particular episode (i.e., selects a subset of clusters of media items based on the associated timestamps such that no more than a particular number (e.g., three) of media items are associated with the same episode). This avoids that clusters of media items are too similar and may repeat as the user takes multiple images of the object in the same time period and at the same location. This also avoids the situation where the user takes the same image and edits it, for example, for distribution on a different photo sharing application. The selection module 206 may use temporal diversity to select a subset of clusters that display the progress of the object over a span of time. For example, the clusters may include different images of children at different time periods to display different images of children growing up or plants going from seedlings to flowering shrubs.

In some embodiments, the selection module 206 applies position diversity to a subset of the clusters of media. For example, the selection module 206 may identify locations where the number of corresponding media items associated with each media item and available for clustering exceeds a predetermined number (e.g., more than 10 media items), the selection module 206 selecting a subset of clusters of media items based on the locations such that the subset of clusters of media items meets the location diversity criteria. Fig. 4 includes an example 400 of visual themes of natural images of different mountains, with both temporal diversity, because images were captured in different months and years, and positional diversity, because images were captured in different locations. Although there are two types of diversity in images, hidden similarities occur through visual themes.

In some embodiments, the selection module 206 applies the semantic topic to a subset of the clusters of media. The selection module 206 may identify tags associated with the images and group subsets of the clusters of media items based on the corresponding media items having the same or similar tags. For example, the selection module 206 may use a tag that identifies a depiction of a dog in an image in order to select a subset of a cluster of media items from puppies to adult dogs. In some embodiments, the media application 103 combines the semantic theme of the golden bridge with the visual theme of other bridges that are visually similar to the golden hue of the golden bridge.

In some embodiments, the selection module 206 scores each media item in a subset of the cluster of media items based on analyzing a likelihood that a user associated with the user account performs a positive action with respect to the media item. The positive actions may include viewing the subset, sharing the subset, ordering the printed matter from the subset, and so forth. If the topic is more interesting, for example, if the topic includes infants, people the user knows, places the user has visited, etc., the selection module 206 may score the media item as being associated with a likelihood that the user associated with the user account will perform a positive action. Instead, the selection module 206 may determine that the user is unlikely to perform a positive action associated with certain objects, e.g., static objects, such as a bunk bed. In some embodiments, the selection module 206 scores a subset of the clusters of media items based on personalized information about the user or based on aggregated information about how the user typically reacts to the media. In some embodiments, the selection module 206 scores the media items based on quality issues, such as being too ambiguous, as this reduces the likelihood that a user associated with the user account will perform a positive action associated with the media item.

The selection module 206 may select a media item in the subset of clusters of media items if the corresponding score for each media item meets the threshold score. In some embodiments, the threshold score is a static value that is the same for all users. In some embodiments, the threshold score is user-specific. In some embodiments, the threshold score is specified by the user.

Once the selection module 206 determines the subset of clusters, the selection module 206 may instruct the user interface module 208 to cause a user interface including the subset of clusters to be displayed. In some embodiments, the user may provide feedback related to a subset of the clusters. For example, the user may view the subset, provide an indication of approval of the subset, share the subset, order printed pictures from the subset, and so forth.

In some embodiments, the selection module 206 receives feedback and modifies the corresponding score of the subset of clusters of media items based on the feedback. For example, the feedback may include an explicit action as indicated by removing a subset of the clusters from the user interface or an implicit action as indicated by one or more of viewing the subset of the clusters, viewing a subset of the clusters, or sharing a subset of the clusters. In some embodiments, the selection module 206 may identify a pattern in the feedback. For example, if positive feedback occurs when an object in a cluster is of a certain type (baby, family member, tree, etc.), the selection module 206 may modify the score such that a subset of the cluster includes objects of a similar type. In another example, the pattern may indicate that the user prefers topics with less visual similarity over topics with more visual similarity, and the selection module 206 may modify the score such that topics with less visual similarity are selected more frequently.

In some embodiments, the selection module 206 may receive feedback from a group of users using the media application 103 and aggregate the feedback. For example, the selection module 206 may create aggregated feedback from the user for a subset of the clusters of media, and the selection module 206 modifies the score based on the aggregated feedback.

The user interface module 208 generates a user interface. In some embodiments, the user interface module 208 includes a set of instructions executable by the processor 235 to generate a user interface. In some embodiments, the user interface module 208 is stored in the memory 237 of the computing device 200 and can be accessible and executable by the processor 235.

The user interface module 208 causes a user interface to be displayed that includes a subset of the clusters of media. According to some embodiments described therein, fig. 5 includes an example 500 of a user interface that includes a cluster 505 with visual theme. In this example, cluster 505 is displayed on top of the user interface along with the most recent set of bright spots (Recent Highlights) and the set of images one Year Ago (1 Year Ago). User interface 500 also includes an image taken Yesterday (3 months 9 days) in San Francisco.

In some embodiments, the user interface module 208 generates a user interface for viewing, editing, and sharing media that also suggests a subset of the clusters. For example, the user interface may include a cluster at the top of the user interface, as shown in FIG. 5, and then when the user selects an image, the user interface includes options for editing or sharing the image.

In some embodiments, in response to a user selecting a cluster in the user interface, the user interface module 208 displays corresponding media items from the cluster at predetermined intervals. For example, the user interface module 208 may display each media item for two seconds, three seconds, etc.

In some embodiments, the user interface module 208 presents a subset of the clusters with cover photos. The cover photo may be the most recent photo, the highest scoring photo, etc. In some embodiments, the user interface module 208 selects a particular media item from each of the subset of clusters of media items as a cover photograph for each of the subset of clusters of media items based on the particular media item including the maximum number of objects corresponding to visual similarity. For example, the cluster may have a visual theme of a group of people skiing, and the user interface module 208 may select a cover photo for the cluster that displays an image depicting the maximum number of people from the cluster while they are skiing. In another example, where the cluster has visual topics for people doing outdoor activities in water, the user interface module 208 may determine that the image of the person surfing is the hottest media item of the cover than other images of the person at the water side rather than in the water (e.g., making a sandfort or the person is not in the water) engaged in more aggressive outdoor activities (e.g., sunbath at the water side). The user interface module 208 may also select a cover photo from the cluster based on having the highest visual quality (e.g., sharp, high resolution, no blurring, good exposure, etc.).

In some embodiments, the user interface module 208 adds a title to each of a subset of clusters of media items based on the type of visual theme and/or the template phrase. For example, a title may refer to an action occurring in an image (e.g., surf's up represents a sea cluster, into blue represents a sky cluster, on the road represents a road cluster, stairway to heaven (steps to the heaven) represents an image in a church); food metaphors (e.g., mixed nuts, snack, mixed bag, good bag, wine flight, chese string, sample, test troves, overlooked treasure (ignored treasure), wave a drink); photo cues (trail) (e.g., photo detection, photo mystery, photo puzzle); creative combination; correlation (e.g., connections, photos of a feather (photo of feathers), photo club, photo weave, pattern, co cadence, cause and effect, slot machine, one of these things is like the others (one of which is similar to the other), parallel); titles related to patterns (patterns) (e.g., beta patterns, pattern hunters, title patterns, pattern portals, connections the dots, picture patterns, photo patterns); synonyms of patterns such as theme (e.g., photo story), a story in photos (story in photo), photo tales, a tale of two photos (story of two photo), photo theme, lucky theme), collection (e.g., photo set, surrise set) or match (e.g., memory match); an anthropomorphic word (e.g., zig-zag, boom, ka-pow, zap photo, photo zap, photo stop); verbs (e.g., look what we found in the couch cushions (see what we find on the sofa cushion), look what appeared (see what we find), help us slot (help us detect), wilt blend (it will blend into does it), time flies (time-of-day flees), some things never change (something is never changed)); or the title of the cluster referencing the connection (e.g., magic pattern, I'm playing lucky) where the selection module has a higher confidence score that it is inferred to be correct. In some embodiments, the template phrase may be an interesting or commonly used phrase that is more spoken and engaging than just adding a title such as "birthday 1997-2001 (1997-2001 birthday)". In some embodiments, the user interface module 208 may also include titles with common titles, such as "Look what we found (see what we find)", and subtitles that identify topics, such as "your orange backpack brought you far (your orange backpack you walk far)", of the fourth example 375 in fig. 3B.

In some embodiments, the user interface module 208 provides notifications to a user associated with a user account that a subset of the clusters are available for viewing. The user interface module 208 may provide notifications periodically, such as daily, weekly, monthly, etc. In some embodiments, the user interface module 208 may generate notifications less frequently in the event that the user stops viewing notifications while the notifications are provided daily (weekly, monthly, etc.). The user interface module 208 may additionally provide notification of the corresponding title with a subset of the clusters.

Example flow chart

FIG. 6 is a flowchart illustrating an example method 600 for displaying a subset of clusters of media items, according to some embodiments. The method illustrated in flowchart 600 may be performed by computing device 200 in fig. 2.

The method 600 may begin at block 602. In block 602, a request to access a collection of media items associated with a user account is generated. In some embodiments, the request is generated by the user interface module 208. Block 602 may be followed by block 604.

At block 604, the permission interface element is caused to be displayed. For example, the user interface module 208 may display a user interface that includes a permission interface element that requests a user to provide permissions to access a collection of media items. Block 604 may be followed by block 606.

At block 606, it is determined whether the user grants permission for accessing the collection of media items. In some embodiments, block 606 is performed by user interface module 208. If the user does not provide permission, the method ends. If the user provides permission, block 606 may be followed by block 608.

At block 608, clusters of media items are determined based on pixels of images or videos from a collection of media items such that the media items in each cluster have visual similarity, wherein the collection of media items is associated with a user account. In some embodiments, block 606 is performed by cluster module 204. Block 608 may be followed by block 610.

In block 610, a subset of clusters of media items is selected based on corresponding media items in each cluster having visual similarity within a range of threshold visual similarity values. In some embodiments, block 610 is performed by selection module 206. Block 610 may be followed by block 612.

In block 612, a user interface including a subset of the clusters of media items is caused to be displayed. In some embodiments, block 610 is performed by user interface module 208.

FIG. 7 is a flowchart illustrating an example method 700 for generating an embedding of clusters of media items and selecting a subset of clusters of media items using a machine learning model, according to some embodiments. The method illustrated in flowchart 700 may be performed by computing device 200 in fig. 2.

The method 700 may begin at block 702. In block 702, a request to access a collection of media items associated with a user account is generated. In some embodiments, the request is generated by the user interface module 208. Block 702 may be followed by block 704.

At block 704, the permission interface element is caused to be displayed. For example, the user interface module 208 may display a user interface that includes a permission interface element that requests a user to provide permissions to access a collection of media items. Block 704 may be followed by block 706.

At block 706, a determination is made as to whether the user grants permission for accessing the collection of media items. In some embodiments, block 706 is performed by user interface module 208. If the user does not provide permission, the method ends. If the user provides permission, block 706 may be followed by block 708.

At block 708, the trained machine learning model receives as input media items in a collection of media items associated with a user account. In some embodiments, block 708 is performed by the machine learning module 205. Block 708 may be followed by block 710.

In block 710, the trained machine learning model generates output image embeddings for clusters of media items, wherein the media items in each cluster have visual similarities and media items that match the visual similarities are closer to each other in vector space than dissimilar media items, such that partitioning the vector space generates clusters of media items. In some embodiments, block 710 is performed by the machine learning module 205. Block 710 may be followed by block 712

In block 712, a subset of clusters of media items is selected based on corresponding media items in each cluster having visual similarity within a range of threshold visual similarity values. In some embodiments, block 712 is performed by the machine learning module 205. Block 712 may be followed by block 714.

In block 714, a user interface comprising a subset of the cluster of media items is caused to be displayed. In some embodiments, block 714 is performed by user interface module 208.

In addition to the above description, a control may be provided to the user that allows the user to select whether and when the systems, programs, or features described herein may be able to collect user information (e.g., information about the user's media items, such as photos or videos, user interactions with a media application displaying the media items, social networks of the user, social actions or activities, professions, preferences of the user, such as viewing preferences based on the creation of images, settings of hidden characters or pets, user interface preferences, etc., or current location of the user), and whether to send content or communications from a server to the user. In addition, certain data may be processed in one or more ways prior to storage or use in order to delete personally identifiable information. For example, the identity of the user may be processed such that personally identifiable information of the user cannot be determined, or the geographic location of the user may be generalized (such as to a city, zip code, or state level) where location information is obtained such that a particular location of the user cannot be determined. Thus, the user can control what information is collected about the user, how that information is used, and what information is provided to the user.

In the above description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the specification. It will be apparent, however, to one skilled in the art that the present disclosure can be practiced without these specific details. In some instances, structures and devices are shown in block diagram form in order to avoid obscuring the description. For example, embodiments can be described above primarily with reference to user interfaces and specific hardware. However, embodiments can be applied to any type of computing device capable of receiving data and commands, as well as any peripheral device providing services.

Reference in the specification to "some embodiments" or "some examples" means that a particular feature, structure, or characteristic described in connection with the embodiments or examples can be included in at least one implementation of the description. The appearances of the phrase "in some embodiments" in various places in the specification are not necessarily all referring to the same embodiment.

Some portions of the detailed descriptions above are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, considered to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic data capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these data as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the description, discussions utilizing terms including "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system, or similar electronic computing, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

Embodiments of the present specification can also relate to a processor for performing one or more steps of the above-described methods. The processor may be a special purpose processor selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory computer readable storage medium, including, but not limited to, any type of disk including optical disks, ROMs, CD-ROMs, magnetic disks, RAM, EPROM, EEPROM, magnetic or optical cards, flash memory including a USB key having non-volatile memory, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.

The description can take the form of some entirely hardware embodiments, some entirely software embodiments or some embodiments containing both hardware and software elements. In some embodiments, the specification is implemented in software, including but not limited to firmware, resident software, microcode, etc.

Furthermore, the description can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

A data processing system suitable for storing or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.

Claims

1. A computer-implemented method, comprising:

generating a vector representation of media items from a collection of media items associated with a user account using a trained machine learning model;

determining clusters of media items based on the vector representations of media items such that the media items in each cluster have visual similarity, wherein vector distances between vector representations of pairs of media items are indicative of the visual similarity of the media items, and wherein the clusters are selected such that vector distances between each pair of media items within the clusters are outside a range of threshold visual similarity values;

selecting a subset of the clusters of media items based on corresponding media items in each cluster having visual similarity within a range of threshold visual similarity values; and

causing a user interface to be displayed, the user interface comprising the subset of the clusters of media items.

2. The method according to claim 1, wherein:

each media item has an associated timestamp;

the media items captured within the predetermined period of time are associated with episodes; and

selecting the subset of the clusters of media items is based on corresponding associated timestamps such that corresponding media items in the subset of the clusters of media items satisfy a temporal diversity criterion that excludes more than a predetermined number of corresponding media items from a particular episode.

3. The method of claim 1, further comprising: media items associated with a category in the list of prohibited categories are excluded from the collection of media items prior to selecting the subset of the cluster of media items.

4. The method of claim 1, further comprising: media items corresponding to categories in the list of prohibited categories are excluded prior to determining the cluster of media items.

5. The method according to claim 1, wherein:

each media item is associated with a location; and

in response to the subset of the clusters of media items including more than a predetermined number of media items, selecting the subset of the clusters of media items is based on location such that the subset of the clusters of media items meets a location diversity criterion.

6. The method of claim 1, wherein the cluster of media items is further determined based on the corresponding media items being associated with tags having semantic similarity.

7. The method of claim 1, further comprising:

scoring each media item in the subset of the cluster of media items based on analyzing a likelihood that a user associated with the user account performs a positive action with respect to the media item; and

The media items in the subset of the clusters of media items are selected based on the corresponding scores satisfying a threshold score.

8. The method of claim 7, further comprising:

receiving feedback from a user, the feedback being about one or more of the media items in the subset of the cluster of media items; and

based on the feedback, corresponding scores of the one or more media items in the subset of the cluster of media items are modified.

9. The method of claim 8, wherein the feedback comprises an explicit action indicated by removing the one or more media items in the subset of the cluster of media items from the user interface or an implicit action indicated by one or more of: corresponding ones of the subset of the clusters of media items or corresponding ones of the subset of the clusters of shared media items are viewed.

10. The method of claim 1, further comprising:

receiving aggregate feedback from the user for an aggregate subset of the cluster of media items;

Providing the aggregate feedback to the trained machine learning model, wherein parameters of the trained machine learning model are updated; and

the clusters of media items are modified based on updating the parameters of the trained machine learning model.

11. The method of claim 1, further comprising: based on a particular media item from each of the subset of clusters of media items including a maximum number of objects corresponding to the visual similarity, the particular media item is selected as a cover photograph for each of the subset of clusters of media items.

12. The method of claim 1, further comprising: a title is added to each of the subset of clusters of media items based on the type of visual similarity and the common phrase.

13. The method of claim 1, wherein the subset of the clusters of media items are displayed in the user interface at predetermined intervals.

14. The method of claim 1, further comprising: a notification is provided to a user associated with the user account that the subset of clusters of media items are available, wherein the notification includes a corresponding title for each of the subset of clusters of media items.

15. The method of claim 1, further comprising:

determining computations to be performed on the respective devices to optimize the computations; and

the trained machine learning model is implemented on a plurality of devices based on the computations to be performed on the respective devices.

16. A computer-implemented method, comprising:

receiving media items from a collection of media items associated with a user account as input to a trained machine learning model;

generating output image embeddings of clusters of media items with the trained machine learning model, wherein media items in each cluster have visual similarity and media items having the visual similarity are closer to each other in vector space than dissimilar media items such that partitioning the vector space generates the clusters of media items;

17. The method of claim 16, wherein a functional image is removed from the collection of media items before the collection of media items is provided to the trained machine learning model.

18. The method of claim 16, wherein the trained machine learning model is trained with feedback from a user, the feedback comprising a reaction to a set of media items or a modification to a title of the set of media items.

19. A system, comprising:

a processor; and

a memory coupled to the processor, the memory having instructions stored thereon that, when executed by the processor, cause the processor to perform operations comprising:

determining clusters of media items based on pixels of images or videos from a collection of media items such that the media items in each cluster have visual similarity, wherein the collection of media items is associated with a user account;

20. The system of claim 19, wherein:

each media item has an associated timestamp;