WO2021226607A1

WO2021226607A1 - Systems and methods for video recognition

Info

Publication number: WO2021226607A1
Application number: PCT/US2021/036569
Authority: WO
Inventors: Jenhao Hsiao
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2021-11-11

Abstract

Systems and methods of the present disclosure provides a plug-and-play recognition system can be trained to predict a concept for an input data object. The predicted concept can come from a dynamic list of concepts. The list of concepts can be modified for different applications without retraining the recognition system. The plug-and-play recognition system can produce embeddings representative of the input data object and embeddings representative of the concepts in the list of concepts, Based on the embeddings representative of the input data object and the embeddings representative of the concepts in the list of concepts, a prediction can be generated that is indicative of a likelihood that the input data object corresponds with a concept in the list of concepts.

Description

SYSTEMS AND METHODS FOR VIDEO RECOGNITION

Description of Related Art

[0001] In general, artificial intelligence applies advanced analysis and logic-based techniques to interpret various situations, to support and to automate decision making, and to take actions accordingly. In the field of artificial intelligence, machine learning involves training a computing system to learn and act without explicit programming. There are various applications where machine learning can be applied. For example, a classifier relies on machine learning to sort data into classes or categories. The classifier can be a machine learning model that has been trained to sort the data into the classes or categories. The classifier can be further trained as more data is provided to the classifier. In this way, the classifier can continue to improve and provide more accurate results. Classifiers and other machine learning applications have wide spread use in various technologies. Thus, improvements to machine learning represent improvements to other technologies as well.

Brief Description of the Drawings

[0002] The present disclosure, in accordance with one or more various embodiments, is described in detail with reference to the following figures. The figures are provided for purposes of illustration only and merely depict typical or exemplary embodiments.

[0003] FIG. 1 illustrates an example embedding space, according to various embodiments of the present disclosure.

[0004] FIG. 2 illustrates an example video recognition architecture, according to various embodiments of the present disclosure. [0005] FIG. 3 illustrates an example video encoder, according to various embodiments of the present disclosure.

[0006] FIGS. 4A-4B illustrate example video predictions, according to various embodiments of the present disclosure.

[0007] FIG. 5 illustrates example video predictions, according to various embodiments of the present disclosure.

[0008] FIG. 6 illustrates a computing component that includes one or more hardware processors and machine-readable storage media storing a set of machine-readable/machine- executable instructions that, when executed, cause the one or more hardware processors to perform an illustrative method for rendering effects during gameplay, according to various embodiments of the present disclosure.

[0009] FIG. 7 illustrates a block diagram of an example computer system in which various embodiments of the present disclosure may be implemented.

[0010] The figures are not exhaustive and do not limit the present disclosure to the precise form disclosed.

Detailed Description

[0011] Machine learning is involved in various technological applications. Various approaches to machine learning have been developed, and various approaches to machine learning continue to be developed. In general, incorporating machine learning into a specific application involves training a machine learning model to perform a function for the that specific application. Training the machine learning model to perform the function can require a large amounts of training data and require substantial investment of resources to supervise the training. Once the machine learning model has been trained to perform the function, the machine learning model cannot readily be modified to perform a different function. In cases where the machine learning model needs to be modified to perform a different function, retraining the machine learning model is the typical approach. As training the machine learning model can require a large amount of training data and require substantial investment of resources to supervise the training, retraining the machine learning model is wasteful and inefficient. These problems related to training and retraining a machine learning model are exacerbated in various technological applications where flexibility is important. Thus, there is a need for technological improvements to address these and other technological problems related to machine learning.

[0012] Accordingly, the present disclosure provides solutions that address the technological challenges described above. In various embodiments, a plug-and-play recognition system can be trained to predict a concept or a category for an input data object (e.g., video). The predicted concept can come from a dynamic list of concepts (e.g., text tags). The list of concepts can be modified for different applications without retraining the recognition system. In various embodiments, the plug-and-play recognition system can include encoders (e.g., video encoders, text encoders) that produce embeddings (e.g., object embeddings) representative of the input data object and embeddings representative of the concepts in the list of concepts. Based on the embeddings representative of the input data object and the embeddings representative of the concepts in the list of concepts, a prediction can be generated that is indicative of a likelihood that the input data object corresponds with a concept in the list of concepts. For example, a first video and a first list of text tags can be provided to the plug-and-play recognition system. A video encoder of the plug-and-play recognition system can generate a first video embedding based on the first video. A text encoder of the plug-and-play recognition system can generate first text embeddings based on the first list of text tags. Based on the first video embedding of the first video and the first text embeddings of the first list of text tags, a first prediction can be generated. The first prediction can indicate which, if any, of the concepts described in the first list of text tags is depicted in the first video. A second video and a second list of text tags can be provided to the plug-and-play recognition system. The second list of text tags can contain different text tags from the first list of text tags. The video encoder of the plug-and-play recognition system can generate a second video embedding based on the second video. The text encoder of the plug-and-play recognition system can generate second text embeddings based on the second list of text tags. Based on the second video embedding of the second video and the second text embeddings of the second list of text tags, a second prediction can be generated. The second prediction can indicate which, if any, of the concepts described in the second list of text tags is depicted in the second video. As illustrated in this example, the plug-and-play recognition system can generate a first prediction of whether concepts in a first list of text tags is depicted in a first video and generate a second prediction of whether concepts in a second list of text tags different from the first list of text tags is depicted in a second video without retraining the plug-and-play recognition system to account for the different concepts in the two lists of text tags. Thus, the plug-and-play recognition system provides flexible capabilities without retraining. These flexible capabilities, and other features of the present disclosure, provide solutions that address the various technological problems described herein and other technological problems. The features of these solutions are discussed in further detail herein.

[0013] Before describing embodiments of the present disclosure in detail, it is useful to describe an example embedding space which may be relied upon by the various embodiments of the present disclosure. FIG. 1 illustrates an example embedding space 100, according to various embodiments of the present disclosure. As illustrated in FIG. 1, the example embedding space 100 includes an embedding A 102, an embedding B 104, an embedding C 106, and an embedding D 108. In general, an embedding is a numerical representation (e.g., vector) of a data object (e.g., video, text tag). The embedding can be mapped to an embedding space along with other embeddings. The location of the embedding in the embedding space relative to the other embeddings can represent various interrelationships between the embeddings. For example, embeddings mapped to an embedding space can represent similarities or dissimilarities between the data objects the embeddings represent. Embeddings that are relatively closer to each other in the embedding space can be more similar (e.g., have more similar features) than embeddings that are relatively farther away from each other. Embeddings that are relatively farther away from each other in the embedding space can be more dissimilar (e.g., have more dissimilar features) than embeddings that are relatively closer to each other. For example, in the example embedding space 100, embedding A 102 is relatively closer to embedding B 104 than embedding B 104 is to embedding C 106. The relative closeness of embedding A 102 to embedding B 104 in the embedding space compared to embedding B 104 to embedding C 106 can indicate that the data object represented by embedding A 102 and the data object represented by embedding B 104 are more similar, or have more similar features, than the data object represented by embedding B 104 and the data object represented by embedding C 106. Also in the example embedding space 100, embedding B 104 is relatively farther from embedding C 106 than embedding C 106 is from embedding D 108. The relative farness of embedding B 104 from embedding C 106 can indicate that the data object represented by embedding B 104 and the data object represented by embedding C 106 are more dissimilar, or have more dissimilar features, than the data object represented by embedding C 106 and the data object represented by embedding D 108.

[0014] An embedding can be generated based on machine learning. For example, a machine learning model (e.g., encoder) can be trained to generate an embedding based on an input data object. The machine learning model can be trained based on training data that includes similar data objects and dissimilar data objects. The machine learning model can be trained to generate embeddings for the similar data objects that, when mapped in an embedding space, are relatively closer to each other than embeddings for dissimilar data objects. The machine learning model can be trained to generate embeddings for the dissimilar data objects that, when mapped in the embedding space, are relatively farther from each other than embeddings for similar data objects. The embeddings generated by the machine learning model can be evaluated and errors can be backpropagated to adjust parameters of the machine learning model. Once the machine learning model is trained, the trained machine learning model can be applied to an input data object to generate an embedding that represents the input data object. The embedding can be mapped to an embedding space, such as the example embedding space 100. [0015] The examples provided herein are for illustrative purposes. While the example embedding space 100 shows a three-dimensional space, an embedding space can be an n-dimensional space. Further, while the example embedding space 100 shows four embeddings mapped to the embedding space 100, any number of embeddings can be mapped to an embedding space. In some cases, an embedding space can be represented as a table. Furthermore, while the examples described herein with regard to the embedding space 100 reference determining similarity and dissimilarity between data objects based on their relative proximity in an embedding space, various other semantic relationships can be represented by embeddings mapped to an embedding space. Many variations are possible.

[0016] FIG. 2 illustrates an example video recognition architecture 200, according to various embodiments of the present disclosure. The example video recognition architecture 100 can generate a prediction as to a concept (e.g., action) being depicted in a video. As illustrated in FIG. 2, the example video recognition architecture 200 includes a video encoder 204. In various embodiments, the video encoder 204 can be a bi-modal video encoder. The bi-modal video encoder can include a machine learning model, such as a visual convolutional neural network (CNN) or a 3D CNN, that evaluates visual signals associated with a video. The bi-modal video encoder can also include a machine learning model, such as an audio CNN, a 2D CNN, or a 3D CNN, that evaluates audio signals associated with the video. The example video recognition architecture 100 can include a text encoder 212. The text encoder 212 can include a machine learning model, such as a neural network language model (NNLM) or recurrent NNLM (RNNLM), that evaluates text tags. As illustrated in FIG. 1, videos 202 are provided to the video encoder 204. The video encoder 204 generates video embeddings 206 that represent the videos 202. Text tags 214 are provided to the text encoder 212. The text encoder 212 generates text embeddings 210 that represent the text tags 214. The video embeddings 206 and the text embeddings 210 can be evaluated in an embedding space 208. Based on the evaluation of the video embeddings 206 and the text embeddings 210 in the embedding space 208, a prediction 216 can be generated. The prediction 216 can include scores (e.g., percentages, fractions) for each video of the videos 202 that indicate a likelihood or confidence that the video depicts a concept described by each respective text tag of the text tags 204. For example, a video can be provided to the example video recognition architecture 100 to be evaluated as to whether the video depicts certain actions. The certain actions can be provided to the example video recognition architecture 100 as text tags. Based on the video and the text tags, a video embedding and text embeddings can be generated. The video embedding and the text embeddings can be evaluated in an embedding space. Based on the evaluation of the video embedding and the text embedding, a prediction can be generated that indicates a likelihood, for each action provided as text tags, that the video depicts the action. As illustrated in this example, a plug-and-play recognition system, such as the example video recognition architecture 100, can evaluate a data object, such as a video, under a dynamic range of categories or concepts, such as actions.

[0017] In various embodiments, an embedding space in a plug-and-play recognition system, such as the embedding space 208 can be a learned visual-semantic embedding space. The visual-semantic embedding space can be learned by jointly training a video encoder and text encoder of the plug-and-play recognition system. Training the plug-and-play recognition system can be based on training data that includes video-text pairs, which includes pairs of videos and text tags. For a given sample of video-text pairs, the plug-and-play recognition system can be trained to predict which of the video-text pairs occurred. Occurrence of the video-text pair indicates that the video of the video-text pair depicted the concept described by the text tag of the video-text pair. In the sample of video-text pairs, positive training data can include the video-text pairs that occur, or the video-text pairs where the video depicted the concept described by the text tag. Negative training data can include the video-text pairs that did not occur, or the video-text pairs where the video did not depict the concept described by the text tag. In some cases, the plug-and-play recognition system is trained to maximize the cosine similarity of video embeddings and text embeddings corresponding to the video-text pairs that occurred. In various embodiments, training the plug-and-play recognition system involves application of a loss function to the training data. In the training of the plug-and-play recognition system, the loss function can be described as:

where sim(z, t ) is a dot product between a l₂ normalized video embedding z and text embedding t is a temperature parameter. The above Ly corresponds with the loss function for a positive video-text pair (I, j).

[0018] In various embodiments, the text encoder 212 can apply a variety of machine learning techniques to generate a text embedding based on a text tag. For example, the text encoder 212 can use a continuous bag-of-words (CBOW) model or a continuous skip-gram model. A CBOW model can be based on a neural network language model (NNLM) that includes input, projection, hidden, and output layers. In the NNLM, text is encoded at the input layers, which are then projected to the projection layers. The hidden layers provide additional input to the projection layers. The result is mapped to the output layers. The CBOW model can operate without the hidden layers of the NNLM and can share the projection layers for all text provided by the input layers. In this way, the CBOW model is trained based on an evaluation of text where proceeding text and preceding text are evaluated in the same space. In a skip-gram model, a log-linear classifier can be implemented with a continuous projection layer. In this way, training the skip-gram model involves predicting text within a certain range proceeding and preceding current text. In various embodiments, the text encoder 212 can use, for example, a continuous skip-gram model to generate an embedding based on a text tag. In some cases, the embedding can be mapped to an embedding space with other embeddings based on other text tags. The embeddings that are relatively closer together represent words that are relatively more similar than words represented by embeddings that are relatively farther apart. The examples described herein are for illustrative purposes, and many variations are possible.

[0019] FIG. 3 illustrates an example video encoder 300, according to various embodiments of the present disclosure. In various embodiments, the example video encoder BOO can be implemented as the video encoder 204 of FIG. 2. As illustrated in FIG. 3, the example video encoder 300 can be a bi-modal video encoder. Video clips 302 can be provided to a visual convolutional neural network (CNN) 304. The visual CNN 304 evaluates visual signals in the video clips 302 to generate visual clip descriptors 306. In various embodiments, the visual CNN 304 can be implemented based on a 3D CNN. A 3D CNN utilizes 2D spatial convolutions to encode spatial information in a video clip and ID temporal convolutions to encode temporal information in the video clip. The spatial information in the video clip can include image data from the video frames of the video clip. The temporal information in the video clip can include the order in which the video frames are arranged in the video clip. The 3D CNN can be trained to evaluate a video clip based on the image data from the video frames of the video clip in the order in which the video frames are played in the video clip. The 3D CNN can generate a visual clip descriptor based on the video clip. In some cases, the visual clip descriptor can be used to identify, for example, an action depicted in the video clip based on the spatial information and temporal information associated with the video clip.

[0020] As illustrated in FIG. 3, the video clips 302 can be provided to an audio convolutional neural network (CNN) 314. The audio CNN 316 evaluates audio signals in the video clips 302 to generate audio clip descriptors 316. In various embodiments, the audio CNN 314 can extract features from the audio signals in the video clips 302 by converting them to a sample rate on which the audio CNN 314 was trained. The audio CNN 314 can extract log mel spectrograms from the audio signals. Based on the extracted features and the extracted log mel spectrograms, the audio CNN can perform an audio classification task to identify, for example, an action depicted in a video clip from the audio signals in the video clip.

[0021] As illustrated in FIG. 3, the visual clip descriptors 306 are provided to a visual inter-clip fusion 308. The visual inter-clip fusion 308 uses convolution to model inter-clip relationships between the visual clip descriptors 306. As duration of different actions can be variant and complex, as well as span across multiple video clips, the visual inter-clip fusion 308 can capture inter-clip dependencies for short-range dependencies and long-range dependencies by aggregating information from other video clips. The visual cross-modal fusion 310 captures information from the visual clip descriptors 306 and the audio clip descriptors 316. The visual cross-modal fusion 310 can capture inter-clip dependencies between the visual clip descriptors 306 and the audio clip descriptors 316. On the other side, the audio clip descriptors 316 are provided to an audio inter-clip fusion 318. The audio inter clip fusion 318 uses convolution to model inter-clip relationships between the audio clip descriptors 316. The audio inter-clip fusion 318 can capture inter-clip dependencies for short- range dependencies and long-range dependencies by aggregating information from audio signals of other video clips. The audio cross-modal fusion 320 captures information from the audio clip descriptors 316 and the visual clip descriptors 306. The audio cross-modal fusion 320 can capture inter-clip dependencies between the audio clip descriptors 316 and the visual clip descriptors 306.

[0022] In various embodiments, inter-clip relationships can be fused by a bi directional attention layer. The fusion of the inter-clip relationships can be expressed as:

where S is a source vector and T is a target vector. The vectors can be, for example, visual clip descriptors and audio clip descriptors, and the vectors can be from different time segments. W_q is a linear transform matrix for query transformations. V\ is a linear transform matrix for key vector transformations. W_v is a linear transform matrix for value vector transformations. (W_qS)(W_kT)^T models the bi-directional relationship between the source vector and the target vector. d is a normalization factor. Based on the fusion of inter-clip relationships described above, the visual inter-clip fusion 308 can be modeled as:

where V^self is the fusion of two visual clip descriptors. Based on the fusion of inter-clip relationships described above, the audio inter-clip fusion 318 can be modeled as:

where A^self is the fusion of two audio clip descriptors. The visual cross-modal fusion 310 can be modeled as:

where V^fuse is the fusion of a visual inter-clip fusion of visual clip descriptors (V^self) and an audio inter-clip fusion of audio clip descriptors (A^self). The audio cross modal fusion 320 can be modeled as:

where A^fuse is the fusion of an audio inter-clip fusion of audio clip descriptors (A^self) and a visual inter-clip fusion of visual clip descriptors (V^self). Based on the fusions described above, the visual cross-modal fusion 310 generates fused visual clip descriptors 312. Based on the fusions described above, the audio cross-modal fusion 320 generates fused audio clip descriptors 322. Adaptive pooling 324 is applied to the fused visual clip descriptors 312 and the fused audio clip descriptors 322. In some cases, directly taking an average of fused clip descriptors can decrease overall accuracy due to video clips that are less relevant than other video clips in a determination of an action depicted in a video. Adaptive pooling 324 can adaptively pool fused clip descriptors based on their significance or relevance to a video-level action recognition decision. In this case, a gating module r can be applied. The gating module can be expressed as:

where r(X) is the gating module. X is the respective vector pairs of corresponding fused video clip descriptors (V^fuse) and fused audio clip descriptors (A^fuse). X in this equation can be provided as:

X

where V^fuse is a fused video clip descriptor and A^fuse is a corresponding audio clip descriptor. A video embedding 326 can be generated based on the adaptive pooling 324. The video embedding 326 can be generated by: r i \

where z is the video embedding. X, is a vector pair of corresponding fused video clip descriptors and fused audio clip descriptors. n(X) is the gating module.

[0023] FIGS. 4A-4B illustrate example video predictions, according to various embodiments of the present disclosure. The example video predictions can be associated with one or more functionalities performed by, for example, the video recognition architecture 200 of FIG. 2. As illustrated in the example video predictions, a plug-and-play recognition system can evaluate a video under different lists of text tags. The evaluation can be performed without retraining the plug-and-play recognition system based on the modified list. The examples provided herein are illustrative rather than limiting. There can be additional, fewer, or alternative steps performed in similar or alternative orders, or in parallel, based on the various features and embodiments discussed herein unless otherwise stated.

[0024] FIG. 4A illustrates an example video prediction 400, according to various embodiments of the present disclosure. As illustrated in FIG. 4A, a video 402 can be provided to a video encoder 404. Text tags 410 can be provided to a text encoder 408. In this example, the text tags 410 can include "ski", "ping pong", "boxing", "hamburger", "sing", "tree", "taekwondo", "swim", and "jump". Many variations are possible. While not shown, the video encoder 404 can generate a video embedding based on the video 402. The text encoder 408 can generate text embeddings based on the text tags 410. The video embedding and the text embeddings can be mapped to an embedding space (not shown). Based on the relationship between the video embedding and the text embeddings in the embedding space, a prediction 406 can be generated. The prediction 406, in this example, shows relative likelihoods or confidences that the video 402 depicts a concept (e.g., actions) described by the text tags 410. In this example, the prediction 406 can indicate that "ski" is the most likely concept of the concepts described by the text tags 410 to be depicted in the video 402. The second most likely concept of the concepts described by the text tags 410 to be depicted in the video 402 can be "jump". The other concepts described by the text tags 410 are associated with low likelihoods or low confidences indicating that it is unlikely that the video 402 depicts the other concepts described by the text tags 410.

[0025] FIG. 4B illustrates an example video prediction 450, according to various embodiments of the present disclosures. As illustrated in FIG. 4B, a video 452 can be provided to a video encoder 454. The video 452 and the video encoder 454 in this example can be the video 402 and the video encoder 404 of FIG. 4A. Text tags 460 can be provided to a text encoder 458. The text encoder in this example can be the text encoder 408 of FIG. 4A. The text tags 460 can include "spraying", "skiing", "baseball", "swim", "jog", "unboxing", and "tai chi". Many variations are possible. While not shown, the video encoder 454 can generate a video embedding based on the video 452. The text encoder 458 can generate text embeddings based on the text tags 460. The video embedding and the text embeddings can be mapped to an embedding space (not shown). Based on the relationship between the video embedding and the text embeddings in the embedding space, a prediction 456 can be generated. The prediction 456, in this example, shows relative likelihoods or confidences that the video 452 depicts a concept (e.g., actions) described by the text tags 460. In this example, the prediction 456 can indicate that "skiing" is the most likely concept of the concepts described by the text tags 460 to be depicted in the video 402. The other concepts described by the text tags 460 are associated with low likelihoods or low confidences indicating that it is unlikely that the video 452 depicts the other concepts described by the text tags 460. As illustrated in this example, a plug-and-play recognition system can determine which concept out of a dynamic list of concepts is depicted in a video. The plug-and-play recognition system can evaluate the video under different lists of concepts without retraining.

[0026] FIG. 5 illustrates example video predictions 500, according to various embodiments of the present disclosure. The example video predictions 500 can be associated with one or more functionalities performed by, for example, the video recognition architecture 200 of FIG. 2. As illustrated by the example video predictions 500, a plug-and- play recognition system can evaluate different videos under a list of text tags. The examples provided herein are illustrative rather than limiting. There can be additional, fewer, or alternative steps performed in similar or alternative orders, or in parallel, based on the various features and embodiments discussed herein unless otherwise stated.

[0027] As illustrated in FIG. 5, a list of text tags 502 can be provided to a plug-and- play recognition system. The list of text tags 502 can include, for example, "ping pong", "boxing", "hamburger", "sing", "ski", "kick", "taekwondo", "swim", and "jump". Many variations are possible. Video A 504 can be provided to the plug-and-play recognition system. Based on video A 504 and the list of text tags 502, the plug-and-play recognition system can generate prediction A 506. Prediction A 506 can indicate that video A 504 is most likely to depict the concept "boxing" out of the concepts described by the list of text tags 502. Prediction A 506 can indicate that "taekwondo" is the second most likely concept out of the concepts described by the list of text tags 502 to be depicted in video A 504. Prediction A 506 can also indicate that the other concepts described by the text tags 502 are associated with low likelihoods or low confidences indicating that it is unlikely that video A 504 depicts the other concepts described by the text tags 502. Video B 508 can be provided to the plug-and- play recognition system. Based on video B 508 and the list of text tags 502, the plug-and-play recognition system can generate prediction B 510. Prediction B 510 can indicate that video B

508 is most likely to depict the concept "ping pong" out of the concepts described by the list of text tags 502. Prediction B 510 can also indicate that the other concepts described by the text tags 502 are associated with low likelihoods or low confidences indicating that it is unlikely that video B 508 depicts the other concepts described by the text tags 502. Video C 512 can be provided to the plug-and-play recognition system. Based on video C 512 and the list of text tags 502, the plug-and-play recognition system can generate prediction C 514. Prediction C 514 can indicate that video C 512 is most likely to depict the concept "sing" out of the concepts described by the list of text tags 502. Prediction C 514 can also indicate that the other concepts described by the text tags 502 are associated with low likelihoods or low confidences indicating that it is unlikely that video B 508 depicts the other concepts described by the text tags 502. Video D 516 can be provided to the plug-and-play recognition system. Based on video D 516 and the list of text tags 502, the plug-and-play recognition system can generate prediction D 518. Prediction D 518 can indicate that video D 516 is most likely to depict the concept "taekwondo" out of the concepts described by the list of text tags 502. Prediction D 518 can indicate that "boxing" is the second most likely concept out of the concepts described by the list of text tags 502 to be depicted in video D 516. Prediction D 518 can indicate that "jump" is the third most likely concept out of the concepts described by the list of text tags 502 to be depicted in video D 516. Prediction D 518 can also indicate that the other concepts described by the text tags 502 are associated with low likelihoods or low confidences indicating that it is unlikely that video D 516 depicts the other concepts described by the text tags 502.

[0028] FIG. 6 illustrates a computing component 600 that includes one or more hardware processors 602 and machine-readable storage media 604 storing a set of machine- readable/machine-executable instructions that, when executed, cause the one or more hardware processors 602 to perform an illustrative method for predicting a concept for an input data object, according to various embodiments of the present disclosure. The computing component 600 may be, for example, the computing system 700 of FIG. 7. The hardware processors 602 may include, for example, the processor(s) 704 of FIG. 7 or any other processing unit described herein. The machine-readable storage media 604 may include the main memory 706, the read-only memory (ROM) 708, the storage 710 of FIG. 7, and/or any other suitable machine-readable storage media described herein.

[0029] At block 606, the hardware processor(s) 602 may execute the machine- readable/machine-executable instructions stored in the machine-readable storage media 604 to receive a data object and a list of text tags. In various embodiments, the data object can be a video or other media content item. The list of text tags can include a list of words describing various concepts or actions.

[0030] At block 608, the hardware processor(s) 602 may execute the machine- readable/machine-executable instructions stored in the machine-readable storage media 604 to generate an embedding (e.g., object embedding) based on the data object. In various embodiments, the data object can be a video, and the embedding can be a video embedding generated by a video encoder. The video encoder can be a bi-modal video encoder.

[0031] At block 610, the hardware processor(s) 602 may execute the machine- readable/machine-executable instructions stored in the machine-readable storage media 604 to generate text embeddings based on the list of text tags. In various embodiments, the text embeddings can be generated by a text encoder. Each text embedding can correspond with a text tag in the list of text tags.

[0032] At block 612, the hardware processor(s) 602 may execute the machine- readable/machine-executable instructions stored in the machine-readable storage media 604 to determine a likelihood that the data object depicts a concept described in the list of text tags based on the embedding and the text embeddings. In various embodiments, the determination can be based on a mapping of the embedding and the text embeddings to an embedding space. The relative positions of the embedding and the text embeddings in the embedding space can be indicative of the likelihood that the data object corresponding to the embedding depicts a concept described by the text tags corresponding to the text embeddings.

[0033] FIG. 7 illustrates a block diagram of an example computer system 700 in which various embodiments of the present disclosure may be implemented. The computer system 700 can include a bus 702 or other communication mechanism for communicating information, one or more hardware processors 704 coupled with the bus 702 for processing information. The hardware processor(s) 704 may be, for example, one or more general purpose microprocessors. The computer system 700 may be an embodiment of an access point controller module, access point, or similar device.

[0034] The computer system 700 can also include a main memory 706, such as a random access memory (RAM), cache and/or other dynamic storage devices, coupled to the bus 702 for storing information and instructions to be executed by the hardware processor(s) 704. The main memory 706 may also be used for storing temporary variables or other intermediate information during execution of instructions by the hardware processor(s) 704. Such instructions, when stored in a storage media accessible to the hardware processor(s) 704, render the computer system 700 into a special-purpose machine that can be customized to perform the operations specified in the instructions.

[0035] The computer system 700 can further include a read only memory (ROM) 708 or other static storage device coupled to the bus 702 for storing static information and instructions for the hardware processor(s) 704. A storage device 710, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., can be provided and coupled to the bus 702 for storing information and instructions.

[0036] Computer system 700 can further include at least one network interface 712, such as a network interface controller module (NIC), network adapter, or the like, or a combination thereof, coupled to the bus 702 for connecting the computer system 700 to at least one network.

[0037] In general, the word "component," "modules," "engine," "system," "database," and the like, as used herein, can refer to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example, Java, C or C++. A software component or module may be compiled and linked into an executable program, installed in a dynamic link library, or may be written in an interpreted programming language such as, for example, BASIC, Perl, or Python. It will be appreciated that software components may be callable from other components or from themselves, and/or may be invoked in response to detected events or interrupts. Software components configured for execution on computing devices, such as the computing system 700, may be provided on a computer readable medium, such as a compact disc, digital video disc, flash drive, magnetic disc, or any other tangible medium, or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution). Such software code may be stored, partially or fully, on a memory device of an executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware components may be comprised of connected logic units, such as gates and flip-flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors.

[0038] The computer system 700 may implement the techniques or technology described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system 700 that causes or programs the computer system 700 to be a special-purpose machine. According to one or more embodiments, the techniques described herein are performed by the computer system 700 in response to the hardware processor(s) 704 executing one or more sequences of one or more instructions contained in the main memory 706. Such instructions may be read into the main memory 706 from another storage medium, such as the storage device 710. Execution of the sequences of instructions contained in the main memory 706 can cause the hardware processor(s) 704 to perform process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

[0039] The term "non-transitory media," and similar terms, as used herein refers to any media that store data and/or instructions that cause a machine to operate in a specific fashion. Such non-transitory media may comprise non-volatile media and/or volatile media.

The non-volatile media can include, for example, optical or magnetic disks, such as the storage device 710. The volatile media can include dynamic memory, such as the main memory 706. Common forms of the non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD- ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, an NVRAM, any other memory chip or cartridge, and networked versions of the same.

[0040] Non-transitory media is distinct from but may be used in conjunction with transmission media. The transmission media can participate in transferring information between the non-transitory media. For example, the transmission media can include coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 702. The transmission media can also take a form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.

[0041] The computer system 700 also includes a network interface 718 coupled to bus 702. Network interface 718 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks. For example, network interface 718 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, network interface 718 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN). Wireless links may also be implemented. In any such implementation, network interface 718 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

[0042] A network link typically provides data communication through one or more networks to other data devices. For example, a network link may provide a connection through local network to a host computer or to data equipment operated by an Internet Service Provider (ISP). The ISP in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet." Local network and Internet both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link and through network interface 718, which carry the digital data to and from computer system 700, are example forms of transmission media.

[0043] The computer system 700 can send messages and receive data, including program code, through the network(s), network link and network interface 718. In the Internet example, a server might transmit a requested code for an application program through the Internet, the ISP, the local network and the network interface 718.

[0044] The received code may be executed by processor 704 as it is received, and/or stored in storage device 710, or other non-volatile storage for later execution.

[0045] Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code components executed by one or more computer systems or computer processors comprising computer hardware. The one or more computer systems or computer processors may also operate to support performance of the relevant operations in a "cloud computing" environment or as a "software as a service" (SaaS). The processes and algorithms may be implemented partially or wholly in application-specific circuitry. The various features and processes described above may be used independently of one another, or may be combined in various ways. Different combinations and sub-combinations are intended to fall within the scope of this disclosure, and certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate, or may be performed in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed example embodiments. The performance of certain of the operations or processes may be distributed among computer systems or computers processors, not only residing within a single machine, but deployed across a number of machines. [0046] As used herein, a circuit might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a circuit. In implementation, the various circuits described herein might be implemented as discrete circuits or the functions and features described can be shared in part or in total among one or more circuits. Even though various features or elements of functionality may be individually described or claimed as separate circuits, these features and functionality can be shared among one or more common circuits, and such description shall not require or imply that separate circuits are required to implement such features or functionality. Where a circuit is implemented in whole or in part using software, such software can be implemented to operate with a computing or processing system capable of carrying out the functionality described with respect thereto, such as computer system 700.

[0047] As used herein, the term "or" may be construed in either an inclusive or exclusive sense. Moreover, the description of resources, operations, or structures in the singular shall not be read to exclude the plural. Conditional language, such as, among others, "can," "could," "might," or "may," unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps.

[0048] Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. Adjectives such as "conventional," "traditional," "normal," "standard," "known," and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. The presence of broadening words and phrases such as "one or more," "at least," "but not limited to" or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent.

Claims

Claims What is claimed is:

1. A computer-implemented method comprising: receiving, by a recognition system, a data object and a list of text tags; generating, by the recognition system, an object embedding based on the data object; generating, by the recognition system, text embeddings based on the list of text tags; and determining, by the recognition system, a likelihood that the data object depicts a concept described in the list of text tags based on the object embedding and the text embeddings.

2. The computer-implemented method of claim 1, wherein the data object is a video and the object embedding is a video embedding generated by a video encoder based on the video.

3. The computer-implemented method of claim 2, wherein the video encoder is a bi- modal video encoder that includes a visual convolutional neural network (CNN) and an audio CNN).

4. The computer-implemented method of claim 3, wherein the bi-modal video encoder further includes a bi-directional attention layer that fuses visual descriptors generated by the visual CNN with audio descriptors generated by the audio CNN.

5. The computer-implemented method of claim 1, wherein the text embeddings are generated by a text encoder based on the text tags.

6. The computer-implemented method of claim 1, further comprising: mapping, by the recognition system, the object embedding and the text embeddings to an embedding space, wherein the likelihood that the data object depicts the concept described in the list of text tag is further based on a relative position of the object embedding to the text embedding corresponding to the text tag that describes the concept.

7. The computer-implemented method of claim 1, further comprising: generating, by the recognition system, scores for each text tag in the list of text tags, wherein the scores indicate the likelihood for each text tag that a concept described by the text tag is depicted in the data object.

8. The computer-implemented method of claim 1, further comprising: determining, by the recognition system, a second likelihood that the data object depicts a second concept described in a second list of text tags based on the object embedding and second text embeddings generated based on the second list of text tags, wherein the second likelihood is determined without retraining the recognition system for the second list of text tags.

9. A computing system comprising: at least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the computing system to perform: receiving a data object and a list of text tags; generating an object embedding based on the data object; generating text embeddings based on the list of text tags; and determining a likelihood that the data object depicts a concept described in the list of text tags based on the object embedding and the text embeddings.

10. The computing system of claim 9, wherein the data object is a video and the object embedding is a video embedding generated by a video encoder based on the video.

11. The computing system of claim 10, wherein the video encoder is a bi-modal video encoder that includes a visual convolutional neural network (CNN) and an audio CNN).

12. The computing system of claim 11, wherein the bi-modal video encoder further includes a bi-directional attention layer that fuses visual descriptors generated by the visual CNN with audio descriptors generated by the audio CNN.

13. The computing system of claim 9, wherein the instructions further cause the computing system to perform: mapping the object embedding and the text embeddings to an embedding space, wherein the likelihood that the data object depicts the concept described in the list of text tag is further based on a relative position of the object embedding to the text embedding corresponding to the text tag that describes the concept.

14. The computing system of claim 9, wherein the instructions further cause the computing system to perform: determining a second likelihood that the data object depicts a second concept described in a second list of text tags based on the object embedding and second text embeddings generated based on the second list of text tags, wherein the second likelihood is determined without retraining the recognition system for the second list of text tags.

15. A non-transitory storage medium of a computing system storing instructions that, when executed by at least one processor of the computing system, cause the computing system to perform: receiving a data object and a list of text tags; generating an object embedding based on the data object; generating text embeddings based on the list of text tags; and determining a likelihood that the data object depicts a concept described in the list of text tags based on the object embedding and the text embeddings.

16. The non-transitory storage medium of claim 15, wherein the data object is a video and the object embedding is a video embedding generated by a video encoder based on the video.

17. The non-transitory storage medium of claim 16, wherein the video encoder is a bi- modal video encoder that includes a visual convolutional neural network (CNN) and an audio CNN).

18. The non-transitory storage medium of claim 17, wherein the bi-modal video encoder further includes a bi-directional attention layer that fuses visual descriptors generated by the visual CNN with audio descriptors generated by the audio CNN.

19. The non-transitory storage medium of claim 15, wherein the instructions further cause the computing system to perform: mapping the object embedding and the text embeddings to an embedding space, wherein the likelihood that the data object depicts the concept described in the list of text tag is further based on a relative position of the object embedding to the text embedding corresponding to the text tag that describes the concept.

20. The non-transitory storage medium of claim 15, wherein the instructions further cause the computing system to perform: determining a second likelihood that the data object depicts a second concept described in a second list of text tags based on the object embedding and second text embeddings generated based on the second list of text tags, wherein the second likelihood is determined without retraining the recognition system for the second list of text tags.