WO2024005784A1 - Récupération de texte à vidéo à l'aide de fenêtres à auto-attention décalées - Google Patents

Récupération de texte à vidéo à l'aide de fenêtres à auto-attention décalées Download PDF

Info

Publication number
WO2024005784A1
WO2024005784A1 PCT/US2022/035244 US2022035244W WO2024005784A1 WO 2024005784 A1 WO2024005784 A1 WO 2024005784A1 US 2022035244 W US2022035244 W US 2022035244W WO 2024005784 A1 WO2024005784 A1 WO 2024005784A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
feature vectors
attention
feature vector
layer
Prior art date
Application number
PCT/US2022/035244
Other languages
English (en)
Inventor
Yikang Li
Jenhao Hsiao
Chiuman HO
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2022/035244 priority Critical patent/WO2024005784A1/fr
Publication of WO2024005784A1 publication Critical patent/WO2024005784A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content

Definitions

  • This application relates generally to video search including, but not limited to, methods, systems, and non-transitory computer-readable media for searching video clips to identify a video clip in response to a textual query.
  • the textual query and each image frame of a video clip are converted to a text feature vector and a respective visual feature vector using encoders of a contrastive language-image pre-training (CLIP) network in a semantic text-image space.
  • CLIP contrastive language-image pre-training
  • a local self-attention mechanism applies a modulated onedimensional (ID) shifted window (Swin) transformer to correlate visual feature vectors generated from adjacent image frames to generate a video feature vector of the corresponding video clip.
  • Video feature vectors of different video clips in a dataset can be predetermined and stored with the video clips, making it efficient to apply the local self-attention mechanism in user applications involving text-to-video retrieval.
  • information is preserved in the semantic text-image space to guarantee an accuracy level for text-to-video retrieval, while the local self-attention mechanism facilitates information matching between different modalities (i.e., between video and text information).
  • a video extraction method is implemented in an electronic device.
  • the method includes obtaining a textual query and generating a textual feature vector in a semantic space.
  • the method further includes obtaining a video clip including a sequence of image frames and generating a plurality of first visual feature vectors from a subset of image frames of the video clip. Each visual feature vector corresponds to a respective image frame.
  • the method further includes iteratively correlating the plurality of first visual feature vectors based on at least one shifted window scheme to generate a plurality of second visual feature vectors and generating a video feature vector from the plurality of second visual feature vectors.
  • the method further includes retrieving the video clip in response to the textual query based on a video-query similarity level of the textual feature vector and the video feature vector.
  • the at least one shifted window scheme includes a first shifted window scheme having a first local attention window centered at a corresponding feature vector and a second local attention window started at the corresponding feature vector, and the first and second local attention windows are configured to be applied altematingly to a sequence of self-attention layers.
  • the at least one shifted window scheme includes a plurality of shifted window schemes configured to correlate the plurality of first visual feature vectors to produce the plurality of second visual feature vectors successively via a plurality of frame attention blocks.
  • each frame attention block corresponds to a respective shifted window scheme, and includes a patch merging module configured to reduce the feature vectors in number by a scaling factor and increase the feature elements of each feature vector in number by the scaling factor.
  • each frame attention block corresponds to a respective shifted window scheme that includes a first shifted window scheme having a first local attention window centered at a corresponding feature vector and a second local attention window started at the corresponding feature vector.
  • the first and second local attention windows are configured to be applied altematingly to successive self-attention layers.
  • some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the one or more processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example of a data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating an electronic system for data processing, in accordance with some embodiments.
  • Figure 3 is an example of a data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • Figure 4A is an exemplary neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments
  • Figure 4B is an example of a node in the neural network, in accordance with some embodiments.
  • FIG. 5 is a flow diagram of an exemplary process of identifying a video clip in response to a textual query using a single frame attention block, in accordance with some embodiments.
  • Figure 6 is a temporal diagram of a shifted window scheme having two local attention windows that are applied altematingly to a sequence of self-attention layers of a ID Swin block, in accordance with some embodiments.
  • Figure 7 is a flow diagram of an exemplary process of identifying a video clip in response to a textual query using a sequence of frame attention blocks, in accordance with some embodiments.
  • Figure 8 is a flow diagram of an exemplary process of training a sequence of frame attention blocks for text-to-video extraction, in accordance with some embodiments.
  • Figure 9 is a flowchart of an exemplary video extraction method, in accordance with some embodiments.
  • a video search engine is configured to search through a plurality of video clips and identify one or more video clips in response to a textual query.
  • This video search engine is applied with a user application (e.g., a multimedia album) of an electronic device to extract a video clip matching a textual query efficiently and accurately.
  • the electronic device obtains a textual query and a video clip.
  • the electronic device generates a textual feature vector and a plurality of first visual feature vectors, e.g., from extraction by a textual encoder and an image encoder, in a semantic space.
  • Each of the plurality of first visual feature vectors corresponds to a respective one of a subset of image frames selected from the video clip.
  • the first visual feature vectors are temporally ordered and iteratively correlated based on at least one shifted window scheme to generate a plurality of second visual feature vectors. Specifically, the first visual feature vectors are correlated successively using one or more frame attention blocks each of which has a sequence of self attention layers and follows a respective shifted window scheme.
  • the video feature vector is generated from the plurality of second visual feature vectors and compared with the textual query.
  • the video clip is retrieved based on a video-query similarity level of the textual feature vector and the video feature vector.
  • FIG. 1 is an example of a data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the client device(s) 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone).
  • HMD head-mounted display
  • AR augmented reality
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs can be processed locally at the client device 104 and/or remotely by another client device 104 or the server(s) 102.
  • the server(s) 102 can provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, process the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • the server(s) 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the server(s) 102. Further, in some embodiments, the server(s) 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and provides the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share the information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104C to monitor the events occurring near the networked surveillance camera 104E in real time and remotely.
  • the server(s) 102, client device(s) 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the communication network(s) 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the communication network(s) 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the communication network(s) 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the communication network(s) 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Intemet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational, and other computer systems that route data and messages.
  • deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • content data e.g., video data, visual data, audio data
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
  • the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.
  • both model training and data processing can be implemented locally in each individual client device 104 (e.g., the mobile phone 104C and HMD 104D).
  • the client device 104 obtains the training data from the server(s) 102 or storage 106 and applies the training data to train the data processing models.
  • both model training and data processing can be implemented remotely in a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D).
  • the server 102A obtains the training data from itself, another server 102, or the storage 106, and applies the training data to train the data processing models.
  • the client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102 A, presents the results on a user interface (e.g., associated with the application), rendering virtual objects in a field of view based on the poses, or implements some other functions based on the results.
  • the client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A.
  • data processing is implemented locally in a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely in a server 102 (e.g., the server 102B) associated with the client device 104.
  • the server 102B obtains the training data from itself, another server 102, or the storage 106, and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.
  • FIG. 2 is a block diagram illustrating an electronic system 200 for data processing, in accordance with some embodiments.
  • the electronic system 200 includes a server 102, a client device 104 (e.g., a mobile device 104B or 104C in Figure 1), a storage 106, or a combination thereof.
  • the electronic system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • CPUs processing units
  • the electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the electronic system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras 260 (e.g., a depth camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on electronic devices.
  • the electronic system 200 also includes one or more output devices 212 that enable the presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receivers, for determining the location of the client device 104.
  • a location detection device such as a GPS (global positioning satellite) or other geo-location receivers, for determining the location of the client device 104.
  • Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the electronic system 200 e.g., games, social network applications, smart home applications, multimedia photo album applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video clips in a multimedia album application) to be collected or obtained by a client device 104;
  • content data e.g., video clips in a multimedia album application
  • Data processing module 228, thereby identifying information contained in the content data, searching the content data in response to textual queries, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224 (e.g., to search a plurality of video clips in response to a textual query in a multimedia photo album);
  • One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 248, where the training data 238 include a plurality of video-text training pairs; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the electronic system 200, respectively, where in some embodiments, the
  • the database(s) 230 may be stored in one of the server 102, client device 104, and storage 106 of the electronic system 200.
  • the database(s) 230 may be distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored in the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is another example of a data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct from the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • the model training module 226 and the data processing module 228 are both located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to the type of content data to be processed.
  • the training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size.
  • ROI region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criterion (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing module 314, a model-based processing module 316, and a data post-processing module 318.
  • the data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of the following: video, image, audio, textual, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size
  • an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and textual data.
  • the model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre- processed content data.
  • the model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing module 228.
  • the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4A is an exemplary neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example of a node 420 in the neural network (NN) 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the node input(s).
  • a weight w associated with each link 412 is applied to the node output.
  • the node input(s) can be combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the node input(s).
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the layer(s) may include a single layer acting as both an input layer and an output layer.
  • the layer(s) may include an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layer 404 between the input and output layers 402 and 406.
  • each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the hidden layer(s) of the CNN can be convolutional layers convolving with multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, textual and audio data).
  • Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the RNN can be used for hand
  • the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG. 5 is a flow diagram of an exemplary process 500 of identifying a video clip 502 in response to a textual query 504 using a single frame attention block 506, in accordance with some embodiments.
  • An electronic device has a user application 224 (e.g., a multimedia photo album) storing a plurality of video clips 502 captured by a camera 260 or downloaded from a different source (e.g., another client device 104 or a server 102).
  • the user application 224 displays a graphical user interface (GUI) for receiving a user input of the textual query 504 (e.g., “dog resting on floor”).
  • GUI graphical user interface
  • the user application 224 searches through the video clips 502 to identify one or more target video clips 502T in response to the textual query 504.
  • the user application 224 reviews all of the video clips 502 to identify all of the video clips 502T that match the textual query 504.
  • the user application 224 stops reviewing the video clips 502 when one of the video clips 502T that match the textual query 504 is identified.
  • the electronic device includes an image encoder 508 and a text encoder 510.
  • the text encoder 510 is configured to generate a textual feature vector 512 from the textual query 504.
  • the image encoder 508 is configured to generate a first visual feature vector 514 for each of a subset of image frames 515 of an input video clip 502.
  • the electronic device includes a contrastive language-image pre-training (CLIP) network having the image encoder 508 and text encoder 510.
  • CLIP contrastive language-image pre-training
  • the textual feature vector 512 and first visual feature vector 514 are formed in a semantic space and have the same number of feature elements (e.g., 512 feature elements).
  • the input video clip 502 includes a sequence of M image frames from which N image frames are selected, where N is equal to or less than M (i.e., N ⁇ M), and each of the N image frames corresponds to a respective first visual feature vector 514.
  • N is equal to AT, and the entire sequence of AL image frames is selected.
  • the Aimage frames are selected from the sequence of AL image frames uniformly. For example, one image frame is selected from every 3 successive image frames. In another example, 3 images are selected from every 4 successive image frames.
  • the frame atention block 506 iteratively correlates a plurality of first visual feature vectors 514 of the subset of image frames 515 of the input video clip 502 based on at least one shifted window (Swin) scheme to generate a plurality of second visual feature vectors 516, which are combined to generate a video feature vector 518.
  • the video feature vector 518 is generated by applying an average pooling operation 522 on the second visual feature vectors 516 along a temporal axis, e.g., getting an average of the second visual feature vector 516.
  • the textual feature vector 512 has 512 feature elements
  • each first visual feature vector 514 has 512 feature elements.
  • the second feature vectors 516 include TV second feature vectors each of which has 512 feature elements.
  • the TV second feature vectors 516 are combined to generate the video feature vector 518 having 512 feature elements, thus allowing the video feature vector 518 to match and be compared with the textual feature vector 512.
  • the textual feature vector 512 and the video feature vector 518 are compared to determine a video-query similarity level 520.
  • the electronic device determines whether to retrieve the video clip 502 in response to the textual query 504 based on the video-query similarity level 520 of the textual feature vector 512 and the video feature vector 518, e.g., in accordance with a determination that the video-query similarity level satisfies a video retrieval condition.
  • the video-query similarity level 520 includes a cosine similarity value, and the video retrieval condition requires that the cosine similarity value be greater than a similarity threshold.
  • the cosine similarity value is determined based on the textual feature vector 512 and the video feature vector 518. In accordance with a determination that the cosine similarity value is greater than (>) the similarity threshold, the electronic device retrieves the video clip 502 in response to the textual query 504. In accordance with a determination that the cosine similarity value is equal to or less than ( ⁇ ) the similarity threshold, the electronic device determines that the video clip 502 does not match the textual query 504. Alternatively, in some embodiments, in accordance with a determination that the cosine similarity value is greater than or equal to (>) the similarity threshold, the electronic device retrieves the video clip 502 in response to the textual query 504. In accordance with a determination that the cosine similarity value is less than ( ⁇ ) the similarity threshold, the electronic device determines that the video clip 502 does not match the textual query 504.
  • the electronic device determines that the sequence of image frames of the video clip 502 includes a total number (AT) of image frames. Based on the total number (A ), the electronic device selects an image number (i.e., N) from a set of positive integer numbers (e.g., 16, 48). The subset of image frames 515 includes TV image frames and is selected from the video clip 502. Further, in some embodiments, the process 500 of identifying the video clip 502 is applied to process a video clip 502 having a length in a certain length range.
  • an image number i.e., N
  • the subset of image frames 515 includes TV image frames and is selected from the video clip 502.
  • the process 500 of identifying the video clip 502 is applied to process a video clip 502 having a length in a certain length range.
  • the electronic device determines that the video clip 502 does not match the textual query 504 due to an improper video length.
  • the first positive integer number e.g., 16
  • the first positive integer number is as the number (i.e., N) of image frames. The first positive integer number is less than the second positive integer number.
  • the second positive integer number is selected as the number (i.e., N) of image frames. For example, if the video clip 502 has 32 image frames, 16 image frames are selected to generate the video feature vector 518. If the video clip 502 has 50 image frames, 48 image frames are selected to generate the video feature vector 518.
  • the single frame attention block 506 includes a ID Swin (abbreviation for “one dimensional shifted window transformer”) block 524 coupled to the image encoder 508.
  • the ID Swin block 524 has a sequence of self-attention layers configured to correlate the first visual feature vectors 514 successively using a shifted window scheme to generate the second feature vectors 516.
  • the single frame attention block 506 includes a linear embedding module 526 coupled to the image encoder 508 and the ID Swin block 524.
  • the linear embedding module 526 converts the first feature vectors 514 from the semantic space to a plurality of intermediate feature vectors 528 and provides the intermediate feature vectors 528 to the self-attention layers of the ID Swin block 524.
  • the self-attention layers of the ID Swin block 524 correlate the intermediate feature vectors 528 successively using a shifted window scheme to generate the second feature vectors 516.
  • the first visual feature vectors 514 extracted for the subset of image frames are represented as follows: ⁇ » where F ⁇ is an encoding operation implemented by a CLIP -based image encoder 508.
  • F ⁇ is an encoding operation implemented by a CLIP -based image encoder 508.
  • the dimension of each first visual feature vector 514 is (1, 512) and the video feature vector 518 is the average value of 50 visual feature vectors outputted by the frame attention block 506 (i.e., 50 equals the number of image tokens, 49, plus a classification token (e.g., a [CLS] token)).
  • the dimension (1, 512) is the embedding dimension of X”, where i E [1 ,7V] that N is the number of image frames in the subset of image frames 515.
  • the textual feature vector 512 is extracted by the text encoder 510 (e.g., a CLIP -based text encoder) and is represented as follows: where Leis an encoding operation implemented by a CLIP -based textual encoder.
  • the dimension of the textual feature vector 512 is (1, 512).
  • the textual feature vector 512 is the output of a mean pooling layer over 77 tokens, which is a fixed length of an input sentence and in the textual query 504.
  • the dimension (1, 512) is the embedding dimension of each textual token.
  • Dimensions of the video clip 502 and the textual query 504 for each batch are (B, N, 512) and (B, 1, 512) respectively, where the B is the batch size.
  • FIG. 6 is a temporal diagram of a shifted window scheme 600 having two local attention windows 602 and 604 that are applied alternatingly to a sequence of selfattention layers 610 of a ID Swin block 524, in accordance with some embodiments.
  • a camera 260 of an electronic device captures a video clip 502 including a plurality of image frames according to a fresh rate (e.g., 30 frames per second (FPS)) within a duration of time T.
  • the electronic device selects a subset of image frames 515 from the video clip 502, e.g., substantially uniformly.
  • An image encoder 508 generates a plurality of first visual feature vectors 514 corresponding to the subset of image frames 515.
  • the plurality of first visual feature vectors 514 are successively processed by a sequence of self-attention layers 610 using a plurality of local attention windows.
  • the first visual feature vectors 514 are converted by a first self-attention layer 610-1 to a plurality of first layer feature vectors 606 using a first local attention window 602.
  • the first layer feature vectors 606 are converted by a second self-attention layer 610-2 to a plurality of second layer feature vectors 608 using a second local attention window 604.
  • Each following odd-numbered self-attention layer 610 (e.g., layer 610-3) converts a plurality of input layer feature vectors to a plurality of output layer feature vectors using the first local attention window 602, and each following even- numbered self-attention layer 610 converts a plurality of input layer feature vectors to a plurality of output layer feature vectors using the second local attention window 604.
  • the first local attention window 602 is centered at a corresponding feature vector, and the second local attention window 604 is started at the corresponding feature vector.
  • the first local attention window 602 is applied and includes two first visual feature vectors preceding a centering first visual feature vector 514, and two first visual feature vectors following the centering first visual feature vector 514.
  • a first visual feature vector 514-3 corresponds to the first local attention window 602-1, and first visual feature vectors 514-1 to 514-5 are combined in a weighted sum to generate a corresponding first layer feature vector 606-3.
  • a first visual feature vector 514-13 corresponds to the first local attention window 602-2, and first visual feature vectors 514-11 to 514-15 are combined in a weighted sum to generate a corresponding first layer feature vector 606-13.
  • the first local attention window 602 of the first visual feature vector 514-1 includes two first visual feature vectors 514-(N-1) and 514-N, and first visual feature vectors 514-1 to 514-3, 514-(N-1) and 514-N are combined in a weighted sum to generate a corresponding first layer feature vector 606-1.
  • the first local attention window 602 of the first visual feature vector 514-2 includes the first visual feature vector 514-N, and first visual feature vectors 514-1 to 514-4 and 514-N are combined in a weighted sum to generate a corresponding first layer feature vector 606-2.
  • the first local attention window 602 of the first visual feature vector 514-N includes the first visual feature vectors 514-1 and 514-2, and first visual feature vectors 514- 1, 514-2, 514-(N-2), 514-(N-1) and 514-N are combined in a weighted sum to generate a corresponding first layer feature vector 606-N.
  • the first local attention window 602 of the first visual feature vector 514-(N-1) includes the first visual feature vector 514-1, and first visual feature vectors 514-(N-3) to 514-N and 514-1 are combined in a weighted sum to generate a corresponding first layer feature vector 606-(N-l).
  • the second local attention window 604 is applied and includes a starting first layer feature vector 606 and four first layer feature vectors following the starting first layer feature vector 606.
  • a first layer feature vector 606-3 corresponds to the second local attention window 604-1
  • first layer feature vectors 606-3 to 606-7 are combined in a weighted sum to generate a corresponding second layer feature vector 608-3
  • a first layer feature vector 606-13 corresponds to the second local attention window 604-2, and first layer feature vectors 606-13 to 606-17 are combined in a weighted sum to generate a corresponding second layer feature vector 608-13.
  • the second local attention window 604 of the first layer feature vector 606-N includes the first layer feature vectors 606-1 and 606-4, and first layer feature vectors 606-1 to 606-4 and 606-N are combined in a weighted sum to generate a corresponding second layer feature vector 608-N.
  • the second local attention window 604 of the first layer feature vector 606-(N-l) includes the first layer feature vectors 606-1 to 606-3, and first layer feature vectors 606-(N-l), 606-N, and 606-1 to 606-3 are combined in a weighted sum to generate a corresponding second layer feature vector 608-(N-l).
  • Each of the first and second local attention windows 602 and 604 includes a predefined number of feature vectors, e.g., 5 first visual feature vectors 514, 5 first layer feature vectors 606.
  • the predefined number corresponds to a self-attention length, and is not limited to 5. The large the predefined number, the longer the self-attention length. Other examples of the predefined number include, and are not limited to, 6, 7, 8, 9, 10.
  • the first local attention window 602 includes an even number (e.g., 8) of feature vectors
  • the first local attention window 602 of each feature vector includes fewer feature vectors preceding the feature vector of interest than those feature vectors following the feature vector of interest.
  • the first local attention window 602 of the first visual feature vector 514-13 includes 3 first visual feature vectors 514-10 to 514-12, the first visual feature vector 514-13, and 4 first visual feature vectors 514-14 and 514-17.
  • the sequence of self-attention layers 610 includes a number of selfattention layers 610 (e.g., 6 self-attention layers 610).
  • the sequence of self-attention layers 610 includes a last self-attention layer.
  • the last self-attention layer generates a plurality of last layer feature vectors based on the first or second local attention window 602 or 604.
  • the last layer feature vectors form the second visual feature vectors 516 outputted by the single frame attention block 506.
  • the first and second local attention windows 602 and 604 are alternatingly applied to the sequence of self-attention layers 610.
  • Each following odd- numbered self-attention layer 610 (e.g., layer 610-3) converts a plurality of input layer feature vectors to a plurality of output layer feature vectors using the first local attention window 602, and each following even-numbered self-attention layer 610 converts a plurality of input layer feature vectors to a plurality of output layer feature vectors using the second local attention window 604.
  • the second and first local attention windows 602 and 604 are alternatingly applied to the sequence of self-attention layers 610.
  • Each odd-numbered self-attention layer 610 (e.g., layers 610-1 and 610-3) converts a plurality of input layer feature vectors to a plurality of output layer feature vectors using the second local attention window 604, and each even-numbered self-attention layer 610 (e.g., layer 610-2) converts a plurality of input layer feature vectors to a plurality of output layer feature vectors using the first local attention window 602.
  • the shifted window scheme 600 includes a sequence of three or more local attention windows (e.g., three local attention windows).
  • the sequence of three or more local attention windows can be successively and altematingly applied to the sequence of self-attention layers 610. For example, when three local attention windows are applied to nine self-attention layers 610, a first local attention window is applied to the first, fourth, and seventh self-attention layers 610; a second local attention window is applied to the second, fifth, and eighth self-attention layers 610; and a third local attention window is applied to the third, sixth, and ninth self-attention layers 610.
  • each of the first, the second, or the third local attention window includes six feature vectors
  • the second local attention is shifted by two feature vectors with respect to the first local attention window
  • the third local attention is further shifted by two feature vectors with respect to the second local attention window.
  • each local attention window 602 or 604 reduces a computational load and releases correlation information with less dependent elements.
  • each local attention window 602 or 604 includes L (e.g., 5) feature vectors.
  • L e.g. 5, feature vectors.
  • partitions are evenly distributed along a time axis based on the first local attention window 602.
  • Local self-attention is computed within each window 602 and no interaction with features cross different windows 602. Then, all windows are shifted to right with rolling over to a beginning of the video clip 502.
  • the distance of window shift is equal to Z/2 (e.g., 2) and the essential computation of the shifted window partitioning of consecutive ID Swin blocks 524 is presented as: where the MLP is a multilayer perceptron module, the (S)WMSA is a (shifted) window based multi-head self-attention module, and LN denotes a LayerNorm layer.
  • the ⁇ y/ and z" represent the output features of each window based multi-head self-attention of the selfattention layer 610 (the layer M).
  • FIG. 7 is a flow diagram of an exemplary process 700 of identifying a video clip 502 in response to a textual query 504 using a sequence of frame attention blocks 506, in accordance with some embodiments.
  • An electronic device includes an image encoder 508 and a text encoder 510.
  • the text encoder 510 generates a textual feature vector 512 from the textual query 504.
  • the image encoder 508 generates a first visual feature vector 514 for each of a subset of image frames 515 of an input video clip 502.
  • the sequence of frame attention block 506 applies a plurality of shifted window schemes configured to correlate the first visual feature vectors 514 successively to generate a plurality of second visual feature vectors 516, which are combined to generate a video feature vector 518.
  • the textual feature vector 512 and the video feature vector 518 are compared to determine a video-query similarity level 520.
  • the electronic device determines whether to retrieve the video clip 502 in response to the textual query based on the video-query similarity level 520 of the textual feature vector 512 and the video feature vector 518, e.g., in accordance with a determination that the video-query similarity level 520 satisfies a video retrieval condition.
  • the electronic device includes a patch partition module 702 configured to adjust the size of each of the first visual feature vectors 514.
  • the first frame attention block 506A is coupled to the patch partition module 702 and includes a linear embedding module 526A configured to convert the first feature vectors 514 from a semantic space to a plurality of intermediate feature vectors 528A in a different space while keeping the size of the first feature vectors 514.
  • Each of the remaining frame attention blocks 506B-506D includes a patch merging module 526 configured to reduce the number of feature vectors by a scaling factor (e.g., 2) and increase the number of feature elements of each feature vector by the scaling factor.
  • a scaling factor e.g. 2, 2
  • the first visual feature vectors 514 includes N feature vectors each of which has 64 feature elements.
  • the first frame attention block 506A generates N block feature vectors (704A) each of which has 64 feature elements.
  • the remaining frame attention blocks 506B-506D respectively generate N/2 block feature vectors (704B) each of which has 128 feature elements, N/4 block feature vectors (704C) each of which has 256 feature elements, and N/8 block feature vectors (704D) each of which has 512 feature elements.
  • the second visual feature vector 516 i.e., N/8 block feature vectors 704D each of which has 512 feature elements
  • the patch partition module 702 adjusts the size of each of the first visual feature vectors 514 to an adjusted number.
  • the textual feature vector 512 includes a first number of textual feature elements.
  • the sequence of frame attention blocks 506 includes a certain number of stages determined based on the adjusted number and the first number. For example, the patch partition module 702 adjusts each first visual feature vector 514 to 32 feature elements, and the textual feature vector 512 includes 512 textual feature elements.
  • the sequence of frame attention blocks 506 can have 5 stages to generate the second visual feature vectors 516 that matches the textual feature vector 512.
  • the first frame attention block 506A is consistent with the remaining frame attention blocks 506B-506D.
  • Each frame attention block 506 includes a patch merging module 526 configured to reduce feature vectors in number by a respective scaling factor (e.g., 2) and increase the feature elements of each feature vector in number by the respective scaling factor. Stated another way, the patch merging module 526 reduces the resolution of the input feature length of different selfattention layers of each frame attention block 502B-502D.
  • the first visual feature vectors 514 include N feature vectors each of which has 32 feature elements.
  • the first frame attention block 506A generates N/2 block feature vectors 704 each of which has 64 feature elements.
  • the remaining frame attention blocks 506B-506D respectively generate N/4 block feature vectors 704 each of which has 128 feature elements, N/8 block feature vectors 704 each of which has 256 feature elements, and N/16 block feature vectors 704 each of which has 512 feature elements.
  • the second visual feature vector 516 i.e., N/16 block feature vectors 704 each of which has 512 feature elements
  • each frame attention block 506 merges a plurality of input visual feature vectors into a plurality of intermediate feature vectors 528 having a reduced number of feature vectors and an increased number of feature elements of each feature vector.
  • the input visual feature vectors include the first visual feature vectors 514 that are optionally processed by the patch partition module 702.
  • Each frame attention block 506 includes a sequence of self-attention layers 610 coupled to the patch merging module 526.
  • Each self-attention layer 610 combines a subset of former layer feature vectors within a respective local attention window of each former layer feature vector to generate a respective current layer feature vector.
  • the former layer feature vectors of a first self-attention layer 610-1 include the intermediate feature vectors.
  • the last selfattention layer 610 of each frame attention block 506A, 506B, 506C or 506D generates the block feature vectors 704A, 704B, 704C, or 704D, respectively, and the current layer feature vectors of a last self-attention layer of a last frame attention block (e.g., block 506D) include the second visual feature vectors 516.
  • the block feature vectors 704A, 704B, or 704C are inputted to the frame attention block 506B, 506C or 506D, respectively.
  • Each frame attention block 506 corresponds to a respective shifted window scheme (e.g., scheme 600 in Figure 6).
  • the shifted window schemes is identical to each other, and for example, uses a first local attention window 602 centered at a corresponding feature vector and a second local window 604 started at the corresponding feature vector. Both first and second local attention windows 602 and 604 include the same number of feature vectors.
  • the first and second local attention windows 602 and 604 are applied altematingly to successive self-attention layers 610. More details are explained above with reference to Figure 6.
  • every two of the shifted window schemes are distinct from or identical to each other.
  • Each shifted window scheme of a respective frame attention block 506 includes a respective number of local attention windows that are altematingly applied to the sequence of self-attention layers in the respective frame attention block 506.
  • the local attention windows of each frame attention block 506 e.g., the first and second local attention windows 602 and 604
  • the local attention windows of each frame attention block 506 have the same number of feature vectors while being shifted with different numbers of feature vectors.
  • the local attention windows of each frame attention block 506 have different numbers of feature vectors and shifted with different numbers of feature vectors.
  • the video feature vector 518 (X ⁇ ) is represented as follows: where j E [0, N/8] is the length of the second visual feature vectors 516 that is processed by the ID Swin block 524D, AvgPool() denotes an average pooling operation, and X v is the final video feature vector 51.
  • the text encoder 510 optionally includes a CLIP text encoder and encodes the corresponding textual query 504 into the same semantic space.
  • the textual feature vector 512 is extracted as X t .
  • a loss is based upon the video feature vector 518 (X v ) and the text feature vector (X*).
  • FIG. 8 is a flow diagram of an exemplary process 800 of training a sequence of frame attention blocks 506 for text-to-video extraction, in accordance with some embodiments.
  • a plurality of video-text training pairs 880 can be provided to train a neural network model applied to implement exemplary processes of identifying a video clip 502 in response to a textual query 504 using one or more frame attention blocks 506.
  • Each video-text training pair 880 includes a textual query 804 and a video clip 802 that matches the textual query 804.
  • the text encoder 510 generates a textual feature vector 812 from the textual query 804.
  • the image encoder 508 generates a first visual feature vector 814 for each of a subset of image frames 815 of the video clip 802.
  • the sequence of frame attention block 506 applies a plurality of shifted window schemes configured to correlate the first visual feature vectors 814 successively to generate a plurality of second visual feature vectors 816, which are combined to generate a video feature vector 818.
  • the textual feature vector 812 and the video feature vector 818 are compared to determine a loss (e.g., a symmetric cross-entropy loss).
  • the neural network model applied to implement the abovedescribed processes 500 and 700 is trained end-to-end to minimize the loss.
  • the image encoder 508 and the text encoder 510 belong to a CLIP network and are pre-trained using a symmetric cross-entropy loss.
  • the neural network model applied to implement the processes 500 and 700 includes the patch partition module 702, a linear embedding module 526A, and ID Swin block(s) of the one or more frame attention blocks 506, which are trained jointly and end-to-end based on the loss 820.
  • An example of the loss 820 includes a symmetric cross-entropy loss which is represented for each video-text training pair 880 as follows: where B is the batch size, T is a temperature hyper-parameter, and S is a similarity matrix of the video-text training pairs 880.
  • the similarity matrix S of a (video, text) training pair is represented as:
  • Figure 9 is a flowchart of an exemplary video extraction method 900, in accordance with some embodiments.
  • the method 900 is described as being implemented by an electronic device (e.g., a mobile device 104C, an HMD 104D).
  • Method 900 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the electronic system.
  • Each of the operations shown in Figure 9 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the electronic system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 900 may be combined and/or the order of some operations may be changed.
  • the electronic device obtains (902) a textual query 504 and generates (904) a textual feature vector 512 in a semantic space.
  • the electronic device obtains (906) a video clip 502 including a sequence of image frames and generates (908) a plurality of first visual feature vectors from a subset of image frames of the video clip 502. Each visual feature vector corresponds to a respective image frame.
  • the electronic device iteratively correlates (910) the first visual feature vectors 514 based on at least one shifted window scheme to generate a plurality of second visual feature vectors 516.
  • a video feature vector 518 is generated (912) from the second visual feature vectors 516.
  • the video feature vector 518 is generated by applying an average pooling operation on the second visual feature vectors 516 along a temporal axis.
  • the electronic device retrieves (914) the video clip 502 in response to the textual query 504 based on a video-query similarity level 520 of the textual feature vector 512 and the video feature vector 518.
  • the electronic device determines that the sequence of image frames includes a total number of image frames, selects an image number from a set of positive integer numbers (e.g., including 16 and 48) based on the total number (AT), and selects, from the video clip 502, the subset of image frames 515 including the image number (N) of frames.
  • the subset of image frames 515 is selected uniformly from the video clip 502.
  • the electronic device selects the first positive integer number as the number of image frames.
  • the first positive integer number is less than the second positive integer number.
  • the electronic device selects the second positive integer number as the number of image frames. For instance, when the video clip 502 has 16-47 image frames, 16 image frames are selected from the video clip 502 substantially uniformly, and when the video clip 502 has more than 48 image frames, 48 image frames are uniformly selected from the video clip 502.
  • the electronic device determines that the video clip 502 does not match the textual query 504 due to an improper video length.
  • the textual feature vector 512 includes a first number of textual feature elements.
  • the video feature vector 518 includes a second number of video feature elements, and the first number is equal to the second number.
  • Each of the first visual feature vectors 514 includes a third number of visual feature elements, and the third number is equal to or less than the second number.
  • Each of the second visual feature vectors 516 includes a fourth number of visual feature elements. The fourth number is equal to the second number.
  • the at least one shifted window scheme includes (916) a plurality of shifted window schemes configured to correlate the first visual feature vectors 514 to produce the second visual feature vectors 516 successively via a plurality of frame attention blocks 506 (e.g., 6 frame attention blocks).
  • each frame attention block 506 corresponds (918) to a respective shifted window scheme and includes a patch merging module configured to reduce the number of feature vectors by a scaling factor (e.g., 2) and increase the number of feature elements of each feature vector by the scaling factor.
  • the electronic device patch merges a plurality of input visual feature vectors into a plurality of intermediate feature vectors having the reduced number of feature vectors and the increased number of feature elements of each feature vector.
  • the input visual feature vectors include the first visual feature vectors 514
  • each frame attention block 506 includes a sequence of self-attention layers 610 coupled to the patch merging module 526.
  • a subset of former layer feature vectors can be combined within a respective local attention window of each former layer feature vector to generate a respective current layer feature vector.
  • the former layer feature vectors of a first self-attention layer 610-1 include the intermediate feature vectors
  • the current layer feature vectors of a last self-attention layer of a last frame attention block 506D include the second visual feature vectors 516.
  • each frame attention block 506 corresponds (920) to a respective shifted window scheme that includes a first shifted window scheme 600 having a first local attention window 602 centered at a corresponding feature vector and a second local attention window 604 started at the corresponding feature vector.
  • the first and second local attention windows 602 and 604 are configured to be applied altematingly to successive self-attention layers 610.
  • each frame attention block 506 corresponds to a respective shifted window scheme that includes a first shifted window scheme having a first local attention window 602 started at a corresponding feature vector and a second local attention window 604 centered at the corresponding feature vector.
  • the first and second local attention windows 602 and 604 are configured to be applied altematingly to successive self-attention layers 610.
  • the at least one shifted window scheme includes (922) a first shifted window scheme having a first local attention window 602 centered at a corresponding feature vector and a second local attention window 604 started at the corresponding feature vector, and the first and second local attention windows 602 and 604 are configured to be applied altematingly to a sequence of self-attention layers 610. Further, in some embodiments, for a first layer of the sequence of self-attention layers 610, the electronic device combines a subset of visual feature vectors within the first local attention window 602 of each of the first visual feature vectors 514 to generate a plurality of respective first layer feature vectors 606.
  • the electronic device For a second layer coupled to the first layer, the electronic device combines a subset of first layer feature vectors within the second local attention window 604 of each first layer feature vector 606 to generate a plurality of respective second layer feature vectors 608.
  • the sequence of self-attention layers 610 includes a last selfattention layer configured to generate a plurality of last layer feature vectors forming the second visual feature vectors 516.
  • the sequence of self-attention layers includes 6 layers, and each of the first and second local attention window covers 5 frames.
  • the first self-attention layer 610-1 the visual feature vectors 514 of the subset of image frames covered by the first local attention window 602 are combined in a respective weighted sum to generate a plurality of respective first layer feature vectors 606.
  • the first layer feature vectors 606 within the second local attention window 604 are combined in a respective weighted sum to generate a plurality of respective second layer feature vectors 608.
  • the video clip 502 is retrieved by comparing the textual feature vector 512 and the video feature vector 518 to determine the video-query similarity level 520 (e.g., a cosine similarity level) and in accordance with a determination that the video-query similarity level 520 satisfies a video retrieval condition, retrieving the video clip 502 in response to the textual query 504.
  • the video-query similarity level 520 includes a cosine similarity value, and the video retrieval condition requires that the cosine similarity value be greater than a similarity threshold.
  • the electronic device determines the cosine similarity value based on the textual feature vector 512 and the video feature vector 518.
  • the video clip 502 is retrieved in response to the textual query 504. In accordance with a determination that the cosine similarity value is equal to or less than ( ⁇ ) the similarity threshold, it is determined that the video clip 502 does not match the textual query 504.
  • a contrastive language-image pre-training (CLIP) network includes a text encoder 510 configured to generate the textual feature vector 512 from the textual query 504 and an image encoder 508 configured to generate each of the first visual feature vectors 514 based on a respective image frame.
  • CLIP contrastive language-image pre-training
  • the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente demande concerne la récupération d'un contenu vidéo en réponse à une interrogation textuelle. Un dispositif électronique obtient l'interrogation textuelle et génère un vecteur de caractéristiques textuelles dans un espace sémantique. Le dispositif électronique obtient un clip vidéo comprenant une séquence de trames d'image et génère une pluralité de premiers vecteurs de caractéristiques visuelles à partir d'un sous-ensemble de trames d'image du clip vidéo. Chaque vecteur de caractéristique visuelle correspond à une trame d'image respective. La pluralité de premiers vecteurs de caractéristiques visuelles est corrélée de manière itérative sur la base d'au moins un schéma de fenêtre décalé pour générer une pluralité de seconds vecteurs de caractéristiques visuelles, et un vecteur de caractéristique vidéo est généré à partir de la pluralité de seconds vecteurs de caractéristiques visuelles. Le clip vidéo est récupéré en réponse à l'interrogation textuelle sur la base d'un niveau de similarité d'interrogation vidéo du vecteur de caractéristique textuelle et du vecteur de caractéristique vidéo.
PCT/US2022/035244 2022-06-28 2022-06-28 Récupération de texte à vidéo à l'aide de fenêtres à auto-attention décalées WO2024005784A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2022/035244 WO2024005784A1 (fr) 2022-06-28 2022-06-28 Récupération de texte à vidéo à l'aide de fenêtres à auto-attention décalées

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2022/035244 WO2024005784A1 (fr) 2022-06-28 2022-06-28 Récupération de texte à vidéo à l'aide de fenêtres à auto-attention décalées

Publications (1)

Publication Number Publication Date
WO2024005784A1 true WO2024005784A1 (fr) 2024-01-04

Family

ID=89380936

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/035244 WO2024005784A1 (fr) 2022-06-28 2022-06-28 Récupération de texte à vidéo à l'aide de fenêtres à auto-attention décalées

Country Status (1)

Country Link
WO (1) WO2024005784A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070041667A1 (en) * 2000-09-14 2007-02-22 Cox Ingemar J Using features extracted from an audio and/or video work to obtain information about the work
US20070255755A1 (en) * 2006-05-01 2007-11-01 Yahoo! Inc. Video search engine using joint categorization of video clips and queries based on multiple modalities
US20090157697A1 (en) * 2004-06-07 2009-06-18 Sling Media Inc. Systems and methods for creating variable length clips from a media stream
US20120121194A1 (en) * 2010-11-11 2012-05-17 Google Inc. Vector transformation for indexing, similarity search and classification
US8874584B1 (en) * 2010-02-24 2014-10-28 Hrl Laboratories, Llc Hierarchical video search and recognition system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070041667A1 (en) * 2000-09-14 2007-02-22 Cox Ingemar J Using features extracted from an audio and/or video work to obtain information about the work
US20090157697A1 (en) * 2004-06-07 2009-06-18 Sling Media Inc. Systems and methods for creating variable length clips from a media stream
US20070255755A1 (en) * 2006-05-01 2007-11-01 Yahoo! Inc. Video search engine using joint categorization of video clips and queries based on multiple modalities
US8874584B1 (en) * 2010-02-24 2014-10-28 Hrl Laboratories, Llc Hierarchical video search and recognition system
US20120121194A1 (en) * 2010-11-11 2012-05-17 Google Inc. Vector transformation for indexing, similarity search and classification

Similar Documents

Publication Publication Date Title
US11138469B2 (en) Training and using a convolutional neural network for person re-identification
WO2021184026A1 (fr) Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo
US20200012674A1 (en) System and methods thereof for generation of taxonomies based on an analysis of multimedia content elements
US20240037948A1 (en) Method for video moment retrieval, computer system, non-transitory computer-readable medium
US8266185B2 (en) System and methods thereof for generation of searchable structures respective of multimedia data content
US8818916B2 (en) System and method for linking multimedia data elements to web pages
WO2021081562A2 (fr) Modèle de reconnaissance de texte multi-tête pour la reconnaissance optique de caractères multilingue
US20170249339A1 (en) Selected image subset based search
WO2021092631A9 (fr) Récupération de moment vidéo à base de texte faiblement supervisé
WO2023101679A1 (fr) Récupération inter-modale d'image de texte sur la base d'une expansion de mots virtuels
CN113434716B (zh) 一种跨模态信息检索方法和装置
CN110765882B (zh) 一种视频标签确定方法、装置、服务器及存储介质
WO2021077140A2 (fr) Systèmes et procédés de transfert de connaissance préalable pour la retouche d'image
WO2021195643A1 (fr) Compression de réseaux neuronaux convolutifs par élagage
WO2021178981A1 (fr) Compression de réseaux de neurones artificiels multi-modèles compatible matériel
WO2021092600A2 (fr) Réseau pose-over-parts pour estimation de pose multi-personnes
WO2024027347A9 (fr) Procédé et appareil de reconnaissance de contenu, dispositif, support de stockage et produit-programme d'ordinateur
WO2023036159A1 (fr) Procédés et dispositifs de localisation d'événements visuels audio sur la base de réseaux à double perspective
CN116797799A (zh) 一种基于通道注意力和时空感知的单目标跟踪方法及跟踪系统
WO2024005784A1 (fr) Récupération de texte à vidéo à l'aide de fenêtres à auto-attention décalées
WO2023091131A1 (fr) Procédés et systèmes pour récupérer des images sur la base de caractéristiques de plan sémantique
CN115080699A (zh) 基于模态特异自适应缩放与注意力网络的跨模态检索方法
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
WO2023277888A1 (fr) Suivi de la main selon multiples perspectives
WO2023018423A1 (fr) Incorporation binaire sémantique d'apprentissage pour des représentations vidéo

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22949618

Country of ref document: EP

Kind code of ref document: A1