WO2021184026A1 - Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo - Google Patents

Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo Download PDF

Info

Publication number
WO2021184026A1
WO2021184026A1 PCT/US2021/026444 US2021026444W WO2021184026A1 WO 2021184026 A1 WO2021184026 A1 WO 2021184026A1 US 2021026444 W US2021026444 W US 2021026444W WO 2021184026 A1 WO2021184026 A1 WO 2021184026A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
visual
features
fused
self
Prior art date
Application number
PCT/US2021/026444
Other languages
English (en)
Inventor
Jenhao Hsiao
Jiawei Chen
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2021/026444 priority Critical patent/WO2021184026A1/fr
Publication of WO2021184026A1 publication Critical patent/WO2021184026A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]

Definitions

  • This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for identifying one or more actions in video content.
  • Personal photo albums and online content sharing platforms contain a large amount of multimedia content items that are often associated with content labels describing one or more actions in the content items.
  • the content items can be classified, retrieved, searched, sorted, or recommended efficiently using these content labels, which thereby facilitate understanding and organization of video content in many applications and make the video content more accessible to users.
  • content labels Despite the importance of the content labels, current practice of detecting actions in video content for content label generation is not efficient. Specifically, many existing methods utilize only visual data and class labels during both training and inference stages of action detection.
  • Recent improvement of label accuracy associated with video action recognition relies on an introduction of a three-dimensional convolutional neural networks (3D CNNs) that extends conventional two-dimensional models to a spatio-temporal domain and is applied in an end-to-end manner.
  • the 3D CNN are trained using single video clips, and gradients are updated based on each video clip.
  • Irrelevant video segments could lead to a gradient in a wrong direction and negatively impact performance of the 3D CNN model.
  • Such clip-level prediction limits the improvement of the accuracy of action detection and label creation that has been made available by the 3D CNN model. It would be beneficial to have a more efficient video action recognition mechanism than the current practice.
  • Various implementations of this application include a multimodal data processing method for determining video actions contained in video content including visual content (e.g., image frames) and audio content.
  • This multimodal data processing method utilizes deep learning models to determine (e.g., identify, predict, output) actions present in the video content.
  • the deep learning models rely on supervised training in which annotated video and audio segments are used as inputs, and include cross-modal audio-visual fusion layers that can improve fusion of information obtained from different modalities (e.g., visual information and audio information from video content).
  • short-range and long- range video segments can be linked by a bi-directional layer during training, allowing the deep learning models to achieve a better accuracy for video action recognition compared to unimodal models and other multimodal models that utilize conventional fusion methods, e.g., concatenation of outputs from unimodal models without cross-modal fusion.
  • a method is implemented at a computer system for labeling video content.
  • the method includes obtaining video content that includes visual content and audio content.
  • the visual content includes a plurality of visual segments
  • the audio content includes a plurality of audio segments.
  • the method also includes generating a plurality of self-attended visual features for the visual segments of the visual content and generating a plurality of self-attended audio features for the audio segments of the audio content.
  • the method further includes fusing the self-attended visual features of the visual segments with the self-attended audio features of the audio segments to generate a plurality of fused visual features, and fusing the self-attended audio features of the audio segments with the self-attended visual features of the visual segments to generate a plurality of fused audio features.
  • the method also includes combining the fused visual features and the fused audio features to generate a cross-modal visual-audio feature based on a respective weight associated with each of the fused visual and audio features.
  • the method further includes determining a video-level content label based on the cross-modal visual-audio feature.
  • some implementations include a computer system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
  • Figure 3 is an example data processing environment for training and applying a neural network based (NN-based) data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • NN-based neural network based
  • Figure 4A is an example neural network (NN) applied to process content data in an NN-based data processing model, in accordance with some embodiments
  • Figure 4B is an example node in the neural network (NN), in accordance with some embodiments.
  • Figure 5A illustrates a video content item including visual and audio content, in accordance with some embodiments.
  • Figure 5B illustrates a process of determining video action based on visual content and audio content of a video content item, in accordance with some embodiments.
  • Figure 6 is a flowchart of a method for labeling video content, in accordance with some embodiments.
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a camera).
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs can be processed locally (e.g., for training and/or for prediction) at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • storage 106 may store video content (including visual and audio content) for training a machine learning model (e.g., deep learning network) and/or video content obtained by a user to which a trained machine learning model can be applied to determine one or more actions associated with the video content.
  • a machine learning model e.g., deep learning network
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
  • Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • content data e.g., video data, visual data, audio data
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C).
  • the client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
  • the client device 104C obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally.
  • both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A).
  • the server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application).
  • the client device 104 A itself implements no or little data processing on the content data prior to sending them to the server 102A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.
  • FIG. 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments.
  • the data processing system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof.
  • the data processing system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice- command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • GPS global positioning satellite
  • Memory 206 includes high-speed random access memory, such as DRAM,
  • SRAM, DDR RAM, or other random access solid state memory devices and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices.
  • Memory 206 optionally, includes one or more storage devices remotely located from one or more processing units 202.
  • Memory 206, or alternatively the non-volatile memory within memory 206 includes a non-transitory computer readable storage medium.
  • memory 206, or the non- transitory computer readable storage medium of memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.); • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the data processing system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video data, visual data, audio data) to be collected or obtained by a client device 104;
  • content data e.g., video data, visual data, audio data
  • Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;
  • One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video data, visual data, audio data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server
  • the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video data, visual data, audio data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, and a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size.
  • ROI region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318.
  • the data pre processing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video data, visual data (e.g., image data), audio data, textual data, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size
  • an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and audio data.
  • the model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data.
  • the model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs.
  • a weight w associated with each link 412 is applied to the node output.
  • the one or more node inputs are combined based on corresponding weights w , W2, W3, and W4 according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, visual data and audio data).
  • Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the RNN can be used for hand
  • the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG. 5A illustrates a video content item 500 including visual content 502 and audio content 504, in accordance with some embodiments.
  • the visual content 502 has a plurality of image frames 510 (e.g., frames 510-1 through 510-p) that are grouped to form visual segment 512.
  • image frames 510-1, 510-2, and 510-3 are grouped together to form a visual segment 512-1
  • image frames 510-4, 510-5, and 510-6 are grouped to form visual segment 512-2.
  • each visual segment 512 includes 30 image frames 510, and corresponds to video content of a time duration of 1 second.
  • the video content item 500 has a video rate of 30 frames per second, the video content item 500 is divided into a sequence of 1 second visual segments 512.
  • the audio content 504 includes a plurality of digital audio samples, and is synchronized with the visual content 502.
  • the audio content 504 is divided to a plurality of audio segments 518.
  • a first subset of consecutive audio samples form a first audio segment 518-1
  • a second subset of consecutive audio samples immediately follow the first subset of consecutive audio samples and form a second audio segment 518-2.
  • each audio segment 518 corresponds to the video content of the same time duration as a respective visual segment 512, and therefore, is synchronized with the respective visual segment 512.
  • each audio segment 518 contains audio samples corresponding to a time duration of 1 second and 30 image frames recorded at a video rate of 30 frames per second.
  • each audio segment 518 corresponds to a second time duration
  • each visual segment 512 corresponds to a first time duration that is not equal to the second time duration.
  • the second time duration e.g., 3 seconds
  • the first time duration e.g., 1 second
  • the second time duration e.g., 0.1 second
  • a third time duration e.g., 3 seconds
  • the visual segments 512 and audio segments 518 of the video content item 500 are processed separately to generate self-attended visual and audio features.
  • the self-attended visual and audio features are fused to generate cross-modal fused visual and audio features focusing on visual and audio aspects, respectively.
  • Such cross-modal fused visual and audio features are then provided to a trained deep learning model as input data to determine the one or more actions associated with the video content item 500 and/or label the video content item 500 based on the one or more actions.
  • Figure 5B illustrates a process 590 of determining video action based on visual content 502 and audio content 504 of a video content item 500, in accordance with some embodiments.
  • Visual segments 512 and audio segments 518 of a video content item 500 are provided as inputs to a visual neural network 530 (e.g., a 3D CNN) and an audio neural network 532 (e.g., an CNN, a VGGish containing a set of 2D or 3D convolutional layers), respectively.
  • the visual neural network 530 extracts a set of visual features 520 from each video segment 512 of the video content item 500
  • the audio neural network 532 extracts a set of audio features 522 from each audio segment 518 of the video content item 500.
  • each of the visual neural network 530 and the audio neural network 532 includes a respective set of convolution layers (either 3D layers or 2D layers).
  • the visual neural network 530 receives the visual segments 512 and outputs visual features 520 that can be represented as , where n is the number of visual features and d v is a feature dimension of visual features.
  • the audio neural network 532 receives the visual segments 512 and outputs audio features 522 that can be represented as , where m is the number of audio features and d a is a feature dimension of audio features.
  • the visual content 502 and the audio content 504 are initially separated from one another and analyzed independently from one another, and however, fused together at a later stage (e.g., via visual cross-modal fusion layers 550 and 552).
  • the visual neural network 530 generates a first segment descriptor (also called first clip descriptor) based on the visual features 520 for each visual segment 512 of the video content item 500.
  • the audio neural network 532 generates a second segment descriptor (also called second clip descriptor) based on the audio features 522 for each audio segment 518 of the video content item 500.
  • the first segment descriptors of the visual segments 512 and the second segment descriptors of the audio segments 518 are generated independently from each other.
  • the segment descriptors of each visual or audio segment 512 or 518 are implicit and local because they are correlated to a limited local event, such as a single visual segment 512 or audio segment 518.
  • Durations of different actions can be different. Some actions can span multiple visual segments 512.
  • An inter-segment fusion layer is used to capture inter-segment dependencies, including both short-range dependencies and long-range dependencies for the visual segments 512 and audio segments 518.
  • a visual inter-segment fusion layer 540 is used to strengthen the local segment descriptors (e.g., visual features 520) for each specific visual segment 512 (e.g., based on a specific target position within the visual segments 512 of the video content item 500).
  • An audio inter-segment fusion layer 542 is used to strengthen the local segment descriptors (e.g., audio features 522) for each specific audio segment 518 (e.g., based on a specific target position within the audio segments 518 of the video content item 500).
  • Each inter-segment fusion layer 540 or 542 is a bi-directional fusion layer that aggregates information from other segments (e.g., other positions), and inter- segment relationships can be fused by a bi-directional attention layer that links (e.g., associates) different segments 512 or 518.
  • a bi-directional attention layer (B(S, T ' )) can be expressed as: where S and T are a source vector and target vector, respectively, and W q , W k , and W v are linear transform matrices for query, key, value vector transformation, respectively.
  • the normalization factor is The term (W q S)(W k T) T models the bi-directional relationship between the source (S) and the target (7) (e.g., between a first set of visual features 520 of a visual segment 512-1 and a second set of visual features 520 of a visual segment 512-2 following the visual segment 512-1).
  • the inter-segment relationship among different visual segments 512 is determined by applying the visual inter-segment fusion layer 540 to the visual features 520.
  • the bi-directional attention layer in equation (1) is applied to the visual features 520, and can be specifically expressed as B(V, V ).
  • the inter-segment relationship among the visual segments 512 results in a self-attended visual vector V self that includes a plurality of self-attended visual features.
  • the inter-segment relationship among different audio segments 518 is determined by applying the audio inter-segment fusion layer 542 to the audio features 522.
  • the bi-directional attention layer in equation (1) is applied to the audio features 522, and can be specifically expressed as B(A,A). As such, the inter-segment relationship among the audio segments 518 results in a self-attended audio vector A self that includes a plurality of self-attended audio features.
  • a plurality of bi-directional attention layers are stacked to provide a deeper integration of inter-segment dependencies (e.g., inter-segment relationships).
  • inter-segment dependencies e.g., inter-segment relationships
  • a corresponding bi-directional segment descriptor ( B n ) is expressed as:
  • n the number of inter-segment fusion layers.
  • the bi-directional segment descriptor ( B n ) results in a plurality of self-attended visual features when S and T represent the visual segments 512, and a plurality of self-attended audio features when S and T represent the audio segments 518.
  • the self-attended visual vector V self and the self-attended audio vector A self are integrated via a cross-attention layer, e.g., visual cross-modal fusion layer 550 and audio cross-modal fusion layer 552.
  • the visual cross-modal fusion layer 550 is applied to fuse the self-attended visual vector V self with the self-attended audio vector A self and generate a fused visual vector V fuse with an emphasis on the visual vector V self .
  • the self- attended visual vector V self is modified by the visual cross-modal fusion layer 550 based on the self-attended audio vector A self .
  • the fused visual vector V fuse is generated by applying the bi-directional segment descriptor B n , expressed in equation 2, to the self-attended visual vector V self and the self-attended audio vector A self .
  • the audio cross-modal fusion layer 552 is applied to fuse the self-attended audio vector A self with the self-attended visual vector V sell and generate a fused audio vector A fuse with an emphasis on the self-attended audio vector A self .
  • the self-attended audio vector A self is modified by the audio cross-modal fusion layer 552 based on the self-attended visual vector V self n other words, the fused audio vector A fuse is generated by applying the bi-directional segment descriptor Bn, expressed in equation (2), to the self-attended audio vector A self and the self-attended visual vector V self .
  • the fused audio vector A fuse is expressed as
  • visual features of the fused visual vectors V fuse are pooled to generate a pooled visual feature V pool including a plurality of pooled visual features
  • audio features of the fused audio vectors A fuse are pooled to generate a pooled audio feature A pool including a plurality of pooled audio features
  • the pooled vector V pool or A pool is generated by adaptively pooling, e.g., by applying adaptive pooling layers 560 and 562 to the fused vectors V fuse or A fuse , respectively.
  • a gating module r is used in adaptive pooling.
  • the gating module r is expressed as: where X is a vector combining the corresponding fused visual or audio features, W 1 and W 2 denote linear transform matrices, ⁇ ReLU is a rectified linear operation, and ⁇ slgmoid a sigmoid function.
  • pooled audio feature A pool and the pooled visual feature V pool are defined as:
  • a first gating module ri is used in applying the first adaptive pooling layer 560 to the fused visual vectors V fuse
  • a second gating module r 2 is used in applying the second adaptive pooling layer 562 to the fused audio vectors A fuse
  • the first gating module r 1 is different from the second gating module r 2 .
  • the first gating module r 1 is the same as the second gating module r 2 .
  • the fully connected neural network 580 is trained using Stochastic Gradient Descent (SGD) with a standard categorical cross-entropy loss.
  • SGD Stochastic Gradient Descent
  • An accuracy of identifying and predicting actions in video data is improved over other methods by way of using the cross-modal visual-audio feature 570 that is generated via the multimodal audio-visual fusion process 500 as described herein.
  • Table 1 provides a comparison of the accuracies using different methods.
  • a unimodal method that only takes audio features e.g., audio features 522 into consideration has the lowest accuracy at 8.29%.
  • Such a low accuracy is expected because an audio-only unimodal method cannot handle video-recognition tasks.
  • the accuracy is significantly improved to 66.4% when a unimodal method is applied based on only visual features (e.g., video features 520).
  • this prediction accuracy can drop to 64.5% when a conventional audio-video fusion method (e.g., naive audio-video fusion method) is used.
  • a conventional method implements multimodal fusion, and however, does not consider any inter-segment dependencies within the video content item 500.
  • the conventional method skips the visual and audio cross- modal fusion layers 550 and 552 and directly concatenates the self-attended visual and audio features e/jr and A ⁇ to the cross-modal visual-audio features 570.
  • the reduction in prediction accuracy of the conventional audio-video fusion method relative to the visual only method highlights a difficulty of training a multimodal model for video action recognition and a significance of implementing a multimodal fusion process 590 as shown in Figure 5B.
  • the cross-modal audio-visual fusion process 590 has the highest prediction accuracy of 70.6%.
  • This cross- modal audio-visual fusion process 590 takes into consideration bi-directional inter-segment dependencies by generating self-attended visual features V self and self-attended audio features A self that are defined based on the bi-directional layer (B(S, T ' )) shown in equation (2).
  • a novel cross-modal fusion technique e.g., visual cross-modal fusion layer 550, audio cross-modal fusion layer 552 is applied to generate fused features (e.g. fused visual features ( V fuse ) and fused audio features (A fuse )).
  • the cross- modal audio-visual fusion process 590 provides a better accuracy in identifying and predicting actions in a video content item, and this improved prediction accuracy shows effectiveness of the cross-modal audio-visual fusion process 590 in integrating multiple modalities (e.g., visual modality and audio modality) of the video content item 500.
  • FIG. 6 is a flowchart of a method 600 for labeling video content (e.g., determining a content label for the video content), in accordance with some embodiments.
  • the method 600 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof).
  • An example of the client device 104 is a mobile phone.
  • the method 600 is applied to identify an action (e.g., video action) in a video content item 500 based on visual content 502 and audio content 504.
  • the video content item 500 may be, for example, captured by a surveillance camera or a personal device, and streamed to a server 102 (e.g., for storage at storage 106 or a database associated with the server 102) to be labelled.
  • a deep learning model used to implement the method 600 is trained at the server 102, and provided to a client device 104 that applies the deep learning model to label one or more video content items 500 obtained or captured by the client device 104.
  • Method 600 is, optionally, governed by instructions that are stored in a non- transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 6 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors.
  • the computer system obtains (610) a video content item 500 that includes visual content 502 and audio content 504.
  • the visual content 502 includes a plurality of visual segments 512
  • the audio content 504 includes a plurality of audio segments 518.
  • the computer system generates (620) a plurality of self-attended visual features V self for the visual content 502 and generates (630) a plurality of self-attended audio features A self for the audio content 504.
  • the computer system further fuses (640) the self-attended visual features V self of the visual segments 512 with the self-attended audio features A sell of the audio segments 518 to generate a plurality of fused visual features V fuse and fuses (650) the self- attended audio features A sell of the audio segments 518 with the self-attended visual features V self of the visual segments 512 to generate a plurality of fused audio features A fuse .
  • the computer system also combines (660) the fused visual features V fuse and the fused audio features A fuse to generate a cross-modal visual-audio feature 570 based on a respective weight associated with each of the fused visual features V fuse and fused audio features A fuse.
  • the fused visual features V fuse are equally weighted (e.g., each of the fused visual features V fuse are equally weighted relative to one another).
  • the fused audio features A fuse are equally weighted (e.g., each of the fused audio features A fuse are equally weighted relative to one another).
  • the computer system determines (670) a video-level content label based on the cross-modal visual-audio feature 570.
  • the video-level content label is one of a plurality of predefined content labels (e.g., training passing by, people singing, dog barking, person dancing).
  • determining a video-level content label includes selecting one of a plurality of predefined content labels.
  • the computer system extracts a plurality of first visual features 520 locally for each of the visual segments 512 of the visual content 502, e.g., using a visual neural network 530, and uses one or more first bi-directional attention layers B(S, T) (shown in equation (1)) of a visual inter-segment fusion layer 540 to fuse the plurality of first visual features 520 of the visual segments 512 to the plurality of self-attended visual features V s elf.
  • first bi-directional attention layers B(S, T) shown in equation (1)
  • the one or more first bi-directional attention layers of the visual inter- segment fusion layer 540 include a plurality of first bi-directional attention layers that are stacked to provide multi-level integration of inter-segment visual dependency.
  • one of the one or more first bi-directional attention layers is represented as follows: where Fis a vector corresponding to the first visual features 520 of the visual segments 512 of the visual content 502, and W vq , W vk , and Ww denote linear transform matrices for video query, key, value vector transformation, and is a video normalization factor.
  • the computer system in order to generate (650) the plurality of self-attended audio features A self , extracts a plurality of first audio features 522 locally for each of the audio segments 518 of the audio content 504, and fuses the plurality of first audio features 522 of the audio segments 518 to the plurality of self-attended audio features A self using one or more second bi-directional attention layers B(S, T) (shown in equation (1)) of the audio inter-segment fusion layer 542.
  • the one or more second bi-directional attention layers of the audio inter-segment fusion layer 542 include a plurality of second bi-directional attention layers that are stacked to provide a multi-level integration of inter-segment audio dependency.
  • each of the one or more second bi-directional attention layers is represented as follows: where A is a vector corresponding to the first audio features 522 of the audio segments 518 of the audio content 504, and W Aq , W Ak , and W AV denote linear transform matrices for audio query, key, value vector transformation, and is an audio normalization factor.
  • the self-attended visual features V self are fused with the self-attended audio features A self to generate the fused visual features V fuse using a first cross- attention layer B(V S , A s ), and the self-attended audio features A self are fused with the self- attended visual features V self to generate the fused audio features A fuse using a second cross- attention layer B(A S , V S ).
  • the first cross-attention layer and the second cross-attention layer are represented as follows: where V luse i s the fused visual features, A fuse is the fused audio features, V s i s a vector corresponding to the self-attended visual features V self , and As is a vector corresponding to the self-attended audio features A self , W vq, W vk ⁇ , and W vv denote linear transform matrices for query, key, value vector transformation associated with the fused visual features, W Aq ’ , W Ak , and W AV ’ denote linear transform matrices for query, key, value vector transformation associated with the fused audio features, and and are normalization factors associated with the fused video and audio features, respectively.
  • the computer system applies a first adaptive pooling layer 560 to combine the fused visual features V fuse .
  • Applying the first adaptive pooling layer 560 to the fused visual features V fuse includes determining the respective weight associated with each fused visual feature V fuse using a first gating module / ⁇ /, and combining the fused visual features V fuse using the respective weight associated with each fused visual feature V fuse .
  • the computer system also applies a second adaptive pooling layer 562 to combine the fused audio features A fuse .
  • Applying the second adaptive pooling layer 562 includes determining the respective weight associated with each fused audio feature A fuse using a second gating module r 2 , and combining the fused audio features A fuse using the respective weight associated with each fused audio feature A fuse .
  • each of the first gating module r 1 and second gating module is is represented as equation (3).
  • the computer system in order to combine the fused visual features V fuse and the fused audio features A fuse the computer system concatenates the fused visual features V fuse that are combined using the first adaptive pooling layer 560 and the fused audio features A fuse that are combined using the second adaptive pooling layer 562.
  • the computer system generates the cross-modal visual-audio feature 570 by concatenating the pooled visual features V pool and the pooled audio features A fuse .
  • the computer system determines a fully connected neural network 580 having a single hidden layer, and the fully connected neural network processes the cross-modal visual-audio feature 570 to generate the video-level content label.
  • the fully connected neural network 580 is trained based on Stochastic Gradient Descent (SGD) with a standard categorical cross-entropy loss.
  • SGD Stochastic Gradient Descent
  • the method 600 is implemented at an electronic device locally (e.g., the computer system is a local computer system), and the video content item (e.g., the video content item 500) is captured by the electronic device and stored in a local photo album.
  • the video content item 500 may be captured by a smart phone
  • the video content item 500 is stored in a media album (e.g., photo album and/or video album) on the smart phone
  • the smart phone includes one or more programs configured to execute method 600 and provide a video-level label for the video content item 500 (e.g., determine or identify a video action of the video content item 500).
  • the computer system receives, from a server, information of a subset of a plurality of neural network modules consisting of one or more visual neural network 530, one or more audio neural network 532, one or more first bi directional attention layers (e.g., visual inter-segment fusion layer 540), one or more second bi-directional attention layers (e.g., audio inter-segment fusion layer 542), a first cross attention layer (e.g., visual cross-modal fusion layer 550), a second cross-attention layer (e.g., audio cross-modal fusion layer 552), a first adaptive pooling layer 560, a second adaptive pooling layer 562, and a fully connected neural network 580.
  • first bi directional attention layers e.g., visual inter-segment fusion layer 540
  • one or more second bi-directional attention layers e.g., audio inter-segment fusion layer 542
  • a first cross attention layer e.g., visual cross-modal fusion layer
  • the subset of the plurality of neural network modules are trained remotely at the server.
  • the plurality of neural network modules are trained remotely by the server in an end-to-end manner.
  • Information of the neural network modules are used by the server or provided to the computer system (e.g., a mobile phone) to label video content according to the method 600.
  • the mobile phone can label video content locally or receive video content that has been labeled by the server.
  • the computer system includes a user application 224 configured to display a user interface for receiving and presenting the video content item 500.
  • the video content item 500 is captured by a surveillance camera and streamed to a server 102.
  • the method 600 is implemented at the server 102 to identify the video action of the video content item 500 and label the video content item 500 accordingly.
  • the user interface is enabled on a mobile device of a user for obtaining, receiving, and/or presenting the video content item 500, e.g., based on the video-level content label.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Un dispositif électronique obtient un contenu vidéo qui comprend un contenu visuel et un contenu audio. Le contenu visuel comprend une pluralité de segments visuels, et le contenu audio comprend une pluralité de contenus audio. Une pluralité de caractéristiques visuelles autonomes sont générées pour les segments visuels du contenu vidéo, et une pluralité de caractéristiques audio autonomes sont générées pour les segments audio du contenu audio. Les caractéristiques visuelles autonomes sont fusionnées avec les caractéristiques audio autonomes pour générer une pluralité de caractéristiques visuelles fusionnées, et les caractéristiques audio autonomes sont fusionnées avec les caractéristiques visuelles autonomes pour générer une pluralité de caractéristiques audio fusionnées. Les caractéristiques visuelles fusionnées et les caractéristiques audio fusionnées sont combinées pour générer une caractéristique visuelle-audio intermodale sur la base d'un poids respectif associé à chacune des caractéristiques visuelles et audio fusionnées. Une étiquette de contenu de niveau vidéo est déterminée sur la base de la caractéristique visuelle-audio intermodale.
PCT/US2021/026444 2021-04-08 2021-04-08 Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo WO2021184026A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2021/026444 WO2021184026A1 (fr) 2021-04-08 2021-04-08 Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/026444 WO2021184026A1 (fr) 2021-04-08 2021-04-08 Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo

Publications (1)

Publication Number Publication Date
WO2021184026A1 true WO2021184026A1 (fr) 2021-09-16

Family

ID=77672088

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/026444 WO2021184026A1 (fr) 2021-04-08 2021-04-08 Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo

Country Status (1)

Country Link
WO (1) WO2021184026A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114339355A (zh) * 2021-12-31 2022-04-12 思必驰科技股份有限公司 事件检测模型训练方法、系统、电子设备和存储介质
CN114495938A (zh) * 2021-12-04 2022-05-13 腾讯科技(深圳)有限公司 音频识别方法、装置、计算机设备及存储介质
CN114519880A (zh) * 2022-02-09 2022-05-20 复旦大学 基于跨模态自监督学习的主动说话人识别方法
CN115223086A (zh) * 2022-09-20 2022-10-21 之江实验室 基于交互注意力引导与修正的跨模态动作定位方法与系统
CN115796244A (zh) * 2022-12-20 2023-03-14 广东石油化工学院 一种超非线性输入输出系统基于cff的参数辨识方法
CN116246213A (zh) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 数据处理方法、装置、设备以及介质
CN116932731A (zh) * 2023-09-18 2023-10-24 上海帜讯信息技术股份有限公司 面向5g消息的多模态知识问答方法及系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160004911A1 (en) * 2012-04-23 2016-01-07 Sri International Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics
US20190384981A1 (en) * 2018-06-15 2019-12-19 Adobe Inc. Utilizing a trained multi-modal combination model for content and text-based evaluation and distribution of digital video content to client devices

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160004911A1 (en) * 2012-04-23 2016-01-07 Sri International Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics
US20190384981A1 (en) * 2018-06-15 2019-12-19 Adobe Inc. Utilizing a trained multi-modal combination model for content and text-based evaluation and distribution of digital video content to client devices

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LEE ET AL.: "Audio-Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model", APPL. SCI., vol. 10, no. 7, 17 October 2020 (2020-10-17), pages 7263, XP055854369, Retrieved from the Internet <URL:https://www.mdpi.com/2076-3417/10/20/7263> [retrieved on 20210602] *
MARTIN ATZMUELLER, ALVIN CHIN, FREDERIK JANSSEN, IMMANUEL SCHWEIZER, CHRISTOPH TRATTNER: "ICIAP: International Conference on Image Analysis and Processing, 17th International Conference, Naples, Italy, September 9-13, 2013. Proceedings", vol. 11206 Chap.16, 9 October 2018, SPRINGER, Berlin, Heidelberg, ISBN: 978-3-642-17318-9, article TIAN YAPENG; SHI JING; LI BOCHEN; DUAN ZHIYAO; XU CHENLIANG: "Audio-Visual Event Localization in Unconstrained Videos", pages: 252 - 268, XP047489179, 032548, DOI: 10.1007/978-3-030-01216-8_16 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114495938B (zh) * 2021-12-04 2024-03-08 腾讯科技(深圳)有限公司 音频识别方法、装置、计算机设备及存储介质
CN114495938A (zh) * 2021-12-04 2022-05-13 腾讯科技(深圳)有限公司 音频识别方法、装置、计算机设备及存储介质
CN114339355B (zh) * 2021-12-31 2023-02-21 思必驰科技股份有限公司 事件检测模型训练方法、系统、电子设备和存储介质
CN114339355A (zh) * 2021-12-31 2022-04-12 思必驰科技股份有限公司 事件检测模型训练方法、系统、电子设备和存储介质
CN114519880A (zh) * 2022-02-09 2022-05-20 复旦大学 基于跨模态自监督学习的主动说话人识别方法
CN114519880B (zh) * 2022-02-09 2024-04-05 复旦大学 基于跨模态自监督学习的主动说话人识别方法
CN115223086A (zh) * 2022-09-20 2022-10-21 之江实验室 基于交互注意力引导与修正的跨模态动作定位方法与系统
CN115223086B (zh) * 2022-09-20 2022-12-06 之江实验室 基于交互注意力引导与修正的跨模态动作定位方法与系统
CN115796244A (zh) * 2022-12-20 2023-03-14 广东石油化工学院 一种超非线性输入输出系统基于cff的参数辨识方法
CN115796244B (zh) * 2022-12-20 2023-07-21 广东石油化工学院 一种超非线性输入输出系统基于cff的参数辨识方法
CN116246213B (zh) * 2023-05-08 2023-07-28 腾讯科技(深圳)有限公司 数据处理方法、装置、设备以及介质
CN116246213A (zh) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 数据处理方法、装置、设备以及介质
CN116932731A (zh) * 2023-09-18 2023-10-24 上海帜讯信息技术股份有限公司 面向5g消息的多模态知识问答方法及系统
CN116932731B (zh) * 2023-09-18 2024-01-30 上海帜讯信息技术股份有限公司 面向5g消息的多模态知识问答方法及系统

Similar Documents

Publication Publication Date Title
WO2021184026A1 (fr) Fusion audiovisuelle avec attention intermodale pour la reconnaissance d&#39;actions vidéo
WO2019217100A1 (fr) Réseau neuronal d&#39;articulation servant à la reconnaissance d&#39;un locuteur
WO2021081562A2 (fr) Modèle de reconnaissance de texte multi-tête pour la reconnaissance optique de caractères multilingue
WO2018105194A1 (fr) Procédé et système de génération d&#39;étiquette à plusieurs niveaux de pertinence
CN111062871A (zh) 一种图像处理方法、装置、计算机设备及可读存储介质
US20240037948A1 (en) Method for video moment retrieval, computer system, non-transitory computer-readable medium
US20100158356A1 (en) System and method for improved classification
WO2023101679A1 (fr) Récupération inter-modale d&#39;image de texte sur la base d&#39;une expansion de mots virtuels
US20230360362A1 (en) Classifying image styles of images based on procedural style embeddings
WO2021092631A9 (fr) Récupération de moment vidéo à base de texte faiblement supervisé
JP2008257460A (ja) 情報処理装置、情報処理方法、およびプログラム
CN113434716B (zh) 一种跨模态信息检索方法和装置
CN113806588B (zh) 搜索视频的方法和装置
WO2021077140A2 (fr) Systèmes et procédés de transfert de connaissance préalable pour la retouche d&#39;image
WO2021092600A2 (fr) Réseau pose-over-parts pour estimation de pose multi-personnes
WO2023102223A1 (fr) Apprentissage multitâche en couplage croisé pour cartographie de profondeur et segmentation sémantique
WO2021195643A1 (fr) Compression de réseaux neuronaux convolutifs par élagage
US11900067B1 (en) Multi-modal machine learning architectures integrating language models and computer vision systems
WO2024027347A9 (fr) Procédé et appareil de reconnaissance de contenu, dispositif, support de stockage et produit-programme d&#39;ordinateur
CN108596068B (zh) 一种动作识别的方法和装置
WO2023091131A1 (fr) Procédés et systèmes pour récupérer des images sur la base de caractéristiques de plan sémantique
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
WO2023277888A1 (fr) Suivi de la main selon multiples perspectives
WO2022250689A1 (fr) Reconnaissance d&#39;action vidéo progressive à l&#39;aide d&#39;attributs de scène
WO2021195644A1 (fr) Élagage de filtre global de réseaux neuronaux à l&#39;aide de cartes de caractéristiques de rang élevé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21768910

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21768910

Country of ref document: EP

Kind code of ref document: A1