WO2022250689A1 - Reconnaissance d'action vidéo progressive à l'aide d'attributs de scène - Google Patents
Reconnaissance d'action vidéo progressive à l'aide d'attributs de scène Download PDFInfo
- Publication number
- WO2022250689A1 WO2022250689A1 PCT/US2021/034779 US2021034779W WO2022250689A1 WO 2022250689 A1 WO2022250689 A1 WO 2022250689A1 US 2021034779 W US2021034779 W US 2021034779W WO 2022250689 A1 WO2022250689 A1 WO 2022250689A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- action
- neural network
- attribute
- clip
- Prior art date
Links
- 230000009471 action Effects 0.000 title claims abstract description 159
- 230000000750 progressive effect Effects 0.000 title description 2
- 238000013528 artificial neural network Methods 0.000 claims abstract description 94
- 230000000007 visual effect Effects 0.000 claims abstract description 70
- 238000012549 training Methods 0.000 claims description 139
- 238000000034 method Methods 0.000 claims description 64
- 230000006870 function Effects 0.000 claims description 48
- 238000013527 convolutional neural network Methods 0.000 claims description 18
- 230000015654 memory Effects 0.000 claims description 16
- 230000002123 temporal effect Effects 0.000 claims description 15
- 238000012360 testing method Methods 0.000 claims description 13
- 238000012545 processing Methods 0.000 description 88
- 239000010410 layer Substances 0.000 description 74
- 238000013136 deep learning model Methods 0.000 description 24
- 230000008569 process Effects 0.000 description 24
- 230000002457 bidirectional effect Effects 0.000 description 21
- 238000004891 communication Methods 0.000 description 16
- 238000000605 extraction Methods 0.000 description 15
- 230000004044 response Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000007781 pre-processing Methods 0.000 description 9
- 239000013598 vector Substances 0.000 description 6
- 230000004913 activation Effects 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 4
- 239000007787 solid Substances 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000008713 feedback mechanism Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/12—Systems in which the television signal is transmitted via one channel or a plurality of parallel channels, the bandwidth of each channel being less than the bandwidth of the television signal
Definitions
- This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for identifying one or more actions in video content.
- One of the main challenges in action recognition is to determine whether an action belongs to a specific action category. While promising results have been shown using data from controlled environments, action recognition in real videos is more difficult and complicated due to complex background settings and action alternations in real videos.
- Various implementations of this application rely on deep learning models to determine video actions contained in video content using activity classification labels (e.g., visual attributes) that can be learned from video and non-video content.
- the deep learning models include an attribute neural network configured to identify the visual attributes in one or mor scenes of the video content and an action neural network configured to determine the video actions present in video content based on the visual attributes that have been identified.
- the attribute neural network may be trained in a supervised manner.
- Sequences of training video segments are annotated and used as inputs to train the attribute neural network to generate clip descriptors including the visual attributes corresponding to individual video segments.
- Each individual clip descriptor is encoded with information regarding a position of a respective video segment in the video content.
- the position- encoded clip descriptors of the video segments are cross-linked by a bidirectional self- attention model, thereby achieving a better accuracy level of video action recognition compared to counterpart models that use gradients of individual video clips.
- the attribute and action neural networks are trained and utilized jointly in an end-to-end manner. Alternatively, in some embodiments, the attribute and action neural networks are trained separately and utilized jointly to identify a video action in the video content.
- a method is implemented at a computer system for recognizing video action.
- the method includes obtaining video content that includes a plurality of image frames and grouping the plurality of image frames into a plurality of successive video segments.
- the method further includes generating a plurality of clip descriptors for the video segments of the video content using an attribute neural network.
- Each video segment corresponds to a respective clip descriptor that includes i) a first subset of feature elements that indicate one or more visual concepts of the respective video segment and (ii) a second subset of feature elements associated with a plurality of visual features extracted from the respective video segment.
- the method further includes fusing the plurality of clip descriptors of the video segments to one another (e.g., to each other) to form a video descriptor using an action neural network.
- the method further includes determining a video action classification of the video content from the video descriptor using an action classification layer.
- some implementations include a computer system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
- some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
- Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
- Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
- Figure 3 is an example data processing environment for training and applying a neural network based (NN-based) data processing model for processing visual and/or audio data, in accordance with some embodiments.
- NN-based neural network based
- Figure 4A is an example neural network (NN) applied to process content data in an NN-based data processing model, in accordance with some embodiments.
- NN neural network
- Figure 4B is an example node in the neural network (NN), in accordance with some embodiments.
- Figure 5 is a diagram of video content that includes a plurality of video frames that are grouped into video segments, in accordance with some embodiments.
- Figure 6A is a block diagram of a deep learning model that has separate neural networks for attribute determination and action recognition and is modified for training, in accordance with some embodiments.
- Figure 6B is a block diagram of an action neural network of the deep learning model shown in Figure 6A, in accordance with some embodiments.
- Figure 7 is a block diagram of a deep learning model that has separate neural networks for inferring visual concepts (e.g., visual attributes) and video action of video content, in accordance with some embodiments.
- visual concepts e.g., visual attributes
- video action of video content in accordance with some embodiments.
- Figure 8 is a flowchart of a method for recognizing video action in video content, in accordance with some embodiments.
- FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
- the one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a camera).
- Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
- the collected data or user inputs can be processed locally (e.g., for training and/or for prediction) at the client device 104 and/or remotely by the server(s) 102.
- the one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
- the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
- storage 106 may store video content for training a machine learning model (e.g., deep learning network) and/or video content obtained by a user to which a trained machine learning model can be applied to determine one or more actions associated with the video content.
- a machine learning model e.g., deep learning network
- the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
- the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
- the client devices 104 include a networked surveillance camera and a mobile phone 104C.
- the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
- the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
- the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
- the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
- a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
- a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
- the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
- TCP/IP Transmission Control Protocol/Internet Protocol
- At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
- Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
- content data e.g., video data, visual data, audio data
- data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
- both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C).
- the client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
- the client device 104C obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally.
- both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A).
- the server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
- the client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application).
- the client device 104 A itself implements no or little data processing on the content data prior to sending them to the server 102A.
- data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B.
- the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
- the trained data processing models are optionally stored in the server 102B or storage 106.
- the client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.
- FIG. 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments.
- the data processing system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof.
- the data processing system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
- the data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice- command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
- the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
- the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
- the data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
- the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
- GPS global positioning satellite
- Memory 206 includes high-speed random access memory, such as DRAM,
- SRAM, DDR RAM, or other random access solid state memory devices and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices.
- Memory 206 optionally, includes one or more storage devices remotely located from one or more processing units 202.
- Memory 206, or alternatively the non-volatile memory within memory 206 includes a non-transitory computer readable storage medium.
- memory 206, or the non- transitory computer readable storage medium of memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof:
- Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
- Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
- User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
- information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
- output devices 212 e.g., displays, speakers, etc.
- Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
- Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
- One or more user applications 224 for execution by the data processing system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
- Model training module 226 for receiving training data (.g., training data 240) and establishing a data processing model (e.g., data processing module 228) for processing content data (e.g., video data, visual data, audio data) to be collected or obtained by a client device 104;
- a data processing model e.g., data processing module 2228 for processing content data (e.g., video data, visual data, audio data) to be collected or obtained by a client device 104;
- Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;
- One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video data, visual data, audio data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server
- the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200.
- the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200.
- more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
- Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
- the above identified modules or programs i.e., sets of instructions
- memory 206 optionally, stores a subset of the modules and data structures identified above.
- memory 206 optionally, stores additional modules and data structures not described above.
- FIG. 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video data, visual data, audio data), in accordance with some embodiments.
- the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
- both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
- the training data source 304 is optionally a server 102 or storage 106.
- both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
- the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
- the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
- the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
- the data processing model 240 is trained according to a type of the content data to be processed.
- the training data 306 is consistent with the type of the content data, and a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
- a video pre-processing module 308 is configured to process video training data 306 to a predefined image format, e.g., group frames (e.g., video frames, visual frames) of the video content into video segments.
- the video pre-processing module 308 may also extract a region of interest (ROI) in each frame or separate a frame into foreground and background components, and crop each frame to a predefined image size.
- ROI region of interest
- the model training engine 310 receives pre-processed training data provided by the data pre processing module(s) 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
- the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
- the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
- the modified data processing model 240 is provided to the data processing module 228 to process the content data.
- the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
- the data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318.
- the data pre processing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316.
- Examples of the content data include one or more of: video data, visual data (e.g., image data), audio data, textual data, and other types of data. For example, each video is pre-processed to group frames in the video into video segments.
- the model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre- processed content data.
- the model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
- the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
- Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
- Figure 4B is an example node 420 in the neural network 400, in accordance with some embodiments.
- the data processing model 240 is established based on the neural network 400.
- a corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
- the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs.
- a weight w associated with each link 412 is applied to the node output.
- the one or more node inputs are combined based on corresponding weights w , W2, W3, and W4 according to the propagation function.
- the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
- the collection of nodes 420 is organized into one or more layers in the neural network 400.
- the one or more layers includes a single layer acting as both an input layer and an output layer.
- the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input layer 402 and the output layer 406.
- a deep neural network has more than one hidden layers 404 between the input layer 402 and the output layer 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer.
- a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
- one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
- max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
- a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
- the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
- the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
- Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
- Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
- the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
- a recurrent neural network is applied in the data processing model 240 to process content data (particularly, visual data and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
- each node 420 of the RNN has a time-varying real-valued activation.
- the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
- LSTM long short-term memory
- BAM bidirectional associative memory
- an echo state network an independently RNN (IndRNN)
- a recursive neural network a recursive neural network
- a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
- the RNN can be used for hand
- the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
- the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
- forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
- backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
- the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
- a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
- the network bias b provides a perturbation that helps neural network 400 avoid over fitting the training data.
- the result of the training includes the network bias parameter b for each layer.
- FIG. 5 is a diagram of the video content 500 that includes a plurality of video frames 512 that are grouped into video segments 502 (also called video clips 502), in accordance with some embodiments.
- the video frames 512 of the video content 500 are grouped into n number of video segments 502 (e.g., video frames 502-1 through 502-n).
- the video frames 512 are grouped such that each video segment 502 includes a respective number of successive video frames 512.
- the respective number is optionally constant or distinct for each video segment 502.
- Two closest video segments 502 are optionally separated by no video frames 512 or by a limited number of frames 512.
- each video segment 502 has 16 successive video frames 512 (e.g., video frames 512-1 through 512-6).
- the video content 500 includes at least 16 times n number of video frames 512, and each video segment 502 corresponds to a predetermined time duration that is equal to 16 times of a time duration of each video frame 502.
- the frame time of each video frame 502 is equal to 16.7 milliseconds
- the predetermined time duration corresponding to each video segment 502 is approximately equal to 267 milliseconds.
- the video frames 512 of the video content 500 are processed (e.g., preprocessed) to generate the video segments 502, and the video segments 502 are provided as inputs to a trained deep learning model as input data to determine the one or more actions associated with the video content 500 and/or label the video content 500 based on the one or more actions.
- visual concepts are generated with feature elements for each of the video segments 502. Examples of the visual concepts include, but are not limited to, an object, a visual attribute, and a scene associated with each video segment 502.
- the visual concepts are further processed with the feature elements to determine a video action classification of the video content 500 using an action neural network. By these means, the visual concepts provide additional information that can guide operation of the action neural network, thereby enhancing an accuracy level of the resulting video action classification.
- Figure 6A is a block diagram of a deep learning model 600 that has separate neural networks for attribute determination and action recognition and is modified for training, in accordance with some embodiments.
- the deep learning model 600 includes an attribute network in-training 604 and an action neural network in-training 606.
- Figure 6B is a block diagram of an action neural network 606 of the deep learning model 600 shown in Figure 6A, in accordance with some embodiments.
- attribute prediction models 624, an attribute loss 625, and an action loss 627 are added in the deep learning model 600 to optimize remaining portions of the deep learning model 600 for a subsequent inference stage.
- Each training data content for training the neural networks includes the video content 500 having a plurality of successive video frames 512.
- the video frames of the video content 500 are grouped into video segments 602 (e.g., video segments 602-1 through 602-m) before being provided to the attribute network in-training 604 as inputs.
- the video frames of the video content 500 is grouped into m number of video segments before being provided to attribute network in-training 604 as training data.
- the attribute network in-training 604 includes one or more three-dimensional (3D) convoluted neural networks (3D CNNs).
- the attribute network in-training 604 includes a clip feature extraction model 630 that receives the video segments 602 and generates (e.g., outputs, determines, provides) a respective clip descriptor 622 (e.g., feature vector) for each received video segment 602.
- a clip feature extraction model in-training 630 outputs m number of clip descriptors 622.
- each of the clip descriptors 622 (e.g., clip descriptors 622-1 through 622-m) includes a first subset of feature elements indicating one or more visual concepts of the respective video segment 602 and a second subset of feature elements associated with a plurality of visual features extracted from the respective video segment 602.
- Each clip descriptor 622 is provided as an input to a respective attribute prediction model 624 to determine attributes of the respective video segment 602.
- the respective attribute prediction model 624 includes a fully connected layer (not shown in Figure 6A) configured to determine one or more visual concepts of the respective video segment 602 from at least the first subset of feature elements of the respective clip descriptor 622.
- the one or more visual concepts of the respective video segment 602 are generated from both the first and second subsets of feature elements of the respective clip descriptor 622.
- Examples of the visual concepts include, but are not limited to, an object, a visual attribute, and a scene associated with each video segment 502. More specifically, examples of the visual concepts include, but are not limited to, “wave”, “board”, “man”, “ball”, and “ocean”.
- the visual concepts that are output by the attribute prediction models 624 are analyzed using an attribute loss function 625.
- the attribute loss function 625 is provided to the clip feature extraction model in-training 630 as part of a feedback mechanism to train the clip feature extraction model in-training 630 for identifying the one or more visual concepts (e.g., objects, visual attributes, scene) for each received video segment 602.
- the video segments 602 of the video content 500 are received with ground truth attributes.
- the one or more visual concepts outputted by the attribute prediction models 624 are compared with the ground truth attributes via the attribute loss function 625 to determine whether the attribute network in-training 604 is optimized.
- the attribute network in-training 604 is modified until the attribute loss function 625 satisfies a predefined training criterion.
- the action network in-training 606 is a neural network (NN) that includes a positional encoder 634 and the bidirectional self-attention model in-training 636.
- a positional encoder 634 receives the clip descriptors 622 and encodes each received clip descriptor 622 with information regarding its temporal position relative to the other received clip descriptors 622 to form a position encoded clip descriptor 626 (e.g., position encoded clip descriptors 626-1 through 626-m).
- a position encoded clip descriptor 626-2 includes information from the clip descriptor 622-2 and information indicating a temporal position of the video segment 602-2 to which the encoded clip descriptor 626-2 corresponds.
- the information indicates that the video segment 602-2 is temporally positioned between the video segments 602-1 and 602-3, and the position encoded clip descriptor 626-2 corresponds to the clip descriptor 622-2 that is positioned between the clip descriptors 622-1 and 622-3.
- the positional encoder 634 encodes information regarding a relative or absolute temporal position of the corresponding video segment 602 in the video content 500.
- a positional encoding is defined as:
- PE(pos, 2i + 1) cos(pos/10000 2l / D ) (2)
- pos is the temporal position of the corresponding video segment 602
- z denotes the z-th element in the pos- th clip descriptor 622
- D is a dimension of a clip descriptor 622.
- Each clip descriptor 622 has the same dimension with the respective PE.
- the bidirectional self-attention model in-training 636 receives the position encoded clip descriptors 626 (e.g., position encoded clip descriptors 626-1 through 626-m) as inputs and bidirectionally fuses the position encoded clip descriptors 626 to form a video descriptor 628.
- the bidirectional self-attention model in-training 636 includes an input layer 640-1, which receives the position encoded clip descriptors 626, and an output layer 640-q, which generates a video descriptor 628 based on the fused position encoded clip descriptors 626.
- the bidirectional self-attention model in training 636 optionally includes one or more intermediate layers (e.g., layers 640-2 through 640-(q-l)) configured to perform at least a portion of bidirectional fusing of the encoded clip descriptors 626.
- the bidirectional self-attention model in-training 636 may include at least two layers, i.e., at least the input layer 640-1 and the output layer 640 -q.
- the bidirectional self-attention model in-training 636 includes q-2 number of intermediate layers (e.g., has 0, 1 or more intermediate layers).
- each of the output layer 640-q and the one or more intermediate layers (if any) includes a fully connected neural network layer.
- the bidirectional self-attention model in-training 636 utilizes the temporal information encoded in the position encoded clip descriptors 626 by fusing the position encoded clip descriptors 626 (e.g., order-encoded vectors (o), defined in equation (3)) in both directions.
- the bidirectional self-attention model in-training 636 is formulated as: where z indicates the index of the target output temporal position, and j denotes all possible combinations. N(o) is a normalization term, and f(o,) is a linear projection inside the self- attention mechanism.
- the functions f(o ⁇ ), q(o ⁇ ), and f(o,) are learnable functions that are trained to project feature embedding vectors to a space where the attention mechanism works efficiently.
- the outputs of the functions f(o ⁇ ), q(o ⁇ ), and f(o,) are defined as value, query, and key, respectively.
- the function PF in equation (4) includes a position-wise feed-forward network that is applied to all positions separately and identically.
- the position- wise feed-forward network (PF) is defined as:
- PF(x) W 2 a GELU (W 1 x + b x ) + b 2 (6)
- OGELU is the Gaussian Error Linear Unit (GELU) activation function.
- the action classification layer 638 receives the video descriptor 628 from the output layer 640-q and determines (e.g., generates) the video action classification 629 of the video content based on the video descriptor 628.
- the action classification layer 638 includes a fully- connected neural network layer configured to determine the video action classification 629 from the video descriptor 628.
- the attribute network in training 604 and the action network in-training 606 are trained to determine: 1) the subspace where the attention mechanism works efficiently, and (2) how to attend the temporal features of the contextual clip descriptors 622 with visual attribute information properly (e.g., which classification embedding to use).
- the video action classification 629 is analyzed to determine an action loss function 627 that is provided to the bidirectional self- attention model in-training 636 as part of a feedback mechanism to train the bidirectional self-attention model in-training 636 for identifying a video action classification 629 for the video content 500.
- the attribute network in-training 604 and the action network in-training 606 are trained separately.
- the attribute network in-training 604 is trained using the attribute loss function 625 as feedback
- the action network in training 606 is trained using the action loss function 627 as feedback.
- the attribute network in-training 604 and action network in-training 606 can be trained end-to- end.
- the attribute loss function 625 is used as feedback for training the attribute network in-training 604
- the action loss function 627 is used as feedback for training both the attribute network in-training 604 and the action network in-training 606.
- the attribute loss function 625 is used to train the clip feature extraction model 630 and attribute prediction model 624
- the action loss function 627 is provided for training the entire deep learning model 660, which includes the clip feature extraction model 620, the attribute prediction model 624, the bidirectional self-attention model in-training 636, and the action classification layer 638.
- an overall loss function is a combination of the attribute loss function 625 and the action loss function 627. The overall loss function is used for training the entire deep learning model 600.
- training the deep learning model 600 includes applying Stochastic Gradient Descent (SGD) with standard categorical cross-entropy loss to the action loss function 627.
- training the deep learning model 600 includes applying a binary cross-entropy to the attribute loss function 625.
- SGD Stochastic Gradient Descent
- a respective video segment 602 of the video content 500 is associated with (e.g., corresponds to, includes) annotation(s) regarding attribute(s) associated with the video segment 602.
- video segment 602-1 may include annotation(s) to indicate attribute(s) of video segment 602-1, such as “wave,” “board,” “person,” “ocean.”
- an attribute prediction 624-1 which is generated by the attribute network in training 604 in response to receiving a video segment 602-1, can be compared to annotation(s) associated with the video segment 602-1 to determine the attribute loss function 625.
- the video content 500 includes annotation(s) regarding action(s) associated with the video content 500.
- the video content 500 may include annotation(s) to indicate actions in the video content 500, such as “surfing.”
- the video action classification 629 which is generated by action network in-training 606 in response to receiving the clip descriptors 622 that correspond to the video content 500, is compared to annotation(s) associated with the video content 500 to determine the action loss function 627.
- training the deep learning model 600 includes comparing the one or more visual concepts generated by the attribute prediction models 624 with one or more ground truth visual concepts that are included as annotation(s) to the video content 500 (e.g., to video frames of the video content 500, to the video segments 602 of the video content 500). In some embodiments, training the deep learning model 600 includes comparing the video action classification 629 with one or more ground truth actions that are included as an annotation to the video content 500.
- the attribute network in-training 604 is trained to form a trained attribute network (e.g., 704 in Figure 7) for generating the clip descriptors 622 in response to receiving the video segments 602.
- the clip feature extraction model in-training 630 is finalized to a trained clip feature extraction model (e.g., 730 in Figure 7).
- the action network in-training 606 is trained to form a trained action network (e.g., 706 in Figure 7) for generating the video action classification 729 of the video content 500 in response to receiving clip descriptors 622 that correspond to the video segments 602 of the video content 500.
- the bidirectional self-attention model in-training 636 is finalized to a trained bidirectional self-attention model (e.g., 736 in Figure 7), and the action classification layer in-training 638 is finalized to a trained action classification layer (e.g.,
- the deep learning model 600 is trained in a server 102, and a corresponding trained deep learning model (e.g., 700 in Figure 7) is provided to and implemented on a client device 104 (e.g., a mobile phone).
- a client device 104 e.g., a mobile phone
- FIG. 7 is a block diagram of a deep learning model 700 that has separate neural networks for inferring visual concepts (e.g., visual attributes) and video action of the video content 500, in accordance with some embodiments.
- the attribute prediction models 624 are removed, if the visual concepts do not need to be outputted.
- the video content 500 includes a plurality of video frames 512.
- the video frames of the video content 500 are grouped into video segments 702 (e.g., video segments 702-1 through 702-p) before being provided to trained neural networks (e.g., trained attribute network 704 and trained action network 706) for video action prediction.
- trained neural networks e.g., trained attribute network 704 and trained action network 706
- the video frames of the video content 500 is grouped into p number of video segments.
- Each video segment 702 optionally includes a predefined number of video frames (e.g., 16 successive video frames).
- each of the clip feature extraction model 730, the bidirectional self-attention model 736, and the action classification layer 738 in the deep learning model 700 is established based on a respective neural network. Further, in some situations, the bidirectional self-attention model 736 and the action classification layer 738 are integrated in the same neural network.
- Each video segment 702 (e.g., video segments 702-1 through 702-p) is provided to and processed by trained clip feature extraction model 730 of trained attribute network 704.
- Trained clip feature extraction model 730 generates a respective clip descriptor 722 for each received video segment 702. For example, in response to receiving p number of video segments 702 (e.g., video segments 702-1 through 702-p), trained clip feature extraction model 730 outputs p number of clip descriptors 722 (e.g., clip descriptors 722-1 through 722-p).
- the positional encoder 634 receives each clip descriptor 722 with information regarding a temporal position of the respective video segment 702 to which the respective clip descriptor 722 corresponds, encodes the respective clip descriptor 722 with the information regarding the temporal position (e.g., using equations (l)-(3)), and generates a respective position encoded clip descriptor 726.
- Trained bidirectional self-attention model 736 receives the position encoded clip descriptors 726 corresponding to the video segments 702 (e.g., position encoded clip descriptors 726-1 through 756-p), and fuses the position encoded clip descriptors 726 to each other to generate a video descriptor 728.
- the rained bidirectional self-attention model 736 provides the video descriptor 728 as an input to a trained action classification layer 738 and generates a video action classification 729 for the video content 500.
- FIG 8 is a flowchart of a method 800 for recognizing video action in the video content 500, in accordance with some embodiments.
- the method 800 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof).
- An example of the client device 104 is a mobile phone.
- the method 800 is applied to identify an action (e.g., video action) in the video content 500 based on video segments 702 of the video content 500.
- the video content item 500 may be, for example, captured by a first electronic device (e.g., a surveillance camera or a personal device), and streamed to a server 102 (e.g., for storage at storage 106 or a database associated with the server 102) to be labelled.
- the server identifies the video action in the video content item 500 and provides the video content item 500 with the video action to one or more second electronic devices that are distinct from or include the first electronic device.
- a deep learning model used to implement the method 800 is trained at the server 102 and provided to a client device 104 that applies the deep learning model locally to provide an action classification for one or more video content items 500 obtained or captured by the client device 104.
- Method 800 is, optionally, governed by instructions that are stored in a non- transitory computer readable storage medium and that are executed by one or more processors of the computer system.
- Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2).
- the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
- the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors.
- the computer system obtains (810) video content item 500 that includes a plurality of image frames and groups (820) the plurality of image frames into a plurality of successive video segments 702 (e.g., video segments 702-1 through 702-p).
- the computer system generates (830) a plurality of clip descriptors 722 for the video segments 702 of the video content 500 using an attribute neural network (e.g., trained attribute network 704).
- an attribute neural network e.g., trained attribute network 704
- Each video segment 702 corresponds to a respective clip descriptor 722 that includes a first subset of feature elements indicating one or more visual concepts of the respective video segment 702 and a second subset of feature elements associated with a plurality of visual features extracted from the respective video segment 702.
- each visual concept is one of an object, a visual attribute, and a scene associated with the respective video segment 702.
- the computer system further fuses (840) the plurality of clip descriptors 722 of the video content 500 to form a video descriptor 728 using an action neural network (e.g., trained action network 706), and determines (850) a video action classification 729 of the video content 500 from the video descriptor 728 using a trained action classification layer 738.
- an action neural network e.g., trained action network 706
- the action neural network 706 optionally includes the action classification layer 738.
- the computer system encodes each of the plurality of clip descriptors 722 with respective positional information to generate a respective position encoded clip descriptor 726.
- the respective positional information indicates a temporal position of the corresponding video segment 702 in the video content 500.
- the opposition encoded clip descriptors 726 of the video segments 702 are fused to form the video descriptor 728.
- PE(pos, 2i + 1) cos(pos/10000 2l / D )
- pos is the temporal position of the corresponding video segment
- i denotes the z-th feature element in the pos- th clip descriptor 722
- D is a dimension of the respective clip descriptor 722.
- the computer system adaptively superimposes the positional encoding term ⁇ PE) for each feature element of the respective clip descriptor 722.
- the action neural network e.g., trained action network
- a semantic attention model e.g., a trained bidirectional self-attention model 736 in Figure 7 7) that combines the plurality of visual features and the one or more visual concepts bidirectionally.
- the video action classification corresponds to an entry in a visual action dictionary (y act , shown in Equation (7)).
- the visual action dictionary includes “surfing” and “boating”, and the video action classification corresponds to a vector having a number of classification elements.
- two distinct classification elements corresponding to “surfing” and “boating” are equal to 1, while remaining classification elements of the vector are equal to 0.
- the attribute neural network (e.g., trained attribute network 704) includes a plurality of three-dimensional (3D) convolutional neural networks, and each 3D convolutional neural network is configured to generate a respective clip descriptor 722 for a respective video segment 702 of the video content 500.
- the computer system generates a plurality of test clip descriptors 622 from training the video content 500 using the attribute neural network (e.g., attribute network in-training 604), applies an attribute prediction layer (e.g., a fully connected layer of each attribute prediction model 624) to the test clip descriptors 622, and extracts one or more test visual concepts from the test clip descriptors 622.
- the computer system trains the attribute neural network 604 using an attribute loss function 625 that compares the one or more test visual concepts with one or more ground truth visual concepts provided with the training the video content 500.
- the computer system generates one or more test action classifications (e.g., video action classification 629) from training the video content 500 using the attribute neural network (e.g., attribute network in-training 604) and action neural network (e.g., action network in-training 606) including an action classification layer (e.g., action classification layer in-training 638).
- the computer system trains at least a subset of the attribute neural network 604 and action neural network 606 using an overall loss function (e.g., action loss function 627) that compares the one or more test action classifications with one or more ground truth actions.
- an overall loss function e.g., action loss function 627
- the overall loss function is a combination of an action prediction loss (e.g., action loss function 627) and a visual attribute loss (e.g., attribute loss function 625).
- training at least a subset of the attribute neural network 604 and action neural network 606 further includes applying Stochastic Gradient Descent (SGD) with standard categorical cross-entropy loss to the action prediction loss 627 and applying a binary cross-entropy to the visual attribute loss 625.
- SGD Stochastic Gradient Descent
- the attribute neural network 704 and action neural network 706 are trained on a remote server and provided to an electronic device that is distinct from the remote server and configured to determine the video action classification 729 of the video content 500.
- the attribute neural network 704 and action neural network 706 are trained on a remote server.
- the remote server receives the video content 500 from a first electronic device that is distinct from the remote server, and the remote server provides the video action classification 729 of the video content 500 to a second electronic device that optionally includes or does not include the first electronic device.
- the attribute neural network 704 and action neural network 706 are trained locally on an electronic device, and the video action classification 729 of the video content 500 is determined by the electronic device.
- the video content 500 includes a first number of image frames, and every second number of image frames is grouped into a respective video segment 702.
- the video descriptor 728 includes a first video descriptor of the video content 500
- the video action classification 729 includes a first video action classification for the video content 500.
- the first video action classification 729 corresponds to a first subset of successive video segments 702 including a third number of successive video segments, and the clip descriptors 722 of the third number of successive video segments are fused to form the first video descriptor 728 for determining the first video action classification 729 of the video content 500.
- the clip descriptors 622 are forced to include visual concepts (e.g., visual attributes) and feature elements.
- visual concepts e.g., visual attributes
- feature elements e.g., feature elements
- these visual concepts are mixed with the feature elements using neural networks to provide additional information on action classification, thereby enhancing the accuracy level of the action classification.
- the plurality of clip descriptors of successive video segments are fused in two directions, and these video segments are cross-referenced to provide additional information for each other. By these means, the accuracy level of the action classification can be enhanced.
- Figure 8 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed.
- One of ordinary skill in the art would recognize various ways to label video content as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 6A, 6B, and 7 are also applicable in an analogous manner to method 800 described above with respect to Figure 8. For brevity, these details are not repeated here.
- the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
- stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
Selon l'invention, un système informatique obtient un contenu vidéo qui comprend une pluralité de trames d'image et groupe la pluralité de trames d'image en une pluralité de segments vidéo successifs. Le dispositif génère une pluralité de descripteurs de séquences pour les segments vidéo du contenu vidéo à l'aide d'un réseau d'attribut. Chaque segment vidéo correspond à un descripteur de séquence respectif qui comprend: (i) un premier sous-ensemble d'éléments caractéristiques qui indiquent un ou plusieurs concepts visuels du segment vidéo considéré et (ii) un second sous-ensemble d'éléments caractéristiques qui sont associés à une pluralité de caractéristiques visuelles extraites du segment vidéo considéré. Les descripteurs de la pluralité de descripteurs de séquences des segments vidéo sont fusionnés entre eux pour former un descripteur de vidéo à l'aide d'un réseau neuronal d'actions, et une action vidéo du contenu vidéo est déterminée à partir du descripteur de vidéo à l'aide d'une couche de classification d'action.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2021/034779 WO2022250689A1 (fr) | 2021-05-28 | 2021-05-28 | Reconnaissance d'action vidéo progressive à l'aide d'attributs de scène |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2021/034779 WO2022250689A1 (fr) | 2021-05-28 | 2021-05-28 | Reconnaissance d'action vidéo progressive à l'aide d'attributs de scène |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022250689A1 true WO2022250689A1 (fr) | 2022-12-01 |
Family
ID=84229051
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2021/034779 WO2022250689A1 (fr) | 2021-05-28 | 2021-05-28 | Reconnaissance d'action vidéo progressive à l'aide d'attributs de scène |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2022250689A1 (fr) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070263900A1 (en) * | 2004-08-14 | 2007-11-15 | Swarup Medasani | Behavior recognition using cognitive swarms and fuzzy graphs |
US20200162799A1 (en) * | 2018-03-15 | 2020-05-21 | International Business Machines Corporation | Auto-curation and personalization of sports highlights |
US20210142440A1 (en) * | 2019-11-07 | 2021-05-13 | Hyperconnect, Inc. | Image conversion apparatus and method, and computer-readable recording medium |
-
2021
- 2021-05-28 WO PCT/US2021/034779 patent/WO2022250689A1/fr active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070263900A1 (en) * | 2004-08-14 | 2007-11-15 | Swarup Medasani | Behavior recognition using cognitive swarms and fuzzy graphs |
US20200162799A1 (en) * | 2018-03-15 | 2020-05-21 | International Business Machines Corporation | Auto-curation and personalization of sports highlights |
US20210142440A1 (en) * | 2019-11-07 | 2021-05-13 | Hyperconnect, Inc. | Image conversion apparatus and method, and computer-readable recording medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021184026A1 (fr) | Fusion audiovisuelle avec attention intermodale pour la reconnaissance d'actions vidéo | |
CN111062871B (zh) | 一种图像处理方法、装置、计算机设备及可读存储介质 | |
WO2021081562A2 (fr) | Modèle de reconnaissance de texte multi-tête pour la reconnaissance optique de caractères multilingue | |
WO2023101679A1 (fr) | Récupération inter-modale d'image de texte sur la base d'une expansion de mots virtuels | |
US20240037948A1 (en) | Method for video moment retrieval, computer system, non-transitory computer-readable medium | |
KR101887637B1 (ko) | 로봇 시스템 | |
WO2021092631A2 (fr) | Récupération de moment vidéo à base de texte faiblement supervisé | |
WO2021077140A2 (fr) | Systèmes et procédés de transfert de connaissance préalable pour la retouche d'image | |
CN113434716B (zh) | 一种跨模态信息检索方法和装置 | |
CN113255625B (zh) | 一种视频检测方法、装置、电子设备和存储介质 | |
CN111046757B (zh) | 人脸画像生成模型的训练方法、装置及相关设备 | |
WO2016142285A1 (fr) | Procédé et appareil de recherche d'images à l'aide d'opérateurs d'analyse dispersants | |
WO2021092600A2 (fr) | Réseau pose-over-parts pour estimation de pose multi-personnes | |
US20240296697A1 (en) | Multiple Perspective Hand Tracking | |
WO2023018423A1 (fr) | Incorporation binaire sémantique d'apprentissage pour des représentations vidéo | |
WO2023091131A1 (fr) | Procédés et systèmes pour récupérer des images sur la base de caractéristiques de plan sémantique | |
WO2022250689A1 (fr) | Reconnaissance d'action vidéo progressive à l'aide d'attributs de scène | |
CN116932788A (zh) | 封面图像提取方法、装置、设备及计算机存储介质 | |
CN112214626B (zh) | 图像识别方法、装置、可读存储介质及电子设备 | |
WO2023277877A1 (fr) | Détection et reconstruction de plan sémantique 3d | |
WO2023022709A1 (fr) | Enregistrement de mouvement humain sans marqueur portatif en temps réel et rendu d'avatar dans une plateforme mobile | |
WO2024005784A1 (fr) | Récupération de texte à vidéo à l'aide de fenêtres à auto-attention décalées | |
WO2023063944A1 (fr) | Reconnaissance de gestes de la main en deux étapes | |
WO2023091129A1 (fr) | Localisation de caméra sur la base d'un plan | |
US20230274403A1 (en) | Depth-based see-through prevention in image fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21943270 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21943270 Country of ref document: EP Kind code of ref document: A1 |