WO2023036159A1

WO2023036159A1 - Methods and devices for audio visual event localization based on dual perspective networks

Info

Publication number: WO2023036159A1
Application number: PCT/CN2022/117415
Authority: WO
Inventors: Varshanth RAO; Md Ibrahim KHALIL; Peng Dai; Juwei Lu
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-09-07
Filing date: 2022-09-07
Publication date: 2023-03-16

Abstract

This application provides a dual perspective network (DPN) method that relates to the artificial intelligence field, and specifically, to the computer vision field. The DPN can process a video by viewing the video from two different perspectives in order to label event localization within the video. This labeling allows retrieval of one or more video segments similar to that of a keyword being searched.

Description

METHODS AND DEVICES FOR AUDIO VISUAL EVENT LOCALIZATION BASED ON DUAL PERSPECTIVE NETWORKS

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to, and the benefit of, U.S. Provisional Patent Application serial no. 63/241,346, filed September 7, 2021, the contents of which is incorporated herein by reference in its entirety.

FIELD OF THE INVENTION

The present disclosure pertains to the field of video processing using artificial intelligence and in particular to computer vision methods for labelling videos and its constituent segments.

BACKGROUND

Current online and offline video platforms use common full-text based search engines to allow users to search for videos using keywords. However, full-text based search engines cannot search for actions or events within a video unless the actions or events depicted by the video have been labeled.

Accordingly, there is a need for one or more methods that at least partially addresses one or more limitations of the prior art.

The foregoing background information is provided to reveal information believed by the applicant to be of possible relevance to the present disclosure. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art against the present disclosure.

SUMMARY

An aspect of the disclosure provides a method for dual perspective processing. Such a method is executed by a processor and includes extracting a video sequence from a visual stream and extracting an audio sequence from an audio stream associated with the visual stream, converting the video sequence into a video graph using a sequence to graph perspective change and converting the audio sequence into an audio graph using the sequence to graph perspective change. Such a method further includes processing the video and audio graph together using a relational graph neural network (RGNN) , the RGNN also creating a processed video graph. Such a method further includes converting the processed video and audio graph into a processed video and audio sequence using a graph to sequence perspective change. Such a method further includes processing the processed video sequence using a sequence processor and processing the processed audio sequence using the sequence processor. In some embodiments, the RGNN is a relational graph convolutional transformer (RGCT) . In some embodiments, a technical benefit can be that dual perspective processing resolves long term dependencies within a single modality stream using sequential perspective processing and resolves short-term dependences while achieving feature fusing between modalities within a local temporal neighbourhood using graph perspective processing. In some embodiments, a technical benefit can be that dual perspective networks can perform event/activity localization, video-level classification, clip-level classification. In some embodiments, a technical benefit can be that self-supervised video representation techniques including audio-visual temporal synchronization, audio-visual correspondence, temporal order verification, segment/clip hallucination, pace prediction can be performed using a graph perspective by morphing the adjacency matrix. In some embodiments, a technical benefit can be that audio-visual event localization can be performed using only visual features and audio features where these features are extracted with respect to speed and representation ability. In some embodiments, a technical benefit can be that dual perspective networks can be compressed to cater to device constraints.

Another aspect of the disclosure provides a method for a relational graph convolutional transformer (RGCT) . Such a method is executed by a processor and includes aggregating neighbor nodes of a node of a relational graph, the neighbor nodes are all nodes in a relational neighborhood of the node. Such a method further includes concatenating a query of the node over different types of relations that include the query. Such a method further includes developing a key transformation related to the aggregated neighbor nodes. Such a method further includes developing a value transformation related to the aggregated neighbor nodes. Such a method further includes developing an attention map using the concatenated query of the node and a transpose of the key transformation and the value, developing the attention map to derive a relational node update. Such a method further includes transforming a representation of the node resulting from an average of the relational node update. In some embodiments, a technical benefit can be that the RGCT is used to refine node features using relation-wise polymorphic representations of itself by querying relation-wise neighbourhoods when resolving graph specific problems. In some embodiments, a technical benefit can be that localized video segments can be searched to find an action of interest.

Another aspect of the disclosure provides a method for a replicate and link data augmentation technique. Such a method is executed by a processor and includes decomposing a first video into any combination of: one or more visual frames; one or more audio frames; one or more audio features which have been statistically precomputed or the output of a neural network; one or more visual features which have been statistically precomputed or the output of the neural network. In such a method, the any combination of visual frames and audio frames and audio features and visual features stored in a plurality of specific state databases, the decomposition includes extracting first video states from the first video. Such a method further includes extracting a state sequence from a second video, that state sequence comprised of extracted second video states. Such a method further includes sampling randomly the frames from the plurality of specific state databases, the frames stored when the first video states are equivalent to the second video states. Such a method further includes stitching the stored frames together to create a replica. In some embodiments, a technical benefit can be that the performance of an artificial intelligence (AI) model is improved by training the AI model using a dataset of videos produced using replica augmentation so that the AI model learns the underlying semantic concept while ignoring external noise caused by coincidental correlations. In some embodiments, a technical benefit can be that replica augmentation can be used to transform a smaller dataset of videos into a significantly larger dataset of videos to introduce diversity in all modalities while preserving the semantic concept. In some embodiments, a technical benefit can be that replica augmentation can also include link augmentation to expand the video representation of a graph perspective at run-time.

BRIEF DESCRIPTION OF THE FIGURES

Further features and advantages of the present disclosure will be apparent from the following detailed description, taken in combination with the appended drawings, in which:

FIG. 1 is a schematic diagram illustrating a sequential processing perspective, in accordance with an embodiment of this present disclosure.

FIG. 2a is a schematic diagram illustrating the modal segments and video stream of a graph perspective, in accordance with an embodiment of this present disclosure.

FIG. 2b is a schematic diagram illustrating the modal segments and audio stream of a graph perspective, according to an embodiment of this disclosure.

FIG. 3 is a schematic diagram illustrating an example modal segments and their relationship, in accordance with an embodiment of this present disclosure.

FIG. 4 is a schematic diagram illustrating an adjacency matrix, in accordance with an embodiment of this present disclosure.

FIG. 5 is a schematic diagram illustrating a visual node update, in accordance with an embodiment of this present disclosure.

FIG. 6 is a schematic diagram illustrating an audio node update, in accordance with an embodiment of this present disclosure.

FIG. 7a is a schematic diagram illustrating alternating dual perspective processing, in accordance with an embodiment of this present disclosure.

FIG. 7b is a schematic diagram illustrating parallel dual perspective processing, in accordance with an embodiment of this present disclosure.

FIG. 8 is a schematic diagram illustrating a relational graph convolution transformer, in accordance with an embodiment of this present disclosure.

FIG. 9 is a schematic diagram illustrating a relational graph convolution transformer attention map, in accordance with an embodiment of this present disclosure.

FIG. 10 is a schematic diagram illustrating a state and composition/extraction table, in accordance with an embodiment of this present disclosure.

FIG. 11 is a schematic diagram illustrating an activation map, in accordance with an embodiment of this present disclosure.

FIG. 12a is a block diagram illustrating replica augmentation, in accordance with an embodiment of this present disclosure.

FIG. 12b is a schematic diagram illustrating an example original sample video, in accordance with an embodiment of this present disclosure.

FIG. 12c is a schematic diagram illustrating a replica video, in accordance with an embodiment of this disclosure.

FIG. 13 is a schematic diagram illustrating link augmentation, in accordance with an embodiment of this disclosure.

FIG. 14 is a block diagram of a system architecture, in accordance with embodiments of the present disclosure.

FIG. 15 is a block diagram of a convolutional neural network model, in accordance with embodiments of this disclosure.

Throughout the appended drawings, like features are identified by like reference numerals.

DETAILED DESCRIPTION

Various embodiments of this disclosure use methods to process video in order to label event localization within the video. This labeling allows retrieval of one or more video segments similar to that of a searched keyword. This disclosure provides these methods.

Methods of this disclosure can use various deep learning based computer vision techniques. These methods can include convolutional neural networks (CNN) , pretrained convolutional neural networks (VGGish and VGG-19) , message passing networks (MPN) , graph neural networks (GNN) such as graph convolutional network (GCN) , relational graph convolutional networks (RGCN) , and graph attention networks (GAN) .

These deep learning based computer vision techniques can be used to construct a novel viewpoint for a neural network to use in order to process video modalities. These video modalities can include segments of audio, visual, optical flow, and text.

A CNN can be a deep neural network that can include a convolutional structure. The CNN can include a feature extractor that can consist of a convolutional layer and a sub-sampling layer. The feature extractor may be considered to be a filter. A convolution process may be considered as performing convolution on an input image by using a trainable filter to produce a convolutional feature map. The convolutional layer may indicate a neural cell layer at which convolution processing can be performed on an input signal in the CNN. The convolutional layer can include one neural cell that can be connected only to neural cells in some neighboring layers. One convolutional layer usually can include several feature maps and each of these feature maps may be formed by some neural cells that can be arranged in a rectangle. Neural cells producing the same feature map can share one or more weights. These shared weights can be referred to as a convolutional kernel by a person skilled in the art. The shared weight can be understood as being unrelated to a manner and a position of image information extraction. A hidden principle can be that statistical information of a part may also be used in another part. Therefore, in all positions on the image, we can use the same image information obtained through learning. A plurality of convolutional kernels can be used at a same convolutional layer to extract different image information. Generally, a larger quantity of convolutional kernels can indicate that richer image information can be reflected by a convolution operation. A convolutional kernel can be initialized in a form of a matrix of a random size. In a training process of the CNN, a proper weight can be obtained by performing learning on the convolutional kernel. In addition, a direct advantage that can be brought by the shared weight is that a connection between layers of the CNN can be reduced and the risk of overfitting can be lowered.

The process of training a deep neural network, to enable the deep neural network to produce a predicted value that can be as close as possible to a desired value, a predicted value of a current network and a desired target value can be compared and a weight vector of each layer of the neural network can be updated based on the difference between the predicted value and the desired target value. An initialization process can be performed before the first update. This initialization process can include a parameter that can be preconfigured for each layer of the deep neural network. As a non-limiting example, if the predicted value of a network is excessively high, a weight vector can be adjusted to reduce the predicted value. This adjustment can be performed multiple times until the neural network can predict the desired target value. This adjustment process is known to those skilled in the art as training a deep neural network using a process of minimizing loss. The loss function and the objective function are mathematical equations that can be used to determine the difference between the predicted value and the target value.

Embodiments of this disclosure can construct novel viewpoints in the form of sequential processing perspectives and also graph perspectives. These graph perspectives can include nodes that can represent segments in each modality and edges that can represent temporal and cross-modal relationships. A dual perspective network (DPN) can process a video by viewing the video from two different perspectives. As a non-limiting example, the DPN can process the video by treating the video as a sequential stream of data and also by treating the video as an interactive graph where the graph can consist of temporally directed and one or more cross-modal relationships. As a result, dual perspective processing can be defined as processing a video both as data represented as a sequential stream and data represented as a graph. Processing both perspectives can result in the DPN refining individual and also joint modal features.

Embodiments of this disclosure can use DPNs for several techniques where the dual perspective processing can be the backbone for video understanding tasks. These tasks can include video classification, action recognition, action localization, and self-supervised video representation techniques that can include audio-visual temporal synchronization, audio-visual correspondence, temporal order verification, segment/clip hallucination, and pace prediction. In some embodiments the adjacency matrix can be manipulated so that perspective 2 (graph perspective) methods can be used. As a non-limiting example, the adjacency matrix can be manipulated by changing the connection of A ₁ from pointing to V ₁ so that A ₁ points to V ₂.

Embodiments of this disclosure can implement self-supervised learning. Self-supervised learning may not include labels nor classes. However, a representation of the video can be created for some tasks. The created representation can be performed by unsynchronizing the audio and visual streams. A neural network can use these synchronized and unsynchronized streams to learn parameters that can be used to distinguish where the unsynchronized operation has taken place and then allow the neural network to create representations where it can learn how to tune parameters. This learning is also known by those skilled in the art as adjusting weights. In other words, the neural network can learn how to form an understanding of which visual segment corresponds to a given audio segment. Hence, a relation that has never been learned before, or could not have been learned before, can be enabled. In order to perform this process, a proxy task can be created and this proxy task can create a problem for which an associated label is known. Next, the sequence can be unsynchronized or frames can be jumbled and the neural network can be requested to predict if the sequence is in a correct order or if the sequence is synchronized.

Embodiments of this disclosure can also construct a GNN to express a node’s representation with respect to each type of relationship in which it is involved. Some embodiments can also refine these representations based on how each relationship affects one another.

Relational graph convolutional transformers (RGCT) that can be used to refine the node representations according to its temporal and cross-modal neighbors in the graph. These RGCTs can perform this refinement by aggregating node (also commonly known as segments) level representations from a particular neighborhood of nodes to collect the temporal and cross-modality associations of the nodes in the neighborhood. RGCTs can refine the temporal and cross-modal representations of a node because a node can be characterized by its association with its associated neighbor nodes.

RGCTs can also redefine a relation type of a node representation with other relations using a scaled dot product attention mechanism. The scaled dot product attention mechanism can allow node refinement based only on a subset of important relations of the R relation types that can be associated to the node being refined.

Embodiments of this disclosure can synthesize a replication data augmentation technique. Replication can be used to transform a small dataset of videos into a significantly larger dataset. Replication can also introduce diversity while preserving the semantic concept of the original event-level segment sequence.

The replicate and link data augmentation technique used by embodiments of this disclosure can yield a large amount of temporally un-constrained natural videos (replicas) based on the combination of state-based clips of existing videos. These existing videos can have a fixed background (BG) –foreground (FG) segment pattern and the replicas can also compromise the same segment pattern as the existing videos.

FIG. 1 illustrates the audio 170 and visual 130 streams of a sequential processing perspective of a video. Visual stream 130 of the video can have a total length of 10 seconds and since visual stream 130 can include 10 segments, each visual segment 140, visual segment 142, visual segment 144, visual segment 146, visual segment 148, visual segment 150, visual segment 152, visual segment 154, visual segment 156, and visual segment 158 can be one second in duration. These 10 visual segments can be processed as a sequence of frames using traditional spatio-temporal processing techniques.

Audio stream 170 of the video can be related to visual stream 130. Audio stream 170 can be comprised of 10 one second audio segments -audio segment 180, audio segment 182, audio segment 184, audio segment 186, audio segment 188, audio segment 190, audio segment 192, audio segment 194, audio segment 196 and audio segment 198.

Each of the 10 visual segments can be processed sequentially by a visual processor. The sequence of processing illustrated by FIG. 1 is frame 140 is processed by the visual processor at 110a, frame 142 at 110b, frame 144 at frame 110c, frame 146 at 110d, frame 148 at 110e, frame 150 at 110f, frame 152 at 110g, frame 154 at 110h, frame 156 at 110i, and frame 158 at 110j.

Each of the 10 audio segments can be processed sequentially by an audio processor. The sequence of processing illustrated by FIG. 1 is frame 180 is processed by the audio processor at 160a, frame 182 at 160b, frame 184 at 160c, frame 186 at 160d, frame 188 at 160e, frame 190 at 160f, frame 192 at 160g, frame 194 at 160h, frame 196 at 160i, and frame 198 at 160j.

FIG. 2a illustrates the visual stream 130 of a graph perspective. Some embodiments of this disclosure can fix the minimum temporal length of processing (unit temporal length) into segments. Some embodiments of this disclosure can aggregate video data in a modality into variable sized segments. The aggregation can be based on similarity between initial features or other modality particular criteria.

Frames

140, 142, 144, 146, 148, 150, 152, 154, 156, and 158 of visual stream 130 can be represented as nodes 208 in a graph. The features of these nodes can be derived by segment-wise processing of an individual modality. The edges of the graph can be defined as relationships between segments between the nodes.

As illustrated by FIG. 2a, this feature wise processing can derive node V ₁ 210 to represent visual frame 140. This processing can also derive node 212 to represent 142, node 214 to represent 144, node 216 to represent 146, node 218 to represent 148, node 220 to represent 150, node 222 to represent 152, node 224 to represent 154, node 226 to represent 156, and node 228 to represent 158. The edges between

nodes

210 and 212 can be represented by visual temporal forward connection 230a and by visual temporal backward connection 240a. The edges between

nodes

212 and 214 can be represented by visual temporal forward connection 230b and by visual temporal backward connection 240b. The edges between nodes 214 and 216 can be represented by visual temporal forward connection 230c and by visual temporal backward connection 240c. The edges between nodes 216 and 218 can be represented by visual temporal forward connection 230d and by visual temporal backward connection 240d. The edges between nodes 218 and 220 can be represented by visual temporal forward connection 230e and by visual temporal backward connection 240e. The edges between nodes 220 and 222 can be represented by visual temporal forward connection 230f and by visual temporal backward connection 240f. The edges between nodes 222 and 224 can be represented by visual temporal forward connection 230g and by visual temporal backward connection 240g. The edges between nodes 224 and 226 can be represented by visual temporal forward connection 230h and by visual temporal backward connection 240h. The edges between

nodes

226 and 228 can be represented by visual temporal forward connection 230i and by visual temporal backward connection 240i.

FIG. 2b illustrates the audio stream 170 of a graph perspective. As illustrated by FIG. 2b, the audio frames 180, 182, 184, 186, 188, 192, 194, 196, 198 of audio stream 170 can be represented by nodes 248. As with nodes 208, the features of nodes 248 can be derived by segment-wise processing of an individual modality. Again, the edges of the graph can be defined as relationships between segments between the nodes.

As illustrated by FIG. 2b, this feature wise processing can derive node A ₁ 250 to represent audio frame 180. This processing can also derive node 252 to represent 182, node 254 to represent 184, node 256 to represent 186, node 258 to represent 188, node 260 to represent 190, node 262 to represent 192, node 264 to represent 194, node 266 to represent 196, and node 268 to represent 198. The edges between

nodes

250 and 252 can be represented by audio temporal forward connection 270a and by audio temporal backward connection 280a. The edges between

nodes

252 and 254 can be represented by audio temporal forward connection 270b and by audio temporal backward connection 280b. The edges between

nodes

254 and 256 can be represented by audio temporal forward connection 270c and by audio temporal backward connection 280c. The edges between

nodes

256 and 258 can be represented by audio temporal forward connection 270d and by audio temporal backward connection 280d. The edges between

nodes

258 and 260 can be represented by audio temporal forward connection 270e and by audio temporal backward connection 280e. The edges between

nodes

260 and 262 can be represented by audio temporal forward connection 270f and by audio temporal backward connection 280f. The edges between

nodes

262 and 264 can be represented by audio temporal forward connection 270g and by audio temporal backward connection 280g. The edges between

nodes

264 and 266 can be represented by audio temporal forward connection 270h and by audio temporal backward connection 280h. The edges between

nodes

266 and 268 can be represented by audio temporal forward connection 270i and by audio temporal backward connection 280i.

FIG. 3 illustrates an example of an embodiment of this disclosure with three visual nodes and three associated audio nodes and their connections. The types of connections can be visual temporal forward, visual temporal backward, audio temporal forward, audio temporal backward, visual to audio, and audio to visual.

Embodiments of this disclosure can include the cross-modal type connections between audio nodes and the video nodes. These cross-modal type connections can be in addition to the temporal forward connections and temporal backward connections between nodes of the same type.

As illustrated by FIG. 3, visual node V ₁ 210 can be connected to visual node V ₂ 212 by visual temporal forward node 340 and also by visual temporal backward node 240a. Visual node V ₂ 212 can be connected to visual node V _. 214 by visual temporal forward node 345 and also by visual temporal backward node 240c.

FIG. 3 also illustrates audio node A ₁ 250 which can be connected to audio node A ₂ 252 by audio temporal forward node 330 and also by audio temporal backward node 280a. Audio node A ₂ 252 can be connected to audio node A _. 254 by audio temporal forward node 335 and also by audio temporal backward node 280b.

Inter-modal connections are also illustrated by FIG. 3. As FIG. 3 illustrates, visual node V ₁ 210 can be connected to audio node A ₁ 250 by visual to audio connection 320 and audio to visual connection 310. Visual node V ₂ 212 can be connected to audio node A ₂ 252 by visual to audio connection 325 and audio to visual connection 315. Visual node V _. 214 can be connected to audio node A _. 254 by visual to audio connection 328 and audio to visual connection 318.

FIG. 4 illustrates a non-limiting example of an adjacency matrix. Each entry in this matrix can record the relationship between the source and destination node and therefore can represent an edge in the graph. Each edge type can semantically designate the temporal direction and also the cross-modal relationships between nodes.

The source nodes illustrated by FIG. 4 can be represented by the columns and the destination nodes can be represented by the rows. As illustrated by FIG. 4, the source nodes are A ₁ 505, A ₂ 510, A _. 515, V ₁ 520, V ₂ 525 and V _. 530 and the destination nodes are A ₁ 405, A ₂ 410, A _. 415, V ₁ 420, V ₂ 425 and V _. 430. As a non-limiting example, the edge between source node A ₁ 505 and destination node A ₂ 410 can be ATB 610. ATB can stand for audio temporal backward connection. Another non-limiting example is that source node A _. 515 and destination node A ₂ 410 can have the edge AFT 710. ATF can stand for audio temporal forward connection. Another non-limiting example illustrated by FIG. 4 can be source node V ₂ 525 and destination node A ₂ 410 can have edge A2V 810. A2V can stand for audio to visual connection. A blank entry in this matrix can indicate that there is no edge (also known as a predefined relation) between the source and destination nodes.

Adjacency matrix data can be processed by GNNs, GCNs, RGCNs, and GANs. The features of a GNN’s nodes can be updated based on a function and/or transformation of the node and also a function and/or transformation of the neighbors of the node by processing the adjacency matrix data. As a non-limiting example, updating a node’s representation in a GNN can be performed by processing the data represented by the adjacency matrix between the node and all of the neighbor nodes connected to the node. In some embodiments of this disclosure these functions and/or transformations can be linear or affine transformations and in other embodiments they can be neural networks.

FIG. 5 illustrates an update of visual node V ₂ 212. As FIG. 5 illustrates, V ₂ can be updated based on its temporal backward counterpoint 240b with node V _. 214, V ₂’s temporal forward counterpoint 230a with node V ₁ 210, and V ₂’s connection 315 with audio node A ₂ 252.

FIG. 6 illustrates an update of audio node A ₂ 252. As FIG. 6 illustrates, A ₂ 252 can be updated based on its temporal backward counterpoint 280b with node A _. 252, A2’s temporal forward counterpoint 270a with node A ₁ 250, and A ₂’s backward connection 325 with video node V ₂ 212.

Since dual perspective processing can be used to process video data using both perspective 1 and perspective 2, dual perspective processing can benefit from the advantages provided by processing perspective 1 and also the advantages provided by processing perspective 2 and also enables refinement of features for each segment.

The advantages of processing perspective 1 can be that long term dependencies can be resolved within a single modality stream so that processing a stream of data local to its modality can lead to learning patterns without corruption from cross-modal data. Cross-modal corruption can occur due to cross-modal processing modules that may not interact with each other in terms of long-term dependencies that would have been available in the previous sequential processing step.

The advantages of processing perspective 2 can be that updates can occur naturally due to the neighborhood relationship of the nodes. As a non-limiting example, when a node update is performed, the temporal neighborhood can be aggregated and information from other related modalities can be learned. Another advantage of perspective 2 is that short-term dependences can be resolved while achieving feature fusion between modalities with a local temporal neighborhood.

FIG. 7a illustrates an embodiment of this disclosure that implements alternating dual perspective processing of a video. Since any form of processing within the same perspective can be engulfed as a single process within that perspective, alternating dual perspective processing can involve processing one or more video segments extracted into perspective 1 or perspective 2 followed by processing the output representations into the other perspective followed by processing the output representations into the first perspective. As a non-limiting example, if perspective 1 is processed followed by processing perspective 2 followed by processing perspective 1, one or more video segments can be extracted into perspective 1 and processed, followed by inputting the output of perspective 1 processing into perspective 2 and processed, followed by inputting the output of perspective 2 processing into perspective 1 and processed and so on..

Neural networks that can perform dual perspective processing can be termed dual perspective networks (DPNs) . DPNs can perform tasks that can include event or activity localization, video level classification, clip-level classification. These tasks can be performed by additional neural networks or feature aggregation modules that can be attached to the end of the DPN. Embodiments of this disclosure can utilize DPNs to perform dual perspective processing by alternating between a

graph perspective

208 and 248 and a

sequential perspective

740 and 745.

FIG. 7a illustrates an embodiment of this disclosure where visual stream 130 and audio stream 170 of the video first being extracted 710 into audio sequence 715 and also video sequence 720. As illustrated by FIG. 7a, sequence to graph perspective change 725 can convert audio sequence 715 into audio graph 248 and sequence to graph perspective change 725 can convert video sequence 720 into video graph 208. When performing a sequence to graph perspective change, temporally ordered audio and visual segments can be converted into unordered graph nodes with temporally directed and cross-modal edges. As a non-limiting example, if a sequence has 10 segments, then the sequence to graph perspective change can create a graph with 20 nodes -10 nodes correspond to the audio component of the sequence and 10 nodes correspond to the visual component of the sequence.

It should be appreciated that node indexing can be used to keep track of the original temporal order of the sequence prior to performing a sequence to graph perspective change so that the original temporal order can be reinstated when performing a graph to sequence perspective change.

Audio graph 248 and video graph 208 can be processed by a relational GNN 730, which can be implemented using a graph convolutional transformer (RGCT) , and the result of this processing by RGCT 730 can be converted into audio sequence 740 and also video sequence 745 by graph to sequence perspective change 735. Then audio sequence 740 and video sequence 745 can be processed as modality streams individually using spatial and/or temporal processing methods using sequence processor 750.

Graph to sequence perspective change can be the conversion of the unordered graph nodes back to the individual audio and visual sequences. Node indexing performed prior to performing a sequence to graph perspective change can be used to reconstruct the ordered audio and visual sequence.

This process of alternating between processing perspective 1 and perspective 2 can be performed N 752 times and after processed N times the result is further processed by video classification block 755 and event/activity localization block 765.

Video classification block 755 can generate video class prediction 760 based on the result of the DPN processing.

Event/activity localization block 765 can generate a fixed background (BG) –foreground (FG) segment pattern 770 based on the result of the DPN processing. As a non-limiting example of network flow for audio-visual event localization, inputs can include visual features related to video segment 130 can be extracted using a suitable artificial intelligence computer vision based model such as a VGG19 neural network. Further inputs can include audio signals related to video segment 130 can be extracted utilizing a suitable artificial intelligence audio-based model such as a VGGish network. Outputs can be event distribution at the segment level over all events from both modalities to allow for separate audio/visual event localization. Further outputs can be from a gated/aggregated/combined output between both modalities and can be used for audio-visual event localization.

Embodiments of this disclosure can implement a DPN in audio-visual event (AVE) localization that can be used for strongly supervised AVE localization and/or weakly supervised AVE localization. When used for strongly supervised AVE localization, segment level labels are predicted given that supervision can be ground truth segment level labels. When used for weakly supervised AVE localization, segment labels are predicted given that supervision can be ground truth video level labels.

FIG. 7b illustrates an embodiment of this disclosure that performs parallel dual perspective processing. Parallel dual perspective processing can utilize many of the same functions that can be used in alternating dual perspective processing. However, as FIG. 7b illustrates, sequence to graph perspective change 725 can convert audio sequence 715 into audio graph 248 and sequence to graph perspective change 725 can convert video sequence 720 into video graph 208. Sequence to sequence perspective change 737 can convert audio sequence 715 into audio sequence 740 and sequence to sequence perspective change 737 can convert video sequence 720 into video sequence 745. A reason for using sequence to sequence perspective change 737 to convert audio sequence 715 to audio sequence 740 and to convert video sequence 720 into video sequence 745 can be because graph representations of a video for all audio and video segments can be represented as individual nodes. However, the notion of temporal sequencing can be lost since nodes can be unordered. That being said, when nodes are connected using temporally directed connections, the sequencing can be retained by use of node chaining. As a non-limiting example, a video with ten segments can have 20 interconnected nodes in the graph. Therefore, in the sequential stream representation of the graph, these 20 nodes can be indexed into both audio and visual segments and temporally sequenced. As a result, sequence to sequence perspective change 737 can format data into video sequence 745 and audio sequence 740 so that these sequences can be in the correct format for processing by processor 750.

Next, audio graph 248 and video graph 208 can be processed by RGCT 730 and audio sequence 740 and video sequence 745 can be processed by using spatial and/or temporal processing methods by processor 750.

Optional merge and split 775 can then optionally interchange information using a merge operation. This merge operation can be implemented using a statistical function or a neural network. If Optional merge and split 775 performed a merge, an optional split operation can be carried out using statistical functions or a neural network. This merge allows parallel perspective processing to be carried out N 752 times.

FIG. 8 illustrates an embodiment of this disclosure for a relational graph convolution transformer (RGCT) . This RGCT can consider each node 805 to assume a polymorphic role dependent on the relation type. Therefore, each node 805 can be transformed based on relation type. The aggregation of the relation-wise neighborhood nodes of each of the nodes 805 can be transformed into key and value transformations while considering each of the nodes polymorphic query transformation in the relation space.

As FIG. 8 illustrates, node 805 can have a given relational neighborhood. Since polymorphism can be the ability of an entity to exist across multiple formations and interpretations of itself, a node can be projected using a query type relation type 1 which can be temporally forward or temporally backward or audio to visual or visual to audio. Therefore, a node can be projected to R different types of transformations and can project each of the node’s neighbor aggregations of different types using different key and value transformations. As a result, NA ₁ 855 can represent the aggregation of all temporal forward connections to neighbor nodes of node 805, NA ₂ 860 can represent the aggregation of all temporal backward connections to neighbor nodes of node 805, and NA ₃ 865 can represent the aggregation of all visual to audio connections to neighbor nodes of node 805. These NA ₁ 855, NA ₂ 860, and NA ₃ 865 can be neighborhood aggregation results that can be computed using aggregation mechanisms such as Equation 1. Also, NA ₁ 855, NA ₂ 860, and NA ₃ 865 can be projected to keys and values per relation type via

learnable parameters W

_k 810, 820, 830 and

W

_v 814, 824, 834.

As a result, FIG. 8 illustrates node 805 respectively projected to query via learnable parameters.

Equation 1 can express an aggregation function that can perform an aggregation of node 805’s neighborhood nodes by averaging all of the nodes which can exist in that relational neighborhood.

Where:

is the reference node under relation r where

in the neighborhood aggregation of Node _n.

Equation 2 can express a query of node 805 concatenated over different types of relations that can include query of relation type 1 840, query of relation type 2 845, query of relation type 3 850. Therefore equation 2 can represent R query representations of the feature length size d.

Where:

1. Λ is a concatenation operator.

2.

is a query vector used to transform a node n _i into a relational polymorph of type r using an FC layer.

3. Q _r is a query of relation r.

Equation 3 can express a key transformation of the neighborhood aggregation that is a concatenation of all keys of the different relations (an Rxd matrix) .

Where:

1. NA _r (Ref) is the neighborhood aggregation of relation type r.

2.

is the key weight of relation r.

3. K _r is the key of relation r.

Equation 4 can express the value.

Where:

1.

is the value weight of relation r.

2. V _r is the value of relation r.

Equation 5 can express the values in the relational graph convolution transformer attention map illustrated by FIG. 9. It should be appreciated that the Softmax function can bound the attention values between 0 and 1 and can allocate larger values to larger inputs. The query term, Q (Ref) can produce a Rxd matrix. Since the key is a Rxd matrix that can be transposed and multiplied with the Rxd query matrix, an RxR attention map matrix can result.

Where:

1. K ^T (NA (Ref) ) is a key transpose.

2.

is a normalization factor.

3. Q (Ref) is a query.

The relational graph convolution transformer attention map illustrated by FIG. 9 can be targeted to capture how each node 805’s neighborhood pertaining to a relation type can influence node 805’s relation wise polymorphic representations. It should be appreciated that the Softmax function of equation 5 can be applied row wise and concatenated over relation types. Concatenation over relation type can be achieved by stacking the query transformations one on top of the other and therefore concatenating R query transformations can result in a Rxd matrix.

Since each key 812, 822, 832 and value 816, 826, 836 are also Rxd matrix, applying the Softmax of equation 5 with multiplication of query (Q (Ref) ) and key transpose (K ^T) can result in an attention map that is a RxR matrix.

Equation 6 can be used to apply the RxR matrix attention map to the Rxd matrix value 816, 826, 836 to, as a result of matrix multiplication, result in an Rxd matrix.

Att NA (Ref) =Att Map (Ref) V (NA (Ref) ) (6)

Where:

1. Att NA can be the attended neighbourhood aggregation over all relation types.

2. V (NA (Ref) ) is the is the value of neighborhood aggregation.

Equation 7 can be used to transform, using a neural network, a “d” representation resulting for an average of Att NA (Ref) over R.

Ref ^new=θ (W _θ, Φ (Ref ^old) +Avg _r (Att NA (Ref ^old) ) (7)

Where:

1. Φ (W _Φ, Ref ^old) can be the transformation of old representation of node 805 parameterized by W _Φ,

2. Φ (W _θ, Ref ^intermed) can be the transformation of intermediate representation of node 805 parameterized by W _θ.

Embodiments of this disclosure can implement a replicate and link video augmentation technique. It should be appreciated that when a node update is performed, the neighbourhood of node V ₂ 212 may only have one node for each of the relation types. This limits V ₂ 212’s learning capacity because neighbourhood aggregation may not be performed since there is nothing to aggregate with only one node. Therefore, diversity can be introduced by linking graphs from different videos together to enrich the representation of node V ₂ 212’s update. Neighbor aggregation can perform feature interpolation, but interpolation can only be performed if V ₂ 212 has two or more neighbours. Also feature interpolation can be different than using only one neighbour (V ₁ 210 as a non-limiting example) . This difference can occur when performing feature interpolation so that node V ₂ 212 effectively can “see” a node with aggregate representation was not directly in the training set. Seeing a node that is not directly in the training set can be valuable because if the node is not in the training set, node V ₂ 212 is seeing nodes from a different video in the feature space and the more nodes that can be “seen” in the feature space can result in an improvement in the operation of the GNN. The GNN can be improved because when the GNN creates boundaries and segregates features, it may need to know if it can go in between two features. Therefore, if interpolation has been performed between two, three, or four different neighbours, the GNN can perform a more optimized boundary creation.

FIG. 10 illustrates a table of states and composition/extraction. These states can be decomposed videos that can be decomposed using the ground truth annotation for existing training videos. States, or event sequences, can be one of three types: event initiation or start, event continuation, event termination or end. Each state can represent a sub-sequence of foreground (FG) and/or background (BG) events which can be searched and segments of the sub-sequence that can be extracted. As a non-limiting example, search and extract sub-sequence of <BG, FG> 1005 as START_2 1010. Similarly, FIG. 10 provides examples of search and extract sub-sequences for N segment sized videos where the extract sub-sequence is bolded and underlined.

FIG. 11 illustrates a non-limiting example of an activation map related to states provided in FIG. 10. From FIG. 10, sub-sequence of one or more audio frames and/or one or more visual frames and/or one or more audio features and/or one or more visual features which have been statistically precomputed or the output of a neural network extracted from each video can be stored in state

specific databases

1210, 1213, 1215, 1218 of FIG. 12a.

Diversity can be achieved by creating a replica of the original sample video. The original sample video can have a particular event sequence of FG and BG. The replica can also be created with the same event sequence as the event sequence of the original sample video. However, the replica shouldn’t be nonsense, it should have some semantic meaning. Therefore, the replica can be created by dividing a sample video in the training set and decomposing them into state features.

A video’s sequence can be described in terms of sub-sequences and these sub-sequences can be characterized according to their semantic position along the video’s event progression. As a non-limiting example, FIG. 12a illustrates the process of generating a replica. FIG. 12b illustrates the original sample video’s visual stream 130, divided into

visual frames

140, 142, 144, 146, 148, 150, 152, 154, 156, 158, associated audio stream 170 divided into

audio frames

180, 182, 184, 186, 188, 190, 192, 194, 196, 198 and the event sequence 770 and

events

1222, 1224, 1226, 1228, 1230, 1232, 1234, 1236, 1238, 1240. An event can occur when there is a transition between these events. As a non-limiting example event 1222 is BG and event 1224 is FG indicating that a transition has occurred between video frames 140 and 142. As a further non-limiting example, event 1224 is FG and event 1226 is also FG which can mean that two continuous FG events can indicate that an event is already in progress. A further non-limiting example is event 1224 is FG, event 1226 is FG, and event 1228 is FG which because there are three continuous FGs can be a stronger indication that the event occurring in the middle of the three FGs is happening. However, since there are three FG events, it is not known if the event has finished. It should be appreciated by those skilled in the art that a BG event is a temporal region that is not of interest but that a FG region is a temporal segment which can be important. This is because from an action localization perspective, a BG event is an event where the action does not happen and an FG event is where the event has happened in terms of the modality of interest. It should also be appreciated by those skilled in the act that a FG event can be an event which is both audible and visible and should therefore exist in both a frame of visual stream 130 and an associated frame of audio stream 170.

FIG. 12c illustrates the replica video’s visual stream 1254, audio stream 170 and event sequence 1280. As FIG. 12b and FIG. 12c illustrate, the events 1222 equal 1282, 1224 equals 1284, 1226 equals 1286, 1228 equals 1288, 1230 equals 1290, 1232 equals 1292, 1234 equals 1294, 1236 equals 1296, 1238 equals 1298, and 1240 equals 1299. Therefore, since the event sequence 770 of the original video equals the event sequence 1280 of the replica, the replica can have semantic meaning. Also, diversity has been achieved since the replica’s third visual frame 1262 and audio frame 1270 is different that original video’s third visual frame 144 and audio frame 184. Diversity is further achieved because:

1. Replica’s fourth visual frame 1264 and audio frame 1272 are different than the original video’s fourth visual frame 146 and audio frame 186.

2. Replica’s eighth visual frame 1266 and audio frame 1274 are different than the original video’s eighth visual frame 154 and audio frame 194.

3. Replica’s ninth visual frame 1268 and audio frame 1276 are different than the original video’s ninth visual frame 156 and audio frame 196.

Embodiments of the disclosure can create a replica using states and composition/extraction illustrated by the entries of FIG. 10’s table because videos can be decomposed into state database states for a given pattern.

Decomposition can begin by extracting one or more FG and BG audio frames and visual frames and/or one or more audio features and/or one or more visual features which have been statistically precomputed or the output of a neural network for a particular class by extracting states illustrated by FIG. 10 for videos of a certain type (as a non-limiting example this type can be of a male speaking) and storing these extracted states in a

particular state database

1210, 1213, 1215, 1218.

As a non-limiting example illustrated by FIG. 10, for state Start_1 1020 the corresponding composition/extraction is < FG, Next N-1> 1025. Embodiments of this disclosure extract the underlined FG or BG composition/extraction filed and search based on the non-underlined field (s) . The non-limiting example of state START_1 1020 can result in searching the FG and BG columns of the START_1 1020 state in activation map FIG. 11’s table for a single FG 1110 that can be extracted. The remaining N-1 columns of FIG. 11’s START_1 1020 state are not searched. Next, state START_2 1010 of FIG. 10 can, based on the composition/extraction < BG, FG>, an result in searching the FG and BG columns of the START_2 1010 state of FIG. 11 for a sequence of BG 1115 followed by FG 1120 and BG 1125 followed by FG 1130. The set of BG 1115, FG 1120 and the set of BG 1125, and FG 1130 can be extracted as pairs. Next, state CONTINUE_1 1030 of FIG. 10 with composition/extraction <FG, FG, FG> 1035 can result in searching the FG and BG columns of the CONTINUE_11030 state of FIG. 11 for three continuous FG and extract the middle FG of the three. Hence, as FIG. 11 illustrates, CONTINUE_1 1030 can extract FG 1135 and 1140. Next, state CONTINUE_2 1040 of FIG. 10 with composition/extraction <FG, FG, FG, FG> 1045 can result in searching the FG and BG columns of the CONTINUE_2 1040 state of FIG. 11 for four consecutive FG and extract the middle two of the four. Hence, as FIG. 11 illustrates, CONTINUE_2 1040 can extract the pair of FG 1145 and FG 1150, and the pair of FG 1150 and FG 1155. Next, state END_1 1050 of FIG. 10 with composition/extraction <First N-1, FG> 1055 can result in searching for the FG column of END_1 1050 state of FIG. 11 for FG 1160 in the last Nth column. Next, state END_2 1060 of FIG. 10 with composition/extraction 1065 can result in searching the FG and BG columns of END_2 1060 of FIG. 11 for a FG 1165 and BG 1170 which are extracted. Next, state BG_1 1070 with composition/extraction < BG> 1075 can result in searching the FG and BG columns of BG_1 1070 of FIG. 11 for BG 1175 and BG above 1185 which is extracted. Next, BG_2 1080 with composition/extraction < BG, BG> 1085 can result in searching the FG and BG columns of BG_2 1080 of FIG. 11 for two

continuous BG

1180 and 1185 which are both extracted.

A given video can consist of an event sequence and a state sequence can be extracted pertaining to this specific event sequence. As a non-limiting example, visual stream 130 and audio stream 170 are of a male speaking. The databases of a male speaking 1210, 1213, 1215, and 1218 are queried and features pertaining to a particular state of the male speaking are randomly sampled. These random samples can be stitched together to create replica visual stream 1254 and audio stream 1258.

The replica event sequence 1280 includes the same events in the same order as the original event sequence 770. The visual frames and audio frames of the original event and the replica:

1. First frame: visual frame140 and audio frame 180 are the same in both the original and the replica.

2. Second frame: visual frame142 and audio frame 182 are the same in both the original and the replica.

3. Third frame: visual frame 144 and audio frame 184 of the original and visual frame 1262 and audio frame 1270 of the replica are all of a male speaking.

4. Fourth frame: visual frame 146 and audio frame 186 of the original and visual frame 1264 and audio frame 1272 of the replica are all of a male speaking.

5. Fifth frame: visual frame148 and audio frame 188 are the same in both the original and the replica.

6. Sixth frame: visual frame150 and audio frame 190 are the same in both the original and the replica.

7. Seventh frame: visual frame152 and audio frame 192 are the same in both the original and the replica.

8. Eighth frame: visual frame 154 and audio frame 194 of the original and visual frame 1266 and audio frame 1274 of the replica are all of a male speaking.

9. Ninth frame: visual frame 156 and audio frame 196 of the original and visual frame 1268 and audio frame 1276 of the replica are all of a male speaking.

10. Tenth frame: visual frame158 and audio frame 198 are the same in both the original and the replica.

Since the replica can be generated randomly, in totality the replica can be new and may not include all the same frames as the original. However, the replica’s state sequence can preserve the semantics of the original video’s state sequence and can as a result increases diversity.

FIG. 13 illustrates the link augmentation based on the diversity resulting from the replica. Link augmentation can be performed to update the nodes of the original graph to include the interconnectivity of nodes of the replica because both the original and the replica can have the same event sequence. Node V ₂ 212 was previously linked to V ₁ 210 and V _. 214 and also to A ₂ 252. Node V ₂ 212 can be linked to replica nodes A ₂ 1340, V ₁ 1310, and V _. 1330. The nodes of the original sample graph and the replica graph can be linked based on the temporal direction and cross modal relationships. However, since replica nodes A ₂ 1340, V ₁ 1310, and V _. 1330 are linked to replica node V ₂ 1320, original nodes V ₁ 210, V _. 214 and A ₂ 252 can also be linked to replica node V ₂ 1320 to increase diversity during training.

Figure 14 illustrates an embodiment of the disclosure that can provide a system architecture 1400. As shown in the system architecture 1400, a data collection device 1460 can be configured to collect training data and store this training data in database 1430. The training data in this embodiment of this application can include extracted states in a particular state database. A training device 1420 can generate a target model/rule 1401 based on the training date maintained in database 1430. Training device 1420 can obtain the target model/rule 1401 which can be based on the training data. The target model/rule 1401 can be used to implement a DPN. Training data that can be maintained in database 1430 may not necessarily be collected by the database collection device 1460 but may be obtained through reception for another device. In addition, it should be appreciated that the training device 1420 may not necessarily perform the training with the target model/rule 1401 fully based on the training data maintained by database 1430 but may perform model training on training data that can be obtained from a cloud end or another location. The foregoing description shall not be construed as a limitation for this embodiment of this application.

Target module/rule 1401 can be obtained through training via training device 1420. Training device 1420 can be applied to different systems or devices. As a non-limiting example, training device 1420 can be applied to an execution device 1410. Execution device 1410 can be terminal, as a non-limiting example, a mobile terminal, a tablet computer, a notebook computer, AR/VR, or an in-vehicle terminal, a server, a cloud end, or the like. Execution device 1410 can be provided with an I/O interface 1412 which can be configured to perform data interaction with an external device. A user can input data to the I/O interface 1412 via customer device 1440.

A preprocessing module 1413 can be configured to perform preprocessing that can be based on the input data received from I/O interface 1412.

A preprocessing module 1414 can be configured to perform preprocessing based on the input data received from the I/O interface 1412.

Embodiments of the present disclosure can include a related processing process in which the execution device 1410 can perform preprocessing of the input data or the computation module 1411 of execution device 1410 can perform computation and execution device 1410 may invoke data, code, or the like from a data storage system 1450 to perform corresponding processing, or may store in a data storge system 1450 data, one or more instructions, or the like that can be obtained through corresponding processing.

I/O interface 1412 can return a processing result to customer device 1440.

It should be appreciated that training device 1420 may generate a corresponding target model/rule 1401 for different targets or different tasks that can be based on different training data. Corresponding target model/rule 1401 can be used to implement the foregoing target or accomplish the foregoing task.

Embodiments of FIG. 14 can enable a user to manually specify input data. The user can perform an operation on a screen provided by the I/O interface 1412.

Embodiments of FIG. 14 can enable customer device 1440 to automatically send input data to I/O interface 1412. If the customer device 1440 needs to automatically send input data, authorization from the user can be obtained. The user can specify a corresponding permission using customer device 1440. The user may view, using customer device 1440, the result that can be output by execution device 1410. A specific presentation form may be display content, voice, action, and the like. In addition, customer device 1440 may be used as a data collector to collect as new sampling data, the input data that is input to the I/O interface 1412 and the output result that can be output by the I/O interface 1412. New sampling data can be stored by database 1430. The data may not be collected by customer device 1440 but I/O interface 1412 can directly store, as new sampling data, the input date that is an input to I/O device 1412 and the output result that can be output from I/O interface 1412 in database 1430.

It should be appreciated that FIG. 14 is a schematic diagram of a system architecture according to an embodiment of the present disclosure. Position relationships between the device, component, module, and the like that are shown in FIG. 14 do not constitute any limitation.

Figure 15 illustrates an embodiment of this disclosure that can include a CNN 1500 which may include an input layer 1510, a convolutional layer/pooling layer 1520 (the pooling layer can be optional) , and a neural network layer 1530.

Convolutional layer/pooling layer 1520 as illustrated by FIG. 15 may include, as a non-limiting example, layers 1521 to 1526. In an implementation, layer 1521 can be a convolutional layer, layer 1522 the pooling layer, layer 1523 the convolutional layer, layer 1524 a pooling layer, layer 1525 a convolutional layer, layer 1526 a pooling layer. In other implementations, layers 1521 and 1522 can be convolutional layer, layer 1523 a pooling layer, layer 1524 and 1525 a convolutional layer, and layer 1526 a pooling layer. In other words, an output from a convolutional layer may be used as an input to a following pooling layer or may be used as an input to another convolutional layer to continue convolution operation.

The convolutional layer 1521 may include a plurality of convolutional operators. The convolutional operator can also be referred to as a kernel. A role of the convolutional operator in image processing can be equivalent to a filter that extracts specific information from an input image matrix. The convolutional operator may be a weight matrix that can be predefined. In a process of performing a convolution operation on an image, the weight matrix can be processed one pixel after another (or two pixels after two pixels …, depending on a value of a stride (stride) ) in a horizontal direction on the input image to extract a specific feature from the image. A size of the weight matrix can be related to the size of the image. It should be noted that a depth dimension (depth dimension) of the weight matrix can be the same as a depth dimension of the input image. In the convolution operation process the weight matrix can extend the entire depth of the input image. Therefore, after convolution is performed on a single weight matrix, convolutional output with a single depth dimension can be output. However, the single weight matrix may not be used in all cases but a plurality of weight matrices with the same dimensions (row x column) can be used -in other words a plurality of same-model matrices. Outputs of the weight matrices can be stacked to form the depth dimension of the convolutional image. It can be understood that the dimension herein can be determined by the foregoing “plurality” . Different weight matrices may be used to extract different features from the image. For example, one weight matrix can be used to extract image edge information. Another weight matrix can be used to extract a specific color from the image. Still another weight matrix can be used to blur unneeded noises from the image. The plurality of weight matrices can have a same size (row x column) . Feature graphs that can be obtained after extraction has been performed by the plurality of weight matrices with the same dimension also can have a same size and the plurality of extracted feature graphs with the same size can be combined to form an output of the convolution operation.

Weight values in the weight matrices can be obtained through a large amount of training in an actual application. The weight matrices formed by the weight values can be obtained through training that may be used to extract information from an input image so that the convolutional neural network 1500 can perform accurate prediction.

When the convolutional neural network 1500 has a plurality of convolutional layers, an initial convolutional layer (such as 1521) can extract a relatively large quantity of common features. The common feature may also be referred to as a low-level feature. As a depth of the convolutional neural network 1500 increases, a feature extracted by a deeper convolutional layer (such as 1526) can become more complex and as a non-limiting example, a feature with a high-level semantics or the like. A feature with higher-level semantics can be applicable to a to-be-resolved problem.

Because a quantity of training parameters can require reduction, a pooling layer usually needs to periodically follow a convolutional layer. To be specific, at the layers 1521 to 1526 shown in 1520 in FIG. 15, one pooling layer may follow one convolutional layer, or one or more pooling layers may follow a plurality of convolutional layers. In an image processing process, a purpose of the pooling layer can be to reduce the space size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator to perform sampling on the input image to obtain an image of a relatively small size. The average pooling operator may calculate a pixel value in the image within a specific range to generate an average value as an average pooling result. The maximum pooling operator may obtain, as a maximum pooling result, a pixel with a largest value within the specific range. In addition, just like the size of the weight matrix in the convolutional layer can be related to the size of the image, an operator at the pooling layer also can be related to the size of the image. The size of the image output after processing by a pooling layer may be smaller than a size of the image input to the pooling layer. Each pixel in the image output by the pooling layer indicates an average value or a maximum value of a subarea corresponding to the image input to the pooling layer.

After the image is processed by the convolutional layer/pooling layer 1520, the convolutional neural network 1500 can still be incapable of outputting desired output information. As described above, the convolutional layer/pooling layer 1520 can extract a feature and reduce a parameter brought by the input image. However, to generate final output information (desired category information or other related information) the convolutional neural network 1500 can generate an output of a quantity of one or a group of desired categories by using the neural network layer 1530. Therefore, the neural network layer 1530 may include a plurality of hidden layers (such as 1531, 1532, to 153n in FIG. 15) and an output layer 1540. A parameter included in the plurality of hidden layers may be obtained by performing pre-training based on related training data of a specific task type. For example, the task type may include image recognition, image classification, or the like.

The output layer 1540 can follow the plurality of hidden layers in the neural network layers 1530. In other words, the output layer 1540 can be a final layer in the entire convolutional neural network 1500. The output layer 1540 can include a loss function similar to category cross-entropy and is specifically used to calculate a prediction error. Once forward propagation (propagation in a direction from 1510 to 1540 in FIG. 15 can be forward propagation) is complete in the entire convolutional neural network 1500, back propagation (propagation in a direction from 1540 to 1510 in FIG. 15 can be back propagation) starts to update the weight values and offsets of the foregoing layers to reduce a loss of the convolutional neural network 1500 and an error between an ideal result and a result output by the convolutional neural network 1500 by using the output layer.

It should be noted that the convolutional neural network 1500 shown in FIG. 15 is merely used as an example of a convolutional neural network. In actual application, the convolutional neural network may exist in a form of another network model.

Additional details are included in the enclosed appendix, which forms part of, and is incorporated by reference in its entirety.

Although the present disclosure has been described with reference to specific features and embodiments thereof, it is evident that various modifications and combinations can be made thereto without departing from the disclosure. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present disclosure.

Claims

A method for dual perspective processing of a video, the method comprising:

extracting a video sequence from a visual stream of the video and extracting an audio sequence from an audio stream of the video associated with the visual stream;

converting the video sequence into a video graph using a sequence to graph perspective change and converting the audio sequence into an audio graph using the sequence to graph perspective change;

processing the video and audio graph together using a relational graph neural network (RGNN) , the RGNN also creating a processed video graph;

converting the processed video and audio graph into a processed video and audio sequence using a graph to sequence perspective change; and

processing the processed video sequence using a sequence processor and processing the processed audio sequence using the sequence processor.
The method as claimed in claim 1 in which the RGNN is a relational graph convolutional transformer (RGCT) .
A method for constructing a relational graph convolutional transformer (RGCT) , the method comprising:

aggregating neighbor nodes of a node of a relational graph, the neighbor nodes are all nodes in a relational neighborhood of the node;

concatenating a query of the node over different types of relations that include the query;

developing a key transformation related to the aggregated neighbor nodes;

developing a value transformation related to the aggregated neighbor nodes;

developing an attention map using the concatenated query of the node and a transpose of the key transformation and the value, developing the attention map to derive a relational node update; and

transforming a representation of the node resulting from an average of the relational node update.
A method for performing a replicate and link data augmentation, the method comprising:

decomposing a first video into any combination of:

one or more visual frames;

one or more audio frames;

one or more audio features which have been statistically precomputed or the output of a neural network;

one or more visual features which have been statistically precomputed or the output of the neural network;

the any combination of visual frames and audio frames and audio features and visual features stored in a plurality of specific state databases, the decomposition includes extracting first video states from the first video;

extracting a state sequence from a second video, that state sequence comprised of extracted second video states;

sampling randomly the frames from the plurality of specific state databases, the frames stored when the first video states are equivalent to the second video states; and

stitching the stored frames together to create a replica.