CN117058601A - Video space-time positioning network and method of cross-modal network based on Gaussian kernel - Google Patents

Video space-time positioning network and method of cross-modal network based on Gaussian kernel Download PDF

Info

Publication number
CN117058601A
CN117058601A CN202311326092.8A CN202311326092A CN117058601A CN 117058601 A CN117058601 A CN 117058601A CN 202311326092 A CN202311326092 A CN 202311326092A CN 117058601 A CN117058601 A CN 117058601A
Authority
CN
China
Prior art keywords
network
feature
video
modal
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311326092.8A
Other languages
Chinese (zh)
Inventor
周潘
熊泽雨
朱佳昊
彭洋
徐子川
袁增辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202311326092.8A priority Critical patent/CN117058601A/en
Publication of CN117058601A publication Critical patent/CN117058601A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/245Aligning, centring, orientation detection or correction of the image by locating a pattern; Special marks for positioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video space-time positioning network of a cross-mode network based on Gaussian kernel, which comprises the following steps: hybrid convolutional network for cross-modal interaction features from multiple second convolutional networks connected in series and parallelF f2 Extracting sequence feature M ser And temporal residual feature M 3 The method comprises the steps of carrying out a first treatment on the surface of the And characterizing the interactionF f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix The method comprises the steps of carrying out a first treatment on the surface of the A positioning module for mixing the characteristics M mix Carrying out space positioning on a preset text by the reconstruction of the (4) and the Gaussian epipolar heat map; by mixing features M mix And the third convolution network performs time positioning on the preset text. According to the invention, spatial non-local information and time modeling of multi-modal representation are learned through deploying a mixed serial and parallel connection network, then, the Gaussian heat map and the confidence are utilized to realize video space-time positioning and model anchor-free, and the positioning accuracy and precision are improved.

Description

Video space-time positioning network and method of cross-modal network based on Gaussian kernel
Technical Field
The invention belongs to the technical field of video semantics and deep learning, and particularly relates to a video space-time positioning network and a video space-time positioning method of a cross-modal network based on a Gaussian kernel.
Background
Video localization in natural language is a fundamental but challenging problem because it has great potential application in visual language understanding. In general, it can be divided into three different categories: spatial positioning, temporal positioning and spatio-temporal positioning. Among other things, spatio-temporal video localization (STVG, space-Temporal Video Grounding for Multi-Form interactions) is more challenging because it requires not only modeling complex multi-modal interactions for semantic alignment, but also retrieval of the spatial location and temporal duration of the target activity. As shown in FIG. 1, STVG is intended to locate the spatiotemporal position of an object to be queried from a given textual description.
Most previous spatial or temporal video localization techniques aim to solve the temporal localization problem by directly detecting the foreground object of each video frame for object correlation learning or by regression boundaries.
A recent effort of the STVG task solves the localization problem in a more general way, which is able to locate the spatiotemporal tubes of objects to be queried in untrimmed video. However, all of these positioning methods suffer from the following drawbacks: (1) they are severely dependent on the quality of the detection model. Furthermore, they typically pre-generate the proposed regions using the detected anchor boxes, resulting in computation time consuming. (2) They typically process video frames individually, without regard to temporal correlation between successive frames.
Disclosure of Invention
In order to solve the problems of space-time video positioning dependency detection model, time consumption in calculation and time correlation between continuous frames not considered, in a first aspect of the present invention, a video space-time positioning network of a cross-modal network based on a gaussian kernel is provided, comprising: the acquisition module is used for acquiring a preset text and a video frame to be positioned; the coding module is used for respectively coding the text and the video frame through a first pre-trained convolution network to obtain a video characteristic v and a text characteristic S; performing cross-modal interaction on the video feature v and the text feature S to obtain a cross-modal interaction feature F f2 A hybrid convolutional network for cross-modal interaction feature F from a plurality of second convolutional networks connected in series and in parallel f2 Extracting sequence feature M ser And temporal residual feature M 3 The method comprises the steps of carrying out a first treatment on the surface of the And the interactive feature F f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix The method comprises the steps of carrying out a first treatment on the surface of the A positioning module for mixing the characteristics M mix Carrying out space positioning on a preset text by the reconstruction of the (4) and the Gaussian epipolar heat map; by mixing features M mix And the third convolution network performs time positioning on the preset text.
In some embodiments of the invention, the hybrid convolutional network comprises: a serial connection network for cross-modal interaction features F f2 Extracting sequence feature M ser The method comprises the steps of carrying out a first treatment on the surface of the Parallel connection network for cross-modal interaction features F f2 Extracting time residual error characteristic M 3 The method comprises the steps of carrying out a first treatment on the surface of the A mixing module for mixing the interactive features F f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix
In some embodiments of the invention, the positioning module comprises: a spatial localization head for mixing the features M mix Reconstruction of (c) and Gaussian epipolar heat map versus preset textSpace positioning is carried out; time alignment head by mixing features M mix And the third convolution network performs time positioning on the preset text.
Further, the spatial positioning head includes: up-sampling unit for mixing the features M mix Upsampling to obtain a feature map M up A Gaussian kernel unit for treating the video frame as a series of heat maps with query objects as heat source centers and describing probability distribution of object positions by utilizing Gaussian kernels; the point positioning unit is used for learning a predicted heat map based on Gaussian kernel heat map supervision and positioning key points of a preset text; a size regression unit for obtaining a feature map M up And performing size regression, and defining the size of the object corresponding to the preset text.
Further, the time positioning head includes: input layer, to mix features M mix Placing the first time boundary into a preset time position interval to obtain one or more first time boundaries; the embedded layer is used for extracting different time length characteristics from the time boundaries, enhancing the time sequence of the different time length characteristics through a self-attention mechanism and obtaining one or more second time boundaries; the confidence unit is used for carrying out confidence evaluation on one or more second time boundaries and screening the second time boundary with the highest confidence degree from the confidence evaluation; and the boundary regression unit is used for regressing the second time boundary and adjusting the offset of the second time boundary.
In the above embodiment, the first convolutional network is an I3D network.
In the above embodiment, the third convolution network includes: at least 1 3-dimensional convolution layer, 1 Bi-GRU, and 1-dimensional convolution layer.
The second aspect of the invention provides a video space-time positioning method of a cross-modal network based on Gaussian kernel, which comprises the following steps: acquiring a preset text and a video frame to be positioned; encoding the text and the video frame through a first pre-trained convolution network respectively to obtain a video feature v and a text feature S; performing cross-modal interaction on the video feature v and the text feature S to obtain a cross-modal interaction feature F f2 By serial and parallel connectionsFrom the cross-modal interaction feature F f2 Extracting sequence feature M ser And temporal residual feature M 3 The method comprises the steps of carrying out a first treatment on the surface of the -generating the interaction features F f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix The method comprises the steps of carrying out a first treatment on the surface of the By mixing features M mix Carrying out space positioning on a preset text by the reconstruction of the (4) and the Gaussian epipolar heat map; by mixing features M mix And the third convolution network performs time positioning on the preset text.
In a third aspect of the present invention, there is provided an electronic apparatus comprising: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the video space-time positioning method based on the cross-mode network of the Gaussian kernel provided by the second aspect of the invention.
In a fourth aspect of the present invention, a computer readable medium is provided, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the video spatiotemporal localization method of the gaussian kernel-based cross-modal network provided in the second aspect of the present invention.
The beneficial effects of the invention are as follows:
the invention proposes a first anchor-free model GKCMN (Gaussian kernel-based cross-modal network or Gaussian kernel-based cross-modal network) for space-time video grounding, which utilizes Gaussian kernels to highlight the most important region of target semantics; meanwhile, a mixed convolution network is designed to capture time and space information; evaluation on the VidSTG dataset demonstrated the accuracy, precision and effectiveness of the gkmn model.
Drawings
FIG. 1 is a schematic illustration of a video spatiotemporal localization task presentation;
FIG. 2 is a schematic diagram of the basic structure of a video spatiotemporal localization network of a Gaussian kernel-based cross-modal network in some embodiments of the invention;
fig. 3 is an overall frame diagram of a gkmn network in some embodiments of the invention;
FIG. 4 is a schematic diagram of a Gaussian kernel based spatial localization principle in some embodiments of the invention;
FIG. 5 is a schematic diagram of time localization in some embodiments of the invention;
FIG. 6 is a flow diagram of a method for video spatiotemporal localization across a modal network based on Gaussian kernels in some embodiments of the invention;
fig. 7 is a schematic structural diagram of an electronic device according to some embodiments of the present invention.
Detailed Description
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.
Referring to fig. 1, a video spatiotemporal localization task presentation schematic is shown. The text or sentence to be queried (text query) in the figure is: "A child is taking a bath with a pair of dark glasses". Namely, the video space-time positioning task is to find out that a child with a sunglasses is bathing in a video with the duration of 30.9S, and the query object in the task is the child with the sunglasses; the corresponding result is a query objective that meets the textual description of the task in 6.23s-30.69s (seconds) of the video. Without loss of generality, the above text description is not limited to sentences, phrases, one or more word combinations; the 6.23s-30.69s video clip may also be referred to as a "time-space tube". Without loss of generality, the generality of the STVG task is defined as: given an untrimmed video V and a sentence S, the STVG task aims to retrieve a spatiotemporal tube U in the video V, which corresponds to the semantics of the sentence S.
Referring to fig. 2 or 3, in a first aspect of the invention, there is provided a gaussian kernel based cross-modal network video spatiotemporal localization network 1 comprising: an obtaining module 11, configured to obtain a preset text and a video frame to be positioned; the encoding module 12 is configured to encode the text and the video frame through a first convolutional network for pre-training, so as to obtain a video feature v and a text feature S; performing cross-modal interaction on the video feature v and the text feature S to obtain a cross-modal interaction feature F f2 A hybrid convolutional network 13 for passingA plurality of second convolutional networks connected in series and in parallel, derived from cross-modal interaction features F f2 Extracting sequence feature M ser And temporal residual feature M 3 The method comprises the steps of carrying out a first treatment on the surface of the And the interactive feature F f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix The method comprises the steps of carrying out a first treatment on the surface of the A positioning module 14 for mixing the features M mix Carrying out space positioning on a preset text by the reconstruction of the (4) and the Gaussian epipolar heat map; by mixing features M mix And the third convolution network performs time positioning on the preset text.
It should be noted that, although the 1D (one-dimensional) convolution layer, the 2D (two-dimensional) convolution layer or the 3D (three-dimensional) convolution layer, the attention layer, the biglu layer and the FC layer are part or all of the convolution network, the function division or description of those skilled in the art is not affected, and other parts of the network are complemented by using the technical means of the existing deep learning field, so as to realize the corresponding feature extraction, feature fusion, time sequence information or residual addition. Features related to an image are typically multidimensional vectors or tensors, and can also be considered feature maps; features related to text are typically treated as word embedding vectors (word vectors for short), and may also be considered as multidimensional vectors or multidimensional tensors; therefore, the fusion process of video features, text features and time sequence features can be understood as multi-modal interaction or cross-modal interaction process. The gaussian kernel based cross-modal network may also be referred to as a gaussian kernel based cross-modal network system or a gaussian kernel based cross-modal model.
In some embodiments of the present invention, the encoding module 12 is alternatively referred to as an encoder; encoding module 12 includes a video encoder, a sentence encoder (text encoder), and a cross-modality interaction unit. Wherein:
video encoder: to encode video frames, we extract their final convolutional layer features as using a pre-trained I3D networkWherein T represents a frame number, < >>Representing the t frame 2D feature, D is the visual feature dimension, H and W are the height and width of the input frame, r h ,r w Is a scaling factor.
Sentence encoder: for sentence coding, we first extract word-level features through Glove embedding, and then use a self-attention module to capture the self-dependencies between words. Further utilizing Bi-GRUs to learn their sequence features and represent the final sentence features asIt->Representing the nth word feature, D is the word feature dimension.
Cross-modal interaction unit: first, the text tensor S r Repeating to the same shape as the visual tensor, and by combining V with S r And fusing to obtain the multi-modal matrix. Next, cross-modal interaction feature F is obtained by f2
(1),
Wherein g (·) is a nonlinear activation function, W v1 、W s1 、W f2 And W is v2 Is a learnable parameter.
It will be appreciated that with reference to fig. 5, the video clips in the encoding module described above are encoded, extracting their spatiotemporal features from the last convolution layer of I3D. For sentence query encoding, word-level features are first extracted individually by Glove, and then their sequence information is embedded by biglu. Visual and sentence semantics are then aligned through the use of a cross-model interaction module. In the following process, use is made ofThe spatiotemporal relationship is extracted as an input fusion feature.
Referring to fig. 3, in some embodiments of the invention, the hybrid convolutional network13 comprises: a serial connection network for cross-modal interaction features F f2 Extracting sequence feature M ser The method comprises the steps of carrying out a first treatment on the surface of the Parallel connection network for cross-modal interaction features F f2 Extracting time residual error characteristic M 3 The method comprises the steps of carrying out a first treatment on the surface of the A mixing module for mixing the interactive features F f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix . That is, the space-time relationship is constructed in terms of depth and width through the serial connection network and the parallel connection network.
Serial connection network: for any 3D signal F f2 Remodelling it to 2D and learning the spatial features of each frame by a 2D convolution block. Next, we learn the timing relationship between each activity using 3D convolution. Thus, the network M is connected in series ser The output of (2) may be expressed as:
(2),
wherein the method comprises the steps ofRepresenting convolution operations, K i Representing the i-dimensional convolution kernel.
Parallel connection network: similarly, static and temporal information is learned and fused by 2D and 3D CNN blocks using parallel structures.
(3),
Hybrid convolutional network: the hybrid convolutional network combines serial and parallel connections with a residual structure. As shown in fig. 3, the original feature map F is fused f2 Sequence features M ser And temporal residual feature M 3 Through a triple parallel structure. Then:
(4),
sequence feature M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix
Referring to fig. 4, in some embodiments of the invention, the positioning module 14 includes: a spatial localization head for mixing the features M mix Carrying out space positioning on a preset text by the reconstruction of the (4) and the Gaussian epipolar heat map; time alignment head by mixing features M mix And the third convolution network performs time positioning on the preset text.
Specifically, a spatial localization head based on a Gaussian kernel is constructed to predict the bounding box of the query object. First, we upsample M mix ObtainingFor spatial location scaling, where L is the feature map size.
Further, the spatial positioning head includes: up-sampling unit for mixing the features M mix Upsampling to obtain a feature map M up ;
A Gaussian kernel unit for treating the video frame as a series of heat maps with query objects as heat source centers and describing probability distribution of object positions by utilizing Gaussian kernels; specifically, the video frames are treated as a series of heat maps centered on the query object and the probability distribution of object locations is described using gaussian kernels, as shown in fig. 4.
Given annotated boxesFirst we map it linearly to the feature map scale L. For each rescaled box +.>We find the center coordinates (x t ,y t ). Then, using a heat map of Gaussian kernels +.>Given by the formula:
(5),
where σ determines the size of the gaussian kernel.
The point positioning unit is used for learning a predicted heat map based on Gaussian kernel heat map supervision and positioning key points of a preset text; in this step, our goal is to learn a predictive heat map supervised by a gaussian kernel based heat map
The peak of the gaussian distribution of the key points is a positive sample and the other pixels are negative samples. We modify the focal loss function to be:
(6)
wherein M is foc Representing the number of label boxes, α and β are hyper-parameters of focus loss, γ is a hyper-parameter we modify, which determines the number of positive samples.
A size regression unit for obtaining a feature map M up And performing size regression, and defining the size of the object corresponding to the preset text. The object size is defined using a size regression header. Each pixel in the annotation box is considered a regression sample. Given a predicted distanceSum of true values { s ] t We decode the prediction box +.>And corresponding comment box { b } t Then, using GIoU as a loss of bounding box regression:
(7),
wherein M is iou Representing the number of regression samples, i.e. the marked area a t The number of pixels (t, x, y).
Further, the time positioning head includes: input layer, to mix features M mix Put into a presetObtaining one or more first time boundaries in the time position interval;
the embedded layer is used for extracting different time length characteristics from the time boundaries, enhancing the time sequence of the different time length characteristics through a self-attention mechanism and obtaining one or more second time boundaries; specifically, three different types of 3D convolutional layers are deployed, with kernel sizes of 1, 3, and 5, respectively, to learn different time length features. A 3D convolution layer and an average pooling layer are then placed after the above layers. Next, we use the self-attention module to enhance the internal relationships in terms of time series.
The confidence unit is used for carrying out confidence evaluation on one or more second time boundaries and screening the second time boundary with the highest confidence degree from the confidence evaluation; specifically, by Bi-GRU and 1D convolutional layers. We estimate IoUi e 0,1 between the generated time pipe and the corresponding ground truth]The confidence score is a value of IoU. Then define a threshold valueTo set the fraction of tubes to zero, where i<c. We use smooth l 1 The loss is confidence assessed, given by:
(8),
wherein L is 1 Is smooth l 1 Loss, N c Andrepresenting the number of tubes and the IoU fraction of the n-th (nth round) prediction.
And the boundary regression unit is used for regressing the second time boundary and adjusting the offset of the second time boundary. Also, we use Bi-GRU and 1D convolution layers to implement the boundary regression header. Each time segment that may be selected has an offset delta k =(δ se ). Using the true value (t s ,t e ) And predicted valueWe know +.>And->. We then calculate a smoothed l by 1 Distance:
(9),
thus, the total loss is the sum of the losses combined with the four loss functions described above:
(10),
wherein alpha is 1234 Is the balance parameter of the loss.
In order to test the performance of the video spatiotemporal network of the gaussian kernel-based cross-modal network disclosed in the above embodiment, it is verified based on a plurality of aspects such as data set, evaluation index, training process, etc. Specifically:
data set: we evaluated our method on a large-scale spatiotemporal video base dataset VidSTG containing 5563, 618 and 743 untrimmed videos in training, validation and test sets, respectively.
The evaluation index follows the previous work, we evaluate the time-positioning performance with m_tIoU (m_t), and apply m_vIoU (m_v) and vIoU@R (v@R) as evaluation criteria for the spatio-temporal accuracy.
Training process: for video preprocessing, we first resize the input video to 224×224 pixels per frame and put each 8 frames into a pre-trained I3D model of step size 4, resulting in the features of the last convolution layer of size 7×7. The feature dimension d is 32. We set the length of the video feature sequence to 200. For language coding, we set the maximum length of the words to 20 and apply Gloveebedding to embed each word into a 300-dimensional feature matrix. As for the model setting, we set the number of attention layers to 2, the confidence threshold c to 0.3, γ to 0.8, δ to 0.9, and we set the feature map scale L to 16. We used an optimizer with Adam learning rate of 0.003.
Table 1 performance on VidSTG dataset compared to previous methods
Table 2 evaluation results with time truth
Table 3 ablation results on VidSTG dataset
Training results: the hybrid convolutional network and positioning head were removed to train the model as our anchor-free Baseline Method (BM). Furthermore, we compared gkmn with the most advanced anchor-based approach STGRN. Other anchor-based methods, such as GroundeR (G), STPR (S), WSSTG (W), require time locators, such as tal (T) and L-Net (L), to ground the time-space tube in the untrimmed video. Thus, we combine them into six additional methods for comparison.
Table 1 shows the overall results of the experiment: for time-to-time, our GKCMN outperforms the anchor-based methods STGRN, tal, and L-Net in measuring mtIoU, indicating that hybrid spatiotemporal modeling is critical for capturing temporal features. For space-time grounding, our model shows comparable performance to the most advanced anchor-based approach, even exceeding STGRN in v@0.5. Our GKCMN was significantly improved over the anchorless BM in terms of the anchorless method.
To eliminate the impact of time base on the overall spatiotemporal base, we performed separate experiments by giving the time base facts, as shown in Table 2. Here we can clearly observe that the proposed gkmn is superior to other anchor-based methods and achieves comparable results to STGRN. Notably, our gaussian kernel design greatly improves the accuracy of spatial localization compared to BM.
Ablation study for the ablation study, we validated the contribution of each part of the GKCMN we proposed using a center-based approach. More specifically, we modified our complete model to five settings: w/o SC, w/o PC, w/o MN, w/o TA, w/o SK represent the removal of serial connections, parallel connections, respectively complete hybrid convolutional networks, self-attention modules, and replacement Gaussian kernels with single-point localization. The ablation results are shown in table 3. We can find that each ablation model has a reduced accuracy compared to the complete model, indicating that each of the above components provides a positive contribution.
Example 2
Referring to fig. 6, in a second aspect of the present invention, a video spatiotemporal localization method of a cross-modal network based on a gaussian kernel is provided, including: s100, acquiring a preset text and a video frame to be positioned; s200, respectively encoding a text and a video frame through a first convolutional network for pre-training to obtain a video feature v and a text feature S; performing cross-modal interaction on the video feature v and the text feature S to obtain a cross-modal interaction feature F f2 S200, cross-modal interaction characteristic F through a plurality of second convolution networks connected in series and in parallel f2 Extracting sequence feature M ser And temporal residual feature M 3 The method comprises the steps of carrying out a first treatment on the surface of the -generating the interaction features F f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix The method comprises the steps of carrying out a first treatment on the surface of the S400, by mixing the characteristics M mix Carrying out space positioning on a preset text by the reconstruction of the (4) and the Gaussian epipolar heat map; by mixing features M mix And the third convolution network performs time positioning on the preset text.
Example 3
Referring to fig. 7, a third aspect of the present invention provides an electronic device, including: one or more processors; and the storage device is used for storing one or more programs, and the one or more programs are executed by the one or more processors, so that the one or more processors realize the video space-time positioning method of the cross-mode network based on the Gaussian kernel in the second aspect.
The electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with programs stored in a Read Only Memory (ROM) 502 or loaded from a storage 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following devices may be connected to the I/O interface 505 in general: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, a hard disk; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 7 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 7 may represent one device or a plurality of devices as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more computer programs which, when executed by the electronic device, cause the electronic device to:
computer program code for carrying out operations of embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++, python and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (10)

1. A gaussian kernel-based cross-modal network video spatiotemporal localization network comprising:
the acquisition module is used for acquiring a preset text and a video frame to be positioned;
the coding module is used for respectively coding the text and the video frame through a first pre-trained convolution network to obtain a video characteristic v and a text characteristic S; performing cross-modal interaction on the video feature v and the text feature S to obtain a cross-modal interaction feature F f2 ;
Hybrid convolutional network for cross-modal interaction feature F from a plurality of second convolutional networks connected in series and in parallel f2 Extracting sequence feature M ser And temporal residual feature M 3 The method comprises the steps of carrying out a first treatment on the surface of the And the interactive feature F f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix
A positioning module for mixing the characteristics M mix Carrying out space positioning on a preset text by the reconstruction of the (4) and the Gaussian epipolar heat map; by mixing features M mix And the third convolution network performs time positioning on the preset text.
2. The gaussian kernel-based cross-modal network video spatiotemporal localization network of claim 1, wherein the hybrid convolution network comprises:
a serial connection network for cross-modal interaction features F f2 Extracting sequence feature M ser
Parallel connection network for cross-modal interaction features F f2 Extracting time residual error characteristic M 3
A mixing module for mixing the interactive features F f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain a mixtureFeature M mix
3. The gaussian kernel-based cross-modal network video spatiotemporal localization network of claim 1, wherein the localization module comprises:
a spatial localization head for mixing the features M mix Carrying out space positioning on a preset text by the reconstruction of the (4) and the Gaussian epipolar heat map;
time alignment head by mixing features M mix And the third convolution network performs time positioning on the preset text.
4. A video spatiotemporal localization network of a gaussian kernel based cross-modal network according to claim 3, wherein said spatial localization head comprises:
up-sampling unit for mixing the features M mix Upsampling to obtain a feature map M up ;
A Gaussian kernel unit for treating the video frame as a series of heat maps with query objects as heat source centers and describing probability distribution of object positions by utilizing Gaussian kernels;
the point positioning unit is used for learning a predicted heat map based on Gaussian kernel heat map supervision and positioning key points of a preset text;
a size regression unit for obtaining a feature map M up And performing size regression, and defining the size of the object corresponding to the preset text.
5. The gaussian kernel-based cross-modal network video spatiotemporal localization network of claim 4, wherein the time localization head comprises:
input layer, to mix features M mix Placing the first time boundary into a preset time position interval to obtain one or more first time boundaries;
the embedded layer is used for extracting different time length characteristics from the time boundaries, enhancing the time sequence of the different time length characteristics through a self-attention mechanism and obtaining one or more second time boundaries;
the confidence unit is used for carrying out confidence evaluation on one or more second time boundaries and screening the second time boundary with the highest confidence degree from the confidence evaluation;
and the boundary regression unit is used for regressing the second time boundary and adjusting the offset of the second time boundary.
6. The gaussian kernel-based cross-modal network video spatiotemporal localization network of any of claims 1 to 5, wherein said first convolutional network is an I3D network.
7. The gaussian kernel-based cross-modal network video spatiotemporal localization network of any of claims 1 to 5, comprising: the third convolutional network comprises: at least 1 3-dimensional convolution layer, 1 Bi-GRU, and 1-dimensional convolution layer.
8. A video space-time positioning method of a cross-modal network based on Gaussian kernel is characterized by comprising the following steps:
acquiring a preset text and a video frame to be positioned;
encoding the text and the video frame through a first pre-trained convolution network respectively to obtain a video feature v and a text feature S; performing cross-modal interaction on the video feature v and the text feature S to obtain a cross-modal interaction feature F f2 ;
Cross-modal interaction feature F from multiple second convolution networks connected in series and parallel f2 Extracting sequence feature M ser And temporal residual feature M 3 The method comprises the steps of carrying out a first treatment on the surface of the -generating the interaction features F f2 Sequence features M ser And temporal residual feature M 3 Fusion to obtain mixed characteristic M mix
By mixing features M mix Carrying out space positioning on a preset text by the reconstruction of the (4) and the Gaussian epipolar heat map; by mixing features M mix And the third convolution network performs time positioning on the preset text.
9. An electronic device, comprising: one or more processors; storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method of video spatiotemporal localization of a gaussian kernel-based cross-modal network as set forth in claim 8.
10. A computer readable medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method of video spatiotemporal localization of gaussian kernel based cross-modal networks of claim 8.
CN202311326092.8A 2023-10-13 2023-10-13 Video space-time positioning network and method of cross-modal network based on Gaussian kernel Pending CN117058601A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311326092.8A CN117058601A (en) 2023-10-13 2023-10-13 Video space-time positioning network and method of cross-modal network based on Gaussian kernel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311326092.8A CN117058601A (en) 2023-10-13 2023-10-13 Video space-time positioning network and method of cross-modal network based on Gaussian kernel

Publications (1)

Publication Number Publication Date
CN117058601A true CN117058601A (en) 2023-11-14

Family

ID=88663139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311326092.8A Pending CN117058601A (en) 2023-10-13 2023-10-13 Video space-time positioning network and method of cross-modal network based on Gaussian kernel

Country Status (1)

Country Link
CN (1) CN117058601A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210027066A1 (en) * 2019-07-24 2021-01-28 Honda Motor Co., Ltd. System and method for providing unsupervised domain adaptation for spatio-temporal action localization
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN113849668A (en) * 2021-09-18 2021-12-28 北京航空航天大学 End-to-end video spatiotemporal visual positioning system based on visual language Transformer
CN114627402A (en) * 2021-12-30 2022-06-14 湖南大学 Cross-modal video time positioning method and system based on space-time diagram

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210027066A1 (en) * 2019-07-24 2021-01-28 Honda Motor Co., Ltd. System and method for providing unsupervised domain adaptation for spatio-temporal action localization
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN113849668A (en) * 2021-09-18 2021-12-28 北京航空航天大学 End-to-end video spatiotemporal visual positioning system based on visual language Transformer
CN114627402A (en) * 2021-12-30 2022-06-14 湖南大学 Cross-modal video time positioning method and system based on space-time diagram

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZEYU XIONG, DAIZONG LIU, PAN ZHOU: "Gaussian Kernel-Based Cross Modal Network for Spatio-Temporal Video Grounding", 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), pages 2481 - 2485 *

Similar Documents

Publication Publication Date Title
CN111898696B (en) Pseudo tag and tag prediction model generation method, device, medium and equipment
Wen et al. Dynamic selective network for RGB-D salient object detection
AU2019268184B2 (en) Precise and robust camera calibration
CN113449085B (en) Multi-mode emotion classification method and device and electronic equipment
CN113792871B (en) Neural network training method, target identification device and electronic equipment
CN111709240A (en) Entity relationship extraction method, device, equipment and storage medium thereof
CN109272543B (en) Method and apparatus for generating a model
JP2023022845A (en) Method of processing video, method of querying video, method of training model, device, electronic apparatus, storage medium and computer program
CN108491812B (en) Method and device for generating face recognition model
CN113159023A (en) Scene text recognition method based on explicit supervision mechanism
CN109325996B (en) Method and device for generating information
CN115861462B (en) Training method and device for image generation model, electronic equipment and storage medium
Li et al. Improved YOLOv7 for small object detection algorithm based on attention and dynamic convolution
CN114973222A (en) Scene text recognition method based on explicit supervision mechanism
WO2023221363A1 (en) Image generation method and apparatus, and device and medium
Lei et al. SNLRUX++ for building extraction from high-resolution remote sensing images
Wang et al. STCD: Efficient Siamese transformers-based change detection method for remote sensing images
Zhou et al. Real-time underwater object detection technology for complex underwater environments based on deep learning
Chen et al. Enhancing visual question answering through ranking-based hybrid training and multimodal fusion
CN117520815A (en) Information extraction method, device, equipment and storage medium based on multiple modes
Wu et al. LSH-DETR: object detection algorithm for marine benthic organisms based on improved DETR
CN117058601A (en) Video space-time positioning network and method of cross-modal network based on Gaussian kernel
CN118715525A (en) Computationally efficient refinement using generated neural networks
CN114357203B (en) Multimedia retrieval method and device and computer equipment
Meng et al. Un-VDNet: unsupervised network for visual odometry and depth estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20231114

RJ01 Rejection of invention patent application after publication