CN113849668A - End-to-end video spatiotemporal visual positioning system based on visual language Transformer - Google Patents

End-to-end video spatiotemporal visual positioning system based on visual language Transformer Download PDF

Info

Publication number
CN113849668A
CN113849668A CN202111100948.0A CN202111100948A CN113849668A CN 113849668 A CN113849668 A CN 113849668A CN 202111100948 A CN202111100948 A CN 202111100948A CN 113849668 A CN113849668 A CN 113849668A
Authority
CN
China
Prior art keywords
visual
module
text
space
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111100948.0A
Other languages
Chinese (zh)
Inventor
于茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202111100948.0A priority Critical patent/CN113849668A/en
Publication of CN113849668A publication Critical patent/CN113849668A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an end-to-end video space-time visual positioning system based on a visual language Transformer, which comprises a visual information coding module, a character embedding module, a space-time visual positioning module and a space-time track generating module, wherein the visual information coding module is used for coding a visual information; the visual information coding module and the character embedding module are connected with the space-time visual positioning module; the space-time visual positioning module is connected with the space-time trajectory generation module; the visual information coding module acquires visual characteristics from the video frame by the video module; the character embedding module extracts text codes from the query text; the space-time visual positioning module is used for learning interactive characteristics between the visual characteristics and the text codes, and performing space positioning and time positioning on a detection target to obtain detection frame information and time starting and ending information; the space-time trajectory generation module is used for generating a space-time trajectory prediction result; by the method, visual positioning in time and space can be completed simultaneously, and better characteristic representation can be learned so as to achieve better positioning effect.

Description

End-to-end video spatiotemporal visual positioning system based on visual language Transformer
Technical Field
The invention relates to the technical field of multimedia, in particular to an end-to-end video spatiotemporal visual positioning system based on a visual language Transformer.
Background
Video spatiotemporal visual localization is a new and extremely challenging visual language task. Given an uncut video, the task generates a spatiotemporal trajectory block (a series of visual localization boxes) to localize the detected objects in the video according to the required linguistic description of the detected objects. Unlike existing image visual positioning tasks, spatiotemporal visual positioning requires that the detection target be positioned both temporally and spatially. In addition, how to efficiently utilize visual and language information to complete cross-modal learning is the key point for accurately positioning a detection target. Wherein different people do similar actions in the same scene is a challenging task scene.
Spatial localization in images/video is a visual localization task that is very relevant to this task. Most of the existing work will first generate some possible target detection candidate boxes using a pre-trained target detector. These methods have certain limitations: 1) spatial detection capability is limited by the quality of the target detection candidate box; 2) it is difficult to generate a detection box for a new class using a pre-trained target detector; 3) the training of pre-training is costly. There is currently no video positioning related effort to remove the pre-trained detector.
In addition, the video spatiotemporal visual positioning task needs to complete the positioning of the detection target in two dimensions of time and space, so that the existing methods are two-stage methods, namely, the time sequence visual positioning is completed firstly, the starting time and the ending time of the detection target are divided, and the spatial visual positioning is completed on the basis of the two-stage methods. But the two-phase pair approach makes the overall framework approach two independent networks to complete their respective subtasks.
In addition, the video spatiotemporal visual positioning task needs to complete the positioning of the detection target in two dimensions of time and space, so that the existing methods are two-stage methods, namely, the time sequence visual positioning is completed firstly, the starting time and the ending time of the detection target are divided, and the spatial visual positioning is completed on the basis of the two-stage methods. But the two-phase pair approach makes the overall framework approach two independent networks to complete their respective subtasks.
Therefore, how to provide a system based on an end-to-end visual language transfoermeir to complete a video visual positioning task without pre-training a target detector is a problem to be solved in the field.
Disclosure of Invention
In view of this, the invention provides an end-to-end video spatiotemporal visual positioning system based on a visual language Transformer, and better feature representation can be learned by completing visual positioning in time and space at the same time, so as to achieve a better positioning effect.
In order to achieve the purpose, the invention adopts the following technical scheme:
an end-to-end video spatiotemporal visual positioning system based on a visual language Transformer comprises a visual information coding module, a character embedding module, a spatiotemporal visual positioning module and a spatiotemporal trajectory generating module; the visual information coding module and the character embedding module are connected with the space-time visual positioning module; the space-time visual positioning module is connected with the space-time trajectory generation module; the visual information coding module is used for acquiring the visual characteristics of the detection target from the video frame; the character embedding module is used for extracting text codes of detection targets from the query text; the space-time visual positioning module is used for learning the interactive characteristics between the visual characteristics and the text codes and carrying out space positioning and time positioning on a detection target to obtain detection frame information and time starting and ending information; and the space-time trajectory generation module is used for combining the generated detection frame information on a time domain and a space domain to obtain a space-time trajectory block containing a detection target.
Furthermore, the space-time visual positioning module comprises a cross-modal feature learning module and a space-time analysis core module; the cross-modal feature learning module acquires text codes and visual features and generates text-guided visual features and visual-guided text features; and the space-time analysis core module is used for positioning the generated text-guided visual features in time and space.
Furthermore, special text marks [ 'GLS' ] and [ 'SEP' ] are added into the query text and are respectively placed at the head and the tail of the query text, and the text embedding module obtains a text code with the special mark according to the query text with the special mark; and the cross-modal learning module acquires a text code with a special text mark to obtain global text characteristics.
Further, the cross-modal feature learning module comprises a visual branch module and a text branch module, and the visual branch module and the text branch module perform interactive learning to obtain visual features guided by a text and text information guided by the text.
Further, a space-time combination decomposition module is constructed in the visual branch module to reserve the spatial information.
Furthermore, the time-space combination decomposition module comprises a time sequence pooling module, a space pooling module, a combination module, a multi-head attention module, a decomposition module, a copying module and a normalization module; the time sequence pooling module is used for acquiring visual features to generate T multiplied by C preliminary time sequence features, wherein T represents the number of video frames, C represents the number of feature map channels, H represents the height, and W represents the width; the space pooling module is used for acquiring visual features to generate primary spatial features with the shape of HW multiplied by C, and the combination module is used for connecting the primary time sequence features and the primary spatial features on feature dimensions to form combined visual features with the size of (T + HW) multiplied by C; the multi-head attention module is used for performing attention operation according to the combined visual features and the text features to generate preliminary text-guided visual features; the decomposition module is used for generating a time sequence feature of the text guide and a space feature of the text guide according to the visual feature of the preliminary text guide; the copying module is used for copying the time sequence characteristics of the text guide for HW times and copying the space characteristics of the text guide for T times to obtain copying time sequence characteristics and copying space characteristics with the size of T multiplied by HW multiplied by C; the normalization module is used for normalizing the result obtained by adding the copying time sequence characteristic, the copying space characteristic and the visual input characteristic to generate an intermediate visual characteristic; the output of the last layer is a text-guided visual feature.
Furthermore, the space-time analysis core module comprises a space visual positioning branch module and a time visual positioning branch module; the space vision positioning module generates the center and the size of the detection box according to the visual features guided by the text; the temporal visual positioning branching module generates start and stop prediction scores based on text-guided visual features.
Further, the spatial vision positioning branch comprises an deconvolution layer, a first detection network head and a second detection network head; the deconvolution layer is three layers, deconvolution is carried out on visual features guided by texts, and spatial up-sampling is carried out to obtain up-sampling features; the first detection network head and the second detection network head are both composed of a 3X 3 roll base layer and a 1X 1 roll base layer; the first network head selects and acquires the up-sampling features to generate a thermodynamic diagram, the position with the highest probability value in the thermodynamic diagram is the center of the detection box, and the second network head obtains the size of the detection box by regressing the up-sampling features.
Furthermore, the time vision positioning branch module comprises a space pooling layer, a time sequence convolution layer, a multi-layer perceptron layer and a calculation activation layer; the space pooling layer uses space average pooling to the visual features of the text guidance to obtain global visual features of the text guidance; the global visual feature guided by the text is subjected to convolution operation through two parallel time sequence convolution blocks to respectively obtain a starting visual feature and an ending visual feature;
the multilayer perception acquisition obtains global text features to obtain text features in a C-dimensional public feature space; and the calculation activation layer carries out correlation calculation on the starting visual feature and the ending visual feature and the text feature in the public feature space and activates by using an activation function to obtain a starting prediction score and an ending prediction score.
Further, the spatiotemporal visual localization module was trained combining three focal distances and one L1 distance as a loss function.
The invention has the following beneficial effects:
according to the technical scheme, compared with the prior art, the invention discloses and provides an end-to-end video space-time visual positioning system based on the visual language Transformer, so that the invention provides an end-to-end video space-time visual positioning unified frame without using a prediction training target detector, and simultaneously realizes visual positioning on time and space, thereby achieving a better positioning effect, solving the problem that the existing space detection capability is limited by the quality of a target detection candidate frame, and further avoiding a high-cost pre-training process of the detector; a space-time visual positioning module is constructed, and spatial information is reserved in a cross-modal feature learning module.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a diagram illustrating a structure of an end-to-end video spatiotemporal visual positioning system based on a visual language Transformer according to the present invention;
FIG. 2 is a diagram of a cross-modal learning module architecture;
FIG. 3 is a block diagram of a spatiotemporal decomposition module;
FIG. 4 is a block diagram of a spatiotemporal visual orientation module.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention discloses an end-to-end video spatiotemporal visual positioning system (STVGBert) based on visual language Transformer, which includes a visual information encoding module (Image Encoder), a Word Embedding module (Word Embedding), a spatiotemporal visual positioning module (STVGBert-core), and a spatiotemporal trajectory generating module (Tube Generation); the visual information coding module and the character embedding module are connected with the space-time visual positioning module; the space-time visual positioning module is connected with the space-time trajectory generation module; the visual information coding module acquires visual characteristics from the video frame; the character embedding module extracts text codes from the query text; the space-time visual positioning module interactively learns the visual characteristics and the text codes according to the visual language characteristics and generates detection box information and initial and final time information; the spatiotemporal trajectory generation module obtains a spatiotemporal trajectory block (Object Tube B) containing a detection target according to the generated detection frame information and the start and end time information.
In another embodiment, a video containing K x T frames not being sworded is divided into K non-overlapping video slices, each video slice containing T frames of video frames, the video being defined as
Figure BDA0003270889190000061
Wherein
Figure BDA0003270889190000062
Representing the k video slice, extracting visual features from video frames by a visual information coding module, extracting text codes from query words by a word embedding module, extracting the visual features by the visual information coding module by using ResNet-101 as a visual information coder, converting the shape of each frame of video frames into HW multiplied by C by using 4 layers of residual blocks, wherein H, W and C respectively represent height, width and the number of characteristic image channels, stacking the extracted features of each frame by the visual information coder to form the slice features of the video, and recording the features as
Figure BDA0003270889190000063
The slice features are input into the spatiotemporal visual localization module as visual features.
In another embodiment, the query word is tagged with a special text label: and the words of 'GLS' and 'SEP' are respectively arranged at the head and tail of the query text. A word embedding module (word embedding module) maps each word in the query word to a vocabulary vector, and each vocabulary vector is regarded as a text Input token (Input Textual Tokens), wherein the vocabulary vector with the special text Tokens is a global text Input token.
In another embodiment, as shown in fig. 2 and fig. 3, the spatiotemporal Visual localization module includes a cross-modal learning module (ST-ViLBERT) for interactively learning the Visual language features for association, the cross-modal learning module is composed of a Visual Branch module (Visual Branch) and a text Branch module (Textual Branch), each Branch adopts a Multi-layer transform coding layer structure, and learns the interactive features through a Multi-head attention layer (Multi-head attention) in the transform coding layer, and the learning scheme is as follows: and the Text-guided visual feature (Text-guided visual feature) is finally generated by taking the input of the Text feature obtained at the previous layer as a key and a value in the visual branching module to participate in multi-head attention calculation and taking the input of the visual feature obtained at the previous layer as a key and a value in the visual branching module to participate in multi-head attention calculation.
In another embodiment, a space-time composition Decomposition (STCD) module is constructed to replace the multi-head attention layer and the Add-sum layer (Add & Norm) in the visual branch module, so that Spatial information is preserved in the learning process across the modal Feature module, in the STCD module, the visual features are first Spatial Pooling (Spatial Pooling) and Temporal Pooling (Temporal Pooling) respectively to yield a preliminary Temporal Feature (preliminary Temporal Feature) with a shape of T × C and a preliminary Spatial Feature (preliminary Spatial Feature) with a shape of hwx C, and the Combination module (Combination) connects the two features in the Feature dimension to form a Combined visual Feature (Combined Spatial Feature) with a size of (T + HW) × C. And then, the combined visual features are transmitted into the text branch in a vector form and are used as keys and values of a multi-head attention block in the text branch to participate in attention operation, the combined visual features are also transmitted into the multi-head attention block of the text branch, attention operation is performed with text input (Textual Power), and a preliminary text-guided visual feature with the size of (T + HW) xC is generated. The preliminary text-guided visual features are decomposed by a Decomposition module (Decomposition) into text-guided temporal features (size T × C) and text-guided spatial features (size HW × C). These two features are replicated HW and T times, respectively, by a Replication module (Replication) to match the size of the visual features, forming Replication temporal features and Replication spatial features. Finally, the replication temporal features and the replication spatial features and the visual features are added and normalized (Norm), generating intermediate visual features of size T × HW × C.
Wherein, the transform coding layer is formed by connecting a multi-head attention layer addition and normalization layer (Add & Norm) feedforward neural network layer (Feed Forward) and a second addition and normalization layer in sequence.
In addition, the cross-modal learning feature module is composed of a plurality of modulesEach layer is formed by a layer transformer structure, visual features are used as input, new visual features are used as output, the visual features generated in the middle layer are intermediate visual features, and the output of the last layer of the visual branches and the text branches is respectively used as visual features F guided by textstv∈RT×HW×CAnd visually-guided text features.
In another embodiment, as shown in FIG. 4, the spatiotemporal visual localization module further comprises a spatiotemporal analysis core module; the space-time analysis core module comprises a space visual positioning branch and a time visual positioning branch.
In the spatial visual positioning branch, the text-guided visual features first perform spatial upsampling with a factor of 8 through a three-layer deconvolution layer to generate upsampling features, and two parallel detection network heads acquire the sampling features to respectively generate a detection frame center (BB Centers) and a detection frame size (Bounding Box Sizes, BB Sizes).
Wherein, two parallel detection network heads are composed of a 3 × 3 volume base layer for feature extraction and a 1 × 1 volume layer for dimension reduction, the input of the two parallel detection network heads are up-sampling features, the first detection network head outputs a thermodynamic diagram A e R8H ×8WThe value in the thermodynamic diagram A represents the probability value of the center of the detection box of which the corresponding position only has the detection target, the corresponding space position with the highest probability value is selected as the center of the detection box, the second detection network head is used for regressing the size of the detection box, and the upper left subscript value and the lower right subscript value of the detection box are calculated according to the size.
In the temporal visual positioning branch, the temporal visual positioning branch predicts the starting time and the ending time of the detection target appearing in the video based on the visual characteristics of the text guide and the text characteristics of the visual guide. The text-guided visual features are subjected to average pooling to obtain text-guided global visual features, and then the text-guided global visual features are convoluted by using two parallel time sequence convolution blocks to respectively obtain initial visual features and termination visual features.
Wherein each time-series convolution block is composed of 3 layers of 1-dimensional convolution layers with a convolution kernel size of 3. The starting visual feature and the ending visual feature are both T × C in size.
In another embodiment, the global text input labels are provided with special text labels by the visual-guided text features generated by the cross-modal feature learning module, i.e. global text features, the global text features are used for obtaining intermediate text features in the C-dimensional common feature space by a Multilayer Perceptron (MLP), and the computation activation module (Correlation & Sigmoid) performs Correlation computation on each starting visual feature or ending visual feature and the intermediate text features in the common feature space and activates with a Sigmoid activation function to obtain a starting prediction score and an ending prediction score, wherein the prediction scores represent the probability size that each frame is a starting frame or an ending frame in one video slice.
In another embodiment, the spatiotemporal trajectory generation module constructs the detection box and the start and stop prediction scores in combination as an initial target spatiotemporal trajectory module in the time domain. The starting time and the ending time of the space-time trajectory module are selected according to the maximum starting prediction score and the maximum ending prediction score, all target detection frames before the starting time and after the ending time are removed, and the obtained time boundary and the target detection frames jointly form a space-time trajectory prediction result, wherein the time boundary is the starting prediction time and the ending prediction time.
In another embodiment, the spatiotemporal visual positioning module is trained with three focal distances and one L1 distance as a loss function. The specific method comprises the following steps:
randomly sampling a series of video slices, each video slice comprising T consecutive video frames; selecting video slices at least containing one frame of video and corresponding target detection frames from the video slices as training samples, labeling the detection frames to the ith video frame in each video slice, and respectively defining the center, the width and the length of the labeled target detection frame as
Figure BDA0003270889190000091
Similarly, the start and end times of the annotation are labeled respectively
Figure BDA0003270889190000092
Label-based target detection box using Gaussian kernels
Figure BDA0003270889190000093
Figure BDA0003270889190000094
A central thermodynamic diagram is generated
Figure BDA0003270889190000095
Wherein the content of the first and second substances,
Figure BDA0003270889190000096
is shown in
Figure BDA0003270889190000097
The spatial coordinates are the values of (x, y),
Figure BDA0003270889190000098
the bandwidth parameter is represented and is adaptively determined by the size of the detection target. Similarly, the invention generates two 1-dimensional time sequence thermodynamic diagrams respectively for the starting time and the ending time of the appearance of the detection target
Figure BDA0003270889190000099
During the training process, for each video slice, the loss function is defined as the formula:
Figure BDA00032708891900000910
Figure BDA00032708891900000911
wherein A isi∈R8H×8WIs a predicted thermodynamic diagram, ps,pe∈RTIs the start and end time prediction score for the occurrence of the target.
Figure BDA00032708891900000912
In the ith frame of video, respectively, by position
Figure BDA0003270889190000101
The width and length of the predicted target detection box at the center. L isc、LsAnd LeThe local area is predicted by the time sequence boundary start time and the time sequence boundary end time for the center of the target detection frame. L issizeIs L1 loss for calculating the regression size of the target detection box. Loss weight taking: lambda [ alpha ]1=0.1,λ2=λ3=λ4=1。
The technical effects achieved by the present invention are specifically described below:
Figure BDA0003270889190000102
TABLE 1 results of the performance of different methods on the VidSTG dataset
Method m_vIoU(%) vIoU@0.3(%) vIoU@0.5(%)
STVGT 18.15 26.81 9.48
STVGBert 20.42 29.37 11.31
TABLE 2 results of the performance of different methods on HC-STVG datasets
The results for these two data sets are shown in tables 1 and 2. From the results, we observed the following. 1) On both data sets, our proposed method has great advantages over the current state-of-the-art methods in all evaluation indexes. 2) On the VidSTG data set, the previous work groups R + {. The }, STPR + {. The } and WSSTG + {. The first uses TALL or L-Net to perform time sequence visual positioning, and then performs space visual positioning to obtain the final result. In contrast, our model can generate the target detection box and the timing sequence boundary at the same time to form the space-time trajectory block, and its performance is significantly better than the two-stage positioning method, which proves the effectiveness of the end-to-end method STVGBert proposed by the present invention. 3) In table 2, both STGVT and our STVGBert apply visual language transform to perform cross-modal feature learning, but the video spatio-temporal positioning method proposed by the present invention has obvious advantages over the STGVT method. In addition, STGVT requires a pre-trained target detector to generate a set of target detection boxes, whereas the system STVGBert proposed by the present invention can directly process an input video.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An end-to-end video space-time visual positioning system based on a visual language Transformer is characterized by comprising a visual information coding module, a character embedding module, a space-time visual positioning module and a space-time track generating module; the visual information coding module and the character embedding module are connected with the space-time visual positioning module; the space-time visual positioning module is connected with the space-time trajectory generation module; the visual information coding module is used for acquiring the visual characteristics of the detection target from the video frame; the character embedding module is used for extracting text codes of detection targets from the query text; the space-time visual positioning module is used for learning the interactive characteristics between the visual characteristics and the text codes and carrying out space positioning and time positioning on a detection target to obtain detection frame information and time starting and ending information; and the space-time trajectory generation module is used for combining the generated detection frame information on a time domain and a space domain to obtain a space-time trajectory block containing a detection target.
2. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 1, wherein the spatiotemporal visual positioning module comprises a cross-modal feature learning module and a spatiotemporal analysis core module; the cross-modal feature learning module acquires text codes and visual features and generates text-guided visual features and visual-guided text features; and the space-time analysis core module is used for positioning the generated text-guided visual features in time and space.
3. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 2, wherein the query text is added with special text labels [ 'GLS' ] and [ 'SEP' ], and the special text labels are respectively placed at the beginning and the end of the query text, and the text embedding module obtains a text code with a special label according to the query text with the special label; and the cross-modal learning module acquires a text code with a special text mark to obtain global text characteristics.
4. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 2, wherein the cross-modal feature learning module comprises a visual branch module and a text branch module, and the visual branch module and the text branch module perform interactive learning to obtain text-guided visual features and text information.
5. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 4, wherein the spatiotemporal combination decomposition module constructed in the visual branch module retains spatial information.
6. The visual language Transformer-based video spatiotemporal visual positioning system of claim 5, wherein the spatiotemporal combination decomposition module comprises a temporal pooling module, a spatial pooling module, a combination module, a multi-headed attention module, a decomposition module, a replication module, and a normalization module;
the time sequence pooling module is used for acquiring visual features to generate T multiplied by C preliminary time sequence features, wherein T represents the number of video frames, C represents the number of feature map channels, H represents the height, and W represents the width;
the spatial pooling module is used to collect visual features to generate preliminary spatial features in the shape of HW x C,
the combination module is used for connecting the preliminary time sequence characteristics and the preliminary space characteristics on characteristic dimensions to form combined visual characteristics with the size of (T + HW) multiplied by C;
the multi-head attention module is used for performing attention operation according to the combined visual features and the text features to generate preliminary text-guided visual features;
the decomposition module is used for generating a time sequence feature of the text guide and a space feature of the text guide according to the visual feature of the preliminary text guide;
the copying module is used for copying the time sequence characteristics of the text guide for HW times and copying the space characteristics of the text guide for T times to obtain copying time sequence characteristics and copying space characteristics with the size of T multiplied by HW multiplied by C;
the normalization module is used for normalizing the result obtained by adding the copying time sequence characteristic, the copying space characteristic and the visual input characteristic to generate an intermediate visual characteristic; the output of the last layer is a text-guided visual feature.
7. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 3, wherein the spatiotemporal analysis core module comprises a spatial visual positioning branch module and a temporal visual positioning branch module; the space vision positioning module generates the center and the size of the detection box according to the visual features guided by the text; the temporal visual positioning branching module generates start and stop prediction scores based on text-guided visual features.
8. A visual language Transformer based end-to-end video spatiotemporal visual positioning system according to claim 7, characterized in that said spatial visual positioning branch comprises an deconvolution layer, a first detection network header and a second detection network header; the deconvolution layer is three layers, deconvolution is carried out on visual features guided by texts, and spatial up-sampling is carried out to obtain up-sampling features; the first detection network head and the second detection network head are both composed of a 3X 3 roll base layer and a 1X 1 roll base layer; the first network head selects and acquires the up-sampling features to generate a thermodynamic diagram, the position with the highest probability value in the thermodynamic diagram is the center of the detection box, and the second network head obtains the size of the detection box by regressing the up-sampling features.
9. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 8, wherein the temporal visual positioning branch module comprises a spatial pooling layer, a timing convolutional layer, a multi-layer perceptron layer and a computation activation layer; the space pooling layer uses space average pooling to the visual features of the text guidance to obtain global visual features of the text guidance; the global visual feature guided by the text is subjected to convolution operation through two parallel time sequence convolution blocks to respectively obtain a starting visual feature and an ending visual feature; the multilayer perception acquisition obtains global text features to obtain text features in a C-dimensional public feature space; and the calculation activation layer carries out correlation calculation on the starting visual feature and the ending visual feature and the text feature in the public feature space and activates by using an activation function to obtain a starting prediction score and an ending prediction score.
10. A visual language Transformer based end-to-end video spatiotemporal visual localization system according to claim 1, characterized in that said spatiotemporal visual localization module is trained combining three local los and one L1 loss as loss functions.
CN202111100948.0A 2021-09-18 2021-09-18 End-to-end video spatiotemporal visual positioning system based on visual language Transformer Pending CN113849668A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111100948.0A CN113849668A (en) 2021-09-18 2021-09-18 End-to-end video spatiotemporal visual positioning system based on visual language Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111100948.0A CN113849668A (en) 2021-09-18 2021-09-18 End-to-end video spatiotemporal visual positioning system based on visual language Transformer

Publications (1)

Publication Number Publication Date
CN113849668A true CN113849668A (en) 2021-12-28

Family

ID=78974636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111100948.0A Pending CN113849668A (en) 2021-09-18 2021-09-18 End-to-end video spatiotemporal visual positioning system based on visual language Transformer

Country Status (1)

Country Link
CN (1) CN113849668A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067371A (en) * 2022-01-18 2022-02-18 之江实验室 Cross-modal pedestrian trajectory generation type prediction framework, method and device
CN115495677A (en) * 2022-11-21 2022-12-20 阿里巴巴(中国)有限公司 Method and storage medium for spatio-temporal localization of video
CN116304560A (en) * 2023-01-17 2023-06-23 北京信息科技大学 Track characterization model training method, characterization method and device based on multi-scale enhanced contrast learning
CN117058601A (en) * 2023-10-13 2023-11-14 华中科技大学 Video space-time positioning network and method of cross-modal network based on Gaussian kernel

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114067371A (en) * 2022-01-18 2022-02-18 之江实验室 Cross-modal pedestrian trajectory generation type prediction framework, method and device
CN115495677A (en) * 2022-11-21 2022-12-20 阿里巴巴(中国)有限公司 Method and storage medium for spatio-temporal localization of video
CN116304560A (en) * 2023-01-17 2023-06-23 北京信息科技大学 Track characterization model training method, characterization method and device based on multi-scale enhanced contrast learning
CN116304560B (en) * 2023-01-17 2023-11-24 北京信息科技大学 Track characterization model training method, characterization method and device
CN117058601A (en) * 2023-10-13 2023-11-14 华中科技大学 Video space-time positioning network and method of cross-modal network based on Gaussian kernel

Similar Documents

Publication Publication Date Title
CN113849668A (en) End-to-end video spatiotemporal visual positioning system based on visual language Transformer
Senocak et al. Learning to localize sound source in visual scenes
Plummer et al. Conditional image-text embedding networks
Chen et al. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning
Messina et al. Transformer reasoning network for image-text matching and retrieval
Ge et al. An attention mechanism based convolutional LSTM network for video action recognition
Mun et al. Text-guided attention model for image captioning
Fung et al. End-to-end low-resource lip-reading with maxout CNN and LSTM
CN112559698B (en) Method and system for improving video question-answering precision based on multi-mode fusion model
US11288438B2 (en) Bi-directional spatial-temporal reasoning for video-grounded dialogues
CN111651635B (en) Video retrieval method based on natural language description
CN113239820A (en) Pedestrian attribute identification method and system based on attribute positioning and association
Estevam et al. Dense video captioning using unsupervised semantic information
Mazaheri et al. Video fill in the blank using lr/rl lstms with spatial-temporal attentions
Xue et al. LCSNet: End-to-end lipreading with channel-aware feature selection
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
Saleem et al. Stateful human-centered visual captioning system to aid video surveillance
Vahdati et al. Facial beauty prediction from facial parts using multi-task and multi-stream convolutional neural networks
Bashmal et al. Language Integration in Remote Sensing: Tasks, datasets, and future directions
Wang et al. How to make a BLT sandwich? learning to reason towards understanding web instructional videos
CN114596432A (en) Visual tracking method and system based on corresponding template features of foreground region
Xian et al. CLIP Driven Few-shot Panoptic Segmentation
Zhao et al. Research on human behavior recognition in video based on 3DCCA
CN110211146B (en) Video foreground segmentation method and device for cross-view simulation
Wu et al. Question-driven multiple attention (dqma) model for visual question answer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination