CN113849668A

CN113849668A - End-to-end video spatiotemporal visual positioning system based on visual language Transformer

Info

Publication number: CN113849668A
Application number: CN202111100948.0A
Authority: CN
Inventors: 于茜
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2021-12-28

Abstract

The invention discloses an end-to-end video space-time visual positioning system based on a visual language Transformer, which comprises a visual information coding module, a character embedding module, a space-time visual positioning module and a space-time track generating module, wherein the visual information coding module is used for coding a visual information; the visual information coding module and the character embedding module are connected with the space-time visual positioning module; the space-time visual positioning module is connected with the space-time trajectory generation module; the visual information coding module acquires visual characteristics from the video frame by the video module; the character embedding module extracts text codes from the query text; the space-time visual positioning module is used for learning interactive characteristics between the visual characteristics and the text codes, and performing space positioning and time positioning on a detection target to obtain detection frame information and time starting and ending information; the space-time trajectory generation module is used for generating a space-time trajectory prediction result; by the method, visual positioning in time and space can be completed simultaneously, and better characteristic representation can be learned so as to achieve better positioning effect.

Description

End-to-end video spatiotemporal visual positioning system based on visual language Transformer

Technical Field

The invention relates to the technical field of multimedia, in particular to an end-to-end video spatiotemporal visual positioning system based on a visual language Transformer.

Background

Video spatiotemporal visual localization is a new and extremely challenging visual language task. Given an uncut video, the task generates a spatiotemporal trajectory block (a series of visual localization boxes) to localize the detected objects in the video according to the required linguistic description of the detected objects. Unlike existing image visual positioning tasks, spatiotemporal visual positioning requires that the detection target be positioned both temporally and spatially. In addition, how to efficiently utilize visual and language information to complete cross-modal learning is the key point for accurately positioning a detection target. Wherein different people do similar actions in the same scene is a challenging task scene.

Spatial localization in images/video is a visual localization task that is very relevant to this task. Most of the existing work will first generate some possible target detection candidate boxes using a pre-trained target detector. These methods have certain limitations: 1) spatial detection capability is limited by the quality of the target detection candidate box; 2) it is difficult to generate a detection box for a new class using a pre-trained target detector; 3) the training of pre-training is costly. There is currently no video positioning related effort to remove the pre-trained detector.

In addition, the video spatiotemporal visual positioning task needs to complete the positioning of the detection target in two dimensions of time and space, so that the existing methods are two-stage methods, namely, the time sequence visual positioning is completed firstly, the starting time and the ending time of the detection target are divided, and the spatial visual positioning is completed on the basis of the two-stage methods. But the two-phase pair approach makes the overall framework approach two independent networks to complete their respective subtasks.

Therefore, how to provide a system based on an end-to-end visual language transfoermeir to complete a video visual positioning task without pre-training a target detector is a problem to be solved in the field.

Disclosure of Invention

In view of this, the invention provides an end-to-end video spatiotemporal visual positioning system based on a visual language Transformer, and better feature representation can be learned by completing visual positioning in time and space at the same time, so as to achieve a better positioning effect.

In order to achieve the purpose, the invention adopts the following technical scheme:

an end-to-end video spatiotemporal visual positioning system based on a visual language Transformer comprises a visual information coding module, a character embedding module, a spatiotemporal visual positioning module and a spatiotemporal trajectory generating module; the visual information coding module and the character embedding module are connected with the space-time visual positioning module; the space-time visual positioning module is connected with the space-time trajectory generation module; the visual information coding module is used for acquiring the visual characteristics of the detection target from the video frame; the character embedding module is used for extracting text codes of detection targets from the query text; the space-time visual positioning module is used for learning the interactive characteristics between the visual characteristics and the text codes and carrying out space positioning and time positioning on a detection target to obtain detection frame information and time starting and ending information; and the space-time trajectory generation module is used for combining the generated detection frame information on a time domain and a space domain to obtain a space-time trajectory block containing a detection target.

Furthermore, the space-time visual positioning module comprises a cross-modal feature learning module and a space-time analysis core module; the cross-modal feature learning module acquires text codes and visual features and generates text-guided visual features and visual-guided text features; and the space-time analysis core module is used for positioning the generated text-guided visual features in time and space.

Furthermore, special text marks [ 'GLS' ] and [ 'SEP' ] are added into the query text and are respectively placed at the head and the tail of the query text, and the text embedding module obtains a text code with the special mark according to the query text with the special mark; and the cross-modal learning module acquires a text code with a special text mark to obtain global text characteristics.

Further, the cross-modal feature learning module comprises a visual branch module and a text branch module, and the visual branch module and the text branch module perform interactive learning to obtain visual features guided by a text and text information guided by the text.

Further, a space-time combination decomposition module is constructed in the visual branch module to reserve the spatial information.

Furthermore, the time-space combination decomposition module comprises a time sequence pooling module, a space pooling module, a combination module, a multi-head attention module, a decomposition module, a copying module and a normalization module; the time sequence pooling module is used for acquiring visual features to generate T multiplied by C preliminary time sequence features, wherein T represents the number of video frames, C represents the number of feature map channels, H represents the height, and W represents the width; the space pooling module is used for acquiring visual features to generate primary spatial features with the shape of HW multiplied by C, and the combination module is used for connecting the primary time sequence features and the primary spatial features on feature dimensions to form combined visual features with the size of (T + HW) multiplied by C; the multi-head attention module is used for performing attention operation according to the combined visual features and the text features to generate preliminary text-guided visual features; the decomposition module is used for generating a time sequence feature of the text guide and a space feature of the text guide according to the visual feature of the preliminary text guide; the copying module is used for copying the time sequence characteristics of the text guide for HW times and copying the space characteristics of the text guide for T times to obtain copying time sequence characteristics and copying space characteristics with the size of T multiplied by HW multiplied by C; the normalization module is used for normalizing the result obtained by adding the copying time sequence characteristic, the copying space characteristic and the visual input characteristic to generate an intermediate visual characteristic; the output of the last layer is a text-guided visual feature.

Furthermore, the space-time analysis core module comprises a space visual positioning branch module and a time visual positioning branch module; the space vision positioning module generates the center and the size of the detection box according to the visual features guided by the text; the temporal visual positioning branching module generates start and stop prediction scores based on text-guided visual features.

Further, the spatial vision positioning branch comprises an deconvolution layer, a first detection network head and a second detection network head; the deconvolution layer is three layers, deconvolution is carried out on visual features guided by texts, and spatial up-sampling is carried out to obtain up-sampling features; the first detection network head and the second detection network head are both composed of a 3X 3 roll base layer and a 1X 1 roll base layer; the first network head selects and acquires the up-sampling features to generate a thermodynamic diagram, the position with the highest probability value in the thermodynamic diagram is the center of the detection box, and the second network head obtains the size of the detection box by regressing the up-sampling features.

Furthermore, the time vision positioning branch module comprises a space pooling layer, a time sequence convolution layer, a multi-layer perceptron layer and a calculation activation layer; the space pooling layer uses space average pooling to the visual features of the text guidance to obtain global visual features of the text guidance; the global visual feature guided by the text is subjected to convolution operation through two parallel time sequence convolution blocks to respectively obtain a starting visual feature and an ending visual feature;

the multilayer perception acquisition obtains global text features to obtain text features in a C-dimensional public feature space; and the calculation activation layer carries out correlation calculation on the starting visual feature and the ending visual feature and the text feature in the public feature space and activates by using an activation function to obtain a starting prediction score and an ending prediction score.

Further, the spatiotemporal visual localization module was trained combining three focal distances and one L1 distance as a loss function.

The invention has the following beneficial effects:

according to the technical scheme, compared with the prior art, the invention discloses and provides an end-to-end video space-time visual positioning system based on the visual language Transformer, so that the invention provides an end-to-end video space-time visual positioning unified frame without using a prediction training target detector, and simultaneously realizes visual positioning on time and space, thereby achieving a better positioning effect, solving the problem that the existing space detection capability is limited by the quality of a target detection candidate frame, and further avoiding a high-cost pre-training process of the detector; a space-time visual positioning module is constructed, and spatial information is reserved in a cross-modal feature learning module.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a diagram illustrating a structure of an end-to-end video spatiotemporal visual positioning system based on a visual language Transformer according to the present invention;

FIG. 2 is a diagram of a cross-modal learning module architecture;

FIG. 3 is a block diagram of a spatiotemporal decomposition module;

FIG. 4 is a block diagram of a spatiotemporal visual orientation module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention discloses an end-to-end video spatiotemporal visual positioning system (STVGBert) based on visual language Transformer, which includes a visual information encoding module (Image Encoder), a Word Embedding module (Word Embedding), a spatiotemporal visual positioning module (STVGBert-core), and a spatiotemporal trajectory generating module (Tube Generation); the visual information coding module and the character embedding module are connected with the space-time visual positioning module; the space-time visual positioning module is connected with the space-time trajectory generation module; the visual information coding module acquires visual characteristics from the video frame; the character embedding module extracts text codes from the query text; the space-time visual positioning module interactively learns the visual characteristics and the text codes according to the visual language characteristics and generates detection box information and initial and final time information; the spatiotemporal trajectory generation module obtains a spatiotemporal trajectory block (Object Tube B) containing a detection target according to the generated detection frame information and the start and end time information.

In another embodiment, a video containing K x T frames not being sworded is divided into K non-overlapping video slices, each video slice containing T frames of video frames, the video being defined as

Wherein

Representing the k video slice, extracting visual features from video frames by a visual information coding module, extracting text codes from query words by a word embedding module, extracting the visual features by the visual information coding module by using ResNet-101 as a visual information coder, converting the shape of each frame of video frames into HW multiplied by C by using 4 layers of residual blocks, wherein H, W and C respectively represent height, width and the number of characteristic image channels, stacking the extracted features of each frame by the visual information coder to form the slice features of the video, and recording the features as

The slice features are input into the spatiotemporal visual localization module as visual features.

In another embodiment, the query word is tagged with a special text label: and the words of 'GLS' and 'SEP' are respectively arranged at the head and tail of the query text. A word embedding module (word embedding module) maps each word in the query word to a vocabulary vector, and each vocabulary vector is regarded as a text Input token (Input Textual Tokens), wherein the vocabulary vector with the special text Tokens is a global text Input token.

In another embodiment, as shown in fig. 2 and fig. 3, the spatiotemporal Visual localization module includes a cross-modal learning module (ST-ViLBERT) for interactively learning the Visual language features for association, the cross-modal learning module is composed of a Visual Branch module (Visual Branch) and a text Branch module (Textual Branch), each Branch adopts a Multi-layer transform coding layer structure, and learns the interactive features through a Multi-head attention layer (Multi-head attention) in the transform coding layer, and the learning scheme is as follows: and the Text-guided visual feature (Text-guided visual feature) is finally generated by taking the input of the Text feature obtained at the previous layer as a key and a value in the visual branching module to participate in multi-head attention calculation and taking the input of the visual feature obtained at the previous layer as a key and a value in the visual branching module to participate in multi-head attention calculation.

In another embodiment, a space-time composition Decomposition (STCD) module is constructed to replace the multi-head attention layer and the Add-sum layer (Add & Norm) in the visual branch module, so that Spatial information is preserved in the learning process across the modal Feature module, in the STCD module, the visual features are first Spatial Pooling (Spatial Pooling) and Temporal Pooling (Temporal Pooling) respectively to yield a preliminary Temporal Feature (preliminary Temporal Feature) with a shape of T × C and a preliminary Spatial Feature (preliminary Spatial Feature) with a shape of hwx C, and the Combination module (Combination) connects the two features in the Feature dimension to form a Combined visual Feature (Combined Spatial Feature) with a size of (T + HW) × C. And then, the combined visual features are transmitted into the text branch in a vector form and are used as keys and values of a multi-head attention block in the text branch to participate in attention operation, the combined visual features are also transmitted into the multi-head attention block of the text branch, attention operation is performed with text input (Textual Power), and a preliminary text-guided visual feature with the size of (T + HW) xC is generated. The preliminary text-guided visual features are decomposed by a Decomposition module (Decomposition) into text-guided temporal features (size T × C) and text-guided spatial features (size HW × C). These two features are replicated HW and T times, respectively, by a Replication module (Replication) to match the size of the visual features, forming Replication temporal features and Replication spatial features. Finally, the replication temporal features and the replication spatial features and the visual features are added and normalized (Norm), generating intermediate visual features of size T × HW × C.

Wherein, the transform coding layer is formed by connecting a multi-head attention layer addition and normalization layer (Add & Norm) feedforward neural network layer (Feed Forward) and a second addition and normalization layer in sequence.

In addition, the cross-modal learning feature module is composed of a plurality of modulesEach layer is formed by a layer transformer structure, visual features are used as input, new visual features are used as output, the visual features generated in the middle layer are intermediate visual features, and the output of the last layer of the visual branches and the text branches is respectively used as visual features F guided by texts^tv∈R^T×HW×CAnd visually-guided text features.

In another embodiment, as shown in FIG. 4, the spatiotemporal visual localization module further comprises a spatiotemporal analysis core module; the space-time analysis core module comprises a space visual positioning branch and a time visual positioning branch.

In the spatial visual positioning branch, the text-guided visual features first perform spatial upsampling with a factor of 8 through a three-layer deconvolution layer to generate upsampling features, and two parallel detection network heads acquire the sampling features to respectively generate a detection frame center (BB Centers) and a detection frame size (Bounding Box Sizes, BB Sizes).

Wherein, two parallel detection network heads are composed of a 3 × 3 volume base layer for feature extraction and a 1 × 1 volume layer for dimension reduction, the input of the two parallel detection network heads are up-sampling features, the first detection network head outputs a thermodynamic diagram A e R^8H ^×8WThe value in the thermodynamic diagram A represents the probability value of the center of the detection box of which the corresponding position only has the detection target, the corresponding space position with the highest probability value is selected as the center of the detection box, the second detection network head is used for regressing the size of the detection box, and the upper left subscript value and the lower right subscript value of the detection box are calculated according to the size.

In the temporal visual positioning branch, the temporal visual positioning branch predicts the starting time and the ending time of the detection target appearing in the video based on the visual characteristics of the text guide and the text characteristics of the visual guide. The text-guided visual features are subjected to average pooling to obtain text-guided global visual features, and then the text-guided global visual features are convoluted by using two parallel time sequence convolution blocks to respectively obtain initial visual features and termination visual features.

Wherein each time-series convolution block is composed of 3 layers of 1-dimensional convolution layers with a convolution kernel size of 3. The starting visual feature and the ending visual feature are both T × C in size.

In another embodiment, the global text input labels are provided with special text labels by the visual-guided text features generated by the cross-modal feature learning module, i.e. global text features, the global text features are used for obtaining intermediate text features in the C-dimensional common feature space by a Multilayer Perceptron (MLP), and the computation activation module (Correlation & Sigmoid) performs Correlation computation on each starting visual feature or ending visual feature and the intermediate text features in the common feature space and activates with a Sigmoid activation function to obtain a starting prediction score and an ending prediction score, wherein the prediction scores represent the probability size that each frame is a starting frame or an ending frame in one video slice.

In another embodiment, the spatiotemporal trajectory generation module constructs the detection box and the start and stop prediction scores in combination as an initial target spatiotemporal trajectory module in the time domain. The starting time and the ending time of the space-time trajectory module are selected according to the maximum starting prediction score and the maximum ending prediction score, all target detection frames before the starting time and after the ending time are removed, and the obtained time boundary and the target detection frames jointly form a space-time trajectory prediction result, wherein the time boundary is the starting prediction time and the ending prediction time.

In another embodiment, the spatiotemporal visual positioning module is trained with three focal distances and one L1 distance as a loss function. The specific method comprises the following steps:

randomly sampling a series of video slices, each video slice comprising T consecutive video frames; selecting video slices at least containing one frame of video and corresponding target detection frames from the video slices as training samples, labeling the detection frames to the ith video frame in each video slice, and respectively defining the center, the width and the length of the labeled target detection frame as

Similarly, the start and end times of the annotation are labeled respectively

Label-based target detection box using Gaussian kernels

A central thermodynamic diagram is generated

Wherein the content of the first and second substances,

is shown in

The spatial coordinates are the values of (x, y),

the bandwidth parameter is represented and is adaptively determined by the size of the detection target. Similarly, the invention generates two 1-dimensional time sequence thermodynamic diagrams respectively for the starting time and the ending time of the appearance of the detection target

During the training process, for each video slice, the loss function is defined as the formula:

wherein A is_i∈R^8H×8WIs a predicted thermodynamic diagram, p_s,p_e∈R^TIs the start and end time prediction score for the occurrence of the target.

In the ith frame of video, respectively, by position

The width and length of the predicted target detection box at the center. L is_c、L_sAnd L_eThe local area is predicted by the time sequence boundary start time and the time sequence boundary end time for the center of the target detection frame. L is_sizeIs L1 loss for calculating the regression size of the target detection box. Loss weight taking: lambda [ alpha ]₁＝0.1，λ₂＝λ₃＝λ₄＝1。

The technical effects achieved by the present invention are specifically described below:

TABLE 1 results of the performance of different methods on the VidSTG dataset

Method	m_vIoU(％)	vIoU@0.3(％)	vIoU@0.5(％)
				STVGT	18.15	26.81	9.48
STVGBert	20.42	29.37	11.31

TABLE 2 results of the performance of different methods on HC-STVG datasets

The results for these two data sets are shown in tables 1 and 2. From the results, we observed the following. 1) On both data sets, our proposed method has great advantages over the current state-of-the-art methods in all evaluation indexes. 2) On the VidSTG data set, the previous work groups R + {. The }, STPR + {. The } and WSSTG + {. The first uses TALL or L-Net to perform time sequence visual positioning, and then performs space visual positioning to obtain the final result. In contrast, our model can generate the target detection box and the timing sequence boundary at the same time to form the space-time trajectory block, and its performance is significantly better than the two-stage positioning method, which proves the effectiveness of the end-to-end method STVGBert proposed by the present invention. 3) In table 2, both STGVT and our STVGBert apply visual language transform to perform cross-modal feature learning, but the video spatio-temporal positioning method proposed by the present invention has obvious advantages over the STGVT method. In addition, STGVT requires a pre-trained target detector to generate a set of target detection boxes, whereas the system STVGBert proposed by the present invention can directly process an input video.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An end-to-end video space-time visual positioning system based on a visual language Transformer is characterized by comprising a visual information coding module, a character embedding module, a space-time visual positioning module and a space-time track generating module; the visual information coding module and the character embedding module are connected with the space-time visual positioning module; the space-time visual positioning module is connected with the space-time trajectory generation module; the visual information coding module is used for acquiring the visual characteristics of the detection target from the video frame; the character embedding module is used for extracting text codes of detection targets from the query text; the space-time visual positioning module is used for learning the interactive characteristics between the visual characteristics and the text codes and carrying out space positioning and time positioning on a detection target to obtain detection frame information and time starting and ending information; and the space-time trajectory generation module is used for combining the generated detection frame information on a time domain and a space domain to obtain a space-time trajectory block containing a detection target.

2. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 1, wherein the spatiotemporal visual positioning module comprises a cross-modal feature learning module and a spatiotemporal analysis core module; the cross-modal feature learning module acquires text codes and visual features and generates text-guided visual features and visual-guided text features; and the space-time analysis core module is used for positioning the generated text-guided visual features in time and space.

3. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 2, wherein the query text is added with special text labels [ 'GLS' ] and [ 'SEP' ], and the special text labels are respectively placed at the beginning and the end of the query text, and the text embedding module obtains a text code with a special label according to the query text with the special label; and the cross-modal learning module acquires a text code with a special text mark to obtain global text characteristics.

4. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 2, wherein the cross-modal feature learning module comprises a visual branch module and a text branch module, and the visual branch module and the text branch module perform interactive learning to obtain text-guided visual features and text information.

5. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 4, wherein the spatiotemporal combination decomposition module constructed in the visual branch module retains spatial information.

6. The visual language Transformer-based video spatiotemporal visual positioning system of claim 5, wherein the spatiotemporal combination decomposition module comprises a temporal pooling module, a spatial pooling module, a combination module, a multi-headed attention module, a decomposition module, a replication module, and a normalization module;

the time sequence pooling module is used for acquiring visual features to generate T multiplied by C preliminary time sequence features, wherein T represents the number of video frames, C represents the number of feature map channels, H represents the height, and W represents the width;

the spatial pooling module is used to collect visual features to generate preliminary spatial features in the shape of HW x C,

the combination module is used for connecting the preliminary time sequence characteristics and the preliminary space characteristics on characteristic dimensions to form combined visual characteristics with the size of (T + HW) multiplied by C;

the multi-head attention module is used for performing attention operation according to the combined visual features and the text features to generate preliminary text-guided visual features;

the decomposition module is used for generating a time sequence feature of the text guide and a space feature of the text guide according to the visual feature of the preliminary text guide;

the copying module is used for copying the time sequence characteristics of the text guide for HW times and copying the space characteristics of the text guide for T times to obtain copying time sequence characteristics and copying space characteristics with the size of T multiplied by HW multiplied by C;

the normalization module is used for normalizing the result obtained by adding the copying time sequence characteristic, the copying space characteristic and the visual input characteristic to generate an intermediate visual characteristic; the output of the last layer is a text-guided visual feature.

7. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 3, wherein the spatiotemporal analysis core module comprises a spatial visual positioning branch module and a temporal visual positioning branch module; the space vision positioning module generates the center and the size of the detection box according to the visual features guided by the text; the temporal visual positioning branching module generates start and stop prediction scores based on text-guided visual features.

8. A visual language Transformer based end-to-end video spatiotemporal visual positioning system according to claim 7, characterized in that said spatial visual positioning branch comprises an deconvolution layer, a first detection network header and a second detection network header; the deconvolution layer is three layers, deconvolution is carried out on visual features guided by texts, and spatial up-sampling is carried out to obtain up-sampling features; the first detection network head and the second detection network head are both composed of a 3X 3 roll base layer and a 1X 1 roll base layer; the first network head selects and acquires the up-sampling features to generate a thermodynamic diagram, the position with the highest probability value in the thermodynamic diagram is the center of the detection box, and the second network head obtains the size of the detection box by regressing the up-sampling features.

9. The visual language Transformer-based end-to-end video spatiotemporal visual positioning system according to claim 8, wherein the temporal visual positioning branch module comprises a spatial pooling layer, a timing convolutional layer, a multi-layer perceptron layer and a computation activation layer; the space pooling layer uses space average pooling to the visual features of the text guidance to obtain global visual features of the text guidance; the global visual feature guided by the text is subjected to convolution operation through two parallel time sequence convolution blocks to respectively obtain a starting visual feature and an ending visual feature; the multilayer perception acquisition obtains global text features to obtain text features in a C-dimensional public feature space; and the calculation activation layer carries out correlation calculation on the starting visual feature and the ending visual feature and the text feature in the public feature space and activates by using an activation function to obtain a starting prediction score and an ending prediction score.

10. A visual language Transformer based end-to-end video spatiotemporal visual localization system according to claim 1, characterized in that said spatiotemporal visual localization module is trained combining three local los and one L1 loss as loss functions.