CN114743139A

CN114743139A - Video scene retrieval method and device, electronic equipment and readable storage medium

Info

Publication number: CN114743139A
Application number: CN202210339794.9A
Authority: CN
Inventors: 陈禹行; 殷佳豪; 刘志励; 范圣印; 李雪
Original assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Current assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-07-12

Abstract

The application relates to a video scene retrieval method, a video scene retrieval device, electronic equipment and a readable storage medium, which relate to the technical field of computers, and the method comprises the following steps: the method comprises the steps of obtaining a current video sequence, wherein the current video sequence comprises multiple frames of images, respectively extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images, respectively performing time domain feature fusion on the basis of the dense depth learning feature maps corresponding to the frames of images to obtain features after respective fusion, performing space-time feature aggregation processing on the basis of the features after respective fusion corresponding to the frames of images to obtain global feature descriptors corresponding to the current video sequence, and then retrieving from a global database on the basis of the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences. The video scene retrieval method, the video scene retrieval device, the electronic equipment and the readable storage medium can improve the accuracy of retrieving the video sequence, and further can improve user experience.

Description

Video scene retrieval method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video scene retrieval method and apparatus, an electronic device, and a readable storage medium.

Background

In recent years, the appearance of application scenes such as autonomous memory parking, intelligent logistics trolleys, dining room intelligent robot meal delivery, unmanned aerial vehicle autonomous cruising and the like is very important for recognizing the scenes once arrived. When the tasks are executed for the first time (for example, the automobile is stopped in a self-parking space), a correct motion path is manually planned in advance and a scene map is established, and when the tasks are executed autonomously, the intelligent robot or the automatic driving automobile senses the position of the scene map in which the intelligent robot or the automatic driving automobile is positioned according to the current observed scene, and then autonomously tracks the intelligent robot or the automatic driving automobile according to the pre-planned path or autonomously avoids obstacles according to the scene map for navigation. Therefore, the accuracy of scene re-identification is crucial to the operation of the subsequent positioning and tracking navigation algorithm module.

In the application scenario, when the autonomous navigation task is executed and the scene map is established, a long time span may pass in the middle, which causes a large change in the surrounding environment of the scene, for example, in the morning when the map is established, and at night when the autonomous navigation is performed; the picture is built in a sunny day, the self-leading navigation time is a rainy day, a foggy day or a snowy day, and even the situation of crossing seasons possibly exists, so that the appearances of scenes observed by the two are greatly changed. In addition, scenes of the applications are often quite complex, for example, the images are interfered by dynamic objects such as pedestrians and vehicles during autonomous navigation, so that the difference of the appearances of the scenes observed twice is further increased, and even the dynamic objects can partially shield the scenes; meanwhile, the repeated appearance of some open scenes or objects with the same texture is also a great challenge, such as open parking lots, similar design styles of different garages, repeated and almost identical lamp poles and fences on roads and the like.

The inventor finds in the research process that the above situation may cause the accuracy of scene re-recognition to be low, and further cause the user experience to be poor.

Disclosure of Invention

The present application aims to provide a video scene retrieval method, apparatus, electronic device and readable storage medium, which are used to solve at least one of the above technical problems.

The above object of the present invention is achieved by the following technical solutions:

in a first aspect, a video scene retrieval method is provided, including:

acquiring a current video sequence, wherein the current video sequence comprises a plurality of frames of images;

respectively extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images;

respectively performing time domain feature fusion on the dense deep learning feature maps respectively corresponding to the frame images to obtain respective fused features;

performing space-time feature aggregation processing on the basis of the fused features corresponding to the images of each frame respectively to obtain a global feature descriptor corresponding to the current video sequence;

and retrieving from a global database based on the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences.

In a possible implementation manner, the performing time domain feature fusion respectively based on the dense deep learning feature maps respectively corresponding to the frames of images to obtain respective fused features includes: and respectively corresponding to the dense deep learning feature maps based on the frame images, and performing time domain feature fusion through an attention mechanism to obtain respective fused features.

In another possible implementation manner, the performing spatio-temporal feature aggregation processing based on the fused features respectively corresponding to each frame of image to obtain a global feature descriptor corresponding to the current video sequence includes:

splicing the time domain characteristic graphs corresponding to the frames of images to obtain spliced characteristic graphs;

performing point-by-point convolution processing on the spliced feature map to obtain a convolution processing result;

carrying out normalization processing on the convolution processing result to obtain a result after the normalization processing;

and determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map.

In another possible implementation manner, the feature map after the stitching process includes a plurality of feature points;

the determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map includes:

clustering the plurality of feature points to obtain at least one clustering center;

determining the distance between each characteristic point and each clustering center, and determining the distance information corresponding to each clustering center, wherein the distance information corresponding to any clustering center is the distance between each characteristic point and any clustering center;

determining global representations respectively corresponding to the clustering clusters based on the distance information corresponding to each clustering center and the result after the normalization processing;

performing regularization processing on the global representations corresponding to the cluster clusters respectively;

splicing all the global representations after the regularization treatment;

and performing regularization processing on the global representation after the splicing processing to obtain a global feature descriptor corresponding to the current video sequence.

In another possible implementation manner, the extracting dense depth learning feature maps respectively corresponding to the frames of images from the multiple frames of images respectively further includes:

respectively extracting the regional features of the dense depth learning feature maps corresponding to the frame images to obtain the multi-scale regional features corresponding to the frame images;

performing region matching based on the respective corresponding multi-scale region characteristics to obtain a space-time characteristic descriptor corresponding to the current video sequence;

and performing region matching on the video sequences with the first preset number based on the space-time feature descriptors corresponding to the current video sequences to obtain video sequences with a second preset number.

In another possible implementation manner, performing region feature extraction based on a dense depth learning feature map corresponding to any frame of image to obtain a multi-scale region feature corresponding to any frame of image includes:

determining a weighted residual feature map based on a dense depth learning feature map corresponding to any frame of image;

dividing the weighted residual characteristic map into a plurality of area blocks;

and determining the area characteristic representation corresponding to each area block so as to obtain the multi-scale area characteristic corresponding to any frame of image.

In another possible implementation manner, the determining a weighted residual feature map based on the dense depth learning feature map corresponding to any frame of image includes:

performing point-by-point convolution processing on the dense depth learning feature map corresponding to any frame of image to obtain a convolution result;

carrying out normalization processing on the convolution result to obtain a normalization result;

and determining the weighted residual error feature map based on the normalization result and the distance information corresponding to each cluster center.

In another possible implementation, the multi-scale region features are characterized by a region descriptor;

the obtaining of the space-time feature descriptor corresponding to the current video sequence by performing region matching based on the respective corresponding multi-scale region features comprises:

carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and the region descriptors corresponding to other frames of images in the current video sequence respectively to obtain a region matching result corresponding to the current video sequence;

selecting a region descriptor meeting a preset condition from a region matching result corresponding to the current video sequence as a region descriptor corresponding to the current video sequence;

and respectively carrying out region feature matching on the region descriptors corresponding to the current video sequence and each video sequence stored in a global database to obtain region matching results of the current video sequence and each video sequence.

In another possible implementation manner, performing region feature matching on a region descriptor corresponding to any frame of image in the current video sequence and a region descriptor corresponding to any other frame of image in the current video sequence to obtain a corresponding matching result, includes:

determining the distance between the region descriptor corresponding to any frame of image and each region descriptor in each region of other frame of images in the current video sequence;

carrying out region feature matching on the region descriptor corresponding to any frame image in the current video sequence and the region description corresponding to any other frame image in the current video sequence by the following formula to obtain a corresponding matching result:

；

wherein, the element D in the matrix_ijThe distance between the ith area descriptor in the Tm frame and the jth area descriptor in the Tn frame is represented, a matrix D is used for representing the distance between all the area descriptors in the Tm frame and all the area descriptors in the Tn frame in the video sequence, the Tm frame represents any one frame of image, and the Tn is used for representing any other frame of image in the current video sequence; d_ij ^kIn the j-th column of the characterization matrix DElement with the smallest distance value, D_i ^k _jThe element with the minimum distance value in the ith row of the characterization matrix D, t is used for characterizing threshold parameters, and the matching items (i, j) meeting the conditions form a matching set P between the Tm frame and the Tn frame_mn。

In another possible implementation manner, selecting an area descriptor of a first preset condition from the area matching result corresponding to any one of the areas, as the area descriptor corresponding to any one of the areas, includes:

determining the average value of the distances meeting the preset conditions;

and determining the area descriptor corresponding to any area based on the average value of the distances meeting the first preset condition.

In another possible implementation manner, the determining, based on the average value of the distances satisfying the first preset condition, the area descriptor corresponding to any one of the areas includes:

determining a region descriptor corresponding to any region by the following formula based on the average value of the distances meeting the first preset condition:

；

wherein x is a group S_iRegion of (5), P_xSet of matching items corresponding to region x in the set of frame matches in which region x is located, D_xTo from a set P of matching items_xAll of D extracted_ijThe collection of the data is carried out,

is D_ijAverage of elements in the set, x' being used to characterize the set S_iThe region descriptors of all the regions determined in (1).

In another possible implementation manner, performing region feature matching on the region descriptor corresponding to the current video sequence and any video sequence includes:

and respectively carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence.

In another possible implementation manner, performing region feature matching on a region descriptor corresponding to each frame of image with a region descriptor corresponding to any frame of image includes:

determining a distance vector corresponding to each frame of image based on a region descriptor corresponding to each region in each frame of image and a region descriptor corresponding to each region in any frame of image, wherein the distance vector corresponding to each frame of image comprises a plurality of elements, and any element is a distance between the region descriptor corresponding to any region in each frame of image and the region descriptor corresponding to any region in any frame of image.

In another possible implementation manner, the performing region matching on the first preset number of video sequences based on the spatio-temporal feature descriptors corresponding to the current video sequence to obtain a second preset number of video sequences includes:

determining a spatial consistency score corresponding to the current video sequence and each video sequence respectively based on the region matching result corresponding to the current video sequence and each video sequence respectively, wherein each video sequence belongs to the video sequences with the first preset number;

reordering the video sequences with the first preset number based on the spatial consistency scores respectively corresponding to the current video sequence and each video sequence;

and extracting a second preset number of video sequences from the sorted first preset number of video sequences.

In another possible implementation manner, determining a spatial consistency score corresponding to the current video sequence and any video sequence includes:

determining a spatial consistency score between each frame of image in a current video sequence and each frame of image in any video sequence;

determining the weight information of each frame of image in the current video sequence;

and determining a spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence.

In another possible implementation manner, the determining a spatial congruency score between each frame of image in the current video sequence and any frame of image includes:

determining region matching space consistency scores of various sizes;

determining weight information corresponding to the areas of all sizes;

and determining the spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.

In another possible implementation, determining a region matching spatial consistency score of any size includes:

determining a region matching spatial consistency score for any size by the following formula:

；

wherein SS_pFeature a region matching spatial consistency score of size p, n_pRepresenting the number of the extracted area blocks with the dimension of P in the frame image, P_pA region matched set of region features of dimension size p, (r)_p,c_p) Is P_pThe matching offset stored therein;

and

characterization of P separately_pAverage column offsets and average row offsets in the set; i, j token pair set P_pNumbering of traversal, dist (-) function being a distance function, max (-) being a maximum function;

wherein the matching the spatial consistency score based on the regions of the respective sizes and the weight information corresponding to the regions of the respective sizes respectively determines the spatial consistency score between each frame of image and any frame of image in the current video sequence by the following formula, and the matching comprises:

；

wherein SS characterizes a spatial consistency score between each frame of image and any frame of image in the current video sequence, i is a traversal of a set of scales, n is_sIs the number of scales, w_iWeight information corresponding to the size i, and w_i∈[0,1]。

In another possible implementation manner, the determining, based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence, a spatial consistency score corresponding to the current video sequence and any video sequence includes:

based on the weight information of each frame of image in the current video sequence and the spatial consistency scores between each frame of image in the current video sequence and each frame of image in any video sequence, determining the spatial consistency score corresponding to the current video sequence and any video sequence by the following formula:

；

VSS represents a spatial consistency score, V, corresponding to any video sequence of the current video sequence_refBelonging to a first predetermined number of video sequences, m being intended to represent a frame in the current video sequence, k being intended to represent V_refThe frame of (2) is selected,

weight information for characterizing m.

In a second aspect, a video scene retrieval apparatus is provided, including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a current video sequence which comprises a plurality of frames of images;

the feature map extraction module is used for respectively extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images;

the time domain feature fusion module is used for respectively carrying out time domain feature fusion on the basis of the dense deep learning feature maps respectively corresponding to the frames of images to obtain respective fused features;

the temporal-spatial feature aggregation processing module is used for performing temporal-spatial feature aggregation processing on the basis of the fused features corresponding to the frames of images respectively to obtain a global feature descriptor corresponding to the current video sequence;

and the first retrieval module is used for retrieving from a global database based on the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences.

In a possible implementation manner, the time domain feature fusion module is specifically configured to, when performing time domain feature fusion on the dense depth learning feature maps respectively corresponding to the frame images to obtain respective fused features:

and respectively corresponding to the dense deep learning feature maps based on the frame images, and performing time domain feature fusion through an attention mechanism to obtain respective fused features.

In another possible implementation manner, the spatio-temporal feature aggregation processing module is specifically configured to, when performing spatio-temporal feature aggregation processing based on the fused features respectively corresponding to each frame image to obtain a global feature descriptor corresponding to a current video sequence:

the spatio-temporal feature aggregation processing module is specifically configured to, when determining the global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map:

splicing all the global representations after the regularization treatment;

and carrying out regularization processing on the global representation after splicing processing to obtain a global feature descriptor corresponding to the current video sequence.

In another possible implementation manner, the apparatus further includes: a multi-scale region feature extraction module, a spatio-temporal region feature matching module and a second retrieval module, wherein,

the multi-scale region extraction module is used for respectively extracting region features of the dense depth learning feature maps corresponding to the frames of images to obtain the respective corresponding multi-scale region features;

the space-time region feature matching module is used for carrying out region matching based on the respective corresponding multi-scale region features to obtain a space-time feature descriptor corresponding to the current video sequence;

and the second retrieval module is used for carrying out region matching on the video sequences with the first preset number based on the space-time feature descriptors corresponding to the current video sequences to obtain the video sequences with the second preset number.

In another possible implementation manner, the multi-scale region feature extraction module, when performing region feature extraction based on a dense depth learning feature map corresponding to any frame image to obtain a multi-scale region feature corresponding to any frame image, is specifically configured to:

dividing the weighted residual error feature map into a plurality of area blocks;

In another possible implementation manner, when determining the weighted residual feature map based on the dense depth learning feature map corresponding to any frame of image, the multi-scale region feature extraction module is specifically configured to:

the spatio-temporal region feature matching module is specifically configured to, when performing region matching based on the respective corresponding multi-scale region features to obtain a spatio-temporal feature descriptor corresponding to the current video sequence:

In another possible implementation manner, the spatio-temporal region feature matching module is specifically configured to, when performing region feature matching on a region descriptor corresponding to any frame image in the current video sequence and a region descriptor corresponding to any other frame image in the current video sequence to obtain a corresponding matching result:

；

wherein, the element D in the matrix_ijThe distance between the ith area descriptor in the Tm frame and the jth area descriptor in the Tn frame is represented, a matrix D is used for representing the distance between all the area descriptors in the Tm frame and all the area descriptors in the Tn frame in the video sequence, the Tm frame represents any one frame of image, and the Tn is used for representing any other frame of image in the current video sequence; d_ij ^kThe element in the j-th column of the characterization matrix D, having the smallest distance value_i ^k _jThe element with the minimum distance value in the ith row of the characterization matrix D, t is used for characterizing threshold parameters, and the matching items (i, j) meeting the conditions form a matching set P between the Tm frame and the Tn frame_mn。

In another possible implementation manner, when the spatio-temporal region feature matching module selects a region descriptor with a preset condition from the region matching result corresponding to any one region, and the region descriptor is used as the region descriptor corresponding to any one region, the spatio-temporal region feature matching module is specifically configured to:

determining the average value of the distances meeting the preset conditions;

and determining the area descriptor corresponding to any area based on the average value of the distances meeting the preset condition.

In another possible implementation manner, when the spatio-temporal region feature matching module determines the region descriptor corresponding to any one region based on the average value of the distances satisfying the first preset condition, the spatio-temporal region feature matching module is specifically configured to:

based on the average value of the distances meeting the first preset condition, determining an area descriptor corresponding to any one area through the following formula:

；

wherein x is a group S_iRegion of (5), P_xSet of matching items corresponding to region x in the set of frame matches in which region x is located, D_xTo from a set P of matching items_xAll of D extracted_ijIn the collection of the images, the image data is collected,

In another possible implementation manner, when the spatio-temporal region feature matching module performs region feature matching on the region descriptor corresponding to the current video sequence and any video sequence, the spatio-temporal region feature matching module is specifically configured to:

In another possible implementation manner, when the spatio-temporal region feature matching module performs region feature matching on the region descriptor corresponding to each frame of image and the region descriptor corresponding to any frame of image, the spatio-temporal region feature matching module is specifically configured to:

determining a distance vector corresponding to each frame of image based on a region descriptor corresponding to each region in each frame of image and a region descriptor corresponding to each region in any frame of image, wherein the distance vector corresponding to each frame of image comprises a plurality of elements, and any element is the distance between the region descriptor corresponding to any region in each frame of image and the region descriptor corresponding to any region in any frame of image.

In another possible implementation manner, when the second retrieval module performs region matching on the video sequences of the first preset number based on the spatio-temporal feature descriptors corresponding to the current video sequence to obtain video sequences of a second preset number, the second retrieval module is specifically configured to:

In another possible implementation manner, when determining a spatial congruency score corresponding to the current video sequence and any video sequence, the second retrieving module is specifically configured to:

and determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence.

In another possible implementation manner, when determining the spatial consistency score between each frame of image in the current video sequence and any frame of image, the second retrieving module is specifically configured to:

determining region matching space consistency scores of various sizes;

determining weight information corresponding to the regions of each size respectively;

and determining a spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.

In another possible implementation manner, when determining that the region of any size matches the spatial consistency score, the second retrieval module is specifically configured to:

；

wherein SS_pFeature a region matching spatial consistency score of size p, n_pRepresenting the number of the extracted area blocks with the dimension of P in the frame image, P_pA region matching set of region features of scale size p, (r)_p,c_p) Is P_pThe match offset stored in (1);

and

respectively characterise P_pAverage column offsets and average row offsets in the set; i, j token pair set P_pNumbering of traversal, dist (-) function being distance function, max (-) being maximum function;

the second retrieval module is specifically configured to, when determining the spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the respective sizes and the weight information corresponding to the regions of the respective sizes by using the following formula:

；

wherein SS characterizes a spatial consistency score between each frame of image and any frame of image in the current video sequence, i is a traversal of a scale set, n is_sIs the number of scales, w_iWeight information corresponding to the size i, and w_i∈[0,1]。

In another possible implementation manner, when determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence, the second retrieval module is specifically configured to:

；

weight information for characterizing m.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: and executing the operation corresponding to the video scene retrieval method shown in any possible implementation manner in the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the video scene retrieval method according to any one of the possible implementations of the first aspect.

In summary, the present application includes at least one of the following beneficial technical effects:

compared with the related technology, in the method, time domain feature fusion is carried out based on dense deep learning feature maps corresponding to each frame of image in a current video sequence, space-time feature aggregation processing is carried out according to the fused features, and a global feature descriptor corresponding to the current video sequence is obtained, namely the space-time feature of the current video sequence can be reflected in the global feature descriptor corresponding to the current video sequence, so that retrieval is carried out from a global database based on the global feature descriptor corresponding to the current video sequence, the influence of the change of the surrounding environment of a scene, local shielding and the like on scene re-identification can be reduced, the accuracy of the retrieved video sequence can be improved, and user experience can be improved.

Drawings

Fig. 1 is a schematic flowchart of a video scene retrieval method in an embodiment of the present application;

FIG. 2 is a schematic diagram of a time domain feature fusion network structure based on a self-attention mechanism in an embodiment of the present application;

FIG. 3 is a schematic diagram of a network model architecture of a TemproalVLAD according to an embodiment of the present application;

FIG. 4 is an exemplary diagram of video scene retrieval in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus for video scene retrieval in an embodiment of the present application;

fig. 6 is a schematic device structure diagram of an electronic apparatus in an embodiment of the present application.

Detailed Description

The present application is described in further detail below with reference to the attached drawings.

The present embodiment is only for explaining the present application, and it is not limited to the present application, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present application.

The embodiment of the application provides a video scene retrieval method, wherein the visual scene retrieval method mainly aims to search observation information (images or video sequences) of the same geographic position when a scene map is established according to the current observation information;

the vision-based scene retrieval is mainly distinguished from the general image retrieval/video retrieval by three points:

1. the main standard of general image retrieval/video retrieval for measuring similarity is whether the images are in the same object category or not, whether the images have similar appearances or not, and the main standard of visual scene retrieval for measuring similarity is whether the images are in the same geographical position or not, and even if the appearance difference is large due to external factors such as weather and seasons, the similarity should be high as long as the positions of the images are close enough;

2. general image retrieval/video retrieval is mainly directed to foreground objects in the image, while vision-based scene retrieval is mainly directed to background regions in the image;

3. generally, image retrieval/video retrieval can be performed off-line, while visual-based scene retrieval is often applied in the field with strong real-time performance, such as relocation and loop detection in SLAM, so that in addition to requiring low complexity of a scene retrieval algorithm, efficient global representation of observation information (image or video sequence) is required, which is easier to calculate and store, such as converting the observation information into a vector or a matrix.

In the related art, most of visual scene re-identification technologies perform similarity calculation based on a single frame image, such as Bag of Words (BoW), Fisher Vectors (FV), local Aggregated Vectors (VLAD), and other methods, but the accuracy is obviously reduced when viewing angle changes exist in two observations by performing a scene retrieval method based on a single frame image; in addition, in the related art, the scene retrieval method based on the video mainly carries out information aggregation on the basis of the global representation of the single-frame image, ignores the space-time information of the video sequence and is still limited by the retrieval precision of the single-frame image.

Aiming at the problem of low recall rate when the visual angle is changed in image scene retrieval, the embodiment of the application provides a method for extracting space-time hierarchical features from a short video sequence. In the embodiment of the present application, video scene retrieval is performed from coarse to fine by using the characteristics of space-time hierarchy, which is described in the following embodiments:

(1) firstly, coarse-grained fast video scene retrieval is carried out: for each frame of image in a video sequence, extracting dense deep learning feature points by a neural network, then performing information aggregation and descriptor optimization on feature points of different interframes in common view by using a self-attention mechanism, finally clustering the feature points on time domain multiframes by using a TemproalVLAD layer, reserving unique observation information of each frame of image, removing interframe redundant observation information, thereby generating a high-dimensional vector as global representation of the video sequence, and performing scene retrieval by using the distance between the global representations of the video sequence. The coarse-grained fast video scene retrieval branch has the advantages of high retrieval speed, high storage efficiency and the like;

(2) then fine-grained optimization sorting is carried out, region matching is carried out by using the region block features on the feature map, and an image pyramid is constructed to extract multi-scale region features; for the regional matching result, the relative offset between all the matching pairs is used to define the image similarity, so as to optimize the search sorting result of the coarse-grained branch. The fine-grained optimization sorting branch has the advantages of high retrieval recall rate, high view angle change/local shielding robustness and the like;

(3) the method has the advantages that the method reduces the calculation amount and reduces the algorithm complexity, and has the advantages of high real-time performance, small time delay and the like.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship, unless otherwise specified.

The embodiments of the present application will be described in further detail with reference to the drawings attached hereto.

The embodiment of the application provides a video scene retrieval method, which can be executed by an electronic device, wherein the electronic device can be a server or a terminal device, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like, but is not limited thereto, and the terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.

It should be noted that the electronic device for executing the video scene retrieval method may further include: unmanned vehicles and intelligent robots.

Further, as shown in fig. 1, the method may include:

step 101, obtaining a current video sequence.

For the embodiment of the present application, the current video sequence is a video sequence to be subjected to scene retrieval. In this embodiment of the present application, the current video sequence may be acquired by a camera disposed in an unmanned vehicle or an intelligent robot, or may be acquired from other devices, which is not limited in this embodiment of the present application.

Specifically, in the embodiment of the present application, a current video sequence includes a plurality of frames of images.

And S102, extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images respectively.

Specifically, in the embodiment of the present application, extracting the dense depth learning feature maps corresponding to the respective frame images from the multiple frame images may specifically be implemented by using a feature extraction network, and may also extract the dense depth learning feature maps corresponding to the respective frame images from the respective frame images by using other manners.

And S103, respectively performing time domain feature fusion on the basis of the dense depth learning feature maps respectively corresponding to the frame images to obtain respective fused features.

And step S104, performing space-time feature aggregation processing based on the fused features respectively corresponding to the frame images to obtain a global feature descriptor corresponding to the current video sequence.

And S105, retrieving from a global database based on the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences.

For the embodiment of the present application, the global database stores a global feature descriptor and a regional feature descriptor, and in the embodiment of the present application, the global feature descriptor and the regional feature descriptor stored in the global database are respectively corresponding to global feature descriptors and regional feature descriptors of different video sequences.

It should be noted that: here, the different video sequences may include: video sequences respectively corresponding to different geographic positions. The same geographic location may correspond to one video sequence, may correspond to multiple video sequences, and certainly, may also correspond to one video sequence corresponding to at least two geographic locations, which is not limited in this embodiment of the application. In addition, the video sequence construction mode adopted during scene map construction and scene retrieval can be the same or different, and the process of constructing the video sequence during scene construction can be performed off-line, so that key frames approximately in the same place can be clustered according to the geographical position relationship (such as translation distance, rotation angle and gps coordinate distance), and the video sequence is constructed.

Compared with the related technology, in the embodiment of the application, time domain feature fusion is carried out based on dense deep learning feature maps corresponding to each frame of image in a current video sequence, space-time feature aggregation processing is carried out according to the fused features, and a global feature descriptor corresponding to the current video sequence is obtained, namely the space-time features of the current video sequence can be reflected in the global feature descriptor corresponding to the current video sequence, so that retrieval is carried out from a global database based on the global feature descriptor corresponding to the current video sequence, the influence of changes of the surrounding environment of a scene, local shielding and the like on scene re-identification can be reduced, the accuracy of the retrieved video sequence can be improved, and user experience can be improved.

Further, after acquiring the current video sequence, in order to avoid a situation that an overlapping area observed between video frames in the current video sequence is too high to cause a waste of computing resources, step S102 may further include: key frames are extracted from a current video sequence. In the embodiment of the present application, the extraction criteria of the keyframes may be based on the overlapping percentage of the observation regions, the number of points in feature point matching, the geographic position relationship (e.g., translation distance, rotation angle, gps coordinate distance) of the two frames, And the like, or may use the existing keyframe extraction policy in a framework of Simultaneous Localization And Mapping (SLAM), such as an ORB-SLAM (english-named: organized FAST And Rotated edge brf-SLAM) feature point-based extraction policy And an optical flow-based extraction policy in a DSO-SLAM (english-named: Direct Sparse overlay-SLAM), And the like.

Further, in order to further reduce the waste of computing resources, after extracting key frames from the current video sequence, M frames can be selected from the extracted key frames to form a new video sequence. In the embodiment of the present application, the M frames selected from the extracted key frames may be consecutive M key frames, or key frames selected at equal intervals, or a common-view relationship between key frames may be determined by feature point matching and an optical flow method, so as to dynamically select key frames with a longer time interval. Further, if a plurality of new video sequences need to be constructed, the plurality of constructed new video sequences may be of equal length or different lengths, and M may be between [3, 15 ].

Further, after extracting M frames from the key frame, in step S102, extracting the dense depth learning feature map corresponding to each frame image from the multi-frame image respectively, which may specifically include: and respectively extracting dense depth learning feature maps corresponding to the frames of images from the M frames of images.

Specifically, in the embodiment of the present application, dense deep learning feature maps corresponding to each frame of image are respectively extracted from M frames of images through a feature extraction network.

Specifically, in order to improve recall rate of scene retrieval when external factors such as viewing angle, illumination and scene appearance change, and to better fuse local information on a single frame image, a neural network is used as a feature extraction network in the embodiment of the present application. The feature extraction network includes, but is not limited to, VGG, Unet, ResNet, RegNet, AlexNet, google lenet, MobileNet and other common deep learning backbone networks.

Further, before inputting each frame image to the feature extraction network for feature extraction, each frame image needs to be adjusted to the same size, and then each frame image of the same size is subjected to feature extraction through the feature extraction network. In addition, in order to use the feature extraction network as a subsequent public network with two branches of coarse-grained fast retrieval and fine-grained optimized sorting, multiple frames of images in one video sequence in the video sequence need to be moved to batch dimension and then input into the feature extraction network, so as not to randomly disorder the interior of the same video sequence, but only randomly disorder the arrangement of different video sequences by using one video sequence as a whole, and in addition, when a multiple Graphics Processing Unit (GPU) is used, it is also needed to ensure that the images in the same video sequence are distributed to the same GPU for Processing. In this embodiment of the present application, performing feature extraction on each dense deep learning feature map of the same size through a feature extraction network, which may specifically include: and (2) performing feature extraction on the frame images with the same size through a feature extraction network, and/or adjusting the frame images with the same size to different sizes through image pyramid operation, and then performing feature extraction. And the image pyramid images corresponding to different frame images have the same image size as the corresponding layers.

Further, in order to make the model more robust to the change of the view angle and reduce the influence of local occlusion, and meanwhile, in order to make the model pay more attention to the stable features observed for many times in the video sequence and reduce the interference of dynamic objects, the embodiment of the present application performs temporal fusion on the extracted feature maps through a self-attention mechanism to update the features observed repeatedly in the video sequence.

Specifically, in step S103, time domain feature fusion is performed respectively based on the dense deep learning feature maps respectively corresponding to each frame of image, so as to obtain respective fused features, which may specifically include: and respectively corresponding to the dense deep learning feature maps based on each frame of image, and performing time domain feature fusion through an attention mechanism to obtain respective fused features. In the embodiment of the present application, a time domain feature fusion network based on a self-attention mechanism is implemented by using a 3D Non-local (3D Non-local network) network, and meanwhile, features on a time domain are fused, as shown in formula (1):

formula (1);

where x is the input feature map, i and j are different coordinates on the feature map, then x_i、x_jRepresenting the values of different points on the feature map, the f () function measures the similarity between two points, e.g. using Gaussian similarity or Dot product similarity, the g () function is used to calculate the feature value of the feature map at the j position, c (x) is a normalization parameter, y_iIs the value of the output feature map at coordinate i.

Specifically, as shown in fig. 2, a detailed structure of a time domain feature fusion network based on a self-attention mechanism used in the embodiment of the present application is obtained by first performing linear mapping on an input feature map X (T × H × W × 1024), compressing the number of channels by using convolution of 1 × 1 × 1, and reducing original information to obtain original information

（T×H×W×512），

(T × H × W × 512) features, then combining all dimensions of the above three features except the number of channels, and then, in order to calculate the autocorrelation between the features, pairing

And

performing matrix dot product operation to obtain the relationship of each pixel in each frame to all pixels in other frames, and calculating the autocorrelation resultPerforming Softmax normalization to obtain a value range of [0,1]As a result of (3), the weight of self-attention is taken as the weight of self-attention, and the weight of self-attention is associated with the feature matrix

Multiplying, up-sampling, and finally performing residual operation with the feature map X which is originally input, thereby obtaining the output Z (T multiplied by H multiplied by W multiplied by 1024) of the time domain feature fusion network.

It should be noted that besides the 3D Non-local network, other network variants based on the self-attention mechanism may be used for temporal feature fusion, which is within the scope of the embodiments of the present application, including but not limited to transform Networks, temporal Non-local Networks, Graph convolutional Neural Networks (GNNs) based on the self-attention mechanism, and so on.

Further, in order to aggregate all features in the same video sequence, unique observation information of each frame of image is retained, inter-frame redundant observation information is removed, so that a high-dimensional vector is generated to serve as global representation of a video sequence, and then a mode that one vector represents one video sequence is utilized, so that not only is rapid scene retrieval facilitated, but also more efficient storage is facilitated.

Specifically, in step S104, performing spatio-temporal feature aggregation based on the fused features respectively corresponding to each frame of image to obtain a global feature descriptor corresponding to the current video sequence, which may specifically include: step S1041 (not shown), step S1042 (not shown), step S1043 (not shown), and step S1044 (not shown), wherein,

and S1041, splicing the time domain feature maps corresponding to the frames of images respectively to obtain spliced feature maps.

Specifically, in the embodiment of the present application, time domain feature maps corresponding to each frame of image in the same video sequence may be stitched along a long side to obtain a stitched feature map, or may be stitched along a short side to obtain a stitched feature map. In the embodiment of the present application, a time domain feature map corresponding to each frame image is spliced along a long side to obtain a spliced feature map.

And step S1042, performing point-by-point convolution processing on the spliced feature map to obtain a convolution processing result.

And step S1043, carrying out normalization processing on the convolution processing result to obtain a result after normalization processing.

Specifically, in this embodiment of the present application, the normalizing the convolution processing result may specifically include: and carrying out index normalization processing on the convolution processing result through a normalization index function. In the embodiment of the present application, the normalized exponential function, or Softmax function, is a generalization of the logistic function, which can "compress" a K-dimensional vector z containing any real number into another K-dimensional real vector σ (z), so that each element ranges between (0,1), and the sum of all elements is 1.

And S1044, determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map.

Specifically, in the embodiment of the present application, the feature map after the stitching process includes a plurality of feature points; in step S1044, based on the result after the normalization processing and the feature map after the splicing, determining a global feature descriptor corresponding to the current video sequence may specifically include: clustering the plurality of feature points to obtain at least one clustering center; determining the distance between each feature point and each clustering center, and determining the corresponding distance information of each clustering center; determining the global representation corresponding to each cluster based on the distance information corresponding to each cluster center and the result after normalization processing; performing regularization treatment on the global representations corresponding to the cluster clusters respectively; splicing all the global representations after the regularization treatment; and carrying out regularization processing on the global representation after splicing processing to obtain a global feature descriptor corresponding to the current video sequence.

And the distance information corresponding to any clustering center is the distance between each characteristic point and any clustering center.

Specifically, in the embodiment of the present application, a temporal-spatial feature aggregation is performed by using a TemproalVLAD as a temporal-spatial feature aggregation network, where a network model architecture of the TemproalVLAD is shown in fig. 3, a time domain splicing process is performed on a feature map after time domain fusion, then a time domain splicing result is sequentially subjected to point-by-point convolution and index normalization processes to obtain a normalization processing result, then the time domain feature splicing result and the normalization processing result are processed by a residual error calculation module, and intra-cluster regularization and global regularization processes are sequentially performed to obtain a global descriptor of a video sequence. Specifically, feature maps of the same video sequence in feature maps output by the time domain feature fusion network based on the self-attention mechanism are spliced along long edges to obtain a spliced feature map F, the feature map F is subjected to point-by-point convolution by using a convolution kernel of 1 × 1, and the result is subjected to exponential normalization by using Softmax to obtain a result a; and (3) taking the spliced feature graph F (which is regarded as dense feature points), then taking out all feature points in all feature graphs, carrying out unsupervised clustering by using a Kmeans + + clustering algorithm to obtain K clustering centers, then respectively calculating the distance between each point in the feature graph F and K clusters in a residual error calculation module, and carrying out weighted summation by using a result a output by an index normalization unit as a weight to obtain K vectors which respectively correspond to global representations of the K clusters. And then regularizing the vectors of each cluster, splicing the vectors of K clusters together, and then performing global regularization to obtain a high-dimensional vector serving as a global descriptor of the whole video sequence.

Where K is between [16,128], regularization operations include, but are not limited to, L1 regularization, L2 regularization, and the like.

Further, after performing space-time aggregation processing based on the above embodiment to obtain a global descriptor of the entire video sequence, the global descriptor of the currently observed video sequence is used to perform coarse-grained fast retrieval in the global database, so as to reduce the number of video sequences that need to be retrieved accurately, and reduce the time consumption of fine-grained optimization sorting branch calculation.

Specifically, in the embodiment of the present application, distances between the global descriptor of the current video sequence and the global descriptors of the video sequences stored in the global database are sequentially calculated to determine similarities between the current video sequence and the video sequences stored in the global database, and then a TopK most similar to the current observed scene is retrieved from the global database by coarse-grained fast retrieval₁A video sequence. Wherein, the distance between the current video sequence global descriptor and the global descriptor of any video sequence stored in the global database can be characterized by manhattan distance, euclidean distance, minkowski distance and the like, and the smaller the distance between the video sequence global descriptors, the higher the similarity.

Note that in the examples of the present application, TopK₁The value may be input by a user or preset, and is not limited in the embodiment of the present application, for example, TopK₁In [20,100 ]]In the meantime.

It should be noted that, according to the above embodiment: the global database stores global descriptors corresponding to the plurality of video sequences, wherein the global descriptors corresponding to the plurality of video sequences stored in the global database are the same as the global descriptors corresponding to the current video sequence determined based on the current video sequence in the above embodiment, and details are not repeated here.

Further, in order to further improve recall rate of retrieval and improve view angle change and local occlusion robustness, fine-grained optimization sorting is performed in the embodiment of the present application to optimize a retrieval sorting result of coarse-grained branches. Further, in step S105, extracting the dense depth learning feature maps corresponding to the respective frames of images from the multiple frames of images, respectively, and then may further include: step S106 (not shown in the figure), step S107 (not shown in the figure), and step S108 (not shown in the figure), wherein step S106 and step S107 may be executed before step S103 to step S105, or may be executed after step S103 to step S105, or may be executed simultaneously with at least one step of step S103 to step S105, and any possible execution sequence is within the protection scope of the embodiment of the present application, and is not limited in the embodiment of the present application, wherein step S106 to step S108 are detailed in the following embodiments:

and S106, respectively extracting the regional characteristics of the dense depth learning characteristic graphs corresponding to the frame images to obtain the corresponding multi-scale regional characteristics.

Specifically, in the embodiment of the present application, the dense depth learning feature maps respectively corresponding to each frame image may be subjected to region feature extraction by using a multi-scale region feature extraction model, or may not be subjected to region feature extraction by using a region feature extraction model.

In the embodiment of the application, the regional feature extraction is performed on each frame of image corresponding to the dense depth learning feature map through a regional feature extraction model, similar to the design of a space-time feature aggregation network, the multi-scale regional feature extraction model firstly performs point-by-point convolution on the feature maps output by the feature extraction network and performs index normalization, the obtained result is used for weighting the residual error results of the original feature map and K clustering centers, and finally the weighted residual error feature map R is obtained. The difference is that the multi-scale region feature extraction model does not perform time domain feature splicing, and does not perform summation and subsequent regularization operations on the weighted residual feature map R.

Specifically, performing region feature extraction based on a dense depth learning feature map corresponding to any frame of image to obtain a multi-scale region feature corresponding to any frame of image, which may specifically include: determining a weighted residual error feature map based on a dense depth learning feature map corresponding to any frame of image; dividing the weighted residual error characteristic diagram into a plurality of area blocks; and determining the area characteristic representation corresponding to each area block so as to obtain the multi-scale area characteristic corresponding to any frame of image.

Specifically, determining a weighted residual feature map based on a dense depth learning feature map corresponding to any frame of image may specifically include: performing point-by-point convolution processing on the dense depth learning feature map corresponding to any frame of image to obtain a convolution result; carrying out normalization processing on the convolution result to obtain a normalization result; and determining a weighted residual feature map based on the normalization result and the dense depth learning feature map corresponding to any frame of image.

For the embodiment of the present application, the way of performing point-by-point convolution and normalization processing on the dense-depth learning feature map corresponding to any frame of image is specifically described in the embodiment of spatio-temporal feature aggregation, and details are not described in this embodiment of the present application.

Further, determining a weighted residual feature map based on the normalization result and the dense depth learning feature map corresponding to any frame of image may specifically include: obtaining K clustering centers by a clustering algorithm for all feature points in a dense deep learning feature map corresponding to any frame of image extracted by the feature extraction model; determining the distance between each characteristic point and each clustering center, and determining the corresponding distance information of each clustering center; and determining a weighted residual characteristic diagram R based on the distance information corresponding to each cluster center and the result after the normalization processing.

Further, after obtaining the weighted residual feature map R, dividing the region blocks on the weighted residual feature map R by using a sliding window with a size of p × p, and regularizing a mean value of residuals in each region block to be a descriptor of the region, where the regularizing operation includes, but is not limited to, L1 regularization, L2 regularization, and the like. In order to enhance the robustness of the region descriptor to the change of the viewing angle, the size p of the sliding window may be changed, or the sliding windows with different sizes may be used to divide the region blocks, and each region block generates a region descriptor correspondingly. Since the region descriptors are generated by mean calculations, the region descriptor dimensions of different regions are the same.

It should be noted that, in the above manner, the multi-scale region feature is extracted for each frame of image in the video sequence, and the region descriptor is used as a way of representing and storing the region feature.

Further, in the embodiment of the present application, the multi-scale regional feature extraction module extracts regional features from each frame of image in the video sequence, which may also be considered as fusion of dense feature points extracted by the foregoing feature extraction network in a spatial domain, so as to perform optimized ordering on the results of the fast search branches by using the regional features subsequently.

Further, after extracting the multi-scale region features corresponding to each frame of image, to further enhance the robustness of the region features to the view angle change, the multi-scale region features corresponding to each frame of image are continuously subjected to region matching, which is described in detail in the following embodiments.

And S107, carrying out region matching based on the respective corresponding multi-scale region characteristics to obtain a space-time characteristic descriptor corresponding to the current video sequence.

Wherein the multi-scale region features are characterized by a region descriptor.

Specifically, in the embodiment of the present application, performing region matching based on respective corresponding multi-scale region features includes: the region matching between frames of the same video sequence aims to update the original region descriptor using robust region descriptors at different viewing angles (see the following steps S1071-S1072), and then perform region matching in the time domain of different video sequences (see the following step S1073), that is, in step S107, perform region matching based on respective corresponding multi-scale region features to obtain a spatio-temporal feature descriptor corresponding to the current video sequence, which may specifically include: step S1071 (not shown), step S1072 (not shown), and step S1073 (not shown), wherein,

step S1071, the area descriptor corresponding to each frame of image in the current video sequence is matched with the area descriptor corresponding to each other frame of image in the current video sequence for area feature, and the area matching result corresponding to the current video sequence is obtained.

For the embodiment of the present application, the region descriptor corresponding to each frame of image in the current video sequence is subjected to region feature matching with the region descriptors corresponding to each other frame of image in the current video sequence, which may specifically be based on a bidirectional matching and ratio test manner, or may also be based on other region matching manners, including but not limited to K-Neighbor matching, Greedy-Nearest Neighbor (Greedy-NN) matching, K-dimension (K-dimensional tree, K-d tree) matching, and the like. In the embodiments of the present application, a method based on bidirectional matching and ratio testing is described as an example.

Specifically, performing region feature matching on the region descriptor corresponding to each frame of image and the region descriptor corresponding to any frame of image may specifically include: and determining the distance vector corresponding to each frame of image based on the region descriptor corresponding to each region in each frame of image and the region descriptor corresponding to each region in any frame of image. The distance vector corresponding to each frame of image comprises a plurality of elements, and any element is the distance between the area descriptor corresponding to any area in each frame of image and the area descriptor corresponding to any area in any frame of image.

Specifically, in the embodiment of the present application, a specific manner of performing region feature matching is described by taking Tm frames and Tn frames as examples, where (M, n ∈ [1, M)]And m ≠ n), in the embodiment of the present application, the distance between all region descriptors of Tm frames and all region descriptors of Tn frames in the video sequence is calculated to form a distance matrix D, the element D in the matrix_ijI.e. represents the distance between the ith region descriptor in the Tm frame and the jth region descriptor in the Tn frame. Including but not limited to, at manhattan distances, euclidean distances, and minkowski distances, among others.

Further, the area descriptor corresponding to each frame of image in the current video sequence and the area descriptors corresponding to other frames of images in the current video sequence can be calculated and obtained through the method, and area feature matching is carried out on the area descriptors, so that an area matching result corresponding to the current video sequence is obtained.

Step S1072, selecting a region descriptor satisfying a preset condition from the region matching result corresponding to the current video sequence as the region descriptor corresponding to the current video sequence.

Specifically, selecting a region descriptor satisfying a preset condition from a distance vector corresponding to each frame of image may specifically include: the region descriptor satisfying the preset condition is determined by the following formula (2), wherein,

equation (2);

wherein X' represents a region descriptor satisfying a predetermined condition, D_ij ^kRepresents the element of the distance matrix D with the smallest distance value in the jth column, D_i ^k _jThe element representing the minimum distance value in the ith row of the distance matrix D, t is a threshold parameter, and the matching item (i, j) meeting the condition forms a matching set P between the Tm frame and the Tn frame_mnWhile the distance value D is also stored in the matching set_ij. Wherein the threshold t is [0.5,0.9 ]]In the meantime.

When the same video sequence is exhausted and all (m, n) combinations are combined, the region i of each frame is matched with the region j of a plurality of frames and is marked as a set S_i={i，j₁，j₂…，j_LRegion j, in turn, matches j 'in several frames in its own frame, recursively adding j' also into set Si; selecting a region descriptor satisfying the following formula (3) instead of the set S_iThe original descriptor of all regions in the image, making it more robust to viewing angle variations.

Formula (3);

wherein x is a group S_iRegion of (1), P_xSet of matching items corresponding to region x in the set of frame matches in which region x is located, D_xTo from a set P of matching items_xAll of D extracted_ijThe collection of the data is carried out,

is the above-mentioned D_ijAverage of elements in the set. Selection of S_iIn (1)

As a set S, the descriptor corresponding to the region x' of_iAll the new region descriptors of the region. Since region x 'has the smallest average distance to other matching regions in the video sequence, the features of region x' are considered more robust under different viewing angles.

And step S1073, respectively carrying out region feature matching on the region descriptors corresponding to the current video sequence and each video sequence stored in the global database to obtain region matching results of the current video sequence and each video sequence.

Specifically, in this embodiment of the present application, performing region feature matching on a region descriptor corresponding to a current video sequence and any video sequence may specifically include: and respectively carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence. In the embodiment of the present application, the method for performing region feature matching between the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence is based on a bidirectional matching and ratio test, and may also be performed in other region matching manners, including but not limited to K-Neighbor matching, Greedy-Nearest Neighbor (Greedy-NN) matching, K-dimension (K-dimensional tree, K-d tree) matching, and the like. In the embodiments of the present application, a method based on bidirectional matching and ratio testing is described as an example.

Specifically, in the embodiment of the present application, performing region feature matching on a region descriptor corresponding to each frame of image and a region descriptor corresponding to any frame of image may specifically include: and determining the distance vector corresponding to each frame of image based on the region descriptor corresponding to each region in each frame of image and the region descriptor corresponding to each region in any frame of image. The distance vector corresponding to each frame of image comprises a plurality of elements, and any element is the distance between the area descriptor corresponding to any area in each frame of image and the area descriptor corresponding to any area in any frame of image.

Specifically, during video scene retrieval, for a current video sequence Vqry and a video sequence Vref stored in a database (global database) when a scene map is built, according to the above-mentioned region matching manner based on bidirectional matching and ratio testing, a region matching set P between each frame image in Vqry and each frame image in Vref is sequentially calculated, and at the same time, the coordinates (R, c) of each pair of matching regions in the region matching set P in the residual feature map R described above are recorded.

Further, the current video sequence Vqry and each video sequence stored in the database (global database) when the scene map is established are subjected to region feature matching in the above manner, which is not described again in detail. It should be noted that the global database also stores the region descriptors corresponding to each video sequence Vref, and in the embodiment of the present application, the determination manner of the region descriptors corresponding to each video sequence Vref is the same as the determination manner of the region descriptor corresponding to the current video sequence Vqry, and details are not described here again.

And S108, performing region matching on the video sequences with the first preset number based on the space-time feature descriptors corresponding to the current video sequences to obtain the video sequences with the second preset number.

For the embodiment of the present application, the above-mentioned TopK is optimized by the spatio-temporal region descriptor extracted from the currently observed video sequence₁The arrangement order of the video sequences, selecting the TopK after optimized ordering₂The video sequence is used as the final result of scene retrieval. In the examples of the present application, TopK₂The value of (c) may be input by a user, or may be preset, and is not limited in the embodiment of the present application, for example, TopK₂In [1,10 ]]In the meantime.

Specifically, in this embodiment of the present application, in step S108, performing region matching on the video sequences of the first preset number based on the spatio-temporal feature descriptors corresponding to the current video sequence to obtain video sequences of the second preset number, which may specifically include: based on the region matching results of the current video sequence respectively corresponding to the video sequences,determining the spatial consistency scores respectively corresponding to the current video sequence and each video sequence; reordering the video sequences with the first preset number based on the spatial consistency scores respectively corresponding to the current video sequence and each video sequence; and extracting a second preset number of video sequences from the sorted first preset number of video sequences. Wherein each video sequence belongs to a first preset number of video sequences. That is, in the embodiment of the present application, the spatial consistency scores between two frames are sequentially calculated for the result of the region matching, and the TopK is calculated according to the size of the overall spatial consistency score of the video sequence₁The individual video sequences are reordered. Specifically, determining a spatial congruency score corresponding to the current video sequence and any video sequence includes: determining a spatial consistency score between each frame of image in a current video sequence and each frame of image in any video sequence; determining the weight information of each frame of image in the current video sequence; and determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence. In the embodiment of the present application, for the spatial consistency score SS of every two frames between two video sequences, the spatial consistency score of the two video sequences is finally calculated as shown in formula (4) below:

formula (4);

wherein VSS represents the spatial congruency score of the two video sequences; m and k are respectively the current observed video sequence V_qryAnd retrieving the video sequence V_refFrame of (1), here V_refEpsilon { TopK obtained by fast search branch₁A video sequence };

as a weight of the observation frame, the

In (0,1)]The selection strategy is as follows: a frame (e.g., the first frame, the intermediate frame, or the last frame) in the observed video sequence is selected as a reference frame, the frame has a weight of 1, and the weights of other frames are exponentially attenuated from the reference frame.

Further, determining a spatial consistency score between each frame of image in the current video sequence and any frame of image may specifically include: determining region matching space consistency scores of various sizes; determining weight information corresponding to the regions of each size respectively; and determining the spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.

Specifically, the calculation formula (5) of the spatial consistency score of the region matching ensemble between two frames is as follows:

formula (5);

wherein, SS represents the overall space consistency score of the region matching between two frames; i is the traversal of the scale set; n is a radical of an alkyl radical_sThe number of scales; w is a_iFor scale weights, one weight per scale, and w_i∈[0,1]. Specifically, the determining of the spatial consistency score between each frame of image and any frame of image in the current video sequence comprises the following steps: determining region matching space consistency scores of various sizes; determining weight information corresponding to the regions of each size respectively; and determining a spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.

Specifically, the region matching spatial consistency score formula (6) for the scale p is as follows:

formula (6);

wherein SS_pRepresenting a region matching spatial consistency score with a scale size p; n is_pExtracting the number of the area blocks with the scale size of p from a frame of image in a multi-scale area feature extraction module; p_pA region matching set of region features of dimension p; (r)_p,c_p) Is P_pThe matching offset stored in the step (1) is the area matching space position offset calculated by the space-time area characteristic matching module;

and

each represents P_pAverage column offsets and average row offsets in the set; i, j represents a set P of pairs_pNumbering of traversal times; dist (-) functions are distance functions, including but not limited to Manhattan distance, Euclidean distance, Minkowski distance, etc., and max (-) is a maximum function.

Further, the following introduces a method for video scene retrieval by specific examples, as shown in fig. 4, obtaining a current video stream sequence, then obtaining a dense deep learning feature map corresponding to the current video stream sequence based on the key frame extraction and feature map extraction processes described above, then executing coarse-grained branching and fine-grained branching,

the specific execution flow of the coarse-grained branch is as follows: determining a global feature descriptor corresponding to the current video sequence based on a dense deep learning feature map corresponding to the current video stream sequence, and then retrieving from a database constructed during mapping based on the global feature descriptor corresponding to the current video sequence to obtain a TopK1 retrieval result;

the specific execution flow of the fine-grained branch is as follows: obtaining corresponding region descriptors based on a dense deep learning feature map corresponding to a current video stream sequence, then updating the region descriptors of the region descriptors, then performing region matching based on the updated region descriptors and the region descriptors of each video in a database constructed during map building, calculating spatial consistency scores of the current video sequence and each video sequence in a Top1 retrieval result based on a matching result, and performing optimized sequencing on each video sequence in the Top1 retrieval result based on the spatial consistency scores of each video sequence to obtain a final Top 2 retrieval result.

The foregoing embodiments describe a video scene retrieval method from the perspective of a method flow, and the following embodiments describe a video scene retrieval device from the perspective of a virtual module, which are described in detail in the following embodiments.

An embodiment of the present application provides a video scene retrieval device, as shown in fig. 5, the video scene retrieval device 50 may include: an obtaining module 51, a feature map extracting module 52, a time domain feature fusing module 53, a spatio-temporal feature aggregation processing module 54 and a first retrieving module 55, wherein,

an obtaining module 51, configured to obtain a current video sequence, where the current video sequence includes multiple frames of images;

the feature map extraction module 52 is configured to extract dense depth learning feature maps corresponding to the frames of images from the multiple frames of images respectively;

a time domain feature fusion module 53, configured to perform time domain feature fusion on the basis of the dense deep learning feature maps corresponding to each frame of image, respectively, to obtain respective fused features;

a spatio-temporal feature aggregation processing module 54, configured to perform spatio-temporal feature aggregation processing based on the fused features corresponding to each frame of image, respectively, to obtain a global feature descriptor corresponding to the current video sequence;

the first retrieving module 55 is configured to retrieve from the global database based on the global feature descriptor corresponding to the current video sequence to obtain a first preset number of video sequences.

In a possible implementation manner of the embodiment of the present application, the time domain feature fusion module 53 is specifically configured to, when performing time domain feature fusion on the dense-depth learning feature maps respectively corresponding to the frame images to obtain respective fused features: and respectively corresponding to the dense deep learning feature maps based on each frame of image, and performing time domain feature fusion through an attention mechanism to obtain respective fused features.

In another possible implementation manner of the embodiment of the present application, the spatio-temporal feature aggregation processing module 54 is specifically configured to, when performing spatio-temporal feature aggregation processing based on the fused features corresponding to each frame image respectively to obtain a global feature descriptor corresponding to a current video sequence: splicing the time domain characteristic graphs corresponding to the frames of images to obtain spliced characteristic graphs; performing point-by-point convolution processing on the spliced feature map to obtain a convolution processing result; carrying out normalization processing on the convolution processing result to obtain a result after the normalization processing; and determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map.

In another possible implementation manner of the embodiment of the application, the feature map after the splicing processing includes a plurality of feature points; the spatio-temporal feature aggregation processing module 54 is specifically configured to, when determining the global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map: clustering the plurality of feature points to obtain at least one clustering center; determining the distance between each characteristic point and each clustering center, and determining the distance information corresponding to each clustering center, wherein the distance information corresponding to any clustering center is the distance between each characteristic point and any clustering center; determining global representations respectively corresponding to the clustering clusters based on the distance information corresponding to each clustering center and the result after normalization processing; performing regularization treatment on the global representations corresponding to the cluster clusters respectively; splicing all the global representations after the regularization treatment; and carrying out regularization processing on the global representation after splicing processing to obtain a global feature descriptor corresponding to the current video sequence.

In another possible implementation manner of the embodiment of the present application, the apparatus 50 further includes: a multi-scale region feature extraction module, a spatio-temporal region feature matching module and a second retrieval module, wherein,

the multi-scale region extraction module is used for respectively extracting region features of the dense depth learning feature maps corresponding to the frames of images to obtain the multi-scale region features corresponding to the frames of images;

the space-time region feature matching module is used for carrying out region matching based on the respective corresponding multi-scale region features to obtain space-time feature descriptors corresponding to the current video sequence;

For the embodiment of the present application, the first retrieving module 55 and the second retrieving module may be the same retrieving module or different retrieving modules, and are not limited in the embodiment of the present application.

In another possible implementation manner of the embodiment of the application, the multi-scale region feature extraction module is specifically configured to, when performing region feature extraction based on a dense depth learning feature map corresponding to any frame of image to obtain a multi-scale region feature corresponding to any frame of image: determining a weighted residual error feature map based on a dense depth learning feature map corresponding to any frame of image; dividing the weighted residual characteristic diagram into a plurality of area blocks; and determining the area characteristic representation corresponding to each area block so as to obtain the multi-scale area characteristic corresponding to any frame of image.

In another possible implementation manner of the embodiment of the present application, when determining the weighted residual feature map based on the dense depth learning feature map corresponding to any frame of image, the multi-scale region feature extraction module is specifically configured to: performing point-by-point convolution processing on the dense depth learning characteristic image corresponding to any frame of image to obtain a convolution result; carrying out normalization processing on the convolution result to obtain a normalization result; and determining a weighted residual error feature map based on the normalization result and the corresponding distance information of each cluster center.

In another possible implementation manner of the embodiment of the application, the multi-scale region features are characterized by a region descriptor; the spatio-temporal region feature matching module is specifically configured to, when performing region matching based on respective corresponding multi-scale region features to obtain a spatio-temporal feature descriptor corresponding to the current video sequence: carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and the region descriptors corresponding to other frames of images in the current video sequence respectively to obtain a region matching result corresponding to the current video sequence; selecting a region descriptor meeting a preset condition from a region matching result corresponding to the current video sequence as a region descriptor corresponding to the current video sequence; and respectively carrying out region feature matching on the region descriptors corresponding to the current video sequence and each video sequence stored in the global database to obtain region matching results of the current video sequence and each video sequence.

In another possible implementation manner of the embodiment of the present application, the spatio-temporal region feature matching module is specifically configured to, when performing region feature matching on a region descriptor corresponding to any frame image in the current video sequence and a region descriptor corresponding to any other frame image in the current video sequence to obtain a corresponding matching result:

determining the distance between the region descriptor corresponding to any frame of image and each region descriptor in each region in any other frame of image in the current video sequence;

；

wherein, the element D in the matrix_ijThe distance between the ith area descriptor in the Tm frame and the jth area descriptor in the Tn frame is represented, a matrix D is used for representing the distance between all the area descriptors in the Tm frame and all the area descriptors in the Tn frame in the video sequence, the Tm frame represents any frame image, and the Tn is used for representing any other frame image in the current video sequence; d_ij ^kThe element in the j-th column of the characterization matrix D, having the smallest distance value_i ^k _jThe element with the minimum distance value in the ith row of the characterization matrix D, t is used for characterizing threshold parameters, and the matching items (i, j) meeting the conditions form a matching set P between the Tm frame and the Tn frame_mn。

In another possible implementation manner of the embodiment of the present application, when the spatio-temporal region feature matching module selects an area descriptor with a preset condition from an area matching result corresponding to any area, and the area descriptor is used as an area descriptor corresponding to any area, the spatio-temporal region feature matching module is specifically configured to:

determining an average value of distances meeting preset conditions;

In another possible implementation manner of the embodiment of the present application, when the space-time region feature matching module determines, based on an average value of distances satisfying a first preset condition, a region descriptor corresponding to any one region, the space-time region feature matching module is specifically configured to:

；

wherein x is a group S_iRegion of (5), P_xFor the matching item set corresponding to the region x in the frame matching set where the region x is located, D_xFor the purpose of collecting P from the matching items_xAll of D extracted_ijIn the collection of the images, the image data is collected,

In another possible implementation manner of the embodiment of the present application, when performing region feature matching on a region descriptor corresponding to a current video sequence and any video sequence, the spatio-temporal region feature matching module is specifically configured to: and respectively carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence.

In another possible implementation manner of the embodiment of the present application, when performing region feature matching on the region descriptor corresponding to each frame of image and the region descriptor corresponding to any frame of image, the spatio-temporal region feature matching module is specifically configured to: determining a distance vector corresponding to each frame of image based on a region descriptor corresponding to each region in each frame of image and a region descriptor corresponding to each region in any frame of image, wherein the distance vector corresponding to each frame of image comprises a plurality of elements, and any element is the distance between the region descriptor corresponding to any region in each frame of image and the region descriptor corresponding to any region in any frame of image.

In another possible implementation manner of the embodiment of the present application, the second retrieval module, when performing region matching on the video sequences of the first preset number based on the spatio-temporal feature descriptors corresponding to the current video sequences to obtain video sequences of the second preset number, is specifically configured to: determining a spatial consistency score corresponding to the current video sequence and each video sequence respectively based on the region matching result corresponding to the current video sequence and each video sequence respectively, wherein each video sequence belongs to a first preset number of video sequences; reordering the video sequences with the first preset number based on the spatial consistency scores respectively corresponding to the current video sequence and each video sequence; and extracting a second preset number of video sequences from the sorted first preset number of video sequences.

In another possible implementation manner of the embodiment of the present application, when determining a spatial consistency score corresponding to a current video sequence and any video sequence, the second retrieval module is specifically configured to: determining the space consistency scores between each frame of image in the current video sequence and each frame of image in any video sequence; determining weight information of each frame of image in a current video sequence; and determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence.

In another possible implementation manner of the embodiment of the present application, when determining a spatial consistency score between each frame of image and any frame of image in a current video sequence, the second retrieval module is specifically configured to: determining a region matching space consistency score for each size; determining weight information corresponding to the regions of each size respectively; and determining the spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.

In another possible implementation manner of the embodiment of the present application, when determining the consistency score of the region matching space of any size, the second retrieval module is specifically configured to:

；

wherein SS_pFeature a region matching spatial consistency score of size p, n_pRepresenting the number of the extracted area blocks with the dimension of P in the frame image, P_pA region matched set of region features of dimension size p, (r)_p,c_p) Is P_pThe match offset stored in (1);

and

the second retrieval module is specifically configured to, when the spatial consistency score is matched based on the regions of each size and the weight information corresponding to the regions of each size, determine the spatial consistency score between each frame of image and any frame of image in the current video sequence according to the following formula:

；

wherein SS represents the spatial consistency score between each frame of image and any frame of image in the current video sequence, i is the traversal of the scale set, n_sIs the number of scales, w_iWeight information corresponding to the size i, and w_i∈[0,1]。

In another possible implementation manner of the embodiment of the present application, when determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence, the second retrieval module is specifically configured to:

based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence, determining the spatial consistency score corresponding to the current video sequence and any video sequence by the following formula:

；

VSS represents the spatial conformance score, V, corresponding to the current video sequence and any video sequence_refBelonging to a first predetermined number of video sequences, m being for characterizing a frame in a current video sequence, k being for characterizing V_refThe frame of (2) is selected,

weight information for characterizing m.

Compared with the related art, in the embodiment of the application, time domain feature fusion is carried out based on dense deep learning feature maps corresponding to each frame of image in a current video sequence, space-time feature aggregation processing is carried out according to the fused features, and a global feature descriptor corresponding to the current video sequence is obtained, namely the space-time features of the current video sequence can be reflected in the global feature descriptor corresponding to the current video sequence, so that retrieval is carried out from a global database based on the global feature descriptor corresponding to the current video sequence, the influence of changes of the surrounding environment of a scene, local shielding and the like on scene re-identification can be reduced, the accuracy of the retrieved video sequence can be improved, and user experience can be improved.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In an embodiment of the present application, there is provided an electronic device, as shown in fig. 6, an electronic device 600 shown in fig. 6 includes: a processor 601 and a memory 603. Wherein the processor 601 is coupled to the memory 603, such as via a bus 602. Optionally, the electronic device 600 may also include a transceiver 604. It should be noted that the transceiver 604 is not limited to one in practical applications, and the structure of the electronic device 600 is not limited to the embodiment of the present application.

The Processor 601 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 601 may also be a combination of computing functions, e.g., comprising one or more microprocessors in combination, a DSP and a microprocessor in combination, or the like.

Bus 602 may include a path that transfers information between the above components. The bus 602 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 602 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.

The Memory 603 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 603 is used for storing application program codes for executing the scheme of the application, and the processor 601 controls the execution. The processor 601 is configured to execute application program code stored in the memory 603 to implement the aspects illustrated in the foregoing method embodiments.

Wherein, the electronic device includes but is not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. But also a server, etc. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

The embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, and when the computer program runs on a computer, the computer is enabled to execute the corresponding content in the foregoing method embodiment. Compared with the related technology, in the embodiment of the application, time domain feature fusion is carried out based on the dense deep learning feature maps respectively corresponding to each frame of image in the current video sequence, and then space-time feature aggregation processing is carried out according to the fused features to obtain the global feature descriptor corresponding to the current video sequence, namely, the space-time features of the current video sequence can be reflected in the global feature descriptor corresponding to the current video sequence, so that retrieval is carried out from a global database based on the global feature descriptor corresponding to the current video sequence, the influence of the change of the surrounding environment of a scene, local shielding and the like on scene re-identification can be reduced, the accuracy of the retrieved video sequence can be improved, and further the user experience can be improved.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.

In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.

The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method and the core idea of the present application, and should not be construed as limiting the present application. Those skilled in the art should also appreciate that various modifications and substitutions can be made without departing from the scope of the present disclosure.

Claims

1. A method for retrieving a video scene, comprising:

2. The method according to claim 1, wherein the performing time domain feature fusion respectively based on the dense depth learning feature maps respectively corresponding to the frame images to obtain respective fused features comprises:

3. The method according to claim 1 or 2, wherein the performing spatio-temporal feature aggregation processing based on the fused features respectively corresponding to the frame images to obtain a global feature descriptor corresponding to the current video sequence comprises:

4. The method according to claim 3, wherein the feature map after the stitching process comprises a plurality of feature points;

splicing all the global representations after the regularization treatment;

5. The method according to claim 4, wherein the extracting dense depth learning feature maps respectively corresponding to the frames of images from the plurality of frames of images further comprises:

performing region matching based on the respective corresponding multi-scale region features to obtain a space-time feature descriptor corresponding to the current video sequence;

6. The method according to claim 5, wherein performing region feature extraction based on a dense depth learning feature map corresponding to any frame image to obtain a multi-scale region feature corresponding to any frame image comprises:

and determining the regional characteristic representation corresponding to each regional block respectively to obtain the multi-scale regional characteristic corresponding to any frame of image.

7. The method of claim 6, wherein determining a weighted residual feature map based on the corresponding dense depth learning feature map of any frame of image comprises:

performing point-by-point convolution processing on the dense depth learning characteristic graph corresponding to any frame of image to obtain a convolution result;

8. The method of any one of claims 5-7, wherein the multi-scale region features are characterized by a region descriptor;

9. The method of claim 8, wherein performing region feature matching on the region descriptor corresponding to any frame of image in the current video sequence and the region descriptor corresponding to any other frame of image in the current video sequence to obtain a corresponding matching result comprises:

；

wherein, the element D in the matrix_ijThe distance between the ith region descriptor in a Tm frame and the jth region descriptor in a Tn frame is characterized, a matrix D is used for characterizing the distance between all the region descriptors in the Tm frame and all the region descriptors in the Tn frame in a video sequence, the Tm frame is used for characterizing any one image, and the Tn is used for characterizing any other image in the current video sequence; d_ij ^kThe element in the jth column of the characterization matrix D with the smallest distance value, D_i ^k _jThe element with the minimum distance value in the ith row of the characterization matrix D, t is used for characterizing threshold parameters, and the matching items (i, j) meeting the conditions form a matching set P between the Tm frame and the Tn frame_mn。

10. The method according to claim 9, wherein selecting a region descriptor with a preset condition from the region matching result corresponding to any one of the regions as the region descriptor corresponding to any one of the regions comprises:

determining the average value of the distances meeting the preset conditions;

11. The method according to claim 10, wherein the determining a region descriptor corresponding to any one of the regions based on the average value of the distances satisfying the first preset condition comprises:

；

wherein x is a group S_iRegion of (5), P_xFor the matching item set corresponding to the region x in the frame matching set where the region x is located, D_xTo from a set P of matching items_xAll of D extracted_ijIn the collection of the images, the image data is collected,

12. The method of claim 11, wherein performing region feature matching on the region descriptor corresponding to the current video sequence with any video sequence comprises:

13. The method according to any one of claims 9 to 12, wherein performing region feature matching on the region descriptor corresponding to each frame of image with the region descriptor corresponding to any one frame of image comprises:

14. The method of claim 13, wherein the performing region matching on the first preset number of video sequences based on the spatio-temporal feature descriptors corresponding to the current video sequence to obtain a second preset number of video sequences comprises:

15. The method of claim 14, wherein determining a spatial conformance score for a current video sequence corresponding to any video sequence based on a region matching result of the current video sequence corresponding to the any video sequence comprises:

determining a spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence respectively based on the region matching result of the current video sequence corresponding to any video sequence respectively;

16. The method of claim 15, wherein determining a spatial congruency score between each frame of image in the current video sequence and any frame of image comprises:

determining region matching space consistency scores of various sizes;

17. The method according to claim 16, wherein the determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence comprises:

；

VSS represents the spatial conformance score, V, corresponding to the current video sequence and any video sequence_refBelonging to a first predetermined number of video sequences, m being intended to represent a frame in the current video sequence, k being intended to represent V_refThe frame of (2) is selected,

and weight information used for characterizing m, and SS characterizes the spatial consistency score between each frame of image and any frame of image in the current video sequence.

18. A video scene retrieval apparatus, comprising:

19. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: a method of video scene retrieval according to any one of claims 1 to 17 is performed.

20. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for video scene retrieval according to any one of claims 1 to 17.