CN114743139A - Video scene retrieval method and device, electronic equipment and readable storage medium - Google Patents

Video scene retrieval method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN114743139A
CN114743139A CN202210339794.9A CN202210339794A CN114743139A CN 114743139 A CN114743139 A CN 114743139A CN 202210339794 A CN202210339794 A CN 202210339794A CN 114743139 A CN114743139 A CN 114743139A
Authority
CN
China
Prior art keywords
video sequence
frame
region
image
current video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210339794.9A
Other languages
Chinese (zh)
Inventor
陈禹行
殷佳豪
刘志励
范圣印
李雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yihang Yuanzhi Technology Co Ltd
Original Assignee
Beijing Yihang Yuanzhi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yihang Yuanzhi Technology Co Ltd filed Critical Beijing Yihang Yuanzhi Technology Co Ltd
Priority to CN202210339794.9A priority Critical patent/CN114743139A/en
Publication of CN114743139A publication Critical patent/CN114743139A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/787Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The application relates to a video scene retrieval method, a video scene retrieval device, electronic equipment and a readable storage medium, which relate to the technical field of computers, and the method comprises the following steps: the method comprises the steps of obtaining a current video sequence, wherein the current video sequence comprises multiple frames of images, respectively extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images, respectively performing time domain feature fusion on the basis of the dense depth learning feature maps corresponding to the frames of images to obtain features after respective fusion, performing space-time feature aggregation processing on the basis of the features after respective fusion corresponding to the frames of images to obtain global feature descriptors corresponding to the current video sequence, and then retrieving from a global database on the basis of the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences. The video scene retrieval method, the video scene retrieval device, the electronic equipment and the readable storage medium can improve the accuracy of retrieving the video sequence, and further can improve user experience.

Description

Video scene retrieval method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a video scene retrieval method and apparatus, an electronic device, and a readable storage medium.
Background
In recent years, the appearance of application scenes such as autonomous memory parking, intelligent logistics trolleys, dining room intelligent robot meal delivery, unmanned aerial vehicle autonomous cruising and the like is very important for recognizing the scenes once arrived. When the tasks are executed for the first time (for example, the automobile is stopped in a self-parking space), a correct motion path is manually planned in advance and a scene map is established, and when the tasks are executed autonomously, the intelligent robot or the automatic driving automobile senses the position of the scene map in which the intelligent robot or the automatic driving automobile is positioned according to the current observed scene, and then autonomously tracks the intelligent robot or the automatic driving automobile according to the pre-planned path or autonomously avoids obstacles according to the scene map for navigation. Therefore, the accuracy of scene re-identification is crucial to the operation of the subsequent positioning and tracking navigation algorithm module.
In the application scenario, when the autonomous navigation task is executed and the scene map is established, a long time span may pass in the middle, which causes a large change in the surrounding environment of the scene, for example, in the morning when the map is established, and at night when the autonomous navigation is performed; the picture is built in a sunny day, the self-leading navigation time is a rainy day, a foggy day or a snowy day, and even the situation of crossing seasons possibly exists, so that the appearances of scenes observed by the two are greatly changed. In addition, scenes of the applications are often quite complex, for example, the images are interfered by dynamic objects such as pedestrians and vehicles during autonomous navigation, so that the difference of the appearances of the scenes observed twice is further increased, and even the dynamic objects can partially shield the scenes; meanwhile, the repeated appearance of some open scenes or objects with the same texture is also a great challenge, such as open parking lots, similar design styles of different garages, repeated and almost identical lamp poles and fences on roads and the like.
The inventor finds in the research process that the above situation may cause the accuracy of scene re-recognition to be low, and further cause the user experience to be poor.
Disclosure of Invention
The present application aims to provide a video scene retrieval method, apparatus, electronic device and readable storage medium, which are used to solve at least one of the above technical problems.
The above object of the present invention is achieved by the following technical solutions:
in a first aspect, a video scene retrieval method is provided, including:
acquiring a current video sequence, wherein the current video sequence comprises a plurality of frames of images;
respectively extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images;
respectively performing time domain feature fusion on the dense deep learning feature maps respectively corresponding to the frame images to obtain respective fused features;
performing space-time feature aggregation processing on the basis of the fused features corresponding to the images of each frame respectively to obtain a global feature descriptor corresponding to the current video sequence;
and retrieving from a global database based on the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences.
In a possible implementation manner, the performing time domain feature fusion respectively based on the dense deep learning feature maps respectively corresponding to the frames of images to obtain respective fused features includes: and respectively corresponding to the dense deep learning feature maps based on the frame images, and performing time domain feature fusion through an attention mechanism to obtain respective fused features.
In another possible implementation manner, the performing spatio-temporal feature aggregation processing based on the fused features respectively corresponding to each frame of image to obtain a global feature descriptor corresponding to the current video sequence includes:
splicing the time domain characteristic graphs corresponding to the frames of images to obtain spliced characteristic graphs;
performing point-by-point convolution processing on the spliced feature map to obtain a convolution processing result;
carrying out normalization processing on the convolution processing result to obtain a result after the normalization processing;
and determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map.
In another possible implementation manner, the feature map after the stitching process includes a plurality of feature points;
the determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map includes:
clustering the plurality of feature points to obtain at least one clustering center;
determining the distance between each characteristic point and each clustering center, and determining the distance information corresponding to each clustering center, wherein the distance information corresponding to any clustering center is the distance between each characteristic point and any clustering center;
determining global representations respectively corresponding to the clustering clusters based on the distance information corresponding to each clustering center and the result after the normalization processing;
performing regularization processing on the global representations corresponding to the cluster clusters respectively;
splicing all the global representations after the regularization treatment;
and performing regularization processing on the global representation after the splicing processing to obtain a global feature descriptor corresponding to the current video sequence.
In another possible implementation manner, the extracting dense depth learning feature maps respectively corresponding to the frames of images from the multiple frames of images respectively further includes:
respectively extracting the regional features of the dense depth learning feature maps corresponding to the frame images to obtain the multi-scale regional features corresponding to the frame images;
performing region matching based on the respective corresponding multi-scale region characteristics to obtain a space-time characteristic descriptor corresponding to the current video sequence;
and performing region matching on the video sequences with the first preset number based on the space-time feature descriptors corresponding to the current video sequences to obtain video sequences with a second preset number.
In another possible implementation manner, performing region feature extraction based on a dense depth learning feature map corresponding to any frame of image to obtain a multi-scale region feature corresponding to any frame of image includes:
determining a weighted residual feature map based on a dense depth learning feature map corresponding to any frame of image;
dividing the weighted residual characteristic map into a plurality of area blocks;
and determining the area characteristic representation corresponding to each area block so as to obtain the multi-scale area characteristic corresponding to any frame of image.
In another possible implementation manner, the determining a weighted residual feature map based on the dense depth learning feature map corresponding to any frame of image includes:
performing point-by-point convolution processing on the dense depth learning feature map corresponding to any frame of image to obtain a convolution result;
carrying out normalization processing on the convolution result to obtain a normalization result;
and determining the weighted residual error feature map based on the normalization result and the distance information corresponding to each cluster center.
In another possible implementation, the multi-scale region features are characterized by a region descriptor;
the obtaining of the space-time feature descriptor corresponding to the current video sequence by performing region matching based on the respective corresponding multi-scale region features comprises:
carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and the region descriptors corresponding to other frames of images in the current video sequence respectively to obtain a region matching result corresponding to the current video sequence;
selecting a region descriptor meeting a preset condition from a region matching result corresponding to the current video sequence as a region descriptor corresponding to the current video sequence;
and respectively carrying out region feature matching on the region descriptors corresponding to the current video sequence and each video sequence stored in a global database to obtain region matching results of the current video sequence and each video sequence.
In another possible implementation manner, performing region feature matching on a region descriptor corresponding to any frame of image in the current video sequence and a region descriptor corresponding to any other frame of image in the current video sequence to obtain a corresponding matching result, includes:
determining the distance between the region descriptor corresponding to any frame of image and each region descriptor in each region of other frame of images in the current video sequence;
carrying out region feature matching on the region descriptor corresponding to any frame image in the current video sequence and the region description corresponding to any other frame image in the current video sequence by the following formula to obtain a corresponding matching result:
Figure 491615DEST_PATH_IMAGE001
wherein, the element D in the matrixijThe distance between the ith area descriptor in the Tm frame and the jth area descriptor in the Tn frame is represented, a matrix D is used for representing the distance between all the area descriptors in the Tm frame and all the area descriptors in the Tn frame in the video sequence, the Tm frame represents any one frame of image, and the Tn is used for representing any other frame of image in the current video sequence; dij kIn the j-th column of the characterization matrix DElement with the smallest distance value, Di k jThe element with the minimum distance value in the ith row of the characterization matrix D, t is used for characterizing threshold parameters, and the matching items (i, j) meeting the conditions form a matching set P between the Tm frame and the Tn framemn
In another possible implementation manner, selecting an area descriptor of a first preset condition from the area matching result corresponding to any one of the areas, as the area descriptor corresponding to any one of the areas, includes:
determining the average value of the distances meeting the preset conditions;
and determining the area descriptor corresponding to any area based on the average value of the distances meeting the first preset condition.
In another possible implementation manner, the determining, based on the average value of the distances satisfying the first preset condition, the area descriptor corresponding to any one of the areas includes:
determining a region descriptor corresponding to any region by the following formula based on the average value of the distances meeting the first preset condition:
Figure 510518DEST_PATH_IMAGE002
wherein x is a group SiRegion of (5), PxSet of matching items corresponding to region x in the set of frame matches in which region x is located, DxTo from a set P of matching itemsxAll of D extractedijThe collection of the data is carried out,
Figure 565062DEST_PATH_IMAGE003
is DijAverage of elements in the set, x' being used to characterize the set SiThe region descriptors of all the regions determined in (1).
In another possible implementation manner, performing region feature matching on the region descriptor corresponding to the current video sequence and any video sequence includes:
and respectively carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence.
In another possible implementation manner, performing region feature matching on a region descriptor corresponding to each frame of image with a region descriptor corresponding to any frame of image includes:
determining a distance vector corresponding to each frame of image based on a region descriptor corresponding to each region in each frame of image and a region descriptor corresponding to each region in any frame of image, wherein the distance vector corresponding to each frame of image comprises a plurality of elements, and any element is a distance between the region descriptor corresponding to any region in each frame of image and the region descriptor corresponding to any region in any frame of image.
In another possible implementation manner, the performing region matching on the first preset number of video sequences based on the spatio-temporal feature descriptors corresponding to the current video sequence to obtain a second preset number of video sequences includes:
determining a spatial consistency score corresponding to the current video sequence and each video sequence respectively based on the region matching result corresponding to the current video sequence and each video sequence respectively, wherein each video sequence belongs to the video sequences with the first preset number;
reordering the video sequences with the first preset number based on the spatial consistency scores respectively corresponding to the current video sequence and each video sequence;
and extracting a second preset number of video sequences from the sorted first preset number of video sequences.
In another possible implementation manner, determining a spatial consistency score corresponding to the current video sequence and any video sequence includes:
determining a spatial consistency score between each frame of image in a current video sequence and each frame of image in any video sequence;
determining the weight information of each frame of image in the current video sequence;
and determining a spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence.
In another possible implementation manner, the determining a spatial congruency score between each frame of image in the current video sequence and any frame of image includes:
determining region matching space consistency scores of various sizes;
determining weight information corresponding to the areas of all sizes;
and determining the spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.
In another possible implementation, determining a region matching spatial consistency score of any size includes:
determining a region matching spatial consistency score for any size by the following formula:
Figure 526064DEST_PATH_IMAGE004
wherein SSpFeature a region matching spatial consistency score of size p, npRepresenting the number of the extracted area blocks with the dimension of P in the frame image, PpA region matched set of region features of dimension size p, (r)p,cp) Is PpThe matching offset stored therein;
Figure 33269DEST_PATH_IMAGE005
and
Figure 354398DEST_PATH_IMAGE006
characterization of P separatelypAverage column offsets and average row offsets in the set; i, j token pair set PpNumbering of traversal, dist (-) function being a distance function, max (-) being a maximum function;
wherein the matching the spatial consistency score based on the regions of the respective sizes and the weight information corresponding to the regions of the respective sizes respectively determines the spatial consistency score between each frame of image and any frame of image in the current video sequence by the following formula, and the matching comprises:
Figure 529027DEST_PATH_IMAGE007
wherein SS characterizes a spatial consistency score between each frame of image and any frame of image in the current video sequence, i is a traversal of a set of scales, n issIs the number of scales, wiWeight information corresponding to the size i, and wi∈[0,1]。
In another possible implementation manner, the determining, based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence, a spatial consistency score corresponding to the current video sequence and any video sequence includes:
based on the weight information of each frame of image in the current video sequence and the spatial consistency scores between each frame of image in the current video sequence and each frame of image in any video sequence, determining the spatial consistency score corresponding to the current video sequence and any video sequence by the following formula:
Figure 395352DEST_PATH_IMAGE008
VSS represents a spatial consistency score, V, corresponding to any video sequence of the current video sequencerefBelonging to a first predetermined number of video sequences, m being intended to represent a frame in the current video sequence, k being intended to represent VrefThe frame of (2) is selected,
Figure 655432DEST_PATH_IMAGE009
weight information for characterizing m.
In a second aspect, a video scene retrieval apparatus is provided, including:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a current video sequence which comprises a plurality of frames of images;
the feature map extraction module is used for respectively extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images;
the time domain feature fusion module is used for respectively carrying out time domain feature fusion on the basis of the dense deep learning feature maps respectively corresponding to the frames of images to obtain respective fused features;
the temporal-spatial feature aggregation processing module is used for performing temporal-spatial feature aggregation processing on the basis of the fused features corresponding to the frames of images respectively to obtain a global feature descriptor corresponding to the current video sequence;
and the first retrieval module is used for retrieving from a global database based on the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences.
In a possible implementation manner, the time domain feature fusion module is specifically configured to, when performing time domain feature fusion on the dense depth learning feature maps respectively corresponding to the frame images to obtain respective fused features:
and respectively corresponding to the dense deep learning feature maps based on the frame images, and performing time domain feature fusion through an attention mechanism to obtain respective fused features.
In another possible implementation manner, the spatio-temporal feature aggregation processing module is specifically configured to, when performing spatio-temporal feature aggregation processing based on the fused features respectively corresponding to each frame image to obtain a global feature descriptor corresponding to a current video sequence:
splicing the time domain characteristic graphs corresponding to the frames of images to obtain spliced characteristic graphs;
performing point-by-point convolution processing on the spliced feature map to obtain a convolution processing result;
carrying out normalization processing on the convolution processing result to obtain a result after the normalization processing;
and determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map.
In another possible implementation manner, the feature map after the stitching process includes a plurality of feature points;
the spatio-temporal feature aggregation processing module is specifically configured to, when determining the global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map:
clustering the plurality of feature points to obtain at least one clustering center;
determining the distance between each characteristic point and each clustering center, and determining the distance information corresponding to each clustering center, wherein the distance information corresponding to any clustering center is the distance between each characteristic point and any clustering center;
determining global representations respectively corresponding to the clustering clusters based on the distance information corresponding to each clustering center and the result after the normalization processing;
performing regularization processing on the global representations corresponding to the cluster clusters respectively;
splicing all the global representations after the regularization treatment;
and carrying out regularization processing on the global representation after splicing processing to obtain a global feature descriptor corresponding to the current video sequence.
In another possible implementation manner, the apparatus further includes: a multi-scale region feature extraction module, a spatio-temporal region feature matching module and a second retrieval module, wherein,
the multi-scale region extraction module is used for respectively extracting region features of the dense depth learning feature maps corresponding to the frames of images to obtain the respective corresponding multi-scale region features;
the space-time region feature matching module is used for carrying out region matching based on the respective corresponding multi-scale region features to obtain a space-time feature descriptor corresponding to the current video sequence;
and the second retrieval module is used for carrying out region matching on the video sequences with the first preset number based on the space-time feature descriptors corresponding to the current video sequences to obtain the video sequences with the second preset number.
In another possible implementation manner, the multi-scale region feature extraction module, when performing region feature extraction based on a dense depth learning feature map corresponding to any frame image to obtain a multi-scale region feature corresponding to any frame image, is specifically configured to:
determining a weighted residual feature map based on a dense depth learning feature map corresponding to any frame of image;
dividing the weighted residual error feature map into a plurality of area blocks;
and determining the area characteristic representation corresponding to each area block so as to obtain the multi-scale area characteristic corresponding to any frame of image.
In another possible implementation manner, when determining the weighted residual feature map based on the dense depth learning feature map corresponding to any frame of image, the multi-scale region feature extraction module is specifically configured to:
performing point-by-point convolution processing on the dense depth learning feature map corresponding to any frame of image to obtain a convolution result;
carrying out normalization processing on the convolution result to obtain a normalization result;
and determining the weighted residual error feature map based on the normalization result and the distance information corresponding to each cluster center.
In another possible implementation, the multi-scale region features are characterized by a region descriptor;
the spatio-temporal region feature matching module is specifically configured to, when performing region matching based on the respective corresponding multi-scale region features to obtain a spatio-temporal feature descriptor corresponding to the current video sequence:
carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and the region descriptors corresponding to other frames of images in the current video sequence respectively to obtain a region matching result corresponding to the current video sequence;
selecting a region descriptor meeting a preset condition from a region matching result corresponding to the current video sequence as a region descriptor corresponding to the current video sequence;
and respectively carrying out region feature matching on the region descriptors corresponding to the current video sequence and each video sequence stored in a global database to obtain region matching results of the current video sequence and each video sequence.
In another possible implementation manner, the spatio-temporal region feature matching module is specifically configured to, when performing region feature matching on a region descriptor corresponding to any frame image in the current video sequence and a region descriptor corresponding to any other frame image in the current video sequence to obtain a corresponding matching result:
determining the distance between the region descriptor corresponding to any frame of image and each region descriptor in each region of other frame of images in the current video sequence;
carrying out region feature matching on the region descriptor corresponding to any frame image in the current video sequence and the region description corresponding to any other frame image in the current video sequence by the following formula to obtain a corresponding matching result:
Figure 16137DEST_PATH_IMAGE010
wherein, the element D in the matrixijThe distance between the ith area descriptor in the Tm frame and the jth area descriptor in the Tn frame is represented, a matrix D is used for representing the distance between all the area descriptors in the Tm frame and all the area descriptors in the Tn frame in the video sequence, the Tm frame represents any one frame of image, and the Tn is used for representing any other frame of image in the current video sequence; dij kThe element in the j-th column of the characterization matrix D, having the smallest distance valuei k jThe element with the minimum distance value in the ith row of the characterization matrix D, t is used for characterizing threshold parameters, and the matching items (i, j) meeting the conditions form a matching set P between the Tm frame and the Tn framemn
In another possible implementation manner, when the spatio-temporal region feature matching module selects a region descriptor with a preset condition from the region matching result corresponding to any one region, and the region descriptor is used as the region descriptor corresponding to any one region, the spatio-temporal region feature matching module is specifically configured to:
determining the average value of the distances meeting the preset conditions;
and determining the area descriptor corresponding to any area based on the average value of the distances meeting the preset condition.
In another possible implementation manner, when the spatio-temporal region feature matching module determines the region descriptor corresponding to any one region based on the average value of the distances satisfying the first preset condition, the spatio-temporal region feature matching module is specifically configured to:
based on the average value of the distances meeting the first preset condition, determining an area descriptor corresponding to any one area through the following formula:
Figure 45273DEST_PATH_IMAGE011
wherein x is a group SiRegion of (5), PxSet of matching items corresponding to region x in the set of frame matches in which region x is located, DxTo from a set P of matching itemsxAll of D extractedijIn the collection of the images, the image data is collected,
Figure 82499DEST_PATH_IMAGE012
is DijAverage of elements in the set, x' being used to characterize the set SiThe region descriptors of all the regions determined in (1).
In another possible implementation manner, when the spatio-temporal region feature matching module performs region feature matching on the region descriptor corresponding to the current video sequence and any video sequence, the spatio-temporal region feature matching module is specifically configured to:
and respectively carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence.
In another possible implementation manner, when the spatio-temporal region feature matching module performs region feature matching on the region descriptor corresponding to each frame of image and the region descriptor corresponding to any frame of image, the spatio-temporal region feature matching module is specifically configured to:
determining a distance vector corresponding to each frame of image based on a region descriptor corresponding to each region in each frame of image and a region descriptor corresponding to each region in any frame of image, wherein the distance vector corresponding to each frame of image comprises a plurality of elements, and any element is the distance between the region descriptor corresponding to any region in each frame of image and the region descriptor corresponding to any region in any frame of image.
In another possible implementation manner, when the second retrieval module performs region matching on the video sequences of the first preset number based on the spatio-temporal feature descriptors corresponding to the current video sequence to obtain video sequences of a second preset number, the second retrieval module is specifically configured to:
determining a spatial consistency score corresponding to the current video sequence and each video sequence respectively based on the region matching result corresponding to the current video sequence and each video sequence respectively, wherein each video sequence belongs to the video sequences with the first preset number;
reordering the video sequences with the first preset number based on the spatial consistency scores respectively corresponding to the current video sequence and each video sequence;
and extracting a second preset number of video sequences from the sorted first preset number of video sequences.
In another possible implementation manner, when determining a spatial congruency score corresponding to the current video sequence and any video sequence, the second retrieving module is specifically configured to:
determining a spatial consistency score between each frame of image in a current video sequence and each frame of image in any video sequence;
determining the weight information of each frame of image in the current video sequence;
and determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence.
In another possible implementation manner, when determining the spatial consistency score between each frame of image in the current video sequence and any frame of image, the second retrieving module is specifically configured to:
determining region matching space consistency scores of various sizes;
determining weight information corresponding to the regions of each size respectively;
and determining a spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.
In another possible implementation manner, when determining that the region of any size matches the spatial consistency score, the second retrieval module is specifically configured to:
determining a region matching spatial consistency score for any size by the following formula:
Figure 153179DEST_PATH_IMAGE013
wherein SSpFeature a region matching spatial consistency score of size p, npRepresenting the number of the extracted area blocks with the dimension of P in the frame image, PpA region matching set of region features of scale size p, (r)p,cp) Is PpThe match offset stored in (1);
Figure 566843DEST_PATH_IMAGE014
and
Figure 529113DEST_PATH_IMAGE015
respectively characterise PpAverage column offsets and average row offsets in the set; i, j token pair set PpNumbering of traversal, dist (-) function being distance function, max (-) being maximum function;
the second retrieval module is specifically configured to, when determining the spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the respective sizes and the weight information corresponding to the regions of the respective sizes by using the following formula:
Figure 268399DEST_PATH_IMAGE016
wherein SS characterizes a spatial consistency score between each frame of image and any frame of image in the current video sequence, i is a traversal of a scale set, n issIs the number of scales, wiWeight information corresponding to the size i, and wi∈[0,1]。
In another possible implementation manner, when determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence, the second retrieval module is specifically configured to:
based on the weight information of each frame of image in the current video sequence and the spatial consistency scores between each frame of image in the current video sequence and each frame of image in any video sequence, determining the spatial consistency score corresponding to the current video sequence and any video sequence by the following formula:
Figure 814656DEST_PATH_IMAGE017
VSS represents a spatial consistency score, V, corresponding to any video sequence of the current video sequencerefBelonging to a first predetermined number of video sequences, m being intended to represent a frame in the current video sequence, k being intended to represent VrefThe frame of (2) is selected,
Figure 766432DEST_PATH_IMAGE009
weight information for characterizing m.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: and executing the operation corresponding to the video scene retrieval method shown in any possible implementation manner in the first aspect.
In a fourth aspect, a computer-readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the video scene retrieval method according to any one of the possible implementations of the first aspect.
In summary, the present application includes at least one of the following beneficial technical effects:
compared with the related technology, in the method, time domain feature fusion is carried out based on dense deep learning feature maps corresponding to each frame of image in a current video sequence, space-time feature aggregation processing is carried out according to the fused features, and a global feature descriptor corresponding to the current video sequence is obtained, namely the space-time feature of the current video sequence can be reflected in the global feature descriptor corresponding to the current video sequence, so that retrieval is carried out from a global database based on the global feature descriptor corresponding to the current video sequence, the influence of the change of the surrounding environment of a scene, local shielding and the like on scene re-identification can be reduced, the accuracy of the retrieved video sequence can be improved, and user experience can be improved.
Drawings
Fig. 1 is a schematic flowchart of a video scene retrieval method in an embodiment of the present application;
FIG. 2 is a schematic diagram of a time domain feature fusion network structure based on a self-attention mechanism in an embodiment of the present application;
FIG. 3 is a schematic diagram of a network model architecture of a TemproalVLAD according to an embodiment of the present application;
FIG. 4 is an exemplary diagram of video scene retrieval in an embodiment of the present application;
fig. 5 is a schematic structural diagram of an apparatus for video scene retrieval in an embodiment of the present application;
fig. 6 is a schematic device structure diagram of an electronic apparatus in an embodiment of the present application.
Detailed Description
The present application is described in further detail below with reference to the attached drawings.
The present embodiment is only for explaining the present application, and it is not limited to the present application, and those skilled in the art can make modifications of the present embodiment without inventive contribution as needed after reading the present specification, but all of them are protected by patent law within the scope of the claims of the present application.
The embodiment of the application provides a video scene retrieval method, wherein the visual scene retrieval method mainly aims to search observation information (images or video sequences) of the same geographic position when a scene map is established according to the current observation information;
the vision-based scene retrieval is mainly distinguished from the general image retrieval/video retrieval by three points:
1. the main standard of general image retrieval/video retrieval for measuring similarity is whether the images are in the same object category or not, whether the images have similar appearances or not, and the main standard of visual scene retrieval for measuring similarity is whether the images are in the same geographical position or not, and even if the appearance difference is large due to external factors such as weather and seasons, the similarity should be high as long as the positions of the images are close enough;
2. general image retrieval/video retrieval is mainly directed to foreground objects in the image, while vision-based scene retrieval is mainly directed to background regions in the image;
3. generally, image retrieval/video retrieval can be performed off-line, while visual-based scene retrieval is often applied in the field with strong real-time performance, such as relocation and loop detection in SLAM, so that in addition to requiring low complexity of a scene retrieval algorithm, efficient global representation of observation information (image or video sequence) is required, which is easier to calculate and store, such as converting the observation information into a vector or a matrix.
In the related art, most of visual scene re-identification technologies perform similarity calculation based on a single frame image, such as Bag of Words (BoW), Fisher Vectors (FV), local Aggregated Vectors (VLAD), and other methods, but the accuracy is obviously reduced when viewing angle changes exist in two observations by performing a scene retrieval method based on a single frame image; in addition, in the related art, the scene retrieval method based on the video mainly carries out information aggregation on the basis of the global representation of the single-frame image, ignores the space-time information of the video sequence and is still limited by the retrieval precision of the single-frame image.
Aiming at the problem of low recall rate when the visual angle is changed in image scene retrieval, the embodiment of the application provides a method for extracting space-time hierarchical features from a short video sequence. In the embodiment of the present application, video scene retrieval is performed from coarse to fine by using the characteristics of space-time hierarchy, which is described in the following embodiments:
(1) firstly, coarse-grained fast video scene retrieval is carried out: for each frame of image in a video sequence, extracting dense deep learning feature points by a neural network, then performing information aggregation and descriptor optimization on feature points of different interframes in common view by using a self-attention mechanism, finally clustering the feature points on time domain multiframes by using a TemproalVLAD layer, reserving unique observation information of each frame of image, removing interframe redundant observation information, thereby generating a high-dimensional vector as global representation of the video sequence, and performing scene retrieval by using the distance between the global representations of the video sequence. The coarse-grained fast video scene retrieval branch has the advantages of high retrieval speed, high storage efficiency and the like;
(2) then fine-grained optimization sorting is carried out, region matching is carried out by using the region block features on the feature map, and an image pyramid is constructed to extract multi-scale region features; for the regional matching result, the relative offset between all the matching pairs is used to define the image similarity, so as to optimize the search sorting result of the coarse-grained branch. The fine-grained optimization sorting branch has the advantages of high retrieval recall rate, high view angle change/local shielding robustness and the like;
(3) the method has the advantages that the method reduces the calculation amount and reduces the algorithm complexity, and has the advantages of high real-time performance, small time delay and the like.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship, unless otherwise specified.
The embodiments of the present application will be described in further detail with reference to the drawings attached hereto.
The embodiment of the application provides a video scene retrieval method, which can be executed by an electronic device, wherein the electronic device can be a server or a terminal device, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The terminal device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like, but is not limited thereto, and the terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.
It should be noted that the electronic device for executing the video scene retrieval method may further include: unmanned vehicles and intelligent robots.
Further, as shown in fig. 1, the method may include:
step 101, obtaining a current video sequence.
For the embodiment of the present application, the current video sequence is a video sequence to be subjected to scene retrieval. In this embodiment of the present application, the current video sequence may be acquired by a camera disposed in an unmanned vehicle or an intelligent robot, or may be acquired from other devices, which is not limited in this embodiment of the present application.
Specifically, in the embodiment of the present application, a current video sequence includes a plurality of frames of images.
And S102, extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images respectively.
Specifically, in the embodiment of the present application, extracting the dense depth learning feature maps corresponding to the respective frame images from the multiple frame images may specifically be implemented by using a feature extraction network, and may also extract the dense depth learning feature maps corresponding to the respective frame images from the respective frame images by using other manners.
And S103, respectively performing time domain feature fusion on the basis of the dense depth learning feature maps respectively corresponding to the frame images to obtain respective fused features.
And step S104, performing space-time feature aggregation processing based on the fused features respectively corresponding to the frame images to obtain a global feature descriptor corresponding to the current video sequence.
And S105, retrieving from a global database based on the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences.
For the embodiment of the present application, the global database stores a global feature descriptor and a regional feature descriptor, and in the embodiment of the present application, the global feature descriptor and the regional feature descriptor stored in the global database are respectively corresponding to global feature descriptors and regional feature descriptors of different video sequences.
It should be noted that: here, the different video sequences may include: video sequences respectively corresponding to different geographic positions. The same geographic location may correspond to one video sequence, may correspond to multiple video sequences, and certainly, may also correspond to one video sequence corresponding to at least two geographic locations, which is not limited in this embodiment of the application. In addition, the video sequence construction mode adopted during scene map construction and scene retrieval can be the same or different, and the process of constructing the video sequence during scene construction can be performed off-line, so that key frames approximately in the same place can be clustered according to the geographical position relationship (such as translation distance, rotation angle and gps coordinate distance), and the video sequence is constructed.
Compared with the related technology, in the embodiment of the application, time domain feature fusion is carried out based on dense deep learning feature maps corresponding to each frame of image in a current video sequence, space-time feature aggregation processing is carried out according to the fused features, and a global feature descriptor corresponding to the current video sequence is obtained, namely the space-time features of the current video sequence can be reflected in the global feature descriptor corresponding to the current video sequence, so that retrieval is carried out from a global database based on the global feature descriptor corresponding to the current video sequence, the influence of changes of the surrounding environment of a scene, local shielding and the like on scene re-identification can be reduced, the accuracy of the retrieved video sequence can be improved, and user experience can be improved.
Further, after acquiring the current video sequence, in order to avoid a situation that an overlapping area observed between video frames in the current video sequence is too high to cause a waste of computing resources, step S102 may further include: key frames are extracted from a current video sequence. In the embodiment of the present application, the extraction criteria of the keyframes may be based on the overlapping percentage of the observation regions, the number of points in feature point matching, the geographic position relationship (e.g., translation distance, rotation angle, gps coordinate distance) of the two frames, And the like, or may use the existing keyframe extraction policy in a framework of Simultaneous Localization And Mapping (SLAM), such as an ORB-SLAM (english-named: organized FAST And Rotated edge brf-SLAM) feature point-based extraction policy And an optical flow-based extraction policy in a DSO-SLAM (english-named: Direct Sparse overlay-SLAM), And the like.
Further, in order to further reduce the waste of computing resources, after extracting key frames from the current video sequence, M frames can be selected from the extracted key frames to form a new video sequence. In the embodiment of the present application, the M frames selected from the extracted key frames may be consecutive M key frames, or key frames selected at equal intervals, or a common-view relationship between key frames may be determined by feature point matching and an optical flow method, so as to dynamically select key frames with a longer time interval. Further, if a plurality of new video sequences need to be constructed, the plurality of constructed new video sequences may be of equal length or different lengths, and M may be between [3, 15 ].
Further, after extracting M frames from the key frame, in step S102, extracting the dense depth learning feature map corresponding to each frame image from the multi-frame image respectively, which may specifically include: and respectively extracting dense depth learning feature maps corresponding to the frames of images from the M frames of images.
Specifically, in the embodiment of the present application, dense deep learning feature maps corresponding to each frame of image are respectively extracted from M frames of images through a feature extraction network.
Specifically, in order to improve recall rate of scene retrieval when external factors such as viewing angle, illumination and scene appearance change, and to better fuse local information on a single frame image, a neural network is used as a feature extraction network in the embodiment of the present application. The feature extraction network includes, but is not limited to, VGG, Unet, ResNet, RegNet, AlexNet, google lenet, MobileNet and other common deep learning backbone networks.
Further, before inputting each frame image to the feature extraction network for feature extraction, each frame image needs to be adjusted to the same size, and then each frame image of the same size is subjected to feature extraction through the feature extraction network. In addition, in order to use the feature extraction network as a subsequent public network with two branches of coarse-grained fast retrieval and fine-grained optimized sorting, multiple frames of images in one video sequence in the video sequence need to be moved to batch dimension and then input into the feature extraction network, so as not to randomly disorder the interior of the same video sequence, but only randomly disorder the arrangement of different video sequences by using one video sequence as a whole, and in addition, when a multiple Graphics Processing Unit (GPU) is used, it is also needed to ensure that the images in the same video sequence are distributed to the same GPU for Processing. In this embodiment of the present application, performing feature extraction on each dense deep learning feature map of the same size through a feature extraction network, which may specifically include: and (2) performing feature extraction on the frame images with the same size through a feature extraction network, and/or adjusting the frame images with the same size to different sizes through image pyramid operation, and then performing feature extraction. And the image pyramid images corresponding to different frame images have the same image size as the corresponding layers.
Further, in order to make the model more robust to the change of the view angle and reduce the influence of local occlusion, and meanwhile, in order to make the model pay more attention to the stable features observed for many times in the video sequence and reduce the interference of dynamic objects, the embodiment of the present application performs temporal fusion on the extracted feature maps through a self-attention mechanism to update the features observed repeatedly in the video sequence.
Specifically, in step S103, time domain feature fusion is performed respectively based on the dense deep learning feature maps respectively corresponding to each frame of image, so as to obtain respective fused features, which may specifically include: and respectively corresponding to the dense deep learning feature maps based on each frame of image, and performing time domain feature fusion through an attention mechanism to obtain respective fused features. In the embodiment of the present application, a time domain feature fusion network based on a self-attention mechanism is implemented by using a 3D Non-local (3D Non-local network) network, and meanwhile, features on a time domain are fused, as shown in formula (1):
Figure 520892DEST_PATH_IMAGE018
formula (1);
where x is the input feature map, i and j are different coordinates on the feature map, then xi、xjRepresenting the values of different points on the feature map, the f () function measures the similarity between two points, e.g. using Gaussian similarity or Dot product similarity, the g () function is used to calculate the feature value of the feature map at the j position, c (x) is a normalization parameter, yiIs the value of the output feature map at coordinate i.
Specifically, as shown in fig. 2, a detailed structure of a time domain feature fusion network based on a self-attention mechanism used in the embodiment of the present application is obtained by first performing linear mapping on an input feature map X (T × H × W × 1024), compressing the number of channels by using convolution of 1 × 1 × 1, and reducing original information to obtain original information
Figure 899921DEST_PATH_IMAGE019
(T×H×W×512),
Figure 90731DEST_PATH_IMAGE020
(T×H×W×512),
Figure 111776DEST_PATH_IMAGE021
(T × H × W × 512) features, then combining all dimensions of the above three features except the number of channels, and then, in order to calculate the autocorrelation between the features, pairing
Figure 219279DEST_PATH_IMAGE019
And
Figure 769209DEST_PATH_IMAGE020
performing matrix dot product operation to obtain the relationship of each pixel in each frame to all pixels in other frames, and calculating the autocorrelation resultPerforming Softmax normalization to obtain a value range of [0,1]As a result of (3), the weight of self-attention is taken as the weight of self-attention, and the weight of self-attention is associated with the feature matrix
Figure 978473DEST_PATH_IMAGE021
Multiplying, up-sampling, and finally performing residual operation with the feature map X which is originally input, thereby obtaining the output Z (T multiplied by H multiplied by W multiplied by 1024) of the time domain feature fusion network.
It should be noted that besides the 3D Non-local network, other network variants based on the self-attention mechanism may be used for temporal feature fusion, which is within the scope of the embodiments of the present application, including but not limited to transform Networks, temporal Non-local Networks, Graph convolutional Neural Networks (GNNs) based on the self-attention mechanism, and so on.
Further, in order to aggregate all features in the same video sequence, unique observation information of each frame of image is retained, inter-frame redundant observation information is removed, so that a high-dimensional vector is generated to serve as global representation of a video sequence, and then a mode that one vector represents one video sequence is utilized, so that not only is rapid scene retrieval facilitated, but also more efficient storage is facilitated.
Specifically, in step S104, performing spatio-temporal feature aggregation based on the fused features respectively corresponding to each frame of image to obtain a global feature descriptor corresponding to the current video sequence, which may specifically include: step S1041 (not shown), step S1042 (not shown), step S1043 (not shown), and step S1044 (not shown), wherein,
and S1041, splicing the time domain feature maps corresponding to the frames of images respectively to obtain spliced feature maps.
Specifically, in the embodiment of the present application, time domain feature maps corresponding to each frame of image in the same video sequence may be stitched along a long side to obtain a stitched feature map, or may be stitched along a short side to obtain a stitched feature map. In the embodiment of the present application, a time domain feature map corresponding to each frame image is spliced along a long side to obtain a spliced feature map.
And step S1042, performing point-by-point convolution processing on the spliced feature map to obtain a convolution processing result.
And step S1043, carrying out normalization processing on the convolution processing result to obtain a result after normalization processing.
Specifically, in this embodiment of the present application, the normalizing the convolution processing result may specifically include: and carrying out index normalization processing on the convolution processing result through a normalization index function. In the embodiment of the present application, the normalized exponential function, or Softmax function, is a generalization of the logistic function, which can "compress" a K-dimensional vector z containing any real number into another K-dimensional real vector σ (z), so that each element ranges between (0,1), and the sum of all elements is 1.
And S1044, determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map.
Specifically, in the embodiment of the present application, the feature map after the stitching process includes a plurality of feature points; in step S1044, based on the result after the normalization processing and the feature map after the splicing, determining a global feature descriptor corresponding to the current video sequence may specifically include: clustering the plurality of feature points to obtain at least one clustering center; determining the distance between each feature point and each clustering center, and determining the corresponding distance information of each clustering center; determining the global representation corresponding to each cluster based on the distance information corresponding to each cluster center and the result after normalization processing; performing regularization treatment on the global representations corresponding to the cluster clusters respectively; splicing all the global representations after the regularization treatment; and carrying out regularization processing on the global representation after splicing processing to obtain a global feature descriptor corresponding to the current video sequence.
And the distance information corresponding to any clustering center is the distance between each characteristic point and any clustering center.
Specifically, in the embodiment of the present application, a temporal-spatial feature aggregation is performed by using a TemproalVLAD as a temporal-spatial feature aggregation network, where a network model architecture of the TemproalVLAD is shown in fig. 3, a time domain splicing process is performed on a feature map after time domain fusion, then a time domain splicing result is sequentially subjected to point-by-point convolution and index normalization processes to obtain a normalization processing result, then the time domain feature splicing result and the normalization processing result are processed by a residual error calculation module, and intra-cluster regularization and global regularization processes are sequentially performed to obtain a global descriptor of a video sequence. Specifically, feature maps of the same video sequence in feature maps output by the time domain feature fusion network based on the self-attention mechanism are spliced along long edges to obtain a spliced feature map F, the feature map F is subjected to point-by-point convolution by using a convolution kernel of 1 × 1, and the result is subjected to exponential normalization by using Softmax to obtain a result a; and (3) taking the spliced feature graph F (which is regarded as dense feature points), then taking out all feature points in all feature graphs, carrying out unsupervised clustering by using a Kmeans + + clustering algorithm to obtain K clustering centers, then respectively calculating the distance between each point in the feature graph F and K clusters in a residual error calculation module, and carrying out weighted summation by using a result a output by an index normalization unit as a weight to obtain K vectors which respectively correspond to global representations of the K clusters. And then regularizing the vectors of each cluster, splicing the vectors of K clusters together, and then performing global regularization to obtain a high-dimensional vector serving as a global descriptor of the whole video sequence.
Where K is between [16,128], regularization operations include, but are not limited to, L1 regularization, L2 regularization, and the like.
Further, after performing space-time aggregation processing based on the above embodiment to obtain a global descriptor of the entire video sequence, the global descriptor of the currently observed video sequence is used to perform coarse-grained fast retrieval in the global database, so as to reduce the number of video sequences that need to be retrieved accurately, and reduce the time consumption of fine-grained optimization sorting branch calculation.
Specifically, in the embodiment of the present application, distances between the global descriptor of the current video sequence and the global descriptors of the video sequences stored in the global database are sequentially calculated to determine similarities between the current video sequence and the video sequences stored in the global database, and then a TopK most similar to the current observed scene is retrieved from the global database by coarse-grained fast retrieval1A video sequence. Wherein, the distance between the current video sequence global descriptor and the global descriptor of any video sequence stored in the global database can be characterized by manhattan distance, euclidean distance, minkowski distance and the like, and the smaller the distance between the video sequence global descriptors, the higher the similarity.
Note that in the examples of the present application, TopK1The value may be input by a user or preset, and is not limited in the embodiment of the present application, for example, TopK1In [20,100 ]]In the meantime.
It should be noted that, according to the above embodiment: the global database stores global descriptors corresponding to the plurality of video sequences, wherein the global descriptors corresponding to the plurality of video sequences stored in the global database are the same as the global descriptors corresponding to the current video sequence determined based on the current video sequence in the above embodiment, and details are not repeated here.
Further, in order to further improve recall rate of retrieval and improve view angle change and local occlusion robustness, fine-grained optimization sorting is performed in the embodiment of the present application to optimize a retrieval sorting result of coarse-grained branches. Further, in step S105, extracting the dense depth learning feature maps corresponding to the respective frames of images from the multiple frames of images, respectively, and then may further include: step S106 (not shown in the figure), step S107 (not shown in the figure), and step S108 (not shown in the figure), wherein step S106 and step S107 may be executed before step S103 to step S105, or may be executed after step S103 to step S105, or may be executed simultaneously with at least one step of step S103 to step S105, and any possible execution sequence is within the protection scope of the embodiment of the present application, and is not limited in the embodiment of the present application, wherein step S106 to step S108 are detailed in the following embodiments:
and S106, respectively extracting the regional characteristics of the dense depth learning characteristic graphs corresponding to the frame images to obtain the corresponding multi-scale regional characteristics.
Specifically, in the embodiment of the present application, the dense depth learning feature maps respectively corresponding to each frame image may be subjected to region feature extraction by using a multi-scale region feature extraction model, or may not be subjected to region feature extraction by using a region feature extraction model.
In the embodiment of the application, the regional feature extraction is performed on each frame of image corresponding to the dense depth learning feature map through a regional feature extraction model, similar to the design of a space-time feature aggregation network, the multi-scale regional feature extraction model firstly performs point-by-point convolution on the feature maps output by the feature extraction network and performs index normalization, the obtained result is used for weighting the residual error results of the original feature map and K clustering centers, and finally the weighted residual error feature map R is obtained. The difference is that the multi-scale region feature extraction model does not perform time domain feature splicing, and does not perform summation and subsequent regularization operations on the weighted residual feature map R.
Specifically, performing region feature extraction based on a dense depth learning feature map corresponding to any frame of image to obtain a multi-scale region feature corresponding to any frame of image, which may specifically include: determining a weighted residual error feature map based on a dense depth learning feature map corresponding to any frame of image; dividing the weighted residual error characteristic diagram into a plurality of area blocks; and determining the area characteristic representation corresponding to each area block so as to obtain the multi-scale area characteristic corresponding to any frame of image.
Specifically, determining a weighted residual feature map based on a dense depth learning feature map corresponding to any frame of image may specifically include: performing point-by-point convolution processing on the dense depth learning feature map corresponding to any frame of image to obtain a convolution result; carrying out normalization processing on the convolution result to obtain a normalization result; and determining a weighted residual feature map based on the normalization result and the dense depth learning feature map corresponding to any frame of image.
For the embodiment of the present application, the way of performing point-by-point convolution and normalization processing on the dense-depth learning feature map corresponding to any frame of image is specifically described in the embodiment of spatio-temporal feature aggregation, and details are not described in this embodiment of the present application.
Further, determining a weighted residual feature map based on the normalization result and the dense depth learning feature map corresponding to any frame of image may specifically include: obtaining K clustering centers by a clustering algorithm for all feature points in a dense deep learning feature map corresponding to any frame of image extracted by the feature extraction model; determining the distance between each characteristic point and each clustering center, and determining the corresponding distance information of each clustering center; and determining a weighted residual characteristic diagram R based on the distance information corresponding to each cluster center and the result after the normalization processing.
And the distance information corresponding to any clustering center is the distance between each characteristic point and any clustering center.
Further, after obtaining the weighted residual feature map R, dividing the region blocks on the weighted residual feature map R by using a sliding window with a size of p × p, and regularizing a mean value of residuals in each region block to be a descriptor of the region, where the regularizing operation includes, but is not limited to, L1 regularization, L2 regularization, and the like. In order to enhance the robustness of the region descriptor to the change of the viewing angle, the size p of the sliding window may be changed, or the sliding windows with different sizes may be used to divide the region blocks, and each region block generates a region descriptor correspondingly. Since the region descriptors are generated by mean calculations, the region descriptor dimensions of different regions are the same.
It should be noted that, in the above manner, the multi-scale region feature is extracted for each frame of image in the video sequence, and the region descriptor is used as a way of representing and storing the region feature.
Further, in the embodiment of the present application, the multi-scale regional feature extraction module extracts regional features from each frame of image in the video sequence, which may also be considered as fusion of dense feature points extracted by the foregoing feature extraction network in a spatial domain, so as to perform optimized ordering on the results of the fast search branches by using the regional features subsequently.
Further, after extracting the multi-scale region features corresponding to each frame of image, to further enhance the robustness of the region features to the view angle change, the multi-scale region features corresponding to each frame of image are continuously subjected to region matching, which is described in detail in the following embodiments.
And S107, carrying out region matching based on the respective corresponding multi-scale region characteristics to obtain a space-time characteristic descriptor corresponding to the current video sequence.
Wherein the multi-scale region features are characterized by a region descriptor.
Specifically, in the embodiment of the present application, performing region matching based on respective corresponding multi-scale region features includes: the region matching between frames of the same video sequence aims to update the original region descriptor using robust region descriptors at different viewing angles (see the following steps S1071-S1072), and then perform region matching in the time domain of different video sequences (see the following step S1073), that is, in step S107, perform region matching based on respective corresponding multi-scale region features to obtain a spatio-temporal feature descriptor corresponding to the current video sequence, which may specifically include: step S1071 (not shown), step S1072 (not shown), and step S1073 (not shown), wherein,
step S1071, the area descriptor corresponding to each frame of image in the current video sequence is matched with the area descriptor corresponding to each other frame of image in the current video sequence for area feature, and the area matching result corresponding to the current video sequence is obtained.
For the embodiment of the present application, the region descriptor corresponding to each frame of image in the current video sequence is subjected to region feature matching with the region descriptors corresponding to each other frame of image in the current video sequence, which may specifically be based on a bidirectional matching and ratio test manner, or may also be based on other region matching manners, including but not limited to K-Neighbor matching, Greedy-Nearest Neighbor (Greedy-NN) matching, K-dimension (K-dimensional tree, K-d tree) matching, and the like. In the embodiments of the present application, a method based on bidirectional matching and ratio testing is described as an example.
Specifically, performing region feature matching on the region descriptor corresponding to each frame of image and the region descriptor corresponding to any frame of image may specifically include: and determining the distance vector corresponding to each frame of image based on the region descriptor corresponding to each region in each frame of image and the region descriptor corresponding to each region in any frame of image. The distance vector corresponding to each frame of image comprises a plurality of elements, and any element is the distance between the area descriptor corresponding to any area in each frame of image and the area descriptor corresponding to any area in any frame of image.
Specifically, in the embodiment of the present application, a specific manner of performing region feature matching is described by taking Tm frames and Tn frames as examples, where (M, n ∈ [1, M)]And m ≠ n), in the embodiment of the present application, the distance between all region descriptors of Tm frames and all region descriptors of Tn frames in the video sequence is calculated to form a distance matrix D, the element D in the matrixijI.e. represents the distance between the ith region descriptor in the Tm frame and the jth region descriptor in the Tn frame. Including but not limited to, at manhattan distances, euclidean distances, and minkowski distances, among others.
Further, the area descriptor corresponding to each frame of image in the current video sequence and the area descriptors corresponding to other frames of images in the current video sequence can be calculated and obtained through the method, and area feature matching is carried out on the area descriptors, so that an area matching result corresponding to the current video sequence is obtained.
Step S1072, selecting a region descriptor satisfying a preset condition from the region matching result corresponding to the current video sequence as the region descriptor corresponding to the current video sequence.
Specifically, selecting a region descriptor satisfying a preset condition from a distance vector corresponding to each frame of image may specifically include: the region descriptor satisfying the preset condition is determined by the following formula (2), wherein,
Figure 272051DEST_PATH_IMAGE022
equation (2);
wherein X' represents a region descriptor satisfying a predetermined condition, Dij kRepresents the element of the distance matrix D with the smallest distance value in the jth column, Di k jThe element representing the minimum distance value in the ith row of the distance matrix D, t is a threshold parameter, and the matching item (i, j) meeting the condition forms a matching set P between the Tm frame and the Tn framemnWhile the distance value D is also stored in the matching setij. Wherein the threshold t is [0.5,0.9 ]]In the meantime.
When the same video sequence is exhausted and all (m, n) combinations are combined, the region i of each frame is matched with the region j of a plurality of frames and is marked as a set Si={i,j1,j2…,jLRegion j, in turn, matches j 'in several frames in its own frame, recursively adding j' also into set Si; selecting a region descriptor satisfying the following formula (3) instead of the set SiThe original descriptor of all regions in the image, making it more robust to viewing angle variations.
Figure 735525DEST_PATH_IMAGE023
Formula (3);
wherein x is a group SiRegion of (1), PxSet of matching items corresponding to region x in the set of frame matches in which region x is located, DxTo from a set P of matching itemsxAll of D extractedijThe collection of the data is carried out,
Figure 456356DEST_PATH_IMAGE024
is the above-mentioned DijAverage of elements in the set. Selection of SiIn (1)
Figure 887337DEST_PATH_IMAGE024
As a set S, the descriptor corresponding to the region x' ofiAll the new region descriptors of the region. Since region x 'has the smallest average distance to other matching regions in the video sequence, the features of region x' are considered more robust under different viewing angles.
And step S1073, respectively carrying out region feature matching on the region descriptors corresponding to the current video sequence and each video sequence stored in the global database to obtain region matching results of the current video sequence and each video sequence.
Specifically, in this embodiment of the present application, performing region feature matching on a region descriptor corresponding to a current video sequence and any video sequence may specifically include: and respectively carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence. In the embodiment of the present application, the method for performing region feature matching between the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence is based on a bidirectional matching and ratio test, and may also be performed in other region matching manners, including but not limited to K-Neighbor matching, Greedy-Nearest Neighbor (Greedy-NN) matching, K-dimension (K-dimensional tree, K-d tree) matching, and the like. In the embodiments of the present application, a method based on bidirectional matching and ratio testing is described as an example.
Specifically, in the embodiment of the present application, performing region feature matching on a region descriptor corresponding to each frame of image and a region descriptor corresponding to any frame of image may specifically include: and determining the distance vector corresponding to each frame of image based on the region descriptor corresponding to each region in each frame of image and the region descriptor corresponding to each region in any frame of image. The distance vector corresponding to each frame of image comprises a plurality of elements, and any element is the distance between the area descriptor corresponding to any area in each frame of image and the area descriptor corresponding to any area in any frame of image.
Specifically, during video scene retrieval, for a current video sequence Vqry and a video sequence Vref stored in a database (global database) when a scene map is built, according to the above-mentioned region matching manner based on bidirectional matching and ratio testing, a region matching set P between each frame image in Vqry and each frame image in Vref is sequentially calculated, and at the same time, the coordinates (R, c) of each pair of matching regions in the region matching set P in the residual feature map R described above are recorded.
Further, the current video sequence Vqry and each video sequence stored in the database (global database) when the scene map is established are subjected to region feature matching in the above manner, which is not described again in detail. It should be noted that the global database also stores the region descriptors corresponding to each video sequence Vref, and in the embodiment of the present application, the determination manner of the region descriptors corresponding to each video sequence Vref is the same as the determination manner of the region descriptor corresponding to the current video sequence Vqry, and details are not described here again.
And S108, performing region matching on the video sequences with the first preset number based on the space-time feature descriptors corresponding to the current video sequences to obtain the video sequences with the second preset number.
For the embodiment of the present application, the above-mentioned TopK is optimized by the spatio-temporal region descriptor extracted from the currently observed video sequence1The arrangement order of the video sequences, selecting the TopK after optimized ordering2The video sequence is used as the final result of scene retrieval. In the examples of the present application, TopK2The value of (c) may be input by a user, or may be preset, and is not limited in the embodiment of the present application, for example, TopK2In [1,10 ]]In the meantime.
Specifically, in this embodiment of the present application, in step S108, performing region matching on the video sequences of the first preset number based on the spatio-temporal feature descriptors corresponding to the current video sequence to obtain video sequences of the second preset number, which may specifically include: based on the region matching results of the current video sequence respectively corresponding to the video sequences,determining the spatial consistency scores respectively corresponding to the current video sequence and each video sequence; reordering the video sequences with the first preset number based on the spatial consistency scores respectively corresponding to the current video sequence and each video sequence; and extracting a second preset number of video sequences from the sorted first preset number of video sequences. Wherein each video sequence belongs to a first preset number of video sequences. That is, in the embodiment of the present application, the spatial consistency scores between two frames are sequentially calculated for the result of the region matching, and the TopK is calculated according to the size of the overall spatial consistency score of the video sequence1The individual video sequences are reordered. Specifically, determining a spatial congruency score corresponding to the current video sequence and any video sequence includes: determining a spatial consistency score between each frame of image in a current video sequence and each frame of image in any video sequence; determining the weight information of each frame of image in the current video sequence; and determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence. In the embodiment of the present application, for the spatial consistency score SS of every two frames between two video sequences, the spatial consistency score of the two video sequences is finally calculated as shown in formula (4) below:
Figure 250185DEST_PATH_IMAGE025
formula (4);
wherein VSS represents the spatial congruency score of the two video sequences; m and k are respectively the current observed video sequence VqryAnd retrieving the video sequence VrefFrame of (1), here VrefEpsilon { TopK obtained by fast search branch1A video sequence };
Figure 66701DEST_PATH_IMAGE026
as a weight of the observation frame, the
Figure 224012DEST_PATH_IMAGE027
In (0,1)]The selection strategy is as follows: a frame (e.g., the first frame, the intermediate frame, or the last frame) in the observed video sequence is selected as a reference frame, the frame has a weight of 1, and the weights of other frames are exponentially attenuated from the reference frame.
Further, determining a spatial consistency score between each frame of image in the current video sequence and any frame of image may specifically include: determining region matching space consistency scores of various sizes; determining weight information corresponding to the regions of each size respectively; and determining the spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.
Specifically, the calculation formula (5) of the spatial consistency score of the region matching ensemble between two frames is as follows:
Figure 142290DEST_PATH_IMAGE028
formula (5);
wherein, SS represents the overall space consistency score of the region matching between two frames; i is the traversal of the scale set; n is a radical of an alkyl radicalsThe number of scales; w is aiFor scale weights, one weight per scale, and wi∈[0,1]. Specifically, the determining of the spatial consistency score between each frame of image and any frame of image in the current video sequence comprises the following steps: determining region matching space consistency scores of various sizes; determining weight information corresponding to the regions of each size respectively; and determining a spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.
Specifically, the region matching spatial consistency score formula (6) for the scale p is as follows:
Figure 777671DEST_PATH_IMAGE029
formula (6);
wherein SSpRepresenting a region matching spatial consistency score with a scale size p; n ispExtracting the number of the area blocks with the scale size of p from a frame of image in a multi-scale area feature extraction module; ppA region matching set of region features of dimension p; (r)p,cp) Is PpThe matching offset stored in the step (1) is the area matching space position offset calculated by the space-time area characteristic matching module;
Figure 215736DEST_PATH_IMAGE030
and
Figure 543949DEST_PATH_IMAGE031
each represents PpAverage column offsets and average row offsets in the set; i, j represents a set P of pairspNumbering of traversal times; dist (-) functions are distance functions, including but not limited to Manhattan distance, Euclidean distance, Minkowski distance, etc., and max (-) is a maximum function.
Further, the following introduces a method for video scene retrieval by specific examples, as shown in fig. 4, obtaining a current video stream sequence, then obtaining a dense deep learning feature map corresponding to the current video stream sequence based on the key frame extraction and feature map extraction processes described above, then executing coarse-grained branching and fine-grained branching,
the specific execution flow of the coarse-grained branch is as follows: determining a global feature descriptor corresponding to the current video sequence based on a dense deep learning feature map corresponding to the current video stream sequence, and then retrieving from a database constructed during mapping based on the global feature descriptor corresponding to the current video sequence to obtain a TopK1 retrieval result;
the specific execution flow of the fine-grained branch is as follows: obtaining corresponding region descriptors based on a dense deep learning feature map corresponding to a current video stream sequence, then updating the region descriptors of the region descriptors, then performing region matching based on the updated region descriptors and the region descriptors of each video in a database constructed during map building, calculating spatial consistency scores of the current video sequence and each video sequence in a Top1 retrieval result based on a matching result, and performing optimized sequencing on each video sequence in the Top1 retrieval result based on the spatial consistency scores of each video sequence to obtain a final Top 2 retrieval result.
The foregoing embodiments describe a video scene retrieval method from the perspective of a method flow, and the following embodiments describe a video scene retrieval device from the perspective of a virtual module, which are described in detail in the following embodiments.
An embodiment of the present application provides a video scene retrieval device, as shown in fig. 5, the video scene retrieval device 50 may include: an obtaining module 51, a feature map extracting module 52, a time domain feature fusing module 53, a spatio-temporal feature aggregation processing module 54 and a first retrieving module 55, wherein,
an obtaining module 51, configured to obtain a current video sequence, where the current video sequence includes multiple frames of images;
the feature map extraction module 52 is configured to extract dense depth learning feature maps corresponding to the frames of images from the multiple frames of images respectively;
a time domain feature fusion module 53, configured to perform time domain feature fusion on the basis of the dense deep learning feature maps corresponding to each frame of image, respectively, to obtain respective fused features;
a spatio-temporal feature aggregation processing module 54, configured to perform spatio-temporal feature aggregation processing based on the fused features corresponding to each frame of image, respectively, to obtain a global feature descriptor corresponding to the current video sequence;
the first retrieving module 55 is configured to retrieve from the global database based on the global feature descriptor corresponding to the current video sequence to obtain a first preset number of video sequences.
In a possible implementation manner of the embodiment of the present application, the time domain feature fusion module 53 is specifically configured to, when performing time domain feature fusion on the dense-depth learning feature maps respectively corresponding to the frame images to obtain respective fused features: and respectively corresponding to the dense deep learning feature maps based on each frame of image, and performing time domain feature fusion through an attention mechanism to obtain respective fused features.
In another possible implementation manner of the embodiment of the present application, the spatio-temporal feature aggregation processing module 54 is specifically configured to, when performing spatio-temporal feature aggregation processing based on the fused features corresponding to each frame image respectively to obtain a global feature descriptor corresponding to a current video sequence: splicing the time domain characteristic graphs corresponding to the frames of images to obtain spliced characteristic graphs; performing point-by-point convolution processing on the spliced feature map to obtain a convolution processing result; carrying out normalization processing on the convolution processing result to obtain a result after the normalization processing; and determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map.
In another possible implementation manner of the embodiment of the application, the feature map after the splicing processing includes a plurality of feature points; the spatio-temporal feature aggregation processing module 54 is specifically configured to, when determining the global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map: clustering the plurality of feature points to obtain at least one clustering center; determining the distance between each characteristic point and each clustering center, and determining the distance information corresponding to each clustering center, wherein the distance information corresponding to any clustering center is the distance between each characteristic point and any clustering center; determining global representations respectively corresponding to the clustering clusters based on the distance information corresponding to each clustering center and the result after normalization processing; performing regularization treatment on the global representations corresponding to the cluster clusters respectively; splicing all the global representations after the regularization treatment; and carrying out regularization processing on the global representation after splicing processing to obtain a global feature descriptor corresponding to the current video sequence.
In another possible implementation manner of the embodiment of the present application, the apparatus 50 further includes: a multi-scale region feature extraction module, a spatio-temporal region feature matching module and a second retrieval module, wherein,
the multi-scale region extraction module is used for respectively extracting region features of the dense depth learning feature maps corresponding to the frames of images to obtain the multi-scale region features corresponding to the frames of images;
the space-time region feature matching module is used for carrying out region matching based on the respective corresponding multi-scale region features to obtain space-time feature descriptors corresponding to the current video sequence;
and the second retrieval module is used for carrying out region matching on the video sequences with the first preset number based on the space-time feature descriptors corresponding to the current video sequences to obtain the video sequences with the second preset number.
For the embodiment of the present application, the first retrieving module 55 and the second retrieving module may be the same retrieving module or different retrieving modules, and are not limited in the embodiment of the present application.
In another possible implementation manner of the embodiment of the application, the multi-scale region feature extraction module is specifically configured to, when performing region feature extraction based on a dense depth learning feature map corresponding to any frame of image to obtain a multi-scale region feature corresponding to any frame of image: determining a weighted residual error feature map based on a dense depth learning feature map corresponding to any frame of image; dividing the weighted residual characteristic diagram into a plurality of area blocks; and determining the area characteristic representation corresponding to each area block so as to obtain the multi-scale area characteristic corresponding to any frame of image.
In another possible implementation manner of the embodiment of the present application, when determining the weighted residual feature map based on the dense depth learning feature map corresponding to any frame of image, the multi-scale region feature extraction module is specifically configured to: performing point-by-point convolution processing on the dense depth learning characteristic image corresponding to any frame of image to obtain a convolution result; carrying out normalization processing on the convolution result to obtain a normalization result; and determining a weighted residual error feature map based on the normalization result and the corresponding distance information of each cluster center.
In another possible implementation manner of the embodiment of the application, the multi-scale region features are characterized by a region descriptor; the spatio-temporal region feature matching module is specifically configured to, when performing region matching based on respective corresponding multi-scale region features to obtain a spatio-temporal feature descriptor corresponding to the current video sequence: carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and the region descriptors corresponding to other frames of images in the current video sequence respectively to obtain a region matching result corresponding to the current video sequence; selecting a region descriptor meeting a preset condition from a region matching result corresponding to the current video sequence as a region descriptor corresponding to the current video sequence; and respectively carrying out region feature matching on the region descriptors corresponding to the current video sequence and each video sequence stored in the global database to obtain region matching results of the current video sequence and each video sequence.
In another possible implementation manner of the embodiment of the present application, the spatio-temporal region feature matching module is specifically configured to, when performing region feature matching on a region descriptor corresponding to any frame image in the current video sequence and a region descriptor corresponding to any other frame image in the current video sequence to obtain a corresponding matching result:
determining the distance between the region descriptor corresponding to any frame of image and each region descriptor in each region in any other frame of image in the current video sequence;
carrying out region feature matching on the region descriptor corresponding to any frame image in the current video sequence and the region description corresponding to any other frame image in the current video sequence by the following formula to obtain a corresponding matching result:
Figure 683944DEST_PATH_IMAGE032
wherein, the element D in the matrixijThe distance between the ith area descriptor in the Tm frame and the jth area descriptor in the Tn frame is represented, a matrix D is used for representing the distance between all the area descriptors in the Tm frame and all the area descriptors in the Tn frame in the video sequence, the Tm frame represents any frame image, and the Tn is used for representing any other frame image in the current video sequence; dij kThe element in the j-th column of the characterization matrix D, having the smallest distance valuei k jThe element with the minimum distance value in the ith row of the characterization matrix D, t is used for characterizing threshold parameters, and the matching items (i, j) meeting the conditions form a matching set P between the Tm frame and the Tn framemn
In another possible implementation manner of the embodiment of the present application, when the spatio-temporal region feature matching module selects an area descriptor with a preset condition from an area matching result corresponding to any area, and the area descriptor is used as an area descriptor corresponding to any area, the spatio-temporal region feature matching module is specifically configured to:
determining an average value of distances meeting preset conditions;
and determining the area descriptor corresponding to any area based on the average value of the distances meeting the preset condition.
In another possible implementation manner of the embodiment of the present application, when the space-time region feature matching module determines, based on an average value of distances satisfying a first preset condition, a region descriptor corresponding to any one region, the space-time region feature matching module is specifically configured to:
determining a region descriptor corresponding to any region by the following formula based on the average value of the distances meeting the first preset condition:
Figure 388595DEST_PATH_IMAGE033
wherein x is a group SiRegion of (5), PxFor the matching item set corresponding to the region x in the frame matching set where the region x is located, DxFor the purpose of collecting P from the matching itemsxAll of D extractedijIn the collection of the images, the image data is collected,
Figure 914123DEST_PATH_IMAGE034
is DijAverage of elements in the set, x' being used to characterize the set SiThe region descriptors of all the regions determined in (1).
In another possible implementation manner of the embodiment of the present application, when performing region feature matching on a region descriptor corresponding to a current video sequence and any video sequence, the spatio-temporal region feature matching module is specifically configured to: and respectively carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence.
In another possible implementation manner of the embodiment of the present application, when performing region feature matching on the region descriptor corresponding to each frame of image and the region descriptor corresponding to any frame of image, the spatio-temporal region feature matching module is specifically configured to: determining a distance vector corresponding to each frame of image based on a region descriptor corresponding to each region in each frame of image and a region descriptor corresponding to each region in any frame of image, wherein the distance vector corresponding to each frame of image comprises a plurality of elements, and any element is the distance between the region descriptor corresponding to any region in each frame of image and the region descriptor corresponding to any region in any frame of image.
In another possible implementation manner of the embodiment of the present application, the second retrieval module, when performing region matching on the video sequences of the first preset number based on the spatio-temporal feature descriptors corresponding to the current video sequences to obtain video sequences of the second preset number, is specifically configured to: determining a spatial consistency score corresponding to the current video sequence and each video sequence respectively based on the region matching result corresponding to the current video sequence and each video sequence respectively, wherein each video sequence belongs to a first preset number of video sequences; reordering the video sequences with the first preset number based on the spatial consistency scores respectively corresponding to the current video sequence and each video sequence; and extracting a second preset number of video sequences from the sorted first preset number of video sequences.
In another possible implementation manner of the embodiment of the present application, when determining a spatial consistency score corresponding to a current video sequence and any video sequence, the second retrieval module is specifically configured to: determining the space consistency scores between each frame of image in the current video sequence and each frame of image in any video sequence; determining weight information of each frame of image in a current video sequence; and determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence.
In another possible implementation manner of the embodiment of the present application, when determining a spatial consistency score between each frame of image and any frame of image in a current video sequence, the second retrieval module is specifically configured to: determining a region matching space consistency score for each size; determining weight information corresponding to the regions of each size respectively; and determining the spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.
In another possible implementation manner of the embodiment of the present application, when determining the consistency score of the region matching space of any size, the second retrieval module is specifically configured to:
determining a region matching spatial consistency score for any size by the following formula:
Figure 413237DEST_PATH_IMAGE035
wherein SSpFeature a region matching spatial consistency score of size p, npRepresenting the number of the extracted area blocks with the dimension of P in the frame image, PpA region matched set of region features of dimension size p, (r)p,cp) Is PpThe match offset stored in (1);
Figure 40528DEST_PATH_IMAGE036
and
Figure 17711DEST_PATH_IMAGE037
characterization of P separatelypAverage column offsets and average row offsets in the set; i, j token pair set PpNumbering of traversal, dist (-) function being a distance function, max (-) being a maximum function;
the second retrieval module is specifically configured to, when the spatial consistency score is matched based on the regions of each size and the weight information corresponding to the regions of each size, determine the spatial consistency score between each frame of image and any frame of image in the current video sequence according to the following formula:
Figure 430369DEST_PATH_IMAGE038
wherein SS represents the spatial consistency score between each frame of image and any frame of image in the current video sequence, i is the traversal of the scale set, nsIs the number of scales, wiWeight information corresponding to the size i, and wi∈[0,1]。
In another possible implementation manner of the embodiment of the present application, when determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence, the second retrieval module is specifically configured to:
based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence, determining the spatial consistency score corresponding to the current video sequence and any video sequence by the following formula:
Figure 100385DEST_PATH_IMAGE039
VSS represents the spatial conformance score, V, corresponding to the current video sequence and any video sequencerefBelonging to a first predetermined number of video sequences, m being for characterizing a frame in a current video sequence, k being for characterizing VrefThe frame of (2) is selected,
Figure 214971DEST_PATH_IMAGE040
weight information for characterizing m.
Compared with the related art, in the embodiment of the application, time domain feature fusion is carried out based on dense deep learning feature maps corresponding to each frame of image in a current video sequence, space-time feature aggregation processing is carried out according to the fused features, and a global feature descriptor corresponding to the current video sequence is obtained, namely the space-time features of the current video sequence can be reflected in the global feature descriptor corresponding to the current video sequence, so that retrieval is carried out from a global database based on the global feature descriptor corresponding to the current video sequence, the influence of changes of the surrounding environment of a scene, local shielding and the like on scene re-identification can be reduced, the accuracy of the retrieved video sequence can be improved, and user experience can be improved.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In an embodiment of the present application, there is provided an electronic device, as shown in fig. 6, an electronic device 600 shown in fig. 6 includes: a processor 601 and a memory 603. Wherein the processor 601 is coupled to the memory 603, such as via a bus 602. Optionally, the electronic device 600 may also include a transceiver 604. It should be noted that the transceiver 604 is not limited to one in practical applications, and the structure of the electronic device 600 is not limited to the embodiment of the present application.
The Processor 601 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 601 may also be a combination of computing functions, e.g., comprising one or more microprocessors in combination, a DSP and a microprocessor in combination, or the like.
Bus 602 may include a path that transfers information between the above components. The bus 602 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 602 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
The Memory 603 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 603 is used for storing application program codes for executing the scheme of the application, and the processor 601 controls the execution. The processor 601 is configured to execute application program code stored in the memory 603 to implement the aspects illustrated in the foregoing method embodiments.
Wherein, the electronic device includes but is not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. But also a server, etc. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
The embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, and when the computer program runs on a computer, the computer is enabled to execute the corresponding content in the foregoing method embodiment. Compared with the related technology, in the embodiment of the application, time domain feature fusion is carried out based on the dense deep learning feature maps respectively corresponding to each frame of image in the current video sequence, and then space-time feature aggregation processing is carried out according to the fused features to obtain the global feature descriptor corresponding to the current video sequence, namely, the space-time features of the current video sequence can be reflected in the global feature descriptor corresponding to the current video sequence, so that retrieval is carried out from a global database based on the global feature descriptor corresponding to the current video sequence, the influence of the change of the surrounding environment of a scene, local shielding and the like on scene re-identification can be reduced, the accuracy of the retrieved video sequence can be improved, and further the user experience can be improved.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.
In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.
The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method and the core idea of the present application, and should not be construed as limiting the present application. Those skilled in the art should also appreciate that various modifications and substitutions can be made without departing from the scope of the present disclosure.

Claims (20)

1. A method for retrieving a video scene, comprising:
acquiring a current video sequence, wherein the current video sequence comprises a plurality of frames of images;
respectively extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images;
respectively performing time domain feature fusion on the dense deep learning feature maps respectively corresponding to the frame images to obtain respective fused features;
performing space-time feature aggregation processing on the basis of the fused features corresponding to the images of each frame respectively to obtain a global feature descriptor corresponding to the current video sequence;
and retrieving from a global database based on the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences.
2. The method according to claim 1, wherein the performing time domain feature fusion respectively based on the dense depth learning feature maps respectively corresponding to the frame images to obtain respective fused features comprises:
and respectively corresponding to the dense deep learning feature maps based on the frame images, and performing time domain feature fusion through an attention mechanism to obtain respective fused features.
3. The method according to claim 1 or 2, wherein the performing spatio-temporal feature aggregation processing based on the fused features respectively corresponding to the frame images to obtain a global feature descriptor corresponding to the current video sequence comprises:
splicing the time domain characteristic graphs corresponding to the frames of images to obtain spliced characteristic graphs;
performing point-by-point convolution processing on the spliced feature map to obtain a convolution processing result;
carrying out normalization processing on the convolution processing result to obtain a result after the normalization processing;
and determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map.
4. The method according to claim 3, wherein the feature map after the stitching process comprises a plurality of feature points;
the determining a global feature descriptor corresponding to the current video sequence based on the result after the normalization processing and the spliced feature map includes:
clustering the plurality of feature points to obtain at least one clustering center;
determining the distance between each characteristic point and each clustering center, and determining the distance information corresponding to each clustering center, wherein the distance information corresponding to any clustering center is the distance between each characteristic point and any clustering center;
determining global representations respectively corresponding to the clustering clusters based on the distance information corresponding to each clustering center and the result after the normalization processing;
performing regularization processing on the global representations corresponding to the cluster clusters respectively;
splicing all the global representations after the regularization treatment;
and carrying out regularization processing on the global representation after splicing processing to obtain a global feature descriptor corresponding to the current video sequence.
5. The method according to claim 4, wherein the extracting dense depth learning feature maps respectively corresponding to the frames of images from the plurality of frames of images further comprises:
respectively extracting the regional features of the dense depth learning feature maps corresponding to the frame images to obtain the multi-scale regional features corresponding to the frame images;
performing region matching based on the respective corresponding multi-scale region features to obtain a space-time feature descriptor corresponding to the current video sequence;
and performing region matching on the video sequences with the first preset number based on the space-time feature descriptors corresponding to the current video sequences to obtain video sequences with a second preset number.
6. The method according to claim 5, wherein performing region feature extraction based on a dense depth learning feature map corresponding to any frame image to obtain a multi-scale region feature corresponding to any frame image comprises:
determining a weighted residual feature map based on a dense depth learning feature map corresponding to any frame of image;
dividing the weighted residual error feature map into a plurality of area blocks;
and determining the regional characteristic representation corresponding to each regional block respectively to obtain the multi-scale regional characteristic corresponding to any frame of image.
7. The method of claim 6, wherein determining a weighted residual feature map based on the corresponding dense depth learning feature map of any frame of image comprises:
performing point-by-point convolution processing on the dense depth learning characteristic graph corresponding to any frame of image to obtain a convolution result;
carrying out normalization processing on the convolution result to obtain a normalization result;
and determining the weighted residual error feature map based on the normalization result and the distance information corresponding to each cluster center.
8. The method of any one of claims 5-7, wherein the multi-scale region features are characterized by a region descriptor;
the obtaining of the space-time feature descriptor corresponding to the current video sequence by performing region matching based on the respective corresponding multi-scale region features comprises:
carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and the region descriptors corresponding to other frames of images in the current video sequence respectively to obtain a region matching result corresponding to the current video sequence;
selecting a region descriptor meeting a preset condition from a region matching result corresponding to the current video sequence as a region descriptor corresponding to the current video sequence;
and respectively carrying out region feature matching on the region descriptors corresponding to the current video sequence and each video sequence stored in a global database to obtain region matching results of the current video sequence and each video sequence.
9. The method of claim 8, wherein performing region feature matching on the region descriptor corresponding to any frame of image in the current video sequence and the region descriptor corresponding to any other frame of image in the current video sequence to obtain a corresponding matching result comprises:
determining the distance between the region descriptor corresponding to any frame of image and each region descriptor in each region in any other frame of image in the current video sequence;
carrying out region feature matching on the region descriptor corresponding to any frame image in the current video sequence and the region description corresponding to any other frame image in the current video sequence by the following formula to obtain a corresponding matching result:
Figure 910272DEST_PATH_IMAGE001
wherein, the element D in the matrixijThe distance between the ith region descriptor in a Tm frame and the jth region descriptor in a Tn frame is characterized, a matrix D is used for characterizing the distance between all the region descriptors in the Tm frame and all the region descriptors in the Tn frame in a video sequence, the Tm frame is used for characterizing any one image, and the Tn is used for characterizing any other image in the current video sequence; dij kThe element in the jth column of the characterization matrix D with the smallest distance value, Di k jThe element with the minimum distance value in the ith row of the characterization matrix D, t is used for characterizing threshold parameters, and the matching items (i, j) meeting the conditions form a matching set P between the Tm frame and the Tn framemn
10. The method according to claim 9, wherein selecting a region descriptor with a preset condition from the region matching result corresponding to any one of the regions as the region descriptor corresponding to any one of the regions comprises:
determining the average value of the distances meeting the preset conditions;
and determining the area descriptor corresponding to any area based on the average value of the distances meeting the preset condition.
11. The method according to claim 10, wherein the determining a region descriptor corresponding to any one of the regions based on the average value of the distances satisfying the first preset condition comprises:
based on the average value of the distances meeting the first preset condition, determining an area descriptor corresponding to any one area through the following formula:
Figure 583830DEST_PATH_IMAGE002
wherein x is a group SiRegion of (5), PxFor the matching item set corresponding to the region x in the frame matching set where the region x is located, DxTo from a set P of matching itemsxAll of D extractedijIn the collection of the images, the image data is collected,
Figure 94315DEST_PATH_IMAGE003
is DijAverage of elements in the set, x' being used to characterize the set SiThe region descriptors of all the regions determined in (1).
12. The method of claim 11, wherein performing region feature matching on the region descriptor corresponding to the current video sequence with any video sequence comprises:
and respectively carrying out region feature matching on the region descriptor corresponding to each frame of image in the current video sequence and each frame of image in any video sequence.
13. The method according to any one of claims 9 to 12, wherein performing region feature matching on the region descriptor corresponding to each frame of image with the region descriptor corresponding to any one frame of image comprises:
determining a distance vector corresponding to each frame of image based on a region descriptor corresponding to each region in each frame of image and a region descriptor corresponding to each region in any frame of image, wherein the distance vector corresponding to each frame of image comprises a plurality of elements, and any element is the distance between the region descriptor corresponding to any region in each frame of image and the region descriptor corresponding to any region in any frame of image.
14. The method of claim 13, wherein the performing region matching on the first preset number of video sequences based on the spatio-temporal feature descriptors corresponding to the current video sequence to obtain a second preset number of video sequences comprises:
determining a spatial consistency score corresponding to the current video sequence and each video sequence respectively based on the region matching result corresponding to the current video sequence and each video sequence respectively, wherein each video sequence belongs to the video sequences with the first preset number;
reordering the video sequences with the first preset number based on the spatial consistency scores respectively corresponding to the current video sequence and each video sequence;
and extracting a second preset number of video sequences from the sorted first preset number of video sequences.
15. The method of claim 14, wherein determining a spatial conformance score for a current video sequence corresponding to any video sequence based on a region matching result of the current video sequence corresponding to the any video sequence comprises:
determining a spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence respectively based on the region matching result of the current video sequence corresponding to any video sequence respectively;
determining the weight information of each frame of image in the current video sequence;
and determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence.
16. The method of claim 15, wherein determining a spatial congruency score between each frame of image in the current video sequence and any frame of image comprises:
determining region matching space consistency scores of various sizes;
determining weight information corresponding to the regions of each size respectively;
and determining the spatial consistency score between each frame of image and any frame of image in the current video sequence based on the region matching spatial consistency scores of the sizes and the weight information respectively corresponding to the regions of the sizes.
17. The method according to claim 16, wherein the determining the spatial consistency score corresponding to the current video sequence and any video sequence based on the weight information of each frame of image in the current video sequence and the spatial consistency score between each frame of image in the current video sequence and each frame of image in any video sequence comprises:
based on the weight information of each frame of image in the current video sequence and the spatial consistency scores between each frame of image in the current video sequence and each frame of image in any video sequence, determining the spatial consistency score corresponding to the current video sequence and any video sequence by the following formula:
Figure 91090DEST_PATH_IMAGE004
VSS represents the spatial conformance score, V, corresponding to the current video sequence and any video sequencerefBelonging to a first predetermined number of video sequences, m being intended to represent a frame in the current video sequence, k being intended to represent VrefThe frame of (2) is selected,
Figure 959820DEST_PATH_IMAGE005
and weight information used for characterizing m, and SS characterizes the spatial consistency score between each frame of image and any frame of image in the current video sequence.
18. A video scene retrieval apparatus, comprising:
the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a current video sequence which comprises a plurality of frames of images;
the feature map extraction module is used for respectively extracting dense depth learning feature maps corresponding to the frames of images from the multiple frames of images;
the time domain feature fusion module is used for respectively carrying out time domain feature fusion on the basis of the dense deep learning feature maps respectively corresponding to the frames of images to obtain respective fused features;
the temporal-spatial feature aggregation processing module is used for performing temporal-spatial feature aggregation processing on the basis of the fused features corresponding to the frames of images respectively to obtain a global feature descriptor corresponding to the current video sequence;
and the first retrieval module is used for retrieving from a global database based on the global feature descriptors corresponding to the current video sequence to obtain a first preset number of video sequences.
19. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: a method of video scene retrieval according to any one of claims 1 to 17 is performed.
20. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for video scene retrieval according to any one of claims 1 to 17.
CN202210339794.9A 2022-04-01 2022-04-01 Video scene retrieval method and device, electronic equipment and readable storage medium Pending CN114743139A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210339794.9A CN114743139A (en) 2022-04-01 2022-04-01 Video scene retrieval method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210339794.9A CN114743139A (en) 2022-04-01 2022-04-01 Video scene retrieval method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN114743139A true CN114743139A (en) 2022-07-12

Family

ID=82278364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210339794.9A Pending CN114743139A (en) 2022-04-01 2022-04-01 Video scene retrieval method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114743139A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115641499A (en) * 2022-10-19 2023-01-24 感知天下(北京)信息科技有限公司 Photographing real-time positioning method and device based on street view feature library and storage medium
CN116129330A (en) * 2023-03-14 2023-05-16 阿里巴巴(中国)有限公司 Video-based image processing, behavior recognition, segmentation and detection methods and equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115641499A (en) * 2022-10-19 2023-01-24 感知天下(北京)信息科技有限公司 Photographing real-time positioning method and device based on street view feature library and storage medium
CN115641499B (en) * 2022-10-19 2023-07-18 感知天下(北京)信息科技有限公司 Photographing real-time positioning method, device and storage medium based on street view feature library
CN116129330A (en) * 2023-03-14 2023-05-16 阿里巴巴(中国)有限公司 Video-based image processing, behavior recognition, segmentation and detection methods and equipment
CN116129330B (en) * 2023-03-14 2023-11-28 阿里巴巴(中国)有限公司 Video-based image processing, behavior recognition, segmentation and detection methods and equipment

Similar Documents

Publication Publication Date Title
Xie et al. Multilevel cloud detection in remote sensing images based on deep learning
CN111652934B (en) Positioning method, map construction method, device, equipment and storage medium
CN108491827B (en) Vehicle detection method and device and storage medium
CN114743139A (en) Video scene retrieval method and device, electronic equipment and readable storage medium
CN112200041B (en) Video motion recognition method and device, storage medium and electronic equipment
CN112884742A (en) Multi-algorithm fusion-based multi-target real-time detection, identification and tracking method
CN106407978B (en) Method for detecting salient object in unconstrained video by combining similarity degree
Vishal et al. Accurate localization by fusing images and GPS signals
Koch et al. Real estate image analysis: A literature review
CN111709317B (en) Pedestrian re-identification method based on multi-scale features under saliency model
Li et al. VNLSTM-PoseNet: A novel deep ConvNet for real-time 6-DOF camera relocalization in urban streets
Lin et al. Deep vanishing point detection: Geometric priors make dataset variations vanish
Valappil et al. CNN-SVM based vehicle detection for UAV platform
CN110636248B (en) Target tracking method and device
Kukolj et al. Road edge detection based on combined deep learning and spatial statistics of LiDAR data
Cinaroglu et al. Long-term image-based vehicle localization improved with learnt semantic descriptors
Liu et al. Comparison of 2D image models in segmentation performance for 3D laser point clouds
Wang et al. Salient object detection using biogeography-based optimization to combine features
Sun et al. Automated human use mapping of social infrastructure by deep learning methods applied to smart city camera systems
CN114943766A (en) Relocation method, relocation device, electronic equipment and computer-readable storage medium
CN111291785A (en) Target detection method, device, equipment and storage medium
CN113139540B (en) Backboard detection method and equipment
Walch et al. Deep Learning for Image-Based Localization
CN114943937A (en) Pedestrian re-identification method and device, storage medium and electronic equipment
CN114612545A (en) Image analysis method and training method, device, equipment and medium of related model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination