CN107169106B

CN107169106B - Video retrieval method, device, storage medium and processor

Info

Publication number: CN107169106B
Application number: CN201710351135.6A
Authority: CN
Inventors: 周文明; 王志鹏
Original assignee: Zhuhai Thinkjoy Information Technology Co ltd
Current assignee: Zhuhai Thinkjoy Information Technology Co ltd
Priority date: 2017-05-18
Filing date: 2017-05-18
Publication date: 2023-08-18
Anticipated expiration: 2037-05-18
Also published as: CN107169106A

Abstract

The application discloses a video retrieval method, a video retrieval device, a storage medium and a processor. Wherein the method comprises the following steps: acquiring a target retrieval picture and a plurality of video images; preprocessing a plurality of video images to obtain at least one first target video image; processing at least one first target video image according to a first preset model to obtain all target image sequences of each first target video image in the at least one first target video image; processing all target image sequences of each first target video image according to a second preset model to obtain first features and second features of each first target video image; clustering the first features and the second features according to a preset algorithm to obtain a retrieval model; carrying out matting processing on the target retrieval image to obtain a target area image; and searching the target area image to obtain a search result. The application solves the technical problems of low video retrieval precision and low retrieval efficiency in the prior art.

Description

Video retrieval method, device, storage medium and processor

Technical Field

The present application relates to the field of digital intelligence, and in particular, to a video retrieval method, apparatus, storage medium, and processor.

Background

With the construction and popularization of projects such as safe cities, intelligent communities and the like, video security monitoring equipment is gradually erected to all corners of the cities, and video image data can be recorded and collected continuously for 7x24 hours. For traffic and community monitoring video systems with huge and numerous scales, the emerging intelligent video analysis based on the computer vision technology enables automatic analysis and target identification of massive videos. It is well known that surveillance videos are mainly used for community and public security maintenance, and play a vital role in guaranteeing social security through real-time evidence obtaining and post-hoc retrieval. However, video images have a huge amount of data as unstructured data, and have little effective information, and there are still many problems in terms of formatted storage. In addition, real-time quick retrieval of video data also faces many challenges, and manual retrieval is not suitable for practical application due to various limiting factors such as large workload, numerous retrieval targets, easy omission, low efficiency and the like. Based on the above, the video retrieval technology in the prior art mainly includes the following two ways:

in one approach, semantic-based video retrieval. The retrieval mode is based on keywords, and the keywords can be titles, topics, characters, video events and the like by performing retrieval matching based on the keywords by manually adding the video or automatically generating semantic description data. However, in security monitoring applications, the accuracy of semantic-based video retrieval techniques relies on a large amount of semantic descriptive information, and the retrieval effect is quite limited with less descriptive information for a single specific target. For example, a certain target person is found in a huge amount of public security videos, the description information of the target person only includes a person wearing blue coat and black trousers, deep characteristic information of the person cannot be described specifically, the searching pertinence is poor, and the searched result is quite complicated.

And secondly, retrieving video based on the content. The retrieval mode generally adopts a traditional image processing method, and the similarity between videos is analyzed as the retrieval basis by extracting bottom layer information such as colors, textures, edges, feature points and the like of video images. Compared with semantic retrieval, the content-based video retrieval effectively utilizes the bottom features in the image video, and the retrieval efficiency is improved. However, most of the current content-based image retrieval technologies need to adopt traditional image features, the description capability still has a certain limit, the feature vector dimension for retrieval is high, the time for calculating the similarity is long, and the real-time retrieval is difficult to achieve.

In summary, the existing video retrieval technology has the technical problems of low retrieval pertinence, low retrieval precision and low retrieval efficiency and poor retrieval real-time property, so the technical problems of low video retrieval precision and low retrieval efficiency in the prior art exist.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a video retrieval method, a device, a storage medium and a processor, which are used for at least solving the technical problems of low video retrieval precision and low retrieval efficiency in the prior art.

According to an aspect of an embodiment of the present application, there is provided a video retrieval method, including: acquiring a target retrieval picture and a plurality of video images; preprocessing the plurality of video images to obtain at least one first target video image; performing target detection processing and target tracking processing on the at least one first target video image according to a first preset model to obtain all target image sequences of each first target video image in the at least one first target video image; performing feature extraction processing on all target image sequences of each first target video image according to a second preset model to obtain first features and second features of each first target video image, wherein the first features are binary hash features of the first target video images, and the second features are original features of the first target video images; clustering the first feature and the second feature according to a preset approximate nearest neighbor algorithm to obtain a retrieval model; carrying out matting processing on the target search image to obtain a target area image; and searching the target area image according to the search model to obtain a search result.

Further, the searching the target area image according to the searching model to obtain a searching result includes: acquiring a third feature and a fourth feature of the target area image, wherein the third feature is a binary hash feature of the target area image, and the fourth feature is an original feature of the target area image; calculating a hamming distance between the third feature and the first feature of each of the first target video images to obtain at least one second target video image; calculating the Euclidean distance between the fourth feature and the second feature of each second target video image in the at least one second target video image to obtain a target image frame, wherein the similarity between the target image frame and the target retrieval image is greater than a preset similarity threshold; acquiring a frame ID of the target image frame; and searching the video image corresponding to the frame ID in the plurality of video images to obtain the search result.

Further, after performing feature extraction processing on all target image sequences of each of the first target video images according to a second preset model, the method further includes: the at least one first target video image, the sequence of target images, the first feature and the second feature are stored in a database.

Further, the preset approximate nearest neighbor algorithm is a local sensitivity hash algorithm.

Further, the preprocessing the plurality of video images to obtain at least one first target video image includes: and sequentially performing length normalization processing and decoding processing on each of the plurality of video images to obtain the first target video image.

Further, the method further comprises the steps of: training the first preset model and the second preset model according to a random gradient descent algorithm until the first preset model and the second preset model reach a convergence state.

According to another aspect of the embodiment of the present application, there is also provided a video retrieval apparatus including: an acquisition unit configured to acquire a target retrieval picture and a plurality of video images; the first processing unit is used for preprocessing the plurality of video images to obtain at least one first target video image; the second processing unit is used for carrying out target detection processing and target tracking processing on the at least one first target video image according to a first preset model to obtain all target image sequences of each first target video image in the at least one first target video image; the third processing unit is configured to perform feature extraction processing on all the target image sequences of each of the first target video images according to a second preset model to obtain a first feature and a second feature of each of the first target video images, where the first feature is a binary hash feature of the first target video image, and the second feature is an original feature of the first target video image; the fourth processing unit is used for carrying out clustering processing on the first characteristic and the second characteristic according to a preset approximate nearest neighbor algorithm to obtain a retrieval model; a fifth processing unit, configured to perform matting processing on the target search image to obtain a target area image; and the searching unit is used for searching the target area image according to the searching model to obtain a searching result.

Further, the search unit includes: a first obtaining subunit, configured to obtain a third feature and a fourth feature of the target area image, where the third feature is a binary hash feature of the target area image, and the fourth feature is an original feature of the target area image; a first computing subunit, configured to calculate a hamming distance between the third feature and the first feature of each of the first target video images, so as to obtain at least one second target video image; a second calculating subunit, configured to calculate a euclidean distance between the fourth feature and the second feature of each of the at least one second target video image, so as to obtain a target image frame, where a similarity between the target image frame and the target search image is greater than a preset similarity threshold; a second acquisition subunit configured to acquire a frame ID of the target image frame; and a search subunit configured to search the plurality of video images for the video image corresponding to the frame ID, and obtain the search result.

According to still another aspect of the embodiments of the present application, there is further provided a storage medium, where the storage medium includes a stored program, and the device where the storage medium is controlled to execute the video search method when the program runs.

According to still another aspect of the embodiment of the present application, there is further provided a processor, where the processor is configured to execute a program, and the video searching method is executed when the program runs.

In the embodiment of the application, the following modes are adopted: acquiring a target retrieval picture and a plurality of video images; preprocessing a plurality of video images to obtain at least one first target video image; performing target detection processing and target tracking processing on at least one first target video image according to a first preset model to obtain all target image sequences of each first target video image in the at least one first target video image; performing feature extraction processing on all target image sequences of each first target video image according to a second preset model to obtain first features and second features of each first target video image, wherein the first features are binary hash features of the first target video image, and the second features are original features of the first target video image; clustering the first features and the second features according to a preset approximate nearest neighbor algorithm to obtain a retrieval model; carrying out matting processing on the target retrieval image to obtain a target area image; the method and the device achieve the aim of obtaining the search result by searching the target area image according to the search model, thereby achieving the technical effects of improving the search precision and the search efficiency of the video, reducing the search time cost and the labor cost, and further solving the technical problems of lower search precision and lower search efficiency of the video in the prior art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of an alternative video retrieval method according to an embodiment of the application;

FIG. 2 is a flow chart of another alternative video retrieval method according to an embodiment of the application;

FIG. 3 is a schematic diagram of an alternative video retrieval device according to an embodiment of the present application;

fig. 4 is a schematic structural view of another alternative video retrieval device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present application, there is provided an embodiment of a video retrieval method, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

Fig. 1 is a flow chart of an alternative video retrieval method according to an embodiment of the present application, as shown in fig. 1, the method includes the steps of:

step S102, obtaining a target retrieval picture and a plurality of video images;

step S104, preprocessing a plurality of video images to obtain at least one first target video image;

step S106, performing target detection processing and target tracking processing on at least one first target video image according to a first preset model to obtain all target image sequences of each first target video image in the at least one first target video image;

step S108, carrying out feature extraction processing on all target image sequences of each first target video image according to a second preset model to obtain first features and second features of each first target video image, wherein the first features are binary hash features of the first target video image, and the second features are original features of the first target video image;

step S110, clustering the first feature and the second feature according to a preset approximate nearest neighbor algorithm to obtain a retrieval model;

step S112, carrying out matting processing on the target retrieval image to obtain a target area image;

step S114, searching the target area image according to the search model to obtain a search result.

Alternatively, the plurality of video images may be understood as massive video images, and the target search picture is input by the user, where it is noted that the target search picture may or may not be included in the plurality of video images.

Optionally, the steps S102 to S110 of the present application may be performed by processing a large number of video images, extracting features (including object detection, object tracking, and feature extraction) of each video image, where the features include an original feature (with a longer dimension) and a binary hash feature (with a shorter dimension, and only 0 or 1 two digits), and further storing and clustering the original feature and the binary hash feature of the video image, so as to construct a search service model.

Optionally, in the case that the user inputs a single picture as the target search picture, the step S112 may be executed to pre-process the single picture input by the user, remove information irrelevant to the target area image in the picture, and extract the target area image alone.

Optionally, the first preset model may include two sub-models, which are a target detection sub-model based on deep learning and a target tracking sub-model based on deep learning, respectively; the second preset model may be a deep learning based target feature extraction model.

Optionally, fig. 2 is a flowchart of another optional video retrieval method according to an embodiment of the present application, as shown in fig. 2, in step S114, retrieving a target area image according to a retrieval model, where obtaining a retrieval result includes:

step S202, obtaining a third feature and a fourth feature of the target area image, wherein the third feature is a binarized hash feature of the target area image, and the fourth feature is an original feature of the target area image;

step S204, calculating the Hamming distance between the third feature and the first feature of each first target video image to obtain at least one second target video image;

step S206, calculating the Euclidean distance between the fourth feature and the second feature of each second target video image in at least one second target video image to obtain a target image frame, wherein the similarity between the target image frame and the target retrieval image is larger than a preset similarity threshold;

step S208, obtaining the frame ID of the target image frame;

step S210, searching video images corresponding to the frame ID in the plurality of video images to obtain a search result.

Optionally, step S202 is performed, where original features with longer dimensions and binary hash features with shorter dimensions in the target area image may be obtained.

Optionally, step S204 is performed, where the hamming distance between the binarized feature of the user input image and the binarized feature of the massive video data may be calculated, so as to reduce the search range and obtain the massive video data feature with reduced range. Wherein the hamming distance may characterize the similarity between the features, the greater the hamming distance, the lower the similarity. For example, computing the hamming distance may narrow the search, e.g., there are hundreds of thousands of video images in a mass database, and the user enters a half-life of a picture, possibly ten thousands of video images, all of which may contain dogs, after computing the hamming distance.

Optionally, executing steps S206 to S210 may calculate the euclidean distance between the original features of the image input by the user and the original features of the massive video data with the reduced scope, so as to obtain the first N image frames with high similarity to the image input by the user in the massive video data, and further search the related information such as the corresponding video identifier, the frame number where the image is located, and the like in the massive video data according to the image frame ID, and finally obtain the video search result. For example, by calculating the Euclidean distance, one thousand video images including only Hastelloy can be obtained from ten thousand video images including dogs in the above example. Therefore, the hamming distance and the euclidean distance are calculated sequentially, and the search range can be further narrowed.

Optionally, based on the above, firstly, according to the binary hash feature of the target search picture, the position of the corresponding sub-bucket is obtained through the standard front distribution icon, the corresponding binary vector set is obtained from the redis according to the sub-bucket mark, and the binary hash feature with high corresponding similarity is obtained through hamming distance comparison and sequencing, so as to complete the preliminary search. Further accurate searching can be performed by calculating Euclidean distance according to the original characteristics of the target searching picture. Finally, the first N image frames with high similarity are obtained through comparison and sequencing, and corresponding information such as video identification, frame number of the image and the like is searched according to the image frame ID, so that a video retrieval result is obtained. Where N is set to 10, i.e. the search returns the first 10 highest-similarity video sequences.

Optionally, after performing the completing step S108, that is, after performing the feature extraction process on all target image sequences of each first target video image according to the second preset model, the method may further include:

step S10, at least one first target video image, a target image sequence, a first feature and a second feature are stored in a database in a structured manner. The database can be a Mongodb database or a Poseidon database, and can be used as a search database, and when video image search is carried out, the target characteristics are required to be compared with the data in the database, so that a search result is obtained.

Optionally, the preset approximate nearest neighbor algorithm is a local sensitivity hashing algorithm. Specifically, the structured information of the video file is clustered based on ANN (Approximate Nearest Neighbor) approximate nearest neighbor algorithm. And carrying out barrel division based on the standard direct-distributed binary hash, and storing the binary vector data after barrel division into memory data redis, thereby constructing retrieval service.

Optionally, performing step S104, that is, preprocessing the plurality of video images to obtain at least one first target video image includes:

step S20, sequentially performing a length normalization process and a decoding process on each of the plurality of video images, to obtain a first target video image.

Specifically, the length normalization processing is carried out on the video image, and the continuous video stream can be intercepted into a video stream string with fixed length, so that the later analysis and the storage are facilitated; in the process of decoding the video image, the video file may be decoded by opencv, and a size scaling normalization operation is performed on each frame of image. Wherein, the size scaling adopts bilinear difference algorithm, and the scaled size is 1920×1080.

Optionally, the method may further include: step S30, training the first preset model and the second preset model according to a random gradient descent algorithm until the first preset model and the second preset model reach a convergence state.

Specifically, the first preset model may be trained in the manner described above: firstly, the image data set and the corresponding category label information thereof can be respectively and correspondingly divided into two parts, wherein one part is used as a training sample set, the other part is used as a test sample set, and each sample in the training sample set and the test sample set comprises an image and a corresponding category label. And then two sub-models in the first preset model can be constructed: the target detection sub-model based on the deep learning and the target tracking sub-model based on the deep learning are adopted, wherein the target detection sub-model adopts a classical YOLO architecture, and the target tracking sub-model adopts an RNN architecture. Finally, training the target detection sub-model and the target tracking sub-model according to an SGD random gradient descent method by using a training sample set. Wherein, the learning rate step length of training is set to 0.01.

Specifically, the second preset model may be trained in the manner described above: firstly, respectively dividing an image data set and corresponding category label information into two parts, wherein one part is used as a training sample set, and the other part is used as a test sample set, and each sample in the training sample set and the test sample set comprises an image and a corresponding category label. Further, a deep convolutional neural network architecture is constructed, wherein the deep convolutional neural network architecture comprises a convolutional sub-network, a hash layer and a loss layer, the convolutional sub-network is used for learning original features of an image, the hash layer is used for carrying out feature compression dimension reduction on the original features and converting the original features into binary codes to obtain binary hash features of an input image, and the loss layer is used for measuring Softmax classification errors; wherein, the convolution sub-network adopts VGG architecture. The original feature dimension is 4096 dimensions. The binarized hash feature dimension is 128 dimensions. Finally, training the second preset model according to the SGD random gradient descent method by utilizing the training sample set and according to the deep convolutional neural network architecture to obtain a target feature extraction model based on deep learning. Wherein, the learning rate step length of training is set to 0.01.

Example 2

According to another aspect of the embodiment of the present application, there is also provided a video retrieval apparatus, as shown in fig. 3, including: an acquisition unit 301, a first processing unit 303, a second processing unit 305, a third processing unit 307, a fourth processing unit 309, a fifth processing unit 311, and a retrieval unit 313.

Wherein, the acquiring unit 301 is configured to acquire a target search picture and a plurality of video images; a first processing unit 303, configured to pre-process the plurality of video images to obtain at least one first target video image; a second processing unit 305, configured to perform a target detection process and a target tracking process on at least one first target video image according to a first preset model, so as to obtain a total target image sequence of each first target video image in the at least one first target video image; the third processing unit 307 is configured to perform feature extraction processing on all the target image sequences of each first target video image according to the second preset model, so as to obtain a first feature and a second feature of each first target video image, where the first feature is a binary hash feature of the first target video image, and the second feature is an original feature of the first target video image; a fourth processing unit 309, configured to perform clustering processing on the first feature and the second feature according to a preset approximate nearest neighbor algorithm, to obtain a retrieval model; a fifth processing unit 311, configured to perform matting processing on the target search image to obtain a target area image; and a retrieval unit 313 for retrieving the target area image according to the retrieval model to obtain a retrieval result.

Alternatively, as shown in fig. 4, the retrieving unit 313 may include: a first acquisition subunit 401, a first calculation subunit 403, a second calculation subunit 405, a second acquisition subunit 407, and a retrieval subunit 409.

The first obtaining subunit 401 is configured to obtain a third feature and a fourth feature of the target area image, where the third feature is a binary hash feature of the target area image, and the fourth feature is an original feature of the target area image; a first calculating subunit 403, configured to calculate a hamming distance between the third feature and the first feature of each first target video image, so as to obtain at least one second target video image; a second calculating subunit 405, configured to calculate a euclidean distance between the fourth feature and the second feature of each of the at least one second target video image, to obtain a target image frame, where a similarity between the target image frame and the target search image is greater than a preset similarity threshold; a second acquisition subunit 407 for acquiring a frame ID of the target image frame; the retrieving subunit 409 is configured to retrieve, from among the plurality of video images, a video image corresponding to the frame ID, and obtain a retrieval result.

Example 3

According to still another aspect of the embodiment of the present application, there is further provided a storage medium including a stored program, wherein the device in which the storage medium is controlled to execute the video search method in embodiment 1 of the present application is controlled when the program runs.

According to still another aspect of the embodiment of the present application, there is further provided a processor, where the processor is configured to execute a program, where the program executes the video search method in embodiment 1 of the present application.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. The video retrieval method is characterized in that firstly, the position of a corresponding sub-bucket is obtained through a standard front distribution icon according to the binary hash characteristic of a target retrieval picture, a corresponding binary vector set is obtained from redis according to a sub-bucket mark, and the binary hash characteristic with high corresponding similarity is obtained through Hamming distance comparison and sequencing to finish preliminary retrieval; finally, obtaining the first N image frames with high similarity through comparison and sequencing, and searching the corresponding video identification and the related information of the frame number of the image according to the image frame ID, thereby obtaining a video retrieval result; wherein N is set to 10, namely searching and returning the first 10 video sequences with highest similarity;

the method specifically comprises the following steps:

acquiring a target retrieval picture and a plurality of video images;

preprocessing the plurality of video images to obtain at least one first target video image;

performing target detection processing and target tracking processing on the at least one first target video image according to a first preset model to obtain all target image sequences of each first target video image in the at least one first target video image;

performing feature extraction processing on all target image sequences of each first target video image according to a second preset model to obtain first features and second features of each first target video image, wherein the first features are binary hash features of the first target video images, and the second features are original features of the first target video images;

clustering the first features and the second features according to a preset approximate nearest neighbor algorithm to obtain a retrieval model;

carrying out matting processing on the target retrieval image to obtain a target area image;

searching the target area image according to the search model to obtain a search result;

the target area image is searched according to the search model, and the search result is obtained comprises the following steps:

acquiring third and fourth features of the target area image, wherein the third feature is a binary hash feature of the target area image, and the fourth feature is an original feature of the target area image;

calculating a hamming distance between the third feature and the first feature of each first target video image to obtain at least one second target video image;

calculating the Euclidean distance between the fourth feature and the second feature of each second target video image in the at least one second target video image to obtain a target image frame, wherein the similarity between the target image frame and the target retrieval image is larger than a preset similarity threshold;

acquiring a frame ID of the target image frame;

and searching the video images corresponding to the frame IDs in the plurality of video images to obtain the search result.

2. The method according to claim 1, wherein after performing feature extraction processing on all target image sequences of each of the first target video images according to a second preset model, the method further comprises:

the at least one first target video image, the sequence of target images, the first feature and the second feature are stored in a database.

3. The method of claim 1, wherein the predetermined approximate nearest neighbor algorithm is a local sensitivity hashing algorithm.

4. The method of claim 1, wherein preprocessing the plurality of video images to obtain at least one first target video image comprises:

and sequentially performing length normalization processing and decoding processing on each video image in the plurality of video images to obtain the first target video image.

5. The method according to claim 1, wherein the method further comprises:

training the first preset model and the second preset model according to a random gradient descent algorithm until the first preset model and the second preset model reach a convergence state.

6. The video retrieval device according to claim 1, characterized by comprising: an acquisition unit configured to acquire a target retrieval picture and a plurality of video images;

the first processing unit is used for preprocessing the plurality of video images to obtain at least one first target video image;

the second processing unit is used for carrying out target detection processing and target tracking processing on the at least one first target video image according to a first preset model to obtain all target image sequences of each first target video image in the at least one first target video image;

the third processing unit is used for carrying out feature extraction processing on all target image sequences of each first target video image according to a second preset model to obtain first features and second features of each first target video image, wherein the first features are binarized hash features of the first target video image, and the second features are original features of the first target video image;

the fourth processing unit is used for carrying out clustering processing on the first features and the second features according to a preset approximate nearest neighbor algorithm to obtain a retrieval model;

a fifth processing unit, configured to perform matting processing on the target search image to obtain a target area image;

and the retrieval unit is used for retrieving the target area image according to the retrieval model to obtain a retrieval result.

7. The apparatus of claim 6, wherein the retrieval unit comprises:

a first obtaining subunit, configured to obtain a third feature and a fourth feature of the target area image, where the third feature is a binary hash feature of the target area image, and the fourth feature is an original feature of the target area image;

a first computing subunit, configured to calculate a hamming distance between the third feature and the first feature of each of the first target video images, to obtain at least one second target video image;

a second calculating subunit, configured to calculate a euclidean distance between the fourth feature and the second feature of each of the at least one second target video image, to obtain a target image frame, where a similarity between the target image frame and the target search image is greater than a preset similarity threshold;

a second acquisition subunit configured to acquire a frame ID of the target image frame;

and the searching subunit is used for searching the video images corresponding to the frame ID in the plurality of video images to obtain the searching result.

8. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the video retrieval method of any one of claims 1 to 5.

9. A processor for running a program, wherein the program when run performs the video retrieval method of any one of claims 1 to 5.