CN113010731B

CN113010731B - Multimodal video retrieval system

Info

Publication number: CN113010731B
Application number: CN202110197952.7A
Authority: CN
Inventors: 董霖; 俞锋锋; 吕繁荣; 陈津来; 姚建明
Original assignee: Hangzhou Xihu Data Intelligence Research Institute
Current assignee: Hangzhou Xihu Data Intelligence Research Institute
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2022-05-20
Anticipated expiration: 2041-02-22
Also published as: CN113010731A

Abstract

The invention relates to a multi-mode video retrieval system which comprises an information interaction interface, a pre-constructed RE ID pedestrian identification model, a pre-constructed multi-mode video database, a processor and a memory, wherein the memory stores a computer program, the video database comprises a plurality of video segment data records, the video segment data records comprise video segments, time information, position information, RE ID characteristic vectors, face characteristic vectors and gait characteristic vector fields, and the RE ID characteristic vectors are obtained based on the RE ID pedestrian identification model. The invention improves the efficiency and accuracy of video retrieval.

Description

Multimodal video retrieval system

Technical Field

The invention relates to the technical field of computers, in particular to a multi-mode video retrieval system.

Background

With the development of multimedia technology and network technology, a great deal of multimedia information is generated, which may be pictures, sounds or videos. Especially, the number of videos is growing rapidly, because video data is vivid, can carry a large amount of information, is easier to be perceived by people, and video sequential chapters are widely used. Cameras are installed in a plurality of existing geographic areas and used for collecting video information in the geographic areas, and once an emergency occurs, historical camera video records need to be retrieved, and the activity tracks of specific targets need to be searched and analyzed.

However, when the huge historical video record data is faced, the traditional retrieval mode through manual screening takes a long time, occupies huge resources, has low retrieval efficiency, and cannot ensure the comprehensiveness and accuracy of the retrieval result due to the huge video amount. Therefore, how to improve the efficiency and accuracy of video retrieval is an urgent technical problem to be solved.

Disclosure of Invention

The invention aims to provide a multi-mode video retrieval system, which improves the efficiency and accuracy of video retrieval.

According to a first aspect of the present invention, there is provided a multimodal video retrieval system, including an information interaction interface, a pre-constructed REID pedestrian recognition model, a pre-constructed multimodal video database, a processor, and a memory storing a computer program, where the video database includes a plurality of video segment data records, the video segment data records include video segments, time information, location information, REID feature vectors, face feature vectors, and gait feature vector fields, the REID feature vectors are obtained based on the REID pedestrian recognition model, where each video segment includes the same person information continuously collected by the same camera, and the continuous collection is that a collection time interval is smaller than a preset time interval threshold, and when the computer program is executed by the processor, the following steps are implemented:

step S1, acquiring an image of an object to be retrieved, and inputting the image of the object to be retrieved into the REID pedestrian recognition model to acquire an REID characteristic vector as a to-be-detected REID characteristic vector;

step S2, traversing all REID feature vectors in the video database according to the REID feature vectors to be detected, acquiring video segments corresponding to all REID feature vectors with the similarity larger than a preset REID feature similarity threshold value with the REID feature vectors to be detected, and forming a first video segment set;

step S3, acquiring a face feature vector corresponding to each video segment in the current first video segment set as a target face feature vector;

step S4, traversing all face feature vectors in the video database according to each target face feature vector, obtaining video segments corresponding to all face feature vectors of which the similarity is greater than a preset face feature similarity threshold value, obtaining a second video segment set corresponding to each target face feature vector, and merging all the second video segment sets with the first video segment set to obtain a third video segment set;

step S5, acquiring a gait feature vector corresponding to each video segment in the current third video segment set as a target gait feature vector;

step S6, traversing all gait feature vectors in the video database according to each target gait feature vector, acquiring video segments corresponding to all gait feature vectors of which the similarity of each target gait feature vector is greater than a preset gait feature vector similarity threshold value, acquiring a fourth video segment set corresponding to each target gait feature vector, merging all the fourth video segment sets with the third video segment set, and acquiring a fifth video segment set;

step S7, acquiring a to-be-processed image corresponding to each video segment in the current fifth video segment set, and inputting each to-be-processed image into the REID pedestrian identification model to obtain a corresponding REID feature vector as a target REID feature vector;

step S8, traversing all REID feature vectors in the video database according to each target REID feature vector, acquiring video segments corresponding to all REID feature vectors of which the similarity with each target REID feature vector is greater than a preset REID feature similarity threshold value to form a sixth video segment set, and merging all the sixth video segment sets with the fifth video segment set to obtain a first video segment set;

step S9, judging whether all the second video segment sets, all the fourth video segment sets and all the sixth video segment sets are all empty sets, if so, determining the current first video segment set as a target video segment set, ending the process, otherwise, returning to the step S3.

Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the multi-mode video retrieval system provided by the invention can achieve considerable technical progress and practicability, has wide industrial utilization value and at least has the following advantages:

the invention carries out multi-mode video retrieval based on the REID feature vector, the face feature vector and the gait feature vector, thereby improving the efficiency and the accuracy of the video retrieval.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic diagram of a multimodal video retrieval system according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention for achieving the predetermined objects, the following detailed description will be given to a specific embodiment of a multi-modal video retrieval system and its effects according to the present invention with reference to the accompanying drawings and preferred embodiments.

The embodiment of the invention provides a multi-modal video retrieval system, which comprises an information interaction interface, a pre-constructed REID pedestrian identification model, a pre-constructed multi-modal video database, a processor and a memory storing a computer program, wherein the video database comprises a plurality of video segment data records, each video segment data record comprises a video segment, time information, position information, a REID (pedestrian re-identification) feature vector, a face feature vector and a gait feature vector field, the REID feature vector is obtained based on the REID video identification model, each video segment comprises the same person information continuously acquired by the same camera, the continuous acquisition time interval is smaller than a preset time interval threshold, the time interval threshold can be set to be within a large range of images acquired by people walking through the camera, the time required for the farthest distance traveled. When executed by a processor, the computer program implementing the steps of:

it is understood that the image of the object to be retrieved is an image containing the target object, and may be extracted from a video known to contain the target object, or directly obtained from a set of images known to contain the target object.

Step S2, traversing all REID feature vectors in the video database according to the REID feature vectors to be detected, acquiring video segments corresponding to all REID feature vectors with the similarity greater than a preset REID feature similarity threshold value, and forming a first video segment set;

The embodiment of the invention takes the REID characteristic vector, the face characteristic vector and the gait characteristic vector multi-modal characteristics as the basis, carries out the retrieval iteratively, can retrieve the target video segment set from the video library as comprehensively as possible, and greatly reduces the manual operation cost based on the characteristic vector similarity contrast, and has high retrieval speed and high retrieval efficiency. According to the invention, the system can be physically implemented as one server or as a server group comprising a plurality of servers. Those skilled in the art will appreciate that parameters such as the model and specification of the server do not affect the scope of the present invention.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. It can be understood that, for convenience of description and clarity of the technical solution, the embodiment of the present invention adopts a cyclic retrieval method based on the REID feature vector, the face feature vector, and the gait feature vector, but the order based on the REID feature vector, the face feature vector, and the gait feature vector may also be adjusted, and may be performed in a cross manner, for example, the cyclic retrieval may be performed in the order based on the REID feature vector, the gait feature vector, and the face feature vector, or the REID feature vector, the gait feature vector, the face feature vector, the gait feature vector, and the REID feature vector … may also be performed in a cross manner, and finally, the condition based on all convergence of the REID feature vector, the gait feature vector, and the face feature vector may be satisfied.

As an embodiment, when executed by a processor, the computer program further implements step 10 of obtaining a video segment for constructing the video database, and specifically includes:

step S11, acquiring video data of each preset camera;

it is understood that the preset camera refers to a camera in the electronic fence where the target object to be retrieved may appear.

And step S12, acquiring a corresponding video segment from each preset camera video data by adopting a deepsort algorithm.

The depsort algorithm can extract video frames only containing single person information in each frame of video, and then combines the video frames only containing single person information with the same person information with the collection time interval smaller than a preset time interval threshold value according to a time sequence to form the video segment.

Between the step S2 and the step S3, between the step S4 and the step S5, and between the step S6 and the step S7, in order to avoid introducing noise and improve the accuracy of the system, as an embodiment, at least one of the steps between the step S2 and the step S3, between the step S4 and the step S5, and between the step S6 and the step S7 may include:

step S10, presenting each video segment in the current video segment set on the information interaction interface, receiving a deletion instruction input by a user for one or more video segments, and deleting the corresponding video segment from the current video segment set;

wherein if the step S10 is executed between the step S2 and the step S3, the current video segment set is the first video segment set; if said step S10 is performed between said step S4 and step S5, then said current video segment set is said third video segment set; if the step S10 is performed between the step S6 and the step S7, then the current video segment set is the fifth video segment set.

It is understood that, as a preferred embodiment, step S10 may be added between step S2 and step S3, between step S4 and step S5, and between step S6 and step S7, and verification is performed manually, so that manual and automatic processes are combined, and accuracy of system retrieval is further improved.

As an embodiment, the computer program, when executed by a processor, further performs the steps of:

step S100, training to obtain the REID pedestrian recognition model, which specifically comprises the following steps:

step S101, inputting video frame pictures in a video segment collected by cameras in the preset geo-fence with known IDs as sample pictures into a neural network in a preset REID pedestrian recognition model frame;

step S102, the neural network extracts contour features, color features and texture features corresponding to each sample picture, and generates corresponding REID feature vectors based on the contour features, the color features and the texture features corresponding to the sample pictures, and the REID pedestrian identification model framework predicts a prediction ID corresponding to each sample picture based on the REID feature vectors corresponding to each sample picture;

step S103, adjusting model parameters of the REID pedestrian recognition model frame based on the known ID and the prediction ID of the sample picture until the REID pedestrian recognition model frame converges to obtain the REID pedestrian recognition model.

It can be understood that in the model training process, the model output is the prediction ID, the contour feature, the color feature and the texture feature corresponding to the sample picture are generated in the middle, and after the model is trained, the corresponding neural network is also trained.

Within a preset time, video data of the same person is easily associated in video data collected by the same camera within a short time (less than or equal to a preset time interval threshold), but for the same camera, video data of the same person is difficult to be accurately associated in video data collected by different cameras or in video data collected by the same camera with a long collection interval (greater than the preset time interval threshold). The embodiment of the invention is based on the REID characteristic vector, and can accurately and quickly associate the same person with the same contour characteristic, color characteristic and texture characteristic in the same time period corresponding to different cameras.

step S110, obtaining REID feature vectors corresponding to each video segment based on the REID pedestrian recognition model, specifically including:

s111, acquiring the proportion of the figure outline occupying the video frame picture in each video segment;

step S112, taking the video frame picture with the maximum proportion of the figure outline occupying the video frame picture in each video segment as a first image to be processed corresponding to the video segment;

step S113, inputting each first image to be processed into the REID pedestrian recognition model, and outputting REID feature vectors generated by the neural network in the REID pedestrian recognition model, so as to obtain REID feature vectors corresponding to each video segment.

The REID feature vector corresponding to each video segment constructed through steps S110 to S113 is used to construct the corresponding REID feature vector field in the video database.

The video frame picture with the maximum proportion of the object outline occupying the video frame picture is used as the first image to be processed corresponding to the video segment, so that the accuracy and reliability of obtaining the first image to be processed for extracting the image based on the outline feature, the color feature and the texture feature can be improved, the accuracy and reliability of obtaining the corresponding REID feature vector are improved, and the reliability and accuracy of video retrieval are improved.

It can be understood that, in step S1, the technical details of obtaining the REID feature vector in the REID pedestrian recognition model of the image of the object to be retrieved are the same as the obtaining manner in step S113, and are not described herein again.

step S200, acquiring a face feature vector corresponding to each video segment, specifically including:

step S201, acquiring face recognition confidence coefficients of video frame images one by one in each video segment one by one, comparing the face recognition confidence coefficients with a preset face recognition confidence coefficient threshold value, and if the face recognition confidence coefficients are larger than or equal to the preset face recognition confidence coefficient threshold value, taking the current video frame image as a second image to be processed corresponding to the video segment;

step S202, extracting a face feature vector based on the second image to be processed as a face feature vector corresponding to the video segment.

The face feature vector corresponding to each video segment constructed through steps S200 to S202 is used to construct a corresponding face feature vector field in the video database.

It should be noted that, in the video database, the gait feature vector corresponding to the video segment may be acquired by using an existing gait feature vector acquisition algorithm based on data of each video segment, and is not described herein.

As an embodiment, in the step S2, the step S4, the step S6, and the step S8, traversing all the feature vectors to be tested in the video database according to each target feature vector to be tested, and acquiring video segments corresponding to all the feature vectors to be tested, of which the similarity of each target feature vector to be tested is greater than a preset feature vector similarity threshold to be tested, includes:

step S211, obtaining cosine similarity of each feature vector to be detected in the video database and the target feature vector to be detected, and obtaining cosine similarity sequence according to the sequence of the cosine similarity from large to small;

the cosine similarity algorithm is an existing algorithm and is not described herein.

S212, adjusting and reordering the cosine similarity rankings through a ranking algorithm to obtain target similarity rankings;

the ranking algorithm is the existing algorithm, description is not expanded here, the ranking effect can be improved based on the ranking algorithm, and accuracy and robustness of a calculation result are improved.

Step S213, acquiring video segments corresponding to all to-be-detected feature vectors in the target similarity sequence, wherein the target similarity of the to-be-detected feature vectors is greater than a preset to-be-detected feature vector similarity threshold;

when steps S211 to S213 correspond to step S2, the to-be-detected feature vector is an REID feature vector to be detected; when step S211 to step S213 correspond to step S4, the feature vector to be measured is a face feature vector; when step S211-step S213 correspond to step S6, the feature vector to be measured is a gait feature vector; when steps S211 to S213 correspond to step S8, the feature vector to be measured is a REID feature vector.

The method according to the embodiment of the invention is suitable for retrieving all scenes of the video segment sets related to the target object from the video database based on the image related to the known target object. As an embodiment, when only a target video segment corresponding to a target object at a specific location at a specific time is needed, the process can be ended when the target video segment is acquired, and when the computer program is executed by the processor, the following steps are further implemented:

step S300, receiving target time and target position information input by a user;

step S301, judging whether a video segment corresponding to the target time and the target position information exists in each obtained video set in real time, if so, determining the video segment as a target video segment, and ending the process.

Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multimodal video retrieval system, characterized in that,

the system comprises an information interaction interface, a pre-constructed REID pedestrian recognition model, a pre-constructed multi-modal video database, a processor and a memory, wherein the memory is used for storing a computer program, the video database comprises a plurality of video segment data records, the video segment data records comprise video segments, time information, position information, REID characteristic vectors, human face characteristic vectors and gait characteristic vector fields, the REID characteristic vectors are obtained based on the REID pedestrian recognition model, each video segment comprises the same person information continuously collected by the same camera, the continuous collection is that the collection time interval is smaller than a preset time interval threshold, and when the computer program is executed by the processor, the following steps are realized:

step S7, acquiring a to-be-processed image corresponding to each video segment in the current fifth video segment set, and inputting the to-be-processed image corresponding to each video segment into the REID pedestrian identification model to obtain a corresponding REID feature vector as a target REID feature vector;

2. The system of claim 1,

when the computer program is executed by a processor, the method further implements step 10 of obtaining a video segment for constructing the video database, and specifically includes:

step S11, acquiring video data of each preset camera;

and step S12, acquiring a corresponding video segment from each preset camera video data by adopting a deppsort algorithm.

3. The system of claim 1,

at least one of the steps between the step S2 and the step S3, between the step S4 and the step S5, and between the step S6 and the step S7 includes:

4. The system of claim 1 or 3,

when executed by a processor, further performs the steps of:

5. The system of claim 4,

when executed by a processor, further performs the steps of:

6. The system of claim 1 or 3,

when executed by a processor, further performs the steps of:

7. The system of claim 1,

in the step S2, the step S4, the step S6, and the step S8, traversing all the feature vectors to be tested in the video database according to each target feature vector to be tested, and acquiring video segments corresponding to all the feature vectors to be tested, of which the similarity of each target feature vector to be tested is greater than a preset feature vector similarity threshold to be tested, including:

step S213, acquiring video segments corresponding to all to-be-detected feature vectors in the target similarity sequence, wherein the target similarity of the target to-be-detected feature vectors is greater than a preset to-be-detected feature vector similarity threshold;

8. The system according to any one of claims 1 to 3,

when executed by a processor, further performs the steps of: