CN113010731B - Multimodal video retrieval system - Google Patents

Multimodal video retrieval system Download PDF

Info

Publication number
CN113010731B
CN113010731B CN202110197952.7A CN202110197952A CN113010731B CN 113010731 B CN113010731 B CN 113010731B CN 202110197952 A CN202110197952 A CN 202110197952A CN 113010731 B CN113010731 B CN 113010731B
Authority
CN
China
Prior art keywords
video segment
video
reid
feature vector
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110197952.7A
Other languages
Chinese (zh)
Other versions
CN113010731A (en
Inventor
董霖
俞锋锋
吕繁荣
陈津来
姚建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Xihu Data Intelligence Research Institute
Original Assignee
Hangzhou Xihu Data Intelligence Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Xihu Data Intelligence Research Institute filed Critical Hangzhou Xihu Data Intelligence Research Institute
Priority to CN202110197952.7A priority Critical patent/CN113010731B/en
Publication of CN113010731A publication Critical patent/CN113010731A/en
Application granted granted Critical
Publication of CN113010731B publication Critical patent/CN113010731B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a multi-mode video retrieval system which comprises an information interaction interface, a pre-constructed RE ID pedestrian identification model, a pre-constructed multi-mode video database, a processor and a memory, wherein the memory stores a computer program, the video database comprises a plurality of video segment data records, the video segment data records comprise video segments, time information, position information, RE ID characteristic vectors, face characteristic vectors and gait characteristic vector fields, and the RE ID characteristic vectors are obtained based on the RE ID pedestrian identification model. The invention improves the efficiency and accuracy of video retrieval.

Description

Multimodal video retrieval system
Technical Field
The invention relates to the technical field of computers, in particular to a multi-mode video retrieval system.
Background
With the development of multimedia technology and network technology, a great deal of multimedia information is generated, which may be pictures, sounds or videos. Especially, the number of videos is growing rapidly, because video data is vivid, can carry a large amount of information, is easier to be perceived by people, and video sequential chapters are widely used. Cameras are installed in a plurality of existing geographic areas and used for collecting video information in the geographic areas, and once an emergency occurs, historical camera video records need to be retrieved, and the activity tracks of specific targets need to be searched and analyzed.
However, when the huge historical video record data is faced, the traditional retrieval mode through manual screening takes a long time, occupies huge resources, has low retrieval efficiency, and cannot ensure the comprehensiveness and accuracy of the retrieval result due to the huge video amount. Therefore, how to improve the efficiency and accuracy of video retrieval is an urgent technical problem to be solved.
Disclosure of Invention
The invention aims to provide a multi-mode video retrieval system, which improves the efficiency and accuracy of video retrieval.
According to a first aspect of the present invention, there is provided a multimodal video retrieval system, including an information interaction interface, a pre-constructed REID pedestrian recognition model, a pre-constructed multimodal video database, a processor, and a memory storing a computer program, where the video database includes a plurality of video segment data records, the video segment data records include video segments, time information, location information, REID feature vectors, face feature vectors, and gait feature vector fields, the REID feature vectors are obtained based on the REID pedestrian recognition model, where each video segment includes the same person information continuously collected by the same camera, and the continuous collection is that a collection time interval is smaller than a preset time interval threshold, and when the computer program is executed by the processor, the following steps are implemented:
step S1, acquiring an image of an object to be retrieved, and inputting the image of the object to be retrieved into the REID pedestrian recognition model to acquire an REID characteristic vector as a to-be-detected REID characteristic vector;
step S2, traversing all REID feature vectors in the video database according to the REID feature vectors to be detected, acquiring video segments corresponding to all REID feature vectors with the similarity larger than a preset REID feature similarity threshold value with the REID feature vectors to be detected, and forming a first video segment set;
step S3, acquiring a face feature vector corresponding to each video segment in the current first video segment set as a target face feature vector;
step S4, traversing all face feature vectors in the video database according to each target face feature vector, obtaining video segments corresponding to all face feature vectors of which the similarity is greater than a preset face feature similarity threshold value, obtaining a second video segment set corresponding to each target face feature vector, and merging all the second video segment sets with the first video segment set to obtain a third video segment set;
step S5, acquiring a gait feature vector corresponding to each video segment in the current third video segment set as a target gait feature vector;
step S6, traversing all gait feature vectors in the video database according to each target gait feature vector, acquiring video segments corresponding to all gait feature vectors of which the similarity of each target gait feature vector is greater than a preset gait feature vector similarity threshold value, acquiring a fourth video segment set corresponding to each target gait feature vector, merging all the fourth video segment sets with the third video segment set, and acquiring a fifth video segment set;
step S7, acquiring a to-be-processed image corresponding to each video segment in the current fifth video segment set, and inputting each to-be-processed image into the REID pedestrian identification model to obtain a corresponding REID feature vector as a target REID feature vector;
step S8, traversing all REID feature vectors in the video database according to each target REID feature vector, acquiring video segments corresponding to all REID feature vectors of which the similarity with each target REID feature vector is greater than a preset REID feature similarity threshold value to form a sixth video segment set, and merging all the sixth video segment sets with the fifth video segment set to obtain a first video segment set;
step S9, judging whether all the second video segment sets, all the fourth video segment sets and all the sixth video segment sets are all empty sets, if so, determining the current first video segment set as a target video segment set, ending the process, otherwise, returning to the step S3.
Compared with the prior art, the invention has obvious advantages and beneficial effects. By means of the technical scheme, the multi-mode video retrieval system provided by the invention can achieve considerable technical progress and practicability, has wide industrial utilization value and at least has the following advantages:
the invention carries out multi-mode video retrieval based on the REID feature vector, the face feature vector and the gait feature vector, thereby improving the efficiency and the accuracy of the video retrieval.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.
Drawings
Fig. 1 is a schematic diagram of a multimodal video retrieval system according to an embodiment of the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention for achieving the predetermined objects, the following detailed description will be given to a specific embodiment of a multi-modal video retrieval system and its effects according to the present invention with reference to the accompanying drawings and preferred embodiments.
The embodiment of the invention provides a multi-modal video retrieval system, which comprises an information interaction interface, a pre-constructed REID pedestrian identification model, a pre-constructed multi-modal video database, a processor and a memory storing a computer program, wherein the video database comprises a plurality of video segment data records, each video segment data record comprises a video segment, time information, position information, a REID (pedestrian re-identification) feature vector, a face feature vector and a gait feature vector field, the REID feature vector is obtained based on the REID video identification model, each video segment comprises the same person information continuously acquired by the same camera, the continuous acquisition time interval is smaller than a preset time interval threshold, the time interval threshold can be set to be within a large range of images acquired by people walking through the camera, the time required for the farthest distance traveled. When executed by a processor, the computer program implementing the steps of:
step S1, acquiring an image of an object to be retrieved, and inputting the image of the object to be retrieved into the REID pedestrian recognition model to acquire an REID characteristic vector as a to-be-detected REID characteristic vector;
it is understood that the image of the object to be retrieved is an image containing the target object, and may be extracted from a video known to contain the target object, or directly obtained from a set of images known to contain the target object.
Step S2, traversing all REID feature vectors in the video database according to the REID feature vectors to be detected, acquiring video segments corresponding to all REID feature vectors with the similarity greater than a preset REID feature similarity threshold value, and forming a first video segment set;
step S3, acquiring a face feature vector corresponding to each video segment in the current first video segment set as a target face feature vector;
step S4, traversing all face feature vectors in the video database according to each target face feature vector, obtaining video segments corresponding to all face feature vectors of which the similarity is greater than a preset face feature similarity threshold value, obtaining a second video segment set corresponding to each target face feature vector, and merging all the second video segment sets with the first video segment set to obtain a third video segment set;
step S5, acquiring a gait feature vector corresponding to each video segment in the current third video segment set as a target gait feature vector;
step S6, traversing all gait feature vectors in the video database according to each target gait feature vector, acquiring video segments corresponding to all gait feature vectors of which the similarity of each target gait feature vector is greater than a preset gait feature vector similarity threshold value, acquiring a fourth video segment set corresponding to each target gait feature vector, merging all the fourth video segment sets with the third video segment set, and acquiring a fifth video segment set;
step S7, acquiring a to-be-processed image corresponding to each video segment in the current fifth video segment set, and inputting each to-be-processed image into the REID pedestrian identification model to obtain a corresponding REID feature vector as a target REID feature vector;
step S8, traversing all REID feature vectors in the video database according to each target REID feature vector, acquiring video segments corresponding to all REID feature vectors of which the similarity with each target REID feature vector is greater than a preset REID feature similarity threshold value to form a sixth video segment set, and merging all the sixth video segment sets with the fifth video segment set to obtain a first video segment set;
step S9, judging whether all the second video segment sets, all the fourth video segment sets and all the sixth video segment sets are all empty sets, if so, determining the current first video segment set as a target video segment set, ending the process, otherwise, returning to the step S3.
The embodiment of the invention takes the REID characteristic vector, the face characteristic vector and the gait characteristic vector multi-modal characteristics as the basis, carries out the retrieval iteratively, can retrieve the target video segment set from the video library as comprehensively as possible, and greatly reduces the manual operation cost based on the characteristic vector similarity contrast, and has high retrieval speed and high retrieval efficiency. According to the invention, the system can be physically implemented as one server or as a server group comprising a plurality of servers. Those skilled in the art will appreciate that parameters such as the model and specification of the server do not affect the scope of the present invention.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. It can be understood that, for convenience of description and clarity of the technical solution, the embodiment of the present invention adopts a cyclic retrieval method based on the REID feature vector, the face feature vector, and the gait feature vector, but the order based on the REID feature vector, the face feature vector, and the gait feature vector may also be adjusted, and may be performed in a cross manner, for example, the cyclic retrieval may be performed in the order based on the REID feature vector, the gait feature vector, and the face feature vector, or the REID feature vector, the gait feature vector, the face feature vector, the gait feature vector, and the REID feature vector … may also be performed in a cross manner, and finally, the condition based on all convergence of the REID feature vector, the gait feature vector, and the face feature vector may be satisfied.
As an embodiment, when executed by a processor, the computer program further implements step 10 of obtaining a video segment for constructing the video database, and specifically includes:
step S11, acquiring video data of each preset camera;
it is understood that the preset camera refers to a camera in the electronic fence where the target object to be retrieved may appear.
And step S12, acquiring a corresponding video segment from each preset camera video data by adopting a deepsort algorithm.
The depsort algorithm can extract video frames only containing single person information in each frame of video, and then combines the video frames only containing single person information with the same person information with the collection time interval smaller than a preset time interval threshold value according to a time sequence to form the video segment.
Between the step S2 and the step S3, between the step S4 and the step S5, and between the step S6 and the step S7, in order to avoid introducing noise and improve the accuracy of the system, as an embodiment, at least one of the steps between the step S2 and the step S3, between the step S4 and the step S5, and between the step S6 and the step S7 may include:
step S10, presenting each video segment in the current video segment set on the information interaction interface, receiving a deletion instruction input by a user for one or more video segments, and deleting the corresponding video segment from the current video segment set;
wherein if the step S10 is executed between the step S2 and the step S3, the current video segment set is the first video segment set; if said step S10 is performed between said step S4 and step S5, then said current video segment set is said third video segment set; if the step S10 is performed between the step S6 and the step S7, then the current video segment set is the fifth video segment set.
It is understood that, as a preferred embodiment, step S10 may be added between step S2 and step S3, between step S4 and step S5, and between step S6 and step S7, and verification is performed manually, so that manual and automatic processes are combined, and accuracy of system retrieval is further improved.
As an embodiment, the computer program, when executed by a processor, further performs the steps of:
step S100, training to obtain the REID pedestrian recognition model, which specifically comprises the following steps:
step S101, inputting video frame pictures in a video segment collected by cameras in the preset geo-fence with known IDs as sample pictures into a neural network in a preset REID pedestrian recognition model frame;
step S102, the neural network extracts contour features, color features and texture features corresponding to each sample picture, and generates corresponding REID feature vectors based on the contour features, the color features and the texture features corresponding to the sample pictures, and the REID pedestrian identification model framework predicts a prediction ID corresponding to each sample picture based on the REID feature vectors corresponding to each sample picture;
step S103, adjusting model parameters of the REID pedestrian recognition model frame based on the known ID and the prediction ID of the sample picture until the REID pedestrian recognition model frame converges to obtain the REID pedestrian recognition model.
It can be understood that in the model training process, the model output is the prediction ID, the contour feature, the color feature and the texture feature corresponding to the sample picture are generated in the middle, and after the model is trained, the corresponding neural network is also trained.
Within a preset time, video data of the same person is easily associated in video data collected by the same camera within a short time (less than or equal to a preset time interval threshold), but for the same camera, video data of the same person is difficult to be accurately associated in video data collected by different cameras or in video data collected by the same camera with a long collection interval (greater than the preset time interval threshold). The embodiment of the invention is based on the REID characteristic vector, and can accurately and quickly associate the same person with the same contour characteristic, color characteristic and texture characteristic in the same time period corresponding to different cameras.
As an embodiment, the computer program, when executed by a processor, further performs the steps of:
step S110, obtaining REID feature vectors corresponding to each video segment based on the REID pedestrian recognition model, specifically including:
s111, acquiring the proportion of the figure outline occupying the video frame picture in each video segment;
step S112, taking the video frame picture with the maximum proportion of the figure outline occupying the video frame picture in each video segment as a first image to be processed corresponding to the video segment;
step S113, inputting each first image to be processed into the REID pedestrian recognition model, and outputting REID feature vectors generated by the neural network in the REID pedestrian recognition model, so as to obtain REID feature vectors corresponding to each video segment.
The REID feature vector corresponding to each video segment constructed through steps S110 to S113 is used to construct the corresponding REID feature vector field in the video database.
The video frame picture with the maximum proportion of the object outline occupying the video frame picture is used as the first image to be processed corresponding to the video segment, so that the accuracy and reliability of obtaining the first image to be processed for extracting the image based on the outline feature, the color feature and the texture feature can be improved, the accuracy and reliability of obtaining the corresponding REID feature vector are improved, and the reliability and accuracy of video retrieval are improved.
It can be understood that, in step S1, the technical details of obtaining the REID feature vector in the REID pedestrian recognition model of the image of the object to be retrieved are the same as the obtaining manner in step S113, and are not described herein again.
As an embodiment, the computer program, when executed by a processor, further performs the steps of:
step S200, acquiring a face feature vector corresponding to each video segment, specifically including:
step S201, acquiring face recognition confidence coefficients of video frame images one by one in each video segment one by one, comparing the face recognition confidence coefficients with a preset face recognition confidence coefficient threshold value, and if the face recognition confidence coefficients are larger than or equal to the preset face recognition confidence coefficient threshold value, taking the current video frame image as a second image to be processed corresponding to the video segment;
step S202, extracting a face feature vector based on the second image to be processed as a face feature vector corresponding to the video segment.
The face feature vector corresponding to each video segment constructed through steps S200 to S202 is used to construct a corresponding face feature vector field in the video database.
It should be noted that, in the video database, the gait feature vector corresponding to the video segment may be acquired by using an existing gait feature vector acquisition algorithm based on data of each video segment, and is not described herein.
As an embodiment, in the step S2, the step S4, the step S6, and the step S8, traversing all the feature vectors to be tested in the video database according to each target feature vector to be tested, and acquiring video segments corresponding to all the feature vectors to be tested, of which the similarity of each target feature vector to be tested is greater than a preset feature vector similarity threshold to be tested, includes:
step S211, obtaining cosine similarity of each feature vector to be detected in the video database and the target feature vector to be detected, and obtaining cosine similarity sequence according to the sequence of the cosine similarity from large to small;
the cosine similarity algorithm is an existing algorithm and is not described herein.
S212, adjusting and reordering the cosine similarity rankings through a ranking algorithm to obtain target similarity rankings;
the ranking algorithm is the existing algorithm, description is not expanded here, the ranking effect can be improved based on the ranking algorithm, and accuracy and robustness of a calculation result are improved.
Step S213, acquiring video segments corresponding to all to-be-detected feature vectors in the target similarity sequence, wherein the target similarity of the to-be-detected feature vectors is greater than a preset to-be-detected feature vector similarity threshold;
when steps S211 to S213 correspond to step S2, the to-be-detected feature vector is an REID feature vector to be detected; when step S211 to step S213 correspond to step S4, the feature vector to be measured is a face feature vector; when step S211-step S213 correspond to step S6, the feature vector to be measured is a gait feature vector; when steps S211 to S213 correspond to step S8, the feature vector to be measured is a REID feature vector.
The method according to the embodiment of the invention is suitable for retrieving all scenes of the video segment sets related to the target object from the video database based on the image related to the known target object. As an embodiment, when only a target video segment corresponding to a target object at a specific location at a specific time is needed, the process can be ended when the target video segment is acquired, and when the computer program is executed by the processor, the following steps are further implemented:
step S300, receiving target time and target position information input by a user;
step S301, judging whether a video segment corresponding to the target time and the target position information exists in each obtained video set in real time, if so, determining the video segment as a target video segment, and ending the process.
Although the present invention has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A multimodal video retrieval system, characterized in that,
the system comprises an information interaction interface, a pre-constructed REID pedestrian recognition model, a pre-constructed multi-modal video database, a processor and a memory, wherein the memory is used for storing a computer program, the video database comprises a plurality of video segment data records, the video segment data records comprise video segments, time information, position information, REID characteristic vectors, human face characteristic vectors and gait characteristic vector fields, the REID characteristic vectors are obtained based on the REID pedestrian recognition model, each video segment comprises the same person information continuously collected by the same camera, the continuous collection is that the collection time interval is smaller than a preset time interval threshold, and when the computer program is executed by the processor, the following steps are realized:
step S1, acquiring an image of an object to be retrieved, and inputting the image of the object to be retrieved into the REID pedestrian recognition model to acquire an REID characteristic vector as a to-be-detected REID characteristic vector;
step S2, traversing all REID feature vectors in the video database according to the REID feature vectors to be detected, acquiring video segments corresponding to all REID feature vectors with the similarity greater than a preset REID feature similarity threshold value, and forming a first video segment set;
step S3, acquiring a face feature vector corresponding to each video segment in the current first video segment set as a target face feature vector;
step S4, traversing all face feature vectors in the video database according to each target face feature vector, obtaining video segments corresponding to all face feature vectors of which the similarity is greater than a preset face feature similarity threshold value, obtaining a second video segment set corresponding to each target face feature vector, and merging all the second video segment sets with the first video segment set to obtain a third video segment set;
step S5, acquiring a gait feature vector corresponding to each video segment in the current third video segment set as a target gait feature vector;
step S6, traversing all gait feature vectors in the video database according to each target gait feature vector, acquiring video segments corresponding to all gait feature vectors of which the similarity of each target gait feature vector is greater than a preset gait feature vector similarity threshold value, acquiring a fourth video segment set corresponding to each target gait feature vector, merging all the fourth video segment sets with the third video segment set, and acquiring a fifth video segment set;
step S7, acquiring a to-be-processed image corresponding to each video segment in the current fifth video segment set, and inputting the to-be-processed image corresponding to each video segment into the REID pedestrian identification model to obtain a corresponding REID feature vector as a target REID feature vector;
step S8, traversing all REID feature vectors in the video database according to each target REID feature vector, acquiring video segments corresponding to all REID feature vectors of which the similarity with each target REID feature vector is greater than a preset REID feature similarity threshold value to form a sixth video segment set, and merging all the sixth video segment sets with the fifth video segment set to obtain a first video segment set;
step S9, judging whether all the second video segment sets, all the fourth video segment sets and all the sixth video segment sets are all empty sets, if so, determining the current first video segment set as a target video segment set, ending the process, otherwise, returning to the step S3.
2. The system of claim 1,
when the computer program is executed by a processor, the method further implements step 10 of obtaining a video segment for constructing the video database, and specifically includes:
step S11, acquiring video data of each preset camera;
and step S12, acquiring a corresponding video segment from each preset camera video data by adopting a deppsort algorithm.
3. The system of claim 1,
at least one of the steps between the step S2 and the step S3, between the step S4 and the step S5, and between the step S6 and the step S7 includes:
step S10, presenting each video segment in the current video segment set on the information interaction interface, receiving a deletion instruction input by a user for one or more video segments, and deleting the corresponding video segment from the current video segment set;
wherein if the step S10 is executed between the step S2 and the step S3, the current video segment set is the first video segment set; if said step S10 is performed between said step S4 and step S5, then said current video segment set is said third video segment set; if the step S10 is performed between the step S6 and the step S7, then the current video segment set is the fifth video segment set.
4. The system of claim 1 or 3,
when executed by a processor, further performs the steps of:
step S100, training to obtain the REID pedestrian recognition model, which specifically comprises the following steps:
step S101, inputting video frame pictures in a video segment collected by cameras in the preset geo-fence with known IDs as sample pictures into a neural network in a preset REID pedestrian recognition model frame;
step S102, the neural network extracts contour features, color features and texture features corresponding to each sample picture, and generates corresponding REID feature vectors based on the contour features, the color features and the texture features corresponding to the sample pictures, and the REID pedestrian identification model framework predicts a prediction ID corresponding to each sample picture based on the REID feature vectors corresponding to each sample picture;
step S103, adjusting model parameters of the REID pedestrian recognition model frame based on the known ID and the prediction ID of the sample picture until the REID pedestrian recognition model frame converges to obtain the REID pedestrian recognition model.
5. The system of claim 4,
when executed by a processor, further performs the steps of:
step S110, obtaining REID feature vectors corresponding to each video segment based on the REID pedestrian recognition model, specifically including:
s111, acquiring the proportion of the figure outline occupying the video frame picture in each video segment;
step S112, taking the video frame picture with the maximum proportion of the figure outline occupying the video frame picture in each video segment as a first image to be processed corresponding to the video segment;
step S113, inputting each first image to be processed into the REID pedestrian recognition model, and outputting REID feature vectors generated by the neural network in the REID pedestrian recognition model, so as to obtain REID feature vectors corresponding to each video segment.
6. The system of claim 1 or 3,
when executed by a processor, further performs the steps of:
step S200, acquiring a face feature vector corresponding to each video segment, specifically including:
step S201, acquiring face recognition confidence coefficients of video frame images one by one in each video segment one by one, comparing the face recognition confidence coefficients with a preset face recognition confidence coefficient threshold value, and if the face recognition confidence coefficients are larger than or equal to the preset face recognition confidence coefficient threshold value, taking the current video frame image as a second image to be processed corresponding to the video segment;
step S202, extracting a face feature vector based on the second image to be processed as a face feature vector corresponding to the video segment.
7. The system of claim 1,
in the step S2, the step S4, the step S6, and the step S8, traversing all the feature vectors to be tested in the video database according to each target feature vector to be tested, and acquiring video segments corresponding to all the feature vectors to be tested, of which the similarity of each target feature vector to be tested is greater than a preset feature vector similarity threshold to be tested, including:
step S211, obtaining cosine similarity of each feature vector to be detected in the video database and the target feature vector to be detected, and obtaining cosine similarity sequence according to the sequence of the cosine similarity from large to small;
s212, adjusting and reordering the cosine similarity rankings through a ranking algorithm to obtain target similarity rankings;
step S213, acquiring video segments corresponding to all to-be-detected feature vectors in the target similarity sequence, wherein the target similarity of the target to-be-detected feature vectors is greater than a preset to-be-detected feature vector similarity threshold;
when steps S211 to S213 correspond to step S2, the to-be-detected feature vector is an REID feature vector to be detected; when step S211 to step S213 correspond to step S4, the feature vector to be measured is a face feature vector; when step S211-step S213 correspond to step S6, the feature vector to be measured is a gait feature vector; when steps S211 to S213 correspond to step S8, the feature vector to be measured is a REID feature vector.
8. The system according to any one of claims 1 to 3,
when executed by a processor, further performs the steps of:
step S300, receiving target time and target position information input by a user;
step S301, judging whether a video segment corresponding to the target time and the target position information exists in each obtained video set in real time, if so, determining the video segment as a target video segment, and ending the process.
CN202110197952.7A 2021-02-22 2021-02-22 Multimodal video retrieval system Active CN113010731B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110197952.7A CN113010731B (en) 2021-02-22 2021-02-22 Multimodal video retrieval system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110197952.7A CN113010731B (en) 2021-02-22 2021-02-22 Multimodal video retrieval system

Publications (2)

Publication Number Publication Date
CN113010731A CN113010731A (en) 2021-06-22
CN113010731B true CN113010731B (en) 2022-05-20

Family

ID=76406138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110197952.7A Active CN113010731B (en) 2021-02-22 2021-02-22 Multimodal video retrieval system

Country Status (1)

Country Link
CN (1) CN113010731B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468141B (en) * 2021-06-30 2023-09-22 杭州云深科技有限公司 Data processing system for generating APK primary key
CN114385859B (en) * 2021-12-29 2024-07-16 北京理工大学 Multi-mode retrieval method for video content

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590452A (en) * 2017-09-04 2018-01-16 武汉神目信息技术有限公司 A kind of personal identification method and device based on gait and face fusion
KR20180079894A (en) * 2017-01-03 2018-07-11 한국전자통신연구원 System and method for providing face recognition information and server using the method
CN110674350A (en) * 2019-09-23 2020-01-10 网易(杭州)网络有限公司 Video character retrieval method, medium, device and computing equipment
CN110750671A (en) * 2019-09-05 2020-02-04 浙江省北大信息技术高等研究院 Pedestrian retrieval method and device based on massive unstructured features
CN111507232A (en) * 2020-04-10 2020-08-07 三一重工股份有限公司 Multi-mode multi-strategy fused stranger identification method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3111455C (en) * 2018-09-12 2023-05-09 Avigilon Coporation System and method for improving speed of similarity based searches

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180079894A (en) * 2017-01-03 2018-07-11 한국전자통신연구원 System and method for providing face recognition information and server using the method
CN107590452A (en) * 2017-09-04 2018-01-16 武汉神目信息技术有限公司 A kind of personal identification method and device based on gait and face fusion
CN110750671A (en) * 2019-09-05 2020-02-04 浙江省北大信息技术高等研究院 Pedestrian retrieval method and device based on massive unstructured features
CN110674350A (en) * 2019-09-23 2020-01-10 网易(杭州)网络有限公司 Video character retrieval method, medium, device and computing equipment
CN111507232A (en) * 2020-04-10 2020-08-07 三一重工股份有限公司 Multi-mode multi-strategy fused stranger identification method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向视频分析的多模态处理技术;刘萌;《中国博士学位论文全文数据库》;20190915(第9期);第30-80页 *

Also Published As

Publication number Publication date
CN113010731A (en) 2021-06-22

Similar Documents

Publication Publication Date Title
CN110472531B (en) Video processing method, device, electronic equipment and storage medium
Mou et al. Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network
CN111222500B (en) Label extraction method and device
WO2020177673A1 (en) Video sequence selection method, computer device and storage medium
CN107169106B (en) Video retrieval method, device, storage medium and processor
CN113010731B (en) Multimodal video retrieval system
CN112861575A (en) Pedestrian structuring method, device, equipment and storage medium
CN113392866A (en) Image processing method and device based on artificial intelligence and storage medium
CN110941978B (en) Face clustering method and device for unidentified personnel and storage medium
CN111783712A (en) Video processing method, device, equipment and medium
CN114550053A (en) Traffic accident responsibility determination method, device, computer equipment and storage medium
CN118072252B (en) Pedestrian re-recognition model training method suitable for arbitrary multi-mode data combination
CN115115855A (en) Training method, device, equipment and medium for image encoder
Qin et al. Depth estimation by parameter transfer with a lightweight model for single still images
Wang et al. Non-local attention association scheme for online multi-object tracking
CN116958267B (en) Pose processing method and device, electronic equipment and storage medium
CN117036392A (en) Image detection method and related device
CN112132175A (en) Object classification method and device, electronic equipment and storage medium
US11816148B1 (en) Sampling technique for data clustering
Tao et al. Learning modal and spatial features with lightweight 3D convolution for RGB guided depth completion
CN115690554A (en) Target identification method, system, electronic device and storage medium
CN118302801A (en) Video screening using machine-learned video screening model trained using self-supervised training
CN115995079A (en) Image semantic similarity analysis method and homosemantic image retrieval method
Prabakaran et al. Key frame extraction analysis based on optimized convolution neural network (ocnn) using intensity feature selection (ifs)
CN111814805A (en) Feature extraction network training method and related method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20221216

Address after: Room 117, South Building, No. 55, Youcheqiao, Shuangqiao Village, Sandun Town, Xihu District, Hangzhou City, Zhejiang Province, 310030

Patentee after: Zhejiang Daily Interactive Research Institute Co.,Ltd.

Address before: Room 4303, building 4, Fudi Pioneer Park, 9 xidoumen Road, Xihu District, Hangzhou City, Zhejiang Province, 310012

Patentee before: Hangzhou Xihu data Intelligence Research Institute

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231109

Address after: Room 4303, building 4, Fudi Pioneer Park, 9 xidoumen Road, Xihu District, Hangzhou City, Zhejiang Province, 310012

Patentee after: Hangzhou Xihu data Intelligence Research Institute

Address before: Room 117, South Building, No. 55, Youcheqiao, Shuangqiao Village, Sandun Town, Xihu District, Hangzhou City, Zhejiang Province, 310030

Patentee before: Zhejiang Daily Interactive Research Institute Co.,Ltd.

TR01 Transfer of patent right