CN114299428A - Cross-media video character recognition method and system - Google Patents

Cross-media video character recognition method and system Download PDF

Info

Publication number
CN114299428A
CN114299428A CN202111598585.8A CN202111598585A CN114299428A CN 114299428 A CN114299428 A CN 114299428A CN 202111598585 A CN202111598585 A CN 202111598585A CN 114299428 A CN114299428 A CN 114299428A
Authority
CN
China
Prior art keywords
video
face
video character
frame
tracking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111598585.8A
Other languages
Chinese (zh)
Inventor
王晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Space Shichuang Chongqing Technology Co ltd
Original Assignee
Space Shichuang Chongqing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Space Shichuang Chongqing Technology Co ltd filed Critical Space Shichuang Chongqing Technology Co ltd
Priority to CN202111598585.8A priority Critical patent/CN114299428A/en
Publication of CN114299428A publication Critical patent/CN114299428A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of image identification, and particularly discloses a method and a system for identifying a cross-media video character, wherein the method comprises the following steps: s1, acquiring a front picture of the video person to be identified; s2, cutting out the face from the front picture to generate a picture of the face; s3, converting the picture of the face into a feature vector; s4, extracting frames at intervals from the film of the video character, detecting the face of the frame and cutting out the face; s5, converting the cut human face into a feature vector, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, and executing the step S6; s6, performing multi-angle tracking on the marked video characters by using a Deep Sort algorithm; and S7, judging whether the tracking of the marked video character is finished or not, and if so, outputting a segment containing the video character. By adopting the technical scheme of the invention, the video containing the target person can be automatically generated.

Description

Cross-media video character recognition method and system
Technical Field
The invention relates to the technical field of image identification, in particular to a cross-media video character identification method and a cross-media video character identification system.
Background
Since the last century, people have accumulated a huge amount of high-quality film and television resources, covering film and television types such as movies, art programs, television shows and the like. With the development of movie technology and equipment, a large amount of videos are available with the duration approaching to or exceeding one hour. In recent years, with the pace of life increasing, users prefer to spend time on more compact short videos, and short video sharing platforms are also becoming more popular. Large numbers of creators on the internet have also begun to use short videos to drain longer-lived movie resources, such as movies, television shows, and the like.
For example, by retrieving the segments of the video characters appearing in the films played by the video characters, video production personnel can be helped to cut videos, subsequent audiences can conveniently select scenes in which only one video character appears, and the film watching experience of the audiences can be improved.
In order to determine the segment of a certain video character appearing in a film, in the conventional scheme, a special operator carries out manual marking, and the method not only takes much energy and takes a long time, but also has high cost and is difficult to meet the requirement of processing the current massive video resources.
Therefore, a method and a system for cross-media video character recognition capable of automatically generating a video containing a target character are needed.
Disclosure of Invention
The invention provides a cross-media video character recognition method and a cross-media video character recognition system, which can automatically generate a video containing a target character.
In order to solve the technical problem, the present application provides the following technical solutions:
a cross-media video character recognition method comprises the following steps:
s1, acquiring a front picture of the video person to be identified;
s2, cutting out a face from the front picture through the trained MTCNN to generate a picture containing the video character face;
s3, converting the picture containing the video character face into a feature vector by using a Facenet network;
s4, extracting frames at intervals from a film in which a video character participates, detecting the face of the frame through an MTCNN network, and cutting the face;
s5, converting the cut human face into a feature vector by using a Facenet network, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, executing the step S6, and if not, returning to the step S4;
s6, performing multi-angle tracking on the marked video characters by using a Deep Sort algorithm;
and S7, judging whether the tracking of the marked video character is finished, if so, outputting a segment containing the video character, judging whether the extraction of the video frame of the video character is finished, if not, executing the step S4, and if so, finishing the operation.
The basic scheme principle and the beneficial effects are as follows:
in the scheme, the face is cut out from the front picture through the MTCNN network, so that the face can be conveniently and independently processed subsequently, and the data volume of subsequent processing can be reduced. The method comprises the steps that a Facenet network is used for converting a picture containing the face of a video person into a characteristic vector, then the characteristic vector is compared with the characteristic vector extracted by the face in a video frame, whether the person in the video frame is a video person to be identified or not can be accurately identified, if yes, the video person is tracked, a segment containing the video person is output, and if not, the face in the video frame is continuously extracted for subsequent comparison.
In conclusion, the video character segment marking method and device can mark segments of video characters appearing in the participatory movie in an automatic mode and output the segments, and can solve the problems of labor-consuming manual marking, time-consuming manual marking and high cost.
Further, in step S1, a picture including the face of the video person is obtained from a face library of the preset video person and the process jumps to step S3.
When the material needing to be identified is a face picture directly, the step S2 is skipped, the process can be simplified, and the time is saved.
Further, in step S6, the multi-angle tracking includes performing tracking recognition on the frames with rotating and blocking head of the human being.
Further, in step S6, when tracking and identifying the frame of the person whose head is rotated and blocked, predicting the position of the current time based on the position of the video person' S head at the previous time through kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
The accuracy rate of directly carrying out comparison recognition is low due to the fact that the face of a frame which is rotated and shielded by the head of a person is incomplete.
Further, in step S4, the interval is 24 frames.
Typically the movie is 24 frames in 1 second, spaced 24 frames apart, i.e. in the case of a movie, spaced 1 second apart.
A cross-media video character recognition system comprising:
the input module is used for receiving a front picture of a video figure to be identified;
the face cutting module is prestored with a trained MTCNN and used for inputting the front picture of the video character into the MTCNN and cutting out a face from the front picture through the MTCNN to generate a picture containing the face of the video character;
the feature conversion module is prestored with a Facenet network and used for converting the picture containing the video character face into a feature vector by using the Facenet network;
the extraction module is used for extracting frames at intervals from a film in which video characters participate;
the face cutting module is also used for inputting the extracted frame into an MTCNN network, detecting the face appearing in the frame through the MTCNN network and cutting out the face; the feature conversion module is also used for converting the cut human face into a feature vector by using a Facenet network;
the analysis module is used for comparing the feature vector of the cut human face with the feature vector of the video character, judging whether a similarity threshold value is met, and if so, marking the video character in the frame;
the tracking module is also used for tracking the marked video people in multiple angles by using a Deep Sort algorithm when the similarity threshold value is met;
and the output module is used for judging whether the tracking of the marked video character is finished or not, and outputting the segment containing the video character if the tracking of the marked video character is finished.
In the scheme, the face is cut out from the front picture through the MTCNN network, so that the face can be conveniently and independently processed subsequently, and the data volume of subsequent processing can be reduced. The method comprises the steps that a Facenet network is used for converting a picture containing the face of a video person into a characteristic vector, then the characteristic vector is compared with the characteristic vector extracted by the face in a video frame, whether the person in the video frame is a video person to be identified or not can be accurately identified, if yes, the video person is tracked, a segment containing the video person is output, and if not, the face in the video frame is continuously extracted for subsequent comparison.
In conclusion, the video character segment marking method and device can mark segments of video characters appearing in the participatory movie in an automatic mode and output the segments, and can solve the problems of labor-consuming manual marking, time-consuming manual marking and high cost.
Further, the input module is further configured to obtain a picture including a face of the video person from a preset face library of the video person.
Further, the multi-angle tracking comprises tracking and identifying the frames with rotating and sheltering human heads.
Further, when the tracking module tracks and identifies the frame of the rotation and the sheltering of the head of the person, the tracking module is used for predicting the position of the current moment based on the position of the head of the person at the previous moment of the video through Kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
The accuracy rate of directly carrying out comparison recognition is low due to the fact that the face of a frame which is rotated and shielded by the head of a person is incomplete.
Further, the interval is 24 frames.
Drawings
Fig. 1 is a flowchart of a cross-media video character recognition method according to an embodiment.
Detailed Description
The following is further detailed by way of specific embodiments:
example one
As shown in fig. 1, the method for identifying a character in a cross-media video of the present embodiment includes the following steps:
s1, acquiring a front picture of the video person to be recognized and jumping to the step S2, or acquiring a picture containing the face of the video person from a preset face library of the video person and jumping to the step S3;
s2, cutting out a face from the front picture through the trained MTCNN to generate a picture containing the video character face; specifically, the positions of the eyes, the mouth and the nose of the face can be found out from the front picture through the face alignment function of the MTCNN, the position of the face is further determined, and then the face is cut.
S3, converting the picture containing the video character face into a feature vector by using a Facenet network;
s4, extracting frames at intervals from a film in which a video character participates, detecting the face of the frame through an MTCNN network, and cutting the face; in this embodiment, the inter-frame interval is 24 frames;
s5, converting the cut human face into a feature vector by using a Facenet network, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, executing the step S6, and if not, returning to the step S4; in this embodiment, the similarity threshold is 0.9.
S6, performing multi-angle tracking on the marked video character by using a Deep Sort algorithm, wherein the multi-angle tracking comprises the steps of identifying frames of head rotation and occlusion of the character;
specifically, the current time position is predicted based on the previous time position of the head of the video figure through Kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
And S7, judging whether the tracking of the marked video character is finished, if so, outputting a segment containing the video character, judging whether the extraction of the video frame of the video character is finished, if not, executing the step S4, and if so, finishing the operation.
Based on the cross-media video character recognition method, the embodiment also provides a cross-media video character recognition system, which comprises an input module, a face clipping module, a feature conversion module, an extraction module, an analysis module, a tracking module and an output module.
The input module is used for receiving a front picture of a video person to be identified, or acquiring a picture containing the face of the video person from a preset face library of the video person.
And the face cutting module is prestored with the trained MTCNN and is used for inputting the front picture of the video character into the MTCNN and cutting out the face from the front picture through the MTCNN to generate a picture containing the face of the video character. Specifically, the positions of the eyes, the mouth and the nose of the face can be found out from the front picture through the face alignment function of the MTCNN, the position of the face is further determined, and then the face is cut.
And the feature conversion module is prestored with a Facenet network and is used for converting the picture containing the video character face into a feature vector by using the Facenet network.
And the extraction module is used for extracting frames at intervals from the film in which the video characters participate, in this embodiment, the intervals between the frames are 24 frames.
The face cutting module is also used for inputting the extracted frame into an MTCNN network, detecting the face appearing in the frame through the MTCNN network and cutting the face.
The feature conversion module is also used for converting the cut human face into a feature vector by using a Facenet network.
The analysis module is used for comparing the feature vector of the cut human face with the feature vector of the video character, judging whether the similarity threshold value is met or not, and if not, continuously extracting a new frame through the extraction module for processing; if yes, marking the video character in the frame; in this embodiment, the similarity threshold is 0.9.
The tracking module is also used for tracking the marked video people in multiple angles by using a Deep Sort algorithm when the similarity threshold value is met; the multi-angle tracking comprises the steps of identifying frames of rotating and blocking the head of a person; specifically, the current time position is predicted based on the previous time position of the head of the video figure through Kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
And the output module is used for judging whether the tracking of the marked video character is finished or not, and outputting the segment containing the video character if the tracking of the marked video character is finished.
The scheme of the embodiment can mark and output the segments of the video characters appearing in the participating film in an automatic mode, and can solve the problems of labor consumption, time consumption and high cost of manual marking.
Example two
The difference between the present embodiment and the first embodiment is that the cross-media video character recognition method of the present embodiment can also be used for recognizing characters in a single picture. And processing the single picture as a frame, skipping the multi-angle tracking step, and finally outputting and identifying the position and the name of the person. Specifically, the position is represented by framing the face through rectangular frames, and the format of each rectangular frame is [ x1, y1, x2, y2], where (x1, y1), (x2, y2) are the coordinates of the upper left corner and the lower right corner of the rectangle.
The scheme of the embodiment can identify the video characters appearing in a certain scene, mark the positions of the characters, and solve the problems of labor waste, time consumption and high cost of manual labeling.
The above are merely examples of the present invention, and the present invention is not limited to the field related to this embodiment, and the common general knowledge of the known specific structures and characteristics in the schemes is not described herein too much, and those skilled in the art can know all the common technical knowledge in the technical field before the application date or the priority date, can know all the prior art in this field, and have the ability to apply the conventional experimental means before this date, and those skilled in the art can combine their own ability to perfect and implement the scheme, and some typical known structures or known methods should not become barriers to the implementation of the present invention by those skilled in the art in light of the teaching provided in the present application. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims (10)

1. A cross-media video character recognition method is characterized by comprising the following steps:
s1, acquiring a front picture of the video person to be identified;
s2, cutting out a face from the front picture through the trained MTCNN to generate a picture containing the video character face;
s3, converting the picture containing the video character face into a feature vector by using a Facenet network;
s4, extracting frames at intervals from a film in which a video character participates, detecting the face of the frame through an MTCNN network, and cutting the face;
s5, converting the cut human face into a feature vector by using a Facenet network, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, executing the step S6, and if not, returning to the step S4;
s6, performing multi-angle tracking on the marked video characters by using a Deep Sort algorithm;
and S7, judging whether the tracking of the marked video character is finished, if so, outputting a segment containing the video character, judging whether the extraction of the video frame of the video character is finished, if not, executing the step S4, and if so, finishing the operation.
2. The cross-media video character recognition method of claim 1, wherein: in step S1, a picture including a face of the video person is further obtained from a face library of the preset video person and the process jumps to step S3.
3. The cross-media video character recognition method of claim 2, wherein: in step S6, the multi-angle tracking includes tracking and recognizing the frames that are blocked and rotated by the head of the human being.
4. The cross-media video character recognition method of claim 3, wherein: in the step S6, when tracking and identifying a frame of a person whose head is rotated and blocked, predicting a current time position based on a previous time position of the head of the video person through kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
5. The cross-media video character recognition method of claim 1, wherein: in step S4, the interval is 24 frames.
6. A cross-media video character recognition system, comprising:
the input module is used for receiving a front picture of a video figure to be identified;
the face cutting module is prestored with a trained MTCNN and used for inputting the front picture of the video character into the MTCNN and cutting out a face from the front picture through the MTCNN to generate a picture containing the face of the video character;
the feature conversion module is prestored with a Facenet network and used for converting the picture containing the video character face into a feature vector by using the Facenet network;
the extraction module is used for extracting frames at intervals from a film in which video characters participate;
the face cutting module is also used for inputting the extracted frame into an MTCNN network, detecting the face appearing in the frame through the MTCNN network and cutting out the face; the feature conversion module is also used for converting the cut human face into a feature vector by using a Facenet network;
the analysis module is used for comparing the feature vector of the cut human face with the feature vector of the video character, judging whether a similarity threshold value is met, and if so, marking the video character in the frame;
the tracking module is also used for tracking the marked video people in multiple angles by using a Deep Sort algorithm when the similarity threshold value is met;
and the output module is used for judging whether the tracking of the marked video character is finished or not, and outputting the segment containing the video character if the tracking of the marked video character is finished.
7. The cross-media video character recognition system of claim 6, wherein: the input module is also used for acquiring pictures containing the faces of the video characters from a preset face library of the video characters.
8. The cross-media video character recognition system of claim 7, wherein: the multi-angle tracking comprises tracking and identifying the frames of the rotation and the occlusion of the head of the human body.
9. The cross-media video character recognition system of claim 8, wherein: the tracking module is used for predicting the position of the current moment based on the position of the video figure at the previous moment through Kalman filtering when tracking and identifying the frame of the rotation and the sheltering of the head of the figure; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
10. The cross-media video character recognition system of claim 6, wherein: the interval is 24 frames.
CN202111598585.8A 2021-12-24 2021-12-24 Cross-media video character recognition method and system Pending CN114299428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111598585.8A CN114299428A (en) 2021-12-24 2021-12-24 Cross-media video character recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111598585.8A CN114299428A (en) 2021-12-24 2021-12-24 Cross-media video character recognition method and system

Publications (1)

Publication Number Publication Date
CN114299428A true CN114299428A (en) 2022-04-08

Family

ID=80969863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111598585.8A Pending CN114299428A (en) 2021-12-24 2021-12-24 Cross-media video character recognition method and system

Country Status (1)

Country Link
CN (1) CN114299428A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114885210A (en) * 2022-04-22 2022-08-09 海信集团控股股份有限公司 Course video processing method, server and display equipment
CN115056223A (en) * 2022-06-15 2022-09-16 谙迈科技(宁波)有限公司 Intelligent mechanical arm control method based on tracking visual recognition algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106534967A (en) * 2016-10-25 2017-03-22 司马大大(北京)智能系统有限公司 Video editing method and device
CN109325964A (en) * 2018-08-17 2019-02-12 深圳市中电数通智慧安全科技股份有限公司 A kind of face tracking methods, device and terminal
CN111126152A (en) * 2019-11-25 2020-05-08 国网信通亿力科技有限责任公司 Video-based multi-target pedestrian detection and tracking method
CN112926410A (en) * 2021-02-03 2021-06-08 深圳市维海德技术股份有限公司 Target tracking method and device, storage medium and intelligent video system
CN113688680A (en) * 2021-07-22 2021-11-23 电子科技大学 Intelligent identification and tracking system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106534967A (en) * 2016-10-25 2017-03-22 司马大大(北京)智能系统有限公司 Video editing method and device
CN109325964A (en) * 2018-08-17 2019-02-12 深圳市中电数通智慧安全科技股份有限公司 A kind of face tracking methods, device and terminal
CN111126152A (en) * 2019-11-25 2020-05-08 国网信通亿力科技有限责任公司 Video-based multi-target pedestrian detection and tracking method
CN112926410A (en) * 2021-02-03 2021-06-08 深圳市维海德技术股份有限公司 Target tracking method and device, storage medium and intelligent video system
CN113688680A (en) * 2021-07-22 2021-11-23 电子科技大学 Intelligent identification and tracking system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李朝阳: "基于深度学习的网络视频人脸识别", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 February 2021 (2021-02-15), pages 1 - 5 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114885210A (en) * 2022-04-22 2022-08-09 海信集团控股股份有限公司 Course video processing method, server and display equipment
CN114885210B (en) * 2022-04-22 2023-11-28 海信集团控股股份有限公司 Tutorial video processing method, server and display device
CN115056223A (en) * 2022-06-15 2022-09-16 谙迈科技(宁波)有限公司 Intelligent mechanical arm control method based on tracking visual recognition algorithm

Similar Documents

Publication Publication Date Title
US10304458B1 (en) Systems and methods for transcribing videos using speaker identification
US8384791B2 (en) Video camera for face detection
US7515739B2 (en) Face detection
US7421149B2 (en) Object detection
US20060104487A1 (en) Face detection and tracking
CN109766883B (en) Method for rapidly extracting network video subtitles based on deep neural network
WO2004051656A1 (en) Media handling system
US20060198554A1 (en) Face detection
WO2009143279A1 (en) Automatic tracking of people and bodies in video
CN113052169A (en) Video subtitle recognition method, device, medium, and electronic device
CN110121105B (en) Clip video generation method and device
KR20050057586A (en) Enhanced commercial detection through fusion of video and audio signatures
GB2414616A (en) Comparing test image with a set of reference images
US10897658B1 (en) Techniques for annotating media content
TWI601425B (en) A method for tracing an object by linking video sequences
GB2557316A (en) Methods, devices and computer programs for distance metric generation, error detection and correction in trajectories for mono-camera tracking
CN107835397A (en) A kind of method of more camera lens audio video synchronizations
CN114299428A (en) Cross-media video character recognition method and system
CN113436231A (en) Pedestrian trajectory generation method, device, equipment and storage medium
CN117854507A (en) Speech recognition method, device, electronic equipment and storage medium
US20220207851A1 (en) System and method for automatic video reconstruction with dynamic point of interest
CN116017088A (en) Video subtitle processing method, device, electronic equipment and storage medium
Desurmont et al. Performance evaluation of frequent events detection systems
CN114339455B (en) Automatic short video trailer generation method and system based on audio features
Kokaram et al. Content controlled image representation for sports streaming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination