CN114299428A - Cross-media video character recognition method and system - Google Patents
Cross-media video character recognition method and system Download PDFInfo
- Publication number
- CN114299428A CN114299428A CN202111598585.8A CN202111598585A CN114299428A CN 114299428 A CN114299428 A CN 114299428A CN 202111598585 A CN202111598585 A CN 202111598585A CN 114299428 A CN114299428 A CN 114299428A
- Authority
- CN
- China
- Prior art keywords
- video
- face
- video character
- frame
- tracking
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000000605 extraction Methods 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 4
- 210000003128 head Anatomy 0.000 description 12
- 230000000903 blocking effect Effects 0.000 description 2
- 230000009191 jumping Effects 0.000 description 2
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of image identification, and particularly discloses a method and a system for identifying a cross-media video character, wherein the method comprises the following steps: s1, acquiring a front picture of the video person to be identified; s2, cutting out the face from the front picture to generate a picture of the face; s3, converting the picture of the face into a feature vector; s4, extracting frames at intervals from the film of the video character, detecting the face of the frame and cutting out the face; s5, converting the cut human face into a feature vector, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, and executing the step S6; s6, performing multi-angle tracking on the marked video characters by using a Deep Sort algorithm; and S7, judging whether the tracking of the marked video character is finished or not, and if so, outputting a segment containing the video character. By adopting the technical scheme of the invention, the video containing the target person can be automatically generated.
Description
Technical Field
The invention relates to the technical field of image identification, in particular to a cross-media video character identification method and a cross-media video character identification system.
Background
Since the last century, people have accumulated a huge amount of high-quality film and television resources, covering film and television types such as movies, art programs, television shows and the like. With the development of movie technology and equipment, a large amount of videos are available with the duration approaching to or exceeding one hour. In recent years, with the pace of life increasing, users prefer to spend time on more compact short videos, and short video sharing platforms are also becoming more popular. Large numbers of creators on the internet have also begun to use short videos to drain longer-lived movie resources, such as movies, television shows, and the like.
For example, by retrieving the segments of the video characters appearing in the films played by the video characters, video production personnel can be helped to cut videos, subsequent audiences can conveniently select scenes in which only one video character appears, and the film watching experience of the audiences can be improved.
In order to determine the segment of a certain video character appearing in a film, in the conventional scheme, a special operator carries out manual marking, and the method not only takes much energy and takes a long time, but also has high cost and is difficult to meet the requirement of processing the current massive video resources.
Therefore, a method and a system for cross-media video character recognition capable of automatically generating a video containing a target character are needed.
Disclosure of Invention
The invention provides a cross-media video character recognition method and a cross-media video character recognition system, which can automatically generate a video containing a target character.
In order to solve the technical problem, the present application provides the following technical solutions:
a cross-media video character recognition method comprises the following steps:
s1, acquiring a front picture of the video person to be identified;
s2, cutting out a face from the front picture through the trained MTCNN to generate a picture containing the video character face;
s3, converting the picture containing the video character face into a feature vector by using a Facenet network;
s4, extracting frames at intervals from a film in which a video character participates, detecting the face of the frame through an MTCNN network, and cutting the face;
s5, converting the cut human face into a feature vector by using a Facenet network, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, executing the step S6, and if not, returning to the step S4;
s6, performing multi-angle tracking on the marked video characters by using a Deep Sort algorithm;
and S7, judging whether the tracking of the marked video character is finished, if so, outputting a segment containing the video character, judging whether the extraction of the video frame of the video character is finished, if not, executing the step S4, and if so, finishing the operation.
The basic scheme principle and the beneficial effects are as follows:
in the scheme, the face is cut out from the front picture through the MTCNN network, so that the face can be conveniently and independently processed subsequently, and the data volume of subsequent processing can be reduced. The method comprises the steps that a Facenet network is used for converting a picture containing the face of a video person into a characteristic vector, then the characteristic vector is compared with the characteristic vector extracted by the face in a video frame, whether the person in the video frame is a video person to be identified or not can be accurately identified, if yes, the video person is tracked, a segment containing the video person is output, and if not, the face in the video frame is continuously extracted for subsequent comparison.
In conclusion, the video character segment marking method and device can mark segments of video characters appearing in the participatory movie in an automatic mode and output the segments, and can solve the problems of labor-consuming manual marking, time-consuming manual marking and high cost.
Further, in step S1, a picture including the face of the video person is obtained from a face library of the preset video person and the process jumps to step S3.
When the material needing to be identified is a face picture directly, the step S2 is skipped, the process can be simplified, and the time is saved.
Further, in step S6, the multi-angle tracking includes performing tracking recognition on the frames with rotating and blocking head of the human being.
Further, in step S6, when tracking and identifying the frame of the person whose head is rotated and blocked, predicting the position of the current time based on the position of the video person' S head at the previous time through kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
The accuracy rate of directly carrying out comparison recognition is low due to the fact that the face of a frame which is rotated and shielded by the head of a person is incomplete.
Further, in step S4, the interval is 24 frames.
Typically the movie is 24 frames in 1 second, spaced 24 frames apart, i.e. in the case of a movie, spaced 1 second apart.
A cross-media video character recognition system comprising:
the input module is used for receiving a front picture of a video figure to be identified;
the face cutting module is prestored with a trained MTCNN and used for inputting the front picture of the video character into the MTCNN and cutting out a face from the front picture through the MTCNN to generate a picture containing the face of the video character;
the feature conversion module is prestored with a Facenet network and used for converting the picture containing the video character face into a feature vector by using the Facenet network;
the extraction module is used for extracting frames at intervals from a film in which video characters participate;
the face cutting module is also used for inputting the extracted frame into an MTCNN network, detecting the face appearing in the frame through the MTCNN network and cutting out the face; the feature conversion module is also used for converting the cut human face into a feature vector by using a Facenet network;
the analysis module is used for comparing the feature vector of the cut human face with the feature vector of the video character, judging whether a similarity threshold value is met, and if so, marking the video character in the frame;
the tracking module is also used for tracking the marked video people in multiple angles by using a Deep Sort algorithm when the similarity threshold value is met;
and the output module is used for judging whether the tracking of the marked video character is finished or not, and outputting the segment containing the video character if the tracking of the marked video character is finished.
In the scheme, the face is cut out from the front picture through the MTCNN network, so that the face can be conveniently and independently processed subsequently, and the data volume of subsequent processing can be reduced. The method comprises the steps that a Facenet network is used for converting a picture containing the face of a video person into a characteristic vector, then the characteristic vector is compared with the characteristic vector extracted by the face in a video frame, whether the person in the video frame is a video person to be identified or not can be accurately identified, if yes, the video person is tracked, a segment containing the video person is output, and if not, the face in the video frame is continuously extracted for subsequent comparison.
In conclusion, the video character segment marking method and device can mark segments of video characters appearing in the participatory movie in an automatic mode and output the segments, and can solve the problems of labor-consuming manual marking, time-consuming manual marking and high cost.
Further, the input module is further configured to obtain a picture including a face of the video person from a preset face library of the video person.
Further, the multi-angle tracking comprises tracking and identifying the frames with rotating and sheltering human heads.
Further, when the tracking module tracks and identifies the frame of the rotation and the sheltering of the head of the person, the tracking module is used for predicting the position of the current moment based on the position of the head of the person at the previous moment of the video through Kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
The accuracy rate of directly carrying out comparison recognition is low due to the fact that the face of a frame which is rotated and shielded by the head of a person is incomplete.
Further, the interval is 24 frames.
Drawings
Fig. 1 is a flowchart of a cross-media video character recognition method according to an embodiment.
Detailed Description
The following is further detailed by way of specific embodiments:
example one
As shown in fig. 1, the method for identifying a character in a cross-media video of the present embodiment includes the following steps:
s1, acquiring a front picture of the video person to be recognized and jumping to the step S2, or acquiring a picture containing the face of the video person from a preset face library of the video person and jumping to the step S3;
s2, cutting out a face from the front picture through the trained MTCNN to generate a picture containing the video character face; specifically, the positions of the eyes, the mouth and the nose of the face can be found out from the front picture through the face alignment function of the MTCNN, the position of the face is further determined, and then the face is cut.
S3, converting the picture containing the video character face into a feature vector by using a Facenet network;
s4, extracting frames at intervals from a film in which a video character participates, detecting the face of the frame through an MTCNN network, and cutting the face; in this embodiment, the inter-frame interval is 24 frames;
s5, converting the cut human face into a feature vector by using a Facenet network, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, executing the step S6, and if not, returning to the step S4; in this embodiment, the similarity threshold is 0.9.
S6, performing multi-angle tracking on the marked video character by using a Deep Sort algorithm, wherein the multi-angle tracking comprises the steps of identifying frames of head rotation and occlusion of the character;
specifically, the current time position is predicted based on the previous time position of the head of the video figure through Kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
And S7, judging whether the tracking of the marked video character is finished, if so, outputting a segment containing the video character, judging whether the extraction of the video frame of the video character is finished, if not, executing the step S4, and if so, finishing the operation.
Based on the cross-media video character recognition method, the embodiment also provides a cross-media video character recognition system, which comprises an input module, a face clipping module, a feature conversion module, an extraction module, an analysis module, a tracking module and an output module.
The input module is used for receiving a front picture of a video person to be identified, or acquiring a picture containing the face of the video person from a preset face library of the video person.
And the face cutting module is prestored with the trained MTCNN and is used for inputting the front picture of the video character into the MTCNN and cutting out the face from the front picture through the MTCNN to generate a picture containing the face of the video character. Specifically, the positions of the eyes, the mouth and the nose of the face can be found out from the front picture through the face alignment function of the MTCNN, the position of the face is further determined, and then the face is cut.
And the feature conversion module is prestored with a Facenet network and is used for converting the picture containing the video character face into a feature vector by using the Facenet network.
And the extraction module is used for extracting frames at intervals from the film in which the video characters participate, in this embodiment, the intervals between the frames are 24 frames.
The face cutting module is also used for inputting the extracted frame into an MTCNN network, detecting the face appearing in the frame through the MTCNN network and cutting the face.
The feature conversion module is also used for converting the cut human face into a feature vector by using a Facenet network.
The analysis module is used for comparing the feature vector of the cut human face with the feature vector of the video character, judging whether the similarity threshold value is met or not, and if not, continuously extracting a new frame through the extraction module for processing; if yes, marking the video character in the frame; in this embodiment, the similarity threshold is 0.9.
The tracking module is also used for tracking the marked video people in multiple angles by using a Deep Sort algorithm when the similarity threshold value is met; the multi-angle tracking comprises the steps of identifying frames of rotating and blocking the head of a person; specifically, the current time position is predicted based on the previous time position of the head of the video figure through Kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
And the output module is used for judging whether the tracking of the marked video character is finished or not, and outputting the segment containing the video character if the tracking of the marked video character is finished.
The scheme of the embodiment can mark and output the segments of the video characters appearing in the participating film in an automatic mode, and can solve the problems of labor consumption, time consumption and high cost of manual marking.
Example two
The difference between the present embodiment and the first embodiment is that the cross-media video character recognition method of the present embodiment can also be used for recognizing characters in a single picture. And processing the single picture as a frame, skipping the multi-angle tracking step, and finally outputting and identifying the position and the name of the person. Specifically, the position is represented by framing the face through rectangular frames, and the format of each rectangular frame is [ x1, y1, x2, y2], where (x1, y1), (x2, y2) are the coordinates of the upper left corner and the lower right corner of the rectangle.
The scheme of the embodiment can identify the video characters appearing in a certain scene, mark the positions of the characters, and solve the problems of labor waste, time consumption and high cost of manual labeling.
The above are merely examples of the present invention, and the present invention is not limited to the field related to this embodiment, and the common general knowledge of the known specific structures and characteristics in the schemes is not described herein too much, and those skilled in the art can know all the common technical knowledge in the technical field before the application date or the priority date, can know all the prior art in this field, and have the ability to apply the conventional experimental means before this date, and those skilled in the art can combine their own ability to perfect and implement the scheme, and some typical known structures or known methods should not become barriers to the implementation of the present invention by those skilled in the art in light of the teaching provided in the present application. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.
Claims (10)
1. A cross-media video character recognition method is characterized by comprising the following steps:
s1, acquiring a front picture of the video person to be identified;
s2, cutting out a face from the front picture through the trained MTCNN to generate a picture containing the video character face;
s3, converting the picture containing the video character face into a feature vector by using a Facenet network;
s4, extracting frames at intervals from a film in which a video character participates, detecting the face of the frame through an MTCNN network, and cutting the face;
s5, converting the cut human face into a feature vector by using a Facenet network, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, executing the step S6, and if not, returning to the step S4;
s6, performing multi-angle tracking on the marked video characters by using a Deep Sort algorithm;
and S7, judging whether the tracking of the marked video character is finished, if so, outputting a segment containing the video character, judging whether the extraction of the video frame of the video character is finished, if not, executing the step S4, and if so, finishing the operation.
2. The cross-media video character recognition method of claim 1, wherein: in step S1, a picture including a face of the video person is further obtained from a face library of the preset video person and the process jumps to step S3.
3. The cross-media video character recognition method of claim 2, wherein: in step S6, the multi-angle tracking includes tracking and recognizing the frames that are blocked and rotated by the head of the human being.
4. The cross-media video character recognition method of claim 3, wherein: in the step S6, when tracking and identifying a frame of a person whose head is rotated and blocked, predicting a current time position based on a previous time position of the head of the video person through kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
5. The cross-media video character recognition method of claim 1, wherein: in step S4, the interval is 24 frames.
6. A cross-media video character recognition system, comprising:
the input module is used for receiving a front picture of a video figure to be identified;
the face cutting module is prestored with a trained MTCNN and used for inputting the front picture of the video character into the MTCNN and cutting out a face from the front picture through the MTCNN to generate a picture containing the face of the video character;
the feature conversion module is prestored with a Facenet network and used for converting the picture containing the video character face into a feature vector by using the Facenet network;
the extraction module is used for extracting frames at intervals from a film in which video characters participate;
the face cutting module is also used for inputting the extracted frame into an MTCNN network, detecting the face appearing in the frame through the MTCNN network and cutting out the face; the feature conversion module is also used for converting the cut human face into a feature vector by using a Facenet network;
the analysis module is used for comparing the feature vector of the cut human face with the feature vector of the video character, judging whether a similarity threshold value is met, and if so, marking the video character in the frame;
the tracking module is also used for tracking the marked video people in multiple angles by using a Deep Sort algorithm when the similarity threshold value is met;
and the output module is used for judging whether the tracking of the marked video character is finished or not, and outputting the segment containing the video character if the tracking of the marked video character is finished.
7. The cross-media video character recognition system of claim 6, wherein: the input module is also used for acquiring pictures containing the faces of the video characters from a preset face library of the video characters.
8. The cross-media video character recognition system of claim 7, wherein: the multi-angle tracking comprises tracking and identifying the frames of the rotation and the occlusion of the head of the human body.
9. The cross-media video character recognition system of claim 8, wherein: the tracking module is used for predicting the position of the current moment based on the position of the video figure at the previous moment through Kalman filtering when tracking and identifying the frame of the rotation and the sheltering of the head of the figure; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.
10. The cross-media video character recognition system of claim 6, wherein: the interval is 24 frames.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111598585.8A CN114299428A (en) | 2021-12-24 | 2021-12-24 | Cross-media video character recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111598585.8A CN114299428A (en) | 2021-12-24 | 2021-12-24 | Cross-media video character recognition method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114299428A true CN114299428A (en) | 2022-04-08 |
Family
ID=80969863
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111598585.8A Pending CN114299428A (en) | 2021-12-24 | 2021-12-24 | Cross-media video character recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114299428A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114885210A (en) * | 2022-04-22 | 2022-08-09 | 海信集团控股股份有限公司 | Course video processing method, server and display equipment |
CN115056223A (en) * | 2022-06-15 | 2022-09-16 | 谙迈科技(宁波)有限公司 | Intelligent mechanical arm control method based on tracking visual recognition algorithm |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106534967A (en) * | 2016-10-25 | 2017-03-22 | 司马大大(北京)智能系统有限公司 | Video editing method and device |
CN109325964A (en) * | 2018-08-17 | 2019-02-12 | 深圳市中电数通智慧安全科技股份有限公司 | A kind of face tracking methods, device and terminal |
CN111126152A (en) * | 2019-11-25 | 2020-05-08 | 国网信通亿力科技有限责任公司 | Video-based multi-target pedestrian detection and tracking method |
CN112926410A (en) * | 2021-02-03 | 2021-06-08 | 深圳市维海德技术股份有限公司 | Target tracking method and device, storage medium and intelligent video system |
CN113688680A (en) * | 2021-07-22 | 2021-11-23 | 电子科技大学 | Intelligent identification and tracking system |
-
2021
- 2021-12-24 CN CN202111598585.8A patent/CN114299428A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106534967A (en) * | 2016-10-25 | 2017-03-22 | 司马大大(北京)智能系统有限公司 | Video editing method and device |
CN109325964A (en) * | 2018-08-17 | 2019-02-12 | 深圳市中电数通智慧安全科技股份有限公司 | A kind of face tracking methods, device and terminal |
CN111126152A (en) * | 2019-11-25 | 2020-05-08 | 国网信通亿力科技有限责任公司 | Video-based multi-target pedestrian detection and tracking method |
CN112926410A (en) * | 2021-02-03 | 2021-06-08 | 深圳市维海德技术股份有限公司 | Target tracking method and device, storage medium and intelligent video system |
CN113688680A (en) * | 2021-07-22 | 2021-11-23 | 电子科技大学 | Intelligent identification and tracking system |
Non-Patent Citations (1)
Title |
---|
李朝阳: "基于深度学习的网络视频人脸识别", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 February 2021 (2021-02-15), pages 1 - 5 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114885210A (en) * | 2022-04-22 | 2022-08-09 | 海信集团控股股份有限公司 | Course video processing method, server and display equipment |
CN114885210B (en) * | 2022-04-22 | 2023-11-28 | 海信集团控股股份有限公司 | Tutorial video processing method, server and display device |
CN115056223A (en) * | 2022-06-15 | 2022-09-16 | 谙迈科技(宁波)有限公司 | Intelligent mechanical arm control method based on tracking visual recognition algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10304458B1 (en) | Systems and methods for transcribing videos using speaker identification | |
US8384791B2 (en) | Video camera for face detection | |
US7515739B2 (en) | Face detection | |
US7421149B2 (en) | Object detection | |
US20060104487A1 (en) | Face detection and tracking | |
CN109766883B (en) | Method for rapidly extracting network video subtitles based on deep neural network | |
WO2004051656A1 (en) | Media handling system | |
US20060198554A1 (en) | Face detection | |
WO2009143279A1 (en) | Automatic tracking of people and bodies in video | |
CN113052169A (en) | Video subtitle recognition method, device, medium, and electronic device | |
CN110121105B (en) | Clip video generation method and device | |
KR20050057586A (en) | Enhanced commercial detection through fusion of video and audio signatures | |
GB2414616A (en) | Comparing test image with a set of reference images | |
US10897658B1 (en) | Techniques for annotating media content | |
TWI601425B (en) | A method for tracing an object by linking video sequences | |
GB2557316A (en) | Methods, devices and computer programs for distance metric generation, error detection and correction in trajectories for mono-camera tracking | |
CN107835397A (en) | A kind of method of more camera lens audio video synchronizations | |
CN114299428A (en) | Cross-media video character recognition method and system | |
CN113436231A (en) | Pedestrian trajectory generation method, device, equipment and storage medium | |
CN117854507A (en) | Speech recognition method, device, electronic equipment and storage medium | |
US20220207851A1 (en) | System and method for automatic video reconstruction with dynamic point of interest | |
CN116017088A (en) | Video subtitle processing method, device, electronic equipment and storage medium | |
Desurmont et al. | Performance evaluation of frequent events detection systems | |
CN114339455B (en) | Automatic short video trailer generation method and system based on audio features | |
Kokaram et al. | Content controlled image representation for sports streaming |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |