CN114299428A

CN114299428A - Cross-media video character recognition method and system

Info

Publication number: CN114299428A
Application number: CN202111598585.8A
Authority: CN
Inventors: 王晶
Original assignee: Space Shichuang Chongqing Technology Co ltd
Current assignee: Space Shichuang Chongqing Technology Co ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-08

Abstract

The invention relates to the technical field of image identification, and particularly discloses a method and a system for identifying a cross-media video character, wherein the method comprises the following steps: s1, acquiring a front picture of the video person to be identified; s2, cutting out the face from the front picture to generate a picture of the face; s3, converting the picture of the face into a feature vector; s4, extracting frames at intervals from the film of the video character, detecting the face of the frame and cutting out the face; s5, converting the cut human face into a feature vector, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, and executing the step S6; s6, performing multi-angle tracking on the marked video characters by using a Deep Sort algorithm; and S7, judging whether the tracking of the marked video character is finished or not, and if so, outputting a segment containing the video character. By adopting the technical scheme of the invention, the video containing the target person can be automatically generated.

Description

Cross-media video character recognition method and system

Technical Field

The invention relates to the technical field of image identification, in particular to a cross-media video character identification method and a cross-media video character identification system.

Background

Since the last century, people have accumulated a huge amount of high-quality film and television resources, covering film and television types such as movies, art programs, television shows and the like. With the development of movie technology and equipment, a large amount of videos are available with the duration approaching to or exceeding one hour. In recent years, with the pace of life increasing, users prefer to spend time on more compact short videos, and short video sharing platforms are also becoming more popular. Large numbers of creators on the internet have also begun to use short videos to drain longer-lived movie resources, such as movies, television shows, and the like.

For example, by retrieving the segments of the video characters appearing in the films played by the video characters, video production personnel can be helped to cut videos, subsequent audiences can conveniently select scenes in which only one video character appears, and the film watching experience of the audiences can be improved.

In order to determine the segment of a certain video character appearing in a film, in the conventional scheme, a special operator carries out manual marking, and the method not only takes much energy and takes a long time, but also has high cost and is difficult to meet the requirement of processing the current massive video resources.

Therefore, a method and a system for cross-media video character recognition capable of automatically generating a video containing a target character are needed.

Disclosure of Invention

The invention provides a cross-media video character recognition method and a cross-media video character recognition system, which can automatically generate a video containing a target character.

In order to solve the technical problem, the present application provides the following technical solutions:

a cross-media video character recognition method comprises the following steps:

s1, acquiring a front picture of the video person to be identified;

s2, cutting out a face from the front picture through the trained MTCNN to generate a picture containing the video character face;

s3, converting the picture containing the video character face into a feature vector by using a Facenet network;

s4, extracting frames at intervals from a film in which a video character participates, detecting the face of the frame through an MTCNN network, and cutting the face;

s5, converting the cut human face into a feature vector by using a Facenet network, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, executing the step S6, and if not, returning to the step S4;

s6, performing multi-angle tracking on the marked video characters by using a Deep Sort algorithm;

and S7, judging whether the tracking of the marked video character is finished, if so, outputting a segment containing the video character, judging whether the extraction of the video frame of the video character is finished, if not, executing the step S4, and if so, finishing the operation.

The basic scheme principle and the beneficial effects are as follows:

in the scheme, the face is cut out from the front picture through the MTCNN network, so that the face can be conveniently and independently processed subsequently, and the data volume of subsequent processing can be reduced. The method comprises the steps that a Facenet network is used for converting a picture containing the face of a video person into a characteristic vector, then the characteristic vector is compared with the characteristic vector extracted by the face in a video frame, whether the person in the video frame is a video person to be identified or not can be accurately identified, if yes, the video person is tracked, a segment containing the video person is output, and if not, the face in the video frame is continuously extracted for subsequent comparison.

In conclusion, the video character segment marking method and device can mark segments of video characters appearing in the participatory movie in an automatic mode and output the segments, and can solve the problems of labor-consuming manual marking, time-consuming manual marking and high cost.

Further, in step S1, a picture including the face of the video person is obtained from a face library of the preset video person and the process jumps to step S3.

When the material needing to be identified is a face picture directly, the step S2 is skipped, the process can be simplified, and the time is saved.

Further, in step S6, the multi-angle tracking includes performing tracking recognition on the frames with rotating and blocking head of the human being.

Further, in step S6, when tracking and identifying the frame of the person whose head is rotated and blocked, predicting the position of the current time based on the position of the video person' S head at the previous time through kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.

The accuracy rate of directly carrying out comparison recognition is low due to the fact that the face of a frame which is rotated and shielded by the head of a person is incomplete.

Further, in step S4, the interval is 24 frames.

Typically the movie is 24 frames in 1 second, spaced 24 frames apart, i.e. in the case of a movie, spaced 1 second apart.

A cross-media video character recognition system comprising:

the input module is used for receiving a front picture of a video figure to be identified;

the face cutting module is prestored with a trained MTCNN and used for inputting the front picture of the video character into the MTCNN and cutting out a face from the front picture through the MTCNN to generate a picture containing the face of the video character;

the feature conversion module is prestored with a Facenet network and used for converting the picture containing the video character face into a feature vector by using the Facenet network;

the extraction module is used for extracting frames at intervals from a film in which video characters participate;

the face cutting module is also used for inputting the extracted frame into an MTCNN network, detecting the face appearing in the frame through the MTCNN network and cutting out the face; the feature conversion module is also used for converting the cut human face into a feature vector by using a Facenet network;

the analysis module is used for comparing the feature vector of the cut human face with the feature vector of the video character, judging whether a similarity threshold value is met, and if so, marking the video character in the frame;

the tracking module is also used for tracking the marked video people in multiple angles by using a Deep Sort algorithm when the similarity threshold value is met;

and the output module is used for judging whether the tracking of the marked video character is finished or not, and outputting the segment containing the video character if the tracking of the marked video character is finished.

Further, the input module is further configured to obtain a picture including a face of the video person from a preset face library of the video person.

Further, the multi-angle tracking comprises tracking and identifying the frames with rotating and sheltering human heads.

Further, when the tracking module tracks and identifies the frame of the rotation and the sheltering of the head of the person, the tracking module is used for predicting the position of the current moment based on the position of the head of the person at the previous moment of the video through Kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.

Further, the interval is 24 frames.

Drawings

Fig. 1 is a flowchart of a cross-media video character recognition method according to an embodiment.

Detailed Description

The following is further detailed by way of specific embodiments:

example one

As shown in fig. 1, the method for identifying a character in a cross-media video of the present embodiment includes the following steps:

s1, acquiring a front picture of the video person to be recognized and jumping to the step S2, or acquiring a picture containing the face of the video person from a preset face library of the video person and jumping to the step S3;

s2, cutting out a face from the front picture through the trained MTCNN to generate a picture containing the video character face; specifically, the positions of the eyes, the mouth and the nose of the face can be found out from the front picture through the face alignment function of the MTCNN, the position of the face is further determined, and then the face is cut.

s4, extracting frames at intervals from a film in which a video character participates, detecting the face of the frame through an MTCNN network, and cutting the face; in this embodiment, the inter-frame interval is 24 frames;

s5, converting the cut human face into a feature vector by using a Facenet network, comparing the feature vector with the feature vector of the video character in the step S3, judging whether a similarity threshold is met, if so, marking the video character in the frame, executing the step S6, and if not, returning to the step S4; in this embodiment, the similarity threshold is 0.9.

S6, performing multi-angle tracking on the marked video character by using a Deep Sort algorithm, wherein the multi-angle tracking comprises the steps of identifying frames of head rotation and occlusion of the character;

specifically, the current time position is predicted based on the previous time position of the head of the video figure through Kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.

Based on the cross-media video character recognition method, the embodiment also provides a cross-media video character recognition system, which comprises an input module, a face clipping module, a feature conversion module, an extraction module, an analysis module, a tracking module and an output module.

The input module is used for receiving a front picture of a video person to be identified, or acquiring a picture containing the face of the video person from a preset face library of the video person.

And the face cutting module is prestored with the trained MTCNN and is used for inputting the front picture of the video character into the MTCNN and cutting out the face from the front picture through the MTCNN to generate a picture containing the face of the video character. Specifically, the positions of the eyes, the mouth and the nose of the face can be found out from the front picture through the face alignment function of the MTCNN, the position of the face is further determined, and then the face is cut.

And the feature conversion module is prestored with a Facenet network and is used for converting the picture containing the video character face into a feature vector by using the Facenet network.

And the extraction module is used for extracting frames at intervals from the film in which the video characters participate, in this embodiment, the intervals between the frames are 24 frames.

The face cutting module is also used for inputting the extracted frame into an MTCNN network, detecting the face appearing in the frame through the MTCNN network and cutting the face.

The feature conversion module is also used for converting the cut human face into a feature vector by using a Facenet network.

The analysis module is used for comparing the feature vector of the cut human face with the feature vector of the video character, judging whether the similarity threshold value is met or not, and if not, continuously extracting a new frame through the extraction module for processing; if yes, marking the video character in the frame; in this embodiment, the similarity threshold is 0.9.

The tracking module is also used for tracking the marked video people in multiple angles by using a Deep Sort algorithm when the similarity threshold value is met; the multi-angle tracking comprises the steps of identifying frames of rotating and blocking the head of a person; specifically, the current time position is predicted based on the previous time position of the head of the video figure through Kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.

The scheme of the embodiment can mark and output the segments of the video characters appearing in the participating film in an automatic mode, and can solve the problems of labor consumption, time consumption and high cost of manual marking.

Example two

The difference between the present embodiment and the first embodiment is that the cross-media video character recognition method of the present embodiment can also be used for recognizing characters in a single picture. And processing the single picture as a frame, skipping the multi-angle tracking step, and finally outputting and identifying the position and the name of the person. Specifically, the position is represented by framing the face through rectangular frames, and the format of each rectangular frame is [ x1, y1, x2, y2], where (x1, y1), (x2, y2) are the coordinates of the upper left corner and the lower right corner of the rectangle.

The scheme of the embodiment can identify the video characters appearing in a certain scene, mark the positions of the characters, and solve the problems of labor waste, time consumption and high cost of manual labeling.

The above are merely examples of the present invention, and the present invention is not limited to the field related to this embodiment, and the common general knowledge of the known specific structures and characteristics in the schemes is not described herein too much, and those skilled in the art can know all the common technical knowledge in the technical field before the application date or the priority date, can know all the prior art in this field, and have the ability to apply the conventional experimental means before this date, and those skilled in the art can combine their own ability to perfect and implement the scheme, and some typical known structures or known methods should not become barriers to the implementation of the present invention by those skilled in the art in light of the teaching provided in the present application. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims

1. A cross-media video character recognition method is characterized by comprising the following steps:

s1, acquiring a front picture of the video person to be identified;

2. The cross-media video character recognition method of claim 1, wherein: in step S1, a picture including a face of the video person is further obtained from a face library of the preset video person and the process jumps to step S3.

3. The cross-media video character recognition method of claim 2, wherein: in step S6, the multi-angle tracking includes tracking and recognizing the frames that are blocked and rotated by the head of the human being.

4. The cross-media video character recognition method of claim 3, wherein: in the step S6, when tracking and identifying a frame of a person whose head is rotated and blocked, predicting a current time position based on a previous time position of the head of the video person through kalman filtering; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.

5. The cross-media video character recognition method of claim 1, wherein: in step S4, the interval is 24 frames.

6. A cross-media video character recognition system, comprising:

7. The cross-media video character recognition system of claim 6, wherein: the input module is also used for acquiring pictures containing the faces of the video characters from a preset face library of the video characters.

8. The cross-media video character recognition system of claim 7, wherein: the multi-angle tracking comprises tracking and identifying the frames of the rotation and the occlusion of the head of the human body.

9. The cross-media video character recognition system of claim 8, wherein: the tracking module is used for predicting the position of the current moment based on the position of the video figure at the previous moment through Kalman filtering when tracking and identifying the frame of the rotation and the sheltering of the head of the figure; and matching the predicted position with the current frame through Hungarian algorithm, wherein if the matching is successful, the video characters of the current frame and the previous frame are the same, and if the matching is failed, the video characters of the current frame and the previous frame are different.

10. The cross-media video character recognition system of claim 6, wherein: the interval is 24 frames.