CN109063611B

CN109063611B - Face recognition result processing method and device based on video semantics

Info

Publication number: CN109063611B
Application number: CN201810797921.3A
Authority: CN
Inventors: 沈灿
Original assignee: Beijing Moviebook Technology Corp ltd
Current assignee: Beijing Moviebook Technology Corp ltd
Priority date: 2018-07-19
Filing date: 2018-07-19
Publication date: 2021-01-05
Anticipated expiration: 2038-07-19
Also published as: CN109063611A

Abstract

The application discloses a face recognition result processing method and device based on video semantics. The method comprises the following steps: detecting and tracking a face of a video to obtain a plurality of video segments, wherein each video frame in the video segments comprises the face of the same person; identifying the face in each video clip to obtain the name of a person in the video clip; in the case where the interval between two preceding and succeeding video segments is less than or equal to the first threshold, if the names of persons in the two video segments are the same and the similarity of the video segments is greater than or equal to the second threshold, the two video segments are merged. The method can carry out face detection, tracking and identification on the given video, merge the video segments of the same person by analyzing the interval, the person name and the similarity of the video segments, avoid fragmentation of a segmentation result and improve the accuracy of the identification result.

Description

Face recognition result processing method and device based on video semantics

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a method and an apparatus for processing a face recognition result based on video semantics.

Background

With the development of multimedia technology, digital video has become an important way for people to record, transmit and communicate information. The development of broadband has led the internet to enter the network video era, and not only has the new application explosion development of the former long video episode, comprehensive art and the like, the short videos in recent two years, live video and the like, but also the application of the business scenes surrounding the videos is increased. Nowadays, video-based face processing is different from traditional still detection based on pictures, and can detect targets, obtain appearance information of targets in multiple frames and motion information of the targets among multiple frames. However, in videos, especially in movie and art shows, faces may twist from time to time, and expressions are exaggerated and change rapidly. Under the condition of the characteristics, face tracking is disconnected, so that a person can be identified into a plurality of discontinuous segments in a continuous time period, and subsequent video analysis and processing are affected, for example, video segmentation or interception based on the identification result can cause a user to think that the processing result is inaccurate, and user experience is affected.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to one aspect of the application, a face recognition result processing method based on video semantics is provided, and comprises the following steps:

a human face detection step: detecting and tracking a face of a video to obtain a plurality of video segments, wherein each video frame in the video segments comprises the face of the same person;

a face recognition step: identifying the face in each video clip to obtain the name of a person in the video clip;

video clip merging step: in the case where the interval between two preceding and succeeding video segments is less than or equal to the first threshold, if the names of persons in the two video segments are the same and the similarity of the video segments is greater than or equal to the second threshold, the two video segments are merged.

By adopting the method, the video segments of the same person can be combined by carrying out face detection, tracking and identification on the given video and analyzing the intervals, the names and the similarity of the persons even under the condition of tracking failure, so that the defect of the detection, tracking and identification process is overcome, fragmentation of the segmentation result is avoided, and the accuracy of the identification result is improved.

Optionally, the face detecting step includes: for each video frame in the video, face detection is carried out through a classifier, the detected face is tracked, continuous video frames comprising the face of the same person are taken as a video segment, and therefore the video is divided into a plurality of video segments.

This step enables the video to be initially segmented quickly by person through the detection and tracking of faces.

Optionally, the face recognition step includes:

a face screenshot selecting step: for each video clip, selecting a face screenshot in the video clip;

an identification step: and carrying out face recognition on the face screenshot by utilizing a neural network to obtain the name of the figure in the video clip.

Through the steps, the identity of the person in each video clip can be obtained, so that the subsequent video clips can be combined conveniently; the neural network is used for identifying the face screenshot, so that the identification accuracy is high and the processing speed is high.

Optionally, in the video segment merging step, the calculating of the similarity of the video segments includes:

respectively reducing the last video frame of the previous video clip and the first video frame of the next video clip to the number of pixels of a first number, quantizing the gray level of each pixel, comparing the gray level of each quantized pixel with the gray level average value of the video frame, recording the gray level average value as 1, and recording the gray level average value as 0, thereby obtaining the fingerprint sequence of each video frame, and taking the number of numerical values with the same size of corresponding positions in the two fingerprint sequences as the similarity of the video clips.

By adopting the step, the gray level of the video is simplified, the subsequent calculation amount can be reduced, the gray level is subjected to standardization processing, and the video frames can be compared under a unified standard, so that the accuracy of the similarity calculation of the video frames is improved.

Optionally, after the video segment merging step, the method further includes:

and a result output step: and repeating the video segment merging step until the video segments cannot be merged to obtain a final video segmentation result and a character name recognition result.

According to another aspect of the present application, there is also provided a face recognition result processing apparatus based on video semantics, including:

the face detection module is configured to detect and track a face of a video to obtain a plurality of video segments, wherein each video frame in the video segments comprises the face of the same person;

the face recognition module is configured to recognize the face in each video segment to obtain the name of a person in the video segment; and

and the video segment merging module is configured to merge two video segments if the names of people in the two video segments are the same and the similarity of the video segments is greater than or equal to a second threshold value when the interval between the two video segments is less than or equal to the first threshold value.

By adopting the device, the face detection, tracking and identification can be carried out on the given video, and even under the condition that the tracking is broken, the video segments of the same person are combined by analyzing the interval, the name and the similarity of the person in the video segments, so that the defect of the detection, tracking and identification process is overcome, the fragmentation of the segmentation result is avoided, and the accuracy of the identification result is improved.

Optionally, the face detection module is further configured to: for each video frame in the video, face detection is carried out through a classifier, the detected face is tracked, continuous video frames comprising the face of the same person are taken as a video segment, and therefore the video is divided into a plurality of video segments.

Optionally, the face recognition module includes:

a face screenshot selecting module configured to select, for each video segment, a face screenshot in the video segment; and

and the recognition module is configured to perform face recognition on the face screenshot by using a neural network to obtain the name of the person in the video clip.

According to another aspect of the present application, there is also provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method as described above when executing the computer program.

According to another aspect of the application, there is also provided a computer-readable storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method as described above.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a video semantic based face recognition result processing method in accordance with the present application;

FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a video semantic-based face recognition result processing method according to the present application;

FIG. 3 is a schematic block diagram of one embodiment of a video semantic based face recognition result processing apparatus according to the present application;

FIG. 4 is a schematic block diagram of another embodiment of a video semantic based face recognition result processing apparatus according to the present application;

FIG. 5 is a block diagram of one embodiment of a computing device of the present application;

FIG. 6 is a block diagram of one embodiment of a computer-readable storage medium of the present application.

Detailed Description

One embodiment of the application provides a face recognition result processing method based on video semantics. Fig. 1 is a schematic flow chart diagram of an embodiment of a video semantic-based face recognition result processing method according to the present application. The method can comprise the following steps:

s100, a human face detection step: detecting and tracking a face of a video to obtain a plurality of video segments, wherein each video frame in the video segments comprises the face of the same person;

s200, face recognition: identifying the face in each video clip to obtain the name of a person in the video clip;

s300, video clip merging step: in the case where the interval between two preceding and succeeding video segments is less than or equal to the first threshold, if the names of persons in the two video segments are the same and the similarity of the video segments is greater than or equal to the second threshold, the two video segments are merged.

In an alternative embodiment, the S100 face detection step includes: for each video frame in the video, face detection is carried out through a classifier, the detected face is tracked, continuous video frames comprising the face of the same person are taken as a video segment, and therefore the video is divided into a plurality of video segments.

Alternatively, the classifier may include a nearest neighbor classifier, a linear classifier. For example, face detection may be implemented by detecting key points of a face based on an adaptive enhancement (AdaBoost) classifier.

Face tracking may be implemented using a mean shift (mean shift) algorithm. For example, face tracking may be achieved by: in a certain frame, selecting a position containing a tracked target rectangle A1 according to the key points of the human face detected by an AdaBoost classifier; in the next frame, detecting a candidate target rectangle A2 according to an AdaBoost classifier, performing background judgment on all pixels of a target rectangle A2, and if the candidate target rectangle A2 is judged to be a background, recording an indication function BI2 as 1, otherwise, recording as 0; calculating the probability density of the target rectangle A2 according to an indication function BI2, and calculating the distance difference between the positions of the candidate target rectangle A2 and the rectangle A1 according to the position of the candidate target rectangle A2; and if the probability density is greater than or equal to a set probability density threshold value and the distance difference value is greater than or equal to a set distance threshold value, determining that the frame and the previous frame contain the human face of the same person, and dividing the two video frames into the same video clip.

By the method, the face of the video is detected and tracked, a plurality of video segments can be obtained, and the start frame and the end frame of each video segment are recorded. When a certain video frame comprises more than two persons, the method can also realize multi-target detection and tracking, for example, a video has two faces, and in the video segments obtained by detection and tracking, all the video segments comprising the first face and all the video segments comprising the second face have cross repeated parts in time, but each video is only marked with the result of one face.

For example, the following result can be obtained for a video through step S100:

a first video clip, the starting and ending time of which is 0 to 5 seconds, is marked as a first face;

a second video segment, with a start-stop time of 6 to 8 seconds, marked as a second face;

a third video segment, the starting and ending time of which is 7 to 10 seconds, is marked as a first face;

a fourth video segment, beginning and ending times 13 to 15 seconds, marked as a second face, and so on.

In an alternative embodiment, the S200 face recognition step includes:

Optionally, in the step of selecting the face shots, the face shots with the highest quality in the video segments may be selected. The highest quality is shown, the face position in the shot image is positive, the light is good, the expression is normal, and the expression is not an exaggerated expression such as crying and laughing, the judgment can be carried out by defining a quality function, and the quality function can be realized by quantizing the standard into parameters of a classifier and the classifier. And inputting the selected face screenshot into a trained neural network for recognition to obtain a character name corresponding to each video clip.

In the identification step, the neural network can be used for carrying out face identification on the face screenshot to obtain the name of the person in the video clip. Optionally, the neural network may be a VGG network. For the face screenshot of the video clip, determining a person name and a confidence coefficient through a trained VGG network model to obtain a first identity information set, wherein the first identity information set at least comprises one person name and the confidence coefficient of the person name. In the training stage, more than 1000 persons of face picture data are used as training data, and each person is not less than 100 persons, including various angles from the front to the side. The VGG network model training results should satisfy that the average accuracy (mapp) of the test set for the target video screenshot is greater than a set threshold, e.g., 0.94. It can be understood that the model such as VGG can be used for training, and the existing face recognition tool can also be used for recognition.

Optionally, in the video segment merging step of S300, first two video segments before and after are selected, and if the interval between the two video segments before and after is greater than or equal to the first threshold, no processing is performed, that is, the original segmented video segments are kept unchanged. Alternatively, the first threshold is 2 seconds, and if there is a break in two seconds, the presence of a person in the video is considered to be continuous according to the subjective feeling of the person watching the video, so that setting the first threshold to 2 seconds is a preferable scheme.

And if the interval between the front video segment and the rear video segment is smaller than a first threshold value, comparing the names of the persons identified in the video segments, and if the names of the persons are different, not processing the names.

If the names of the people are the same, the last video frame of the previous video clip and the first video frame of the next video clip are taken out, and the similarity of the video clips is compared. If the video clips are not similar, no processing is carried out; if the video segments are similar, the video between the starting frame of the previous video segment and the ending frame of the next video segment is taken as a video segment.

For the similarity of the video segments, in an alternative embodiment, the step of calculating the similarity of the video segments includes: the method comprises the steps of obtaining a first video frame gray value set and a second video frame gray value set respectively based on gray values of pixels of a last video frame of a previous video clip and a first video frame of a next video clip, sequentially comparing the sizes of element values in the first video frame gray value set and the second video frame gray value set, regarding elements with difference values meeting constraint conditions as same elements, and regarding the number of the same elements in the first video frame gray value set and the second video frame gray value set as similarity.

By adopting the steps, the similarity of the video frames can be calculated according to the gray level of the video, and the algorithm is simple, high in calculation speed and high in accuracy.

In this step, the gray values of the previous video segment are sequentially combined into a first video frame gray value set, the gray values of the pixels of the first video frame of the next video segment are sequentially combined into a second video frame gray value set, the sizes of the element values in the first video frame gray value set and the second video frame gray value set are sequentially compared, the elements whose difference values satisfy the constraint condition are regarded as the same elements, for example, the constraint condition is that the difference value is less than or equal to 10, the number of the elements with the same gray value set of the two video frames is counted, and the number is used as the similarity of the video frames.

In an alternative embodiment, the step of calculating the similarity of the video segments comprises: respectively reducing the last video frame of the previous video clip and the first video frame of the next video clip to the number of pixels of a first number, quantizing the gray level of each pixel, comparing the gray level of each quantized pixel with the gray level average value of the video frame, recording the gray level average value as 1, and recording the gray level average value as 0, thereby obtaining the fingerprint sequence of each video frame, and taking the number of numerical values with the same size of corresponding positions in the two fingerprint sequences as the similarity.

Optionally, in this step, two video frames are reduced to 8 × 8, 64 pixels in total are obtained, each pixel is quantized according to 64 levels of gray scale, for one video frame, the gray scale average value of the 64 pixels after quantization is taken as the gray scale average value of the video frame, the gray scale of each pixel of the video frame is compared with the gray scale average value, the value greater than or equal to the gray scale average value is taken as 1, the value less than the gray scale average value is taken as 0, and the comparison results are combined together to form a sequence containing 64 numbers, that is, a fingerprint sequence of the video frame. Comparing the fingerprint sequences of the two video frames, if the number of different digits does not exceed 5, the two video frames are similar.

Fig. 2 is a schematic flow chart diagram of another embodiment of a face recognition result processing method based on video semantics according to the present application. In an alternative embodiment, after the video segment merging step, the method further comprises:

s400, result output step: and repeating the video segment merging step until the video segments cannot be merged to obtain a final video segmentation result and a character name recognition result.

And after circularly and repeatedly analyzing all the remaining identification result segments, sequencing the obtained video segments and the person name identification results according to time to obtain a final identification result.

One embodiment of the application also provides a face recognition result processing device based on video semantics. Fig. 3 is a schematic block diagram of an embodiment of a face recognition result processing apparatus based on video semantics according to the present application. The apparatus may include:

a face detection module 100 configured to detect and track a face of a video to obtain a plurality of video segments, where each video frame in the video segments includes a face of the same person;

a face recognition module 200 configured to recognize a face in each video segment to obtain a name of a person in the video segment;

and a video segment merging module 300 configured to merge two video segments before and after the interval between the two video segments is smaller than or equal to a first threshold if the names of people in the two video segments are the same and the similarity of the video segments is greater than or equal to a second threshold.

Optionally, the face detection module 100 is further configured to: for each video frame in the video, face detection is carried out through a classifier, the detected face is tracked, continuous video frames comprising the face of the same person are taken as a video segment, and therefore the video is divided into a plurality of video segments.

Optionally, the face recognition module 200 includes:

a face screenshot selecting module configured to select, for each video segment, a face screenshot in the video segment;

Through the face recognition module, the identity of a person in each video clip can be obtained, so that the subsequent video clips can be combined conveniently; the neural network is used for identifying the face screenshot, so that the identification accuracy is high and the processing speed is high.

Optionally, the apparatus further includes a similarity calculation module, and in an embodiment, the similarity calculation module is configured to obtain a first video frame gray value set and a second video frame gray value set based on gray values of pixels of a last video frame of a previous video segment and a first video frame of a subsequent video segment, respectively, compare sizes of respective element values in the first video frame gray value set and the second video frame gray value set in sequence, identify an element whose difference value satisfies a constraint condition as a same element, and use the number of the same element in the first video frame gray value set and the second video frame gray value set as a similarity.

In another embodiment, the similarity calculation module is configured to respectively reduce the last video frame of the previous video segment and the first video frame of the next video segment to a first number of pixels, quantize the gray level of each pixel, compare the gray level of each pixel after quantization with the average gray level value of the video frame, and record the gray level of each pixel as 1 or more and record the gray level of each pixel as 0 or less, thereby obtaining the fingerprint sequence of each video frame, and use the number of values with the same size at corresponding positions in the two fingerprint sequences as the similarity.

Fig. 4 is a schematic block diagram of another embodiment of a face recognition result processing apparatus based on video semantics according to the present application. In an alternative embodiment, the apparatus may further comprise:

a result output module 400 configured to repeat the video segment merging step until the video segments cannot be merged, resulting in a final video segmentation result and a person name recognition result.

Embodiments of the present application also provide a computing device, referring to fig. 5, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program realizing for performing any of the method steps 1131 according to the present invention when executed by the processor 1110.

Embodiments of the present application also provide a computer-readable storage medium. Referring to fig. 6, the computer readable storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.

Embodiments of the present application also provide a computer program product containing instructions comprising computer readable code which, when executed by a computing device, causes the computing device to perform the method as described above.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A face recognition result processing method based on video semantics comprises the following steps:

a face recognition step: for each video clip, selecting a face screenshot with the highest quality in the video clip, and performing face recognition on the face screenshot by using a VGG (virtual vapor group) network to obtain a character name in the video clip; and

video clip merging step: if the distance between two front video segments and the distance between two rear video segments is smaller than or equal to a first threshold value, if the names of people in the two video segments are the same and the similarity of the video segments is larger than or equal to a second threshold value, merging the two video segments;

wherein the step of calculating the similarity of the video segments comprises the following steps:

2. The method of claim 1, wherein the face detection step comprises: for each video frame in the video, face detection is carried out through a classifier, the detected face is tracked, continuous video frames comprising the face of the same person are taken as a video segment, and therefore the video is divided into a plurality of video segments.

3. The method of claim 1, wherein after the video segment merging step, the method further comprises:

4. A face recognition result processing device based on video semantics comprises:

the face recognition module is configured to select a face screenshot with the highest quality in each video clip, and perform face recognition on the face screenshot by using a VGG (virtual vapor group) network to obtain a character name in the video clip; and

a video segment merging module configured to merge two preceding video segments if the names of people in the two video segments are the same and the similarity of the video segments is greater than or equal to a second threshold value if the interval between the two video segments is less than or equal to the first threshold value;

the similarity of the video clips is calculated by adopting the following steps:

5. The apparatus of claim 4, wherein the face detection module is further configured to: for each video frame in the video, face detection is carried out through a classifier, the detected face is tracked, continuous video frames comprising the face of the same person are taken as a video segment, and therefore the video is divided into a plurality of video segments.

6. A computing device comprising a memory, a processor, and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of any of claims 1 to 3 when executing the computer program.

7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 3.