CN110163096B

CN110163096B - Person identification method, person identification device, electronic equipment and computer readable medium

Info

Publication number: CN110163096B
Application number: CN201910304196.6A
Authority: CN
Inventors: 韩冰
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2021-11-02
Anticipated expiration: 2039-04-16
Also published as: CN110163096A

Abstract

The embodiment of the application discloses a person identification method, a person identification device, electronic equipment and a computer readable medium. An embodiment of the method comprises: extracting frames of a target video, detecting and identifying faces and clothes in each extracted frame to generate a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame and clothes characteristics corresponding to the clothes detection frame; detecting the head in each frame to generate a head detection frame; determining the pairing of the human head detection frame and the clothes detection frame in each frame; clustering clothes characteristics corresponding to the clothes detection frames in each pair to obtain a clustering result; and determining the figure corresponding to each pair based on the face detection frame, the face recognition result and the clustering result. This embodiment improves the recall rate of person identification in the video.

Description

Person identification method, person identification device, electronic equipment and computer readable medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a person identification method, a person identification device, electronic equipment and a computer readable medium.

Background

With the development of computer technology, people recognition technology is applied more and more. For example, people in the video can be identified, so that image data with people identity labels can be obtained, people can be tracked, and the like.

The related method for identifying the person in the video usually utilizes the existing face identification technology to identify the face in the video frame, and then determines the person identity according to the face identification result. However, because the gestures and angles of the people in the video are variable, the front faces of the people are not displayed in many video frames, or the displayed front faces are unclear, at this time, the identities of the people cannot be determined directly through a face recognition technology, so that the recall rate of identifying the people in the video is low.

Disclosure of Invention

The embodiment of the application provides a person identification method, a person identification device, electronic equipment and a computer readable medium, and aims to solve the technical problem that in the prior art, when persons in a video are identified, the recall rate is low because the identities of the persons are determined only according to a face identification result.

In a first aspect, an embodiment of the present application provides a person identification method, where the method includes: extracting frames of a target video, detecting and identifying faces and clothes in each extracted frame, and generating a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame and a clothes feature corresponding to the clothes detection frame; detecting the head in each frame to generate a head detection frame; determining the pairing of the human head detection frame and the clothes detection frame in each frame; clustering clothes characteristics corresponding to the clothes detection frames in each pair to obtain a clustering result; and determining the figure corresponding to each pair based on the generated face detection frame, the generated head detection frame, the face recognition result and the clustering result.

In some embodiments, detecting and recognizing the faces and clothes in the extracted frames to generate a face detection frame, a clothes detection frame, a face recognition result corresponding to the face detection frame, and a clothes feature corresponding to the clothes detection frame includes: carrying out face detection and recognition on each extracted frame by using a pre-trained face detection recognition model to generate a face detection frame and a face recognition result corresponding to the face detection frame; and (3) utilizing a pre-trained clothes detection and identification model to perform clothes detection and identification on each frame to generate a clothes detection frame and clothes characteristics corresponding to the clothes detection frame.

In some embodiments, the face detection and recognition model includes a face detection model and a face recognition model, wherein the face detection model is used for detecting a face region in an image and generating a face detection frame, and the face recognition model is used for performing face recognition on an image region surrounded by the face detection frame and generating a face recognition result.

In some embodiments, the clothes detection and identification model comprises a clothes detection model and a clothes identification model, wherein the clothes detection model is used for detecting clothes areas in the image and generating a clothes detection frame, and the clothes identification model is used for extracting clothes features of the image areas surrounded by the clothes detection frame.

In some embodiments, determining a pairing of the head detection box and the clothing detection box in each frame comprises: for the human head detection frame in each frame, the human head detection frame and the clothes detection frame simultaneously meeting the following conditions are taken as a pair: the distance between the center of the clothes detection frame and the center of the human head detection frame is smaller than or equal to a first preset value, and the center of the clothes detection frame is located below the center of the human head detection frame.

In some embodiments, determining the person corresponding to each pair based on the generated face detection frame, the head detection frame, the face recognition result and the clustering result includes: indicating the face recognition result as a face detection frame of a target face, and determining the face detection frame as the target face detection frame, wherein the target face is the face of a target character, and the target character comprises an actor in an actor list of a target video; for each target face detection frame, the following steps are performed: determining a target human head detection frame corresponding to the target human face detection frame, wherein the target human head detection frame is a human head detection frame with the intersection ratio of the target human face detection frame and the target human face detection frame being greater than or equal to a second preset value; and taking the pair where the target human head detection frame is as a target pair, and taking the target person corresponding to the target human face detection frame as the person corresponding to the target pair.

In some embodiments, determining the person corresponding to each pair based on the generated face detection frame, the generated head detection frame, the face recognition result, and the clustering result further includes: taking the class of the clothes feature corresponding to the clothes detection frame in each pair in the clustering result as the class of the pair; for each target pair, taking the type of the target pair as a target type, and establishing the mapping between a target person corresponding to the target pair and the target type; and for each category in the clustering result, determining the target person with the most mapping with the category, and determining the target person with the most mapping as the person corresponding to other pairs in the category, wherein the other pairs are pairs except the target pair.

In some embodiments, after determining the person corresponding to each pair based on the face detection box, the face recognition result, and the clustering result, the method further includes: and carrying out figure marking on the figure detection frame in each matching pair based on the figure corresponding to each matching pair.

In a second aspect, an embodiment of the present application provides a person identification apparatus, including: the detection and identification unit is configured to extract frames of the target video, detect and identify faces and clothes in each extracted frame and generate a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame and clothes characteristics corresponding to the clothes detection frame; a detection unit configured to detect a head in each frame, generating a head detection frame; a pairing unit configured to determine a pairing of the head detection frame and the clothing detection frame in each frame; the clustering unit is configured to cluster the clothes characteristics corresponding to the clothes detection frames in each pair to obtain a clustering result; a determination unit configured to determine a person corresponding to each pair based on the generated face detection frame, head detection frame, face recognition result, and clustering result.

In some embodiments, the detection recognition unit comprises: the first detection and recognition module is configured to perform face detection and recognition on each extracted frame by using a pre-trained face detection and recognition model to generate a face detection frame and a face recognition result corresponding to the face detection frame; and the second detection and identification module is configured to detect and identify clothes of each frame by using a pre-trained clothes detection and identification model, and generate a clothes detection frame and clothes characteristics corresponding to the clothes detection frame.

In some embodiments, the pairing unit is further configured to: for the human head detection frame in each frame, the human head detection frame and the clothes detection frame simultaneously meeting the following conditions are taken as a pair: the distance between the center of the clothes detection frame and the center of the human head detection frame is smaller than or equal to a first preset value, and the center of the clothes detection frame is located below the center of the human head detection frame.

In some embodiments, the determining unit comprises: a first determination module configured to determine the face recognition result as a face detection frame of a target face, wherein the target face is a face of a target character, and the target character comprises an actor in an actor table of a target video; an execution module configured to execute, for each target face detection box, the following steps: determining a target human head detection frame corresponding to the target human face detection frame, wherein the target human head detection frame is a human head detection frame with the intersection ratio of the target human face detection frame and the target human face detection frame being greater than or equal to a second preset value; and taking the pair where the target human head detection frame is as a target pair, and taking the target person corresponding to the target human face detection frame as the person corresponding to the target pair.

In some embodiments, the determining unit further comprises: the class dividing module is configured to take the class of the clothes feature corresponding to the clothes detection frame in each pair in the clustering result as the class of the pair; the establishing module is configured to establish mapping between a target person corresponding to each target pair and a target class by taking the class of the target pair as the target class for each target pair; and the second determining module is configured to determine, for each category in the clustering result, the target person with the highest mapping with the category, determine the target person with the highest mapping as the person corresponding to other pairs in the category, wherein the other pairs are pairs except the target pair.

In some embodiments, the apparatus further comprises: and the labeling unit is configured to label the human head detection frame in each pair based on the human corresponding to each pair.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the embodiments of the first aspect described above.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method according to any one of the embodiments of the first aspect.

The person identification method, the person identification device, the electronic device and the computer readable medium provided by the embodiment of the application perform frame extraction on the target video so as to detect and identify the face and the clothes in each extracted frame, thereby generating a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame and a clothes feature corresponding to the clothes detection frame. And then, detecting the head in each frame to generate a head detection frame, thereby determining the pairing of the head detection frame and the clothes detection frame in each frame. After the clothes features corresponding to the clothes detection frames in each pair are clustered, the person corresponding to each pair can be determined based on the face detection frames, the face recognition result and the clustering result. Therefore, the identity of the person in the video can be determined by combining the detection and the identification of the face, the detection and the identification of the clothes and the detection of the head. Because the person in the video has a certain relation with the clothing worn, the person identity can be determined through the detection of the head of the person and the detection and identification of the clothing for the frame without the front face of the person or the frame with the unclear front face, so that the recall rate of the person identification in the video is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow diagram of one embodiment of a person identification method according to the present application;

FIG. 2 is a flow diagram of yet another embodiment of a person identification method according to the present application;

FIG. 3 is a schematic diagram of one embodiment of a person identification device according to the present application;

FIG. 4 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1, a flow 100 of one embodiment of a person identification method according to the present application is shown. The person identification method comprises the following steps:

step 101, frame extraction is performed on a target video, and a face and clothes in each extracted frame are detected and identified, so that a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame and a clothes feature corresponding to the clothes detection frame are generated.

In the present embodiment, the target video may be extracted first in the execution subject (e.g., an electronic device such as a server) of the person recognition method. The target video may be any video to be processed currently. In practice, a video may be described in frames (frames). Here, a frame is the smallest visual unit constituting a video. Each frame is a static image. Temporally successive sequences of frames are composited together to form a video. Here, the target video may be transmitted by a terminal device, or may be pre-stored in the execution body, and is not limited herein.

After the target video is extracted, the execution body may frame the target video using various existing frame extraction methods or tools (e.g., an open source video processing tool FFmpeg). Here, all frames in the target video may be extracted. The target video may be extracted at a predetermined frame rate (i.e., FPS (Frames Per Second)). The frame extraction method is not limited herein.

After the frames are extracted, the execution main body can detect and identify the faces and the clothes in the extracted frames to generate a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame and clothes characteristics corresponding to the clothes detection frame. The method can be specifically executed according to the following steps:

first, a Face Detection (Face Detection) technique in the related art may be used to perform Face Detection on each extracted frame, so as to generate a Face Detection frame. In practice, face detection refers to a process of searching any given image by using a certain strategy to determine whether the image contains a face, and if so, returning the position, size and posture of a face.

Then, the image area corresponding to each generated Face detection frame may be subjected to Face Recognition by using the existing Face Recognition (Face Recognition) technology, so as to obtain a Face Recognition result corresponding to each Face detection frame. In practice, the face recognition technology is a biometric technology for performing identity recognition based on facial feature information of a person. When the face recognition is performed on each image region, the probability that the face in the image region belongs to each known face can be obtained. When the probability of belonging to a certain known face is greater than a set threshold, the face in the image area can be considered as the known face. If the probability values obtained for a certain image region are all smaller than the threshold, it can be considered that no person is identified.

Thereafter, the clothing in each frame can be detected and identified using existing Object Detection (Object Detection) techniques. In practice, the task of object detection is to frame the position of an object (here, clothing) in an image (i.e., a clothing detection frame), and perform identification of the object at that position (here, identification of clothing). In the process of clothes identification, clothes features can be extracted. In practice, the clothing features may be represented by feature vectors. The clothing feature may be information for characterizing various elements of the clothing (e.g., color, style, etc.).

And 102, detecting the human head in each frame to generate a human head detection frame.

In this embodiment, the execution body may detect the human head in each extracted frame, and generate a human head detection frame. Here, the detection of the human head may also be performed based on object detection techniques.

As an example, the execution subject may perform detection of a human head in an image using a human head detection model trained in advance. The human head detection model can be obtained by performing supervised training on the existing convolutional neural network by using a machine learning method. Various existing structures can be used for the convolutional neural network, such as DenseBox, VGGNet, ResNet, SegNet, and the like. And the execution main body inputs each frame to the human head detection model in sequence, so that the human head detection frame in the frame can be obtained. The area surrounded by the human head detection frame contains the human head.

And 103, determining the pairing of the human head detection frame and the clothes detection frame in each frame.

In this embodiment, the execution body may pair the head detection frame and the clothes detection frame in each of the extracted frames, so that each pair includes one head detection frame and one clothes detection frame, and the head detection frame and the clothes detection frame in each pair correspond to the same person.

In practice, because the head and the body of the person are connected, the pairing of the head detection frame and the clothes detection frame can be determined by judging whether the relative positions of the head detection frame and the clothes detection frame meet the preset condition, so that the person can be determined to be worn correspondingly. As an example, when the distance between a certain head detection frame and a certain clothes detection frame is smaller than a preset value, the head detection frame and the clothes detection frame may be regarded as a pair. As another example, if the distance between a certain head detection frame and a certain clothes detection frame is smaller than a preset value, and the ratio of the area of the certain head detection frame to the area of the certain clothes detection frame is within a preset ratio range, the certain head detection frame and the certain clothes detection frame may be regarded as a pair.

In some optional implementations of this embodiment, for the head detection box in each frame, the executing body may pair the head detection box with a clothing detection box that simultaneously satisfies the following conditions: the distance between the center of the clothes detection frame and the center of the human head detection frame is smaller than or equal to a first preset value, and the center of the clothes detection frame is located below the center of the human head detection frame. In practice, the human head detection frame and the clothes detection frame are rectangular frames. At this time, the intersection point of the two diagonal lines of the rectangular frame is the center of the rectangle.

And 104, clustering clothes characteristics corresponding to the clothes detection frames in each pair to obtain a clustering result.

In this embodiment, the executing body may perform clustering on the clothes features corresponding to the clothes detection boxes in each pair to obtain a clustering result. In practice, the executing agent may perform clustering by using various existing clustering algorithms. For example, a K-means algorithm (K-means Clustering algorithm), a Clara algorithm (Clustering method in LARge Applications), or the like may be employed. The clustering algorithm used herein is not limited.

It should be noted that each pair includes one clothing detection frame, and each clothing detection frame has a corresponding clothing feature, so that the clustering result of the clothing features can be regarded as the clustering result of each pair.

And 105, determining the person corresponding to each pair based on the generated face detection frame, the generated head detection frame, the face recognition result and the clustering result.

In this embodiment, the executing entity may determine the person corresponding to each pair based on the generated face detection frame, the generated head detection frame, the generated face recognition result, and the generated clustering result. Specifically, the executing body may execute the following steps:

and step one, determining clothes corresponding to the recognized people based on the generated face detection frame, the generated head detection frame and the generated face recognition result.

Here, since the result of the face recognition may indicate the identity of the person, and the same person usually does not perform the clothes replacement in the video, the execution subject may determine the clothes corresponding to each identified person. First, a face detection frame in which a person is recognized can be selected. Here, since in step 101, the executing subject performs face recognition on each image region, the probability that the face in the image region belongs to each known face can be obtained. When the probability of belonging to a certain known face is greater than a set threshold, the face in the image area can be considered as the known face. If the probability values obtained for a certain image region are all smaller than the threshold, it can be considered that no person is identified. Therefore, the execution subject can select a face detection frame for identifying the person.

Then, since the positions of the face detection frame and the head detection frame of a person are usually close to each other, the execution main body may first determine the pair corresponding to the selected face detection frame based on the position relationship between the selected face detection frame and the head detection frame in each pair.

Optionally, the execution main body may determine whether the face detection frame and the head detection frame correspond to the same person by determining whether the face detection frame is included in the head detection frame. If the face detection frame is included in the human head detection frame, it can be considered that the face detection frame and the human head detection frame correspond to the same person. Otherwise, the two persons may not be considered to correspond to the same person.

Optionally, the execution main body may determine whether the face detection frame and the human head detection frame correspond to the same person by determining whether an Intersection over Union (IoU) of the face detection frame and the human head detection frame is greater than a preset certain threshold. If the number is larger than the threshold value, the persons can correspond to the same person. Otherwise, the two persons may not be considered to correspond to the same person.

Then, since the pair includes the clothes detection frames and the category of the clothes feature of each clothes detection frame is known, the execution subject can determine the category of the clothes corresponding to the selected face detection frame. Different categories may be used to indicate different clothing, and thus, the clothing worn by each identified person may be determined.

Second, a person corresponding to each pair can be determined based on the clustering result obtained in step 104 and the corresponding relationship between each person and clothes.

Here, since the type of clothes corresponding to the face detection frame in which the person is recognized is known (that is, the clothes worn by each person is known), the person corresponding to each type can be specified, and the person corresponding to the pair in each type can be obtained.

Optionally, if there is only one person corresponding to a certain category, each pair corresponding to the category may be determined, and the person corresponding to each pair is determined as the person. As an example, if a certain category (e.g., a "black dress" category) corresponds to 10 pairs (i.e., clothing features corresponding to clothing detection frames in 10 pairs are clustered into a "black dress" category), and it is known that the person corresponding to the "black dress" category has only person a (i.e., only person a wears a black dress), then the person corresponding to the 10 pairs can be determined as a.

Optionally, if there are at least two people corresponding to a certain category, the person corresponding to the identified category that is paired the most frequently may be determined as the person corresponding to the other pairs in the category. As an example, a certain category (e.g., the "white cotta" category) corresponds to 10 pairs, wherein the characters corresponding to the "white cotta" category include character B and character C. Of these 10 pairs, 3 pairs have been identified as corresponding person B by the first step, and 1 pair has been identified as corresponding person C by the first step, the remaining 6 pairs can be identified as corresponding persons B.

In some optional implementation manners of this embodiment, after the executing entity determines the person corresponding to each pair, the executing entity may perform person labeling on the head detection frame in each pair based on the person corresponding to each pair. Thus, more annotation data can be obtained. The labeling data can be further applied to model training, video collection of a certain person and the like.

In the method provided by the above embodiment of the present application, frames are extracted from a target video, so as to detect and identify faces and clothes in each extracted frame, thereby generating a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame, and a clothes feature corresponding to the clothes detection frame. And then detecting the human head in each frame to generate a human head detection frame, so as to determine the pairing of the human head detection frame and the clothes detection frame in each frame. After the clothes features corresponding to the clothes detection frames in each pair are clustered, the person corresponding to each pair can be determined based on the face detection frame, the face recognition result and the clustering result. Therefore, the identity of the person in the video can be determined by combining the detection and the identification of the face, the detection and the identification of the clothes and the detection of the head. Through the detection and recognition of the human face, the identity of a person in a conventional case (such as a case where a front face appears) can be determined. Because the person in the video has a certain relation with the clothing worn, the person identity can be determined through the detection of the head of the person and the detection and identification of the clothing for the frame without the front face of the person or the frame with the unclear front face, so that the recall rate of the person identification in the video is improved.

With further reference to fig. 2, a flow 200 of yet another embodiment of a person identification method is shown. The process 200 of the person identification method includes the following steps:

step 201, using a pre-trained face detection recognition model to perform face detection and recognition on each extracted frame, and generating a face detection frame and a face recognition result corresponding to the face detection frame.

In this embodiment, the execution subject may sequentially output each frame to a pre-trained face detection and recognition model, so as to obtain a face detection frame in each frame and a face recognition result corresponding to each face detection frame. Here, the face detection recognition model may detect and recognize a face in an image. In practice, the face detection recognition model may be composed of one model, or may be composed of two or more models, and is not limited herein. When the face detection recognition model is composed of one model, the face detection recognition model may be obtained after supervised training using an existing network structure such as RCNN (regions with conditional Neural network), fast-RCNN, and the like.

In some optional implementations of the present embodiment, the face detection recognition model may include a face detection model and a face recognition model. The face detection model may be used to detect a face region in an image (here, each extracted frame) and generate a face detection frame. The face recognition model can be used for carrying out face recognition on an image area surrounded by the generated face detection frame and generating a face recognition result. In practice, the face detection model and the face recognition model may be obtained by performing supervised training on an existing Convolutional Neural Network (CNN) by using a machine learning method. Various existing structures can be used for the convolutional neural network, such as DenseBox, VGGNet, ResNet, SegNet, and the like.

Step 202, using a pre-trained clothes detection and identification model to perform clothes detection and identification on each frame, and generating a clothes detection frame and clothes characteristics corresponding to the clothes detection frame.

In this embodiment, the execution body may sequentially input each frame to a pre-trained clothes detection and recognition model, and obtain the clothes detection frame in each frame and the clothes feature corresponding to each clothes detection frame. Here, the clothing detection recognition model may detect and recognize clothing in the image. In practice, the clothing detection and identification model may be composed of one model or two or more models, and is not limited herein. When the clothing detection and recognition model is formed by one model, the clothing detection and recognition model may be obtained after supervised training using an existing network structure such as RCNN, fast-RCNN, and the like.

In some optional implementations of this embodiment, the clothing detection and identification model may include a clothing detection model and a clothing identification model. The clothing detection model may be used to detect clothing regions in the image (here, each extracted frame) and generate a clothing detection frame. The clothing recognition model can be used for extracting clothing features of an image area surrounded by the generated clothing detection frame. In practice, the clothing detection model and the clothing recognition model may be obtained by performing supervised training on an existing convolutional neural network by using a machine learning method. Various existing structures can be used for the convolutional neural network, such as DenseBox, VGGNet, ResNet, SegNet, and the like.

And step 203, detecting the human head in each frame to generate a human head detection frame.

In this embodiment, the executing subject may perform detection of the human head in the image by using a human head detection model trained in advance. The human head detection model can be obtained by performing supervised training on the existing convolutional neural network by using a machine learning method. And the execution main body inputs each frame to the human head detection model in sequence, so that the human head detection frame in the frame can be obtained. The area surrounded by the human head detection frame contains the human head.

And step 204, determining the pairing of the human head detection frame and the clothes detection frame in each frame.

In this embodiment, for the head detection frame in each frame, the execution subject may pair the head detection frame with a clothing detection frame that satisfies the following conditions at the same time: the distance between the center of the clothes detection frame and the center of the human head detection frame is smaller than or equal to a first preset value, and the center of the clothes detection frame is located below the center of the human head detection frame.

And step 205, clustering the clothes characteristics corresponding to the clothes detection frames in each pair.

In this embodiment, the execution subject may cluster the clothes features corresponding to the clothes detection boxes in each pair, so as to cluster the features of the same or similar clothes into one class. Specifically, the executing agent may perform clustering using various existing clustering algorithms. For example, a K-means algorithm (K-means Clustering algorithm), a Clara algorithm (Clustering method in LARge Applications), or the like may be employed. The clustering algorithm used herein is not limited.

It should be noted that the operations of the steps 203-205 are substantially the same as the operations of the steps 102-104, and are not described herein again.

And step 206, indicating the face recognition result as a face detection frame of the target face, and determining the face recognition result as the target face detection frame.

In this embodiment, the face recognition result may indicate the identity corresponding to the face in the face detection frame. Therefore, the execution subject may determine the target face detection frame based on the face recognition result. Specifically, for each face detection frame, if the face recognition result corresponding to the face detection frame indicates that the face in the face detection frame is the target face, the face detection frame may be determined as the target face detection frame. The target face is a face of a target character, and the target character comprises an actor in an actor table of the target video.

And step 207, for each target face detection frame, determining a target head detection frame corresponding to the target face detection frame, taking the pair where the target head detection frame is located as a target pair, and taking the target person corresponding to the target face detection frame as the person corresponding to the target pair.

In this embodiment, for each target face detection frame, the execution subject may perform the following steps:

firstly, determining a target human head detection frame corresponding to the target human face detection frame. The target human head detection frame is a human head detection frame with the intersection ratio of the target human head detection frame and the target human face detection frame being greater than or equal to a second preset value. Here, the value of the second predetermined value may be predetermined according to a large data statistic, and is not limited herein.

Because the position of the face detection frame of a person is close to that of the head detection frame, the intersection ratio of the face detection frame of the person to the target face detection frame is greater than or equal to the second preset value, and the person can be considered to correspond to the same person with the target face detection frame.

And secondly, determining the pair where the target human head detection frame is located as a target pair.

And thirdly, determining the target person corresponding to the target face detection frame as the person corresponding to the target pairing. Thus, the corresponding relation between the recognized target person and the target pair can be determined.

As an example, step 206 determines that there are 5 target face detection boxes, respectively target face detection boxes A, B, C, D, E. The target human head detection frames corresponding to the target human face detection frames are a, b, c, d and e in sequence. The pairing of each target human head detection frame is respectively target pairing 1, target pairing 2, target pairing 3, target pairing 4 and target pairing 5. The face recognition result corresponding to the target face detection box A, B is the target person M. The face recognition result corresponding to the target face detection box C, D, E is the target person N. And the face recognition result corresponding to the target face detection frame E is a target person P. The persons corresponding to the target pair 1, the target pair 2, the target pair 3, the target pair 4 and the target pair 5 are the target person M, the target person N and the target person P, respectively.

And step 208, taking the class of the clothes feature corresponding to the clothes detection frame in each pair in the clustering result as the class of the pair.

In this embodiment, after clustering the clothing features, the obtained clustering result may include a plurality of categories. The executing body may use a category of the clothing feature corresponding to the clothing detection box in each pair in the clustering result as a category of the pair.

Step 209 is to establish a mapping between the target person corresponding to each target pair and the target category, taking the category of the target pair as the target category.

In this embodiment, for each target pair, the executing entity may regard the category of the target pair as a target category, and establish a mapping between a target person corresponding to the target pair and the target category. This makes it possible to obtain the type of clothing corresponding to each recognized target person.

Continuing with the above example, the people corresponding to target pair 1, target pair 2, target pair 3, target pair 4, and target pair 5 are target person M, target person N, and target person P, respectively. If the categories of the clothes corresponding to the clothes detection frames in the target pair 1, the target pair 2, the target pair 3, the target pair 4 and the target pair 5 are "white short-sleeve", "black one-piece dress" and "black one-piece dress", respectively. The following mappings may be established in turn: the first mapping of the target person M and the white short-sleeve shirt, the second mapping of the target person M and the white short-sleeve shirt, the third mapping of the target person N and the black one-piece dress, the fourth mapping of the target person N and the black one-piece dress, and the fifth mapping of the target person P and the black one-piece dress.

Step 210, for each category in the clustering result, determining the target person with the most mapping established with the category, and determining the target person with the most mapping established with the category as the person corresponding to other pairs in the category.

In this embodiment, for each category in the clustering result, the executing entity may determine the target person with the most mapping with the category, and determine the target person with the most mapping as the person corresponding to the other pairs in the category. Wherein the other pair is a pair other than the target pair.

Continuing with the above example, for the "white cotta" category. If 10 pairs are included in the category. Two of the pairs are target pairs (i.e., the target pair 1 and the target pair 2), and the other 8 pairs are not target pairs (i.e., no person is identified due to face blurring or no front face exposure, and the persons corresponding to the 8 pairs are not identified). Then the target person M is the one with the most mapping to the category since the two target pairs in the category correspond to target person M. Therefore, the person corresponding to the other 8 other pairs is the target person M.

For the "black dress" category. If 10 pairs are included in the category. Three of the pairs are targets (i.e., the target pair 3, the target pair 4, and the target pair 5), and the other 7 pairs are not targets (i.e., no person is identified due to face blurring or no front face being exposed, and the person corresponding to the 7 pairs is not identified). Since there are two pairs corresponding to the target person M in the category, only one has a target pair corresponding to the target person P, i.e., the person having the most mapping with the category is the target person N. Thus, the person corresponding to the other 7 pairs can be identified as the target person N.

And step 211, performing character labeling on the head detection frames in each pairing based on the characters corresponding to each pairing.

In this embodiment, after the executing entity identifies the person corresponding to each pair, the executing entity may perform person labeling on the head detection frame in each pair based on the person corresponding to each pair. Thus, more annotation data can be obtained. The labeling data can be further applied to model training, video collection of a certain person and the like.

As can be seen from fig. 2, compared with the embodiment corresponding to fig. 1, the process 200 of the person identification method in this embodiment relates to a specific implementation manner for determining persons corresponding to each pair based on the face detection frame, the face identification result, and the clustering result, and further relates to a step of processing the video frame and labeling the head detection frame in each pair by using multiple models. Therefore, the recall rate of the person identification in the video is improved, the person identity in the image can be automatically labeled, and the data labeling efficiency is improved.

With further reference to fig. 3, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of a person identification apparatus, which corresponds to the embodiment of the method shown in fig. 1, and which is particularly applicable to various electronic devices.

As shown in fig. 3, the person recognition apparatus 300 according to the present embodiment includes: a detection and recognition unit 301 configured to frame a target video, detect and recognize faces and clothes in each frame, and generate a face detection frame, a clothes detection frame, a face recognition result corresponding to the face detection frame, and clothes features corresponding to the clothes detection frame; a detection unit 302 configured to detect a human head in each of the frames and generate a human head detection frame; a pairing unit 303 configured to determine a pairing of the head detection frame and the clothing detection frame in each of the frames; a clustering unit 304 configured to cluster the clothing features; a determination unit 305 configured to determine a person corresponding to each pair based on the generated face detection frame, the head detection frame, the face recognition result, and the clustering result.

In some optional implementation manners of this embodiment, the detection and identification unit 301 may include: a first detection and recognition module 3011, configured to perform face detection and recognition on each extracted frame by using a pre-trained face detection and recognition model, and generate a face detection frame and a face recognition result corresponding to the face detection frame; the second detection and recognition module 3012 is configured to perform clothes detection and recognition on each frame by using a pre-trained clothes detection and recognition model, and generate a clothes detection frame and clothes features corresponding to the clothes detection frame.

In some optional implementation manners of this embodiment, the face detection and recognition model may include a face detection model and a face recognition model, where the face detection model is used to detect a face region in an image and generate a face detection frame, and the face recognition model is used to perform face recognition on an image region surrounded by the face detection frame and generate a face recognition result.

In some optional implementations of the embodiment, the clothes detection and identification model includes a clothes detection model and a clothes identification model, where the clothes detection model is used to detect a clothes area in an image and generate a clothes detection frame, and the clothes identification model is used to extract clothes features from an image area surrounded by the clothes detection frame.

In some optional implementations of the present embodiment, the pairing unit 303 may be further configured to: for the human head detection frame in each frame, the human head detection frame and the clothes detection frame simultaneously meeting the following conditions are taken as a pair: the distance between the center of the clothes detection frame and the center of the human head detection frame is smaller than or equal to a first preset value, and the center of the clothes detection frame is located below the center of the human head detection frame.

In some optional implementations of this embodiment, the determining unit 305 may include: a first determination module 3051, configured to determine a face detection frame indicating a face recognition result as a target face detection frame, wherein the target face is a face of a target person, and the target person includes an actor in an actor table of the target video; an execution module 3052, configured to, for each target face detection box, execute the following steps: determining a target human head detection frame corresponding to the target human face detection frame, wherein the intersection ratio of the target human face detection frame and the target human head detection frame is greater than or equal to a second preset value; determining the pairing of the target human head detection frame as a target pairing; and determining the target person corresponding to the target face detection frame as the person corresponding to the target pair.

In some optional implementations of this embodiment, the determining unit 305 may further include a category dividing module 3053, an establishing module 3054, and a second determining module 3055. The class classification module is configured to take the class of the clothes feature corresponding to the clothes detection frame in each pair as the class of the pair; the establishing module is configured to establish mapping between a target person corresponding to each target pair and the target type by taking the type of the target pair as the target type for each target pair; and the second determining module is configured to determine the target person with the highest mapping with each category, and determine the target person with the highest mapping as the person corresponding to other pairs in the category, wherein the other pairs are pairs except the target pair.

In some optional implementations of this embodiment, the apparatus may further include: and the marking unit is configured to mark the person in the head detection frame in each pair based on the person corresponding to each pair.

In the apparatus provided by the above embodiment of the present application, the detection and recognition unit 301 performs frame extraction on the target video, so as to detect and recognize the face and the clothes in each extracted frame, thereby generating a face detection frame, a clothes detection frame, a face recognition result corresponding to the face detection frame, and a clothes feature corresponding to the clothes detection frame. Then, the detecting unit 302 detects the human head in each frame to generate a human head detecting frame, so that the pairing unit 303 determines the pairing of the human head detecting frame and the clothing detecting frame in each frame. After the clustering unit 304 clusters the clothing features corresponding to the clothing detection frames in each pair, the determining unit 305 may determine the people corresponding to each pair based on the face detection frame, the face recognition result, and the clustering result. Therefore, the identity of the person in the video can be determined by combining the detection and the identification of the face, the detection and the identification of the clothes and the detection of the head. Because the person in the video has a certain relation with the clothing worn, the person identity can be determined through the detection of the head of the person and the detection and identification of the clothing for the frame without the front face of the person or the frame with the unclear front face, so that the recall rate of the person identification in the video is improved.

Referring now to FIG. 4, shown is a block diagram of a computer system 400 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 401. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a detection identification unit, a detection unit, a determination unit, a clustering unit, and a determination unit. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: extracting frames of a target video, detecting and identifying faces and clothes in each extracted frame to generate a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame and clothes characteristics corresponding to the clothes detection frame; detecting the head in each frame to generate a head detection frame; determining the pairing of the human head detection frame and the clothes detection frame in each frame; clustering clothes characteristics corresponding to the clothes detection frames in each pair; and determining the figure corresponding to each pair based on the face detection frame, the face recognition result and the clustering result.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A person identification method, comprising:

the method comprises the steps of performing frame extraction on a target video, detecting and identifying faces and clothes in each extracted frame, and generating a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame and clothes characteristics corresponding to the clothes detection frame, wherein the clothes characteristics represent various elements of clothes;

detecting the human head in each frame to generate a human head detection frame;

determining the pairing of the human head detection frame and the clothes detection frame in each frame;

clustering clothes characteristics corresponding to the clothes detection frames in each pair to obtain a clustering result;

determining the figure corresponding to each pair based on the generated face detection frame, the generated head detection frame, the face recognition result and the clustering result, wherein the determining comprises the following steps: determining clothes corresponding to the recognized characters based on the generated face detection frame, the head detection frame and the face recognition result, and determining characters corresponding to the pairs based on the clustering result and the corresponding relation between the characters and the clothes; and determining the person with the most identified pairing corresponding times in the clustering result as the person corresponding to other pairings in the clustering result.

2. The human recognition method of claim 1, wherein the detecting and recognizing the faces and clothes in the extracted frames to generate a face detection frame, a clothes detection frame, a face recognition result corresponding to the face detection frame and clothes features corresponding to the clothes detection frame comprises:

carrying out face detection and recognition on each extracted frame by using a pre-trained face detection recognition model to generate a face detection frame and a face recognition result corresponding to the face detection frame;

and utilizing a pre-trained clothes detection and identification model to perform clothes detection and identification on each frame to generate a clothes detection frame and clothes characteristics corresponding to the clothes detection frame.

3. The person recognition method according to claim 2, wherein the face detection recognition model comprises a face detection model and a face recognition model, wherein the face detection model is used for detecting a face region in an image and generating a face detection frame, and the face recognition model is used for performing face recognition on an image region surrounded by the face detection frame and generating a face recognition result.

4. The person recognition method according to claim 3, wherein the clothing detection recognition model includes a clothing detection model for detecting a clothing region in the image and generating a clothing detection frame, and a clothing recognition model for extracting clothing features from an image region surrounded by the clothing detection frame.

5. The person identification method according to claim 1, wherein the determining of the pair of the head detection frame and the clothing detection frame in each frame includes:

for the human head detection frame in each frame, the human head detection frame and the clothes detection frame simultaneously meeting the following conditions are taken as a pair: the distance between the center of the clothes detection frame and the center of the human head detection frame is smaller than or equal to a first preset value, and the center of the clothes detection frame is located below the center of the human head detection frame.

6. The method of claim 4, wherein the determining the person corresponding to each pair based on the generated face detection frame, the generated head detection frame, the face recognition result and the clustering result comprises:

indicating a face recognition result as a face detection frame of a target face, and determining the face detection frame as the target face detection frame, wherein the target face is the face of a target character, and the target character comprises an actor in an actor list of the target video;

for each target face detection frame, the following steps are performed:

determining a target human head detection frame corresponding to the target human face detection frame, wherein the target human head detection frame is a human head detection frame with the intersection ratio of the target human face detection frame and the target human face detection frame being greater than or equal to a second preset value;

and taking the pair where the target human head detection frame is as a target pair, and taking the target person corresponding to the target human face detection frame as the person corresponding to the target pair.

7. The person recognition method according to claim 6, wherein the determining of the person corresponding to each pair based on the generated face detection frame, head detection frame, face recognition result, and clustering result further comprises:

taking the class of the clothes feature corresponding to the clothes detection frame in each pair in the clustering result as the class of the pair;

for each target pair, taking the type of the target pair as a target type, and establishing the mapping between a target person corresponding to the target pair and the target type;

and for each category in the clustering result, determining the target person with the most mapping with the category, and determining the target person with the most mapping as the person corresponding to other pairs in the category, wherein the other pairs are pairs except the target pair.

8. The person recognition method according to any one of claims 1 to 7, wherein after the determination of the person corresponding to each pair based on the face detection frame, the face recognition result, and the clustering result, the method further comprises:

and carrying out figure marking on the figure detection frame in each matching pair based on the figure corresponding to each matching pair.

9. A person recognition apparatus, comprising:

the detection and identification unit is configured to extract frames of a target video, detect and identify faces and clothes in each extracted frame, and generate a face detection frame, a clothes detection frame, a face identification result corresponding to the face detection frame and clothes features corresponding to the clothes detection frame, wherein the clothes features represent various elements of clothes;

a detection unit configured to detect a human head in each of the frames to generate a human head detection frame;

a pairing unit configured to determine a pairing of the head detection frame and the clothing detection frame in the respective frames;

the clustering unit is configured to cluster the clothes characteristics corresponding to the clothes detection frames in each pair to obtain a clustering result;

a determining unit configured to determine a person corresponding to each of the pairs based on the generated face detection frame, the head detection frame, the face recognition result, and the clustering result, including: determining clothes corresponding to the recognized characters based on the generated face detection frame, the head detection frame and the face recognition result, and determining characters corresponding to the pairs based on the clustering result and the corresponding relation between the characters and the clothes; and determining the person with the most identified pairing corresponding times in the clustering result as the person corresponding to other pairings in the clustering result.

10. The person recognition device according to claim 9, wherein the detection recognition unit includes:

the first detection and recognition module is configured to perform face detection and recognition on each extracted frame by using a pre-trained face detection and recognition model to generate a face detection frame and a face recognition result corresponding to the face detection frame;

and the second detection and identification module is configured to detect and identify clothes of each frame by using a pre-trained clothes detection and identification model, and generate a clothes detection frame and clothes characteristics corresponding to the clothes detection frame.

11. The person recognition apparatus according to claim 10, wherein the face detection recognition model includes a face detection model and a face recognition model, wherein the face detection model is configured to detect a face region in an image and generate a face detection frame, and the face recognition model is configured to perform face recognition on an image region surrounded by the face detection frame and generate a face recognition result.

12. The person recognition apparatus according to claim 11, wherein the clothing detection recognition model includes a clothing detection model for detecting a clothing region in the image and generating a clothing detection frame, and a clothing recognition model for extracting clothing features from an image region surrounded by the clothing detection frame.

13. The personal identification device of claim 9, wherein the pairing unit is further configured to:

14. The person recognition apparatus according to claim 12, wherein the determination unit includes:

a first determination module configured to determine a face detection frame indicating a face recognition result as a target face detection frame, wherein the target face is a face of a target character including an actor in an actor table of the target video; (ii) a

An execution module configured to execute, for each target face detection box, the following steps:

15. The person recognition apparatus according to claim 14, wherein the determination unit further includes:

the class dividing module is configured to take the class of the clothes feature corresponding to the clothes detection frame in each pair in the clustering result as the class of the pair;

the establishing module is configured to establish mapping between a target person corresponding to each target pair and the target category by taking the category of the target pair as the target category for each target pair;

and the second determining module is configured to determine the target person with the highest mapping with each category in the clustering results, and determine the target person with the highest mapping as the person corresponding to other pairs in the category, wherein the other pairs are pairs except the target pair.

16. The personal identification device as claimed in any one of claims 9 to 15, wherein the device further comprises:

and the labeling unit is configured to label the human head detection frame in each pair based on the human corresponding to each pair.

17. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

18. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.