WO2021213158A1 - 一种智能视频会议终端的实时人脸摘要服务的方法及系统 - Google Patents

一种智能视频会议终端的实时人脸摘要服务的方法及系统 Download PDF

Info

Publication number
WO2021213158A1
WO2021213158A1 PCT/CN2021/084231 CN2021084231W WO2021213158A1 WO 2021213158 A1 WO2021213158 A1 WO 2021213158A1 CN 2021084231 W CN2021084231 W CN 2021084231W WO 2021213158 A1 WO2021213158 A1 WO 2021213158A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
gallery
real
image
detection
Prior art date
Application number
PCT/CN2021/084231
Other languages
English (en)
French (fr)
Inventor
张晓帅
Original Assignee
厦门亿联网络技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 厦门亿联网络技术股份有限公司 filed Critical 厦门亿联网络技术股份有限公司
Publication of WO2021213158A1 publication Critical patent/WO2021213158A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Definitions

  • the invention belongs to the technical field of face recognition, and specifically relates to a method and system for real-time face summary service of an intelligent video conference terminal.
  • the face summary service is to collect face images based on the video images, and then enter them into the face gallery.
  • the function of the face gallery is to store the facial features of the participants and filter them into high-quality face blocks for display. Therefore, the face summary service technology faces many technical problems. For example, several faces will appear in the same scene in a video conference. How to display a clearer face in the face gallery. On the other hand, because of the video conference images They are all dynamic. How to extract the face from the video quickly and in real time and display it in the face gallery.
  • the Chinese patent application number 201510158931.9 discloses a face summary method and a video summary method based on face recognition.
  • the method includes generating Face images of different people that appear in the original video, and form a list of face images that appear, including scanning the image frames in the original video to obtain face detection and facial features of whether there are face areas in the video frames Steps such as extraction, face feature clustering and face summary image generation.
  • the methods in the prior art can generate face summaries and video summaries, the methods in the prior art have many problems.
  • the first is the clarity of the generated face summaries.
  • the existing methods directly adopt human
  • the face recognition method generates a face summary, but due to complex and changeable scenes such as lighting changes, motion blur, and face ratio in the video scene, the recognition rate of the face is seriously affected, resulting in a low level of clarity of the face summary, which is difficult to satisfy Scenario requirements.
  • most of the neural network models are used to complete face recognition and face detection.
  • the neural network models used in the prior art are often relatively large, resulting in a very large amount of calculation and difficult Directly deployed on terminal equipment, in order to be able to quickly detect and recognize the face in the video, it is often necessary to rely on a large machine platform, such as deployment on a cloud server, but this method is used to perform face detection, recognition and other operations It is done on the cloud server and then returned to the terminal device, which will cause delays and poor real-time performance, which will affect the quality of the face summary service, so it is difficult to meet the needs of scenes such as video conferences.
  • the existing face summary service methods have poor quality, low definition, and large amount of calculation, which makes it difficult to generate face summaries quickly and accurately in real time, and it is impossible to directly model the algorithm.
  • Deploying on terminal equipment requires the use of large-scale computing equipment, which has poor real-time performance and high cost.
  • the present invention provides a method and system for real-time face summary service of an intelligent video conference terminal.
  • the present invention can generate a face summary with good quality and high definition, and the method has a small amount of calculation and a fast calculation speed.
  • the real-time performance is good, and it can be directly deployed on the terminal device, which reduces the calculation cost.
  • the real-time face summary service method of the intelligent video conference terminal of the present invention includes:
  • S1 Initialize the face detection model, face alignment model, face recognition model and face gallery, and perform model loading and memory allocation;
  • S4 Use the face detection result to initialize the tracker, use the tracker to track the face in the video frame, and track and capture the position information of the face;
  • the output face detection frame coordinates are cropped out of the face image block, and the face image block is input into the face alignment model to obtain the coordinates of the key points of the face, and then the similar transformation is used to transform the face image block.
  • the face is transformed to a standard face image;
  • S6 Input the standard face image into the face recognition model, perform face feature mapping according to the distinguishing features of the face, obtain vectorized face feature data, and recognize the face in the frame image;
  • S7 Record the recognized face image into the face gallery, and update the face gallery through face selection.
  • step S7 when updating the face gallery through face optimization, it is first judged whether there is pre-entered face image information in the face gallery, and the following operations are respectively performed according to the judgment result:
  • the face image information that appears in the video will be automatically entered, and through face optimization, high-quality face images will be automatically updated over time, and all the face images that appear in the gallery will be saved 'S face
  • the corresponding face ID name will be calibrated, and the face images that appear in the video but not pre-entered in the face gallery will be entered into the face gallery, and then passed
  • the face preferably continuously updates the face images in the gallery.
  • the method for optimizing the human face includes:
  • the face detection frame output by the face detection filter out the face images whose face detection frame area is less than the face area threshold
  • the face detection filter out the face images whose confidence score is less than the confidence threshold
  • Adopt SMD algorithm to calculate the sharpness of the face image, and filter out the face images whose sharpness is lower than the sharpness threshold;
  • the face quality value is calculated.
  • the method for calculating the face quality value is:
  • Q represents the face quality value
  • Q c represents the face confidence score
  • Q a represents the face area score
  • Q s represents the sharpness of the face
  • Q f represents the face attitude angle
  • step S7 the specific method for updating the face gallery is:
  • a single target tracking scheme is adopted.
  • the tracker is initialized, a tracker is initialized for each detected face detection frame, and during the tracking period, the tracker outputs the information of the face in the current frame. Coordinates of the detection frame.
  • the face detection model adopts a cascaded convolutional neural network for face detection, and the cascaded convolutional neural network is sequentially cascaded by P-Net, R-Net, and O-Net networks, so
  • the P-Net network uses standard convolution to roughly filter out the face detection frame in the video frame.
  • the R-Net network and O-Net use standard convolution and depth separable convolution to extract the face feature data in the image for filtering And refine the face detection frame to get the final face position information.
  • the face alignment model uses a convolutional neural model to extract key points of the face
  • the convolutional neural network model uses standard convolution and deep separable convolution to extract key point features of the face, and uses one
  • the FC fully connected layer is used as the output of the convolutional neural network model.
  • the face recognition network model uses several MBConv convolutional network modules connected in series to extract distinguishable features of the face and perform feature mapping to identify the face in the video frame.
  • the face image block is sent to the face recognition model for face recognition, the face image block is subjected to secondary detection to prevent false detection.
  • the real-time face summary service system of the intelligent video conference terminal of the present invention adopts the method of the present invention to perform the real-time face summary service of the intelligent video conference terminal.
  • the present invention has the following advantages:
  • the real-time face summary service method of the smart video conference terminal of the present invention uses face detection, face tracking, face alignment, face recognition, face optimization, and then generates and updates a face gallery.
  • the present invention adds the face optimization operation.
  • the face optimization operation in the process of generating the face summary, the face images of poor quality are continuously filtered out. On the one hand, it speeds up the face.
  • the face optimization method filters out face images with too small face detection frame area and low confidence score based on the output results of face detection and face tracking, thereby avoiding detection errors
  • the facial posture score filter out the face images with lower posture scores.
  • the calculation method of the face quality value is given.
  • the face quality value can quickly evaluate the quality of the face image, and quickly update the face image quality in the face gallery, so that the generated face summary has higher quality and better clarity.
  • the present invention adopts a single target tracker tracking scheme, initializes a tracker for each detected face, and only tracks one tracking period. Since the efficiency of face tracking is much higher than that of face detection, it passes the face Tracking real-time output of the position information of the face, effectively improving the uniform speed, thereby improving the real-time performance of the method's face summary service; at the same time, tracking a tracking cycle to avoid missing or missing people due to too long tracking time Circumstances, effectively improving the accuracy of the method.
  • the face detection model of the present invention uses a cascaded convolutional neural network for face detection, and the cascaded convolutional neural network is sequentially cascaded by P-Net, R-Net and O-Net networks, By using cascaded convolutional neural networks and using standard convolution and depth separable convolution, the depth of the network is effectively reduced, and the convolutional layer settings are simplified.
  • the size of the model is only 83Kb, which not only reduces the amount of calculation, The calculation speed of the network model is improved, so that the face in the video frame can be detected quickly, and the real-time performance of the algorithm is improved.
  • the convolutional neural model is small, so that it can be deployed in low-energy video conferencing terminals. Relying on a large-scale computing platform to perform calculations not only improves real-time performance, but also reduces costs.
  • the face alignment model of the present invention uses a convolutional neural model to extract key points of a human face.
  • the convolutional neural network model uses standard convolution and deep separable convolution to extract key point features of a human face, and uses An FC fully connected layer is used as the output of the convolutional neural network model, which simplifies the setting of the convolutional layer, reduces the depth of the neural network and the volume of the model, thus reduces the amount of calculation, improves the real-time performance of the algorithm, and can be deployed in
  • the low-energy video conferencing terminal does not need to rely on a large-scale computing platform to perform calculations, which improves real-time performance and reduces costs.
  • the face recognition model of the present invention uses several MBConv convolutional network modules connected in series instead of standard convolution, which greatly reduces the amount of calculation and significantly improves the accuracy of recognition, thus maintaining high recognition accuracy. At the same time, it has a faster calculation speed, so that the face can be recognized from the image more quickly and accurately.
  • the use of the MBConv convolutional network module enables the convolutional neural network model to have a smaller volume and can be directly applied to low-energy video conferencing terminals, thereby reducing costs.
  • Fig. 1 is a flowchart of a method for real-time face summary service of an intelligent video conference terminal of the present invention
  • Figure 2 is a structural diagram of the P-Net network in the face detection model of the present invention.
  • Figure 3 is a structural diagram of the R-Net network in the face detection model of the present invention.
  • Figure 4 is a structural diagram of the O-Net network in the face detection model of the present invention.
  • Fig. 5 is a structural diagram of the convolutional neural network of the face alignment model in the present invention.
  • Figure 6 is a network structure diagram of the MBConv network module
  • Fig. 7 is a structural diagram of the convolutional neural network of the face recognition model in the present invention.
  • Figure 8 is an effect diagram when the method of the present invention is used in a video conference
  • Figure 9 is an effect diagram when the method of the present invention is used in a video conference
  • FIG. 10 is an effect diagram when the method of the present invention is used for face punching
  • Fig. 11 is a face image pre-entered when the method of the present invention is applied to a face card.
  • the method for real-time face summary service of the intelligent video conference terminal of the present invention includes:
  • S1 Initialize the face detection model, face alignment model, face recognition model and face gallery, and perform model loading and memory allocation;
  • S4 Use the face detection result to initialize the tracker, use the tracker to track the face in the video frame, and track and capture the position information of the face;
  • the output face detection frame coordinates are cropped out of the face image block, and the face image block is input into the face alignment model to obtain the coordinates of the key points of the face, and then the similar transformation is used to transform the face image block.
  • the face is transformed to a standard face image;
  • S6 Input the standard face image into the face recognition model, perform face feature mapping according to the distinguishing features of the face, obtain vectorized face feature data, and recognize the face in the frame image;
  • S7 Record the recognized face image into the face gallery, and update the face gallery through face selection.
  • the function of the face gallery is to display the faces that appear in the video.
  • the preprocessing operation of the frame image includes: format conversion, scaling and normalization.
  • the image input required by various algorithm models is generally in RGB format, and the format of the actual video frame is diversified according to different scenarios, such as YUV, ARGB, etc.
  • Format conversion refers to the unified conversion of image pixels in the above format into RGB arrangement. To match the algorithm model requirements.
  • Scaling processing refers to the original resolution image, such as 1080P image, is reduced in proportion.
  • the method of scaling processing is bilinear interpolation. Through the scaling processing, the running time of the algorithm model can be reduced. The larger the resolution , The longer the algorithm processing time is, the zooming process can improve the calculation speed of various models and make the models have better real-time performance.
  • Normalization processing refers to processing 0 ⁇ 255 image pixels to the range of -1 ⁇ 1. Normalization processing helps to speed up the convergence speed and improve the accuracy of the algorithm model training. It should be noted that the model training and actual The data processing required by the test must be consistent, so the actual incoming image data needs to be normalized.
  • the formula used in the present invention to normalize the image is:
  • x represents each pixel value of the original image
  • x′ represents the normalized pixel value
  • Face detection is used to determine the size and position of the face in the image, that is, to solve the problem of "where is the face", and cut out the real face area from the image to facilitate subsequent face feature analysis and recognition.
  • the face detection model of the present invention uses cascaded convolutional neural networks for face detection.
  • the cascaded convolutional neural networks are sequentially cascaded by P-Net, R-Net, and O-Net networks.
  • the Net network uses standard convolution to roughly filter out the face detection frame in the video frame.
  • the R-Net network and O-Net use standard convolution and deep separable convolution to extract the facial feature data in the image for filtering and refinement. Face detection frame, get the final face position information. It is explained that standard convolution refers to a general convolution form.
  • the network structure of the P-Net (Proposal Network) network is: the convolutional layer C101 and the convolutional layer C102 are connected in sequence, and the convolutional layer C102 is connected to the convolutional layer C103 and the convolutional layer.
  • the convolutional layers C101 ⁇ C104 all use standard convolutions.
  • the core size of the convolutional layer C101 is 3 ⁇ 3, and the number of channels is 8; the core size of the convolutional layer C102 is 3 ⁇ 3, and the number of channels is 16; the convolutional layer
  • the core size of C103 is 3 ⁇ 3, and the number of channels is 4; the core size of convolutional layer C104 is 3 ⁇ 3, and the number of channels is 2.
  • the network structure of the R-Net (Refine Network) network includes the convolutional layer C201, the convolutional layer C202, the convolutional layer C203, the convolutional layer C204, the convolutional layer C205, The convolutional layer C206 and the convolutional layer 206 connect the fully connected layer FC201 and the fully connected layer FC202 for output.
  • the convolutional layer C201, the convolutional layer C203, and the convolutional layer C205 all adopt standard convolution, and the core size of the convolutional layer C201 is 3 ⁇ 3, and the number of channels is 16; the core size of the convolutional layer C203 is 1 ⁇ 1.
  • the number of channels is 32; the core size of the convolutional layer C205 is 1 ⁇ 1, and the number of channels is 64.
  • the convolutional layer C202, the convolutional layer C204, and the convolutional layer C206 all adopt deep separable convolution, and the core size is 3 ⁇ 3.
  • the number of neurons in the fully connected layers FC201 and FC202 are both 64, of which the fully connected layer FC201 Used for classification judgment, the fully connected layer FC202 is used to output the coordinate value of the face detection frame.
  • the input size of the R-Net network is 24 ⁇ 24 ⁇ 3.
  • the network structure of the O-Net (Output Network) network includes the convolutional layer C301, the convolutional layer C302, the convolutional layer C303, the convolutional layer C304, the convolutional layer C305, The convolutional layer C306, the convolutional layer C307, the convolutional layer C308, and the convolutional layer C308 are connected to the fully connected layer FC301 and the fully connected layer FC302 for output.
  • Convolutional layer C301, convolutional layer C303, convolutional layer C305, convolutional layer C307 all use standard convolution, and the core size of convolutional layer C301 is 3 ⁇ 3, the number of channels is 16; the core size of convolutional layer C303 It is 1 ⁇ 1 and the number of channels is 32; the kernel size of the convolutional layer C305 is 1 ⁇ 1, and the number of channels is 64; the kernel size of the convolutional layer C307 is 1 ⁇ 1, and the number of channels is 128.
  • the convolutional layer C302, the convolutional layer C304, the convolutional layer C306, and the convolutional layer C308 all adopt deep separable convolution, and the core size is 3 ⁇ 3.
  • the number of nodes in the fully connected layers FC1 and FC2 are both 128.
  • the fully connected layer FC1 is used to make classification judgments, and the fully connected layer FC2 is used to output the coordinate values of the face detection frame.
  • the input size of the O-Net network is 32 ⁇ 32 ⁇ 3.
  • the face detection model of the present invention uses a cascaded convolutional neural network, and uses depth separable convolution and standard convolution to simplify the depth of the network layer and the setting of the convolutional layer, so that the overall detection model size is only 86Kb, real-time face detection of low-power video conferencing terminals is realized. It can quickly and real-time recognize faces in videos without the use of large-scale computing equipment. Therefore, it not only has fast calculation speed and good real-time performance, but also Can reduce costs.
  • Face tracking is used to continue to capture the position information of the face in subsequent frames on the premise of detecting the face.
  • the tracker initializes the coordinates that depend on the face detection frame coordinates output by the face detection. Normally, tracking The time efficiency of the method far exceeds the detection efficiency, so the real-time output of face information by tracking is beneficial to improve the real-time performance of the method.
  • the tracking period period and the parameter MAX are set, which means that a tracking period period is at most MAX frames, and the MAX value is generally set to 10-25 .
  • step S4 a single target tracking scheme is adopted.
  • a tracker When the tracker is initialized, a tracker is initialized for each detected face, and during the tracking period, the tracker outputs the current frame
  • the position information of the human face that is, tracking the MAX frame video image according to the set MAX value.
  • face position information can be output in real time, which reduces the pressure of face detection and improves the real-time performance of the method.
  • Face alignment because the same person may present different postures and expressions in different image sequences, this situation is not conducive to face recognition, so it is necessary to transform all face images to a uniform angle and posture.
  • the principle is to find the key points of the face.
  • there are five key points namely the left eye, right eye, nose, left mouth corner and right mouth corner, and then use these key points to pass similar transformations (rotation, Zooming and panning) transform the face to a standard face as much as possible, thereby completing the face alignment process.
  • a key point detection convolutional neural network is designed for the face alignment model of the present invention to detect the key points of the face.
  • the key point detection volume The network structure of the convolutional neural network is: Convolutional layer C401, Convolutional layer C402, Convolutional layer C403, Convolutional layer C404, Convolutional layer C405, Convolutional layer C406, Convolutional layer C407, Convolutional layer C408, convolutional layer C409, convolutional layer C410, fully connected layer FC401.
  • the convolutional layer C401, the convolutional layer C403, the convolutional layer C405, the convolutional layer C407, and the convolutional layer C409 all adopt standard convolution.
  • the core size of the convolutional layer C401 is 3 ⁇ 3, and the number of channels is 16;
  • the core size of C403, C405, C407, and C409 are all 1 ⁇ 1.
  • the number of channels of the convolutional layer C403 is 32, and the number of channels of the convolutional layer C405 is 48.
  • the number of channels in layer C407 is 64, and the number of channels in convolutional layer C409 is 96.
  • Convolutional layer C402, convolutional layer C404, convolutional layer C406, convolutional layer C408, and convolutional layer C410 all adopt depth separable convolution, and the core size is 3 ⁇ 3.
  • the neurons of the fully connected layer FC1 The number is 96.
  • the key point detection network is used to identify the coordinates of the key points of the face. In the embodiment of the present invention, the key points are: left eye, left mouth corner, right eye, right mouth corner and nose. Therefore, the key point detection network is used to identify Find the coordinate values of the five key points.
  • the face alignment operation of the present invention uses the convolutional neural network to detect the key points of the face, and the key point convolutional neural network model uses standard convolution and deep separable convolution to identify the key points of the face and simplify the network
  • the depth of the layer and the setting of the convolutional layer, and only a fully connected layer output is used, which significantly reduces the volume of the model.
  • the volume of the model is less than 2M, which can be used for low-power video conferencing terminals, so that it can be used without large-scale
  • the computing device that can be quickly and in real-time is the key point of being exposed to the face, which is convenient for other operations.
  • Face recognition specifically refers to mapping the face features in the detection frame, obtaining vectorized face features through deep learning feature modeling, and obtaining the result of face recognition according to the classification of the classifier.
  • the key to the face recognition model is how to obtain distinguishable features in different faces. Usually when recognizing a person, he will look at his eyebrow shape, face contour, nose shape, eye type, etc. The face recognition algorithm needs to be trained through the network Obtain distinguishable features like this.
  • the present invention constructs a face recognition convolutional neural network model as shown in FIG. 7, which uses several MBConv network modules of a standard convolution kernel for face feature recognition.
  • the structure of the convolutional neural network model for face recognition is: convolutional layer C501, convolutional modules MBC-1 to MBC-16, convolutional layer C502, convolutional layer C503, and convolutional layer C504 that are sequentially connected.
  • the convolutional layer C501, the convolutional layer C502, and the convolutional layer C504 adopt standard convolution.
  • the kernel size of the convolutional layer C501 is 3 ⁇ 3
  • the kernel size of the convolutional layer C502 is 1 ⁇ 1
  • the kernel of the convolutional layer C504 The size is 1 ⁇ 1
  • the convolutional layer C503 adopts Global Depthwise Convolution (Global Depthwise Convolution, GDC), and the kernel size is 5 ⁇ 4.
  • the network structure of the MBConv (Mobile inverted Bottleneck Convolution, Inverse Residual Convolutional Network Module) network module is shown in Figure 6: It includes the convolutional layer C601, the convolutional layer C602, and the convolutional layer C603, which are connected in sequence.
  • the convolutional layer C601 and C603 use standard convolution, and the core size is 1 ⁇ 1, the convolution layer C603 uses depth separable convolution, and the core size is 3 ⁇ 3; the input of the convolution module passes through the convolution layers C601, C602 and C603 in turn After output from the convolutional layer C603, it is connected with the input of the convolution module through the residual error and used as the output of the convolution module.
  • the MBConv network module has two parameters, namely the core size and the number of channels.
  • the core size of the MBC-1 module is 3 ⁇ 3, and the number of channels is 1.
  • the core size of MBC-2 ⁇ MBC-4 modules is 3 ⁇ 3, and the number of channels is 3
  • the core size of MBC-5 module is 3 ⁇ 3, and the number of channels is 3
  • the core size is 3 ⁇ 3, and the number of channels is 6
  • the core size of MBC-13 module is 3 ⁇ 3, and the number of channels is 3
  • the core size of MBC-14 ⁇ MBC-16 module is 3 ⁇ 3, the number of channels Both are 6.
  • the network model of the embodiment of the present invention is adopted, and the MBConv module is used to replace the traditional standard convolutional layer, which effectively reduces the amount of calculation and has higher accuracy, thus greatly improving the calculation speed and recognition accuracy.
  • the volume of the model is less than 2M, so it can be directly deployed on the video conference terminal, so there is no need to use large-scale computing equipment for calculation, which not only improves the real-time performance, but also effectively reduces the cost.
  • the present invention adopts the ArcFace algorithm when training the convolutional neural network model of face recognition, so that the trained face features have better generalization ability.
  • Ability In order to achieve faster network inference on the video conference terminal, during model training, the training set images were manually cleaned, the interference part was removed, and the cropping process was performed to crop the top and left and right information of the image. A small part, so that the trained convolutional neural network model can be applied to video conferencing terminals, and can be applied without the aid of large-scale computing equipment, thereby meeting the needs of real-time video conferencing systems.
  • step S7 of the present invention when the face gallery is updated through face optimization, it is first judged whether there is pre-entered face image information in the face gallery, and the following operations are performed respectively according to the judgment result: if there is no face gallery in the face gallery
  • the face image information entered in advance is automatically entered in the face image information that appears in the video, and through face optimization, high-quality face images are automatically updated over time, and all faces that appear in the gallery are saved; If there is pre-entered face image information in the face gallery, the corresponding face ID name will be calibrated, and the face images that appear in the video but not pre-entered in the face gallery will be entered into the face gallery, and then use the face
  • the face images in the gallery are constantly updated.
  • the preferred methods of face include:
  • the face images whose face detection frame area is smaller than the face area threshold are filtered out.
  • the range of the face area threshold is 2400-3600.
  • the confidence threshold ranges from 0.6 to 0.8.
  • the posture score of the face is calculated, and the face images whose posture score is less than the posture score threshold are filtered out.
  • the range of the posture score threshold is 0.5-1.
  • the SMD Sud of Modulus of gray Difference, gray difference function
  • the SMD is used to calculate the sharpness of the face image, and to filter out the face image whose sharpness is lower than the sharpness threshold.
  • the sharpness The range of the threshold is 80-100.
  • the face quality value is calculated.
  • the calculation of the face quality value uses the following formula:
  • Q represents the face quality value
  • Q c represents the face confidence score
  • Q a represents the face area score
  • Q s represents the face clarity
  • Q f represents the face pose score
  • the face pose score Q f is calculated according to the key points of the detected face.
  • the specific calculation method is:
  • the horizontal line passing through the tip of the nose intersects the first line and the second line respectively.
  • the points of intersection are the first point of intersection and the second point of intersection.
  • the distance from the tip of the nose to the first point of intersection is the first distance, and the distance from the tip of the nose to the second point of intersection. It is the second distance, and the minimum value of the first distance and the second distance is divided by the maximum value to obtain the face pose score Q f .
  • the similarity between the face of the current frame image and the previously entered face gallery is lower than the given similarity threshold, it is determined that a new person enters, and some face images that do not meet the gallery entry requirements are filtered through face optimization.
  • Add the face images that meet the conditions to the gallery collection is mainly to filter out some relatively fuzzy images; it means that the input requirements for the face gallery are mainly to have a certain degree of clarity, specifically, in practical applications , Set according to needs.
  • the similarity of human faces is calculated using cosine similarity, and in an embodiment of the present invention, the similarity threshold is set to 60%. It should be noted here that the similarity threshold can be calculated according to The actual application needs to be adjusted.
  • the face in the video frame is recorded in the face gallery, and some face images disappear from the video frame and do not reappear in the video after exceeding the time threshold, delete the corresponding face image in the face gallery.
  • the face image is entered into the face gallery, but after entry, some faces are removed from the video After a while, the face does not appear in the video again.
  • Face image In order to avoid more and more faces in the face gallery, you need to delete the face that has not appeared in the video for a long time. Face image. In the specific operation, count and count the faces that appear in the face gallery, and define a time threshold.
  • the face image disappears from the video frame, and does not appear in the video again after the time threshold, the corresponding person in the face gallery is deleted Face image. More generally speaking, when a face disappears from the video at a certain moment, the corresponding face in the face gallery will be temporarily updated. From the moment of disappearance, after a time threshold length of time, the face has not been updated again. If it appears in the video, the face image corresponding to the face in the face gallery is deleted from the face gallery.
  • the time threshold is set to 20 minutes, that is, if the face does not appear in the video again within 20 minutes, the corresponding face image will be deleted from the face gallery. This should be explained here However, the time threshold can be modified according to requirements in actual applications, and is not strictly limited.
  • the face image block before the face image block is sent to the face recognition model for face recognition, the face image block is subjected to secondary detection to prevent false detection.
  • the convolutional neural network model structure of the face detection of the present invention is still used, but the parameters will be fine-tuned according to requirements.
  • the method of the present invention is applied to a video conference system.
  • the video conference system because the people participating in the conference are uncertain, usually there is no pre-entered face image information in the face gallery.
  • the face image information that appears in the video will be automatically entered, and through face optimization, high-quality face images will be automatically updated over time, and all the face images appearing in the gallery will be saved human face.
  • Figure 8 shows the effect of a video conference scene
  • Figure 9 shows the effect diagram of another video conference scene. It can be seen from Figures 8 and 9 that the method of the present invention can obtain a face summary in real time.
  • the face images in the face gallery are relatively clear; at the same time, it can be seen that some faces are not clear. That is because the corresponding faces in the video are some side faces or just flashed by in the video. It is impossible to collect complete face information.
  • Fig. 10 shows the effect diagram of face punching in with the method of the present invention.
  • face punching usually the person who punched the face is determined, so the face images of these people will be entered in advance.
  • Figure 11 shows the images of these faces, and the system will pre-enter these face images In the face gallery, it can be seen from Figure 10 that the face in Figure 11 has been entered in the face gallery.
  • the face ID name will be marked in the face gallery, indicating that the person has successfully clocked in.
  • the present invention provides a real-time face summary service method of an intelligent video conference terminal.
  • face detection face tracking, face alignment, face recognition, face optimization
  • a face gallery is generated and updated, and a face gallery is generated and updated quickly and in real time.
  • the face summary service has good quality and high definition; at the same time, the method of the present invention simplifies and optimizes the convolutional nerves used in face detection, face recognition, and face alignment
  • the network model reduces the amount of calculation of the convolutional neural network model, thereby effectively improving the speed of the face summary service, and the simplified and optimized neural network model is small in size and can be directly applied to the intelligent terminal of the video conference system
  • On the ARM side there is no need to rely on large-scale computing equipment for auxiliary calculations, which makes the real-time performance of the video summary service better and reduces the cost.
  • the present invention also provides a real-time face summary service system of an intelligent video conference terminal.
  • the system uses the method of the present invention to perform a real-time face summary service of the intelligent video conference terminal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本发明公开了一种智能视频会议终端的实时人脸摘要服务的方法及系统,属于人脸识别技术领域。本发明包括初始化模型;获取视频帧;利用人脸检测模型对预处理后的帧图像进行人脸检测;利用人脸检测结果,初始化跟踪器,并人脸进行人脸跟踪,跟踪捕获人脸的位置信息;利用人脸对齐模型,得到人脸关键点的坐标,然后采用相似变换将人脸变换到标准人脸图像;利用人脸识别模型,进行人脸特征映射,识别出帧图像中的人脸;将识别出的人脸图像录入人脸画廊中,并通过人脸优选,更新人脸画廊。本发明的方法进行人脸摘要服务的质量好,清晰度高,并且计算速度快,实时性好,无需借助大型计算平台,可以直接应用于视频会议终端,降低了成本。

Description

一种智能视频会议终端的实时人脸摘要服务的方法及系统 技术领域
本发明属于人脸识别技术领域,具体涉及一种智能视频会议终端的实时人脸摘要服务的方法及系统。
背景技术
随着经济和科技的发展,办公模式也在不断地发生着转变,视频会议也在各领域的地位逐渐突显出来。在视频会议中,很重要的一项技术就是人脸摘要服务,即将参与会议的人员的人脸都显示在人脸画廊中。对于视频会议,人脸摘要是根据视频画面采集人脸图像,然后录入到人脸画廊中,人脸画廊的作用是为了存储参会人员的人脸特征,筛选成高质量的人脸块作为显示,因此,人脸摘要服务技术面临很多技术问题,例如视频会议中会在同一场景中出现若干人脸,如何将更加清晰的人脸显示在人脸画廊中,另一方面,因为视频会议的图像都是动态的,如何快速地、实时地将视频中人脸提取出来,显示在人脸画廊中。
随着人工智能技术的发展,科研人员提出了很多生成人脸摘要的方法,例如申请号为201510158931.9的中国专利公开了一种基于人脸识别的人脸摘要方法、视频摘要方法,该方法包括生成出现在原始视频中的不同人的人脸图像,并形成出现的人脸图像列表,包括对原始视频中的图像帧进行扫描,获得视频帧中是否存在人脸区域的人脸检测、人脸特征提取、人脸特征聚类和人脸摘要图像生成等步骤。
然而采用现有技术中的方法,虽然能够生成人脸摘要以及视频摘要,但是现有技术中的方法,存在很多问题,首先就是生成的人脸摘要的清晰度问题,现有的方法直接采用人脸识别的方法生成人脸摘要,但是由于视频场景中光照变化、运动模糊、人脸比例等复杂多变的场景,严重影响了人脸的识别率,导致人脸摘要清晰程度较低,难以满足场景需求。其次,目前的人脸摘要生成方法中,采用的多为神经网络模型完成人脸识别、人脸检测,而现有技术中采用神经网络模型,往往比较大,从而导致计算量非常大,很难直接部署在终端设备上,为了能够实现快速地检测和识别视频中的人脸,往往需要借助大的机器平台,例如部署在云端服务器上,但是采用这样的方法,进行人脸检测、识别等操作是在云端服务器上完成,然后再返回到终端设备上,会造成延迟、实时性较差等情况,从而影响了人脸摘要服务的质量,因此很难满足视频会议等场景的需要。
综上分析,现有的人脸摘要服务的方法,人脸摘要服务的质量差、清晰度低,并且计算量大,导致难以快速地、实时准确地生成人脸摘要,并且不能直接将算法模型部署在终端设备上,需借助大型计算设备,实时性差、成本较高。
发明内容
技术问题:本发明提供了一种智能视频会议终端的实时人脸摘要服务的方法及系统,本发明能够生成质量好、清晰度高的人脸摘要,并且该方法计算量小、计算速度快、实时性好,能够直接部署在终端设备上,降低了计算成本。
技术方案:本发明的智能视频会议终端的实时人脸摘要服务的方法,包括:
S1:初始化人脸检测模型、人脸对齐模型、人脸识别模型以及人脸画廊,并进行模型加载和内存分配;
S2:获取视频帧,并对帧图像进行预处理;
S3:利用人脸检测模型对预处理后的帧图像进行人脸检测;
S4:利用人脸检测结果,初始化跟踪器,利用跟踪器对视频帧中的人脸进行人脸跟踪,跟踪捕获人脸的位置信息;
S5:根据人脸检测或人脸跟踪,输出的人脸检测框坐标裁剪出人脸图像块,将人脸图像块输入人脸对齐模型中,得到人脸关键点的坐标,然后采用相似变换将人脸变换到标准人脸图像;
S6:将标准人脸图像输入到人脸识别模型,根据人脸上具有区分度的特征,进行人脸特征映射,得到向量化的人脸特征数据,识别出帧图像中的人脸;
S7:将识别出的人脸图像录入人脸画廊中,并通过人脸优选,更新人脸画廊。
进一步地,所述步骤S7中,通过人脸优选,更新人脸画廊时,先判断人脸画廊中是否存在预先录入的人脸图像信息,并根据判断结果,分别执行如下操作:
若人脸画廊中无预先录入的人脸图像信息,则自动录入视频中出现的人脸图像信息,并通过人脸优选,随时间自动更新高质量的人脸图像,并保存所有出现在画廊中的人脸;
若人脸画廊中存在预先录入的人脸图像信息,则标定对应人脸ID名称,并将在视频中出现的但在人脸画廊中未预先录入的人脸图像录入人脸画廊中,然后通过人脸优选不断更新画廊中的人脸图像。
进一步地,所述人脸优选的方法包括:
根据人脸检测输出的人脸检测框,过滤掉人脸检测框面积小于人脸面积阈值的人脸图像;
根据人脸检测输出的置信度得分,过滤掉置信度得分小于置信度阈值的人脸图像;
根据人脸关键点,计算人脸的姿态得分,过滤掉姿态得分小于姿态得分阈值的人脸图像;
采用SMD算法,计算人脸图像的清晰度,并过滤掉清晰度低于清晰度阈值的人脸图像;
根据人脸检测框的面积、置信度得分、姿态得分及清晰度,计算人脸质量值。
进一步地,根据人脸检测框的面积、置信度得分、姿态得分及清晰度,计算人脸质量值的方法为:
Q=10000×Q c+3×Q a+Q f+2×Q s
式中,Q表示人脸质量值,Q c表示人脸置信度得分,Q a表示人脸面积得分;Q s表示人脸清晰度,Q f表示人脸姿态角度,其中Q a=1-人脸检测框面积/7680。
进一步地,所述步骤S7中,对人脸画廊进行更新的具体方法为:
对人脸的相似度进行判断,若当前帧图像中的人脸与之前已录入人脸画廊的人脸相似度高于给定阈值,判定此人脸已出现过,然后计算出当前人脸图像的质量值,若高于画廊内的人脸的质量值,则进行出现在画廊中的人脸的更新替换;
若当前帧图像中的人脸与之前已录入人脸画廊的相似度低于给定阈值,则判定有新的人员进入,先通过人脸优选过滤掉一些不满足画廊录入要求的人脸图像,将满足要求的人脸图像加入到人脸画廊;
若视频帧中人脸被录入人脸画廊中后,某些人脸图像从视频帧中消失,超过时间阈值未再次出现在视频中,则删除人脸画廊中对应的人脸图像。
进一步地,所述步骤S4中,采用单目标跟踪方案,跟踪器初始化时,为检测到的每个人脸检测框初始化一个跟踪器,并且在跟踪周期内,由跟踪器输出当前帧中人脸的检测框坐标。
进一步地,所述人脸检测模型,采用级联的卷积神经网络进行人脸检测,所述级联的卷积神经网络依次由P-Net、R-Net和O-Net网络级联,所述P-Net网络采用标准卷积粗略筛选出视频帧中人脸检测框,R-Net网络和O-Net利用标准卷积和深度可分卷积提取图像中的人脸特征数据,用于过滤和细化人脸检测框,得到最终人脸位置信息。
进一步地,所述人脸对齐模型,利用卷积神经模型提取人脸的关键点,所述卷积神经网络模型利用标准卷积和深度可分卷积提取人脸的关键点特征,并采用一个FC全连 接层作为卷积神经网络模型的输出。
进一步地,所述人脸识别网络模型采用若干个串联的MBConv卷积网络模块,提取人脸上有区分度的特征,并进行特征映射,识别出视频帧中的人脸。
进一步地,在将人脸图像块送入人脸识别模型进行人脸识别前,对人脸图像块进行二次检测,防止误检。
本发明的智能视频会议终端的实时人脸摘要服务的系统,采用本发明的方法进行智能视频会议终端的实时人脸摘要服务。
有益效果:本发明与现有技术相比,具有以下优点:
(1)本发明的智能视频会议终端的实时人脸摘要服务的方法,通过人脸检测、人脸跟踪、人脸对齐、人脸识别、人脸优选,然后生成人脸画廊并进行更新,相对与现有技术中的方法,本发明加入了人脸优选操作,通过人脸优选操作,在人脸摘要生成的过程中,不断地过滤掉质量差的人脸图像,一方面,加速了人脸摘要的生成速度;另一方面,通过人脸优选,不断地更新人脸画廊中的人脸图像,使得人脸画廊中人脸图像具有更高的质量和清晰度。
(2)本发明的方法中,人脸优选的方法根据人脸检测、人脸跟踪的输出结果,过滤掉人脸检测框面积过小以及置信度得分较低的人脸图像,避免了检测失误的情况;根据人脸姿态得分,过滤掉姿态得分较低的人脸图像,通过此操作,可以过滤掉采集的图像为人脸倾斜或侧脸程度过大的情况;通过计算清晰度,过滤掉因为光照变化、运动模糊等情况引起地图像模糊的情况,通过以上操作,过滤掉不符合要求的图像,可以有效地减少计算量,提高实时性;同时给出了人脸质量值的计算方法,通过人脸的质量值,可以快速地评估人脸图像的质量,快速地更新人脸画廊中的人脸图像质量,从而使得生成的人脸摘要具有更高的质量,更好的清晰度。
(3)本发明采用单目标跟踪器跟踪的方案,为检测到的每个人脸初始化一个跟踪器,并且只跟踪一个跟踪周期,由于人脸跟踪的效率远高于人脸检测,因此通过人脸跟踪实时输出人脸的位置信息,有效地提高了匀速速度,进而提高了方法的人脸摘要服务的实时性;同时,跟踪一个跟踪周期,避免了由于跟踪时间过长出现跟丢或漏人的情况,有效地提高了该方法的准确性。
(4)本发明的人脸检测模型,采用级联的卷积神经网络进行人脸检测,所述级联的卷积神经网络依次由P-Net、R-Net和O-Net网络级联,通过采用级联的卷积神经网络,并且利用标准卷积和深度可分卷积,有效地降低了网络的深度,精简了卷积层的设置, 模型的大小只有83Kb,不仅降低了计算量,提高了网络模型的计算速度,从而能够快速地检测出视频帧中的人脸,提高了算法的实时性;并且,卷积神经模型很小,使其能够部署在低能耗的视频会议终端,无需借助大型的计算平台进行计算,不仅提高了实时性,同时降低了成本。
(5)本发明的人脸对齐模型,利用卷积神经模型提取人脸的关键点,所述卷积神经网络模型利用标准卷积和深度可分卷积提取人脸的关键点特征,并采用一个FC全连接层作为卷积神经网络模型的输出,简化了卷积层的设置,降低了神经网络的深度和模型的体积,因此降低了计算量,提高了算法的实时性,并且能够部署在低能耗的视频会议终端,无需借助大型的计算平台进行计算,提高了实时性的同时降低了成本。
(6)本发明的人脸识别模型,采用若干个串联的MBConv卷积网络模块,代替标准卷积,极大地降低了计算量,同时显著地提高了识别的精度,因此在保持高的识别精度的同时,具有较快的计算速度,从而能够更快速准确地将人脸从图像中识别出来。此外,利用MBConv卷积网络模块,使卷积神经网络模型具有较小的体积,能够直接应用于低能耗的视频会议终端,从而降低了成本。
附图说明
图1为本发明智能视频会议终端的实时人脸摘要服务的方法的流程图;
图2为本发明中的人脸检测模型中P-Net网络的结构图;
图3为本发明中的人脸检测模型中R-Net网络的结构图;
图4为本发明中的人脸检测模型中O-Net网络的结构图;
图5为本发明中的人脸对齐模型的卷积神经网络的结构图;
图6为MBConv网络模块的网络结构图;
图7为本发明中的人脸识别模型的卷积神经网络的结构图;
图8为将本发明的方法用于视频会议时的效果图;
图9为将本发明的方法用于视频会议时的效果图;
图10为将本发明的方法用于人脸打卡时的效果图;
图11为将本发明的方法用于人脸打卡时预先录入的人脸图像。
具体实施方式
下面结合实施例和说明书附图对本发明作进一步的说明。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,对附图 中出现的英文单词进行说明:Conv表示标准卷积、DwiseConv表示深度可分卷积。
结合图1所示,本发明的智能视频会议终端的实时人脸摘要服务的方法,包括:
S1:初始化人脸检测模型、人脸对齐模型、人脸识别模型以及人脸画廊,并进行模型加载和内存分配;
S2:获取视频帧,并对帧图像进行预处理;
S3:利用人脸检测模型对预处理后的帧图像进行人脸检测;
S4:利用人脸检测结果,初始化跟踪器,利用跟踪器对视频帧中的人脸进行人脸跟踪,跟踪捕获人脸的位置信息;
S5:根据人脸检测或人脸跟踪,输出的人脸检测框坐标裁剪出人脸图像块,将人脸图像块输入人脸对齐模型中,得到人脸关键点的坐标,然后采用相似变换将人脸变换到标准人脸图像;
S6:将标准人脸图像输入到人脸识别模型,根据人脸上具有区分度的特征,进行人脸特征映射,得到向量化的人脸特征数据,识别出帧图像中的人脸;
S7:将识别出的人脸图像录入人脸画廊中,并通过人脸优选,更新人脸画廊。
说明的是,人脸画廊的作用是为了将出现在视频中的人脸显示。
具体地,本发明的步骤S2中,对帧图像的预处理操作包括:格式转换、缩放处理和归一化。各种算法模型要求的图像输入一般是RGB格式,而实际视频帧的格式根据不同场景是多样化的,如YUV、ARGB等等,格式转换是指将上述格式的图像像素统一转换为RGB排列,以匹配算法模型要求。
缩放处理是指将原始分辨率的图像,如1080P图像,进行等比例缩小化,缩放处理采用的方法是双线性插值,通过进行缩放处理,能够降低算法模型的运行时间耗费,分辨率越大,算法处理时间越长,通过进行缩放处理,能够提高各种模型的计算速度,使得模型具有更好的实时性。
归一化处理是指将0~255图像像素处理到-1~1的范围,归一化处理有助于加快算法模型训练时的收敛速度和提高模型精度,应该说明的是,模型训练和实际测试要求的数据处理必须一致,所以实际进来的图像数据需要进行归一化处理。本发明对图像进行归一化处理采用的公式为:
Figure PCTCN2021084231-appb-000001
式中,x表示原始图像的每个像素值,x′表示归一化后的像素值。
人脸检测,用于确定人脸在图像中的大小和位置,即解决“人脸在哪里”的问题,把真正的人脸区域从图像中裁剪出来,便于后续的人脸特征分析与识别。本发明的人脸检测模型采用级联的卷积神经网络进行人脸检测,所述级联的卷积神经网络依次由P-Net、R-Net和O-Net网络级联,所述P-Net网络采用标准卷积粗略筛选出视频帧中人脸检测框,R-Net网络和O-Net利用标准卷积和深度可分卷积提取图像中的人脸特征数据,用于过滤和细化人脸检测框,得到最终人脸位置信息。说明的是,标准卷积是指一般通用的卷积形式。通过人脸检测模型,输出当前视频帧中出现的人数、置信度分数、以及人脸检测框的坐标。
具体的,结合图2所示,P-Net(Proposal Network,生成网络)网络的网络结构为:依次连接的卷积层C101和卷积层C102,卷积层C102连接卷积层C103和卷积层C104。卷积层C101~C104均采用标准卷积,其中卷积层C101的核大小为3×3,通道数为8;卷积层C102的核大小为3×3,通道数为16;卷积层C103的核大小为3×3,通道数为4;卷积层C104的核大小为3×3,通道数为2。
结合图3所示,R-Net(Refine Network,提炼网络)网络的网络结构:包括依次连接的卷积层C201、卷积层C202、卷积层C203、卷积层C204、卷积层C205、卷积层C206,卷积层206连接全连接层FC201和全连接层FC202用于输出。其中,卷积层C201、卷积层C203、卷积层C205均采用标准卷积,且卷积层C201的核大小为3×3,通道数为16;卷积层C203的核大小为1×1,通道数为32;卷积层C205的核大小为1×1,通道数为64。卷积层C202、卷积层C204、卷积层C206均采用深度可分卷积,且核大小均为3×3,全连接层FC201、FC202的神经元数量均为64,其中全连接层FC201用于做分类判断,全连接层FC202用于输出人脸检测框的坐标值。
在本发明的实施例中,所述R-Net网络的输入的大小为24×24×3。
结合图4所示,O-Net(Output Network,输出网络)网络的网络结构:包括依次连接的卷积层C301、卷积层C302、卷积层C303、卷积层C304、卷积层C305、卷积层C306、卷积层C307、卷积层C308,卷积层C308连接全连接层FC301、全连接层FC302用于输出。卷积层C301、卷积层C303、卷积层C305、卷积层C307均采用标准卷积,且卷积层C301的核大小为3×3,通道数为16;卷积层C303的核大小为1×1,通道数为32;卷积层C305的核大小为1×1,通道数为64;卷积层C307的核大小为1×1,通道数为128。卷积层C302、卷积层C304、卷积层C306、卷积层C308均采用深度可分卷积,且核大小均为3×3,全连接层FC1、FC2的节点数均为128,其中全连接层FC1用于做 分类判断,全连接层FC2用于输出人脸检测框的坐标值。在本发明的实施例中,O-Net网络的输入的大小为32×32×3。
本发明的人脸检测模型,采用级联式的卷积神经网络,并且利用深度可分卷积和标准卷积,精简了网络层的深度和卷积层的设置,使得整体的检测模型大小只有86Kb,实现了低功耗的视频会议终端的实时人脸检测,无需借助大型的计算设备,即能快速、实时地识别出视频中的人脸,因此不仅计算速度快,实时性好,同时还能够降低成本。
人脸跟踪,用于在检测到人脸的前提下,在后续帧中继续捕获人脸的位置信息,跟踪器初始化依赖于人脸检测输出的人脸检测框坐标的坐标,通常情况下,跟踪的时间效率要远超于检测效率,因此通过跟踪实时输出人脸信息,利于提高方法的实时性。在本发明的方法中,为了缓解长时间跟踪会存在跟丢或漏人的情况,设置了跟踪周期period以及参数MAX,意思是一个跟踪周期period最多为MAX帧,MAX值一般设置为10~25。本发明的方法,在步骤S4中,采用的是单目标跟踪方案,在跟踪器初始化时,为检测到的每张人脸初始化一个跟踪器,并且在跟踪周期内,由跟踪器输出当前帧中人脸的位置信息,即根据设定的MAX值,跟踪MAX帧视频图像。通过人脸跟踪,可以实时地输出人脸位置信息,降低了人脸检测的压力,提高了方法的实时性。
人脸对齐,因为同一个人在不同的图像序列中可能呈现出不同的姿态和表情,这种情况是不利于人脸识别的,所以有必要将人脸图像都变换到一个统一的角度和姿态。原理是找到人脸的关键点,在本发明的实时例中,共五个关键点,分别为左眼、右眼、鼻子、左嘴角和右嘴角,然后利用这些关键点通过相似变换(旋转、缩放和平移)将人脸尽可能变换到标准人脸,由此完成人脸对齐过程。
为了识别出人脸的五个关键点,为本发明的人脸对齐模型设计了关键点检测卷积神经网络用于检测人脸的关键点,具体地,结合图5所示,关键点检测卷积神经网络的网络结构为:包括依次连接的卷积层C401、卷积层C402、卷积层C403、卷积层C404、卷积层C405、卷积层C406、卷积层C407、卷积层C408、卷积层C409、卷积层C410、全连接层FC401。其中卷积层C401、卷积层C403、卷积层C405、卷积层C407、卷积层C409均采用标准卷积,卷积层C401的核大小均为3×3,通道数为16;卷积层C403、卷积层C405、卷积层C407和卷积层C409的核大小均为1×1,其中卷积层C403的通道数为32,卷积层C405的通道数为48,卷积层C407的通道数为64,卷积层C409的通道数为96。卷积层C402、卷积层C404、卷积层C406、卷积层C408、卷积层C410均采用深度可分卷积,且核大小均为3×3,所述全连接层FC1的神经元数量为96。采用关 键点检测网络,识别出人脸的关键点的坐标,在本发明的实施例中,关键点为:左眼、左嘴角、右眼、右嘴角和鼻子,因此通过关键点检测网络,识别出五个关键点的坐标值。
在具体的操作中,需要根据人脸检测或人脸跟踪,输出的人脸检测框坐标裁剪出人脸图像块,将人脸图像块输入人脸对齐模型中,因为对于人脸检测和人脸跟踪都会输出人脸的位置信息,因此可根据人脸检测或人脸跟踪,输出的人脸检测框坐标裁剪出人脸图像块。
本发明的人脸对齐操作,利用卷积神经网络检测人脸的关键点,并且关键点卷积神经网络模型,利用标准卷积和深度可分卷积,识别人脸的关键点,精简了网络层的深度和卷积层的设置,并且只采用了一个全连接层输出,显著降低了模型体积,该模型的体积小于2M,从而能够用于低功耗的视频会议终端,从而能够不借助大型的计算设备,即能快速、实时地是被出人脸的关键点,便于其他操作。
人脸识别,特指将检测框中的人脸特征映射,通过深度学习特征建模得到向量化的人脸特征,依照分类器判别得到人脸识别的结果。人脸识别模型关键是怎样得到不同人脸中有区分度的特征,通常在识别一个人时会看他的眉形、脸轮廓、鼻子形状、眼睛的类型等,人脸识别算法要通过网络训练得到类似这样的有区分度的特征。为了进行人脸识别,本发明构建了如图7所示的人脸识别卷积神经网络模型,该卷积神经神经网络模型利用标准卷积核若干个MBConv网络模块进行人脸特征识别。具体地,人脸识别的卷积神经网络模型的结构为:依次连接的卷积层C501、卷积模块MBC-1~MBC-16、卷积层C502、卷积层C503、卷积层C504,其中卷积层C501、卷积层C502、卷积层C504采用标准卷积,卷积层C501的核大小为3×3,卷积层C502的核大小为1×1,卷积层C504的核大小为1×1,卷积层C503采用全局深度卷积(GlobalDepthwise Convolution,GDC),核大小为5×4。
MBConv(Mobile inverted Bottleneck Convolution,逆残差卷积网络模块)网络模块的网络结构,如图6所示:包括依次连接的卷积层C601、卷积层C602和卷积层C603,其中卷积层C601、C603采用标准卷积,且核大小均为1×1,卷积层C603采用深度可分卷积,核大小为3×3;卷积模块的输入依次通过卷积层C601、C602和C603,从卷积层C603输出后再与卷积模块的输入通过残差连接,作为卷积模块的输出。
MBConv网络模块,有两个参数,即核大小和通道数,在本发明采用的如图7所示的卷积神经网络结构中,MBC-1模块的核大小为3×3,通道数为1;MBC-2~MBC-4模块的核大小均为3×3,通道数均为3;MBC-5模块的核大小为3×3,通道数为3; MBC-6~MBC-12模块的核大小均为3×3,通道数均为6;MBC-13模块的核大小为3×3,通道数为3;MBC-14~MBC-16模块的核大小均为3×3,通道数均为6。采用本发明实施例的网络模型,采用MBConv模块代替传统的标准卷积层,有效地降低了计算量,并且具有较高的精度,因此在极大地提高了运算速度和识别精度。此外该模型的体积小于2M,因此能够直接部署在视频会议终端,因此无需借助大型计算设备进行计算,不仅提高了实时性,同时有效地降低了成本。
为了使得训练出来的人脸特征具有更好的泛化能力,本发明在对人脸识别的卷积神经网络模型训练时,采用了ArcFace算法,从而使得训练出来的人脸特征具有较好的泛化能力。为了更快的实现在视频会议终端的网络推理,在进行模型训练时,对训练集图像进行了人工的清洗,去掉干扰部分,又进行了裁切处理,裁切掉图像上方和左右信息量较少的部分,从而使得训练出来的卷积神经网络模型能够应用于视频会议终端,并不需要借助大型的计算设备即可应用,从而满足实时视频会议系统的需求。
本发明的步骤S7中,通过人脸优选,更新人脸画廊时,先判断人脸画廊中是否存在预先录入的人脸图像信息,并根据判断结果,分别执行如下操作:若人脸画廊中无预先录入的人脸图像信息,则自动录入视频中出现的人脸图像信息,并通过人脸优选,随时间自动更新高质量的人脸图像,并保存所有出现在画廊中的人脸;若人脸画廊中存在预先录入的人脸图像信息,则标定对应人脸ID名称,并将在视频中出现的但在人脸画廊中未预先录入的人脸图像录入人脸画廊中,然后通过人脸优选不断更新画廊中的人脸图像。
人脸优选的方法包括:
根据人脸检测输出的人脸检测框,过滤掉人脸检测框面积小于人脸面积阈值的人脸图像,本发明的实施例中,人脸面积阈值的范围为2400~3600。
根据人脸检测输出的置信度得分,过滤掉置信度得分小于置信度阈值的人脸图像,本发明的实施例中,置信度阈值的范围为0.6~0.8。
根据人脸关键点,计算人脸的姿态得分,过滤掉姿态得分小于姿态得分阈值的人脸图像,本发明的实施例中,姿态得分阈值的范围为0.5~1。
采用SMD(Sum of Modulus of gray Difference,灰度差分函数)算法,计算人脸图像的清晰度,并过滤掉清晰度低于清晰度阈值的人脸图像,在本发明的实施例中,清晰度阈值的范围为80~100。
根据人脸检测框的面积、置信度得分、姿态得分及清晰度,计算人脸质量值。
人脸质量值的计算采用如下公式:
Q=10000×Q c+3×Q a+Q f+2×Q s
式中,Q表示人脸质量值,Q c表示人脸置信度得分,Q a表示人脸面积得分;Q s表示人脸清晰度,Q f表示人脸姿态得分,其中Q a=1-人脸检测框面积/7680。
其中人脸姿态得分Q f,根据检测到的人脸关键点进行计算,具体的计算方法为:
确定第一连线:左眼与左嘴角的连线;
确定第二连线:右眼与右嘴角的连线;
通过鼻尖点的水平线分别与第一连线和第二连线相交,交点分别为第一交点和第二交点,鼻尖点到第一交点的距离为第一距离,鼻尖点到第二交点的距离为第二距离,用第一距离与第二距离中的最小值除以最大值,即得到人脸姿态得分Q f
对人脸画廊进行更新的具体方法为:
对人脸的相似度进行判断,若当前帧图像中的人脸与之前已录入人脸画廊的人脸相似度高于给定相似度阈值,判定此人脸已出现过,然后计算出当前人脸图像的质量值,若高于画廊内的人脸的质量值,则进行出现在画廊中的人脸的更新替换;
若当前帧图像的人脸与之前已录入人脸画廊的相似度低于给定相似度阈值,则判定有新的人员进入,先通过人脸优选过滤一些不满足画廊录入要求的人脸图像,将满足条件的人脸图像加入到画廊集,这里主要是为了过滤掉一些比较模糊的图像;说明的是,对于人脸画廊的录入要求主要是具有一定的清晰度,具体的,在实际应用中,根据需求进行设定。
在本发明的方法中,人脸的相似度采用余弦相似度进行计算,并且在本发明的一个实施例中,相似度阈值设定为60%,此处应该说明的是,相似度阈值可根据实际应用需求进行调整。
若视频帧中人脸被录入人脸画廊中后,某些人脸图像从视频帧中消失,超过时间阈值未再次出现在视频中,则删除人脸画廊中对应的人脸图像。具体地,因为人脸画廊中可以同时显示的人脸数量是有限的,但有些人脸在视频中出现后,人脸图像被录入到人脸画廊中,但录入后,某些人脸从视频中消失,一段时间后,该人脸都没有再次在视频中出现,为了避免人脸画廊中的人脸越来越多,需要将长时间未在视频中出现人脸删除人脸画廊中对应的人脸图像。具体操作时,对人脸画廊中出现的人脸进行计数统计,并 定义时间阈值,人脸图像从视频帧中消失,超过时间阈值未再次出现在视频中,则删除人脸画廊中对应的人脸图像。更通俗地讲,当某个人脸在某一时刻从视频中消失,人脸画廊中对应的人脸会暂停更新,从消失的时刻起,经过时间阈值长度的时间段,该人脸仍未再次出现在视频中,则将人脸画廊中与该人脸对应的人脸图像从人脸画廊中删除。
在本发明的实施例中,时间阈值设定为20分钟,即如果连续20分钟内,人脸没有再次出现在视频中,会将对应的人脸图像从人脸画廊中删除,此处应该说明的是,时间阈值在实际应用时可以根据需求进行修改,并不是严格限定。
应该说明的是,当人脸画廊中初始状态下不存在人脸图像时,先将检测到人脸录入人脸画廊,然后按上述方法,随着时间不断更新。
进一步地,在本发明的优选实施例中,人脸图像块送入人脸识别模型进行人脸识别前,对人脸图像块进行二次检测,防止误检。对人脸图像块进行二次检测,仍然采用本发明的人脸检测的卷积神经网络模型结构,但是,参数会根据需求进行微调。
将本发明的方法,应用于视频会议系统,对于视频会议系统,由于参会的人群具有不确定性,因此通常人脸画廊中无预先录入的人脸图像信息。当人脸画廊中无预先录入的人脸信息时,会自动录入视频中出现的人脸图像信息,并通过人脸优选,随时间自动更新高质量的人脸图像,并保存所有画廊中出现的人脸。图8给出了一个视频会议场景的效果,图9给出了另一个视频会议场景的效果图,从图8和图9可以看出,采用本发明的方法,可以实时地得到人脸摘要,并且,人脸画廊中的人脸图像比较清晰;同时可以看到,有部分人脸是不清楚的,那是因为视频中出现相应的人脸是一些侧脸或者只是在视频中一闪而过的人脸,无法采集到完整的人脸信息。
本发明的方法的应用,不局限于视频会议系统中,也可以有其他应用,例如人脸打卡,图10给出了采用本发明的方法进行人脸打卡的效果图。对于人脸打卡,通常人脸打卡的人都是确定的,因此会提前录入这些人的人脸图像,例如,图11中给出了这些人脸的图像,系统会将这些人脸图像预先录入到人脸画廊中,从图10可以看出,人脸画廊中已经录入了图11中的人脸。在人脸打卡时,当人脸画廊中有与视频中出现的人脸相同的人脸是,在人脸画廊中会标定出人脸ID名称,说明该人员已经打卡成功。
本发明提供了一种智能视频会议终端的实时人脸摘要服务的方法,通过人脸检测、人脸跟踪、人脸对齐、人脸识别、人脸优选,生成和更新人脸画廊,快速实时地生成人脸摘要,采用本发明的方法,人脸摘要服务的质量好,清晰度高;同时,本发明的方法,精简优化了人脸检测、人脸识别、人脸对齐所采用的卷积神经网络模型,使得卷积神经 网络模型的计算量降低,从而有效地提高了人脸摘要服务的速度,并且,精简优化后的神经网络模型的体积较小,能直接应用到视频会议系统的智能终端的ARM端,无需借助大型计算设备做辅助计算,使得视频摘要服务的实时性更好,降低了成本。
基于本发明的方法,本发明还提供了一种智能视频会议终端的实时人脸摘要服务的系统,该系统利用本发明的方法进行智能视频会议终端的实时人脸摘要服务。
上述实施例仅是本发明的优选实施方式,应当指出:对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和等同替换,这些对本发明权利要求进行改进和等同替换后的技术方案,均落入本发明的保护范围。

Claims (11)

  1. 一种智能视频会议终端的实时人脸摘要服务的方法,其特征在于,包括:
    S1:初始化人脸检测模型、人脸对齐模型、人脸识别模型以及人脸画廊,并进行模型加载和内存分配;
    S2:获取视频帧,并对帧图像进行预处理;
    S3:利用人脸检测模型对预处理后的帧图像进行人脸检测;
    S4:利用人脸检测结果,初始化跟踪器,利用跟踪器对视频帧中的人脸进行人脸跟踪,跟踪捕获人脸的位置信息;
    S5:根据人脸检测或人脸跟踪,输出的人脸检测框坐标裁剪出人脸图像块,将人脸图像块输入人脸对齐模型中,得到人脸关键点的坐标,然后采用相似变换将人脸变换到标准人脸图像;
    S6:将标准人脸图像输入到人脸识别模型,根据人脸上具有区分度的特征,进行人脸特征映射,得到向量化的人脸特征数据,识别出帧图像中的人脸;
    S7:将识别出的人脸图像录入人脸画廊中,并通过人脸优选,更新人脸画廊。
  2. 根据权利要求1所述的一种智能视频会议终端的实时人脸摘要服务的方法,其特征在于,所述步骤S7中,通过人脸优选,更新人脸画廊时,先判断人脸画廊中是否存在预先录入的人脸图像信息,并根据判断结果,分别执行如下操作:
    若人脸画廊中无预先录入的人脸图像信息,则自动录入视频中出现的人脸图像信息,并通过人脸优选,随时间自动更新高质量的人脸图像,并保存所有出现在画廊中的人脸;
    若人脸画廊中存在预先录入的人脸图像信息,则标定对应人脸ID名称,并将在视频中出现的但在人脸画廊中未预先录入的人脸图像录入人脸画廊中,然后通过人脸优选不断更新画廊中的人脸图像。
  3. 根据权利要求2所述的一种智能视频会议终端的实时人脸摘要服务的方法,其特征在于,所述人脸优选的方法包括:
    根据人脸检测输出的人脸检测框,过滤掉人脸检测框面积小于人脸面积阈值的人脸图像;
    根据人脸检测输出的置信度得分,过滤掉置信度得分小于置信度阈值的人脸图像;
    根据人脸关键点,计算人脸的姿态得分,过滤掉姿态得分小于姿态得分阈值的人脸图像;
    采用SMD算法,计算人脸图像的清晰度,并过滤掉清晰度低于清晰度阈值的人脸图像;
    根据人脸检测框的面积、置信度得分、姿态得分及清晰度,计算人脸质量值。
  4. 根据权利要求3所述的一种智能视频会议终端的实时人脸摘要服务的方法,其特征在于,根据人脸检测框的面积、置信度得分、姿态得分及清晰度,计算人脸质量值的方法为:
    Q=10000×Q c+3×Q a+Q f+2×Q s
    式中,Q表示人脸质量值,Q c表示人脸置信度得分,Q a表示人脸面积得分;Q s表示人脸清晰度,Q f表示人脸姿态角度,其中Q a=1-人脸检测框面积/7680。
  5. 根据权利要求1所述的一种智能视频会议终端的实时人脸摘要服务的方法,其特征在于,所述步骤S7中,对人脸画廊进行更新的具体方法为:
    对人脸的相似度进行判断,若当前帧图像中的人脸与之前已录入人脸画廊的人脸相似度高于给定阈值,判定此人脸已出现过,然后计算出当前人脸图像的质量值,若高于画廊内的人脸的质量值,则进行出现在画廊中的人脸的更新替换;
    若当前帧图像中的人脸与之前已录入人脸画廊的相似度低于给定阈值,则判定有新的人员进入,先通过人脸优选过滤掉一些不满足画廊录入要求的人脸图像,将满足要求的人脸图像加入到人脸画廊;
    若视频帧中人脸被录入人脸画廊中后,某些人脸图像从视频帧中消失,超过时间阈值未再次出现在视频中,则删除人脸画廊中对应的人脸图像。
  6. 根据权利要求1所述的一种智能视频会议终端的实时人脸摘要服务的方法,其特征在于,所述步骤S4中,采用单目标跟踪方案,跟踪器初始化时,为检测到的每个人脸检测框初始化一个跟踪器,并且在跟踪周期内,由跟踪器输出当前帧中人脸的检测框坐标。
  7. 根据权利要求1所述的一种智能视频会议终端的实时人脸摘要服务的方法,其特征在于,所述人脸检测模型,采用级联的卷积神经网络进行人脸检测,所述级联的卷积神经网络依次由P-Net、R-Net和O-Net网络级联,所述P-Net网络采用标准卷积粗略筛选出视频帧中人脸检测框,R-Net网络和O-Net利用标准卷积和深度可分卷积提取图像中的人脸特征数据,用于过滤和细化人脸检测框,得到最终人脸位置信息。
  8. 根据权利要求1所述的一种智能视频会议终端的实时人脸摘要服务的方法,其特征在于,所述人脸对齐模型,利用卷积神经模型提取人脸的关键点,所述卷积神经网络模型利用标准卷积和深度可分卷积提取人脸的关键点特征,并采用一个FC全连接层 作为卷积神经网络模型的输出。
  9. 根据权利要求1所述的智能视频会议终端的实时人脸摘要服务的方法,其特征在于,所述人脸识别网络模型采用若干个串联的MBConv卷积网络模块,提取人脸上有区分度的特征,并进行特征映射,识别出视频帧中的人脸。
  10. 根据权利要求1-9任一项所述的一种智能视频会议终端的实时人脸摘要服务的方法,其特征在于,在将人脸图像块送入人脸识别模型进行人脸识别前,对人脸图像块进行二次检测,防止误检。
  11. 一种智能视频会议终端的实时人脸摘要服务的系统,其特征在于,采用权利要求1-10任一项所述的方法进行智能视频会议终端的实时人脸摘要服务。
PCT/CN2021/084231 2020-04-20 2021-03-31 一种智能视频会议终端的实时人脸摘要服务的方法及系统 WO2021213158A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010310359.4A CN111770299B (zh) 2020-04-20 2020-04-20 一种智能视频会议终端的实时人脸摘要服务的方法及系统
CN202010310359.4 2020-04-20

Publications (1)

Publication Number Publication Date
WO2021213158A1 true WO2021213158A1 (zh) 2021-10-28

Family

ID=72719264

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084231 WO2021213158A1 (zh) 2020-04-20 2021-03-31 一种智能视频会议终端的实时人脸摘要服务的方法及系统

Country Status (2)

Country Link
CN (1) CN111770299B (zh)
WO (1) WO2021213158A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115396726A (zh) * 2022-08-01 2022-11-25 陈兵 一种用于商务直播的演示文稿生成系统及方法
CN116489502A (zh) * 2023-05-12 2023-07-25 深圳星河创意科技开发有限公司 基于ai摄像头拓展坞的远程会议方法与ai摄像头拓展坞

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111770299B (zh) * 2020-04-20 2022-04-19 厦门亿联网络技术股份有限公司 一种智能视频会议终端的实时人脸摘要服务的方法及系统
CN112215174A (zh) * 2020-10-19 2021-01-12 江苏中讯通物联网技术有限公司 一种基于计算机视觉的环卫车辆状态分析方法
CN112329665B (zh) * 2020-11-10 2022-05-17 上海大学 一种人脸抓拍系统
CN112541402A (zh) * 2020-11-20 2021-03-23 北京搜狗科技发展有限公司 一种数据处理方法、装置和电子设备
CN112686175A (zh) * 2020-12-31 2021-04-20 北京澎思科技有限公司 人脸抓拍方法、系统及计算机可读存储介质
CN113537139A (zh) * 2021-08-03 2021-10-22 山西长河科技股份有限公司 一种人脸检测定位方法及装置
CN114025198B (zh) * 2021-11-08 2023-06-27 深圳万兴软件有限公司 基于注意力机制的视频卡通化方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010071442A1 (en) * 2008-12-15 2010-06-24 Tandberg Telecom As Method for speeding up face detection
CN102214291A (zh) * 2010-04-12 2011-10-12 云南清眸科技有限公司 一种快速准确的基于视频序列的人脸检测跟踪方法
US20140098174A1 (en) * 2012-10-08 2014-04-10 Citrix Systems, Inc. Facial Recognition and Transmission of Facial Images in a Videoconference
CN109829436A (zh) * 2019-02-02 2019-05-31 福州大学 一种基于深度表观特征和自适应聚合网络的多人脸跟踪方法
CN110837750A (zh) * 2018-08-15 2020-02-25 华为技术有限公司 一种人脸质量评价方法与装置
CN111770299A (zh) * 2020-04-20 2020-10-13 厦门亿联网络技术股份有限公司 一种智能视频会议终端的实时人脸摘要服务的方法及系统

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605969B (zh) * 2013-11-28 2018-10-09 Tcl集团股份有限公司 一种人脸录入的方法及装置
CN103984738B (zh) * 2014-05-22 2017-05-24 中国科学院自动化研究所 一种基于搜索匹配的角色标注方法
CN104731964A (zh) * 2015-04-07 2015-06-24 上海海势信息科技有限公司 基于人脸识别的人脸摘要方法、视频摘要方法及其装置
CN105069408B (zh) * 2015-07-24 2018-08-03 上海依图网络科技有限公司 一种复杂场景下基于人脸识别的视频人像跟踪方法
CN105574506B (zh) * 2015-12-16 2020-03-17 深圳市商汤科技有限公司 基于深度学习和大规模集群的智能人脸追逃系统及方法
CN107145833A (zh) * 2017-04-11 2017-09-08 腾讯科技(上海)有限公司 人脸区域的确定方法和装置
CN107748858A (zh) * 2017-06-15 2018-03-02 华南理工大学 一种基于级联卷积神经网络的多姿态眼睛定位方法
CN107609497B (zh) * 2017-08-31 2019-12-31 武汉世纪金桥安全技术有限公司 基于视觉跟踪技术的实时视频人脸识别方法及系统
CN109063581A (zh) * 2017-10-20 2018-12-21 奥瞳系统科技有限公司 用于有限资源嵌入式视觉系统的增强型人脸检测和人脸跟踪方法和系统
CN108197604A (zh) * 2018-01-31 2018-06-22 上海敏识网络科技有限公司 基于嵌入式设备的快速人脸定位跟踪方法
CN108388885B (zh) * 2018-03-16 2021-06-08 南京邮电大学 面向大型直播场景的多人特写实时识别与自动截图方法
CN109376645B (zh) * 2018-10-18 2021-03-26 深圳英飞拓科技股份有限公司 一种人脸图像数据优选方法、装置及终端设备
CN109598211A (zh) * 2018-11-16 2019-04-09 恒安嘉新(北京)科技股份公司 一种实时动态人脸识别方法及系统
CN110288632A (zh) * 2019-05-15 2019-09-27 北京旷视科技有限公司 一种图像处理方法、装置、终端及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010071442A1 (en) * 2008-12-15 2010-06-24 Tandberg Telecom As Method for speeding up face detection
CN102214291A (zh) * 2010-04-12 2011-10-12 云南清眸科技有限公司 一种快速准确的基于视频序列的人脸检测跟踪方法
US20140098174A1 (en) * 2012-10-08 2014-04-10 Citrix Systems, Inc. Facial Recognition and Transmission of Facial Images in a Videoconference
CN110837750A (zh) * 2018-08-15 2020-02-25 华为技术有限公司 一种人脸质量评价方法与装置
CN109829436A (zh) * 2019-02-02 2019-05-31 福州大学 一种基于深度表观特征和自适应聚合网络的多人脸跟踪方法
CN111770299A (zh) * 2020-04-20 2020-10-13 厦门亿联网络技术股份有限公司 一种智能视频会议终端的实时人脸摘要服务的方法及系统

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115396726A (zh) * 2022-08-01 2022-11-25 陈兵 一种用于商务直播的演示文稿生成系统及方法
CN115396726B (zh) * 2022-08-01 2024-05-07 陈兵 一种用于商务直播的演示文稿生成系统及方法
CN116489502A (zh) * 2023-05-12 2023-07-25 深圳星河创意科技开发有限公司 基于ai摄像头拓展坞的远程会议方法与ai摄像头拓展坞
CN116489502B (zh) * 2023-05-12 2023-10-31 深圳星河创意科技开发有限公司 基于ai摄像头拓展坞的远程会议方法与ai摄像头拓展坞

Also Published As

Publication number Publication date
CN111770299B (zh) 2022-04-19
CN111770299A (zh) 2020-10-13

Similar Documents

Publication Publication Date Title
WO2021213158A1 (zh) 一种智能视频会议终端的实时人脸摘要服务的方法及系统
Shao et al. Deep convolutional dynamic texture learning with adaptive channel-discriminability for 3D mask face anti-spoofing
CN109472198B (zh) 一种姿态鲁棒的视频笑脸识别方法
Black et al. Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion
US20220092882A1 (en) Living body detection method based on facial recognition, and electronic device and storage medium
CN102332095B (zh) 一种人脸运动跟踪方法和系统以及一种增强现实方法
Rikert et al. Gaze estimation using morphable models
CN102375970B (zh) 一种基于人脸的身份认证方法和认证装置
CN108537754B (zh) 基于形变引导图的人脸图像复原系统
CN109360156A (zh) 基于生成对抗网络的图像分块的单张图像去雨方法
CN107330371A (zh) 3d脸部模型的脸部表情的获取方法、装置和存储装置
CN110674701A (zh) 一种基于深度学习的驾驶员疲劳状态快速检测方法
KR20200063292A (ko) 얼굴 영상 기반의 감정 인식 시스템 및 방법
CN109635693B (zh) 一种正脸图像检测方法及装置
CN112288627B (zh) 一种面向识别的低分辨率人脸图像超分辨率方法
CN111476710B (zh) 基于移动平台的视频换脸方法及系统
CN112541422B (zh) 光照和头部姿态鲁棒的表情识别方法、设备及存储介质
CN109117753A (zh) 部位识别方法、装置、终端及存储介质
CN111597978B (zh) 基于StarGAN网络模型实现行人重识别图片自动生成的方法
CN115359534A (zh) 基于多特征融合和双流网络的微表情识别方法
CN111582036A (zh) 可穿戴设备下基于形状和姿态的跨视角人物识别方法
CN113343927B (zh) 一种适用于面瘫患者的智能化人脸识别方法和系统
Kakumanu et al. A local-global graph approach for facial expression recognition
CN106778576A (zh) 一种基于sehm特征图序列的动作识别方法
CN112487926A (zh) 一种基于时空图卷积网络的景区投喂行为识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21792264

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21792264

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21792264

Country of ref document: EP

Kind code of ref document: A1