WO2023087891A1 - 实时人脸图像驱动方法、装置、电子设备及存储介质 - Google Patents

实时人脸图像驱动方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2023087891A1
WO2023087891A1 PCT/CN2022/119941 CN2022119941W WO2023087891A1 WO 2023087891 A1 WO2023087891 A1 WO 2023087891A1 CN 2022119941 W CN2022119941 W CN 2022119941W WO 2023087891 A1 WO2023087891 A1 WO 2023087891A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
face detection
real
detection frame
frame
Prior art date
Application number
PCT/CN2022/119941
Other languages
English (en)
French (fr)
Inventor
张子文
贾霞
申光
侯春华
刘明
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023087891A1 publication Critical patent/WO2023087891A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to the field of image processing, and in particular to a real-time facial image driving method, device, electronic equipment, and storage medium.
  • Real-time face driving technology based on 2D images has many typical application scenarios and application potentials in current production and life. Face-driven technology can be applied to applications such as entertainment, beauty makeup, and short videos to increase the fun and playability of applications; it can be embedded in video conferencing systems for ultra-low bandwidth video conferencing bandwidth compression; it can also be used for The bank's remote virtual customer service system provides a unified and high-quality customer service image. Face driving technology can be divided into 3D face driving and 2D image face driving. Among them, the 3D face driving method usually collects 3D information of the user's head and face, performs 3D modeling on the target image, decouples head posture parameters and facial expression parameters, and maps them to the target image. In comparison, the image-based 2D face driving technology has a wider range of application scenarios. Usually, 2D video capture equipment is used to collect the user's facial expressions and posture features, and the target image to be driven is generated through the confrontation generation network.
  • the cost of the image-based 2D face driving scheme is lower and the application scenarios are more abundant, the robustness of the generation effect is poor and the application scenarios are more restricted.
  • the video capture device is usually fixed, and the user's face area is required to always remain within the predetermined capture window. If it is close to the edge of the capture window or exceeds the capture window, it will cause A stable and effective real-time driving effect cannot be achieved. This will undoubtedly increase the discomfort and fatigue under long-term use, while limiting the application scenarios.
  • users will inevitably have slight jitter or offset. After the amplification effect of the generation network, it will cause jitter or even jumps in the real-time generated image, which will seriously affect the perception.
  • Embodiments of the present application provide a real-time face image driving method, device, electronic equipment, and storage medium, so that the user's face does not have to be kept in the predetermined acquisition window all the time, but also can achieve a stable and effective driving effect, thereby relaxing the need for Constraints on the position of the user's face to achieve high-robust large-scale face driving.
  • the embodiment of the present application provides a real-time face image driving method, including: performing face detection on the currently collected video frame; determining the face detection frame of the area where the target face is located in the currently detected video frame ; smoothing the determined face detection frame; extracting face features from the area where the smoothed face detection frame is located, and driving a 2D face image based on the extracted face features.
  • Embodiments of the present application provide a real-time face image driving device, comprising: a face detection module, configured to perform face detection on currently collected video frames; a face detection frame acquisition module, configured to In the video frame, determine the face detection frame of the area where the target face is located; the smoothing module is set to smooth the determined face detection frame; the driver module is set to smooth the face detection frame after smoothing Extract the face features in the area, and drive the 2D face image based on the extracted face features.
  • An embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be executed by the at least one processor. instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the above real-time face image driving method.
  • the embodiment of the present application also provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the above real-time face image driving method is realized.
  • the face detection frame of the area where the face is located in each frame is determined, and the face detection frame is smoothed, and the smoothed face Face feature extraction is carried out in the area where the face detection frame is located, and finally the target image is driven according to the above-mentioned face features, so that the user's face area does not need to be kept within the predetermined acquisition window, and the range of the user's facial feature acquisition area is expanded.
  • the face driver generation effect will not jitter and jump, and the look and feel of continuous frames is more continuous and realistic.
  • Fig. 1 is the process flow of the face image driving method provided by an embodiment of the present application
  • Fig. 2a is a schematic diagram of a face acquisition scheme of a face image driving method in the related art
  • Fig. 2b is a schematic diagram of a face acquisition scheme of a face image driving method provided by an embodiment of the present application
  • Fig. 3 is a flow chart of the smoothing anti-shake algorithm in the face image driving method provided by an embodiment of the present application
  • Fig. 4 is a schematic diagram of modules for realizing a face image driving method provided by an embodiment of the present application.
  • Fig. 5 is a flow chart of the face image driving method provided by an embodiment of the present application applied to the bank's remote digital virtual customer service real-time service system;
  • Fig. 6 is a flow chart of the face image driving method provided by an embodiment of the present application applied to the mobile APP video call privacy encryption function;
  • FIG. 7 is a schematic structural diagram of a face image driving device provided by an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • An embodiment of the present application relates to a real-time face image driving method, including: performing face detection on the currently collected video frame; determining the face detection frame of the area where the target face is located in the currently detected video frame ; smoothing the determined face detection frame; extracting face features from the area where the smoothed face detection frame is located, and driving a 2D face image based on the extracted face features.
  • step 101 face detection is performed on currently collected video frames.
  • the terminal may also perform denoising processing on the currently collected video frames to eliminate noise in the video.
  • the terminal performs smoothing processing on each video frame, that is, the terminal adds a smoothing layer (such as Gaussian filtering) to the first layer of the face detection algorithm; then the terminal uses the face detection algorithm to smooth the processed video frames for face recognition.
  • a smoothing layer such as Gaussian filtering
  • step 102 in the currently detected video frame, the face detection frame of the area where the target face is located is determined.
  • the detection frames before determining the face detection frame in the area where the target face is located, when the terminal detects that the number of face detection frames is greater than 1, the detection frames are sorted from large to small according to the size of the detection frames, Reserve the first N detection frames, where the number of reserved detection frames is selected according to the specific scene, and N is selected as 3 by default; then use face recognition or face feature comparison to compare the faces in the reserved detection frames with the user's preset Compare the preset faces, and keep the face detection frame with the highest similarity, wherein the preset face is usually the face in the largest face detection frame in the first frame of the collected video frame. For example: In the currently detected video frame, face detection frames A, B, C and D appear.
  • Set N to 3 keep the face detection frames B, C and D according to the size of the detection frame, and then use face recognition to compare the faces in the face detection frames B, C and D with the user's preset person Faces are compared, and the comparison finds that the face in the face detection frame B has the highest similarity with the preset face, so the face detection frame B is retained.
  • the face detection frame in the area where the face most similar to the preset face is located is reserved to avoid multiple faces in the collected video frame, which may cause the user's face position to be incorrectly positioned problems, while improving the speed of determining the user's face detection frame.
  • the terminal uses an improved face detection algorithm to detect the face detection frame, wherein the algorithm includes: ultraface (a lightweight face detection network) algorithm, Centerface (a lightweight face detection network) Detection network) algorithm and MTCNN (Multi-tasks cascade neural network, cascaded face detection network) algorithm, etc.
  • the algorithm includes: ultraface (a lightweight face detection network) algorithm, Centerface (a lightweight face detection network) Detection network) algorithm and MTCNN (Multi-tasks cascade neural network, cascaded face detection network) algorithm, etc.
  • the terminal passes Use the face detection algorithm and smoothing anti-shake algorithm for the video frame to track the user's face position in real time, without requiring the user to keep the face in the preset small-scale acquisition window, allowing the user to be more flexible in the visible area of the acquisition device
  • the selected position of the face does not need to be limited to a certain fixed area, as shown in Figure 2a and 2b, to avoid the problem that the face capture frame can only be collected in a fixed area in the common technology, and realize the face capture frame according to the face position Dynamic adjustment reduces the restrictions on users and improves user experience.
  • step 103 the determined face detection frame is smoothed.
  • the terminal before smoothing the face detection frame, the terminal stretches the detected face detection frame into a frame with equal width and height. Since the aspect ratio of the detected face detection frame is not fixed, the terminal here sets the detection frame as a frame with equal width and height, which is more conducive to subsequent processing.
  • the terminal obtains the IOU (Intersection of Union) value of the face detection frames of two adjacent frames, that is, the ratio of the intersection and union of the face detection frames of two adjacent frames.
  • IOU Intersection of Union
  • the terminal obtains the corresponding fixed constant according to the interval to which the IOU value belongs, and then multiplies the fixed constant with the coordinate difference of the face detection frames of two adjacent frames to obtain the offset, and finally compares the offset with the original face detection
  • a new coordinate is obtained by adding the coordinate values.
  • the above new coordinate is the updated face detection frame coordinate.
  • the corresponding relationship between the fixed constant and the IOU value range is preset by the terminal. The specific process is shown in Figure 3.
  • the slight vibration and offset of the user and the acquisition device can be easily Good filtering ensures that the face-driven generation effect will not jitter and jump, and the look and feel of continuous frames is more continuous and realistic.
  • step 104 perform extraction of facial features on the area where the smoothed face detection frame is located, and perform driving of a 2D facial image based on the extracted facial features.
  • the terminal performs feature extraction on the collected face, and analyzes the expression and posture features of the face.
  • the above feature extraction operation often uses a convolutional neural network for extraction, such as FOMM (First Order Motion Model, based on a The key point detector in the face-driven network of order Taylor expansion), the terminal sends the feature information to the cloud server for the generator of the cloud server to generate the face image in real time.
  • FOMM First Order Motion Model
  • the terminal can also directly generate face images in real time on the local generator.
  • the face image is generated in real time by the server on the network side, which can save the local computing burden.
  • face detection is performed on the target image to be driven, and the face area and non-face area of the target image are separated, and then the user's face motion feature information and the face area of the target image are sent to the generator to generate
  • the target image has the user's facial expression and posture; finally, the generated image and the image of the non-to-be-driven area are spliced into a new image, or the generated image is directly used and sent to other devices or previewed and displayed on the front-end device in real time.
  • the real-time face image driving method provided by the embodiment of the present application is realized by modules as shown in Figure 4, including: video acquisition module S1, image preprocessing module S2, human face motion feature extraction module S3, target image preprocessing module S4, face-driven image generation module S5 and streaming/display module S6.
  • the video collection module S1 is configured to complete the real-time collection and transmission of videos in the area, wherein the video collection module can be fixed or mobile.
  • the image preprocessing module S2 is configured to preprocess the collected video frames. First, denoise the collected video frame by frame, and then perform frame by frame smoothing and face detection on the video frames to determine the user's face. The position of the face in the video image, and then use the smoothing anti-shake algorithm to smooth and anti-shake the position of the frame-by-frame face detection frame, and finally according to the adjusted position of the face detection frame, cut out the face area.
  • the facial motion feature extraction module S3 is configured to extract facial features from the user's face within the smoothed face detection frame, and analyze facial expression and posture features.
  • the target image preprocessing module S4 is configured to perform face detection on the image of the target image to be driven, and separate the face area and non-face area of the target image.
  • the face-driven image generation module S5 is configured to send the user's face motion feature information and the face area of the target image into the generator to generate a target image with the user's facial expression and posture.
  • the streaming/display module S6 is configured to splice the generated image and the image of the non-to-be-driven area into a new image, or directly use the generated image, and send it to other devices or preview and display it on the front-end device in real time.
  • step 501 the user accesses the APP (application program), that is, the user accesses the banking service APP on the terminal side (mobile phone or PC).
  • APP application program
  • step 502 the background thread waits for the user to initiate a virtual customer service service application, and judges whether the user activates the virtual customer service until the user activates the virtual customer service or exits the APP.
  • the target customer service image is selected. For example, after the system detects that the user initiates a virtual customer service service application, the system responds to enter the virtual customer service mode.
  • the system presets multiple virtual customer service images (such as bank brand promotion mascots, bank signing stars, bank internal employees, etc.) to meet the needs of users of different genders and groups. Users choose a virtual customer service to provide follow-up services according to their interests.
  • the terminal APP sends a message to the server to inform the selected virtual customer service, and the server completes the initialization and gives feedback.
  • the collection device collects real customer service audio and video in real time.
  • the system automatically allocates business according to the current idle online customer service personnel, and establishes a connection with the user APP to start providing virtual customer service.
  • video capture devices such as fixed wide-angle cameras
  • audio capture devices such as microphones
  • the collection device is facing the customer service personnel of the bank. Try to ensure that the head of the customer service personnel is located in the middle of the visible area of the collection device, so as to collect and capture the head and face information of the customer service personnel in real time.
  • the audio collection equipment should be close to the real customer service to ensure the sound pickup effect.
  • users can consult the customer service of the bank about business issues as needed, and the customer service of the bank guarantees to answer user questions with a natural expression and normal speaking speed.
  • step 505 detect the face position of the real customer service, for example, detect the face of the customer service personnel frame by frame according to the collected video, and use the face detection algorithm that adds a smooth layer to the first layer position to detect the face detection box.
  • step 506 the position of the human face detection frame is adjusted by the anti-shake algorithm, and the adjusted position of the human face detection frame is cut, for example, the human face detection frame is stretched into a box with equal width and height, and the Calculate the IOU value of the detection frame between frames, if the IOU value is greater than the preset value, the coordinates of the face detection frame will not be updated; if the IOU value is less than or equal to the preset value, the terminal will obtain the corresponding Fixed constant, then multiply the fixed constant with the coordinate difference of the face detection frames of two adjacent frames to get the offset, and finally add the offset to the original coordinate value of the face detection to get a new coordinate, the above new The coordinates of are the updated coordinates of the face detection frame. Cut the original video captured by the video capture device in real time according to the final determined position of the face detection frame.
  • step 507 the terminal performs feature extraction on the collected face of the bank's real customer service, and decouples the face pose and expression parameters.
  • Convolutional neural networks can be used to extract face features, and convolutional neural networks add smoothing layers (such as Gaussian filtering) to reduce the disturbance of feature extraction caused by changes in environmental light and other factors.
  • step 508 the user side encodes the characteristic information and audio and video in real time and sends them to the cloud server, wherein, the sending can be performed through RTCP (RTP Control Protocol, real-time transmission control protocol).
  • RTCP RTP Control Protocol, real-time transmission control protocol
  • the server cuts out the face area of the target image to be driven according to the preset position, that is, according to the virtual customer service image selected by the user, cuts out the area to be driven in the target image using a preset cropping frame.
  • the virtual customer service image preset in the system is usually a half-length photo or a photo with a custom background of the bank.
  • the generation network generates a new target image photo in real time, that is, the server uses the generation network to generate a virtual customer service image with real bank customer service expressions and facial gestures.
  • the server first decodes the information flow sent by the user side, and then sends the extracted face feature parameters and the cropped target image picture to the generator, such as a GAN-based generation network and a face based on first-order Taylor expansion.
  • the generator such as a GAN-based generation network and a face based on first-order Taylor expansion.
  • the generator in the network to generate target image pictures with real customer service expressions and postures in real time.
  • step 511 the splicing algorithm adjusts the generated image and pastes it back to the original image, that is, splicing the newly generated image of the target image back to the original cutting position, so as to ensure that it fits with the original image without a sense of fragmentation.
  • step 512 the server codes the spliced video stream and audio stream of the target image and pushes (such as RTCP) to the user in real time.
  • step 513 real-time display is performed on the user end.
  • the above-mentioned bank remote digital virtual customer service real-time service system designed based on the real-time facial image driving method has realized the unified use of a good-looking customer service image nationwide, guaranteed real-time interaction with customers, and enhanced customer experience.
  • the real-time face image driving method of the embodiment of the present application is described below through a mobile APP video call privacy encryption function designed based on a real-time face image driving method, wherein the mobile APP video call privacy encryption function is implemented on the local side.
  • the specific process refers to FIG. 6, wherein, the user terminal using the virtual call function is terminal 1, and the other user terminal is terminal 2).
  • step 601 the terminal 1 user accesses the video conferencing APP.
  • step 602 the terminal 1 determines whether the user enables the virtual call service.
  • step 603 after it is determined that the user opens the virtual service call, the user selects a target image, that is, selects one of the available target images or star images provided by the system for use.
  • step 604 terminal 1 collects the user's audio and video in real time, that is, after the user turns on the target image video call function, it is necessary to ensure that the front camera of the mobile phone can capture a complete face, try to keep the face in the center, and avoid the face being in the middle of the camera. In the fast movement, the terminal collects the voice and picture of people.
  • step 605 terminal 1 detects the position of the user's face through a face detection algorithm, and obtains a face detection frame.
  • step 606 terminal 1 adjusts the position of the face detection frame and crops it through the smoothing process of the face detection frame, that is, the terminal tracks the user's face position frame by frame in real time, and corrects the face detection frame with the anti-shake algorithm The position of the face is cut out.
  • the terminal 1 extracts the user's facial expression and posture features, that is, uses the facial feature encoder to extract the user's facial expression and posture features.
  • This module usually uses a convolutional neural network (such as the key point detector in FOMM), and pushes the extracted feature information and audio to the terminal side of the opposite user, that is, Terminal 2.
  • step 608 the target image is cut out according to the preset position to be driven face area.
  • step 609 the face area separated from the target image is combined with the extracted feature information to generate a new target image.
  • the difference from the above-mentioned banking system is that the generated network is a neural network that has been lightweight and accelerated and can run on the mobile phone side in real time.
  • step 610 the terminal 1 previews and displays the newly generated target image on the screen of the terminal 1 in real time.
  • Terminal 1 transmits the extracted facial feature information and real-time collected audio to Terminal 2 synchronously.
  • the terminal 2 In step 612 , the terminal 2 generates an avatar with facial expression and posture information of the user of the terminal 1 .
  • Terminal 2 first decodes the information pushed by Terminal 1 in real time, and sends the face area of the target image together with the pushed feature information to the face driver to generate a new target image with user expression and posture information.
  • the generator is a lightweight generation network that can be deployed on the terminal side and run in real time.
  • step 613 the newly generated target image is displayed on the terminal 2 in real time and played synchronously with the audio.
  • the real-time face image driving method greatly expands the collection area of the user's facial features through face detection plus smoothing anti-shake algorithm, allowing the user to be in the large-scale visual area of the video collection device. It can drive the target image without requiring the user to place the face in a specific acquisition frame all the time, which improves the user experience and reduces the fatigue of the user after long-term use, and the smooth anti-shake algorithm can effectively eliminate the face detection frame Shake, user's face shake displacement and video capture device shake displacement affect the generation effect, ensuring that the generated target image will not have large-scale inter-frame jumps and shakes, and improve the effect of real-time driving.
  • step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.
  • the embodiment of this application also provides a real-time face image driving device, as shown in FIG. 7 , including: a face detection module 701 , a face detection frame acquisition module 702 , a smoothing module 703 and a driving module 704 .
  • the face detection module 701 is configured to perform face detection on the currently collected video frame;
  • the face detection frame acquisition module 702 is configured to determine the face in the area where the target face is located in the currently detected video frame Detection frame; smoothing module 703, configured to perform smoothing processing on the determined described human face detection frame; driving module 704, configured to perform extraction of facial features on the region where the smoothed described human face detection frame is located , and drive the 2D face image based on the extracted face features.
  • the real-time face image driving device may also include a denoising module (not shown in the figure), and before performing face detection on the collected video frames, the denoising module performs denoising on the currently collected video frames. Noise processing, remove the noise in the video.
  • the face detection module 701 performs smoothing processing on each video frame, that is, the terminal adds a smoothing layer (such as Gaussian filtering) to the first layer of the detection algorithm, and performs face recognition on the smoothed video frame .
  • a smoothing layer such as Gaussian filtering
  • the smoothing module 703 stretches the detected face detection frame into a frame with equal width and height before smoothing the face detection frame.
  • the smoothing module 703 obtains the IOU (Intersection of Union) value of the face detection frames of two adjacent frames, that is, the ratio of the intersection and union of the face detection frames of two adjacent frames, in
  • the IOU value of the face detection frame of two adjacent frames is greater than the preset value, keep the coordinates of the face detection frame unchanged; when the IOU value of the face detection frame of two adjacent frames is less than or equal to the preset value
  • the terminal obtains the corresponding fixed constant according to the interval to which the IOU value belongs, and then multiplies the fixed constant by the coordinate difference of the face detection frames of two adjacent frames to obtain the offset, and finally calculates the offset with the face
  • the detected original coordinate values are added to obtain a new coordinate, and the above new coordinate is the updated coordinate of the face detection frame.
  • the corresponding relationship between the fixed constant and the range of the IOU value is preset by the terminal.
  • the driver module 704 sends the extracted face features to the network side server, for the network side server to generate a target image with facial expressions and postures, and receive the target image fed back by the network side server and display it in real time .
  • the real-time face image driving device expands the collection area of the user's facial features through the face detection plus smoothing and anti-shake algorithm, allowing the user to be in the large-scale visual area of the video collection device. It can drive the target image without requiring the user to place the face in a specific acquisition frame all the time, which improves the user experience and reduces the fatigue of the user after long-term use, and the smooth anti-shake algorithm can effectively eliminate the face detection frame Shake, user's face shake displacement and video capture device shake displacement affect the generation effect, ensuring that the generated target image will not have large-scale inter-frame jumps and shakes, and improve the effect of real-time driving.
  • this embodiment is a device embodiment corresponding to the above embodiment of the real-time face image driving method, and this embodiment can be implemented in cooperation with the above real-time face image driving method embodiment.
  • the relevant technical details mentioned in the above embodiments of the real-time face image driving method are still valid in this embodiment, and will not be repeated here in order to reduce repetition.
  • the relevant technical details mentioned in this implementation manner can also be applied to the above embodiment of the real-time face image driving method.
  • a logical unit can be a physical unit, or a part of a physical unit, and can also Combination of physical units.
  • this embodiment does not introduce units that are not closely related to solving the technical problems raised by the embodiment of the application, but this does not mean that there are no other elements in this embodiment unit.
  • the embodiment of this application also provides an electronic device, as shown in FIG. 8 , including at least one processor 801; Instructions executed by the at least one processor 801, the instructions are executed by the at least one processor 801, so that the at least one processor can execute the above-mentioned face image driving method.
  • the memory and the processor are connected by a bus
  • the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together.
  • the bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor.
  • the processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory can be used to store data that the processor uses when performing operations.
  • Embodiments of the present application also provide a computer-readable storage medium storing a computer program.
  • the above method embodiments are realized when the computer program is executed by the processor.
  • a device which can be A single chip microcomputer, chip, etc.
  • a processor processor
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

一种实时人脸图像驱动方法、装置、电子设备及存储介质,方法包括:对当前采集到的视频帧进行人脸检测(101);在当前检测的视频帧中,确定出目标人脸所在区域的人脸检测框(102);对确定出的人脸检测框进行平滑处理(103);对经平滑处理后的人脸检测框的所在区域进行人脸特征的提取,并基于提取的所述人脸特征进行2D人脸图像的驱动(104),实现使用者的面部不必始终保持在既定的采集窗口内,也能实现稳定有效的驱动效果。

Description

实时人脸图像驱动方法、装置、电子设备及存储介质 技术领域
本申请实施例涉及图像处理领域,尤其涉及一种实时人脸图像驱动方法、装置、电子设备及存储介质。
背景技术
基于2D图像的实时人脸驱动技术在当前的生产生活中有很多典型的应用场景与应用潜力。人脸驱动技术可以适用于娱乐、美妆、短视频等类别的应用程序,增加应用程序的趣味性与可玩性;可嵌入视频会议系统用于超低带宽的视频会议带宽压缩;也可用于银行的远程虚拟客服系统,提供统一、高质量的客服形象。人脸驱动技术可以分为3D人脸驱动与2D图像人脸驱动两种。其中3D人脸驱动方法通常采集使用者的头部和面部的3D信息,对目标形象进行3D建模,解耦头部姿态参数与人脸表情参数并映射到目标形象上。相较而言,基于图像的2D人脸驱动技术的应用场景更加广泛,通常使用2D视频采集设备采集使用者的面部表情及姿态特征,通过对抗生成网络生成所要驱动的目标形象。
虽然基于图像的2D人脸驱动方案的成本更低,应用场景更加丰富,但是生成效果鲁棒性较差,应用场景限制多。主要是由于目前的2D人脸驱动算法的限制,通常将视频采集设备固定,并要求使用者的面部区域要始终保持在在既定的采集窗口内,若接近采集窗口边缘或者超出采集窗口,会导致无法实现稳定有效的实时驱动效果。这无疑会增加长时间使用下的不适感和劳累感,同时限制了应用场景。此外,采用固定采集窗口的方式,由于使用者难免会有轻微的抖动或者偏移,经过生成网络的放大效应,会导致实时生成图像的抖动甚至是跳变,严重影响观感。
发明内容
本申请实施例提供一种实时人脸图像驱动方法、装置、电子设备及存储介质,实现使用者的面部不必始终保持在既定的采集窗口内,也能实现稳定有效的驱动效果,从而可以放宽对使用者人脸位置的约束条件,实现高鲁棒性的大范围人脸驱动。
本申请的实施例提供了一种实时人脸图像驱动方法,包括:对当前采集到的视频帧进行人脸检测;在当前检测的视频帧中,确定出目标人脸所在区域的人脸检测框;对确定出的人脸检测框进行平滑处理;对经平滑处理后的人脸检测框的所在区域进行人脸特征的提取,并基于提取的人脸特征进行2D人脸图像的驱动。
本申请的实施例提供了一种实时人脸图像驱动装置,包括:人脸检测模块,设置为对当前采集到的视频帧进行人脸检测;人脸检测框获取模块,设置为在当前检测的视频帧中,确定出目标人脸所在区域的人脸检测框;平滑模块,设置为对确定出的人脸检测框进行平滑处理;驱动模块,设置为对经平滑处理后的人脸检测框的所在区域进行人脸特征的提取,并基于提取的人脸特征进行2D人脸图像的驱动。
本申请的实施例还提供了一种电子设备,包括:至少一个处理器;以及,与所述至少一 个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述实时人脸图像驱动方法。
本申请的实施例还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现上述实时人脸图像驱动方法。
在本申请实施例中,通过对视频帧进行逐帧的人脸检测,确定每一帧的人脸所在区域的人脸检测框,并对人脸检测框进行平滑处理,对平滑处理后的人脸检测框的所在区域进行人脸特征提取,最后根据上述人脸特征对目标图像进行驱动,使得使用者的面部区域无需始终保持在既定的采集窗口内,扩展了使用者面部特征采集区域范围,即可以允许用户在采集设备可视区域内较灵活的选择位置,而不必局限在某一固定区域,从而放宽了对使用者人脸位置的约束条件,实现高鲁棒性的大范围人脸驱动,提升了用户体验,降低了用户长时间使用的疲劳。而且通过对人脸检测框的平滑处理,保证了人脸驱动生成效果不会因此出现抖动和跳变的情况,连续帧的观感上更具延续性和真实感。
附图说明
图1是本申请的一实施例提供的人脸图像驱动方法流程;
图2a是相关技术中的人脸图像驱动方法的人脸采集方案示意图;
图2b是本申请的一实施例提供的人脸图像驱动方法的人脸采集方案示意图;
图3是本申请的一实施例提供的人脸图像驱动方法中的平滑防抖算法流程图;
图4是本申请的一实施例提供的实现人脸图像驱动方法的模块示意图;
图5是本申请的一实施例提供的人脸图像驱动方法应用于银行远程数字虚拟客服实时服务系统的流程图;
图6是本申请的一实施例提供的人脸图像驱动方法应用于移动APP视频通话隐私加密功能的流程图;
图7是本申请的一实施例提供的人脸图像驱动装置的结构示意图;
图8是本申请的一实施例提供的电子设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施方式中,为了使读者更好地理解本申请实施例而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本申请实施例所要求保护的技术方案。
本申请的一实施例涉及一种实时人脸图像驱动方法,包括:对当前采集到的视频帧进行人脸检测;在当前检测的视频帧中,确定出目标人脸所在区域的人脸检测框;对确定出的人脸检测框进行平滑处理;对经平滑处理后的人脸检测框的所在区域进行人脸特征的提取,并基于提取的人脸特征进行2D人脸图像的驱动。实现使用者的面部不必始终保持在既定的采集窗口内,也能实现生稳定有效的驱动效果,从而可以放宽对使用者人脸位置的约束条件,实现高鲁棒性的大范围人脸驱动。
下面对本实施例中的实时人脸图像驱动方法的实现细节进行具体的说明,以下内容仅为 方便理解本方案的实现细节,并非实施本方案的必须。具体流程如图1所示,至少包括但不限于如下步骤:
在步骤101中,对当前采集到的视频帧进行人脸检测。
在一个例子中,终端在对采集到的视频帧进行人脸检测之前,终端还可以对当前采集到的视频帧进行去噪处理,消除视频中的噪声。
在一个例子中,终端对每一视频帧进行平滑处理,即终端在人脸检测算法的第一层位置增加平滑层(如高斯滤波);然后终端通过人脸检测算法对经平滑处理后的视频帧进行人脸识别。
上述通过在人脸检测算法第一层位置增加平滑层,有利于降低外界光照等因素在时序上的变化造成的干扰,减少对人脸检测框位置回归的影响,使得到的人脸检测框的位置更加准确。
在步骤102中,在当前检测的视频帧中,确定出目标人脸所在区域的人脸检测框。
在一个例子中,在确定出目标人脸所在区域的人脸检测框之前,在终端检测到人脸检测框的数量大于1时,根据检测框的大小,从大到小对检测框进行排序,保留前N个检测框,其中,保留的检测框数量根据具体场景选择,N默认选择为3;然后使用人脸识别或者人脸特征比对,对保留的检测框内的人脸与使用者预设的人脸进行比较,保留相似度最高的人脸检测框,其中,预设人脸通常为采集到的视频帧的第一帧中最大人脸检测框中的人脸。例如:在当前检测的视频帧中,出现了人脸检测框A、B、C和D。设置N为3,根据检测框的大小,保留了人脸检测框B、C和D,然后使用人脸识别将人脸检测框B、C和D中的人脸分别和使用者预设的人脸进行比较,比较发现人脸检测框B中的人脸与预设的人脸相似度最高,因此保留人脸检测框B。
当算法检测出多个人脸检测框时,保留与预设人脸最相似的人脸所在区域的人脸检测框,避免采集的视频帧中存在多个人脸,导致使用者人脸位置定位错误的问题,同时提高了确定使用者人脸检测框的速度。
在一个例子中,终端采用改进的人脸检测算法检测得到人脸检测框,其中,算法包括:ultraface(一种轻量级的人脸检测网络)算法、Centerface(一种轻量级的人脸检测网络)算法以及MTCNN(Multi-tasks cascade neural network,级联人脸检测网络)算法等。
值得一提的是,由于常规方法只能在采集设备可视范围内划定某一固定区域作为采集窗口,只有使用者人脸部分完全在采集窗口内才有效,而本申请实施例中终端通过对视频帧使用人脸检测算法和平滑防抖算法实时追踪使用者人脸位置,不必要求用户将面部始终保持在预设的小范围采集窗口内,可以允许用户在采集设备可视区域内较灵活的选择位置,不必局限于某一固定区域,如图2a和2b所示,避免常用技术中人脸采集框只能在固定区域进行人脸采集的问题,实现了人脸采集框根据人脸位置动态调整,降低了对使用者的限制条件,提升了用户体验。
在步骤103中,对确定出的所述人脸检测框进行平滑处理。
在一个例子中,终端在对人脸检测框进行平滑处理之前,将检测到的人脸检测框拉伸成宽高相等的方框。由于检测出来的人脸检测框的长宽比不固定,所以此处终端将检测框设置成宽高相等的方框,更有利于后续的处理。
在一个例子中,终端获取相邻两帧的人脸检测框的IOU(Intersection of Union,交并 比)值,即相邻两帧的人脸检测框的交集与并集的比值,在相邻两帧的人脸检测框的IOU值大于预设值的情况下,保持人脸检测框的坐标不变;在相邻两帧的人脸检测框的IOU值小于或等于预设值的情况下,终端根据IOU值所属的区间范围获取对应的固定常量,然后将固定常量与相邻两帧的人脸检测框的坐标差相乘得到偏移量,最后将偏移量与人脸检测的原坐标值相加得到一个新的坐标,上述新的坐标就是更新的人脸检测框坐标,固定常量与IOU值区间范围的对应关系由终端预先设定,具体流程如图3所示。
通过对人脸检测框进行平滑处理,如根据相邻两帧的人脸检测框的IOU值来对人脸检测框的坐标进行更新,使得对于使用者和采集设备的轻微抖动、偏移可以很好的过滤,保证人脸驱动生成效果不会因此出现抖动和跳变的情况,连续帧的观感上更具延续性和真实感。
在步骤104中,对经平滑处理后的所述人脸检测框的所在区域进行人脸特征的提取,并基于提取的所述人脸特征进行2D人脸图像的驱动。
在一个例子中,终端对采集到的人脸进行特征提取,解析人脸的表情和姿态特征,上述特征提取操作往往采用的是卷积神经网络进行提取,比如FOMM(First Order Motion Model,基于一阶泰勒展开的人脸驱动网络)中的关键点检测器,终端将特征信息发送到云端服务器,供云端服务器的生成器实时生成人脸图像。另外,终端也可以直接在本地的生成器上实时生成人脸图像。在上述例子中,在网络侧服务器实时生成人脸图像,可以节省本地计算负担。
在一个例子中,对待驱动的目标图片进行人脸检测,分离目标形象人脸区域和非人脸区域部分,然后将使用者人脸运动特征信息和目标形象的人脸区域送入生成器,生成具有使用者面部表情和姿态的目标形象;最后将生成的图像与非待驱动区域图像拼接成一个新的图像,或者直接使用生成的图像,发送至其它设备或实时在前端设备上预览显示。
本申请实施例提供的实时人脸图像驱动方法是由如图4所示的模块实现的,包括:视频采集模块S1、图像预处理模块S2、人脸运动特征提取模块S3、目标形象预处理模块S4、人脸驱动图像生成模块S5以及推流/显示模块S6。
视频采集模块S1,设置为完成区域内的视频的实时采集与传输,其中,视频采集模块可以是固定或者移动的。
图像预处理模块S2,设置为对采集到的视频帧进行预处理,首先对采集到的视频逐帧进行去噪处理,然后对视频帧进行逐帧的平滑处理和人脸检测,确定使用者人脸在视频图像中的位置,其次使用平滑防抖算法对得到的逐帧人脸检测框位置进行平滑防抖,最后根据调整后的人脸检测框位置,按照1∶1宽高比例裁切出人脸区域。
人脸运动特征提取模块S3,设置为对平滑过的人脸检测框内的使用者人脸提取人脸特征,解析人脸的表情和姿态特征。
目标形象预处理模块S4,设置为对待驱动的目标形象图片进行人脸检测,分离目标形象人脸区域与非人脸区域部分。
人脸驱动图像生成模块S5,设置为将使用者人脸运动特征信息和目标形象人脸区域送入生成器,生成具有使用者面部表情和姿态的目标形象。
推流/显示模块S6,设置为将生成图像与非待驱动区域图像拼接后的新图像,或者直接使用生成的图像,发送至其它设备或实时在前端设备上预览显示。
为了使本申请实施例提供的实时人脸图像驱动方法的实现过程更加明了,接下来通过基于实时人脸图像驱动方法设计的银行远程数字虚拟客服实时服务系统来对上述方法进行具体 的说明,其中,银行远程数字虚拟客服实时服务系统在服务器进行实时人脸生成,具体流程参照图5:
在步骤501中,用户接入APP(应用程序),即用户在终端侧(手机或者PC)接入银行服务APP。
在步骤502中,后台线程等待用户发起虚拟客服业务申请,判断用户是否启用虚拟客服,直到用户启用虚拟客服或者退出APP。
在步骤503中,选择目标客服形象,例如,系统检测到用户发起虚拟客服业务申请后,响应进入虚拟客服模式。系统预设多张虚拟客服形象(如银行品牌宣传吉祥物、银行签约明星、银行内部员工等),以满足不同性别、不同群体的用户需求。用户根据兴趣选择一位虚拟客服提供后续服务。终端APP向服务端发送消息告知选择的虚拟客服,服务器完成初始化并反馈。
在步骤504中,采集设备实时采集真实客服音视频,例如,系统根据当前空闲的在线客服人员情况自动分配业务,并与用户APP建立连接,开始提供虚拟客服服务。根据业务场景需求,要为真实银行客服提供视频采集设备(如固定广角摄像头)和音频采集设备(如麦克风)用于实时采集真实客服的音视频信息。实时虚拟客服服务启用过程中,采集设备正对银行客服人员,尽量保证客服人员的头部位于采集设备可视区域的中间位置,用于实时采集捕捉客服人员的头部和面部信息。音频采集设备要靠近真实客服,保证拾音效果。服务进行期间,用户根据需要向银行客服咨询业务问题,银行客服保证表情自然,语速正常的回答用户问题。
在步骤505中,检测真实客服的人脸位置,例如,根据采集到的视频逐帧对客服人员的人脸进行检测,采用在第一层位置增加平滑层的人脸检测算法,检测得到人脸检测框。
在步骤506中,通过防抖算法调整人脸检测框位置,并对调整后的人脸检测框位置进行裁切,例如,将人脸检测框拉伸成宽高相等的方框,在帧与帧之间计算检测框的IOU值,若IOU值大于预设值,则不更新人脸检测框的坐标;若IOU值小于或者等于预设值,则终端根据IOU值所属的区间范围获取对应的固定常量,然后将固定常量与相邻两帧的人脸检测框的坐标差相乘得到偏移量,最后将偏移量与人脸检测的原坐标值相加得到一个新的坐标,上述新的坐标就是更新后的人脸检测框坐标。根据最终确定的人脸检测框位置对视频采集设备实时采集到的原视频进行裁切。
在步骤507中,终端对采集到的银行真实客服的人脸进行特征提取,将人脸姿态与表情参数解耦。可以采用卷积神经网络进行人脸特征的提取,并且卷积神经网络增加平滑层(如高斯滤波),降低环境光线等因素的变化对特征提取产生的扰动。
在步骤508中,将用户侧将特征信息与音视频实时编码并发送到云端服务器,其中,可以通过RTCP(RTP Control Protocol,实时传输控制协议)进行发送。
在步骤509中,服务器按照预设位置裁切出目标形象待驱动人脸区域,即根据用户选择的虚拟客服形象,使用预设的裁切框将目标形象中待驱动区域截取出来。系统中预设的虚拟客服形象通常为半身照或者增加银行自定义背景的照片。
在步骤510中,生成网络实时生成新的目标形象照片,即服务器使用生成网络生成具有真实银行客服表情与面部姿态的虚拟客服形象。其中,服务器先将用户侧发送的信息流解码,然后将提取的人脸特征参数与裁切后的目标形象图片送入生成器,如一种基于GAN的生成网 络和基于一阶泰勒展开的人脸驱动网络中的生成器,实时生成具有真实客服表情和姿态的目标形象图片。
在步骤511中,拼接算法调整生成图像并贴回原图,即将新生成的目标形象图片拼接回原裁切位置,保证与原图像贴合而不会有割裂感。
在步骤512中,服务器将拼接好的目标形象视频流和音频流编码并实时推送(如RTCP)给用户。
在步骤513中,在用户端进行实时显示。
上述基于实时人脸图像驱动方法设计的银行远程数字虚拟客服实时服务系统,实现了在全国范围内统一使用面容姣好的客服形象,保证了与客户的实时互动,增强了客户体验。
以下通过一个基于实时人脸图像驱动方法设计的移动APP视频通话隐私加密功能来对本申请实施例的实时人脸图像驱动方法进行说明,其中,移动APP视频通话隐私加密功能在本地侧进行实时人脸生成,具体流程参照图6,其中,使用虚拟通话功能的用户终端为终端1,另一用户终端为终端2)。
在步骤601中,终端1用户接入视频会议APP。
在步骤602中,终端1判断用户是否启用虚拟通话服务。
在步骤603中,当确定用户开启虚拟服务通话后,使用者选择目标形象,即根据系统中提供的可用目标形象或者明星形象,选择一个作为使用。
在步骤604中,终端1实时采集用户音视频,即用户开启目标形象视频通话功能后,需保证手机的前置摄像头可以采集到完整的人脸,尽量保持人脸位置居中,避免人脸在摄像头中快速的移动,终端对人的声音和画面进行采集。
在步骤605中,终端1通过人脸检测算法检测用户人脸的位置,并获取人脸检测框。
在步骤606中,终端1通过对人脸检测框进行的平滑处理,调整人脸检测框位置并裁切,即终端实时逐帧跟踪使用者的脸部位置,配合防抖算法修正人脸检测框的位置,将人脸区域裁切出来。
在步骤607中,终端1提取用户人脸表情及姿态特征,即使用人脸特征编码器提取使用者的表情与姿态特征。该模块通常使用的是卷积神经网络(如FOMM中的关键点检测器),并将提取到的特征信息和音频推流到对面用户的终端侧,即终端2。
在步骤608中,按照预设位置裁切出目标形象待驱动人脸区域。
在步骤609中,将目标形象分离出的人脸区域结合提取的特征信息,生成新的目标形象。
其中,与上述银行系统不同的是,生成网络为经过轻量化加速、可以实时运行在手机侧的神经网络。
在步骤610中,终端1将新生成的目标形象实时预览并显示在终端1的屏幕上。
在步骤611中,终端1将提取到的人脸特征信息与实时采集的音频同步传输到终端2。
在步骤612中,终端2生成具有终端1使用者面部表情与姿态信息的虚拟形象。终端2先将终端1实时推送的信息进行解码,将目标形象人脸区域与推流过来的特征信息一起送到人脸驱动器中,生成新的具有使用者表情姿态信息的目标形象。该生成器是经过轻量化的可以部署在终端侧实时运行的生成网络。
在步骤613中,在终端2实时显示新生成的目标形象并与音频同步播放。
本申请实施例提供的实时人脸图像驱动方法,通过人脸检测加平滑防抖算法加大的扩展 了使用者面部特征的采集区域范围,允许用户可以在视频采集设备的大范围可视区域内均可实现对目标形象的驱动,不必要求使用者将面部一直放置在特定的采集框,提升了用户体验,降低用户长时间使用的疲劳,并且通过平滑防抖算法可以有效的消除人脸检测框抖动、使用者面部抖动位移和视频采集设备的抖动位移对于生成效果的影响,保证生成的目标形象不会出现大范围的帧间跳变与抖动,提升实时驱动的效果。
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。
本申请的实施例还提供了一种实时人脸图像驱动装置,如图7所示,包括:人脸检测模块701、人脸检测框获取模块702、平滑模块703以及驱动模块704。
其中,人脸检测模块701,设置为对当前采集到的视频帧进行人脸检测;人脸检测框获取模块702,设置为在当前检测的视频帧中,确定出目标人脸所在区域的人脸检测框;平滑模块703,设置为对确定出的所述人脸检测框进行平滑处理;驱动模块704,设置为对经平滑处理后的所述人脸检测框的所在区域进行人脸特征的提取,并基于提取的所述人脸特征进行2D人脸图像的驱动。
在一个例子中,实时人脸图像驱动装置还可以包括去噪模块(图中未示出),在对采集到的视频帧进行人脸检测之前,去噪模块对当前采集到的视频帧进行去噪处理,消除视频中的噪声。
在一个例子中,人脸检测模块701对每一视频帧进行平滑处理,即终端在检测算法的第一层位置增加平滑层(如高斯滤波),对经平滑处理后的视频帧进行人脸识别。
在一个例子中,平滑模块703在对人脸检测框进行平滑处理之前,将检测到的人脸检测框拉仲成宽高相等的方框。
在一个例子中,平滑模块703获取相邻两帧的人脸检测框的IOU(Intersection of Union,交并比)值,即相邻两帧的人脸检测框的交集与并集的比值,在相邻两帧的人脸检测框的IOU值大于预设值的情况下,保持人脸检测框的坐标不变;在相邻两帧的人脸检测框的IOU值小于或等于预设值的情况下,则终端根据IOU值所属的区间范围获取对应的固定常量,然后将固定常量与相邻两帧的人脸检测框的坐标差相乘得到偏移量,最后将偏移量与人脸检测的原坐标值相加得到一个新的坐标,上述新的坐标就是更新后的人脸检测框坐标。其中,固定常量与IOU值区间范围的对应关系由终端预先设定。
在一个例子中,驱动模块704将提取的人脸特征发送至网络侧服务器,供所述网络侧服务器生成具有面部表情和姿态的目标图像,并接收网络侧服务器反馈的所述目标图像并实时显示。
本申请实施例提供的实时人脸图像驱动装置,通过人脸检测加平滑防抖算法加大的扩展了使用者面部特征的采集区域范围,允许用户可以在视频采集设备的大范围可视区域内均可实现对目标形象的驱动,不必要求使用者将面部一直放置在特定的采集框,提升了用户体验,降低用户长时间使用的疲劳,并且通过平滑防抖算法可以有效的消除人脸检测框抖动、使用者面部抖动位移和视频采集设备的抖动位移对于生成效果的影响,保证生成的目标形象不会出现大范围的帧间跳变与抖动,提升实时驱动的效果。
不难发现,本实施例为上述实时人脸图像驱动方法实施例相对应的装置实施例,本实施方式可与上述实时人脸图像驱动的方法实施例互相配合实施。上述实时人脸图像驱动方法实施例提到的相关技术细节在本实施方式中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在上述实时人脸图像驱动方法实施例中。
值得一提的是,本申请上述实施方式中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本申请实施例的创新部分,本实施方式中并没有将与解决本申请实施例所提出的技术问题关系不太密切的单元引入,但这并不表明本实施方式中不存在其它的单元。
本申请的实施例还提供一种电子设备,如图8所示,包括至少一个处理器801;以及,与所述至少一个处理器801通信连接的存储器802;其中,所述存储器802存储有可被所述至少一个处理器801执行的指令,所述指令被所述至少一个处理器801执行,以使所述至少一个处理器能够执行上述人脸图像驱动方法。
其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器。
处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果,未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。
本申请的实施例还提供一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。
本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
上述实施例是提供给本领域普通技术人员来实现和使用本申请实施例的,本领域普通技术人员可以在脱离本申请实施例的发明思想的情况下,对上述实施例做出种种修改或变化,因而本申请实施例的保护范围并不被上述实施例所限,而应该符合权利要求书所提到的创新性特征的最大范围。

Claims (11)

  1. 一种实时人脸图像驱动方法,包括:
    对当前采集到的视频帧进行人脸检测;
    在当前检测的视频帧中,确定出目标人脸所在区域的人脸检测框;
    对确定出的所述人脸检测框进行平滑处理;
    对经平滑处理后的所述人脸检测框的所在区域进行人脸特征的提取,并基于提取的所述人脸特征进行2D人脸图像的驱动。
  2. 根据权利要求1所述的实时人脸图像驱动方法,其中,在所述确定出目标人脸所在区域的人脸检测框之前,还包括:
    在检测到的所述人脸检测框数量大于1的情况下,根据所述人脸检测框的大小,保留最大的N个所述人脸检测框,其中,N为大于0的自然数;
    将所述N个人脸检测框内的人脸与使用者的预设人脸进行比较,保留相似度最高的所述人脸检测框。
  3. 根据权利要求1所述的实时人脸图像驱动方法,其中,所述对确定出的所述人脸检测框进行平滑处理,包括:
    获取相邻两帧的所述人脸检测框的交并比IOU值;
    在所述IOU值大于预设值的情况下,保持所述人脸检测框的坐标不变;
    在所述IOU值小于或等于所述预设值的情况下,根据所述IOU值和预设的固定常量,更新所述人脸检测框的坐标。
  4. 根据权利要求3所述的实时人脸图像驱动方法,其中,所述预设的固定常量包括:N个常量,不同常量分别与不同的IOU值区间范围相对应:
    所述根据所述IOU值和预设的固定常量,更新所述人脸检测框的坐标,包括:
    根据所述IOU值所属的区间范围获取对应的固定常量;其中,所述固定常量与IOU值区间范围的对应关系预先设定;
    将所述固定常量与相邻两帧的所述人脸检测框的坐标差相乘得到偏移量,再将所述偏移量与所述人脸检测框的原坐标值相加得到更新后的所述人脸检测框的坐标。
  5. 根据权利要求1所述的实时人脸图像驱动方法,其中,在所述对经平滑处理后的所述人脸检测框的所在区域进行人脸特征的提取之前,还包括:
    将所述人脸检测框的宽高比调整为1∶1。
  6. 根据权利要求1至5中任一项所述的实时人脸图像驱动方法,其中,所述对当前采集到的视频帧进行人脸检测,包括:
    对每一视频帧进行平滑处理;
    对经平滑处理后的视频帧进行人脸检测。
  7. 根据权利要求1至5中任一项所述的实时人脸图像驱动方法,其中,所述基于提取的所述人脸特征进行2D人脸图像的驱动,包括:
    将提取的所述人脸特征发送至网络侧服务器,供所述网络侧服务器生成具有面部表情和姿态的目标图像;
    接收所述网络侧服务器反馈的所述目标图像并实时显示。
  8. 根据权利要求1至5中任一项所述的实时人脸图像驱动方法,其中,在所述对当前采集到的视频帧进行人脸检测之前,还包括:
    对当前采集到的视频帧进行去噪处理;
    所述对当前采集到的视频帧进行人脸检测,包括:
    对经去噪处理后的视频帧进行人脸检测。
  9. 一种实时人脸图像驱动装置,包括:
    人脸检测模块,设置为对当前采集到的视频帧进行人脸检测;
    人脸检测框获取模块,设置为在当前检测的视频帧中,确定出目标人脸所在区域的人脸检测框;
    平滑模块,设置为对确定出的所述人脸检测框进行平滑处理;
    驱动模块,设置为对经平滑处理后的所述人脸检测框的所在区域进行人脸特征的提取,并基于提取的所述人脸特征进行2D人脸图像的驱动。
  10. 一种电子设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至8中任一项所述的实时人脸图像驱动方法。
  11. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至8中任一项所述的实时人脸图像驱动方法。
PCT/CN2022/119941 2021-11-18 2022-09-20 实时人脸图像驱动方法、装置、电子设备及存储介质 WO2023087891A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111371666.4A CN116189251A (zh) 2021-11-18 2021-11-18 实时人脸图像驱动方法、装置、电子设备及存储介质
CN202111371666.4 2021-11-18

Publications (1)

Publication Number Publication Date
WO2023087891A1 true WO2023087891A1 (zh) 2023-05-25

Family

ID=86396210

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/119941 WO2023087891A1 (zh) 2021-11-18 2022-09-20 实时人脸图像驱动方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN116189251A (zh)
WO (1) WO2023087891A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402691A (zh) * 2010-09-08 2012-04-04 中国科学院自动化研究所 一种对人脸姿态和动作进行跟踪的方法
US20140063236A1 (en) * 2012-08-29 2014-03-06 Xerox Corporation Method and system for automatically recognizing facial expressions via algorithmic periocular localization
CN107886558A (zh) * 2017-11-13 2018-04-06 电子科技大学 一种基于RealSense的人脸表情动画驱动方法
CN111325157A (zh) * 2020-02-24 2020-06-23 高新兴科技集团股份有限公司 人脸抓拍方法、计算机存储介质及电子设备
CN113177515A (zh) * 2021-05-20 2021-07-27 罗普特科技集团股份有限公司 一种基于图像的眼动追踪方法和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402691A (zh) * 2010-09-08 2012-04-04 中国科学院自动化研究所 一种对人脸姿态和动作进行跟踪的方法
US20140063236A1 (en) * 2012-08-29 2014-03-06 Xerox Corporation Method and system for automatically recognizing facial expressions via algorithmic periocular localization
CN107886558A (zh) * 2017-11-13 2018-04-06 电子科技大学 一种基于RealSense的人脸表情动画驱动方法
CN111325157A (zh) * 2020-02-24 2020-06-23 高新兴科技集团股份有限公司 人脸抓拍方法、计算机存储介质及电子设备
CN113177515A (zh) * 2021-05-20 2021-07-27 罗普特科技集团股份有限公司 一种基于图像的眼动追踪方法和系统

Also Published As

Publication number Publication date
CN116189251A (zh) 2023-05-30

Similar Documents

Publication Publication Date Title
WO2022001407A1 (zh) 一种摄像头的控制方法及显示设备
US10938725B2 (en) Load balancing multimedia conferencing system, device, and methods
US9424678B1 (en) Method for teleconferencing using 3-D avatar
WO2015074476A1 (zh) 一种图像处理方法、装置和存储介质
JP2009510877A (ja) 顔検出を利用したストリーミングビデオにおける顔アノテーション
US20090257623A1 (en) Generating effects in a webcam application
US11257293B2 (en) Augmented reality method and device fusing image-based target state data and sound-based target state data
US11917158B2 (en) Static video recognition
US11463270B2 (en) System and method for operating an intelligent face framing management system for videoconferencing applications
CN111654715A (zh) 直播的视频处理方法、装置、电子设备及存储介质
US20150222854A1 (en) Enhancing video conferences
CN112672174A (zh) 分屏直播方法、采集设备、播放设备及存储介质
CN114286021B (zh) 渲染方法、装置、服务器、存储介质及程序产品
US11423550B2 (en) Presenter-tracker management in a videoconferencing environment
WO2023087891A1 (zh) 实时人脸图像驱动方法、装置、电子设备及存储介质
CN108320331B (zh) 一种生成用户场景的增强现实视频信息的方法与设备
CN111243585B (zh) 多人场景下的控制方法、装置、设备及存储介质
KR102404130B1 (ko) 텔레 프레젠스 영상 송신 장치, 텔레 프레젠스 영상 수신 장치 및 텔레 프레젠스 영상 제공 시스템
CN114727120A (zh) 直播音频流的获取方法、装置、电子设备及存储介质
CN114339393A (zh) 直播画面的显示处理方法、服务器、设备、系统及介质
US20230186654A1 (en) Systems and methods for detection and display of whiteboard text and/or an active speaker
CN115424156A (zh) 一种虚拟视频会议方法及相关装置
WO2023078103A1 (zh) 多模式的人脸驱动方法、装置、电子设备和存储介质
US20240015264A1 (en) System for broadcasting volumetric videoconferences in 3d animated virtual environment with audio information, and procedure for operating said device
KR20230081243A (ko) Xr 환경 구축을 위한 다중 프로젝션 시스템에서의 roi 추적 및 최적화 기술

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894429

Country of ref document: EP

Kind code of ref document: A1