WO2017211139A1 - Method and apparatus for implementing video communication - Google Patents

Method and apparatus for implementing video communication Download PDF

Info

Publication number
WO2017211139A1
WO2017211139A1 PCT/CN2017/081956 CN2017081956W WO2017211139A1 WO 2017211139 A1 WO2017211139 A1 WO 2017211139A1 CN 2017081956 W CN2017081956 W CN 2017081956W WO 2017211139 A1 WO2017211139 A1 WO 2017211139A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
frame
image frame
parameter description
scene
Prior art date
Application number
PCT/CN2017/081956
Other languages
French (fr)
Chinese (zh)
Inventor
张殿凯
沈琳
瞿广财
王宁
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017211139A1 publication Critical patent/WO2017211139A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present disclosure relates to the field of image processing technologies, and in particular, to a method and apparatus for implementing video communication.
  • RCS ich Communication Suite
  • RCS service is a kind of converged communication service based on enhanced mobile phone address book, which integrates various communication methods and functions such as voice, video, message, presentation and content sharing.
  • users can update their own presentations (such as personal pictures, mood phrases, referral links, and status) to achieve various communication needs such as instant messaging, chat, file transfer, and video sharing during a session. It shares with the picture and connects to the network side through the standard protocol interface to realize registration, authentication, audio and video call capabilities.
  • the technical problem to be solved by the present disclosure is to provide a method and apparatus for implementing video communication, which can reduce the amount of data transmitted by video communication, and save traffic and tariffs for users.
  • the present disclosure provides a method for implementing video communication, which is applied to a transmitting end, and the method includes:
  • Image recognition is performed on each frame of image captured, and feature extraction is performed according to the image recognition result;
  • the parameter description of the image frame is performed according to the feature extracted from each frame of the image frame, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.
  • the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;
  • the feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
  • the parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  • the capturing images by the camera includes:
  • Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
  • a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
  • the capturing images by the camera includes:
  • Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
  • a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
  • the capturing images by the camera includes:
  • Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
  • the depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  • the method further includes:
  • An image frame of the portrait shooting reference image or an image frame of the scene capturing reference image is transmitted to the receiving end.
  • An embodiment of the present disclosure provides a method for implementing video communication, which is applied to a receiving end, and the method includes:
  • Each frame of the image is reconstructed and displayed according to the decoded parameter description.
  • the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.
  • the reconstructing each frame image according to the decoded parameter description comprises:
  • the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; /or
  • the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed;
  • the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
  • the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
  • An embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a transmitting end, and includes:
  • An image acquisition module for collecting images through a camera
  • the image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result;
  • the parameter description and encoding module is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.
  • the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;
  • the feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
  • the parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  • an image acquisition module is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:
  • a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
  • an image acquisition module is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:
  • a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
  • an image acquisition module is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:
  • the depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  • the device further includes:
  • an image sending module configured to send, to the receiving end, an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image.
  • An embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a receiving end, and includes:
  • a parameter description receiving module configured to receive a parameter description of each frame of the image frame, and decode the received parameter description
  • the image reconstruction and display module is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
  • the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.
  • an image reconstruction and display module is configured to reconstruct each frame image according to the decoded parameter description, including:
  • the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; /or
  • the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed;
  • the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
  • the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
  • the present disclosure provides a method and apparatus for implementing video communication, in which video communication can greatly reduce the amount of data transmitted by transmitting character motions, expressions, and scene parameter descriptions instead of images themselves, thereby saving traffic for users. And tariffs.
  • controlling the animation to simulate people's expressions and actions through expression parameters can increase the fun and entertainment of video communication, protect the user's personal privacy, and enhance the user experience.
  • FIG. 1 is a flowchart (transmission end) of a method for implementing video communication according to an embodiment of the present disclosure.
  • FIG. 2 is a flowchart (receiving end) of a method for implementing video communication according to an embodiment of the present disclosure.
  • FIG. 3 is a schematic diagram of a device (transmitting end) for implementing video communication according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a device (receiving end) for implementing video communication according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of an image of a user collected at the transmitting end in the example 2 of the present disclosure (see the left figure), and a schematic diagram of the receiving end emulating the expression of the user of the transmitting end with a cartoon image (big shark) (see the right figure).
  • an embodiment of the present disclosure provides a method for implementing video communication, which is applied to a sending end, and the method includes:
  • S120 performing image recognition on each frame of the image frame that is collected, and performing feature extraction according to the image recognition result;
  • S130 Perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end;
  • the method may also include the following features:
  • the image recognition includes at least one of the following: face recognition, body recognition, scene recognition;
  • the feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
  • the parameter description includes at least one of the following: a description of a facial expression parameter, a description of a human motion parameter, and a description of a physical feature parameter in the scene;
  • the collecting images by the camera includes:
  • An image frame that is a reference image for shooting a scene is taken by a camera.
  • the Adaboost algorithm can be used for face detection to determine the position and size of the face, and then the SDM (Supervised Descent Method) algorithm is used to extract the key feature point coordinates of the face in the face region;
  • the key feature points of the face include at least one of the following: an eye, a nose, a mouth, an eyebrow, a facial contour, and the like;
  • the facial expression parameter includes at least one of the following parameters: closed eyes, blinking, opening mouth, closing mouth, laughing, looking up, bowing, turning left, turning to the right, tilting the head to the left shoulder, The right shoulder is tilted to the head, etc.;
  • a deep learning algorithm or a template matching algorithm may be used for human body detection, and the position, direction, shape, curvature and other features of key features of each human body are extracted;
  • the key feature points of the human body include at least one of the following: a head, an arm, a hand, a leg, a foot, a waist, and the like;
  • the deep learning algorithm can be used to understand the scene and extract the features of each physical object in the scene
  • the characteristics of the physical object include: shape, size, color, material, etc.;
  • image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:
  • the position of the face key feature point extracted from the portrait image reference image frame is taken as the face feature point reference position
  • the determining according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position, determining a face expression parameter and an expression action range corresponding to the image frame. ,include:
  • the blink threshold is the distance between the upper and lower eyelids in the portrait image frame minus the first error allowed a difference after the value, or the blink threshold is a difference between an average value of distances of upper and lower eyelids in the plurality of portrait image reference image frames minus a second error allowable value; wherein the first error allowable value Is an empirical value; the second error allowable value is a variance of a motion recognition sensitivity coefficient multiplied by a distance of upper and lower eyelids in a plurality of portrait photographing reference image frames; and/or
  • the closing threshold is the distance between the upper and lower lip edges of the portrait image frame and the first error a difference after the allowable value, or the threshold of the closing is the average of the distance between the upper and lower lip edges of the plurality of portrait image reference image frames plus the difference of the second error allowable value; wherein the first error
  • the allowable value is an empirical value
  • the second error allowable value is a variance of the motion recognition sensitivity coefficient multiplied by the distance of the upper and lower lip edges in the plurality of portrait photographing reference image frames; and/or
  • the action and the action range of the head are determined according to the distance, angle and direction of the deviation
  • the head position reference point comprises at least one of the following: a center point of the two eyes, an eyebrow, and a tip of the nose;
  • the action of the head includes at least one of the following: raising the head, lowering the head, turning the head to the left, turning the head to the right, tilting the head to the left shoulder, and tilting the head to the right shoulder;
  • the method further includes:
  • image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:
  • a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
  • the human body action such as: moving the left hand forward, kicking the right leg forward, twisting the waist, gestures, etc.;
  • image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:
  • the depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  • an embodiment of the present disclosure provides a method for implementing video communication, which is applied to a receiving end, and the method includes:
  • S220 Simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
  • the method may also include the following features:
  • the parameter description includes at least one of the following: a description of a facial expression parameter, a description of a human motion parameter, and a description of a physical feature parameter in the scene;
  • the reconstructing each frame image according to the decoded parameter description comprises:
  • the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; / or
  • the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed;
  • the two-dimensional image or the three-dimensional image is used to reconstruct and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed;
  • the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting end portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library;
  • an embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a transmitting end, and includes:
  • the image acquisition module 301 is configured to collect an image by using a camera
  • the image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the captured image frame, and perform feature extraction according to the image recognition result;
  • the parameter description and encoding module 303 is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.
  • the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;
  • the feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
  • the parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  • the image acquisition module 301 is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:
  • a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
  • the image recognition and feature extraction module 302 is configured to determine the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.
  • Corresponding facial expression parameters and expression action amplitudes including:
  • the blink threshold is the distance between the upper and lower eyelids in the portrait image frame minus the first error allowed a difference after the value, or the blink threshold is a difference between an average value of distances of upper and lower eyelids in the plurality of portrait image reference image frames minus a second error allowable value; wherein the first error allowable value Is an empirical value; the second error allowable value is a variance of a motion recognition sensitivity coefficient multiplied by a distance of upper and lower eyelids in a plurality of portrait photographing reference image frames; and/or
  • the closing threshold is a difference between a distance of the upper and lower lip edges in the portrait image frame and a first error allowable value, or the closing threshold is a plurality of portrait shooting reference image frames.
  • the average of the distances between the upper and lower lip edges plus the difference after the second error allowable value; wherein the first error allowable value is an empirical value; and the second error allowable value is a motion recognition sensitivity coefficient multiplied by Portrait The difference in the distance of the distance between the upper and lower lip edges in the reference image frame; and/or
  • the motion and motion amplitude of the head are determined according to the distance, angle and direction of the deviation.
  • the image acquisition module 301 is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:
  • a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
  • the image acquisition module 301 is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:
  • the depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  • the device further includes:
  • the image sending module 304 is configured to send an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image to the receiving end.
  • an embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a receiving end, and includes:
  • the parameter description receiving module 401 is configured to receive a parameter description of each frame of the image frame, and decode the received parameter description.
  • the image reconstruction and display module 402 is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
  • the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.
  • the image reconstruction and display module 402 is configured to reconstruct each frame image according to the decoded parameter description, including:
  • the decoded parameter description includes facial expression parameters and facial motion amplitude
  • use two Dimensional image or 3D image simulation to reconstruct the expression of the sender's portrait, and display the simulated reconstructed image frame
  • the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed;
  • the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
  • the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
  • the wireless data traffic in mobile video communication is expensive.
  • the parameter description and encoding transmission can be performed only on the facial expression. The implementation steps are as follows:
  • Step 1 The image receiving module of the transmitting end first transmits the collected image by using conventional encoding, that is, sending the current face image to the receiving end for subsequent control display;
  • Step 2 The subsequently acquired image is sent to the feature extraction module for face detection and key feature point location of the face, including the mouth, eyebrows, eyes, nose and facial contours;
  • Step 3 The feature extraction module performs face detection through the Adaboost algorithm to determine the position and size of the face, and then uses the SDM (Supervised Descent Method) algorithm to extract the coordinates of the face feature points in the face region;
  • SDM Supervised Descent Method
  • Step 4 extracting facial expression motion parameters and amplitude according to feature point coordinates and encoding according to a predetermined rule. If no face is detected or the feature point position cannot be located, the default action parameter is adopted;
  • the mean value of the face feature point position in the N reference frames is counted as a reference frame with the video frame of the user having a normal expression (no action) at rest.
  • the average difference ⁇ DX calculated as follows:
  • ⁇ DX represents the variance of the position of the feature points due to video noise.
  • the mouth opening action is determined by the distance D2 of the upper and lower lip edges, if Then it is determined that a mouth opening action occurs. among them, Is the mean value of the distance between the upper and lower lip edges in the N reference frames, ⁇ D2 is the variance of the distance between the upper and lower lip edges in the N reference frames, and ⁇ is the sensitivity coefficient of the motion recognition. The smaller the value, the more the result of the motion recognition Sensitive.
  • the value of the angle ⁇ 1 of the lateral head of the head (ie, the clockwise side head and the counterclockwise side head) can be estimated from the horizontal distance and the vertical distance of the eyebrow point, the center point of the eyes or the position of the nose point from the position of the reference point;
  • the direction can be determined by the relative relationship between the current position and the position of the reference point.
  • the left and right rotation of the head ie, turn left and turn right
  • the value of the angle ⁇ 2 can be estimated from the horizontal distance of the eyebrow point, the center point of the eye or the position of the tip of the nose from the position of the reference point; the direction of the angle can be from the current position Determined by the relative relationship with the position of the reference point.
  • the value of the angle ⁇ 3 of the head up and down rotation (ie, head up, head down) can be estimated from the vertical distance of the eye point, the center point of the eye or the position of the nose point from the position of the reference point; the direction of the angle can be from the current position and the position of the reference point The relative relationship is determined.
  • Step 5 The receiving end decodes the received action parameters to obtain various action description parameters.
  • Step 6 The animation control module at the receiving end controls the image obtained in step 1 according to the parameters obtained in step 4 or performs the motion expression simulation according to the 3D model constructed from the image obtained in step one and sends it to the display module for display.
  • the two parties do not want the other party to see their real face, but only want to know each other's facial expressions, so they can only transmit the expression action parameters and then control the animation at the receiving end.
  • the implementation steps are as follows:
  • Step 1 The image receiving module of the sending end sends the collected image to the feature extraction module for face detection and key feature point positioning of the face, including mouth, eyebrows, eyes, nose and facial contours;
  • Step 2 The feature extraction module performs face detection through the Adaboost algorithm to determine the position and size of the face, and then uses the SDM (Supervised Descent Method) algorithm to extract the coordinates of the face feature points in the face region;
  • SDM Supervised Descent Method
  • Step 3 extracting the facial expression motion parameters and amplitude according to the feature point coordinates and encoding according to a predetermined rule. If no face is detected or the feature point position cannot be located, the default action parameter is adopted, and the parameter definition reference example 1;
  • Step 4 The receiving end decodes the received expression action parameters to obtain various expression action description parameters
  • Step 5 The animation control module of the receiving end controls the local picture and the cartoon animation model according to the expression action description parameters obtained in step 4 to perform an expression motion simulation and send it to the display module for display;
  • the receiving end can display an animated image of a large shark to simulate the expression of the sender's mouth.
  • This example can perform parameter description and code transmission for body movements.
  • the implementation steps are as follows:
  • Step 1 The image receiving module of the sending end sends the collected image to the feature extraction module for human body detection and organ positioning, including organs such as a head, an arm, a leg, and a waist;
  • Step 2 The feature extraction module extracts the position and square of each organ through a deep learning algorithm or a template matching algorithm. Characteristics of direction, shape, curvature, etc.;
  • Step 3 According to the feature extracted in step 2, identify and judge the movement of the human body, such as the left hand forward, the forward kick, the right leg, the twist, the gesture, etc., and send the action parameters and the action amplitude according to the predetermined rule code to the receiving end. ;
  • Step 4 The receiving end decodes the received data according to a predetermined rule to obtain description parameters of various actions and action amplitudes;
  • Step 5 The animation control module of the receiving end controls the local human body image or model according to the parameters obtained in step 4 to perform motion simulation, and displays in the display module.
  • This example can perform parameter description and code transmission on the scene.
  • the implementation steps are as follows:
  • Step 1 The image receiving module of the sending end sends the collected image to the feature extraction module for image content understanding and extraction;
  • Step 2 The feature extraction module understands the scene through a deep learning algorithm, extracts objects that can be described, such as tables, landmark buildings, computers, dogs, etc., and extracts and describes features such as table shape and size. Information such as color, height, material, position in the image, etc.
  • Step 3 performing parameter description on the physical features extracted in step 2, and encoding the parameters describing the physical objects according to a predetermined rule, and transmitting the parameters to the receiving end;
  • Step 4 The receiving end decodes the received data according to a predetermined rule, and obtains description parameters of each physical object in the scene.
  • Step 5 The receiving end animation control module selects various physical model simulations to construct the transmitting end scene according to the parameters obtained in step 4, so that the receiving end can experience a scene similar to the sending end and display in the display module.
  • the foregoing embodiment provides a method and apparatus for implementing video communication.
  • the amount of data transmitted can be greatly reduced by transmitting character motions, expressions, and scene parameter descriptions in the image instead of the image itself, thereby saving traffic and tariffs for the user.
  • controlling the animation to simulate people's expressions and actions through expression parameters can increase the fun and entertainment of video communication, protect the user's personal privacy, and enhance the user experience.
  • an apparatus for implementing video communication comprising: a processor; a memory for storing instructions executable by the processor; the processor for storing according to the memory
  • the instruction performs an action, the action includes: acquiring an image by a camera; performing image recognition on each frame of the image frame collected, performing feature extraction according to the image recognition result; and extracting the feature pair according to the image frame from each frame
  • the image frame is described by parameters, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.
  • an apparatus for implementing video communication comprising: a processor; a memory for storing instructions executable by the processor; the processor for storing according to the memory
  • the instruction performs an action, the action includes: receiving a parameter description of each frame of the image frame, decoding the received parameter description, and reconstructing and reconstructing each frame image according to the decoded parameter description.
  • a computer storage medium may be stored with a license A line instruction for performing a method of implementing video communication in any of the above embodiments.
  • the method for realizing video communication provided by the present disclosure can be applied to a terminal device having a video capture and communication function, and can greatly reduce the transmission data by transmitting a character motion, an expression, and a scene parameter description in the image instead of the image itself during video communication. Quantity, saving users traffic and tariffs.

Abstract

Disclosed is a method for implementing video communication, which is applied to a sending end. The method comprises: collecting an image by means of a camera; performing image recognition on each collected frame of image, and performing feature abstraction according to an image recognition result; and performing parameter description on the image frames according to a feature abstracted from each frame of image, encoding the parameter description of each image frame, and then sending same to a receiving end. By means of the disclosure, the amount of data transmitted during video communication can be reduced, and traffic and charges can be reduced for users.

Description

一种实现视频通讯的方法和装置Method and device for realizing video communication 技术领域Technical field
本公开涉及图像处理技术领域,尤其涉及的是一种实现视频通讯的方法和装置。The present disclosure relates to the field of image processing technologies, and in particular, to a method and apparatus for implementing video communication.
背景技术Background technique
目前,RCS(Rich Communication Suite,富媒体通信)业务逐渐兴起。RCS业务是一种基于增强型手机地址本,集语音、视频、消息、呈现、内容共享等多种通讯方式和功能为一体的融合通讯服务。通过使用RCS业务,用户可以对自已的呈现(如个人图片、心情短语、推荐链接以及状态)进行更新,实现即时消息、聊天、文件传输等多种通信需求,实现在一个会话过程中进行视频共享和图片共享,通过标准协议接口与网络侧对接,实现注册、鉴权、音视频呼叫能力。At present, the RCS (Rich Communication Suite) business is gradually emerging. RCS service is a kind of converged communication service based on enhanced mobile phone address book, which integrates various communication methods and functions such as voice, video, message, presentation and content sharing. By using the RCS service, users can update their own presentations (such as personal pictures, mood phrases, referral links, and status) to achieve various communication needs such as instant messaging, chat, file transfer, and video sharing during a session. It shares with the picture and connects to the network side through the standard protocol interface to realize registration, authentication, audio and video call capabilities.
但是,作为RCS业务重要组成部分的视频通讯,由于昂贵的无线流量资费、个人隐私等问题在人们生活中还没有真正地普及起来。However, video communication, which is an important part of the RCS business, has not really become popular in people's lives due to expensive wireless traffic charges and personal privacy.
发明内容Summary of the invention
本公开所要解决的技术问题是提供一种实现视频通讯的方法和装置,能够降低视频通讯的传输数据量,为用户节省流量和资费。The technical problem to be solved by the present disclosure is to provide a method and apparatus for implementing video communication, which can reduce the amount of data transmitted by video communication, and save traffic and tariffs for users.
本公开提供了一种实现视频通讯的方法,应用于发送端,该方法包括:The present disclosure provides a method for implementing video communication, which is applied to a transmitting end, and the method includes:
通过摄像头采集图像;Acquiring images through the camera;
对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取;Image recognition is performed on each frame of image captured, and feature extraction is performed according to the image recognition result;
根据从每一帧图像帧提取到的特征对图像帧进行参数描述,对每一帧图像帧的参数描述进行编码后发送给接收端。The parameter description of the image frame is performed according to the feature extracted from each frame of the image frame, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.
可选地,所述图像识别包括以下至少一种:人脸识别、人体识别、场景识别;Optionally, the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;
所述特征提取包括以下至少一种:表情特征提取、动作特征提取、场景中实物特征提取;The feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
可选地,所述通过摄像头采集图像,包括:Optionally, the capturing images by the camera includes:
通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像的图像帧中提取到的人脸关键特征点的位置作为人脸特征点参考位置;After acquiring an image frame as a portrait shooting reference image, a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
根据从每一帧图像帧上提取到的人脸关键特征点位置与所述人脸特征点参考位置之间的位置关系确定所述图像帧对应的人脸表情参数和表情动作幅度。 And determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.
可选地,所述通过摄像头采集图像,包括:Optionally, the capturing images by the camera includes:
通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像帧中提取到的人体关键特征点的位置作为人体特征点参考位置;After acquiring an image frame as a reference image for portrait shooting, a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
根据从每一帧图像帧上提取到的人体关键特征点位置与所述人体特征点参考位置之间的位置关系确定所述图像帧对应的动作参数和动作幅度。And determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position.
可选地,所述通过摄像头采集图像,包括:Optionally, the capturing images by the camera includes:
通过摄像头拍摄作为场景拍摄参考图像的图像帧;Taking an image frame as a reference image for shooting a scene through a camera;
对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
利用深度学习算法对每一帧图像帧的场景进行理解,提取可以描述的实物,对所述实物进行特征提取。The depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
可选地,所述方法还包括:Optionally, the method further includes:
向所述接收端发送人像拍摄参考图像的图像帧或场景拍摄参考图像的图像帧。An image frame of the portrait shooting reference image or an image frame of the scene capturing reference image is transmitted to the receiving end.
本公开实施例提供了一种实现视频通讯的方法,应用于接收端,该方法包括:An embodiment of the present disclosure provides a method for implementing video communication, which is applied to a receiving end, and the method includes:
接收每一帧图像帧的参数描述,对接收到的参数描述进行解码;Receiving a parameter description of each frame of the image frame, and decoding the received parameter description;
根据解码后的参数描述模拟重建每一帧图像,并进行显示。Each frame of the image is reconstructed and displayed according to the decoded parameter description.
可选地,所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。Optionally, the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.
可选地,所述根据解码后的参数描述模拟重建每一帧图像,包括:Optionally, the reconstructing each frame image according to the decoded parameter description comprises:
对每一帧图像帧,在解码后的参数描述包含人脸表情参数和表情动作幅度时,使用二维图片或三维图像模拟重建发送端人像的表情,对模拟重建后的图像帧进行显示;和/或For each frame of the image frame, when the decoded parameter description includes the facial expression parameter and the expression action amplitude, the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; /or
对每一帧图像帧,在解码后的参数描述包含人体动作参数和动作幅度时,使用二维图片或三维图像模拟重建发送端人像的动作,对模拟重建后的图像帧进行显示;和/或For each frame of the image frame, when the decoded parameter description includes the human body motion parameter and the motion amplitude, the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed; and/or
对每一帧图像帧,在解码后的参数描述包含场景参数时,使用二维图片或三维图像模拟重建发送端场景中的各个实物,对模拟重建后的图像帧进行显示。For each frame of the image frame, when the decoded parameter description includes the scene parameter, the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
可选地,所述二维图片或三维图像包括以下至少一种:发送端人像拍摄参考图像、发送端场景拍摄参考图像、图片库中的图片、动画模型库中的动画模型。Optionally, the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
本公开实施例提供了一种实现视频通讯的装置,应用于发送端,包括:An embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a transmitting end, and includes:
图像采集模块,用于通过摄像头采集图像;An image acquisition module for collecting images through a camera;
图像识别及特征提取模块,用于对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取;The image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result;
参数描述及编码模块,用于根据从每一帧图像帧提取到的特征对图像帧进行参数描述,对每一帧图像帧的参数描述进行编码后发送给接收端。The parameter description and encoding module is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.
可选地,所述图像识别包括以下至少一种:人脸识别、人体识别、场景识别; Optionally, the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;
所述特征提取包括以下至少一种:表情特征提取、动作特征提取、场景中实物特征提取;The feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
可选地,图像采集模块,用于通过摄像头采集图像,包括:Optionally, an image acquisition module is configured to collect images by using a camera, including:
通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
图像识别及特征提取模块,用于对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:The image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:
在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像的图像帧中提取到的人脸关键特征点的位置作为人脸特征点参考位置;After acquiring an image frame as a portrait shooting reference image, a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
根据从每一帧图像帧上提取到的人脸关键特征点位置与所述人脸特征点参考位置之间的位置关系确定所述图像帧对应的人脸表情参数和表情动作幅度。And determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.
可选地,图像采集模块,用于通过摄像头采集图像,包括:Optionally, an image acquisition module is configured to collect images by using a camera, including:
通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
图像识别及特征提取模块,用于对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:The image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:
在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像帧中提取到的人体关键特征点的位置作为人体特征点参考位置;After acquiring an image frame as a reference image for portrait shooting, a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
根据从每一帧图像帧上提取到的人体关键特征点位置与所述人体特征点参考位置之间的位置关系确定所述图像帧对应的动作参数和动作幅度。And determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position.
可选地,图像采集模块,用于通过摄像头采集图像,包括:Optionally, an image acquisition module is configured to collect images by using a camera, including:
通过摄像头拍摄作为场景拍摄参考图像的图像帧;Taking an image frame as a reference image for shooting a scene through a camera;
图像识别及特征提取模块,用于对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:The image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:
利用深度学习算法对每一帧图像帧的场景进行理解,提取可以描述的实物,对所述实物进行特征提取。The depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
可选地,所述装置还包括:Optionally, the device further includes:
图像发送模块,用于向所述接收端发送人像拍摄参考图像的图像帧或场景拍摄参考图像的图像帧。And an image sending module, configured to send, to the receiving end, an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image.
本公开实施例提供了一种实现视频通讯的装置,应用于接收端,包括:An embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a receiving end, and includes:
参数描述接收模块,用于接收每一帧图像帧的参数描述,对接收到的参数描述进行解码;a parameter description receiving module, configured to receive a parameter description of each frame of the image frame, and decode the received parameter description;
图像重建及显示模块,用于根据解码后的参数描述模拟重建每一帧图像,并进行显示。The image reconstruction and display module is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
可选地,所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。 Optionally, the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.
可选地,图像重建及显示模块,用于根据解码后的参数描述模拟重建每一帧图像,包括:Optionally, an image reconstruction and display module is configured to reconstruct each frame image according to the decoded parameter description, including:
对每一帧图像帧,在解码后的参数描述包含人脸表情参数和表情动作幅度时,使用二维图片或三维图像模拟重建发送端人像的表情,对模拟重建后的图像帧进行显示;和/或For each frame of the image frame, when the decoded parameter description includes the facial expression parameter and the expression action amplitude, the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; /or
对每一帧图像帧,在解码后的参数描述包含人体动作参数和动作幅度时,使用二维图片或三维图像模拟重建发送端人像的动作,对模拟重建后的图像帧进行显示;和/或For each frame of the image frame, when the decoded parameter description includes the human body motion parameter and the motion amplitude, the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed; and/or
对每一帧图像帧,在解码后的参数描述包含场景参数时,使用二维图片或三维图像模拟重建发送端场景中的各个实物,对模拟重建后的图像帧进行显示。For each frame of the image frame, when the decoded parameter description includes the scene parameter, the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
可选地,所述二维图片或三维图像包括以下至少一种:发送端人像拍摄参考图像、发送端场景拍摄参考图像、图片库中的图片、动画模型库中的动画模型。Optionally, the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
与相关技术相比,本公开提供的一种实现视频通讯的方法和装置,视频通讯中通过传输图像中人物动作、表情、场景参数描述而不是图像本身可以大大降低传输数据量,为用户节省流量和资费。另一方面,通过表情参数控制动画模拟人的表情、动作,可以增加视频通讯的趣味性和娱乐性、保护用户的个人隐私,提升了用户体验。Compared with the related art, the present disclosure provides a method and apparatus for implementing video communication, in which video communication can greatly reduce the amount of data transmitted by transmitting character motions, expressions, and scene parameter descriptions instead of images themselves, thereby saving traffic for users. And tariffs. On the other hand, controlling the animation to simulate people's expressions and actions through expression parameters can increase the fun and entertainment of video communication, protect the user's personal privacy, and enhance the user experience.
附图说明DRAWINGS
图1为本公开实施例的一种实现视频通讯的方法流程图(发送端)。FIG. 1 is a flowchart (transmission end) of a method for implementing video communication according to an embodiment of the present disclosure.
图2为本公开实施例的一种实现视频通讯的方法流程图(接收端)。FIG. 2 is a flowchart (receiving end) of a method for implementing video communication according to an embodiment of the present disclosure.
图3为本公开实施例的一种实现视频通讯的装置示意图(发送端)。FIG. 3 is a schematic diagram of a device (transmitting end) for implementing video communication according to an embodiment of the present disclosure.
图4为本公开实施例的一种实现视频通讯的装置示意图(接收端)。FIG. 4 is a schematic diagram of a device (receiving end) for implementing video communication according to an embodiment of the present disclosure.
图5为本公开示例2中在发送端采集的用户的图像示意图(见左图),以及接收端用卡通图像(大鲨鱼)模拟发送端用户表情的示意图(见右图)。FIG. 5 is a schematic diagram of an image of a user collected at the transmitting end in the example 2 of the present disclosure (see the left figure), and a schematic diagram of the receiving end emulating the expression of the user of the transmitting end with a cartoon image (big shark) (see the right figure).
具体实施方式detailed description
为使本公开的目的、技术方案和优点更加清楚明白,下文中将结合附图对本公开的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
如图1所示,本公开实施例提供了一种实现视频通讯的方法,应用于发送端,该方法包括:As shown in FIG. 1 , an embodiment of the present disclosure provides a method for implementing video communication, which is applied to a sending end, and the method includes:
S110,通过摄像头采集图像;S110, collecting an image through a camera;
S120,对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取;S120: performing image recognition on each frame of the image frame that is collected, and performing feature extraction according to the image recognition result;
S130,根据从每一帧图像帧提取到的特征对图像帧进行参数描述,对每一帧图像帧的参数描述进行编码后发送给接收端;S130: Perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end;
所述方法还可以包括下述特点:The method may also include the following features:
所述图像识别包括以下至少一种:人脸识别、人体识别、场景识别; The image recognition includes at least one of the following: face recognition, body recognition, scene recognition;
所述特征提取包括以下至少一种:表情特征提取、动作特征提取、场景中实物特征提取;The feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述;The parameter description includes at least one of the following: a description of a facial expression parameter, a description of a human motion parameter, and a description of a physical feature parameter in the scene;
所述通过摄像头采集图像,包括:The collecting images by the camera includes:
通过摄像头拍摄作为人像拍摄参考图像的图像帧;和/或Taking an image frame as a reference image for portrait shooting through a camera; and/or
通过摄像头拍摄作为场景拍摄参考图像的图像帧。An image frame that is a reference image for shooting a scene is taken by a camera.
其中,可以采用Adaboost算法进行人脸检测,确定人脸的位置和大小,然后在人脸区域采用SDM(Supervised Descent Method,监督下降法)算法提取人脸关键特征点坐标;The Adaboost algorithm can be used for face detection to determine the position and size of the face, and then the SDM (Supervised Descent Method) algorithm is used to extract the key feature point coordinates of the face in the face region;
其中,所述人脸的关键特征点包括以下至少一种:眼睛、鼻子、嘴巴、眉毛、脸部轮廓等;Wherein, the key feature points of the face include at least one of the following: an eye, a nose, a mouth, an eyebrow, a facial contour, and the like;
其中,所述人脸表情参数包括以下参数的至少一种:闭眼、睁眼、张嘴、闭嘴、笑、抬头、低头、向左转头、向右转头、向左肩倾斜头部、向右肩倾斜头部等;Wherein, the facial expression parameter includes at least one of the following parameters: closed eyes, blinking, opening mouth, closing mouth, laughing, looking up, bowing, turning left, turning to the right, tilting the head to the left shoulder, The right shoulder is tilted to the head, etc.;
其中,可以采用深度学习算法或模板匹配算法进行人体检测,提取各个人体关键特征点的位置、方向、形状、曲率等特征;Wherein, a deep learning algorithm or a template matching algorithm may be used for human body detection, and the position, direction, shape, curvature and other features of key features of each human body are extracted;
其中,所述人体的关键特征点,包括以下至少一种:头、手臂、手、腿、脚、腰等;Wherein, the key feature points of the human body include at least one of the following: a head, an arm, a hand, a leg, a foot, a waist, and the like;
其中,可以采用深度学习算法对场景进行理解,提取场景中各个实物的特征;The deep learning algorithm can be used to understand the scene and extract the features of each physical object in the scene;
其中,实物的特征包括:形状、大小、颜色、材质等;Among them, the characteristics of the physical object include: shape, size, color, material, etc.;
可选地,对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:Optionally, image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:
在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像帧中提取到的人脸关键特征点的位置作为人脸特征点参考位置;After acquiring the image frame as the portrait image of the portrait, the position of the face key feature point extracted from the portrait image reference image frame is taken as the face feature point reference position;
根据从每一帧图像帧上提取到的人脸关键特征点位置与所述人脸特征点参考位置之间的位置关系确定所述图像帧对应的人脸表情参数和表情动作幅度;Determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position;
其中,在拍摄人像参考图像帧时,通常要求用户眼睛正视前方,嘴巴闭住且不能笑,身体保持正直站立;Wherein, when photographing a portrait reference image frame, the user is usually required to face the front of the eye, the mouth is closed and cannot be laughed, and the body is kept standing upright;
其中,所述根据从每一帧图像帧上提取到的人脸关键特征点位置与所述人脸特征点参考位置之间的位置关系确定所述图像帧对应的人脸表情参数和表情动作幅度,包括:The determining, according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position, determining a face expression parameter and an expression action range corresponding to the image frame. ,include:
如果当前图像帧上上下眼睑的距离小于睁眼阈值,则判定人像闭眼,否则判定人像睁眼;其中,所述睁眼阈值是人像拍摄参考图像帧中上下眼睑的距离减去第一误差允许值后的差值,或者,所述睁眼阈值是多张人像拍摄参考图像帧中上下眼睑的距离的平均值减去第二误差允许值后的差值;其中,所述第一误差允许值是经验值;所述第二误差允许值是动作识别灵敏度系数乘以多张人像拍摄参考图像帧中上下眼睑的距离的方差;和/或If the distance between the upper and lower eyelids on the current image frame is less than the blink threshold, determining that the portrait is closed, otherwise determining that the portrait is blinking; wherein the blink threshold is the distance between the upper and lower eyelids in the portrait image frame minus the first error allowed a difference after the value, or the blink threshold is a difference between an average value of distances of upper and lower eyelids in the plurality of portrait image reference image frames minus a second error allowable value; wherein the first error allowable value Is an empirical value; the second error allowable value is a variance of a motion recognition sensitivity coefficient multiplied by a distance of upper and lower eyelids in a plurality of portrait photographing reference image frames; and/or
如果当前图像帧上上下嘴唇边缘的距离大于闭嘴阈值,则判定人像张嘴,否则判定人像闭嘴;其中,所述闭嘴阈值是人像拍摄参考图像帧中上下嘴唇边缘的距离加上第一误差 允许值后的差值,或者,所述闭嘴阈值是多张人像拍摄参考图像帧中上下嘴唇边缘的距离的平均值加上第二误差允许值后的差值;其中,所述第一误差允许值是经验值;所述第二误差允许值是动作识别灵敏度系数乘以多张人像拍摄参考图像帧中上下嘴唇边缘的距离的方差;和/或If the distance between the upper and lower lip edges of the current image frame is greater than the closing threshold, determining that the portrait opens the mouth, otherwise determining that the portrait is closed; wherein the closing threshold is the distance between the upper and lower lip edges of the portrait image frame and the first error a difference after the allowable value, or the threshold of the closing is the average of the distance between the upper and lower lip edges of the plurality of portrait image reference image frames plus the difference of the second error allowable value; wherein the first error The allowable value is an empirical value; the second error allowable value is a variance of the motion recognition sensitivity coefficient multiplied by the distance of the upper and lower lip edges in the plurality of portrait photographing reference image frames; and/or
如果当前图像帧上头部位置参考点的位置偏离了所述头部位置参考点的参考位置,则根据偏离的距离、角度和方向确定头部的动作和动作幅度;If the position of the head position reference point on the current image frame deviates from the reference position of the head position reference point, the action and the action range of the head are determined according to the distance, angle and direction of the deviation;
其中,所述头部位置参考点包括以下至少一种:两眼中心点、眉心、鼻尖;Wherein the head position reference point comprises at least one of the following: a center point of the two eyes, an eyebrow, and a tip of the nose;
其中,头部的动作包括以下至少一种:抬头、低头、向左转头、向右转头、向左肩倾斜头部、向右肩倾斜头部;Wherein, the action of the head includes at least one of the following: raising the head, lowering the head, turning the head to the left, turning the head to the right, tilting the head to the left shoulder, and tilting the head to the right shoulder;
所述方法还包括:The method further includes:
将所述人像拍摄参考图像帧发送给接收端;Sending the portrait shooting reference image frame to the receiving end;
其中,上述描述的睁眼、闭眼、张嘴、闭嘴等动作参数的描述只是举例说明,其他表情参数可以采用类似的方法,但也不限于上述描述的这种方法;The descriptions of the action parameters such as blinking, closing the eyes, opening the mouth, and closing the mouth described above are merely examples, and other expression parameters may adopt a similar method, but are not limited to the method described above;
可选地,对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:Optionally, image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:
在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像帧中提取到的人体关键特征点的位置作为人体特征点参考位置;After acquiring an image frame as a reference image for portrait shooting, a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
根据从每一帧图像帧上提取到的人体关键特征点位置与所述人体特征点参考位置之间的位置关系确定所述图像帧对应的动作参数和动作幅度;Determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position;
其中,所述人体动作,比如:左手向前、向前踢右腿、扭腰、手势等;Wherein the human body action, such as: moving the left hand forward, kicking the right leg forward, twisting the waist, gestures, etc.;
可选地,对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:Optionally, image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:
利用深度学习算法对每一帧图像帧的场景进行理解,提取可以描述的实物,对所述实物进行特征提取。The depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
如图2所示,本公开实施例提供了一种实现视频通讯的方法,应用于接收端,该方法包括:As shown in FIG. 2, an embodiment of the present disclosure provides a method for implementing video communication, which is applied to a receiving end, and the method includes:
S210,接收每一帧图像帧的参数描述,对接收到的参数描述进行解码;S210. Receive a parameter description of each frame of the image frame, and decode the received parameter description.
S220,根据解码后的参数描述模拟重建每一帧图像,并进行显示。S220: Simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
所述方法还可以包括下述特点:The method may also include the following features:
可选地,所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述;Optionally, the parameter description includes at least one of the following: a description of a facial expression parameter, a description of a human motion parameter, and a description of a physical feature parameter in the scene;
可选地,所述根据解码后的参数描述模拟重建每一帧图像,包括:Optionally, the reconstructing each frame image according to the decoded parameter description comprises:
对每一帧图像帧,在解码后的参数描述包含人脸表情参数和表情动作幅度时,使用二维图片或三维图像模拟重建发送端人像的表情,对模拟重建后的图像帧进行显示;和/或 For each frame of the image frame, when the decoded parameter description includes the facial expression parameter and the expression action amplitude, the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; / or
对每一帧图像帧,在解码后的参数描述包含人体动作参数和动作幅度时,使用二维图片或三维图像模拟重建发送端人像的动作,对模拟重建后的图像帧进行显示;和/或For each frame of the image frame, when the decoded parameter description includes the human body motion parameter and the motion amplitude, the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed; and/or
对每一帧图像帧,在解码后的参数描述包含场景参数时,使用二维图片或三维图像模拟重建发送端场景中的各个实物,对模拟重建后的图像帧进行显示;For each frame of image frame, when the decoded parameter description includes the scene parameter, the two-dimensional image or the three-dimensional image is used to reconstruct and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed;
其中,所述二维图片或三维图像包括以下至少一种:发送端人像拍摄参考图像、发送端场景拍摄参考图像、图片库中的图片、动画模型库中的动画模型;The two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting end portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library;
如图3所示,本公开实施例提供了一种实现视频通讯的装置,应用于发送端,包括:As shown in FIG. 3, an embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a transmitting end, and includes:
图像采集模块301,用于通过摄像头采集图像;The image acquisition module 301 is configured to collect an image by using a camera;
图像识别及特征提取模块302,用于对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取;The image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the captured image frame, and perform feature extraction according to the image recognition result;
参数描述及编码模块303,用于根据从每一帧图像帧提取到的特征对图像帧进行参数描述,对每一帧图像帧的参数描述进行编码后发送给接收端。The parameter description and encoding module 303 is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.
可选地,所述图像识别包括以下至少一种:人脸识别、人体识别、场景识别;Optionally, the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;
所述特征提取包括以下至少一种:表情特征提取、动作特征提取、场景中实物特征提取;The feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
可选地,图像采集模块301,用于通过摄像头采集图像,包括:Optionally, the image acquisition module 301 is configured to collect images by using a camera, including:
通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
图像识别及特征提取模块302,用于对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:The image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:
在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像的图像帧中提取到的人脸关键特征点的位置作为人脸特征点参考位置;After acquiring an image frame as a portrait shooting reference image, a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
根据从每一帧图像帧上提取到的人脸关键特征点位置与所述人脸特征点参考位置之间的位置关系确定所述图像帧对应的人脸表情参数和表情动作幅度。And determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.
可选地,图像识别及特征提取模块302,用于根据从每一帧图像帧上提取到的人脸关键特征点位置与所述人脸特征点参考位置之间的位置关系确定所述图像帧对应的人脸表情参数和表情动作幅度,包括:Optionally, the image recognition and feature extraction module 302 is configured to determine the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position. Corresponding facial expression parameters and expression action amplitudes, including:
如果当前图像帧上上下眼睑的距离小于睁眼阈值,则判定人像闭眼,否则判定人像睁眼;其中,所述睁眼阈值是人像拍摄参考图像帧中上下眼睑的距离减去第一误差允许值后的差值,或者,所述睁眼阈值是多张人像拍摄参考图像帧中上下眼睑的距离的平均值减去第二误差允许值后的差值;其中,所述第一误差允许值是经验值;所述第二误差允许值是动作识别灵敏度系数乘以多张人像拍摄参考图像帧中上下眼睑的距离的方差;和/或If the distance between the upper and lower eyelids on the current image frame is less than the blink threshold, determining that the portrait is closed, otherwise determining that the portrait is blinking; wherein the blink threshold is the distance between the upper and lower eyelids in the portrait image frame minus the first error allowed a difference after the value, or the blink threshold is a difference between an average value of distances of upper and lower eyelids in the plurality of portrait image reference image frames minus a second error allowable value; wherein the first error allowable value Is an empirical value; the second error allowable value is a variance of a motion recognition sensitivity coefficient multiplied by a distance of upper and lower eyelids in a plurality of portrait photographing reference image frames; and/or
如果当前图像帧上上下嘴唇边缘的距离大于闭嘴阈值,则判定人像张嘴,否则判定人 像闭嘴;其中,所述闭嘴阈值是人像拍摄参考图像帧中上下嘴唇边缘的距离加上第一误差允许值后的差值,或者,所述闭嘴阈值是多张人像拍摄参考图像帧中上下嘴唇边缘的距离的平均值加上第二误差允许值后的差值;其中,所述第一误差允许值是经验值;所述第二误差允许值是动作识别灵敏度系数乘以多张人像拍摄参考图像帧中上下嘴唇边缘的距离的方差;和/或If the distance between the upper and lower lip edges of the current image frame is greater than the closing threshold, then the portrait is opened, otherwise the person is judged The closing threshold is a difference between a distance of the upper and lower lip edges in the portrait image frame and a first error allowable value, or the closing threshold is a plurality of portrait shooting reference image frames. The average of the distances between the upper and lower lip edges plus the difference after the second error allowable value; wherein the first error allowable value is an empirical value; and the second error allowable value is a motion recognition sensitivity coefficient multiplied by Portrait The difference in the distance of the distance between the upper and lower lip edges in the reference image frame; and/or
如果当前图像帧上头部位置参考点的位置偏离了所述头部位置参考点的参考位置,则根据偏离的距离、角度和方向确定头部的动作和动作幅度。If the position of the head position reference point on the current image frame deviates from the reference position of the head position reference point, the motion and motion amplitude of the head are determined according to the distance, angle and direction of the deviation.
可选地,图像采集模块301,用于通过摄像头采集图像,包括:Optionally, the image acquisition module 301 is configured to collect images by using a camera, including:
通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
图像识别及特征提取模块302,用于对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:The image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:
在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像帧中提取到的人体关键特征点的位置作为人体特征点参考位置;After acquiring an image frame as a reference image for portrait shooting, a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
根据从每一帧图像帧上提取到的人体关键特征点位置与所述人体特征点参考位置之间的位置关系确定所述图像帧对应的动作参数和动作幅度。And determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position.
可选地,图像采集模块301,用于通过摄像头采集图像,包括:Optionally, the image acquisition module 301 is configured to collect images by using a camera, including:
通过摄像头拍摄作为场景拍摄参考图像的图像帧;Taking an image frame as a reference image for shooting a scene through a camera;
图像识别及特征提取模块302,用于对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:The image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:
利用深度学习算法对每一帧图像帧的场景进行理解,提取可以描述的实物,对所述实物进行特征提取。The depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
可选地,所述装置还包括:Optionally, the device further includes:
图像发送模块304,用于向所述接收端发送人像拍摄参考图像的图像帧或场景拍摄参考图像的图像帧。The image sending module 304 is configured to send an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image to the receiving end.
如图4所示,本公开实施例提供了一种实现视频通讯的装置,应用于接收端,包括:As shown in FIG. 4, an embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a receiving end, and includes:
参数描述接收模块401,用于接收每一帧图像帧的参数描述,对接收到的参数描述进行解码;The parameter description receiving module 401 is configured to receive a parameter description of each frame of the image frame, and decode the received parameter description.
图像重建及显示模块402,用于根据解码后的参数描述模拟重建每一帧图像,并进行显示。The image reconstruction and display module 402 is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
可选地,所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。Optionally, the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.
可选地,图像重建及显示模块402,用于根据解码后的参数描述模拟重建每一帧图像,包括:Optionally, the image reconstruction and display module 402 is configured to reconstruct each frame image according to the decoded parameter description, including:
对每一帧图像帧,在解码后的参数描述包含人脸表情参数和表情动作幅度时,使用二 维图片或三维图像模拟重建发送端人像的表情,对模拟重建后的图像帧进行显示;和/或For each frame of image frame, when the decoded parameter description includes facial expression parameters and facial motion amplitude, use two Dimensional image or 3D image simulation to reconstruct the expression of the sender's portrait, and display the simulated reconstructed image frame; and/or
对每一帧图像帧,在解码后的参数描述包含人体动作参数和动作幅度时,使用二维图片或三维图像模拟重建发送端人像的动作,对模拟重建后的图像帧进行显示;和/或For each frame of the image frame, when the decoded parameter description includes the human body motion parameter and the motion amplitude, the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed; and/or
对每一帧图像帧,在解码后的参数描述包含场景参数时,使用二维图片或三维图像模拟重建发送端场景中的各个实物,对模拟重建后的图像帧进行显示。For each frame of the image frame, when the decoded parameter description includes the scene parameter, the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
可选地,所述二维图片或三维图像包括以下至少一种:发送端人像拍摄参考图像、发送端场景拍摄参考图像、图片库中的图片、动画模型库中的动画模型。Optionally, the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
示例1Example 1
手机视频通讯中无线数据流量资费昂贵,本示例可以只对人脸表情进行参数描述和编码传输,实现步骤如下:The wireless data traffic in mobile video communication is expensive. In this example, the parameter description and encoding transmission can be performed only on the facial expression. The implementation steps are as follows:
步骤一:发送端图像采集模块把采集到的图像先利用传统编码进行传输,也即,将当前人脸图像发送给接收端用于后续的控制显示;Step 1: The image receiving module of the transmitting end first transmits the collected image by using conventional encoding, that is, sending the current face image to the receiving end for subsequent control display;
步骤二:后续采集的图像送到特征提取模块进行人脸检测和人脸关键特征点定位,包括嘴巴、眉毛、眼睛、鼻子和脸部轮廓等;Step 2: The subsequently acquired image is sent to the feature extraction module for face detection and key feature point location of the face, including the mouth, eyebrows, eyes, nose and facial contours;
步骤三:特征提取模块通过Adaboost算法进行人脸检测,确定人脸的位置和大小,然后在人脸区域采用SDM(Supervised Descent Method,监督下降法)算法提取人脸特征点坐标;Step 3: The feature extraction module performs face detection through the Adaboost algorithm to determine the position and size of the face, and then uses the SDM (Supervised Descent Method) algorithm to extract the coordinates of the face feature points in the face region;
步骤四:根据特征点坐标提取人脸表情动作参数及幅度并根据预定规则进行编码,如果没有检测到人脸或者无法定位特征点位置,则采用默认动作参数;Step 4: extracting facial expression motion parameters and amplitude according to feature point coordinates and encoding according to a predetermined rule. If no face is detected or the feature point position cannot be located, the default action parameter is adopted;
下面举例说明几种动作参数的计算:The following examples illustrate the calculation of several action parameters:
以用户在静止状态下具有正常表情(无动作)的视频帧作为基准帧,统计N个基准帧内的人脸特征点位置的均值
Figure PCTCN2017081956-appb-000001
和平均差σDX,计算方法如下:
The mean value of the face feature point position in the N reference frames is counted as a reference frame with the video frame of the user having a normal expression (no action) at rest.
Figure PCTCN2017081956-appb-000001
And the average difference σ DX , calculated as follows:
Figure PCTCN2017081956-appb-000002
Figure PCTCN2017081956-appb-000002
Figure PCTCN2017081956-appb-000003
Figure PCTCN2017081956-appb-000003
其中
Figure PCTCN2017081956-appb-000004
表示特征点位置的均值,σDX表示由于视频噪音导致的特征点位置的方差。
among them
Figure PCTCN2017081956-appb-000004
Indicates the mean of the position of the feature points, and σ DX represents the variance of the position of the feature points due to video noise.
以下介绍几种动作计算,但不限于以下动作,其他动作参数可以用类似方法计算提取:The following describes several motion calculations, but is not limited to the following actions. Other action parameters can be extracted in a similar way:
1)闭眼动作由上下眼睑的距离D1来判定,如果
Figure PCTCN2017081956-appb-000005
则判定出现闭眼动作。其中,
Figure PCTCN2017081956-appb-000006
是N个基准帧内的上下眼睑的距离的均值,σD1是N个基准帧内的上下眼睑的距离的方差,α为动作识别的灵敏度系数,该值越小,动作识别的结果越灵敏。
1) Closed eye movement is determined by the distance D1 between the upper and lower eyelids, if
Figure PCTCN2017081956-appb-000005
Then it is determined that the eye-closing action occurs. among them,
Figure PCTCN2017081956-appb-000006
The mean value of the distance between the upper and lower eyelids in the N reference frames, σ D1 is the variance of the distance between the upper and lower eyelids in the N reference frames, and α is the sensitivity coefficient of the motion recognition. The smaller the value, the more sensitive the result of the motion recognition.
2)张嘴动作由上下唇边缘的距离D2来判定,如果
Figure PCTCN2017081956-appb-000007
则判定出现张嘴动作。其中,
Figure PCTCN2017081956-appb-000008
是N个基准帧内的上下唇边缘的距离的均值,σD2是N个基准帧内的上下唇边缘的距离的方差,α为动作识别的灵敏度系数,该值越小,动作识别的结果越灵敏。
2) The mouth opening action is determined by the distance D2 of the upper and lower lip edges, if
Figure PCTCN2017081956-appb-000007
Then it is determined that a mouth opening action occurs. among them,
Figure PCTCN2017081956-appb-000008
Is the mean value of the distance between the upper and lower lip edges in the N reference frames, σ D2 is the variance of the distance between the upper and lower lip edges in the N reference frames, and α is the sensitivity coefficient of the motion recognition. The smaller the value, the more the result of the motion recognition Sensitive.
3)头部的侧头(即顺时针侧头、逆时针侧头)角度θ1的数值可以由眉心点、双眼中心点或鼻尖点位置偏离参考点位置的水平距离和垂直距离来估计;角度的方向可以由当前位置与参考点位置的相对关系来确定。3) The value of the angle θ1 of the lateral head of the head (ie, the clockwise side head and the counterclockwise side head) can be estimated from the horizontal distance and the vertical distance of the eyebrow point, the center point of the eyes or the position of the nose point from the position of the reference point; The direction can be determined by the relative relationship between the current position and the position of the reference point.
4)头部的左右转动(即向左转、向右转)角度θ2的数值可以由眉心点、双眼中心点或鼻尖点位置偏离参考点位置的水平距离来估计;角度的方向可以由当前位置与参考点位置的相对关系来确定。4) The left and right rotation of the head (ie, turn left and turn right) the value of the angle θ2 can be estimated from the horizontal distance of the eyebrow point, the center point of the eye or the position of the tip of the nose from the position of the reference point; the direction of the angle can be from the current position Determined by the relative relationship with the position of the reference point.
5)头部的上下转动(即抬头、低头)角度θ3的数值可以由眉心点、双眼中心点或鼻尖点位置偏离参考点位置的垂直距离来估计;角度的方向可以由当前位置与参考点位置的相对关系来确定。5) The value of the angle θ3 of the head up and down rotation (ie, head up, head down) can be estimated from the vertical distance of the eye point, the center point of the eye or the position of the nose point from the position of the reference point; the direction of the angle can be from the current position and the position of the reference point The relative relationship is determined.
步骤五:接收端把接收到的动作参数进行解码得到各种动作描述参数;Step 5: The receiving end decodes the received action parameters to obtain various action description parameters.
步骤六:接收端的动画控制模块根据步骤四得到的参数控制在步骤一获得的图片或是根据从步骤一获得的图像构建的3D模型进行动作表情模拟并送给显示模块进行显示。Step 6: The animation control module at the receiving end controls the image obtained in step 1 according to the parameters obtained in step 4 or performs the motion expression simulation according to the 3D model constructed from the image obtained in step one and sends it to the display module for display.
示例2Example 2
在某些视频通讯应用场景中,通讯双方为了保护隐私和增加乐趣,并不希望对方看到自己的真实面部,只希望了解对方的表情动作,因此可以只传输表情动作参数然后在接收端控制动画来进行表情动作模拟,实现步骤如下:In some video communication application scenarios, in order to protect privacy and increase fun, the two parties do not want the other party to see their real face, but only want to know each other's facial expressions, so they can only transmit the expression action parameters and then control the animation at the receiving end. To perform an emoticon simulation, the implementation steps are as follows:
步骤一:发送端图像采集模块把采集到的图像送到特征提取模块进行人脸检测和人脸关键特征点定位,包括嘴巴、眉毛、眼睛、鼻子和脸部轮廓等;Step 1: The image receiving module of the sending end sends the collected image to the feature extraction module for face detection and key feature point positioning of the face, including mouth, eyebrows, eyes, nose and facial contours;
步骤二:特征提取模块通过Adaboost算法进行人脸检测,确定人脸的位置和大小,然后在人脸区域采用SDM(Supervised Descent Method,监督下降法)算法提取人脸特征点坐标;Step 2: The feature extraction module performs face detection through the Adaboost algorithm to determine the position and size of the face, and then uses the SDM (Supervised Descent Method) algorithm to extract the coordinates of the face feature points in the face region;
步骤三:根据特征点坐标提取人脸表情动作参数及幅度并根据预定规则进行编码,如果没有检测到人脸或者无法定位特征点位置,则采用默认动作参数,参数定义参考示例1;Step 3: extracting the facial expression motion parameters and amplitude according to the feature point coordinates and encoding according to a predetermined rule. If no face is detected or the feature point position cannot be located, the default action parameter is adopted, and the parameter definition reference example 1;
步骤四:接收端把接收到的表情动作参数进行解码得到各种表情动作描述参数;Step 4: The receiving end decodes the received expression action parameters to obtain various expression action description parameters;
步骤五:接收端的动画控制模块根据步骤四得到的表情动作描述参数控制本地的图片、卡通动画模型进行表情动作模拟并送给显示模块进行显示;Step 5: The animation control module of the receiving end controls the local picture and the cartoon animation model according to the expression action description parameters obtained in step 4 to perform an expression motion simulation and send it to the display module for display;
如图2所示,接收端可以显示一副大鲨鱼的动画图像来模拟发送端用户张大嘴的表情。As shown in Figure 2, the receiving end can display an animated image of a large shark to simulate the expression of the sender's mouth.
示例3Example 3
本示例可以对身体动作进行参数描述和编码传输,实现步骤如下:This example can perform parameter description and code transmission for body movements. The implementation steps are as follows:
步骤一:发送端图像采集模块把采集到的图像送到特征提取模块进行人体检测和器官定位,包括头,手臂、腿,腰等器官;Step 1: The image receiving module of the sending end sends the collected image to the feature extraction module for human body detection and organ positioning, including organs such as a head, an arm, a leg, and a waist;
步骤二:特征提取模块通过深度学习算法或者模板匹配算法提取各个器官的位置、方 向、形状、曲率等特征;Step 2: The feature extraction module extracts the position and square of each organ through a deep learning algorithm or a template matching algorithm. Characteristics of direction, shape, curvature, etc.;
步骤三:根据步骤二提取的特征识别和判断人体的动作,如左手向前、向前踢右腿、扭腰、手势等动作,并把这些动作参数及动作幅度按照预定规则编码发送给接收端;Step 3: According to the feature extracted in step 2, identify and judge the movement of the human body, such as the left hand forward, the forward kick, the right leg, the twist, the gesture, etc., and send the action parameters and the action amplitude according to the predetermined rule code to the receiving end. ;
步骤四:接收端把接收到的数据按照预定规则进行解码得到各种动作和动作幅度的描述参数;Step 4: The receiving end decodes the received data according to a predetermined rule to obtain description parameters of various actions and action amplitudes;
步骤五:接收端的动画控制模块根据步骤四得到的参数控制本地的人体图像或模型进行动作模拟,并在显示模块进行显示。Step 5: The animation control module of the receiving end controls the local human body image or model according to the parameters obtained in step 4 to perform motion simulation, and displays in the display module.
示例4Example 4
本示例可以对场景进行参数描述和编码传输,实现步骤如下:This example can perform parameter description and code transmission on the scene. The implementation steps are as follows:
步骤一:发送端图像采集模块把采集到的图像送到特征提取模块进行图像内容理解和提取;Step 1: The image receiving module of the sending end sends the collected image to the feature extraction module for image content understanding and extraction;
步骤二:特征提取模块通过深度学习算法对场景进行理解,提取一些可以描述的物体,如桌子、标志性建筑、电脑、狗等实物,并对这些实物进行特征提取和描述,如桌子形状、大小、颜色、高矮、材质、在图像中的位置等信息;Step 2: The feature extraction module understands the scene through a deep learning algorithm, extracts objects that can be described, such as tables, landmark buildings, computers, dogs, etc., and extracts and describes features such as table shape and size. Information such as color, height, material, position in the image, etc.
步骤三:把步骤二提取的实物特征进行参数描述,并把这些描述实物的参数按照预定规则进行编码,发送给接收端;Step 3: performing parameter description on the physical features extracted in step 2, and encoding the parameters describing the physical objects according to a predetermined rule, and transmitting the parameters to the receiving end;
步骤四:接收端把接收到的数据按照预定规则进行解码,得到场景中各个实物的描述参数;Step 4: The receiving end decodes the received data according to a predetermined rule, and obtains description parameters of each physical object in the scene.
步骤五:接收端动画控制模块根据步骤四得到的参数在本地选取各种实物的模型模拟构建发送端场景,使接收端可以体验类似发送端大致一样的场景,并在显示模块进行显示。Step 5: The receiving end animation control module selects various physical model simulations to construct the transmitting end scene according to the parameters obtained in step 4, so that the receiving end can experience a scene similar to the sending end and display in the display module.
上述实施例提供的一种实现视频通讯的方法和装置,视频通讯中通过传输图像中人物动作、表情、场景参数描述而不是图像本身可以大大降低传输数据量,为用户节省流量和资费。另一方面,通过表情参数控制动画模拟人的表情、动作,可以增加视频通讯的趣味性和娱乐性、保护用户的个人隐私,提升了用户体验。The foregoing embodiment provides a method and apparatus for implementing video communication. In video communication, the amount of data transmitted can be greatly reduced by transmitting character motions, expressions, and scene parameter descriptions in the image instead of the image itself, thereby saving traffic and tariffs for the user. On the other hand, controlling the animation to simulate people's expressions and actions through expression parameters can increase the fun and entertainment of video communication, protect the user's personal privacy, and enhance the user experience.
在本公开的另一实施例中,还提供一种实现视频通讯的设备,包括:处理器;存储器,用于存储所述处理器可执行的指令;所述处理器用于根据所述存储器中存储的所述指令执行动作,所述动作包括:通过摄像头采集图像;对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取;根据从每一帧图像帧提取到的特征对图像帧进行参数描述,对每一帧图像帧的参数描述进行编码后发送给接收端。In another embodiment of the present disclosure, there is also provided an apparatus for implementing video communication, comprising: a processor; a memory for storing instructions executable by the processor; the processor for storing according to the memory The instruction performs an action, the action includes: acquiring an image by a camera; performing image recognition on each frame of the image frame collected, performing feature extraction according to the image recognition result; and extracting the feature pair according to the image frame from each frame The image frame is described by parameters, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.
在本公开的又一实施例中,还提供一种实现视频通讯的设备,包括:处理器;存储器,用于存储所述处理器可执行的指令;所述处理器用于根据所述存储器中存储的所述指令执行动作,所述动作包括:接收每一帧图像帧的参数描述,对接收到的参数描述进行解码;根据解码后的参数描述模拟重建每一帧图像,并进行显示。In still another embodiment of the present disclosure, there is provided an apparatus for implementing video communication, comprising: a processor; a memory for storing instructions executable by the processor; the processor for storing according to the memory The instruction performs an action, the action includes: receiving a parameter description of each frame of the image frame, decoding the received parameter description, and reconstructing and reconstructing each frame image according to the decoded parameter description.
在本公开实施例中,还提供了一种计算机存储介质,该计算机存储介质可以存储有执 行指令,该执行指令用于执行上述实施例中任一种实现视频通讯的方法。In an embodiment of the present disclosure, there is also provided a computer storage medium, the computer storage medium may be stored with a license A line instruction for performing a method of implementing video communication in any of the above embodiments.
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现,相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本公开不限制于任何特定形式的硬件和软件的结合。One of ordinary skill in the art will appreciate that all or a portion of the steps described above can be accomplished by a program that instructs the associated hardware, such as a read-only memory, a magnetic or optical disk, and the like. Optionally, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits. Accordingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware, or may be implemented by using a software function module. Formal realization. The present disclosure is not limited to any specific form of combination of hardware and software.
需要说明的是,本公开还可有其他多种实施例,在不背离本公开精神及其实质的情况下,熟悉本领域的技术人员可根据本公开作出各种相应的改变和变形,但这些相应的改变和变形都应属于本公开所附的权利要求的保护范围。It should be noted that various other embodiments and modifications may be made in accordance with the present disclosure without departing from the spirit and scope of the disclosure. Corresponding changes and modifications are intended to be included within the scope of the appended claims.
工业实用性Industrial applicability
本公开提供的实现视频通讯的方法,可应用于具备视频采集和通信功能的终端设备中,通过在视频通讯时传输图像中人物动作、表情、场景参数描述而不是图像本身,能够大大降低传输数据量,为用户节省流量和资费。 The method for realizing video communication provided by the present disclosure can be applied to a terminal device having a video capture and communication function, and can greatly reduce the transmission data by transmitting a character motion, an expression, and a scene parameter description in the image instead of the image itself during video communication. Quantity, saving users traffic and tariffs.

Claims (20)

  1. 一种实现视频通讯的方法,应用于发送端,该方法包括:A method for implementing video communication is applied to a transmitting end, and the method includes:
    通过摄像头采集图像;Acquiring images through the camera;
    对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取;Image recognition is performed on each frame of image captured, and feature extraction is performed according to the image recognition result;
    根据从每一帧图像帧提取到的特征对图像帧进行参数描述,对每一帧图像帧的参数描述进行编码后发送给接收端。The parameter description of the image frame is performed according to the feature extracted from each frame of the image frame, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.
  2. 如权利要求1所述的方法,其中:The method of claim 1 wherein:
    所述图像识别包括以下至少一种:人脸识别、人体识别、场景识别;The image recognition includes at least one of the following: face recognition, body recognition, scene recognition;
    所述特征提取包括以下至少一种:表情特征提取、动作特征提取、场景中实物特征提取;The feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
    所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  3. 如权利要求2所述的方法,其中:The method of claim 2 wherein:
    所述通过摄像头采集图像,包括:The collecting images by the camera includes:
    通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
    对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
    在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像的图像帧中提取到的人脸关键特征点的位置作为人脸特征点参考位置;After acquiring an image frame as a portrait shooting reference image, a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
    根据从每一帧图像帧上提取到的人脸关键特征点位置与所述人脸特征点参考位置之间的位置关系确定所述图像帧对应的人脸表情参数和表情动作幅度。And determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.
  4. 如权利要求2所述的方法,其中:The method of claim 2 wherein:
    所述通过摄像头采集图像,包括:The collecting images by the camera includes:
    通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
    对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
    在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像帧中提取到的人体关键特征点的位置作为人体特征点参考位置;After acquiring an image frame as a reference image for portrait shooting, a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
    根据从每一帧图像帧上提取到的人体关键特征点位置与所述人体特征点参考位置之间的位置关系确定所述图像帧对应的动作参数和动作幅度。And determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position.
  5. 如权利要求2所述的方法,其中:The method of claim 2 wherein:
    所述通过摄像头采集图像,包括:The collecting images by the camera includes:
    通过摄像头拍摄作为场景拍摄参考图像的图像帧;Taking an image frame as a reference image for shooting a scene through a camera;
    对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
    利用深度学习算法对每一帧图像帧的场景进行理解,提取可以描述的实物,对所述实物进行特征提取。The depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  6. 如权利要求2-5中任一项所述的方法,其中,所述方法还包括: The method of any of claims 2 to 5, wherein the method further comprises:
    向所述接收端发送人像拍摄参考图像的图像帧或场景拍摄参考图像的图像帧。An image frame of the portrait shooting reference image or an image frame of the scene capturing reference image is transmitted to the receiving end.
  7. 一种实现视频通讯的方法,应用于接收端,该方法包括:A method for implementing video communication is applied to a receiving end, and the method includes:
    接收每一帧图像帧的参数描述,对接收到的参数描述进行解码;Receiving a parameter description of each frame of the image frame, and decoding the received parameter description;
    根据解码后的参数描述模拟重建每一帧图像,并进行显示。Each frame of the image is reconstructed and displayed according to the decoded parameter description.
  8. 如权利要求7所述的方法,其中:The method of claim 7 wherein:
    所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  9. 如权利要求8所述的方法,其中:The method of claim 8 wherein:
    所述根据解码后的参数描述模拟重建每一帧图像,包括:The simulation reconstructs each frame image according to the decoded parameter description, including:
    对每一帧图像帧,在解码后的参数描述包含人脸表情参数和表情动作幅度时,使用二维图片或三维图像模拟重建发送端人像的表情,对模拟重建后的图像帧进行显示;和/或For each frame of the image frame, when the decoded parameter description includes the facial expression parameter and the expression action amplitude, the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; /or
    对每一帧图像帧,在解码后的参数描述包含人体动作参数和动作幅度时,使用二维图片或三维图像模拟重建发送端人像的动作,对模拟重建后的图像帧进行显示;和/或For each frame of the image frame, when the decoded parameter description includes the human body motion parameter and the motion amplitude, the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed; and/or
    对每一帧图像帧,在解码后的参数描述包含场景参数时,使用二维图片或三维图像模拟重建发送端场景中的各个实物,对模拟重建后的图像帧进行显示。For each frame of the image frame, when the decoded parameter description includes the scene parameter, the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
  10. 如权利要求9所述的方法,其中:The method of claim 9 wherein:
    所述二维图片或三维图像包括以下至少一种:发送端人像拍摄参考图像、发送端场景拍摄参考图像、图片库中的图片、动画模型库中的动画模型。The two-dimensional picture or the three-dimensional image includes at least one of the following: a portrait portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
  11. 一种实现视频通讯的装置,应用于发送端,包括:A device for implementing video communication, applied to a transmitting end, comprising:
    图像采集模块,设置为通过摄像头采集图像;An image acquisition module configured to acquire an image through a camera;
    图像识别及特征提取模块,设置为对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取;The image recognition and feature extraction module is configured to perform image recognition on each frame of the acquired image frame, and perform feature extraction according to the image recognition result;
    参数描述及编码模块,设置为根据从每一帧图像帧提取到的特征对图像帧进行参数描述,对每一帧图像帧的参数描述进行编码后发送给接收端。The parameter description and encoding module is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.
  12. 如权利要求11所述的装置,其中:The apparatus of claim 11 wherein:
    所述图像识别包括以下至少一种:人脸识别、人体识别、场景识别;The image recognition includes at least one of the following: face recognition, body recognition, scene recognition;
    所述特征提取包括以下至少一种:表情特征提取、动作特征提取、场景中实物特征提取;The feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
    所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  13. 如权利要求12所述的装置,其中:The apparatus of claim 12 wherein:
    图像采集模块,设置为通过摄像头采集图像,包括:The image acquisition module is configured to capture images through the camera, including:
    通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
    图像识别及特征提取模块,设置为对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括: The image recognition and feature extraction module is configured to perform image recognition on each frame of the acquired image frame, and perform feature extraction according to the image recognition result, including:
    在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像的图像帧中提取到的人脸关键特征点的位置作为人脸特征点参考位置;After acquiring an image frame as a portrait shooting reference image, a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
    根据从每一帧图像帧上提取到的人脸关键特征点位置与所述人脸特征点参考位置之间的位置关系确定所述图像帧对应的人脸表情参数和表情动作幅度。And determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.
  14. 如权利要求12所述的装置,其中:The apparatus of claim 12 wherein:
    图像采集模块,设置为通过摄像头采集图像,包括:The image acquisition module is configured to capture images through the camera, including:
    通过摄像头拍摄作为人像拍摄参考图像的图像帧;Taking an image frame as a reference image for portrait shooting by a camera;
    图像识别及特征提取模块,设置为对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:The image recognition and feature extraction module is configured to perform image recognition on each frame of the acquired image frame, and perform feature extraction according to the image recognition result, including:
    在采集到作为人像拍摄参考图像的图像帧后,将从所述人像拍摄参考图像帧中提取到的人体关键特征点的位置作为人体特征点参考位置;After acquiring an image frame as a reference image for portrait shooting, a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
    根据从每一帧图像帧上提取到的人体关键特征点位置与所述人体特征点参考位置之间的位置关系确定所述图像帧对应的动作参数和动作幅度。And determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position.
  15. 如权利要求12所述的装置,其中:The apparatus of claim 12 wherein:
    图像采集模块,设置为通过摄像头采集图像,包括:The image acquisition module is configured to capture images through the camera, including:
    通过摄像头拍摄作为场景拍摄参考图像的图像帧;Taking an image frame as a reference image for shooting a scene through a camera;
    图像识别及特征提取模块,设置为对采集到的每一帧图像帧进行图像识别,根据图像识别结果进行特征提取,包括:The image recognition and feature extraction module is configured to perform image recognition on each frame of the acquired image frame, and perform feature extraction according to the image recognition result, including:
    利用深度学习算法对每一帧图像帧的场景进行理解,提取可以描述的实物,对所述实物进行特征提取。The depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  16. 如权利要求12-15中任一项所述的装置,其中,还包括:The apparatus of any of claims 12-15, further comprising:
    图像发送模块,设置为向所述接收端发送人像拍摄参考图像的图像帧或场景拍摄参考图像的图像帧。And an image transmitting module configured to send an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image to the receiving end.
  17. 一种实现视频通讯的装置,应用于接收端,包括:A device for implementing video communication, applied to a receiving end, comprising:
    参数描述接收模块,设置为接收每一帧图像帧的参数描述,对接收到的参数描述进行解码;The parameter description receiving module is configured to receive a parameter description of each frame of the image frame, and decode the received parameter description;
    图像重建及显示模块,设置为根据解码后的参数描述模拟重建每一帧图像,并进行显示。The image reconstruction and display module is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
  18. 如权利要求17所述的装置,其中:The apparatus of claim 17 wherein:
    所述参数描述包括以下至少一种:人脸表情参数描述、人体动作参数描述、场景中实物特征参数描述。The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  19. 如权利要求18所述的装置,其中:The apparatus of claim 18 wherein:
    图像重建及显示模块,设置为根据解码后的参数描述模拟重建每一帧图像,包括:The image reconstruction and display module is configured to simulate reconstructing each frame image according to the decoded parameter description, including:
    对每一帧图像帧,在解码后的参数描述包含人脸表情参数和表情动作幅度时,使用二维图片或三维图像模拟重建发送端人像的表情,对模拟重建后的图像帧进行显示;和/或 For each frame of the image frame, when the decoded parameter description includes the facial expression parameter and the expression action amplitude, the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; / or
    对每一帧图像帧,在解码后的参数描述包含人体动作参数和动作幅度时,使用二维图片或三维图像模拟重建发送端人像的动作,对模拟重建后的图像帧进行显示;和/或For each frame of the image frame, when the decoded parameter description includes the human body motion parameter and the motion amplitude, the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed; and/or
    对每一帧图像帧,在解码后的参数描述包含场景参数时,使用二维图片或三维图像模拟重建发送端场景中的各个实物,对模拟重建后的图像帧进行显示。For each frame of the image frame, when the decoded parameter description includes the scene parameter, the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
  20. 如权利要求19所述的装置,其中:The apparatus of claim 19 wherein:
    所述二维图片或三维图像包括以下至少一种:发送端人像拍摄参考图像、发送端场景拍摄参考图像、图片库中的图片、动画模型库中的动画模型。 The two-dimensional picture or the three-dimensional image includes at least one of the following: a portrait portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
PCT/CN2017/081956 2016-06-06 2017-04-26 Method and apparatus for implementing video communication WO2017211139A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610394389.1A CN107465885A (en) 2016-06-06 2016-06-06 A kind of method and apparatus for realizing video communication
CN201610394389.1 2016-06-06

Publications (1)

Publication Number Publication Date
WO2017211139A1 true WO2017211139A1 (en) 2017-12-14

Family

ID=60544535

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/081956 WO2017211139A1 (en) 2016-06-06 2017-04-26 Method and apparatus for implementing video communication

Country Status (2)

Country Link
CN (1) CN107465885A (en)
WO (1) WO2017211139A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921773A (en) * 2018-07-04 2018-11-30 百度在线网络技术(北京)有限公司 Human body tracking processing method, device, equipment and system
CN112235531A (en) * 2020-10-15 2021-01-15 北京字节跳动网络技术有限公司 Video processing method, device, terminal and storage medium

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102256110B1 (en) * 2017-05-26 2021-05-26 라인 가부시키가이샤 Method for image compression and method for image restoration
CN110276232A (en) * 2018-03-16 2019-09-24 东方联合动画有限公司 A kind of data processing method based on social scene, system
CN110799986B (en) * 2018-04-25 2020-09-18 北京嘀嘀无限科技发展有限公司 System and method for blink action recognition based on facial feature points
CN110769323B (en) * 2018-07-27 2021-06-18 Tcl科技集团股份有限公司 Video communication method, system, device and terminal equipment
CN109151430B (en) * 2018-09-30 2020-07-28 Oppo广东移动通信有限公司 Data processing method, terminal, server and computer storage medium
CN109302598B (en) * 2018-09-30 2021-08-31 Oppo广东移动通信有限公司 Data processing method, terminal, server and computer storage medium
CN109246409B (en) * 2018-09-30 2020-08-04 Oppo广东移动通信有限公司 Data processing method, terminal, server and computer storage medium
CN111131744B (en) * 2019-12-26 2021-04-20 杭州当虹科技股份有限公司 Privacy protection method based on video communication
CN112804245B (en) * 2021-01-26 2023-09-26 杨文龙 Data transmission optimization method, device and system suitable for video transmission

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101674363A (en) * 2009-09-23 2010-03-17 中兴通讯股份有限公司 Mobile equipment and talking method
CN102271241A (en) * 2011-09-02 2011-12-07 北京邮电大学 Image communication method and system based on facial expression/action recognition
CN103369289A (en) * 2012-03-29 2013-10-23 深圳市腾讯计算机系统有限公司 Communication method of video simulation image and device
CN103647922A (en) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 Virtual video call method and terminals
CN104766041A (en) * 2014-01-07 2015-07-08 腾讯科技(深圳)有限公司 Image recognition method, device and system
CN104935860A (en) * 2014-03-18 2015-09-23 北京三星通信技术研究有限公司 Method and device for realizing video calling

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101674363A (en) * 2009-09-23 2010-03-17 中兴通讯股份有限公司 Mobile equipment and talking method
CN102271241A (en) * 2011-09-02 2011-12-07 北京邮电大学 Image communication method and system based on facial expression/action recognition
CN103369289A (en) * 2012-03-29 2013-10-23 深圳市腾讯计算机系统有限公司 Communication method of video simulation image and device
CN103647922A (en) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 Virtual video call method and terminals
CN104766041A (en) * 2014-01-07 2015-07-08 腾讯科技(深圳)有限公司 Image recognition method, device and system
CN104935860A (en) * 2014-03-18 2015-09-23 北京三星通信技术研究有限公司 Method and device for realizing video calling

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921773A (en) * 2018-07-04 2018-11-30 百度在线网络技术(北京)有限公司 Human body tracking processing method, device, equipment and system
CN112235531A (en) * 2020-10-15 2021-01-15 北京字节跳动网络技术有限公司 Video processing method, device, terminal and storage medium

Also Published As

Publication number Publication date
CN107465885A (en) 2017-12-12

Similar Documents

Publication Publication Date Title
WO2017211139A1 (en) Method and apparatus for implementing video communication
US11595617B2 (en) Communication using interactive avatars
US20170310934A1 (en) System and method for communication using interactive avatar
KR102506738B1 (en) snow texture inpainting
US11836866B2 (en) Deforming real-world object using an external mesh
JP7101749B2 (en) Mediation devices and methods, as well as computer-readable recording media {MEDIATING APPARATUS, METHOD AND COMPANY REDABLE RECORDING MEDIA FORM THEREOF}
US11790614B2 (en) Inferring intent from pose and speech input
US20220125337A1 (en) Adaptive skeletal joint smoothing
KR20230003555A (en) Texture-based pose validation
US20240062500A1 (en) Generating ground truths for machine learning
US20230120037A1 (en) True size eyewear in real time
WO2023121896A1 (en) Real-time motion and appearance transfer
WO2023121897A1 (en) Real-time garment exchange
WO2022146799A1 (en) Compressing image-to-image models
CN112804245A (en) Data transmission optimization method, device and system suitable for video transmission
KR20200134623A (en) Apparatus and Method for providing facial motion retargeting of 3 dimensional virtual character
US20240070950A1 (en) Avatar call on an eyewear device
US20240007585A1 (en) Background replacement using neural radiance field
KR20240049844A (en) Control AR games on fashion items
CN115499612A (en) Video communication method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17809579

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17809579

Country of ref document: EP

Kind code of ref document: A1