WO2017211139A1 - Procédé et appareil destinés à mettre en œuvre une communication vidéo - Google Patents

Procédé et appareil destinés à mettre en œuvre une communication vidéo Download PDF

Info

Publication number
WO2017211139A1
WO2017211139A1 PCT/CN2017/081956 CN2017081956W WO2017211139A1 WO 2017211139 A1 WO2017211139 A1 WO 2017211139A1 CN 2017081956 W CN2017081956 W CN 2017081956W WO 2017211139 A1 WO2017211139 A1 WO 2017211139A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
frame
image frame
parameter description
scene
Prior art date
Application number
PCT/CN2017/081956
Other languages
English (en)
Chinese (zh)
Inventor
张殿凯
沈琳
瞿广财
王宁
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017211139A1 publication Critical patent/WO2017211139A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present disclosure relates to the field of image processing technologies, and in particular, to a method and apparatus for implementing video communication.
  • RCS ich Communication Suite
  • RCS service is a kind of converged communication service based on enhanced mobile phone address book, which integrates various communication methods and functions such as voice, video, message, presentation and content sharing.
  • users can update their own presentations (such as personal pictures, mood phrases, referral links, and status) to achieve various communication needs such as instant messaging, chat, file transfer, and video sharing during a session. It shares with the picture and connects to the network side through the standard protocol interface to realize registration, authentication, audio and video call capabilities.
  • the technical problem to be solved by the present disclosure is to provide a method and apparatus for implementing video communication, which can reduce the amount of data transmitted by video communication, and save traffic and tariffs for users.
  • the present disclosure provides a method for implementing video communication, which is applied to a transmitting end, and the method includes:
  • Image recognition is performed on each frame of image captured, and feature extraction is performed according to the image recognition result;
  • the parameter description of the image frame is performed according to the feature extracted from each frame of the image frame, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.
  • the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;
  • the feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
  • the parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  • the capturing images by the camera includes:
  • Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
  • a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
  • the capturing images by the camera includes:
  • Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
  • a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
  • the capturing images by the camera includes:
  • Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:
  • the depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  • the method further includes:
  • An image frame of the portrait shooting reference image or an image frame of the scene capturing reference image is transmitted to the receiving end.
  • An embodiment of the present disclosure provides a method for implementing video communication, which is applied to a receiving end, and the method includes:
  • Each frame of the image is reconstructed and displayed according to the decoded parameter description.
  • the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.
  • the reconstructing each frame image according to the decoded parameter description comprises:
  • the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; /or
  • the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed;
  • the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
  • the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
  • An embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a transmitting end, and includes:
  • An image acquisition module for collecting images through a camera
  • the image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result;
  • the parameter description and encoding module is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.
  • the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;
  • the feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
  • the parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  • an image acquisition module is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:
  • a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
  • an image acquisition module is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:
  • a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
  • an image acquisition module is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:
  • the depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  • the device further includes:
  • an image sending module configured to send, to the receiving end, an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image.
  • An embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a receiving end, and includes:
  • a parameter description receiving module configured to receive a parameter description of each frame of the image frame, and decode the received parameter description
  • the image reconstruction and display module is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
  • the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.
  • an image reconstruction and display module is configured to reconstruct each frame image according to the decoded parameter description, including:
  • the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; /or
  • the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed;
  • the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
  • the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
  • the present disclosure provides a method and apparatus for implementing video communication, in which video communication can greatly reduce the amount of data transmitted by transmitting character motions, expressions, and scene parameter descriptions instead of images themselves, thereby saving traffic for users. And tariffs.
  • controlling the animation to simulate people's expressions and actions through expression parameters can increase the fun and entertainment of video communication, protect the user's personal privacy, and enhance the user experience.
  • FIG. 1 is a flowchart (transmission end) of a method for implementing video communication according to an embodiment of the present disclosure.
  • FIG. 2 is a flowchart (receiving end) of a method for implementing video communication according to an embodiment of the present disclosure.
  • FIG. 3 is a schematic diagram of a device (transmitting end) for implementing video communication according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a device (receiving end) for implementing video communication according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of an image of a user collected at the transmitting end in the example 2 of the present disclosure (see the left figure), and a schematic diagram of the receiving end emulating the expression of the user of the transmitting end with a cartoon image (big shark) (see the right figure).
  • an embodiment of the present disclosure provides a method for implementing video communication, which is applied to a sending end, and the method includes:
  • S120 performing image recognition on each frame of the image frame that is collected, and performing feature extraction according to the image recognition result;
  • S130 Perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end;
  • the method may also include the following features:
  • the image recognition includes at least one of the following: face recognition, body recognition, scene recognition;
  • the feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
  • the parameter description includes at least one of the following: a description of a facial expression parameter, a description of a human motion parameter, and a description of a physical feature parameter in the scene;
  • the collecting images by the camera includes:
  • An image frame that is a reference image for shooting a scene is taken by a camera.
  • the Adaboost algorithm can be used for face detection to determine the position and size of the face, and then the SDM (Supervised Descent Method) algorithm is used to extract the key feature point coordinates of the face in the face region;
  • the key feature points of the face include at least one of the following: an eye, a nose, a mouth, an eyebrow, a facial contour, and the like;
  • the facial expression parameter includes at least one of the following parameters: closed eyes, blinking, opening mouth, closing mouth, laughing, looking up, bowing, turning left, turning to the right, tilting the head to the left shoulder, The right shoulder is tilted to the head, etc.;
  • a deep learning algorithm or a template matching algorithm may be used for human body detection, and the position, direction, shape, curvature and other features of key features of each human body are extracted;
  • the key feature points of the human body include at least one of the following: a head, an arm, a hand, a leg, a foot, a waist, and the like;
  • the deep learning algorithm can be used to understand the scene and extract the features of each physical object in the scene
  • the characteristics of the physical object include: shape, size, color, material, etc.;
  • image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:
  • the position of the face key feature point extracted from the portrait image reference image frame is taken as the face feature point reference position
  • the determining according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position, determining a face expression parameter and an expression action range corresponding to the image frame. ,include:
  • the blink threshold is the distance between the upper and lower eyelids in the portrait image frame minus the first error allowed a difference after the value, or the blink threshold is a difference between an average value of distances of upper and lower eyelids in the plurality of portrait image reference image frames minus a second error allowable value; wherein the first error allowable value Is an empirical value; the second error allowable value is a variance of a motion recognition sensitivity coefficient multiplied by a distance of upper and lower eyelids in a plurality of portrait photographing reference image frames; and/or
  • the closing threshold is the distance between the upper and lower lip edges of the portrait image frame and the first error a difference after the allowable value, or the threshold of the closing is the average of the distance between the upper and lower lip edges of the plurality of portrait image reference image frames plus the difference of the second error allowable value; wherein the first error
  • the allowable value is an empirical value
  • the second error allowable value is a variance of the motion recognition sensitivity coefficient multiplied by the distance of the upper and lower lip edges in the plurality of portrait photographing reference image frames; and/or
  • the action and the action range of the head are determined according to the distance, angle and direction of the deviation
  • the head position reference point comprises at least one of the following: a center point of the two eyes, an eyebrow, and a tip of the nose;
  • the action of the head includes at least one of the following: raising the head, lowering the head, turning the head to the left, turning the head to the right, tilting the head to the left shoulder, and tilting the head to the right shoulder;
  • the method further includes:
  • image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:
  • a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
  • the human body action such as: moving the left hand forward, kicking the right leg forward, twisting the waist, gestures, etc.;
  • image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:
  • the depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  • an embodiment of the present disclosure provides a method for implementing video communication, which is applied to a receiving end, and the method includes:
  • S220 Simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
  • the method may also include the following features:
  • the parameter description includes at least one of the following: a description of a facial expression parameter, a description of a human motion parameter, and a description of a physical feature parameter in the scene;
  • the reconstructing each frame image according to the decoded parameter description comprises:
  • the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; / or
  • the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed;
  • the two-dimensional image or the three-dimensional image is used to reconstruct and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed;
  • the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting end portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library;
  • an embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a transmitting end, and includes:
  • the image acquisition module 301 is configured to collect an image by using a camera
  • the image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the captured image frame, and perform feature extraction according to the image recognition result;
  • the parameter description and encoding module 303 is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.
  • the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;
  • the feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;
  • the parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
  • the image acquisition module 301 is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:
  • a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;
  • the image recognition and feature extraction module 302 is configured to determine the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.
  • Corresponding facial expression parameters and expression action amplitudes including:
  • the blink threshold is the distance between the upper and lower eyelids in the portrait image frame minus the first error allowed a difference after the value, or the blink threshold is a difference between an average value of distances of upper and lower eyelids in the plurality of portrait image reference image frames minus a second error allowable value; wherein the first error allowable value Is an empirical value; the second error allowable value is a variance of a motion recognition sensitivity coefficient multiplied by a distance of upper and lower eyelids in a plurality of portrait photographing reference image frames; and/or
  • the closing threshold is a difference between a distance of the upper and lower lip edges in the portrait image frame and a first error allowable value, or the closing threshold is a plurality of portrait shooting reference image frames.
  • the average of the distances between the upper and lower lip edges plus the difference after the second error allowable value; wherein the first error allowable value is an empirical value; and the second error allowable value is a motion recognition sensitivity coefficient multiplied by Portrait The difference in the distance of the distance between the upper and lower lip edges in the reference image frame; and/or
  • the motion and motion amplitude of the head are determined according to the distance, angle and direction of the deviation.
  • the image acquisition module 301 is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:
  • a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;
  • the image acquisition module 301 is configured to collect images by using a camera, including:
  • the image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:
  • the depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
  • the device further includes:
  • the image sending module 304 is configured to send an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image to the receiving end.
  • an embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a receiving end, and includes:
  • the parameter description receiving module 401 is configured to receive a parameter description of each frame of the image frame, and decode the received parameter description.
  • the image reconstruction and display module 402 is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
  • the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.
  • the image reconstruction and display module 402 is configured to reconstruct each frame image according to the decoded parameter description, including:
  • the decoded parameter description includes facial expression parameters and facial motion amplitude
  • use two Dimensional image or 3D image simulation to reconstruct the expression of the sender's portrait, and display the simulated reconstructed image frame
  • the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed;
  • the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
  • the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
  • the wireless data traffic in mobile video communication is expensive.
  • the parameter description and encoding transmission can be performed only on the facial expression. The implementation steps are as follows:
  • Step 1 The image receiving module of the transmitting end first transmits the collected image by using conventional encoding, that is, sending the current face image to the receiving end for subsequent control display;
  • Step 2 The subsequently acquired image is sent to the feature extraction module for face detection and key feature point location of the face, including the mouth, eyebrows, eyes, nose and facial contours;
  • Step 3 The feature extraction module performs face detection through the Adaboost algorithm to determine the position and size of the face, and then uses the SDM (Supervised Descent Method) algorithm to extract the coordinates of the face feature points in the face region;
  • SDM Supervised Descent Method
  • Step 4 extracting facial expression motion parameters and amplitude according to feature point coordinates and encoding according to a predetermined rule. If no face is detected or the feature point position cannot be located, the default action parameter is adopted;
  • the mean value of the face feature point position in the N reference frames is counted as a reference frame with the video frame of the user having a normal expression (no action) at rest.
  • the average difference ⁇ DX calculated as follows:
  • ⁇ DX represents the variance of the position of the feature points due to video noise.
  • the mouth opening action is determined by the distance D2 of the upper and lower lip edges, if Then it is determined that a mouth opening action occurs. among them, Is the mean value of the distance between the upper and lower lip edges in the N reference frames, ⁇ D2 is the variance of the distance between the upper and lower lip edges in the N reference frames, and ⁇ is the sensitivity coefficient of the motion recognition. The smaller the value, the more the result of the motion recognition Sensitive.
  • the value of the angle ⁇ 1 of the lateral head of the head (ie, the clockwise side head and the counterclockwise side head) can be estimated from the horizontal distance and the vertical distance of the eyebrow point, the center point of the eyes or the position of the nose point from the position of the reference point;
  • the direction can be determined by the relative relationship between the current position and the position of the reference point.
  • the left and right rotation of the head ie, turn left and turn right
  • the value of the angle ⁇ 2 can be estimated from the horizontal distance of the eyebrow point, the center point of the eye or the position of the tip of the nose from the position of the reference point; the direction of the angle can be from the current position Determined by the relative relationship with the position of the reference point.
  • the value of the angle ⁇ 3 of the head up and down rotation (ie, head up, head down) can be estimated from the vertical distance of the eye point, the center point of the eye or the position of the nose point from the position of the reference point; the direction of the angle can be from the current position and the position of the reference point The relative relationship is determined.
  • Step 5 The receiving end decodes the received action parameters to obtain various action description parameters.
  • Step 6 The animation control module at the receiving end controls the image obtained in step 1 according to the parameters obtained in step 4 or performs the motion expression simulation according to the 3D model constructed from the image obtained in step one and sends it to the display module for display.
  • the two parties do not want the other party to see their real face, but only want to know each other's facial expressions, so they can only transmit the expression action parameters and then control the animation at the receiving end.
  • the implementation steps are as follows:
  • Step 1 The image receiving module of the sending end sends the collected image to the feature extraction module for face detection and key feature point positioning of the face, including mouth, eyebrows, eyes, nose and facial contours;
  • Step 2 The feature extraction module performs face detection through the Adaboost algorithm to determine the position and size of the face, and then uses the SDM (Supervised Descent Method) algorithm to extract the coordinates of the face feature points in the face region;
  • SDM Supervised Descent Method
  • Step 3 extracting the facial expression motion parameters and amplitude according to the feature point coordinates and encoding according to a predetermined rule. If no face is detected or the feature point position cannot be located, the default action parameter is adopted, and the parameter definition reference example 1;
  • Step 4 The receiving end decodes the received expression action parameters to obtain various expression action description parameters
  • Step 5 The animation control module of the receiving end controls the local picture and the cartoon animation model according to the expression action description parameters obtained in step 4 to perform an expression motion simulation and send it to the display module for display;
  • the receiving end can display an animated image of a large shark to simulate the expression of the sender's mouth.
  • This example can perform parameter description and code transmission for body movements.
  • the implementation steps are as follows:
  • Step 1 The image receiving module of the sending end sends the collected image to the feature extraction module for human body detection and organ positioning, including organs such as a head, an arm, a leg, and a waist;
  • Step 2 The feature extraction module extracts the position and square of each organ through a deep learning algorithm or a template matching algorithm. Characteristics of direction, shape, curvature, etc.;
  • Step 3 According to the feature extracted in step 2, identify and judge the movement of the human body, such as the left hand forward, the forward kick, the right leg, the twist, the gesture, etc., and send the action parameters and the action amplitude according to the predetermined rule code to the receiving end. ;
  • Step 4 The receiving end decodes the received data according to a predetermined rule to obtain description parameters of various actions and action amplitudes;
  • Step 5 The animation control module of the receiving end controls the local human body image or model according to the parameters obtained in step 4 to perform motion simulation, and displays in the display module.
  • This example can perform parameter description and code transmission on the scene.
  • the implementation steps are as follows:
  • Step 1 The image receiving module of the sending end sends the collected image to the feature extraction module for image content understanding and extraction;
  • Step 2 The feature extraction module understands the scene through a deep learning algorithm, extracts objects that can be described, such as tables, landmark buildings, computers, dogs, etc., and extracts and describes features such as table shape and size. Information such as color, height, material, position in the image, etc.
  • Step 3 performing parameter description on the physical features extracted in step 2, and encoding the parameters describing the physical objects according to a predetermined rule, and transmitting the parameters to the receiving end;
  • Step 4 The receiving end decodes the received data according to a predetermined rule, and obtains description parameters of each physical object in the scene.
  • Step 5 The receiving end animation control module selects various physical model simulations to construct the transmitting end scene according to the parameters obtained in step 4, so that the receiving end can experience a scene similar to the sending end and display in the display module.
  • the foregoing embodiment provides a method and apparatus for implementing video communication.
  • the amount of data transmitted can be greatly reduced by transmitting character motions, expressions, and scene parameter descriptions in the image instead of the image itself, thereby saving traffic and tariffs for the user.
  • controlling the animation to simulate people's expressions and actions through expression parameters can increase the fun and entertainment of video communication, protect the user's personal privacy, and enhance the user experience.
  • an apparatus for implementing video communication comprising: a processor; a memory for storing instructions executable by the processor; the processor for storing according to the memory
  • the instruction performs an action, the action includes: acquiring an image by a camera; performing image recognition on each frame of the image frame collected, performing feature extraction according to the image recognition result; and extracting the feature pair according to the image frame from each frame
  • the image frame is described by parameters, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.
  • an apparatus for implementing video communication comprising: a processor; a memory for storing instructions executable by the processor; the processor for storing according to the memory
  • the instruction performs an action, the action includes: receiving a parameter description of each frame of the image frame, decoding the received parameter description, and reconstructing and reconstructing each frame image according to the decoded parameter description.
  • a computer storage medium may be stored with a license A line instruction for performing a method of implementing video communication in any of the above embodiments.
  • the method for realizing video communication provided by the present disclosure can be applied to a terminal device having a video capture and communication function, and can greatly reduce the transmission data by transmitting a character motion, an expression, and a scene parameter description in the image instead of the image itself during video communication. Quantity, saving users traffic and tariffs.

Abstract

La présente invention concerne un procédé destiné à mettre en œuvre une communication vidéo, qui est appliquée à une extrémité d'envoi. Le procédé consiste : à collecter une image au moyen d'un appareil photo ; à réaliser une reconnaissance d'image sur chaque trame collectée d'image, et à réaliser une extraction de caractéristique conformément à un résultat de reconnaissance d'image ; et à réaliser une description de paramètre sur les trames d'image conformément à une caractéristique extraite de chaque trame d'image, à coder la description de paramètre de chaque trame d'image, et à envoyer ensuite cette dernière à une extrémité de réception. La présente invention peut permettre de réduire la quantité de données transmises pendant la communication vidéo, et le trafic et les frais peuvent être réduits pour les utilisateurs.
PCT/CN2017/081956 2016-06-06 2017-04-26 Procédé et appareil destinés à mettre en œuvre une communication vidéo WO2017211139A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610394389.1A CN107465885A (zh) 2016-06-06 2016-06-06 一种实现视频通讯的方法和装置
CN201610394389.1 2016-06-06

Publications (1)

Publication Number Publication Date
WO2017211139A1 true WO2017211139A1 (fr) 2017-12-14

Family

ID=60544535

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/081956 WO2017211139A1 (fr) 2016-06-06 2017-04-26 Procédé et appareil destinés à mettre en œuvre une communication vidéo

Country Status (2)

Country Link
CN (1) CN107465885A (fr)
WO (1) WO2017211139A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921773A (zh) * 2018-07-04 2018-11-30 百度在线网络技术(北京)有限公司 人体跟踪处理方法、装置、设备及系统
CN112235531A (zh) * 2020-10-15 2021-01-15 北京字节跳动网络技术有限公司 视频处理的方法、装置、终端及存储介质

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102256110B1 (ko) * 2017-05-26 2021-05-26 라인 가부시키가이샤 영상 압축 방법 및 영상 복원 방법
CN110276232A (zh) * 2018-03-16 2019-09-24 东方联合动画有限公司 一种基于社交场景的数据处理方法、系统
EP3707643A4 (fr) 2018-04-25 2020-11-18 Beijing Didi Infinity Technology and Development Co., Ltd. Systèmes et procédés de reconnaissance d'action de clignement sur la base de points caractéristiques faciaux
CN110769323B (zh) * 2018-07-27 2021-06-18 Tcl科技集团股份有限公司 一种视频通信方法、系统、装置和终端设备
CN109302598B (zh) * 2018-09-30 2021-08-31 Oppo广东移动通信有限公司 一种数据处理方法、终端、服务器和计算机存储介质
CN109151430B (zh) * 2018-09-30 2020-07-28 Oppo广东移动通信有限公司 一种数据处理方法、终端、服务器和计算机存储介质
CN109246409B (zh) * 2018-09-30 2020-08-04 Oppo广东移动通信有限公司 一种数据处理方法、终端、服务器和计算机存储介质
CN111131744B (zh) * 2019-12-26 2021-04-20 杭州当虹科技股份有限公司 一种基于视频通讯隐私保护的方法
CN112804245B (zh) * 2021-01-26 2023-09-26 杨文龙 适用于视频传输的数据传输优化方法、装置及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101674363A (zh) * 2009-09-23 2010-03-17 中兴通讯股份有限公司 移动设备及通话方法
CN102271241A (zh) * 2011-09-02 2011-12-07 北京邮电大学 一种基于面部表情/动作识别的图像通信方法及系统
CN103369289A (zh) * 2012-03-29 2013-10-23 深圳市腾讯计算机系统有限公司 一种视频模拟形象的通信方法和装置
CN103647922A (zh) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 虚拟视频通话方法和终端
CN104766041A (zh) * 2014-01-07 2015-07-08 腾讯科技(深圳)有限公司 一种图像识别方法、装置及系统
CN104935860A (zh) * 2014-03-18 2015-09-23 北京三星通信技术研究有限公司 视频通话实现方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101674363A (zh) * 2009-09-23 2010-03-17 中兴通讯股份有限公司 移动设备及通话方法
CN102271241A (zh) * 2011-09-02 2011-12-07 北京邮电大学 一种基于面部表情/动作识别的图像通信方法及系统
CN103369289A (zh) * 2012-03-29 2013-10-23 深圳市腾讯计算机系统有限公司 一种视频模拟形象的通信方法和装置
CN103647922A (zh) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 虚拟视频通话方法和终端
CN104766041A (zh) * 2014-01-07 2015-07-08 腾讯科技(深圳)有限公司 一种图像识别方法、装置及系统
CN104935860A (zh) * 2014-03-18 2015-09-23 北京三星通信技术研究有限公司 视频通话实现方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921773A (zh) * 2018-07-04 2018-11-30 百度在线网络技术(北京)有限公司 人体跟踪处理方法、装置、设备及系统
CN112235531A (zh) * 2020-10-15 2021-01-15 北京字节跳动网络技术有限公司 视频处理的方法、装置、终端及存储介质

Also Published As

Publication number Publication date
CN107465885A (zh) 2017-12-12

Similar Documents

Publication Publication Date Title
WO2017211139A1 (fr) Procédé et appareil destinés à mettre en œuvre une communication vidéo
US11595617B2 (en) Communication using interactive avatars
KR102506738B1 (ko) 눈 텍스처 인페인팅
US11836866B2 (en) Deforming real-world object using an external mesh
JP7101749B2 (ja) 仲介装置及び方法、並びにコンピュータ読み取り可能な記録媒体{mediating apparatus、method and computer readable recording medium thereof}
US11790614B2 (en) Inferring intent from pose and speech input
US20220125337A1 (en) Adaptive skeletal joint smoothing
KR20230003555A (ko) 텍스처 기반 자세 검증
US20240062500A1 (en) Generating ground truths for machine learning
US20230120037A1 (en) True size eyewear in real time
WO2023121896A1 (fr) Transfert de mouvement et d'apparence en temps réel
WO2023121897A1 (fr) Échange de vêtements en temps réel
WO2022146799A1 (fr) Compression de modèles image-image
CN112804245A (zh) 适用于视频传输的数据传输优化方法、装置及系统
KR20200134623A (ko) 3차원 가상 캐릭터의 표정모사방법 및 표정모사장치
US20240070950A1 (en) Avatar call on an eyewear device
US20240007585A1 (en) Background replacement using neural radiance field
KR20240049844A (ko) 패션 아이템들 상에서 ar 게임들을 제어함
CN115499612A (zh) 一种视频通讯的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17809579

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17809579

Country of ref document: EP

Kind code of ref document: A1