WO2017211139A1

WO2017211139A1 - Method and apparatus for implementing video communication

Info

Publication number: WO2017211139A1
Application number: PCT/CN2017/081956
Authority: WO
Inventors: 张殿凯; 沈琳; 瞿广财; 王宁
Original assignee: 中兴通讯股份有限公司
Priority date: 2016-06-06
Filing date: 2017-04-26
Publication date: 2017-12-14
Also published as: CN107465885A

Abstract

Disclosed is a method for implementing video communication, which is applied to a sending end. The method comprises: collecting an image by means of a camera; performing image recognition on each collected frame of image, and performing feature abstraction according to an image recognition result; and performing parameter description on the image frames according to a feature abstracted from each frame of image, encoding the parameter description of each image frame, and then sending same to a receiving end. By means of the disclosure, the amount of data transmitted during video communication can be reduced, and traffic and charges can be reduced for users.

Description

Method and device for realizing video communication

Technical field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and apparatus for implementing video communication.

Background technique

At present, the RCS (Rich Communication Suite) business is gradually emerging. RCS service is a kind of converged communication service based on enhanced mobile phone address book, which integrates various communication methods and functions such as voice, video, message, presentation and content sharing. By using the RCS service, users can update their own presentations (such as personal pictures, mood phrases, referral links, and status) to achieve various communication needs such as instant messaging, chat, file transfer, and video sharing during a session. It shares with the picture and connects to the network side through the standard protocol interface to realize registration, authentication, audio and video call capabilities.

However, video communication, which is an important part of the RCS business, has not really become popular in people's lives due to expensive wireless traffic charges and personal privacy.

Summary of the invention

The technical problem to be solved by the present disclosure is to provide a method and apparatus for implementing video communication, which can reduce the amount of data transmitted by video communication, and save traffic and tariffs for users.

The present disclosure provides a method for implementing video communication, which is applied to a transmitting end, and the method includes:

Acquiring images through the camera;

Image recognition is performed on each frame of image captured, and feature extraction is performed according to the image recognition result;

The parameter description of the image frame is performed according to the feature extracted from each frame of the image frame, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.

Optionally, the image recognition includes at least one of the following: face recognition, human body recognition, scene recognition;

The feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;

The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.

Optionally, the capturing images by the camera includes:

Taking an image frame as a reference image for portrait shooting by a camera;

Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:

After acquiring an image frame as a portrait shooting reference image, a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;

And determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.

Optionally, the capturing images by the camera includes:

Taking an image frame as a reference image for portrait shooting by a camera;

After acquiring an image frame as a reference image for portrait shooting, a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;

And determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position.

Optionally, the capturing images by the camera includes:

Taking an image frame as a reference image for shooting a scene through a camera;

The depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.

Optionally, the method further includes:

An image frame of the portrait shooting reference image or an image frame of the scene capturing reference image is transmitted to the receiving end.

An embodiment of the present disclosure provides a method for implementing video communication, which is applied to a receiving end, and the method includes:

Receiving a parameter description of each frame of the image frame, and decoding the received parameter description;

Each frame of the image is reconstructed and displayed according to the decoded parameter description.

Optionally, the parameter description includes at least one of the following: a facial expression parameter description, a human motion parameter description, and a physical feature parameter description in the scene.

Optionally, the reconstructing each frame image according to the decoded parameter description comprises:

For each frame of the image frame, when the decoded parameter description includes the facial expression parameter and the expression action amplitude, the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; /or

For each frame of the image frame, when the decoded parameter description includes the human body motion parameter and the motion amplitude, the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed; and/or

For each frame of the image frame, when the decoded parameter description includes the scene parameter, the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.

Optionally, the two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.

An embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a transmitting end, and includes:

An image acquisition module for collecting images through a camera;

The image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result;

The parameter description and encoding module is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.

Optionally, an image acquisition module is configured to collect images by using a camera, including:

Taking an image frame as a reference image for portrait shooting by a camera;

The image recognition and feature extraction module is configured to perform image recognition on each frame of the image frame collected, and perform feature extraction according to the image recognition result, including:

Taking an image frame as a reference image for portrait shooting by a camera;

Optionally, the device further includes:

And an image sending module, configured to send, to the receiving end, an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image.

An embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a receiving end, and includes:

a parameter description receiving module, configured to receive a parameter description of each frame of the image frame, and decode the received parameter description;

The image reconstruction and display module is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.

Optionally, an image reconstruction and display module is configured to reconstruct each frame image according to the decoded parameter description, including:

Compared with the related art, the present disclosure provides a method and apparatus for implementing video communication, in which video communication can greatly reduce the amount of data transmitted by transmitting character motions, expressions, and scene parameter descriptions instead of images themselves, thereby saving traffic for users. And tariffs. On the other hand, controlling the animation to simulate people's expressions and actions through expression parameters can increase the fun and entertainment of video communication, protect the user's personal privacy, and enhance the user experience.

DRAWINGS

FIG. 1 is a flowchart (transmission end) of a method for implementing video communication according to an embodiment of the present disclosure.

FIG. 2 is a flowchart (receiving end) of a method for implementing video communication according to an embodiment of the present disclosure.

FIG. 3 is a schematic diagram of a device (transmitting end) for implementing video communication according to an embodiment of the present disclosure.

FIG. 4 is a schematic diagram of a device (receiving end) for implementing video communication according to an embodiment of the present disclosure.

FIG. 5 is a schematic diagram of an image of a user collected at the transmitting end in the example 2 of the present disclosure (see the left figure), and a schematic diagram of the receiving end emulating the expression of the user of the transmitting end with a cartoon image (big shark) (see the right figure).

detailed description

The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.

As shown in FIG. 1 , an embodiment of the present disclosure provides a method for implementing video communication, which is applied to a sending end, and the method includes:

S110, collecting an image through a camera;

S120: performing image recognition on each frame of the image frame that is collected, and performing feature extraction according to the image recognition result;

S130: Perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end;

The method may also include the following features:

The image recognition includes at least one of the following: face recognition, body recognition, scene recognition;

The parameter description includes at least one of the following: a description of a facial expression parameter, a description of a human motion parameter, and a description of a physical feature parameter in the scene;

The collecting images by the camera includes:

Taking an image frame as a reference image for portrait shooting through a camera; and/or

An image frame that is a reference image for shooting a scene is taken by a camera.

The Adaboost algorithm can be used for face detection to determine the position and size of the face, and then the SDM (Supervised Descent Method) algorithm is used to extract the key feature point coordinates of the face in the face region;

Wherein, the key feature points of the face include at least one of the following: an eye, a nose, a mouth, an eyebrow, a facial contour, and the like;

Wherein, the facial expression parameter includes at least one of the following parameters: closed eyes, blinking, opening mouth, closing mouth, laughing, looking up, bowing, turning left, turning to the right, tilting the head to the left shoulder, The right shoulder is tilted to the head, etc.;

Wherein, a deep learning algorithm or a template matching algorithm may be used for human body detection, and the position, direction, shape, curvature and other features of key features of each human body are extracted;

Wherein, the key feature points of the human body include at least one of the following: a head, an arm, a hand, a leg, a foot, a waist, and the like;

The deep learning algorithm can be used to understand the scene and extract the features of each physical object in the scene;

Among them, the characteristics of the physical object include: shape, size, color, material, etc.;

Optionally, image recognition is performed on each frame of the image frame that is collected, and feature extraction is performed according to the image recognition result, including:

After acquiring the image frame as the portrait image of the portrait, the position of the face key feature point extracted from the portrait image reference image frame is taken as the face feature point reference position;

Determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position;

Wherein, when photographing a portrait reference image frame, the user is usually required to face the front of the eye, the mouth is closed and cannot be laughed, and the body is kept standing upright;

The determining, according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position, determining a face expression parameter and an expression action range corresponding to the image frame. ,include:

If the distance between the upper and lower eyelids on the current image frame is less than the blink threshold, determining that the portrait is closed, otherwise determining that the portrait is blinking; wherein the blink threshold is the distance between the upper and lower eyelids in the portrait image frame minus the first error allowed a difference after the value, or the blink threshold is a difference between an average value of distances of upper and lower eyelids in the plurality of portrait image reference image frames minus a second error allowable value; wherein the first error allowable value Is an empirical value; the second error allowable value is a variance of a motion recognition sensitivity coefficient multiplied by a distance of upper and lower eyelids in a plurality of portrait photographing reference image frames; and/or

If the distance between the upper and lower lip edges of the current image frame is greater than the closing threshold, determining that the portrait opens the mouth, otherwise determining that the portrait is closed; wherein the closing threshold is the distance between the upper and lower lip edges of the portrait image frame and the first error a difference after the allowable value, or the threshold of the closing is the average of the distance between the upper and lower lip edges of the plurality of portrait image reference image frames plus the difference of the second error allowable value; wherein the first error The allowable value is an empirical value; the second error allowable value is a variance of the motion recognition sensitivity coefficient multiplied by the distance of the upper and lower lip edges in the plurality of portrait photographing reference image frames; and/or

If the position of the head position reference point on the current image frame deviates from the reference position of the head position reference point, the action and the action range of the head are determined according to the distance, angle and direction of the deviation;

Wherein the head position reference point comprises at least one of the following: a center point of the two eyes, an eyebrow, and a tip of the nose;

Wherein, the action of the head includes at least one of the following: raising the head, lowering the head, turning the head to the left, turning the head to the right, tilting the head to the left shoulder, and tilting the head to the right shoulder;

The method further includes:

Sending the portrait shooting reference image frame to the receiving end;

The descriptions of the action parameters such as blinking, closing the eyes, opening the mouth, and closing the mouth described above are merely examples, and other expression parameters may adopt a similar method, but are not limited to the method described above;

Determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position;

Wherein the human body action, such as: moving the left hand forward, kicking the right leg forward, twisting the waist, gestures, etc.;

As shown in FIG. 2, an embodiment of the present disclosure provides a method for implementing video communication, which is applied to a receiving end, and the method includes:

S210. Receive a parameter description of each frame of the image frame, and decode the received parameter description.

S220: Simulate and reconstruct each frame image according to the decoded parameter description, and display the image.

The method may also include the following features:

Optionally, the parameter description includes at least one of the following: a description of a facial expression parameter, a description of a human motion parameter, and a description of a physical feature parameter in the scene;

For each frame of the image frame, when the decoded parameter description includes the facial expression parameter and the expression action amplitude, the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; / or

For each frame of image frame, when the decoded parameter description includes the scene parameter, the two-dimensional image or the three-dimensional image is used to reconstruct and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed;

The two-dimensional picture or the three-dimensional image includes at least one of the following: a transmitting end portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library;

As shown in FIG. 3, an embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a transmitting end, and includes:

The image acquisition module 301 is configured to collect an image by using a camera;

The image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the captured image frame, and perform feature extraction according to the image recognition result;

The parameter description and encoding module 303 is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.

Optionally, the image acquisition module 301 is configured to collect images by using a camera, including:

Taking an image frame as a reference image for portrait shooting by a camera;

The image recognition and feature extraction module 302 is configured to perform image recognition on each frame of the image frame that is collected, and perform feature extraction according to the image recognition result, including:

Optionally, the image recognition and feature extraction module 302 is configured to determine the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position. Corresponding facial expression parameters and expression action amplitudes, including:

If the distance between the upper and lower lip edges of the current image frame is greater than the closing threshold, then the portrait is opened, otherwise the person is judged The closing threshold is a difference between a distance of the upper and lower lip edges in the portrait image frame and a first error allowable value, or the closing threshold is a plurality of portrait shooting reference image frames. The average of the distances between the upper and lower lip edges plus the difference after the second error allowable value; wherein the first error allowable value is an empirical value; and the second error allowable value is a motion recognition sensitivity coefficient multiplied by Portrait The difference in the distance of the distance between the upper and lower lip edges in the reference image frame; and/or

If the position of the head position reference point on the current image frame deviates from the reference position of the head position reference point, the motion and motion amplitude of the head are determined according to the distance, angle and direction of the deviation.

Taking an image frame as a reference image for portrait shooting by a camera;

Optionally, the device further includes:

The image sending module 304 is configured to send an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image to the receiving end.

As shown in FIG. 4, an embodiment of the present disclosure provides an apparatus for implementing video communication, which is applied to a receiving end, and includes:

The parameter description receiving module 401 is configured to receive a parameter description of each frame of the image frame, and decode the received parameter description.

The image reconstruction and display module 402 is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.

Optionally, the image reconstruction and display module 402 is configured to reconstruct each frame image according to the decoded parameter description, including:

For each frame of image frame, when the decoded parameter description includes facial expression parameters and facial motion amplitude, use two Dimensional image or 3D image simulation to reconstruct the expression of the sender's portrait, and display the simulated reconstructed image frame; and/or

Example 1

The wireless data traffic in mobile video communication is expensive. In this example, the parameter description and encoding transmission can be performed only on the facial expression. The implementation steps are as follows:

Step 1: The image receiving module of the transmitting end first transmits the collected image by using conventional encoding, that is, sending the current face image to the receiving end for subsequent control display;

Step 2: The subsequently acquired image is sent to the feature extraction module for face detection and key feature point location of the face, including the mouth, eyebrows, eyes, nose and facial contours;

Step 3: The feature extraction module performs face detection through the Adaboost algorithm to determine the position and size of the face, and then uses the SDM (Supervised Descent Method) algorithm to extract the coordinates of the face feature points in the face region;

Step 4: extracting facial expression motion parameters and amplitude according to feature point coordinates and encoding according to a predetermined rule. If no face is detected or the feature point position cannot be located, the default action parameter is adopted;

The following examples illustrate the calculation of several action parameters:

The mean value of the face feature point position in the N reference frames is counted as a reference frame with the video frame of the user having a normal expression (no action) at rest.

And the average difference σ _DX , calculated as follows:

among them

Indicates the mean of the position of the feature points, and σ _DX represents the variance of the position of the feature points due to video noise.

The following describes several motion calculations, but is not limited to the following actions. Other action parameters can be extracted in a similar way:

1) Closed eye movement is determined by the distance D1 between the upper and lower eyelids, if

Then it is determined that the eye-closing action occurs. among them,

The mean value of the distance between the upper and lower eyelids in the N reference frames, σ _D1 is the variance of the distance between the upper and lower eyelids in the N reference frames, and α is the sensitivity coefficient of the motion recognition. The smaller the value, the more sensitive the result of the motion recognition.

2) The mouth opening action is determined by the distance D2 of the upper and lower lip edges, if

Then it is determined that a mouth opening action occurs. among them,

Is the mean value of the distance between the upper and lower lip edges in the N reference frames, σ _D2 is the variance of the distance between the upper and lower lip edges in the N reference frames, and α is the sensitivity coefficient of the motion recognition. The smaller the value, the more the result of the motion recognition Sensitive.

3) The value of the angle θ1 of the lateral head of the head (ie, the clockwise side head and the counterclockwise side head) can be estimated from the horizontal distance and the vertical distance of the eyebrow point, the center point of the eyes or the position of the nose point from the position of the reference point; The direction can be determined by the relative relationship between the current position and the position of the reference point.

4) The left and right rotation of the head (ie, turn left and turn right) the value of the angle θ2 can be estimated from the horizontal distance of the eyebrow point, the center point of the eye or the position of the tip of the nose from the position of the reference point; the direction of the angle can be from the current position Determined by the relative relationship with the position of the reference point.

5) The value of the angle θ3 of the head up and down rotation (ie, head up, head down) can be estimated from the vertical distance of the eye point, the center point of the eye or the position of the nose point from the position of the reference point; the direction of the angle can be from the current position and the position of the reference point The relative relationship is determined.

Step 5: The receiving end decodes the received action parameters to obtain various action description parameters.

Step 6: The animation control module at the receiving end controls the image obtained in step 1 according to the parameters obtained in step 4 or performs the motion expression simulation according to the 3D model constructed from the image obtained in step one and sends it to the display module for display.

Example 2

In some video communication application scenarios, in order to protect privacy and increase fun, the two parties do not want the other party to see their real face, but only want to know each other's facial expressions, so they can only transmit the expression action parameters and then control the animation at the receiving end. To perform an emoticon simulation, the implementation steps are as follows:

Step 1: The image receiving module of the sending end sends the collected image to the feature extraction module for face detection and key feature point positioning of the face, including mouth, eyebrows, eyes, nose and facial contours;

Step 2: The feature extraction module performs face detection through the Adaboost algorithm to determine the position and size of the face, and then uses the SDM (Supervised Descent Method) algorithm to extract the coordinates of the face feature points in the face region;

Step 3: extracting the facial expression motion parameters and amplitude according to the feature point coordinates and encoding according to a predetermined rule. If no face is detected or the feature point position cannot be located, the default action parameter is adopted, and the parameter definition reference example 1;

Step 4: The receiving end decodes the received expression action parameters to obtain various expression action description parameters;

Step 5: The animation control module of the receiving end controls the local picture and the cartoon animation model according to the expression action description parameters obtained in step 4 to perform an expression motion simulation and send it to the display module for display;

As shown in Figure 2, the receiving end can display an animated image of a large shark to simulate the expression of the sender's mouth.

Example 3

This example can perform parameter description and code transmission for body movements. The implementation steps are as follows:

Step 1: The image receiving module of the sending end sends the collected image to the feature extraction module for human body detection and organ positioning, including organs such as a head, an arm, a leg, and a waist;

Step 2: The feature extraction module extracts the position and square of each organ through a deep learning algorithm or a template matching algorithm. Characteristics of direction, shape, curvature, etc.;

Step 3: According to the feature extracted in step 2, identify and judge the movement of the human body, such as the left hand forward, the forward kick, the right leg, the twist, the gesture, etc., and send the action parameters and the action amplitude according to the predetermined rule code to the receiving end. ;

Step 4: The receiving end decodes the received data according to a predetermined rule to obtain description parameters of various actions and action amplitudes;

Step 5: The animation control module of the receiving end controls the local human body image or model according to the parameters obtained in step 4 to perform motion simulation, and displays in the display module.

Example 4

This example can perform parameter description and code transmission on the scene. The implementation steps are as follows:

Step 1: The image receiving module of the sending end sends the collected image to the feature extraction module for image content understanding and extraction;

Step 2: The feature extraction module understands the scene through a deep learning algorithm, extracts objects that can be described, such as tables, landmark buildings, computers, dogs, etc., and extracts and describes features such as table shape and size. Information such as color, height, material, position in the image, etc.

Step 3: performing parameter description on the physical features extracted in step 2, and encoding the parameters describing the physical objects according to a predetermined rule, and transmitting the parameters to the receiving end;

Step 4: The receiving end decodes the received data according to a predetermined rule, and obtains description parameters of each physical object in the scene.

Step 5: The receiving end animation control module selects various physical model simulations to construct the transmitting end scene according to the parameters obtained in step 4, so that the receiving end can experience a scene similar to the sending end and display in the display module.

The foregoing embodiment provides a method and apparatus for implementing video communication. In video communication, the amount of data transmitted can be greatly reduced by transmitting character motions, expressions, and scene parameter descriptions in the image instead of the image itself, thereby saving traffic and tariffs for the user. On the other hand, controlling the animation to simulate people's expressions and actions through expression parameters can increase the fun and entertainment of video communication, protect the user's personal privacy, and enhance the user experience.

In another embodiment of the present disclosure, there is also provided an apparatus for implementing video communication, comprising: a processor; a memory for storing instructions executable by the processor; the processor for storing according to the memory The instruction performs an action, the action includes: acquiring an image by a camera; performing image recognition on each frame of the image frame collected, performing feature extraction according to the image recognition result; and extracting the feature pair according to the image frame from each frame The image frame is described by parameters, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.

In still another embodiment of the present disclosure, there is provided an apparatus for implementing video communication, comprising: a processor; a memory for storing instructions executable by the processor; the processor for storing according to the memory The instruction performs an action, the action includes: receiving a parameter description of each frame of the image frame, decoding the received parameter description, and reconstructing and reconstructing each frame image according to the decoded parameter description.

In an embodiment of the present disclosure, there is also provided a computer storage medium, the computer storage medium may be stored with a license A line instruction for performing a method of implementing video communication in any of the above embodiments.

One of ordinary skill in the art will appreciate that all or a portion of the steps described above can be accomplished by a program that instructs the associated hardware, such as a read-only memory, a magnetic or optical disk, and the like. Optionally, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits. Accordingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware, or may be implemented by using a software function module. Formal realization. The present disclosure is not limited to any specific form of combination of hardware and software.

It should be noted that various other embodiments and modifications may be made in accordance with the present disclosure without departing from the spirit and scope of the disclosure. Corresponding changes and modifications are intended to be included within the scope of the appended claims.

Industrial applicability

The method for realizing video communication provided by the present disclosure can be applied to a terminal device having a video capture and communication function, and can greatly reduce the transmission data by transmitting a character motion, an expression, and a scene parameter description in the image instead of the image itself during video communication. Quantity, saving users traffic and tariffs.

Claims

A method for implementing video communication is applied to a transmitting end, and the method includes:

Acquiring images through the camera;

Image recognition is performed on each frame of image captured, and feature extraction is performed according to the image recognition result;

The parameter description of the image frame is performed according to the feature extracted from each frame of the image frame, and the parameter description of each frame of the image frame is encoded and sent to the receiving end.
The method of claim 1 wherein:

The image recognition includes at least one of the following: face recognition, body recognition, scene recognition;

The feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;

The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
The method of claim 2 wherein:

The collecting images by the camera includes:

Taking an image frame as a reference image for portrait shooting by a camera;

Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:

After acquiring an image frame as a portrait shooting reference image, a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;

And determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.
The method of claim 2 wherein:

The collecting images by the camera includes:

Taking an image frame as a reference image for portrait shooting by a camera;

Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:

After acquiring an image frame as a reference image for portrait shooting, a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;

And determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position.
The method of claim 2 wherein:

The collecting images by the camera includes:

Taking an image frame as a reference image for shooting a scene through a camera;

Image recognition is performed on each frame of the image frame collected, and feature extraction is performed according to the image recognition result, including:

The depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
The method of any of claims 2 to 5, wherein the method further comprises:

An image frame of the portrait shooting reference image or an image frame of the scene capturing reference image is transmitted to the receiving end.
A method for implementing video communication is applied to a receiving end, and the method includes:

Receiving a parameter description of each frame of the image frame, and decoding the received parameter description;

Each frame of the image is reconstructed and displayed according to the decoded parameter description.
The method of claim 7 wherein:

The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
The method of claim 8 wherein:

The simulation reconstructs each frame image according to the decoded parameter description, including:

For each frame of the image frame, when the decoded parameter description includes the facial expression parameter and the expression action amplitude, the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; /or

For each frame of the image frame, when the decoded parameter description includes the human body motion parameter and the motion amplitude, the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed; and/or

For each frame of the image frame, when the decoded parameter description includes the scene parameter, the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
The method of claim 9 wherein:

The two-dimensional picture or the three-dimensional image includes at least one of the following: a portrait portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.
A device for implementing video communication, applied to a transmitting end, comprising:

An image acquisition module configured to acquire an image through a camera;

The image recognition and feature extraction module is configured to perform image recognition on each frame of the acquired image frame, and perform feature extraction according to the image recognition result;

The parameter description and encoding module is configured to perform parameter description on the image frame according to the feature extracted from each frame of the image frame, encode the parameter description of each frame of the image frame, and send the parameter description to the receiving end.
The apparatus of claim 11 wherein:

The image recognition includes at least one of the following: face recognition, body recognition, scene recognition;

The feature extraction includes at least one of the following: an expression feature extraction, an action feature extraction, and a physical feature extraction in the scene;

The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
The apparatus of claim 12 wherein:

The image acquisition module is configured to capture images through the camera, including:

Taking an image frame as a reference image for portrait shooting by a camera;

The image recognition and feature extraction module is configured to perform image recognition on each frame of the acquired image frame, and perform feature extraction according to the image recognition result, including:

After acquiring an image frame as a portrait shooting reference image, a position of a face key feature point extracted from an image frame of the portrait shooting reference image is taken as a face feature point reference position;

And determining a facial expression parameter and an expression action amplitude corresponding to the image frame according to a positional relationship between a face key feature point position extracted from each frame of the image frame and the face feature point reference position.
The apparatus of claim 12 wherein:

The image acquisition module is configured to capture images through the camera, including:

Taking an image frame as a reference image for portrait shooting by a camera;

The image recognition and feature extraction module is configured to perform image recognition on each frame of the acquired image frame, and perform feature extraction according to the image recognition result, including:

After acquiring an image frame as a reference image for portrait shooting, a position of a human key feature point extracted from the portrait image reference image frame is used as a human body feature point reference position;

And determining an action parameter and an action amplitude corresponding to the image frame according to a positional relationship between a human key feature point position extracted from each frame of the image frame and the human body feature point reference position.
The apparatus of claim 12 wherein:

The image acquisition module is configured to capture images through the camera, including:

Taking an image frame as a reference image for shooting a scene through a camera;

The image recognition and feature extraction module is configured to perform image recognition on each frame of the acquired image frame, and perform feature extraction according to the image recognition result, including:

The depth learning algorithm is used to understand the scene of each frame of the image frame, extract the physical objects that can be described, and perform feature extraction on the physical object.
The apparatus of any of claims 12-15, further comprising:

And an image transmitting module configured to send an image frame of the portrait shooting reference image or an image frame of the scene capturing reference image to the receiving end.
A device for implementing video communication, applied to a receiving end, comprising:

The parameter description receiving module is configured to receive a parameter description of each frame of the image frame, and decode the received parameter description;

The image reconstruction and display module is configured to simulate and reconstruct each frame image according to the decoded parameter description, and display the image.
The apparatus of claim 17 wherein:

The parameter description includes at least one of the following: a facial expression parameter description, a human body motion parameter description, and a physical feature parameter description in the scene.
The apparatus of claim 18 wherein:

The image reconstruction and display module is configured to simulate reconstructing each frame image according to the decoded parameter description, including:

For each frame of the image frame, when the decoded parameter description includes the facial expression parameter and the expression action amplitude, the two-dimensional image or the three-dimensional image is used to simulate and reconstruct the expression of the sender image, and the image frame after the simulation reconstruction is displayed; / or

For each frame of the image frame, when the decoded parameter description includes the human body motion parameter and the motion amplitude, the motion of the image of the sender is reconstructed using a two-dimensional image or a three-dimensional image, and the image frame after the simulation reconstruction is displayed; and/or

For each frame of the image frame, when the decoded parameter description includes the scene parameter, the two-dimensional picture or the three-dimensional image is used to simulate and reconstruct each object in the scene of the transmitting end, and the image frame after the simulation is displayed.
The apparatus of claim 19 wherein:

The two-dimensional picture or the three-dimensional image includes at least one of the following: a portrait portrait shooting reference image, a transmitting end scene shooting reference image, a picture in the picture library, and an animation model in the animation model library.