WO2021139706A1 - 图像处理的方法、设备及系统 - Google Patents

图像处理的方法、设备及系统 Download PDF

Info

Publication number
WO2021139706A1
WO2021139706A1 PCT/CN2021/070579 CN2021070579W WO2021139706A1 WO 2021139706 A1 WO2021139706 A1 WO 2021139706A1 CN 2021070579 W CN2021070579 W CN 2021070579W WO 2021139706 A1 WO2021139706 A1 WO 2021139706A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
user
facial
images
frame
Prior art date
Application number
PCT/CN2021/070579
Other languages
English (en)
French (fr)
Inventor
梁运恺
高扬
叶威威
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021139706A1 publication Critical patent/WO2021139706A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working

Definitions

  • This application relates to the field of video technology, and in particular to an image processing method, device and system.
  • video calls are a more effective way of remote communication and interaction than voice calls.
  • it can also convey information such as body movements and facial expressions, making the communication between the two parties more in-depth.
  • the traditional video method is a real-time video method, that is, the local end uses the camera to collect the characters and backgrounds participating in the video in real time, and generates a video stream, and then transmits the video stream to the remote end through the network so that the remote end can present the video .
  • high-resolution video streaming requires high network transmission bandwidth, and it is difficult for traditional video methods to achieve real-time high-quality video calls.
  • the video screen will appear packet loss, blurring and other phenomena.
  • the use of traditional video methods for video calls does not work well, which affects the user experience.
  • the present application provides an image processing method, device, and system, so as to reduce the requirement for network transmission bandwidth, and thereby improve the video call effect and user experience.
  • the present application provides an image processing method, including: acquiring a first frame of facial images of a user, and the first frame of facial images of the user includes multiple facial organ images. Acquire a plurality of first images matching with a plurality of facial organ images. The data packet of the user's first frame of facial image is sent to the receiving end, and the data packet of the user's first frame of facial image includes indexes of multiple first images, and the indexes of multiple first images are used to obtain multiple first images.
  • the sending end since the sending end does not need to send the user's first face image to the receiving end, it only needs to send a data packet including indexes of multiple first images. Thereby, the requirement for network bandwidth can be reduced, that is, in the case of limited network transmission bandwidth, better video effects can still be ensured.
  • the multiple facial organ images are images of real facial organs of the user
  • the multiple first images are images of virtual facial organs of the user. Since the first image is an image of the user's virtual facial organs, the user's personal privacy is protected, thereby increasing the scope of application of the technical solution of the present application.
  • acquiring multiple first images that match multiple facial organ images includes: for each facial organ image in the multiple facial organ images, combining the facial organ image with a standard organ image corresponding to the facial organ image Make a comparison and determine the first difference value.
  • the first image matching the facial organ image is acquired according to the first difference value, and the second difference value and the first difference value of the first image matching the facial organ image and the standard organ image satisfy the first condition.
  • the above method further includes the sending end sending at least one audio data packet to the receiving end, and the time stamp of the audio data packet matches the time stamp of the data packet of the user's first frame of facial image. Based on this, the user can achieve the effect of synchronization in hearing and vision.
  • the above method further includes: acquiring a second frame of facial images of the user, and the second frame of facial images of the user is earlier than the first frame of facial images of the user. Acquire a plurality of second images that match the plurality of facial organ images of the user's second frame of facial image.
  • the data packet of the user's second frame of facial image is sent to the receiving end, and the data packet of the user's second frame of facial image includes indexes of multiple second images, and the indexes of multiple second images are used to obtain multiple second images. Since the sending end does not need to send the user's second face image to the receiving end, it only needs to send a data packet including indexes of multiple second images. Thereby, the requirement for network bandwidth can be reduced, that is, in the case of limited network transmission bandwidth, better video effects can still be ensured.
  • the above method further includes: receiving instruction information sent by the receiving end, the instruction information is used to instruct to send a facial image earlier than the user's first face image, that is, the instruction information is used to instruct to send a face image earlier than the user The facial image data packet of the first frame of facial image. That is, not in all cases, the sender has to send a facial image earlier than the user's first facial image, thereby reducing the consumption of communication resources.
  • the present application provides an image processing method, including: receiving a data packet of a user's first frame of facial image from a sending end, the data packet of the user's first frame of facial image includes indexes of multiple first images, and the user
  • the first frame of facial images includes a plurality of facial organ images, and the plurality of first images are matched with the plurality of facial organ images. Acquire multiple first images.
  • the first frame of facial image of the receiving end is generated according to the multiple first images. Since the sending end does not need to send the user's first face image to the receiving end, it only needs to send a data packet including indexes of multiple first images. Thereby, the requirement for network bandwidth can be reduced, that is, in the case of limited network transmission bandwidth, better video effects can still be ensured.
  • the multiple facial organ images are images of real facial organs of the user
  • the multiple first images are images of virtual facial organs of the user.
  • the above method further includes: receiving at least one audio data packet from the sending end, where the time stamp of the audio data packet matches the time stamp of the data packet of the user's first frame of facial image. Based on this, the user can achieve the effect of synchronization in hearing and vision.
  • the above method further includes: receiving a data packet of the user's second frame of facial image from the sending end, the user's second frame of facial image is earlier than the user's first frame of facial image, and the user's second frame of facial image data
  • the package includes indexes of a plurality of second images, and the plurality of second images are matched with the plurality of facial organ images included in the user's second frame of facial image. Since the sending end does not need to send the user's second face image to the receiving end, it only needs to send a data packet including indexes of multiple second images. Thereby, the requirement for network bandwidth can be reduced, that is, in the case of limited network transmission bandwidth, better video effects can still be ensured.
  • the above method further includes: sending instruction information to the sending end, where the instruction information is used to instruct to send a facial image earlier than the user's first facial image. That is, in not all cases, only when the sending end receives the instruction information, the sending end sends a facial image that is earlier than the first facial image of the user, thereby reducing the consumption of communication resources.
  • the above method further includes: if the first frame of facial image of the receiving end has been generated, discarding the data packet of the second frame of facial image of the user. There is no need to generate the second face image of the receiving end, thereby reducing the power consumption of the receiving end.
  • the above method further includes: if the receiving end third facial image corresponding to the third facial image of the user has not been generated, wherein the third facial image of the user is earlier than the second facial image of the user, Then, the second facial image of the receiving end is generated according to the data packet of the second facial image of the user.
  • the receiving end uses AR/VR technology to generate the video background image, so that the first facial images of the multiple receiving ends can be merged into one background Under the scene, it can improve the user experience and interactivity.
  • the present application provides an image processing device, including: a first acquiring module, a second acquiring module, and a first sending module.
  • the first acquisition module is used to acquire a first frame of facial image of the user, and the first frame of facial image of the user includes a plurality of facial organ images.
  • the second acquiring module is used to acquire multiple first images matching multiple facial organ images.
  • the first sending module is used to send the data packet of the user's first frame of facial image to the receiving end.
  • the data packet of the user's first frame of facial image includes indexes of multiple first images, and the indexes of multiple first images are used to obtain Multiple first images.
  • the present application provides an image processing device, including: a first receiving module, a first acquiring module, and a first generating module.
  • the first receiving module is configured to receive a data packet of the user's first frame of facial image from the sending end, the data packet of the user's first frame of facial image includes indexes of multiple first images, and the user's first frame of facial image includes A plurality of facial organ images, and a plurality of first images are matched with the plurality of facial organ images.
  • the first acquisition module is used to acquire multiple first images.
  • the first generating module is used for generating a first frame of facial image of the receiving end according to a plurality of first images.
  • this application provides a terminal device, including: a memory and a processor.
  • the memory stores instructions that can be executed by the processor, and the instructions are executed by the processor to enable the processor to execute any one of the first aspect, the second aspect, the optional manner of the first aspect, and the optional manner of the second aspect Methods.
  • the present application provides a computer-readable storage medium.
  • the storage medium stores computer instructions.
  • the computer instructions are used to make a computer execute Any one of the optional methods.
  • the present application provides a computer program product, the computer program product storing computer instructions, the computer instructions are used to make the computer execute the first aspect, the second aspect, the optional manner of the first aspect, and the second aspect Any one of the optional methods.
  • this application provides an image processing method, device and system.
  • the image sample library is configured at the sending end and the receiving end, and the image index in the sample library is transferred between the sending end and the receiving end to realize image transfer, thereby reducing
  • the bandwidth requirements for network transmission will further improve the effect of video calls and the sense of user experience.
  • the video scene is built on AR or VR technology, and the virtual characters and video scenes are used to deliver rich expression and posture information, thereby protecting the user's personal privacy.
  • the receiving end uses AR/VR technology to generate the video background image, so that the first frame of facial images of the multiple receiving ends can be merged into one background Under the scene, it can improve the user experience and interactivity.
  • Figure 1 is a system architecture diagram provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of an image processing process provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of an audio data packet sequence and a facial image data packet sequence provided by an embodiment of the application
  • FIG. 6 is a schematic diagram of a first data packet and a first buffer queue provided by an embodiment of this application;
  • FIG. 7 is a flowchart of a method for processing a data packet of a facial image provided by a receiving end according to an embodiment of the application
  • FIG. 8 is a schematic diagram of image processing provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of image processing provided by another embodiment of this application.
  • FIG. 10 is a schematic diagram of image processing provided by still another embodiment of this application.
  • FIG. 11 is a schematic diagram of an image processing device provided by an embodiment of this application.
  • FIG. 12 is a schematic diagram of an image processing apparatus according to another embodiment of the application.
  • FIG. 13 is a schematic diagram of a terminal device provided by an embodiment of this application.
  • FIG. 14 is a schematic diagram of an image processing system provided by an embodiment of the application.
  • the traditional video method is a real-time video method, that is, the local end uses the camera to collect the characters and backgrounds participating in the video in real time, and generates a video stream, and then transmits the video stream to the remote end through the network so that the remote end can present the video .
  • the effect of using traditional video methods for video calls is not good, which affects user experience.
  • traditional video methods tend to expose personal privacy such as clothing, location, or mental state of individuals, resulting in a narrow range of use of traditional video methods.
  • this application provides an image processing method, device and system.
  • the main idea of this application is to configure an image sample library at the sending end and the receiving end, and transfer the image index in the sample library between the sending end and the receiving end to realize image transmission, thereby reducing the bandwidth requirement for network transmission.
  • the video scene is built on augmented reality (AR) or virtual reality (virtual reality, VR) technology, and virtual characters and video scenes are used to deliver rich expression and posture information.
  • AR augmented reality
  • VR virtual reality
  • the technical solutions of the embodiments of this application can be applied to various communication systems, such as the third generation (3G) mobile communication system, the fourth generation (4G) mobile communication system, and the fifth generation (5G) mobile communication system.
  • 3G third generation
  • 4G fourth generation
  • 5G fifth generation
  • Networks such as mobile communication systems, new radio (NR) or wireless fidelity (WiFi).
  • FIG. 1 is a system architecture diagram provided by an embodiment of the application.
  • both the sending end 11 and the receiving end 12 have cameras, through which image collection can be performed.
  • the sending end 11 and the receiving end 12 On the signaling side, the session initiation protocol (SIP) is used, and the media side uses the real-time transport protocol (RTP) or real-time control protocol (RTCP), so the sender 11 RTP or RTCP is used to send data packets of facial images to the receiving end 12.
  • the sending end 11 can call a real-time network (RTN) software development kit (SDK) to send the data packet of the facial image to the server 13 through RTN
  • SDK software development kit
  • the receiving end 12 calls the RTN SDK to receive the data packet of the facial image, the receiving end 12 parses the data packet of the facial image according to the format of the RTP data packet, and the receiving end 12 uses the graphics processing unit (graphics processing unit, GPU) or network processor (network process units, NPU) to realize the image three-dimensional (3-dimension, 3D) rendering function.
  • graphics processing unit graphics processing unit, GPU
  • network processor network process units, NPU
  • the dashed frame of the GPU/NPU indicates that the GPU/NPU is inside the terminal device instead of being displayed on the display screen of the terminal device.
  • the aforementioned terminal device may be a mobile phone or an AR/VR device, for example, a VR head-mounted display device, AR glasses, and the like.
  • the sender and receiver may not transmit data through the server, that is, they can be directly connected for data transmission.
  • the sender calls RTN SDK to send data packets of facial images to the receiver through RTN.
  • the receiving end calls the RTN SDK to receive the data packets of the facial image, and the receiving end parses the data packets of the facial image according to the format of the RTP data packet, and implements the image 3D rendering function through the GPU or NPU according to the parsed data packet.
  • FIG. 2 is a flowchart of an image processing method provided by an embodiment of the application.
  • the method involves a sending end and a receiving end.
  • the sending end and the receiving end may be two different terminal devices, such as two different mobile phones, or ,
  • the sending end is a mobile phone and the receiving end is an AR/VR device, or the sending end is an AR/VR device and the receiving end is a mobile phone, etc.
  • This application does not restrict this.
  • the method includes the following steps:
  • Step S201 The sending end acquires a first frame of facial images of the user, and the first frame of facial images of the user includes multiple facial organ images.
  • Step S202 The sending end acquires multiple first images that match multiple facial organ images.
  • Step S203 The sending end sends a data packet of the user's first frame of facial image to the receiving end, the data packet includes indexes of multiple first images, and the indexes of the multiple first images are used to obtain multiple first images.
  • Step S204 The receiving end acquires multiple first images.
  • Step S205 The receiving end generates a first frame of facial image of the receiving end according to the multiple first images.
  • the sender uses its own camera, such as a front camera, to collect the user's picture, and can obtain multiple frames of facial images.
  • the user’s first facial image here represents the current facial image. It can be the user’s first facial image or not the first facial image.
  • the “first” here is just the second frame that will be mentioned below.
  • the facial images are distinguished, and have no actual meaning.
  • the multiple facial organ images included in the user's first frame of facial image are all images of the user's real facial organs.
  • the aforementioned facial organs may be facial organs in the sense of coarser granularity, such as eyes, nose, mouth, ears, and the like. It can also be facial organs in the sense of finer granularity, such as eyeballs, whites of eyes, eyelashes, left wing of nose, right wing of nose, bridge of nose, etc.
  • the so-called first image matching the facial organ image means that the facial organ features presented by the first image are similar to the facial organ features presented by the facial organ image.
  • the first image may be a first image that satisfies the following conditions: the difference between the first image and the facial organ image is the smallest, or the absolute value of the difference between the first image and the facial organ image is less than a preset threshold.
  • the difference value between the facial organ image and the standard organ image corresponding to the facial organ image is the first difference value
  • the difference value between the first image and the standard organ image is the second difference value
  • the second difference The difference between the value and the first difference value is the smallest, or the absolute value of the difference between the second difference value and the first difference value is less than a preset threshold.
  • the standard organ image corresponding to any facial organ image refers to the standard image corresponding to the facial organ.
  • the facial organ is the eye
  • the corresponding standard organ image is the standard image corresponding to the eye.
  • the first image in this application is an image of a virtual facial organ of a user, that is, a virtual image of a facial organ.
  • the virtual image can be understood as an image of a cartoon character's facial organs or an image of a star's facial organs, etc. .
  • the multiple first images are acquired in the following manner: for each facial organ image in the multiple facial organ images, a first image is acquired according to the facial organ image, where the first image is related to the facial organ image.
  • the difference in organ images is minimal.
  • multiple facial organ images include images of eyebrows, squinted eyes, nose, raised mouth, and ears.
  • the image is compared with the image of at least one eye in the sample library, and the image of the eye with the smallest difference from the image is obtained, and the image of the eye with the smallest difference is the first image.
  • the absolute value of the difference value between the first image and the facial organ image is less than a preset threshold.
  • the preset threshold can be set according to actual conditions. Still taking the above-mentioned user's first face image as the user laughing as an example, for the image of the squinted eye, compare the image with the image of at least one eye in the sample library to obtain the difference with the image An image of an eye whose absolute value is less than a preset threshold, and an image of an eye whose absolute value of the difference is less than the preset threshold is the first image.
  • the facial organ image is compared with a standard organ image corresponding to the facial organ image to determine the first difference value; and the first difference value is obtained according to the first difference value.
  • the first image corresponding to the facial organ image wherein the difference between the second difference value and the first difference value is the smallest. For example, if the user’s first face image is the user’s laughter, multiple facial organ images include images of eyebrows, squinted eyes, nose, raised mouth, and ears.
  • For the image of squinted eyes determine The first difference value between the image and the image of the standard eye, determine the second difference value between the image of at least one eye in the sample library and the image of the standard eye, and obtain the eye with the smallest difference between the second difference value and the first difference value
  • the image of the eye with the smallest difference is the first image.
  • the absolute value of the difference between the second difference value and the first difference value between the first image and the standard organ image is smaller than a preset threshold, and the preset threshold may be set according to actual conditions.
  • first face image as the user’s laughing picture as an example
  • determine the first difference between the image and the image of standard eyes and determine the value of at least one eye in the sample library.
  • the second difference value between the image and the image of the standard eye to obtain the image of the eye whose absolute value of the difference between the second difference value and the first difference value is less than the preset threshold, and the eye whose absolute value of the difference is less than the preset threshold
  • the image of is the first image.
  • the sending end may determine the first difference value between the facial organ image and the standard organ image corresponding to the facial organ image in the following manner, but it is not limited to this:
  • the sender obtains the pixel values of multiple first pixels in the facial organ image, and obtains the pixel values of multiple second pixels in each standard organ image in the sample library, where multiple first pixels One-to-one correspondence with multiple second pixel points. Further, for each standard organ image, the sending end calculates the absolute value of the difference between the pixel values of the plurality of first pixels and the plurality of second pixels in the standard organ image, and adds all the absolute values , To get the first difference value.
  • the sender obtains the pixel values of multiple first pixels in the facial organ image, and obtains the pixel values of multiple second pixels in each standard organ image in the sample library, where multiple first pixels One-to-one correspondence with multiple second pixel points. Further, for each standard organ image, the sending end calculates the absolute value of the difference between the pixel values of the multiple first pixels and the multiple second pixels in the standard organ image, and squares all the absolute values And to get the first difference value.
  • the method for the sending end to calculate the second difference value is the same as the method for calculating the first difference value, which will not be repeated in this application.
  • each of the above-mentioned standard organ images and/or each first image may be in the local sample library of the sending end or in the sample library in the cloud, which is not limited in this application.
  • the indexes of the multiple first images are one-to-one corresponding to the multiple first images.
  • each index is a floating-point value, and the range of the number of indexes of the multiple first images is [70, 312].
  • each index is an integer value.
  • the first image may be stored in the sample library in the form of facial organ feature values. If the receiving end stores the feature value of the first image, the receiving end generates the first face image of the receiving end according to the feature values corresponding to the multiple first images.
  • Fig. 3 is a schematic diagram of the image processing process provided by an embodiment of the application.
  • the receiving end stores various indexes (that is, indexes 1, 2...70 shown in Fig. 3) in a local sample library or a cloud sample library.
  • the number does not mean the index is the number, but just to distinguish the first image of each facial organ (such as eyes, mouth, nose, cheek, etc.) corresponding to these 70 indexes, the local sample library of the receiving end or
  • the first image of each facial organ and the index of each first image are stored in the cloud sample library.
  • the receiving end can determine each first image according to the index of each first image. For example: if the receiving end receives the index of the first image corresponding to the squinted eye, the receiving end determines the first image of the squinted eye according to the index.
  • Option 1 After the receiving end obtains a plurality of first images, the first images are rendered through the 3D model to generate the first facial image of the receiving end, and the first facial image of the receiving end is a virtual image.
  • Option 2 In order to prevent the user’s first face image data packet from not including all facial organ indexes, or, when transmitting the user’s first face image data packet, causing the user’s first face image data packet There are some cases where the index is lost in the data packet of the frame face image.
  • the receiving end may also obtain a data packet of at least one other facial image of the user (the second facial image of the user is taken as an example below).
  • the data packet of the user's second frame of facial image includes indexes of multiple second images of multiple facial organs, and multiple second images can be determined through the multiple second image indexes, and the second images are also virtual images.
  • the receiving end can combine the data packet of the user's first frame of facial image and the data packet of the user's second frame of facial image to generate the first frame of facial image of the receiving end.
  • "combining the data packet of the user's first frame of facial image and the user's second frame of facial image to generate the first facial image of the receiving end” means: if the receiving end receives the user's first facial image There is an index corresponding to a certain facial organ in the data packet of the frame facial image, the first image corresponding to the facial organ is obtained through the index, and the first image is used as a component of the first facial image of the receiving end; if the receiving end receives The received data packet of the user's first frame of facial image does not include the index corresponding to a certain facial organ, and the data packet of the user's second frame of facial image includes the index corresponding to the facial organ, then the receiving end obtains the facial organ through the index Corresponding image, and use this image as a component of the first frame of the face image at
  • the image corresponding to the facial organ is obtained through the index, and Use this image as a component of the first face image at the receiving end. If the data packet of a facial image received earliest does not include the index corresponding to a certain facial organ, and the data packet of the subsequent facial image or the data packet of the first facial image of the user includes the index corresponding to the facial organ, the receiving end Obtain the image corresponding to the facial organ through the index, and use the image as a component of the first frame of the facial image at the receiving end.
  • the receiving end generates a video background image through AR/VR technology.
  • the receiving end uses AR/VR technology to generate the video background image, so that the first facial image of the receiving end of the multiple users can be merged into one The background scene.
  • the receiving end may select a video background image adapted to the first face image of the receiving end.
  • the receiving end selects the cartoon background image.
  • the receiving end selects the poster image of the film and television works in which the star participates as the video background image.
  • the first frame of facial image at the receiving end has a corresponding relationship with the video background image, and the corresponding relationship is a one-to-one, one-to-many, many-to-one, or many-to-many relationship.
  • the first face image of the receiving end may correspond to one video background image or multiple video background images.
  • the receiving end can arbitrarily select a video background image from the multiple video background images, or select a video background image according to a preset rule.
  • the receiving end can correspond to one video background image or multiple Video background image.
  • the receiving end can select a video background image from the multiple video background images, or select a video background image according to a preset rule.
  • the receiving end can also rotate, zoom, etc. on the first frame of the facial image of the receiving end, and can also add special effects such as expressions or gestures to the facial images to increase interest.
  • the present application provides an image processing method.
  • the sending end since the sending end does not need to send the user's first facial image to the receiving end, it only needs to send a data packet including indexes of multiple first images.
  • the requirement for network bandwidth can be reduced, that is, in the case of limited network transmission bandwidth, better video effects can still be ensured.
  • the current traditional video in the case of high-definition and high frame rate video pictures, it takes up a lot of bandwidth.
  • traditional video methods need to transmit 2K video video frames, which are encoded at 30 frames per second (FPS) and H264 encoding.
  • the bandwidth required for the transmission process is about 8 megabits per second (million bits per second, Mbps).
  • the image processing method provided by this application that is, the sending end only sends the data packet including the index corresponding to each facial organ. If the 2K image quality video picture is to be presented at the receiving end, the data packet of the user's first face image The occupied bandwidth is approximately:
  • the present application can also collect facial image data packets at a frame rate of 60FPS, 90FPS, or even greater than 500FPS, so as to present video images more coherently and finely.
  • the image processing method provided by this application will not expose personal privacy such as clothing, location or mental state of the individual, so that the application scope of the technical solution of this application can be expanded.
  • the receiving end uses AR/VR technology to generate a video background image, so that the first facial images of multiple receiving ends can be merged into a background scene , which can improve user experience and interactivity.
  • FIG. 4 is a flowchart of an image processing method provided by another embodiment of the application. As shown in FIG. 4, the image processing method further includes the following steps:
  • Step S401 The sending end obtains a first frame of facial images of the user, and the first frame of facial images of the user includes multiple facial organ images.
  • Step S402 The sending end acquires multiple first images that match multiple facial organ images.
  • Step S403 The sending end sends a data packet of the user's first frame of facial image to the receiving end, where the data packet includes indexes of multiple first images, and the indexes of the multiple first images are used to obtain multiple first images.
  • Step S404 The receiving end acquires multiple first images.
  • Step S405 The receiving end generates a first frame of facial image of the receiving end according to the multiple first images.
  • Step S406 The sending end sends at least one audio data packet to the receiving end.
  • Step S407 The receiving end displays the first face image of the receiving end, and synchronizes the above-mentioned at least one audio data packet.
  • steps S401 to S405 are the same as steps S201 to S205, and the content can refer to the content of steps S201 to S205, which will not be repeated here.
  • Step S406 the time stamp of at least one audio data packet matches the time stamp of the data packet of the user's first frame of facial image.
  • the so-called “the timestamp of at least one audio data packet matches the timestamp of the data packet of the user's first facial image” refers to: the timestamp of each audio data packet in the at least one audio data packet is greater than or equal to the user The time stamp of the data packet of the first face image of the user, and the time stamp of each audio data packet in the at least one audio data packet is smaller than the time stamp of the next data packet of the data packet of the first face image of the user.
  • the time stamp of the data packet of the user's first face image is n
  • the time stamp of each audio data packet in at least one audio data packet is n, n+160, n+320... and n+2880 audio data Packet
  • the time stamp of the next data packet of the user's first frame of facial image data is n+3000.
  • the time stamp in any audio data packet or facial image data packet reflects the sampling time of the first octet of the data packet.
  • the time stamp occupies 32 bits.
  • the sender can randomly set the initial value of the timestamp. For example: set to n. Assuming that the data packet of the first facial image of the user is the data packet of the first facial image in this video, the time stamp of the data packet of the first facial image of the user is n, and the above-mentioned at least one audio data The timestamp of the first audio data packet in the packet is also n.
  • the sending end obtains multiple audio data packets according to the collection frequency of the audio data packets, and obtains multiple facial image data packets according to the collection frequency of the facial image data packets.
  • FIG. 5 is a schematic diagram of an audio data packet sequence and a facial image data packet sequence provided by an embodiment of the application.
  • the first row is an audio data packet sequence composed of multiple audio data packets
  • the second row is A facial image data packet sequence composed of multiple frames of facial image data packets.
  • the time stamp of the T-th frame audio data packet is n
  • the time stamp of the T+1-th frame audio data packet is n+160...
  • the time stamp of the audio data packet of the T+18th frame is n+2880
  • the time stamp of the audio data packet of the T+19th frame is n+3040...
  • the time stamp of the audio data packet of the T+38th frame is n+6080.
  • the time stamp of the T-th facial image data packet is n
  • the time-stamp of the T+1 facial image data packet is n+3000...the T+2 facial image data packet
  • the timestamp of is n+6000.
  • Step S407 when the receiving end generates the first facial image of the receiving end, it also generates the time stamp of the first facial image of the receiving end, and the timestamp may be the time of the data packet of the first facial image of the user. stamp. Further, the receiving end adopts the same criteria as the sending end to determine the audio data packet and the face image of the receiving end that match the timestamp. For example, for the first face image of the receiving end with a timestamp of n, at least one audio data packet matching it is an audio data packet with timestamps of n, n+160, n+320... and n+2880.
  • the first frame of facial image at the receiving end and the aforementioned at least one audio data packet need to be synchronized. Therefore, the terminal device simultaneously plays the content of the at least one audio data packet while displaying the first frame of facial image at the receiving end. For example, while displaying the first face image of the receiving end, simultaneously playing n, n+160, n+320... and n+2880 audio data packets.
  • step S406 and step S403 can be performed at the same time, and another part of the content in step S406 is executed after step S403, such as the first audio data packet in the at least one audio data packet and the user's first audio data packet.
  • a data packet of a facial image needs to be sent to the receiving end at the same time.
  • the other audio data packets in the at least one audio data packet mentioned above except the first audio data packet are sent after the data packet of the user's first frame of facial image.
  • the receiving end can simultaneously play at least one audio data packet matching the receiving end while displaying the first frame of the face image of the receiving end, so that the user can achieve the effect of synchronizing aurally and visually.
  • the receiving end also receives a data packet of the user's second frame of facial image from the sending end, the user's second frame of facial image is earlier than the user's first frame of facial image, that is, the generation time of the user's second frame of facial image Earlier than the user’s first facial image generation time.
  • the data packet of the user's second frame of facial image includes indexes of a plurality of second images, the user's second frame of facial image includes a plurality of facial organ images, and the plurality of second images are matched with the plurality of facial organ images.
  • the sending end may separately send the first frame of facial image of the user and the second frame of facial image of the user to the receiving end.
  • the sending end sends the user's first frame of facial image first, and then sends the user's second frame of facial image.
  • the sending end may send the user's first face image and the user's second face image to the receiving end together.
  • the sending end may send a first data packet to the receiving end, and the first data packet includes the user's first data packet.
  • sending a facial image can also be understood as sending a data packet of a facial image.
  • the receiving end may send instruction information to the sending end, where the instruction information is used to instruct to send a facial image earlier than the first facial image of the user.
  • the sending end sends a data packet of the user's second face image to the receiving end according to the instruction information.
  • the instruction information may indicate that the first facial image of the user is sent with a facial image that is earlier than the first facial image of the user. Considering that the sender always sends the user's first facial image and the first facial image earlier than the user together, it will increase the transmission burden of the sender. Therefore, the receiver can fail to receive consecutive facial images many times in a row. In the case of data packets, the instruction information is sent to the sender.
  • the receiving end in some cases, it does not need the data packet of the user's second face image. For example: if the receiving end has generated the data packet of the first face image of the receiving end according to the data packet of the user's first face image, then the receiving end does not need to generate the second frame of the receiving end according to the data packet of the user's second face image For facial images, the data packet of the user's second facial image is discarded.
  • the receiving end can generate the second facial image of the receiving end according to the data packet of the user's second facial image, where , The generation time of the data packet of the user's third frame of facial image is earlier than the generation time of the data packet of the user's second frame of facial image.
  • the synchronization waiting time refers to It is the length of time that the receiving end waits for the aforementioned delayed received facial image data packet.
  • the synchronization waiting time can be 20 milliseconds, 30 milliseconds, etc., which is not limited in this application.
  • the sending end may send the data packet of the user's first frame of facial image and the user's second frame of facial image to the receiving end together.
  • the data packet of the user's second frame of facial image and the data packet of the user's first frame of facial image are continuous in time.
  • FIG. 6 is a schematic diagram of the first data packet and the first buffer queue provided by an embodiment of the application.
  • the first buffer queue of the receiving end stores the received T-7th frame to the first buffer queue.
  • the data packet of the face image of frame T-3 but because the packet of the face image of the T-2 frame and the data packet of the face image of the T-1 frame are lost, the first buffer queue does not store the T-th packet.
  • a data packet of 2 frames of facial images and a data packet of the T-1th frame of facial images includes the data packet of the face image of the T-th frame, the data packet of the face image of the T-1th frame, the T-2th frame and the T-3th frame.
  • the T-th frame of facial image data packet may be the aforementioned user's first frame of facial image data packet
  • the T-1th frame is the aforementioned user's second frame of facial image data packet.
  • the receiving end adds the facial image data packets of the T-1 frame and the T-2 frame to the first buffer queue to solve the packet loss problem.
  • the receiver can send to the sender the first facial image used to indicate the sending user when it has not received consecutive facial image data packets for multiple times.
  • the instruction information of the face image of the frame face image That is, when the sending end receives the instruction information, the sending end will carry the data packet of the user's first frame of facial image and the data packet of the user's second frame of facial image in the first data packet.
  • the sending end does not carry the second face image of the user when sending the first face image of the user.
  • the receiving end can set a network state variable S.
  • the initial value of the network state variable is 0.
  • the receiving end when the first data packet received by the receiving end includes the user's first frame of facial image and the user's second frame of facial image, the receiving end selectively puts the data packet of the user's second frame of facial image into the first frame of facial image.
  • a cache queue At S reaches N+1, that is, the number of consecutive facial image data packets received by the receiving end is N+1, the receiving end sends another indication information to the sending end to indicate that there is no need for the sending user
  • the first facial image of the user carries a facial image that is earlier than the first facial image of the user.
  • the instruction information used to instruct to carry a facial image earlier than the current facial image when sending the current facial image is referred to as the first instruction information.
  • the instruction information used to indicate that there is no need to carry a facial image earlier than the current facial image when sending the current facial image is referred to as second instruction information.
  • the first instruction can be replaced by an instruction for sending the current facial image, adding facial images earlier than the current facial image
  • the second instruction information can be replaced by an instruction for sending the current facial image, reducing the early Carrying of facial images for the current facial image.
  • FIG. 7 is a flowchart of a method for processing facial image data packets at a receiving end according to an embodiment of the application. As shown in FIG. 7, the execution subject of the method is the receiving end, and the method includes the following steps:
  • Step S701 Receive a data packet of the user's first frame of facial image.
  • Step S702 Determine whether the data packet of the user's first frame of facial image and the previously received user facial image data packet are consecutive data packets. If the data packet of the user's first frame of facial image and the previously received user facial image data packet are consecutive data packets, step S703 is executed; otherwise, step S707 is executed.
  • Step S704 Determine whether S reaches N+1, if yes, execute step S705, if not, execute step S706.
  • Step S706 Buffer the data packet of the user's first frame of facial image in the first buffer queue.
  • the first frame of facial image of the user is taken out of the first data packet And buffer the data packet of the user’s first face image into the first buffer queue.
  • the first frame of the user’s facial image is the T-th facial image
  • the second frame of the user’s facial image is the T-1-th facial image
  • the T-th, T-1, and T-2-th facial images are The data packet is packaged and sent in the first data packet, and the receiving end stores the data packet of the T-th facial image in the first buffer queue.
  • Step S708 Determine whether S reaches -(N+1), if yes, execute step S709, if not, execute step S710.
  • Step S710 Determine whether the first data packet includes a data packet of the user's first frame of facial image and a data packet of the user's second frame of facial image, if yes, perform step S711, if not, perform step S714.
  • Step S711 Determine whether the facial image with the earliest generation time in the first data packet is earlier than the facial image with the latest generation time in the first buffer queue. If yes, perform step S712, if not, perform step S713.
  • Step S712 Add the data packet of the facial image in the first data packet to the first buffer queue.
  • the above-mentioned first data packet includes: the T-th facial image data packet, and the T-th facial image.
  • the facial image with the latest generation time in the first buffer queue is the T-3 facial image packet, and the T-2 facial image is earlier than the T-3 facial image.
  • the receiving end will set the T-th facial image packet.
  • One frame of facial image data packet, T-1 frame facial image data packet, and T-2 frame facial image data packet are all added to the first buffer queue.
  • Step S713 Add the data packet of the facial image of the first data packet later than the facial image with the latest generation time in the first buffer queue to the first buffer queue.
  • the user’s first frame of facial image is the T-th facial image
  • the user’s second facial image is the T-1-th facial image
  • the first data packet includes: the T-th facial image data packet, and the T-1th facial image.
  • Frame facial image data packet, T-2 frame facial image data packet, T-3 frame facial image data packet, and the latest facial image generated in the first buffer queue is the T-3 frame facial image
  • add the T-th facial image data packet, the T-1 facial image data packet, and the T-2 facial image data packet to the first buffer queue, and discard the first data packet The data packet of the face image in frame T-3.
  • Step S714 Determine whether the user's first frame of facial image is earlier than the latest facial image in the first cache queue, if yes, perform step S715, otherwise, perform step S716.
  • Step S715 Discard the data packet of the user's first frame of facial image.
  • Step S716 Buffer the data packet of the user's first frame of facial image in the first buffer queue.
  • the receiving end can select 2 to 3 frames of facial image data packets from the first buffer queue and buffer them in the second buffer queue for rendering.
  • FIG. 8 is a schematic diagram of image processing provided by an embodiment of the application.
  • the receiving end receives the data packet of the T-th facial image, but has not yet stored it in the first buffer queue.
  • the data packets of the facial image of frame T-1 to the data packets of the facial image of frame T-7 are stored, and when the receiving end generates the first facial image of the receiving end, only the data packet of the facial image of frame T is scheduled to
  • the data packets of the face image of the T-2 frame, the data packets of the 3 frames of the face image are stored in the second buffer queue, and the face images from the T-7 to the T-3 frames in the first buffer queue are cleared data pack.
  • the rendering module in the receiving end can start rendering from the T-2th frame of the facial image, and then decrease in sequence. After the data packets of the 3 frames of facial image in the second buffer queue are rendered, the second buffer queue continues from the first buffer queue Get the data packet of the facial image.
  • the refresh rate of the second buffer queue by the receiving end may be 30 frames per second, as long as it can be ensured that the rendering module can obtain 2 to 3 frames of facial image data packets each time.
  • the data packet of the user's first frame of facial image and the user's second frame of data packet can be carried in one data packet.
  • the user's second frame of facial image is continuous in time with the user's first frame of facial image, so that packet loss of facial image data packets can be prevented, and based on this, the quality of the first frame of facial image at the receiving end can be improved.
  • the receiving end can send instruction information to the sending end when it has not received consecutive facial image data packets multiple times in succession to instruct the user to send the first facial image of the user with the first facial image earlier than the user. Face image.
  • the sending end when the sending end receives the instruction information, the sending end will send the user's second face image together with the user's first face image.
  • the sending end does not carry the second face of the user when sending the first face image of the user, so that the transmission burden of the sending end can be reduced.
  • the receiving end can choose to discard the user’s second frame of facial image, and cache the user’s first frame of facial image data packet in the first cache queue; or choose to select the user’s second frame of facial image data packet and user The data packet of the first frame of facial image is buffered in the first buffer queue.
  • FIG. 9 is a schematic diagram of image processing provided by another embodiment of this application.
  • the receiving end discards the second frame of the user's facial image.
  • the receiving end first receives the T-th frame facial image data packet, and the receiving end has buffered the T-th frame to the second buffer queue for rendering, and then received the T-1 frame facial image data packet And the data packet of the face image of frame T-2.
  • the receiving end discards the facial image data packets of the T-1 frame and the facial image data packets of the T-2 frame.
  • the rendering module can obtain the skipped frame, that is, the data packet of the T-th frame of facial image, the data packet of the T-3th frame of facial image, and the data packet of the T-4th frame of facial image.
  • the refresh rate of the queue is relatively high, which does not affect the look and feel of the receiving end during a video call.
  • FIG. 10 is a schematic diagram of image processing provided by still another embodiment of this application.
  • the receiving end adds the user's second face image to the first buffer queue.
  • the receiving end has first received the T-th facial image data packet, and the receiving end has not buffered the T-3th frame to the second buffer queue for rendering, and then received the T-1th facial image The data packet and the data packet of the face image of frame T-2.
  • the receiving end adds the T-1 frame facial image data packets and the T-2 frame facial image data packets to the first cache queue.
  • the subsequent rendering module can obtain the data packet of the T-th frame of facial image, the data packet of the T-1th frame of facial image, and the data packet of the T-2th frame of facial image to ensure the continuity of the rendered facial image at the receiving end .
  • the user's second frame of face image should have been received before the user's first frame of facial image, but due to the delay, the user's second frame of face The image is relative to the user's first face image. If the user's first facial image has been used to generate the first facial image of the receiving end, the second facial image of the user is discarded; if the third facial image of the user has not been used to generate the third facial image of the receiving end, Where the user’s third face image is earlier than the user’s second face image, the user’s second face image is added to the first cache queue, that is, the second face frame of the receiving end is generated according to the user’s second face image image.
  • the receiving end can also combine the user's first Frame facial image data packets and at least one other facial image data packets of the user are used to generate the first facial image of the receiving end.
  • This application does not limit how many frames of data packets of the user's facial image to generate the receiving end facial image.
  • Fig. 11 is a schematic diagram of an image processing device provided by an embodiment of the application.
  • the image processing device is part or all of the foregoing sending end. As shown in Fig. 11, the device includes:
  • the first acquisition module 1101 is configured to acquire a first frame of facial images of the user, and the first frame of facial images of the user includes multiple facial organ images.
  • the second acquiring module 1102 is configured to acquire multiple first images matching multiple facial organ images.
  • the first sending module 1103 is used to send a data packet of the user's first frame of facial image to the receiving end.
  • the data packet of the user's first frame of facial image includes indexes of multiple first images. To obtain multiple first images.
  • the multiple facial organ images are images of real facial organs of the user
  • the multiple first images are images of virtual facial organs of the user.
  • the second acquisition module 1102 is specifically configured to: for each facial organ image in the multiple facial organ images, compare the facial organ image with a standard organ image corresponding to the facial organ image to determine the first difference value.
  • the first image matching the facial organ image is acquired according to the first difference value, and the second difference value and the first difference value of the first image matching the facial organ image and the standard organ image satisfy the first condition.
  • the device further includes: a second sending module 1104, configured to send at least one audio data packet to the receiving end, and the time stamp of the audio data packet matches the time stamp of the data packet of the user's first face image.
  • a second sending module 1104 configured to send at least one audio data packet to the receiving end, and the time stamp of the audio data packet matches the time stamp of the data packet of the user's first face image.
  • the device further includes:
  • the third acquiring module 1105 is configured to acquire a second face image of the user, and the second face image of the user is earlier than the first face image of the user.
  • the fourth acquiring module 1106 is configured to acquire multiple second images that match multiple facial organ images of the user's second frame of facial image.
  • the third sending module 1107 is used to send a data packet of the user's second frame of facial image to the receiving end.
  • the data packet of the user's second frame of facial image includes indexes of multiple second images. To obtain multiple second images.
  • the device further includes: a receiving module 1108, configured to receive instruction information sent by the receiving end, where the instruction information is used to instruct to send a facial image earlier than the first facial image of the user.
  • a receiving module 1108 configured to receive instruction information sent by the receiving end, where the instruction information is used to instruct to send a facial image earlier than the first facial image of the user.
  • the image processing device provided in this application can be used to execute the image processing method corresponding to the above-mentioned sending end.
  • the content and effect please refer to the method embodiment part, which will not be repeated here.
  • Fig. 12 is a schematic diagram of an image processing device provided by another embodiment of the application.
  • the image processing device is part or all of the above-mentioned receiving end. As shown in Fig. 12, the device includes:
  • the first receiving module 1201 is configured to receive a data packet of the user's first frame of facial image from the sending end, the data packet of the user's first frame of facial image includes indexes of multiple first images, and the user's first frame of facial image includes A plurality of facial organ images, and a plurality of first images are matched with the plurality of facial organ images.
  • the first acquiring module 1202 is configured to acquire multiple first images.
  • the first generating module 1203 is configured to generate a first frame of facial image of the receiving end according to a plurality of first images.
  • the multiple facial organ images are images of real facial organs of the user
  • the multiple first images are images of virtual facial organs of the user.
  • the device further includes: a second receiving module 1204, configured to receive at least one audio data packet from the sending end, and the time stamp of the audio data packet matches the time stamp of the data packet of the user's first face image.
  • a second receiving module 1204 configured to receive at least one audio data packet from the sending end, and the time stamp of the audio data packet matches the time stamp of the data packet of the user's first face image.
  • the device further includes: a third receiving module 1205, configured to receive a data packet of the user's second frame of facial image from the sending end, the user's second frame of facial image is earlier than the user's first frame of facial image, and the user
  • the data packet of the second frame of facial image includes indexes of a plurality of second images, and the plurality of second images are matched with the plurality of facial organ images included in the second frame of facial image of the user.
  • the device further includes: a sending module 1206, configured to send instruction information to the sending end, where the instruction information is used to instruct to send a facial image earlier than the first facial image of the user.
  • a sending module 1206, configured to send instruction information to the sending end, where the instruction information is used to instruct to send a facial image earlier than the first facial image of the user.
  • the device further includes: a discarding module 1207, configured to discard the data packet of the second facial image of the user if the first facial image of the receiving end has been generated.
  • a discarding module 1207 configured to discard the data packet of the second facial image of the user if the first facial image of the receiving end has been generated.
  • the device further includes: a second generation module 1208, configured to generate a third facial image of the receiving end corresponding to the third facial image of the user if the third facial image of the user is earlier than the third facial image of the user.
  • the second facial image of the user is generated according to the data packet of the second facial image of the user to generate the second facial image of the receiving end.
  • the image processing device provided by the present application can be used to execute the image processing method corresponding to the above receiving end.
  • the content and effect please refer to the method embodiment part, which will not be repeated here.
  • FIG. 13 is a schematic diagram of a terminal device provided by an embodiment of the application.
  • the terminal device may be the aforementioned transmitting end or receiving end.
  • the terminal device includes: a memory 1301, a processor 1302, and a transceiver 1303.
  • the memory 1301 stores instructions that can be executed by the processor, and the instructions are executed by the processor, so that the processor 1302 can execute the image processing method corresponding to the sending end or the receiving end.
  • the transceiver 1303 is used to implement data transmission between terminal devices.
  • the terminal device may include one or more processors 1302.
  • the memory 1301 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random-access memory (SRAM), electrically erasable programmable read-only memory (electrically erasable programmable read-only memory) erasable programmable read only memory, EEPROM, erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (read-only) memory, ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random-access memory
  • electrically erasable programmable read-only memory electrically erasable programmable read-only memory (electrically erasable programmable read-only memory) erasable programmable read only memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable programmable read-only memory
  • PROM programmable read-only memory
  • the terminal device may also include one or more of the following components: a power supply component, a multimedia component, an audio component, an input/output (I/O) interface, and a sensor component.
  • the power supply component provides power to various components of the terminal.
  • the power supply components may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for terminal devices.
  • the multimedia component includes a touch screen that provides an output interface between the terminal device and the user.
  • the touch display screen may include a liquid crystal display (LCD) and a touch panel (TP).
  • the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel.
  • the multimedia component includes a front camera and/or a rear camera. When the terminal device is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data.
  • Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component is configured to output and/or input audio signals.
  • the audio component includes a microphone (MIC).
  • MIC microphone
  • the terminal device is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode
  • the microphone is configured to receive external audio signals.
  • the received audio signal can be further stored in a memory or sent via a communication component.
  • the audio component further includes a speaker for outputting audio signals.
  • the I/O interface provides an interface between the processor and the peripheral interface module.
  • the above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
  • the sensor component includes one or more sensors, and the sensor component may include a light sensor, such as at least one of a complementary metal oxide semiconductor (CMOS) or a charge-coupled device (CCD) image sensor , For use in imaging applications.
  • CMOS complementary metal oxide semiconductor
  • CCD charge-coupled device
  • the sensor component may further include at least one of an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • the terminal device provided in this application can be used to execute the image processing method corresponding to the above sending end or receiving end.
  • the content and effect please refer to the method embodiment part, which will not be repeated here.
  • FIG. 14 is a schematic diagram of an image processing system 1400 provided by an embodiment of the application.
  • the system includes: a sending end 1401 and a receiving end 1402.
  • the two can be directly connected or through an intermediate device, such as The server realizes the connection.
  • the sending end 1401 is used to execute the image processing method corresponding to the above sending end
  • the receiving end 1402 is used to execute the image processing method corresponding to the receiving end 1402.
  • the content and effect can be referred to the method embodiment part, which will not be repeated here.
  • the application also provides a computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the image processing method provided in this application.
  • the computer-readable storage medium may include a storage program area and a storage data area.
  • the storage program area may store an operating system and an application program required by at least one function; the storage data area may store computer instructions for implementing the above-mentioned image processing method.
  • the computer-readable storage medium is also a memory, which can be a high-speed random access memory or a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the present application also provides a computer program product, which stores computer instructions, and the computer instructions are used to make the computer execute the above-mentioned image processing method.
  • a computer program product which stores computer instructions, and the computer instructions are used to make the computer execute the above-mentioned image processing method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请提供一种图像处理的方法、设备及系统,该方法包括:获取用户的第一帧面部图像,用户的第一帧面部图像包括多个面部器官图像。获取与多个面部器官图像相匹配的多个第一图像。向接收端发送用户的第一帧面部图像的数据包,用户的第一帧面部图像的数据包包括多个第一图像的索引,多个第一图像的索引用于获取多个第一图像,从而可以降低对网络带宽的要求,即在网络传输带宽有限的情况下,仍能保证较佳的视频效果。

Description

图像处理的方法、设备及系统
本申请要求在2020年1月8日提交中国专利局、申请号为202010018738.6、申请名称为“图像处理的方法、设备及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及视频技术领域,尤其涉及一种图像处理的方法、设备及系统。
背景技术
目前,视频通话是一种比语音通话更有效的远程交流互动方式,它除了能够传达声音信息外,还可以传递肢体动作与面部表情等信息,使双方的交流更加深入。
传统视频方式是一种实景视频方式,即本端利用摄像头实时采集参与视频的人物、背景等画面帧,并生成视频流,再将视频流通过网络传输至远端,以使远端进行视频呈现。然而,高分辨率的视频流传输对网络传输带宽要求较高,传统视频方式较难实现实时的高质量视频通话。甚至在网络环境较差的情况下,视频画面会出现丢包,花屏等现象。总之,在网络传输带宽有限的情况下,采用传统视频方式进行视频通话的效果不佳,影响了用户体验。
发明内容
本申请提供一种图像处理的方法、设备及系统,从而降低对网络传输带宽的要求,进而提高视频通话效果以及用户体验感。
第一方面,本申请提供一种图像处理方法,包括:获取用户的第一帧面部图像,用户的第一帧面部图像包括多个面部器官图像。获取与多个面部器官图像相匹配的多个第一图像。向接收端发送用户的第一帧面部图像的数据包,用户的第一帧面部图像的数据包包括多个第一图像的索引,多个第一图像的索引用于获取多个第一图像。
在本申请中,由于发送端无需向接收端发送用户的第一帧面部图像,而仅发送包括多个第一图像的索引的数据包即可。从而可以降低对网络带宽的要求,即在网络传输带宽有限的情况下,仍能保证较佳的视频效果。
可选的,多个面部器官图像是用户真实的面部器官的图像,多个第一图像是为用户虚拟的面部器官的图像。由于第一图像是用户虚拟的面部器官的图像,从而保护了用户的个人隐私,进而提高了本申请技术方案的适用范围。
可选的,获取与多个面部器官图像相匹配的多个第一图像,包括:对于多个面部器官图像中的每一个面部器官图像,将面部器官图像和与面部器官图像对应的标准器官图像进行对比,确定第一差异值。根据第一差异值获取面部器官图像相匹配的第一图像,面部器官图像相匹配的第一图像与标准器官图像的第二差异值与第一差异值满 足第一条件。通过该方法可以有效的获取与多个面部器官图像相匹配的多个第一图像。
可选的,上述方法还包括发送端向接收端发送至少一个音频数据包,音频数据包的时间戳和用户的第一帧面部图像的数据包的时间戳相匹配。基于此,使用户在听觉和视觉上达到同步的效果。
可选的,上述方法还包括:获取用户的第二帧面部图像,用户的第二帧面部图像早于用户的第一帧面部图像。获取与用户的第二帧面部图像的多个面部器官图像相匹配的多个第二图像。向接收端发送用户的第二帧面部图像的数据包,用户的第二帧面部图像的数据包包括多个第二图像的索引,多个第二图像的索引用于获取多个第二图像。由于发送端无需向接收端发送用户的第二帧面部图像,而仅发送包括多个第二图像的索引的数据包即可。从而可以降低对网络带宽的要求,即在网络传输带宽有限的情况下,仍能保证较佳的视频效果。
可选的,上述方法还包括:接收接收端发送的指示信息,指示信息用于指示发送早于用户的第一帧面部图像的面部图像,也就是说,该指示信息用于指示发送早于用户的第一帧面部图像的面部图像的数据包。即不是所有情况下,发送端都要发送早于用户的第一帧面部图像的面部图像,从而降低对通信资源的消耗。
第二方面,本申请提供一种图像处理方法,包括:从发送端接收用户的第一帧面部图像的数据包,用户的第一帧面部图像的数据包包括多个第一图像的索引,用户的第一帧面部图像包括多个面部器官图像,多个第一图像与多个面部器官图像相匹配。获取多个第一图像。根据多个第一图像生成接收端第一帧面部图像。由于发送端无需向接收端发送用户的第一帧面部图像,而仅发送包括多个第一图像的索引的数据包即可。从而可以降低对网络带宽的要求,即在网络传输带宽有限的情况下,仍能保证较佳的视频效果。
可选的,多个面部器官图像是用户真实的面部器官的图像,多个第一图像是为用户虚拟的面部器官的图像。
可选的,上述方法还包括:接收来自发送端的至少一个音频数据包,音频数据包的时间戳和用户的第一帧面部图像的数据包的时间戳相匹配。基于此,使用户在听觉和视觉上达到同步的效果。
可选的,上述方法还包括:从发送端接收用户的第二帧面部图像的数据包,用户的第二帧面部图像早于用户的第一帧面部图像,用户的第二帧面部图像的数据包包括多个第二图像的索引,多个第二图像与用户的第二帧面部图像包括的多个面部器官图像相匹配。由于发送端无需向接收端发送用户的第二帧面部图像,而仅发送包括多个第二图像的索引的数据包即可。从而可以降低对网络带宽的要求,即在网络传输带宽有限的情况下,仍能保证较佳的视频效果。
可选的,上述方法还包括:向发送端发送指示信息,指示信息用于指示发送早于用户的第一帧面部图像的面部图像。即不是所有情况下,只有在发送端接收到该指示信息时,发送端才发送早于用户的第一帧面部图像的面部图像,从而降低对通信资源的消耗。
可选的,上述方法还包括:若已生成接收端第一帧面部图像,则丢弃用户的第二帧面部图像的数据包。而无需生成接收端第二帧面部图像,从而降低对接收端功耗。
可选的,上述方法还包括:若还未生成与用户的第三帧面部图像对应的接收端第三帧面部图像,其中,用户的第三帧面部图像早于用户的第二帧面部图像,则根据用户的第二帧面部图像的数据包生成接收端第二帧面部图像。
可选的,当接收端侧的用户同时和多个发端侧的用户进行视频时,接收端通过AR/VR技术生成视频背景图像,使得多个接收端第一帧面部图像能被融合至一个背景场景下,从而可以提高用户的体验感和互动性。
下面将介绍图像处理装置、设备、系统、存储介质及计算机程序产品,其效果和参考上述方法部分对应的效果,下面对此不再赘述。
第三方面,本申请提供一种图像处理装置,包括:第一获取模块、第二获取模块和第一发送模块。其中,第一获取模块用于获取用户的第一帧面部图像,用户的第一帧面部图像包括多个面部器官图像。第二获取模块用于获取与多个面部器官图像相匹配的多个第一图像。第一发送模块用于向接收端发送用户的第一帧面部图像的数据包,用户的第一帧面部图像的数据包包括多个第一图像的索引,多个第一图像的索引用于获取多个第一图像。
第四方面,本申请提供一种图像处理装置,包括:第一接收模块、第一获取模块、第一生成模块。其中,第一接收模块用于从发送端接收用户的第一帧面部图像的数据包,用户的第一帧面部图像的数据包包括多个第一图像的索引,用户的第一帧面部图像包括多个面部器官图像,多个第一图像与多个面部器官图像相匹配。第一获取模块用于获取多个第一图像。第一生成模块用于根据多个第一图像生成接收端第一帧面部图像。
第五方面,本申请提供一种终端设备,包括:存储器和处理器。存储器存储有可被处理器执行的指令,指令被处理器执行,以使处理器能够执行第一方面、第二方面、第一方面的可选方式、第二方面的可选方式中任一项的方法。
第六方面,本申请提供一种计算机可读存储介质,存储介质存储有计算机指令,计算机指令用于使计算机执行如第一方面、第二方面、第一方面的可选方式、第二方面的可选方式中任一项的方法。
第七方面,本申请提供一种计算机程序产品,该计算机程序产品存储有计算机指令,计算机指令用于使计算机执行如第一方面、第二方面、第一方面的可选方式、第二方面的可选方式中任一项的方法。
综上,本申请提供一种图像处理的方法、设备及系统,在发送端和接收端配置图像样本库,发送端和接收端之间传递样本库中的图像索引以实现图像的传递,从而减少对网络传输的带宽要求,进而提高视频通话效果和用户体验感。进一步的,将视频场景建立在AR或者VR技术上,利用虚拟的人物和视频场景传递丰富的表情与姿态信息,从而可以保护用户的个人隐私。更进一步地,当接收端侧的用户同时和多个发端侧的用户进行视频时,接收端通过AR/VR技术生成视频背景图像,使得多个接收端第一帧面部图像能被融合至一个背景场景下,从而可以提高用户的体验感和互动性。
附图说明
图1为本申请实施例提供的系统架构图;
图2为本申请实施例提供的图像处理方法的流程图;
图3为本申请实施例提供的图像处理过程的示意图;
图4为本申请另一实施例提供的图像处理方法的流程图;
图5为本申请实施例提供的音频数据包序列和面部图像的数据包序列的示意图;
图6为本申请一实施例提供的第一数据包以及第一缓存队列的示意图;
图7为本申请一实施例提供的接收端对面部图像的数据包的处理方法流程图;
图8为本申请一实施例提供的图像处理示意图;
图9为本申请另一实施例提供的图像处理示意图;
图10为本申请再一实施例提供的图像处理示意图;
图11为本申请一实施例提供的一种图像处理装置的示意图;
图12为本申请另一实施例提供的一种图像处理装置的示意图;
图13为本申请一实施例提供的终端设备的示意图;
图14为本申请一实施例提供的一种图像处理系统的示意图。
具体实施方式
传统视频方式是一种实景视频方式,即本端利用摄像头实时采集参与视频的人物、背景等画面帧,并生成视频流,再将视频流通过网络传输至远端,以使远端进行视频呈现。然而,传统视频方式在网络传输带宽有限的情况下,采用传统视频方式进行视频通话的效果不佳,影响用户体验。进一步的,传统视频方式容易暴露个人的穿着打扮、所处地点或精神状态等个人隐私,导致传统视频方式的使用范围较窄。
为解决上述问题,本申请提供一种图像处理的方法、设备及系统。其中,本申请的主旨思想是:在发送端和接收端配置图像样本库,发送端和接收端之间传递样本库中的图像索引以实现图像的传递,从而减少对网络传输的带宽要求。进一步的,将视频场景建立在增强现实(augmented reality,AR)或者虚拟现实(virtual reality,VR)技术上,利用虚拟的人物和视频场景传递丰富的表情与姿态信息。
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图对本申请实施例的技术方案进行描述。
本申请实施例的技术方案可以应用于各种通信系统,例如第三代(3 generation,3G)移动通信系统、第四代(4 generation,4G)移动通信系统、第五代(5 generation,5G)移动通信系统、新空口(new radio,NR)或者无线保真(wireless fidelity,WiFi)等网络。
示例性地,图1为本申请实施例提供的系统架构图,如图1所示,发送端11和接收端12上均具有摄像头,通过该摄像头可以进行图像采集,发送端11和接收端12在信令面采用会话初始协议(session initiation protocol,SIP),媒体面采用实时传输协议(real-time transport protocol,RTP)或者实时传输控制协议(real-time control protocol,RTCP),因此发送端11采用RTP或者RTCP向接收端12发送面部图像的数据包。其中,发送端11可以调用实时网络(real-time network,RTN)软件开发工具包(software development kit,SDK)将面部图像的数据包通过RTN发送给服务器 13,服务器13将该面部图像的数据包转发给接收端12。接收端12调用RTN SDK接收面部图像的数据包,接收端12按照RTP数据包的格式对面部图像的数据包进行解析,接收端12根据解析后的数据包,通过图形处理器(graphics processing unit,GPU)或者网络处理器(network process units,NPU),实现图像三维(3-dimension,3D)渲染功能。其中,如图1所示,GPU/NPU的虚线框表示的是GPU/NPU在终端设备内部,而不是在终端设备的显示屏上显示的。上述终端设备可以是手机或者AR/VR设备等,例如是VR头显设备、AR眼镜等。
需要说明的是,上述发送端和接收端可以不通过服务器进行数据传输,即二者可以直连,以进行数据传输,例如:发送端调用RTN SDK将面部图像的数据包通过RTN发送给接收端。接收端调用RTN SDK接收面部图像的数据包,接收端按照RTP数据包的格式对面部图像的数据包进行解析,并根据解析后的数据包,通过GPU或者NPU实现图像3D渲染功能。
下面将对本申请技术方案进行详细阐述:
图2为本申请实施例提供的图像处理方法的流程图,该方法涉及发送端和接收端,该发送端和接收端可以分别是两个不同的终端设备,例如是两个不同的手机,或者,发送端是手机,接收端是AR/VR设备,或者,发送端是AR/VR设备,接收端是手机等,本申请对此不做限制。如图2所示,该方法包括如下步骤:
步骤S201:发送端获取用户的第一帧面部图像,用户的第一帧面部图像包括多个面部器官图像。
步骤S202:发送端获取与多个面部器官图像相匹配的多个第一图像。
步骤S203:发送端向接收端发送用户的第一帧面部图像的数据包,该数据包包括多个第一图像的索引,多个第一图像的索引用于获取多个第一图像。
步骤S204:接收端获取多个第一图像。
步骤S205:接收端根据多个第一图像生成接收端第一帧面部图像。
结合步骤S201至步骤S203进行说明:
在视频通话场景下,发送端通过自己的摄像头,如前置摄像头采集用户的画面,可以得到多帧面部图像。这里的用户的第一帧面部图像表示当前帧面部图像,它可以是用户的首帧面部图像,也可以不是首帧面部图像,这里的“第一”仅仅是与下文将要提到的第二帧面部图像作以区别,没有实际的含义。该用户的第一帧面部图像所包括的多个面部器官图像均是用户真实的面部器官的图像。需要说明的是,上述面部器官可以是粒度较粗意义上的面部器官,比如可以是眼睛、鼻子、嘴巴、耳朵等。也可以是粒度较细意义上的面部器官,比如是眼球、眼白、眼睫毛、左侧鼻翼、右侧鼻翼、鼻梁等。
针对一个面部器官图像,所谓与该面部器官图像相匹配的第一图像意味着第一图像所呈现出的面部器官特征与该面部器官图像所呈现出的面部器官特征相近似。例如,第一图像可以是满足如下条件的第一图像:该第一图像与该面部器官图像的差异最小,或者该第一图像与该面部器官图像的差异的绝对值小于预设阈值。又或者,假设该面部器官图像和与该面部器官图像对应的标准器官图像的差异值为第一差异值,该第一图像与该标准器官图像的差异值为第二差异值,该第二差异值与第一差异值的差值最 小,或者该第二差异值与第一差异值的差异的绝对值小于预设阈值。其中,任一面部器官图像对应的标准器官图像是指该面部器官对应的标准图像,比如面部器官是眼睛,其对应的标准器官图像是眼睛对应的标准图像。
可选的,本申请中的第一图像是为用户虚拟的面部器官的图像,即是面部器官的虚拟图像,该虚拟图像可以被理解为卡通人物面部器官的图像或者是明星面部器官的图像等。
可选的,通过以下方式,获取该多个第一图像:针对上述多个面部器官图像中的每一个面部器官图像,根据该面部器官图像获取第一图像,其中,该第一图像与该面部器官图像的差异最小。例如,若用户的第一帧面部图像为用户大笑时的画面,多个面部器官图像包括眉毛、眯着的眼睛、鼻子、上扬的嘴巴和耳朵的图像,针对眯着的眼睛的图像,将该图像和样本库中的至少一个眼睛的图像进行对比,获取与该图像的差异最小的眼睛的图像,该差异最小的眼睛的图像即为第一图像。或者,该第一图像与该面部器官图像的差异值的绝对值小于预设阈值。该预设阈值可以根据实际情况设置。仍以上述用户的第一帧面部图像为用户大笑时的画面为例,针对眯着的眼睛的图像,将该图像和样本库中的至少一个眼睛的图像进行对比,获取与该图像的差异值的绝对值小于预设阈值的眼睛的图像,该差异值的绝对值小于预设阈值的眼睛的图像即为第一图像。
又或者,针对上述多个面部器官图像中的每一个面部器官图像,将该面部器官图像和与该面部器官图像对应的标准器官图像进行对比,确定第一差异值;根据第一差异值获取该面部器官图像对应的第一图像,其中,该第二差异值和第一差异值的差值最小。例如,若用户的第一帧面部图像为用户大笑时的画面,多个面部器官图像包括眉毛、眯着的眼睛、鼻子、上扬的嘴巴和耳朵的图像,针对眯着的眼睛的图像,确定该图像与标准眼睛的图像的第一差异值,确定样本库中至少一个眼睛的图像与该标准眼睛的图像的第二差异值,获取第二差异值和第一差异值的差值最小的眼睛的图像,该差值最小的眼睛的图像即为第一图像。或者,第一图像与标准器官图像的第二差异值与第一差异值的差值的绝对值小于预设阈值,该预设阈值可以根据实际情况设置。仍以上述用户的第一帧面部图像为用户大笑时的画面为例,针对眯着的眼睛的图像,确定该图像与标准眼睛的图像的第一差异值,确定样本库中至少一个眼睛的图像与该标准眼睛的图像的第二差异值,获取第二差异值与第一差异值的差值的绝对值小于预设阈值的眼睛的图像,该差值的绝对值小于预设阈值的眼睛的图像即为第一图像。
其中,发送端可以通过如下方式确定面部器官图像和该面部器官图像对应的标准器官图像的第一差异值,但不限于此:
可选方式一:发送端获取该面部器官图像中多个第一像素点的像素值,并获取样本库中各个标准器官图像中多个第二像素点的像素值,其中多个第一像素点和多个第二像素点一一对应。进一步地,发送端针对每个标准器官图像,发送端计算多个第一像素点与该标准器官图像中的多个第二像素点的像素值之差的绝对值,并对所有绝对值相加,以得到第一差异值。
可选方式二:发送端获取该面部器官图像中多个第一像素点的像素值,并获取样本库中各个标准器官图像中多个第二像素点的像素值,其中多个第一像素点和多个第 二像素点一一对应。进一步地,发送端针对每个标准器官图像,发送端计算多个第一像素点与该标准器官图像中的多个第二像素点的像素值之差的绝对值,并对所有绝对值求平方和,以得到第一差异值。
同样,发送端计算第二差异值的方法与计算第一差异值的方法相同,本申请对此不再赘述。
其中,上述各个标准器官图像和/或各个第一图像可以在发送端本地样本库中或者在云端的样本库中,本申请对此不做限制。
上述多个第一图像的索引与多个第一图像一一对应,可选的,每个索引是浮点型数值,多个第一图像的索引个数范围是【70,312】。可选的,每个索引是整型数值。通过该索引,接收端可以在样本库中获取与该索引对应的第一图像。
需要说明的是,第一图像可以以面部器官特征值的形式存储在样本库中。如果接收端存储的是第一图像的特征值,那么接收端是根据多个第一图像分别对应的特征值生成接收端第一帧面部图像。
针对步骤S204和步骤S205进行说明:
图3为本申请实施例提供的图像处理过程的示意图,如图3所示,接收端在本地样本库或者云端样本库中存储有各个索引(即图3所示的索引1、2……70,这里的数字表示的并不是索引为该数字,而仅仅是为了区分这70个索引)对应的各个面部器官(如眼睛、嘴巴、鼻子、脸颊等)的第一图像,接收端的本地样本库或者云端样本库中存储有各个面部器官的第一图像和各个第一图像的索引。基于此,接收端可以根据各个第一图像的索引确定各个第一图像。例如:若接收端接收到眯着的眼睛对应的第一图像的索引,则接收端根据该索引确定眯着的眼睛的第一图像。
可选方式一:接收端获取到多个第一图像之后,通过3D模型,对这些第一图像进行渲染,以生成接收端第一帧面部图像,该接收端第一帧面部图像为虚拟图像。
可选方式二:为了防止用户的第一帧面部图像的数据包中不能完全包括所有面部器官的索引,或者,在传输该用户的第一帧面部图像的数据包时,造成该用户的第一帧面部图像的数据包中有一些索引丢失的情况。接收端还可以获取用户的其他至少一帧面部图像(下述以用户的第二帧面部图像为例)的数据包。用户的第二帧面部图像的数据包包括多个面部器官的多个第二图像的索引,通过多个第二图像的索引可以确定多个第二图像,第二图像也是虚拟图像。基于此,接收端可以结合用户的第一帧面部图像的数据包和用户的第二帧面部图像的数据包,来生成接收端第一帧面部图像。其中,“结合用户的第一帧面部图像的数据包和用户的第二帧面部图像的数据包,来生成接收端第一帧面部图像”指的是:若接收端接收到的用户的第一帧面部图像的数据包存在某面部器官对应的索引,则通过该索引获取该面部器官对应的第一图像,并将该第一图像作为接收端第一帧面部图像的组成部分;若接收端接收到的用户的第一帧面部图像的数据包不包括某面部器官对应的索引,且用户的第二帧面部图像的数据包包括该面部器官对应的索引,则接收端通过该索引获取该面部器官对应的图像,并将该图像作为接收端第一帧面部图像的组成部分。
或者,按照面部图像的数据包的接收顺序,若上述其他至少一帧面部图像的数据包中最早接收到的一个存在某面部器官对应的索引,则通过该索引获取该面部器官对 应的图像,并将该图像作为接收端第一帧面部图像的组成部分。若最早接收到的一个面部图像的数据包不包括某面部器官对应的索引,它之后的面部图像的数据包或者用户的第一帧面部图像的数据包包括该面部器官对应的索引,则接收端通过该索引获取该面部器官对应的图像,并将该图像作为接收端第一帧面部图像的组成部分。
可选的,接收端通过AR/VR技术生成视频背景图像。例如:当接收端侧的用户同时和多个发端侧的用户进行视频时,接收端通过AR/VR技术生成视频背景图像,使得多个用户各自的接收端第一帧面部图像能被融合至一个背景场景下。
可选的,接收端可以选择与接收端第一帧面部图像适配的视频背景图像,比如:若接收端第一帧面部图像是卡通人物面部器官的图像,则接收端选择卡通背景图像。若接收端第一帧面部图像是明星面部器官的图像,则接收端选择明星参与的影视作品的海报图像作为视频背景图像。其中,接收端第一帧面部图像与视频背景图像具有对应关系,该对应关系是一对一、一对多、多对一、或者多对多关系。例如:当进行两人视频时,即接收端显示屏上目前显示一个用户对应的接收端第一帧面部图像,该接收端第一帧面部图像可以对应一个视频背景图像或者多个视频背景图像,当该接收端第一帧面部图像对应多个视频背景图像时,接收端可以在多个视频背景图像中任意选择一个视频背景图像,或者按照预设规则选择一个视频背景图像。当进三人及三人以上的视频时,即接收端显示屏上目前显示多个用户对应的接收端第一帧面部图像,这些接收端第一帧面部图像可以对应一个视频背景图像或者多个视频背景图像,当这些接收端第一帧面部图像对应多个视频背景图像时,接收端可以在多个视频背景图像中任意选择一个视频背景图像,或者按照预设规则选择一个视频背景图像。
可选的,在本申请中,接收端还可以对接收端第一帧面部图像进行旋转、缩放等,还可以在面部图像上添加表情特效或者手势特效等,以增加趣味性。
综上,本申请提供一种图像处理方法,首先,由于发送端无需向接收端发送用户的第一帧面部图像,而仅发送包括多个第一图像的索引的数据包即可。从而可以降低对网络带宽的要求,即在网络传输带宽有限的情况下,仍能保证较佳的视频效果。例如:目前传统视频在高清高帧率的视频画面的情况下,其占用带宽较大。一般情况下,若要在接收端呈现2K画质的视频画面,传统视频方式则需要传输2K视频的视频画面帧,以30每秒传输帧数(frames per second,FPS)、H264编码方式进行编码,其传输过程需要的带宽约为8兆比特每秒(million bits per second,Mbps)。而若采用本申请提供的图像处理方法,即发送端仅发送包括各个面部器官对应的索引的数据包,若要在接收端呈现2K画质的视频画面,用户的第一帧面部图像的数据包占用带宽约为:
帧率*用户的第一帧面部图像的数据包中索引个数*每浮点数比特位/千(计算机)/文本压缩率=带宽
假设帧率为30FPS,用户的第一帧面部图像的数据包中索引个数为70,每浮点数比特位为32bit/float、千(计算机)为1024kb、文本压缩率为10,则计算得到带宽为6.56千比特每秒(kilobit per second,kbps),该带宽约为在传统视频方式下占用带宽的1/1250。因此,本申请还可以按60FPS、90FPS甚至大于500FPS的帧率采集面部图像的数据包,从而更加连贯地、精细地呈现视频画面。
其次,本申请提供的图像处理方法,不会暴露个人的穿着打扮、所处地点或精神 状态等个人隐私,从而可以扩大本申请技术方案的使用范围。
最后,当接收端侧的用户同时和多个发端侧的用户进行视频时,接收端通过AR/VR技术生成视频背景图像,使得多个接收端第一帧面部图像能被融合至一个背景场景下,从而可以提高用户的体验感和互动性。
在上一实施例的基础上,发送端还向接收端发送音频数据包,使用户在听觉和视觉上达到同步的效果。因此,接收端需要对接收端第一帧面部图像和至少一个音频数据包进行数据同步。具体地,图4为本申请另一实施例提供的图像处理方法的流程图,如图4所示,图像处理方法还包括如下步骤:
步骤S401:发送端获取用户的第一帧面部图像,用户的第一帧面部图像包括多个面部器官图像。
步骤S402:发送端获取与多个面部器官图像相匹配的多个第一图像。
步骤S403:发送端向接收端发送用户的第一帧面部图像的数据包,该数据包包括多个第一图像的索引,多个第一图像的索引用于获取多个第一图像。
步骤S404:接收端获取多个第一图像。
步骤S405:接收端根据多个第一图像生成接收端第一帧面部图像。
步骤S406:发送端向接收端发送至少一个音频数据包。
步骤S407:接收端显示接收端第一帧面部图像,并同步上述至少一个音频数据包。
其中,步骤S401至步骤S405,与,步骤S201至步骤S205相同,其内容可参考步骤S201至步骤S205的内容,对此不再赘述。
针对步骤S406进行说明:至少一个音频数据包的时间戳和用户的第一帧面部图像的数据包的时间戳相匹配。所谓“至少一个音频数据包的时间戳和用户的第一帧面部图像的数据包的时间戳相匹配”指的是:上述至少一个音频数据包中各个音频数据包的时间戳大于或等于该用户的第一帧面部图像的数据包的时间戳,且上述至少一个音频数据包中各个音频数据包的时间戳小于该用户的第一帧面部图像的数据包的下一个数据包的时间戳。例如:用户的第一帧面部图像的数据包的时间戳为n,至少一个音频数据包中各个音频数据包的时间戳为n、n+160、n+320……和n+2880的音频数据包,用户的第一帧面部图像的数据的下一个数据包的时间戳为n+3000。
其中,任一个音频数据包或者面部图像的数据包中的时间戳反映了该数据包的第一个八位组的采样时刻。在RTP中,时间戳占用32个比特位。
在一次视频中,发送端可以对时间戳的初始值进行随机设置。比如:设置为n。假设上述用户的第一帧面部图像的数据包为这一次视频中的首帧面部图像的数据包,则该用户的第一帧面部图像的数据包的时间戳为n,并且上述至少一个音频数据包中的第一个音频数据包的时间戳也为n。
其中,发送端按照音频数据包的采集频率得到多个音频数据包,并按照面部图像的数据包的采集频率得到多个面部图像的数据包。例如:音频数据包的采集频率为8千赫兹(kilo hertz,kHz),每0.02秒(seconds,S)打包一个音频数据包,则相邻音频数据包的时间戳增量为:0.02*8000=160S。面部图像的数据包的采集频率为90kHz,每(1/30)S打包一个面部图像的数据包,则相邻面部图像的数据包的时间戳增量为:(1/30)*90*1000=3000S。图5为本申请实施例提供的音频数据包序列和面部图像的 数据包序列的示意图,如图5所示,第一行是由多个音频数据包构成的音频数据包序列,第二行是由多帧面部图像的数据包构成的面部图像的数据包序列,音频数据包序列中第T帧音频数据包的时间戳为n,第T+1帧音频数据包的时间戳为n+160……第T+18帧音频数据包的时间戳为n+2880,第T+19帧音频数据包的时间戳为n+3040……第T+38帧音频数据包的时间戳为n+6080。面部图像的数据包序列中第T帧面部图像的数据包的时间戳为n,第T+1帧面部图像的数据包的时间戳为n+3000……第T+2帧面部图像的数据包的时间戳为n+6000。
针对步骤S407进行说明:接收端在生成接收端第一帧面部图像时,还会生成接收端第一帧面部图像的时间戳,该时间戳可以是用户的第一帧面部图像的数据包的时间戳。进一步地,接收端采用和发送端相同的准则,确定时间戳相匹配的音频数据包和接收端面部图像。例如:时间戳为n的接收端第一帧面部图像,与其相匹配的至少一个音频数据包是时间戳为n、n+160、n+320……和n+2880的音频数据包。
接收端第一帧面部图像和上述至少一个音频数据包是需要同步,因此,终端设备在显示接收端第一帧面部图像同时,同步播放至少一个音频数据包的内容。例如,显示接收端第一帧面部图像的同时,同步播放n、n+160、n+320……和n+2880的音频数据包。
需要说明的是,上述步骤S406中部分内容与步骤S403可以同时进行,而步骤S406中另一部分内容在步骤S403之后执行,比如上述至少一个音频数据包中的第一个音频数据包和用户的第一帧面部图像的数据包需要同时被发送至接收端。而上述至少一个音频数据包中除第一个音频数据包中的其他音频数据包在用户的第一帧面部图像的数据包之后被发送。
综上,在本申请中,接收端在显示接收端第一帧面部图像的同时,可以同步播放与其相匹配的至少一个音频数据包,从而使用户在听觉和视觉上达到同步的效果。
可选的,接收端还从发送端接收用户的第二帧面部图像的数据包,用户的第二帧面部图像早于用户的第一帧面部图像,即用户的第二帧面部图像的生成时间早于用户的第一帧面部图像的生成时间。用户的第二帧面部图像的数据包包括多个第二图像的索引,用户的第二帧面部图像包括多个面部器官图像,多个第二图像与该多个面部器官图像相匹配。其中,发送端可以将该用户的第一帧面部图像和用户的第二帧面部图像分开发送给接收端。例如:发送端先发送用户的第一帧面部图像,再发送用户的第二帧面部图像。或者,发送端可以将该用户的第一帧面部图像和用户的第二帧面部图像一起发送给接收端,例如:发送端可以向接收端发送第一数据包,第一数据包包括用户的第一帧面部图像的数据包和用户的第二帧面部图像的数据包。需注意的是,发送面部图像,也可以理解为发送面部图像的数据包。
其中,接收端可以向发送端发送指示信息,该指示信息用于指示发送早于用户的第一帧面部图像的面部图像。发送端根据该指示信息,向接收端发送用户的第二帧面部图像的数据包。
进一步的,该指示信息可以指示发送用户的第一帧面部图像时携带早于用户的第一帧面部图像的面部图像。考虑到发送端始终一起发送用户的第一帧面部图像和早于用户的第一帧面部图像,会增加发送端的传输负担,因此,接收端可以在自己连续多 次未接收到连续的面部图像的数据包时,向发送端发送该指示信息。
然而,对于接收端而言,在有些情况下,其并不需要用户的第二帧面部图像的数据包。例如:若接收端已根据用户的第一帧面部图像的数据包生成接收端第一帧面部图像的数据包,那么接收端无需根据用户的第二帧面部图像的数据包生成接收端第二帧面部图像,则丢弃用户的第二帧面部图像的数据包。
相反地,若接收端还未根据用户的第三帧面部图像生成接收端第三帧面部图像,那么接收端可以根据用户的第二帧面部图像的数据包生成接收端第二帧面部图像,其中,用户的第三帧面部图像的数据包的生成时间早于用户的第二帧面部图像的数据包的生成时间。
发送端在向接收端分开发送多个面部图像的数据包时,有些面部图像的数据包由于网络状态不佳等原因会被延迟接收,因此接收端可以增加同步等待时长,该同步等待时长指的是接收端等待上述延迟接收的面部图像的数据包的时长,该同步等待时长可以是20毫秒、30毫秒等,本申请对此不做限制。
为了防止面部图像的数据包丢包的情况,发送端可以将用户的第一帧面部图像的数据包和用户的第二帧面部图像的数据包一起发送给接收端。其中,用户的第二帧面部图像的数据包与用户的第一帧面部图像的数据包在时间上连续。例如:图6为本申请一实施例提供的第一数据包以及第一缓存队列的示意图,如图6所示,接收端的第一缓存队列中存储有已接收到的第T-7帧至第T-3帧面部图像的数据包,但是由于第T-2帧面部图像的数据包和第T-1帧面部图像的数据包发生丢包情况,因此第一缓存队列中并未存储第T-2帧面部图像的数据包和第T-1帧面部图像的数据包。而第一数据包包括第T帧面部图像的数据包、第T-1帧、第T-2帧和第T-3帧面部图像的数据包。其中,第T帧面部图像的数据包可以为上述的用户的第一帧面部图像数据包,第T-1帧为上述的用户的第二帧面部图像数据包。接收端将第T-1帧和第T-2帧面部图像的数据包加入至第一缓存队列,以解决丢包问题。
为减少发送端的传输负担,接收端可以在自己连续多次未接收到连续的面部图像的数据包时,向发送端发送用于指示发送用户的第一帧面部图像时携带早于用户的第一帧面部图像的面部图像的指示信息。即当发送端接收到该指示信息时,发送端才会在第一数据包中携带用户的第一帧面部图像的数据包和用户的第二帧面部图像的数据包。当发送端未接收到该指示信息时,发送端在发送用户的第一帧面部图像时,不携带用户的第二帧面部图像。其中,接收端可以设置一个网络状态变量S,该网络状态变量的初始值为0,接收端每接收到一个面部图像的数据包,则判断该面部图像的数据包和接收端接收到的前一个面部图像的数据包是否为连续的数据包,如果是,则令S+1,否则,则令S-1。一旦S达到-(N+1),即接收端接收到的非连续的面部图像的数据包的连续次数为N+1,则接收端向发送端发送指示信息,以指示发送用户的第一帧面部图像时携带早于用户的第一帧面部图像的面部图像,并且接收端令S=0。此外,当接收端接收到的第一数据包中包括用户的第一帧面部图像和用户的第二帧面部图像时,接收端对用户的第二帧面部图像的数据包选择性的放入第一缓存队列中。可选的,一旦S达到N+1,即接收端接收到的连续的面部图像的数据包的连续次数为N+1,则接收端向发送端发送另一指示信息,以指示无需在发送用户的第一帧面部图像时携带 早于用户的第一帧面部图像的面部图像。为了方便起见,将用于指示发送当前面部图像时携带早于当前面部图像的面部图像的指示信息,称为第一指示信息。将用于指示无需在发送当前面部图像时携带早于当前面部图像的面部图像的指示信息,称为第二指示信息。需要说明的是,第一指示可替换为用于指示发送当前面部图像时,增加早于当前面部图像的面部图像的携带,第二指示信息可替换为用于指示发送当前面部图像时,减少早于当前面部图像的面部图像的携带。
具体地,图7为本申请一实施例提供的接收端对面部图像的数据包的处理方法流程图,如图7所示,该方法的执行主体为接收端,该方法包括如下步骤:
步骤S701:接收用户的第一帧面部图像的数据包。
步骤S702:判断用户的第一帧面部图像的数据包和前一个接收到的用户面部图像数据包是否为连续的数据包。若用户的第一帧面部图像的数据包和前一个接收到的用户面部图像数据包为连续的数据包,则执行步骤S703,否则,则执行步骤S707。
步骤S703:令S=S+1。
步骤S704:判断S是否达到N+1,若是,则执行步骤S705,若否,则执行步骤S706。
步骤S705:向发送端发送第二指示信息,并令S=0。
步骤S706:将用户的第一帧面部图像的数据包缓存至第一缓存队列中。
其中,若用户的第一帧面部图像的数据包和用户的第二帧面部图像的数据包被打包在第一数据包中发送,则从该第一数据包中取出用户的第一帧面部图像的数据包,并将该用户的第一帧面部图像的数据包缓存至第一缓存队列中。例如:用户的第一帧面部图像为第T帧面部图像,用户的第二帧面部图像为第T-1帧面部图像,第T帧、第T-1帧、第T-2帧面部图像的数据包被打包在第一数据包中发送,则接收端将第T帧面部图像的数据包存储至第一缓存队列中。
步骤S707:令S=S-1。
步骤S708:判断S是否达到-(N+1),若是,则执行步骤S709,若否,则执行步骤S710。
步骤S709:向发送端发送第一指示信息,并令S=0。
步骤S710:判断第一数据包是否包括用户的第一帧面部图像的数据包和用户的第二帧面部图像的数据包,若是,则执行步骤S711,若否,则执行步骤S714。
步骤S711:判断第一数据包中生成时间最早的面部图像是否早于第一缓存队列中生成时间最晚的面部图像。若是,则执行步骤S712,若否,则执行步骤S713。
步骤S712:将第一数据包中的面部图像的数据包加入第一缓存队列中。
假设用户的第一帧面部图像为第T帧面部图像,用户的第二帧面部图像为第T-1帧面部图像,上述第一数据包包括:第T帧面部图像的数据包、第T-1帧面部图像的数据包和第T-2帧面部图像的数据包。第一缓存队列中生成时间最晚的面部图像为第T-3帧面部图像包,第T-2帧面部图像早于第T-3帧面部图像,这种情况下,接收端将第T-1帧面部图像的数据包、第T-1帧面部图像的数据包和第T-2帧面部图像的数据包均加入第一缓存队列中。
步骤S713:将第一数据包中晚于第一缓存队列中生成时间最晚的面部图像的面部 图像的数据包加入第一缓存队列中。
假设用户的第一帧面部图像为第T帧面部图像,用户的第二帧面部图像为第T-1帧面部图像,第一数据包包括:第T帧面部图像的数据包、第T-1帧面部图像的数据包、第T-2帧面部图像的数据包、第T-3帧面部图像的数据包,而第一缓存队列中生成时间最晚的面部图像为第T-3帧面部图像,这种情况下,将第T帧面部图像的数据包、第T-1帧面部图像的数据包和第T-2帧面部图像的数据包均加入第一缓存队列中,丢弃第一数据包中的第T-3帧面部图像的数据包。
步骤S714:判断用户的第一帧面部图像是否早于第一缓存队列中生成时间最晚的面部图像,如果是,则执行步骤S715,否则,则执行步骤S716。
步骤S715:丢弃用户的第一帧面部图像的数据包。
步骤S716:将用户的第一帧面部图像的数据包缓存至第一缓存队列中。
最后,接收端可以从第一缓存队列中选择2至3帧面部图像的数据包,缓存至第二缓存队列以进行渲染。
例如:图8为本申请一实施例提供的图像处理示意图,如图8所示,接收端接收到了第T帧面部图像的数据包,但还未存储至第一缓存队列中,第一缓存队列当前存储有第T-1帧面部图像的数据包至第T-7帧面部图像的数据包,而接收端在生成接收端第一帧面部图像时,仅调度第T帧面部图像的数据包至第T-2帧面部图像的数据包,将这3帧面部图像的数据包存储至第二缓存队列中,并清除第一缓存队列中的第T-7帧至第T-3帧面部图像的数据包。接收端中的渲染模块可以从第T-2帧面部图像开始渲染,依次递减,等第二缓存队列内的3帧面部图像的数据包渲染完毕后,第二缓存队列继续从第一缓存队列中获取面部图像的数据包。其中,接收端对第二缓存队列的刷新频率可以是每秒30帧,只要能保证渲染模块每次可以获取2至3帧面部图像的数据包即可。
综上,在本申请中,用户的第一帧面部图像的数据包和用户的第二帧数据包可以携带在一个数据包中。其中,用户的第二帧面部图像与用户的第一帧面部图像在时间上连续,从而可以防止面部图像的数据包丢包的情况,基于此,可以提高接收端第一帧面部图像的质量。另外,接收端可以在自己连续多次未接收到连续的面部图像的数据包时,向发送端发送指示信息,以指示发送用户的第一帧面部图像时携带早于用户的第一帧面部图像的面部图像。即当发送端接收到该指示信息时,发送端才会将用户的第二帧面部图像与用户的第一帧面部图像一起发送。当发送端未接收到该指示信息时,发送端不在发送用户的第一帧面部图像时携带用户的第二帧面部,从而可以降低发送端的传输负担。
若用户的第一帧面部图像和第一缓存队列中生成时间最晚的面部图像非连续,且在接收用户的第一帧面部图像之后,接收到与用户的第一帧面部图像连续的用户的第二帧面部图像,即生成时间晚的用户面部图像先被接收端接收,生成时间早的用户面部图像后被接收端接收。根据情况,接收端可以选择丢弃用户的第二帧面部图像,将用户的第一帧面部图像的数据包缓存至第一缓存队列中;或者选择将用户的第二帧面部图像的数据包和用户的第一帧面部图像的数据包缓存至第一缓存队列中。
例如:图9为本申请另一实施例提供的图像处理示意图,在图9所示的情况下, 接收端丢弃用户的第二帧面部图像。如图9所示,接收端先接收到了第T帧面部图像的数据包,且接收端已将第T帧缓存至第二缓存队列进行渲染,后接收到第T-1帧面部图像的数据包和第T-2帧面部图像的数据包。为了防止第一缓存队列中出现乱序的面部图像的数据包,接收端丢弃第T-1帧面部图像的数据包和第T-2帧面部图像的数据包。基于此,渲染模块可以获取到跳帧,即第T帧面部图像的数据包、第T-3帧面部图像的数据包和第T-4帧面部图像的数据包,由于接收端对第二缓存队列的刷新频率比较高,不影响接收端视频通话时的观感。
例如:图10为本申请再一实施例提供的图像处理示意图,在图10所示的情况下,接收端将用户的第二帧面部图像加入第一缓存队列。如图10所示,接收端先接收到了第T帧面部图像的数据包,且接收端还未将第T-3帧缓存至第二缓存队列进行渲染,后接收到第T-1帧面部图像的数据包和第T-2帧面部图像的数据包。为了保证第一缓存队列中面部图像的数据包的连续性,接收端将第T-1帧面部图像的数据包和第T-2帧面部图像的数据包加入至第一缓存队列。基于此,后续渲染模块可以获取第T帧面部图像的数据包、第T-1帧面部图像的数据包和第T-2帧面部图像的数据包,保证渲染出的接收端面部图像的连续性。
即,在本申请中,如果出现乱序的情况,即后至的用户的第二帧面部图像本应该在用户的第一帧面部图像之前接收到,但是由于延迟,导致用户的第二帧面部图像相对于用户的第一帧面部图像后至。若用户的第一帧面部图像已经用于生成接收端第一面部图像,则丢弃用户的第二帧面部图像;若还未用用户的第三帧面部图像生成接收端第三帧面部图像,其中用户的第三帧面部图像早于用户的第二帧面部图像,则将用户的第二帧面部图像加入第一缓存队列,即根据用户的第二帧面部图像的生成接收端第二帧面部图像。
需要注意的是,上面介绍了接收端每次根据一帧面部图像的数据包生成接收端面部图像的情况,然而,如步骤S205的可选方式二所述,接收端还可以结合用户的第一帧面部图像的数据包和用户的其他至少一帧面部图像的数据包,来生成接收端第一帧面部图像。对于根据多少帧用户的面部图像的数据包生成接收端面部图像,本申请不做限定。
图11为本申请一实施例提供的一种图像处理装置的示意图,该图像处理装置是上述发送端的部分或者全部,如图11所示,该装置包括:
第一获取模块1101,用于获取用户的第一帧面部图像,用户的第一帧面部图像包括多个面部器官图像。
第二获取模块1102,用于获取与多个面部器官图像相匹配的多个第一图像。
第一发送模块1103,用于向接收端发送用户的第一帧面部图像的数据包,用户的第一帧面部图像的数据包包括多个第一图像的索引,多个第一图像的索引用于获取多个第一图像。
可选的,多个面部器官图像是用户真实的面部器官的图像,多个第一图像是为用户虚拟的面部器官的图像。
可选的,第二获取模块1102具体用于:对于多个面部器官图像中的每一个面部器官图像,将面部器官图像和与面部器官图像对应的标准器官图像进行对比,确定第一 差异值。根据第一差异值获取面部器官图像相匹配的第一图像,面部器官图像相匹配的第一图像与标准器官图像的第二差异值与第一差异值满足第一条件。
可选的,该装置还包括:第二发送模块1104,用于向接收端发送至少一个音频数据包,音频数据包的时间戳和用户的第一帧面部图像的数据包的时间戳相匹配。
可选的,该装置还包括:
第三获取模块1105,用于获取用户的第二帧面部图像,用户的第二帧面部图像早于用户的第一帧面部图像。
第四获取模块1106,用于获取与用户的第二帧面部图像的多个面部器官图像相匹配的多个第二图像。
第三发送模块1107,用于向接收端发送用户的第二帧面部图像的数据包,用户的第二帧面部图像的数据包包括多个第二图像的索引,多个第二图像的索引用于获取多个第二图像。
可选的,该装置还包括:接收模块1108,用于接收接收端发送的指示信息,指示信息用于指示发送早于用户的第一帧面部图像的面部图像。
本申请提供的图像处理装置,可以用于执行上述发送端对应的图像处理方法,其内容和效果可参考方法实施例部分,对此不再赘述。
图12为本申请另一实施例提供的一种图像处理装置的示意图,该图像处理装置是上述接收端的部分或者全部,如图12所示,该装置包括:
第一接收模块1201,用于从发送端接收用户的第一帧面部图像的数据包,用户的第一帧面部图像的数据包包括多个第一图像的索引,用户的第一帧面部图像包括多个面部器官图像,多个第一图像与多个面部器官图像相匹配。
第一获取模块1202,用于获取多个第一图像。
第一生成模块1203,用于根据多个第一图像生成接收端第一帧面部图像。
可选的,多个面部器官图像是用户真实的面部器官的图像,多个第一图像是为用户虚拟的面部器官的图像。
可选的,该装置还包括:第二接收模块1204,用于接收来自发送端的至少一个音频数据包,音频数据包的时间戳和用户的第一帧面部图像的数据包的时间戳相匹配。
可选的,该装置还包括:第三接收模块1205,用于从发送端接收用户的第二帧面部图像的数据包,用户的第二帧面部图像早于用户的第一帧面部图像,用户的第二帧面部图像的数据包包括多个第二图像的索引,多个第二图像与用户的第二帧面部图像包括的多个面部器官图像相匹配。
可选的,该装置还包括:发送模块1206,用于向发送端发送指示信息,指示信息用于指示发送早于用户的第一帧面部图像的面部图像。
可选的,该装置还包括:丢弃模块1207,用于若已生成接收端第一帧面部图像,则丢弃用户的第二帧面部图像的数据包。
可选的,该装置还包括:第二生成模块1208,用于若还未生成与用户的第三帧面部图像对应的接收端第三帧面部图像,其中,用户的第三帧面部图像早于用户的第二帧面部图像,则根据用户的第二帧面部图像的数据包生成接收端第二帧面部图像。
本申请提供的图像处理装置,可以用于执行上述接收端对应的图像处理方法,其 内容和效果可参考方法实施例部分,对此不再赘述。
图13为本申请一实施例提供的终端设备的示意图,该终端设备可以是上述的发送端或者接收端,如图13所示,该终端设备包括:存储器1301、处理器1302和收发器1303。其中存储器1301存储有可被处理器执行的指令,指令被处理器执行,以使处理器1302能够执行上述发送端或者接收端对应的图像处理方法。收发器1303用于实现终端设备之间的数据传输。
其中,终端设备可以包括一个或多个处理器1302。存储器1301可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(static random-access memory,SRAM),电可擦除可编程只读存储器(electrically erasable programmable read only memor,EEPROM),可擦除可编程只读存储器(erasable programmable read-only memory,EPROM),可编程只读存储器(programmable read-only memory,PROM),只读存储器(read-only memory,ROM),磁存储器,快闪存储器,磁盘或光盘。
终端设备还可以包括以下一个或多个组件:电源组件,多媒体组件,音频组件,输入/输出(input/output,I/O)的接口,传感器组件。
电源组件为终端的各种组件提供电力。电源组件可以包括电源管理系统,一个或多个电源,及其他与为终端设备生成、管理和分配电力相关联的组件。
多媒体组件包括在终端设备和用户之间的提供一个输出接口的触控显示屏。在一些实施例中,触控显示屏可以包括液晶显示器(liquid crystal display,LCD)和触摸面板(touch panel,TP)。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。在一些实施例中,多媒体组件包括一个前置摄像头和/或后置摄像头。当终端设备处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件被配置为输出和/或输入音频信号。例如,音频组件包括一个麦克风(MIC),当终端设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器或经由通信组件发送。在一些实施例中,音频组件还包括一个扬声器,用于输出音频信号。
I/O接口为处理器和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件包括一个或多个传感器,该传感器组件可以包括光传感器,如互补金属氧化物半导体(complementary metal oxide semiconductor,CMOS)或电荷耦合元件(charge-coupled device,CCD)图像传感器中的至少一项,用于在成像应用中使用。在一些实施例中,该传感器组件还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器中的至少一项。
本申请提供的终端设备,可以用于执行上述发送端或接收端对应的图像处理方法,其内容和效果可参考方法实施例部分,对此不再赘述。
图14为本申请一实施例提供的一种图像处理系统1400的示意图,如图14所示, 该系统包括:发送端1401和接收端1402,二者可以直连,也可以通过中间设备,如服务器实现连接。其中,发送端1401用于执行上述发送端对应的图像处理方法,接收端1402用于执行上述接收端1402对应的图像处理方法,其内容和效果可参考方法实施例部分,对此不再赘述。
本申请还提供一种计算机可读存储介质。其中,该计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行本申请所提供图像处理方法。
计算机可读存储介质可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储实现上述图像处理方法的计算机指令。计算机可读存储介质亦为存储器,其可以是高速随机存取存储器,也可以是非瞬时存储器,例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。
本申请还提供一种计算机程序产品,该计算机程序产品存储有计算机指令,计算机指令用于使计算机执行上述的图像处理方法,其内容和效果可参考方法实施例部分,对此不再赘述。

Claims (32)

  1. 一种图像处理方法,其特征在于,包括:
    获取用户的第一帧面部图像,所述用户的第一帧面部图像包括多个面部器官图像;
    获取与所述多个面部器官图像相匹配的多个第一图像;
    向接收端发送所述用户的第一帧面部图像的数据包,所述用户的第一帧面部图像的数据包包括所述多个第一图像的索引,所述多个第一图像的索引用于获取所述多个第一图像。
  2. 根据权利要求1所述的方法,其特征在于,
    所述多个面部器官图像是所述用户真实的面部器官的图像,所述多个第一图像是为所述用户虚拟的面部器官的图像。
  3. 根据权利要求1或2所述的方法,其特征在于,所述获取与所述多个面部器官图像相匹配的多个第一图像,包括:
    对于所述多个面部器官图像中的每一个面部器官图像,将所述面部器官图像和与所述面部器官图像对应的标准器官图像进行对比,确定第一差异值;
    根据所述第一差异值获取所述面部器官图像相匹配的第一图像,所述面部器官图像相匹配的第一图像与所述标准器官图像的第二差异值与所述第一差异值满足第一条件。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,还包括:
    向所述接收端发送至少一个音频数据包,所述音频数据包的时间戳和所述用户的第一帧面部图像的数据包的时间戳相匹配。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,还包括:
    获取用户的第二帧面部图像,所述用户的第二帧面部图像早于所述用户的第一帧面部图像;
    获取与所述用户的第二帧面部图像的多个面部器官图像相匹配的多个第二图像;
    向所述接收端发送所述用户的第二帧面部图像的数据包,所述用户的第二帧面部图像的数据包包括所述多个第二图像的索引,所述多个第二图像的索引用于获取所述多个第二图像。
  6. 根据权利要求5所述的方法,其特征在于,还包括:
    接收所述接收端发送的指示信息,所述指示信息用于指示发送早于所述用户的第一帧面部图像的面部图像。
  7. 一种图像处理方法,其特征在于,包括:
    从发送端接收用户的第一帧面部图像的数据包,所述用户的第一帧面部图像的数据包包括多个第一图像的索引,所述用户的第一帧面部图像包括多个面部器官图像,所述多个第一图像与所述多个面部器官图像相匹配;
    获取所述多个第一图像;
    根据所述多个第一图像生成接收端第一帧面部图像。
  8. 根据权利要求7所述的方法,其特征在于,
    所述多个面部器官图像是所述用户真实的面部器官的图像,所述多个第一图像是 为所述用户虚拟的面部器官的图像。
  9. 根据权利要求7或8所述的方法,其特征在于,还包括:
    接收来自所述发送端的至少一个音频数据包,所述音频数据包的时间戳和所述用户的第一帧面部图像的数据包的时间戳相匹配。
  10. 根据权利要求7-9任一项所述的方法,其特征在于,还包括:
    从所述发送端接收所述用户的第二帧面部图像的数据包,所述用户的第二帧面部图像早于所述用户的第一帧面部图像,所述用户的第二帧面部图像的数据包包括多个第二图像的索引,所述多个第二图像与所述用户的第二帧面部图像包括的多个面部器官图像相匹配。
  11. 根据权利要求10所述的方法,其特征在于,还包括:
    向所述发送端发送指示信息,所述指示信息用于指示发送早于所述用户的第一帧面部图像的面部图像。
  12. 根据权利要求10或11所述的方法,其特征在于,还包括:
    若已生成所述接收端第一帧面部图像,则丢弃所述用户的第二帧面部图像的数据包。
  13. 根据权利要求10或11所述的方法,其特征在于,还包括:
    若还未生成与用户的第三帧面部图像对应的接收端第三帧面部图像,其中,所述用户的第三帧面部图像早于所述用户的第二帧面部图像,则根据所述用户的第二帧面部图像的数据包生成接收端第二帧面部图像。
  14. 一种图像处理装置,其特征在于,包括:
    第一获取模块,用于获取用户的第一帧面部图像,所述用户的第一帧面部图像包括多个面部器官图像;
    第二获取模块,用于获取与所述多个面部器官图像相匹配的多个第一图像;
    第一发送模块,用于向接收端发送所述用户的第一帧面部图像的数据包,所述用户的第一帧面部图像的数据包包括所述多个第一图像的索引,所述多个第一图像的索引用于获取所述多个第一图像。
  15. 根据权利要求14所述的装置,其特征在于,
    所述多个面部器官图像是所述用户真实的面部器官的图像,所述多个第一图像是为所述用户虚拟的面部器官的图像。
  16. 根据权利要求14或15所述的装置,其特征在于,所述第二获取模块具体用于:
    对于所述多个面部器官图像中的每一个面部器官图像,将所述面部器官图像和与所述面部器官图像对应的标准器官图像进行对比,确定第一差异值;
    根据所述第一差异值获取所述面部器官图像相匹配的第一图像,所述面部器官图像相匹配的第一图像与所述标准器官图像的第二差异值与所述第一差异值满足第一条件。
  17. 根据权利要求14-16任一项所述的装置,其特征在于,还包括:
    第二发送模块,用于向所述接收端发送至少一个音频数据包,所述音频数据包的时间戳和所述用户的第一帧面部图像的数据包的时间戳相匹配。
  18. 根据权利要求14-17任一项所述的装置,其特征在于,还包括:
    第三获取模块,用于获取用户的第二帧面部图像,所述用户的第二帧面部图像早于所述用户的第一帧面部图像;
    第四获取模块,用于获取与所述用户的第二帧面部图像的多个面部器官图像相匹配的多个第二图像;
    第三发送模块,用于向所述接收端发送所述用户的第二帧面部图像的数据包,所述用户的第二帧面部图像的数据包包括所述多个第二图像的索引,所述多个第二图像的索引用于获取所述多个第二图像。
  19. 根据权利要求18所述的装置,其特征在于,还包括:
    接收模块,用于接收所述接收端发送的指示信息,所述指示信息用于指示发送早于所述用户的第一帧面部图像的面部图像。
  20. 一种图像处理装置,其特征在于,包括:
    第一接收模块,用于从发送端接收用户的第一帧面部图像的数据包,所述用户的第一帧面部图像的数据包包括多个第一图像的索引,所述用户的第一帧面部图像包括多个面部器官图像,所述多个第一图像与所述多个面部器官图像相匹配;
    第一获取模块,用于获取所述多个第一图像;
    第一生成模块,用于根据所述多个第一图像生成接收端第一帧面部图像。
  21. 根据权利要求20所述的装置,其特征在于,
    所述多个面部器官图像是所述用户真实的面部器官的图像,所述多个第一图像是为所述用户虚拟的面部器官的图像。
  22. 根据权利要求20或21所述的装置,其特征在于,还包括:
    第二接收模块,用于接收来自所述发送端的至少一个音频数据包,所述音频数据包的时间戳和所述用户的第一帧面部图像的数据包的时间戳相匹配。
  23. 根据权利要求20-22任一项所述的装置,其特征在于,还包括:
    第三接收模块,用于从所述发送端接收所述用户的第二帧面部图像的数据包,所述用户的第二帧面部图像早于所述用户的第一帧面部图像,所述用户的第二帧面部图像的数据包包括多个第二图像的索引,所述多个第二图像与所述用户的第二帧面部图像包括的多个面部器官图像相匹配。
  24. 根据权利要求23所述的装置,其特征在于,还包括:
    发送模块,用于向所述发送端发送指示信息,所述指示信息用于指示发送早于所述用户的第一帧面部图像的面部图像。
  25. 根据权利要求23或24所述的装置,其特征在于,还包括:
    丢弃模块,用于若已生成所述接收端第一帧面部图像,则丢弃所述用户的第二帧面部图像的数据包。
  26. 根据权利要求23或24所述的装置,其特征在于,还包括:
    第二生成模块,用于若还未生成与用户的第三帧面部图像对应的接收端第三帧面部图像,其中,所述用户的第三帧面部图像早于所述用户的第二帧面部图像,则根据所述用户的第二帧面部图像的数据包生成接收端第二帧面部图像。
  27. 一种图像处理装置,其特征在于,包括:存储器和处理器;
    所述存储器存储有可被所述处理器执行的指令,所述指令被所述处理器执行,以使所述处理器能够执行权利要求1-13中任一项所述的方法。
  28. 根据权利要求27所述的装置,其特征在于,所述装置为终端设备。
  29. 一种图像处理装置,其特征在于,用于执行如权利要求1-6中任一项所述方法,或者,用于执行如权利要求7-13中任一项所述方法。
  30. 一种图像处理系统,其特征在于,包括:用于执行权利要求1-6任一项所述方法的发送端以及用于执行权利要求7-13任一项所述方法的接收端。
  31. 一种计算机可读存储介质,其特征在于,所述存储介质存储有计算机指令,所述计算机指令用于使计算机执行权利要求1-13中任一项所述的方法。
  32. 一种计算机程序产品,其特征在于,存储有计算机指令,所述计算机指令用于使计算机执行如权利要求1-13中任一项所述的方法。
PCT/CN2021/070579 2020-01-08 2021-01-07 图像处理的方法、设备及系统 WO2021139706A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010018738.6A CN113099150B (zh) 2020-01-08 2020-01-08 图像处理的方法、设备及系统
CN202010018738.6 2020-01-08

Publications (1)

Publication Number Publication Date
WO2021139706A1 true WO2021139706A1 (zh) 2021-07-15

Family

ID=76663317

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/070579 WO2021139706A1 (zh) 2020-01-08 2021-01-07 图像处理的方法、设备及系统

Country Status (2)

Country Link
CN (1) CN113099150B (zh)
WO (1) WO2021139706A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11213132A (ja) * 1998-01-27 1999-08-06 Atr Ningen Joho Tsushin Kenkyusho:Kk 任意表情を持つ3次元顔モデルの生成方法
CN101390375A (zh) * 2006-02-27 2009-03-18 京瓷株式会社 图像信息共享系统
JP2010086174A (ja) * 2008-09-30 2010-04-15 Fujifilm Corp 画像共有システムおよび画像共有方法
CN103258190A (zh) * 2013-05-13 2013-08-21 苏州福丰科技有限公司 一种用于移动终端的人脸识别方法
CN104574299A (zh) * 2014-12-25 2015-04-29 小米科技有限责任公司 人脸图片处理方法及装置
CN106331572A (zh) * 2016-08-26 2017-01-11 乐视控股(北京)有限公司 一种基于图像的控制方法和装置
GB2559975A (en) * 2017-02-22 2018-08-29 Cubic Motion Ltd Method and apparatus for tracking features
CN109740476A (zh) * 2018-12-25 2019-05-10 北京琳云信息科技有限责任公司 即时通讯方法、装置和服务器

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054287B (zh) * 2009-11-09 2015-05-06 腾讯科技(深圳)有限公司 面部动画视频生成的方法及装置
KR20130022434A (ko) * 2011-08-22 2013-03-07 (주)아이디피쉬 통신단말장치의 감정 컨텐츠 서비스 장치 및 방법, 이를 위한 감정 인지 장치 및 방법, 이를 이용한 감정 컨텐츠를 생성하고 정합하는 장치 및 방법
CN102271241A (zh) * 2011-09-02 2011-12-07 北京邮电大学 一种基于面部表情/动作识别的图像通信方法及系统
CN103368929B (zh) * 2012-04-11 2016-03-16 腾讯科技(深圳)有限公司 一种视频聊天方法及系统
CN103442137B (zh) * 2013-08-26 2016-04-13 苏州跨界软件科技有限公司 一种在手机通话中查看对方虚拟人脸的方法
CN103647922A (zh) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 虚拟视频通话方法和终端
CN106204698A (zh) * 2015-05-06 2016-12-07 北京蓝犀时空科技有限公司 为自由组合创作的虚拟形象生成及使用表情的方法和系统
CN107333086A (zh) * 2016-04-29 2017-11-07 掌赢信息科技(上海)有限公司 一种在虚拟场景中进行视频通信的方法及装置
CN109670385B (zh) * 2017-10-16 2023-04-18 腾讯科技(深圳)有限公司 一种应用程序中表情更新的方法及装置
CN108038422B (zh) * 2017-11-21 2021-12-21 平安科技(深圳)有限公司 摄像装置、人脸识别的方法及计算机可读存储介质
CN108875539B (zh) * 2018-03-09 2023-04-07 北京旷视科技有限公司 表情匹配方法、装置和系统及存储介质
CN110472523A (zh) * 2019-07-25 2019-11-19 天脉聚源(杭州)传媒科技有限公司 用于生成虚拟形象的表情采集方法、系统、装置和介质
CN110557625A (zh) * 2019-09-17 2019-12-10 北京达佳互联信息技术有限公司 虚拟形象直播方法、终端、计算机设备及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11213132A (ja) * 1998-01-27 1999-08-06 Atr Ningen Joho Tsushin Kenkyusho:Kk 任意表情を持つ3次元顔モデルの生成方法
CN101390375A (zh) * 2006-02-27 2009-03-18 京瓷株式会社 图像信息共享系统
JP2010086174A (ja) * 2008-09-30 2010-04-15 Fujifilm Corp 画像共有システムおよび画像共有方法
CN103258190A (zh) * 2013-05-13 2013-08-21 苏州福丰科技有限公司 一种用于移动终端的人脸识别方法
CN104574299A (zh) * 2014-12-25 2015-04-29 小米科技有限责任公司 人脸图片处理方法及装置
CN106331572A (zh) * 2016-08-26 2017-01-11 乐视控股(北京)有限公司 一种基于图像的控制方法和装置
GB2559975A (en) * 2017-02-22 2018-08-29 Cubic Motion Ltd Method and apparatus for tracking features
CN109740476A (zh) * 2018-12-25 2019-05-10 北京琳云信息科技有限责任公司 即时通讯方法、装置和服务器

Also Published As

Publication number Publication date
CN113099150B (zh) 2022-12-02
CN113099150A (zh) 2021-07-09

Similar Documents

Publication Publication Date Title
US11490132B2 (en) Dynamic viewpoints of live event
US9924159B2 (en) Shared scene mesh data synchronization
US10771736B2 (en) Compositing and transmitting contextual information during an audio or video call
CN106488265A (zh) 一种发送媒体流的方法和装置
CN113286184B (zh) 一种在不同设备上分别播放音频与视频的唇音同步方法
US11741616B2 (en) Expression transfer across telecommunications networks
WO2022019719A1 (en) Generation and distribution of immersive media content from streams captured via distributed mobile devices
CN113726815B (zh) 一种动态调整视频的方法、电子设备、芯片系统和存储介质
US10104415B2 (en) Shared scene mesh data synchronisation
CN112165598A (zh) 数据处理的方法、装置、终端和存储介质
WO2021139706A1 (zh) 图像处理的方法、设备及系统
US20160212180A1 (en) Shared Scene Object Synchronization
KR20120040622A (ko) 영상 통신 방법 및 장치
US11290680B1 (en) High-fidelity freeze-frame for precision video communication applications
CN112272305A (zh) 一种多路实时交互视频缓存存储方法
WO2024160031A1 (zh) 一种数字人通信方法及装置
WO2021199128A1 (ja) 画像データ転送装置、画像生成方法およびコンピュータプログラム
US12044845B2 (en) Towards subsiding motion sickness for viewport sharing for teleconferencing and telepresence for remote terminals
US20240187673A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2024167098A1 (en) Methods and device for providing seamless connectivity in a call
US20230421743A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
US20140225984A1 (en) Complimentary Video Content
US20220038756A1 (en) Network-based assistance for receiver processing of video data
CN117956211A (zh) 一种投屏方法及装置
CN118540513A (zh) 视频帧处理方法及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21738693

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21738693

Country of ref document: EP

Kind code of ref document: A1