WO2022143128A1 - 基于虚拟形象的视频通话方法、装置和终端 - Google Patents

基于虚拟形象的视频通话方法、装置和终端 Download PDF

Info

Publication number
WO2022143128A1
WO2022143128A1 PCT/CN2021/137526 CN2021137526W WO2022143128A1 WO 2022143128 A1 WO2022143128 A1 WO 2022143128A1 CN 2021137526 W CN2021137526 W CN 2021137526W WO 2022143128 A1 WO2022143128 A1 WO 2022143128A1
Authority
WO
WIPO (PCT)
Prior art keywords
terminal
feature information
frame
video
avatar
Prior art date
Application number
PCT/CN2021/137526
Other languages
English (en)
French (fr)
Inventor
林宇航
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022143128A1 publication Critical patent/WO2022143128A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals

Definitions

  • the embodiments of the present application relate to the technical field of terminals, and in particular, to a method, device, and terminal for a video call based on an avatar.
  • Internet-based and mobile Internet-based audio and video call technologies are currently widely used communication methods in the social field. Compared with traditional telephones, Internet telephony charges are lower and it is more convenient to use. With the help of mobile Internet technology, Internet telephony does not require fixed terminal equipment, and users can use portable terminals such as mobile phones to access. In addition, compared to traditional phones that can only transmit audio, VoIP can also implement video calls.
  • face recognition technology has developed rapidly.
  • the recognition of faces and facial features through cameras has been widely used in the fields of identity recognition, face replacement, and expression mapping.
  • An avatar-based video call method, device, and terminal provided by the embodiments of the present application are used to solve the problem that an avatar video call cannot be used in the prior art under the condition of poor network conditions.
  • a first aspect provides a method for a video call based on an avatar, which is applied to a first terminal, and the method includes:
  • the first terminal collects image data and audio data of the user during the call
  • the first terminal extracts multi-frame target feature information from the image data, where the multi-frame target feature information includes feature information used to characterize the user's facial expression and head movement;
  • the first terminal transmits the multi-frame target feature information and audio data to the second terminal, and the second terminal is used to map the multi-frame target feature information to a preset target avatar, so as to generate a video call image, and the video call image contains A target avatar with facial expressions and head movements.
  • the first terminal does not need to transmit a video stream to the second terminal, but only needs to transmit feature information extracted from image data, which greatly reduces the amount of data that needs to be transmitted , so that users can use video calls to connect with other users even in poor network conditions.
  • the first terminal does not need to transmit the real-time image of the user during the call to the second terminal, the privacy and security of the user can also be guaranteed.
  • the image data includes multiple video frames
  • a first face recognition engine is configured in the first terminal, and when the first terminal extracts multi-frame target feature information from the image data, it can The first face recognition engine is used to analyze the facial features in each video frame respectively, and the feature point information contained in each video frame is obtained; then, the first terminal encodes each video frame as the feature point information, and obtains the corresponding Multi-frame target feature information corresponding to each video frame one-to-one.
  • the first terminal performs encoding according to each video frame as feature point information, and obtains multi-frame target feature information corresponding to each video frame one-to-one, which can be performed according to the following steps. : the first terminal determines the frame serial number of each frame of target feature information according to the sequence in which each video frame is received; the first terminal identifies a plurality of face regions according to the feature point information contained in each video frame; the first terminal obtains feature information of each face area, the above-mentioned feature information includes state information and coordinate information of each face area; the first terminal stores the frame serial number and the feature information of each face area in a preset data structure to obtain a multi-frame target characteristic information.
  • the method before the first terminal collects the image data and audio data of the user during the call, the method further includes: the first terminal determines the face area to be transmitted.
  • acquiring the feature information of each facial region by the first terminal includes: the first terminal determining a key video frame from a plurality of video frames; for the key video frame, the first terminal acquiring the information of the facial region to be transmitted in the key video frame.
  • the first terminal determines whether the feature information of the face region to be transmitted in any two adjacent non-critical video frames has changed, if any adjacent non-critical video frames are to be transmitted If the feature information of the face region changes, the feature information of the face region to be transmitted in the changed non-key video frame is acquired.
  • the first terminal is configured with a first face recognition engine
  • the second terminal is configured with a second face recognition engine
  • the first face recognition engine and the second face recognition engine The recognition engine is the same type of face recognition engine
  • the multi-frame target feature information is the original feature information recognized by the first face recognition engine
  • the second terminal is used to map the original feature information to the target virtual machine using the second face recognition engine. image to generate video call images.
  • the method before the first terminal transmits the multi-frame target feature information and audio data to the second terminal, the method further includes: the first terminal adds time to the multi-frame target feature information and audio data stamp.
  • the first terminal transmits the target feature information and audio data to the second terminal, including: the first terminal encapsulates the target feature information and audio data into a call data stream; the first terminal Stream the call data to the second terminal.
  • the method before the first terminal transmits the target feature information and audio data to the second terminal, the method further includes: the first terminal transmits avatar number information to the second terminal, the avatar number The information is used to instruct the second terminal to determine the target avatar from the plurality of avatars.
  • a method for a video call based on an avatar is provided, which is applied to a second terminal communicating with the first terminal, and the method includes:
  • the second terminal receives the call data stream transmitted by the first terminal.
  • the call data stream includes audio data and multi-frame target feature information, and the multi-frame target feature information includes features used to characterize the user's facial expressions and head movements during the call information;
  • the second terminal maps the multi-frame target feature information to the preset target avatar to generate a video call image, and the video call image includes the target avatar with the above-mentioned facial expressions and head movements;
  • the second terminal When displaying the video call image, the second terminal synchronously plays the audio data.
  • the second terminal maps multiple frames of target feature information to a preset target avatar to generate a video call image, including: the second terminal splits the call data stream from the call data stream audio data and multi-frame target feature information; the second terminal respectively determines the facial expressions and head movements contained in each frame of target feature information; the second terminal respectively Actions are mapped to preset target avatars to generate video call images.
  • each frame of target feature information includes state information and coordinate information of multiple facial regions
  • the second terminal determines the facial expressions and head movements contained in each frame of target feature information respectively. , including: the second terminal calculates the orientation of the user's head according to the coordinate information of the multiple facial regions; the second terminal adjusts the orientation of the user's head according to the state information of the multiple facial regions, and simulates facial expressions and head movements .
  • the multi-frame target feature information includes target feature information corresponding to key video frames and target feature information corresponding to non-key video frames, and the target feature information corresponding to key video frames includes Complete feature information of key video frames, target feature information corresponding to non-key video frames includes feature information that changes in non-key video frames; audio data and multi-frame target features are split from the call data stream at the second terminal After the information, the method further includes: the second terminal generates complete feature information of the non-key video frame according to the complete feature information of the key video frame and the changed feature information of the non-key video frame.
  • the first terminal is configured with a first face recognition engine
  • the second terminal is configured with a second face recognition engine
  • the first face recognition engine and the second face recognition engine The recognition engine is the same type of face recognition engine
  • the multi-frame target feature information is the original feature information recognized by the first face recognition engine
  • the second terminal maps the multi-frame target feature information to the preset target avatar, so as to Generating the video call image includes: the second terminal uses a second face recognition engine to map the original feature information to the target virtual image, so as to generate the video call image.
  • the method before the second terminal receives the call data stream transmitted by the first terminal, the method further includes: the second terminal receiving the avatar number information transmitted by the first terminal; The avatar number information identifies the target avatar from among the plurality of avatars.
  • the multi-frame target feature information and audio data have time stamps
  • synchronously playing the audio data includes: the second terminal according to The timestamps of multiple frames of target feature information determine the timestamps of each frame of video call images; the second terminal synchronizes the video call images and audio data according to the timestamps of each frame of video call images and the timestamps of audio data.
  • a avatar-based video call device in a third aspect, can be applied to the first terminal, and the device can specifically include the following modules:
  • the acquisition module is used to collect the image data and audio data of the user during the call;
  • the extraction module is used for extracting multi-frame target feature information from the image data, and the multi-frame target feature information includes feature information used to characterize the user's facial expression and head action;
  • the transmission module is used to transmit the multi-frame target feature information and audio data to the second terminal, and the second terminal is used to map the multi-frame target feature information to the preset target avatar, so as to generate a video call image, a video call image contains the target avatar with the above facial expressions and head movements.
  • the image data includes multiple video frames
  • the first terminal is configured with a first face recognition engine
  • the extraction module may specifically include the following submodules:
  • the parsing sub-module is used to analyze the facial features in each video frame by using the first face recognition engine to obtain the feature point information contained in each video frame;
  • the coding sub-module is used for coding according to each video frame as feature point information, so as to obtain multi-frame target feature information corresponding to each video frame one-to-one.
  • the encoding sub-module may specifically include the following units:
  • a frame serial number determining unit used for respectively determining the frame serial number of each frame of target feature information according to the order in which each video frame is received;
  • a face area identification unit used for identifying multiple face areas according to the feature point information contained in each video frame
  • a feature information acquisition unit used to obtain the feature information of each face area, and the feature information includes the state information and coordinate information of each face area;
  • the feature information storage unit is configured to store the frame serial number and the feature information of each face region in a preset data structure to obtain multi-frame target feature information.
  • the encoding sub-module may further include the following units:
  • the face area determination unit is used to determine the face area to be transmitted.
  • the feature information acquisition unit may specifically include the following subunits:
  • a key video frame determination subunit for determining key video frames from multiple video frames
  • the first feature information obtaining subunit is used for obtaining the feature information of the face region to be transmitted in the key video frame for the key video frame;
  • the second feature information acquisition subunit is used to determine whether the feature information of the face area to be transmitted in any two adjacent non-critical video frames has changed for the non-critical video frames. If the feature information of the face region to be transmitted in the frame changes, the feature information of the face region to be transmitted in the changed non-key video frame is acquired.
  • the first terminal is configured with a first face recognition engine
  • the second terminal is configured with a second face recognition engine
  • the first face recognition engine and the second face recognition engine The recognition engine is the same type of face recognition engine
  • the multi-frame target feature information is the original feature information recognized by the first face recognition engine
  • the second terminal is used to map the original feature information to the target virtual machine using the second face recognition engine. image to generate video call images.
  • the apparatus may further include the following modules:
  • the timestamp adding module is used to add timestamps to multi-frame target feature information and audio data.
  • the transmission module may specifically include the following submodules:
  • the encapsulation submodule is used to encapsulate the target feature information and audio data into a call data stream;
  • the transmission submodule is used for transmitting the call data stream to the second terminal.
  • the transmission module is further configured to transmit avatar number information to the second terminal, where the avatar number information is used to instruct the second terminal to determine the target avatar from the plurality of avatars.
  • a avatar-based video call device is provided, the device can be applied to a second terminal, and the device can specifically include the following modules:
  • the receiving module is used for receiving the call data stream transmitted by the first terminal, the call data stream includes audio data and multi-frame target feature information, and the multi-frame target feature information includes the facial expressions and head movements used to characterize the user during the call characteristic information;
  • mapping module for mapping multiple frames of target feature information to a preset target avatar to generate a video call image, where the video call image includes a target avatar with facial expressions and head movements;
  • the call module is used to display the video call image and play audio data synchronously.
  • mapping module may specifically include the following submodules:
  • the splitting submodule is used to split the audio data and multi-frame target feature information from the call data stream;
  • Determining sub-modules which are used to respectively determine the facial expressions and head movements contained in the target feature information of each frame;
  • the mapping sub-module is used to map the facial expressions and head movements contained in each frame of target feature information to a preset target virtual image to generate a video call image.
  • each frame of target feature information includes the state information and coordinate information of a plurality of face regions
  • the determination submodule can specifically include the following units:
  • a calculation unit used for calculating the orientation of the user's head according to the coordinate information of the multiple face regions
  • the adjustment and simulation unit is used to adjust the orientation of the user's head according to the state information of multiple facial regions, and to simulate the facial expressions and head movements.
  • the multi-frame target feature information includes target feature information corresponding to key video frames and target feature information corresponding to non-key video frames, and the target feature information corresponding to key video frames includes The complete feature information of the key video frame, the target feature information corresponding to the non-key video frame includes the feature information that changes in the non-key video frame;
  • the mapping module may also include the following submodules:
  • the generating sub-module is used for generating complete feature information of the non-key video frame according to the complete feature information of the key video frame and the changed feature information of the non-key video frame.
  • the first terminal is configured with a first face recognition engine
  • the second terminal is configured with a second face recognition engine
  • the first face recognition engine and the second face recognition engine The recognition engine is the same type of face recognition engine
  • the multi-frame target feature information is the original feature information recognized by the first face recognition engine
  • the mapping sub-module is also used to map the original feature information to the target using the second face recognition engine avatars to generate video call images.
  • the receiving module may further include the following submodules:
  • an avatar number information receiving submodule for receiving the avatar number information transmitted by the first terminal
  • the target avatar determination submodule is used for determining the target avatar from the plurality of avatars according to the avatar number information.
  • the multi-frame target feature information and audio data have timestamps
  • the call module may specifically include the following submodules:
  • the timestamp determination submodule is used to determine the timestamp of each frame of video call images according to the timestamps of the multi-frame target feature information
  • the audio and video synchronization sub-module is used to synchronize the video call image and the audio data according to the time stamp of each frame of the video call image and the time stamp of the audio data.
  • a fifth aspect provides a terminal, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the first aspect when the processor executes the computer program Or the avatar-based video calling method according to any one of the second aspect.
  • a sixth aspect provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a terminal, the terminal executes the above-mentioned related method steps to realize the above-mentioned first aspect or the second aspect.
  • the avatar-based video calling method according to any one of the aspects.
  • a seventh aspect provides a computer program product that, when the computer program product runs on a computer, causes the computer to execute the above-mentioned relevant steps to realize the avatar-based video according to any one of the first aspect or the second aspect. call method.
  • a chip in an eighth aspect, includes a memory and a processor, and the processor executes a computer program stored in the memory, so as to implement the above-mentioned first or second aspect based on the The avatar's video call method.
  • a communication system comprising the first terminal according to any one of the above first aspects and the second terminal according to any one of the above second aspects, and a communication system for establishing the first terminal and the second terminal A communication device that communicates between terminals.
  • FIG. 1 is a schematic interface diagram of an avatar video call in the prior art.
  • FIG. 2 is a schematic diagram of comparison between the avatar-based video calling method provided by the embodiment of the present application and the conventional avatar video calling method in the prior art.
  • FIG. 3 is a schematic diagram of data transmission provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 5 is a software structural block diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of steps of a method for a video call based on an avatar provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an operation of triggering a first terminal to initiate a video call request according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an operation of accepting a video call request by a second terminal according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a call interface when a video call is performed between a first terminal and a second terminal according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a data processing process of a first terminal provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a processing manner of a video frame provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a data processing process of a second terminal provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a face normal provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of steps of a avatar-based video call method implemented on the first terminal side provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of steps of another avatar-based video call method implemented on the first terminal side provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of steps of another avatar-based video call method implemented on the first terminal side provided by an embodiment of the present application.
  • FIG. 17 is a schematic diagram of steps of a avatar-based video call method implemented on a second terminal side provided by an embodiment of the present application.
  • FIG. 18 is a structural block diagram of a device for video calling based on an avatar provided by an embodiment of the present application.
  • FIG. 19 is a structural block diagram of another avatar-based video call device provided by an embodiment of the present application.
  • words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect.
  • the first face recognition engine, the second face recognition engine, etc. are only for distinguishing the face recognition engines on different terminals, and the number and execution order thereof are not limited.
  • “at least one” refers to one or more, and “multiple” refers to two or more.
  • “And/or”, which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the related objects are an “or” relationship.
  • “At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • At least one (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .
  • the steps involved in the avatar-based video calling method provided in the embodiment of the present application are only examples, and not all steps are mandatory steps, or not all information or contents in the messages are mandatory , which can be increased or decreased as needed during use.
  • FIG. 1 it is a schematic interface diagram of an avatar video call in the prior art.
  • the user needs to select one avatar from a plurality of candidate avatars as the avatar of the current call.
  • the user selects the avatar 103 from the area 100 containing a plurality of avatars to be selected as the avatar of the current call.
  • the first terminal or application uses the face recognition technology to replace the user's face in the collected video frame with the selected virtual image 103 , and the replaced user's face is shown as 110 in FIG. 1 .
  • the first terminal sends a video stream to the second terminal to implement the avatar video call.
  • the avatar video call in the prior art is to transmit the replaced picture as a complete picture to the peer device, and the whole process is no different from the process of transmitting the video stream and the audio stream in the traditional video call.
  • the size of the video stream to be transmitted during the video call is 1080*1920 pixels, and the number of frames per second (fps) transmitted is 30 frames.
  • the avatar will replace the face in each frame, and the final video stream will still be 1080*1920 pixels and the frame rate will be 30fps, which is not much different from the original video stream in terms of data size. In this way, when the network conditions accessed by the user are poor, such as when the bandwidth cannot support the video call, the video call of the avatar cannot be used.
  • an embodiment of the present application provides a video call method based on an avatar.
  • the first terminal can extract feature information representing the user's facial expressions and head movements from the image data. Then, the first terminal transmits the audio data and the extracted feature information to the second terminal, and the second terminal maps the received feature information to the avatar to form a video call image.
  • the second terminal plays the received audio data synchronously, so that an avatar-based video call can be implemented between the first terminal and the second terminal.
  • the first terminal does not need to transmit the video stream to the second terminal, but only needs to transmit the feature information extracted from the video stream, which greatly reduces the amount of data that needs to be transmitted.
  • users can also use video calls to connect with other users.
  • the first terminal since the first terminal does not need to transmit the real-time image of the user during the call to the second terminal, the privacy and security of the user can also be guaranteed.
  • FIG. 2 it is a schematic diagram of a comparison between the avatar-based video calling method provided by the embodiment of the present application and the traditional avatar video calling method in the prior art.
  • FIG. 2 a schematic diagram showing a comparison of the data processing process of the video call initiating end (ie, the first terminal) in the embodiment of the present application and the prior art is shown.
  • the first terminal calls a camera to collect image data, and calls a microphone to collect audio data. Then, the first terminal superimposes the image data and the audio data into a video stream, and transmits the video stream to the opposite terminal (ie, the second terminal).
  • the first terminal may call the camera to collect image data, and call the microphone to collect audio data. Then, the first terminal processes the collected image data, and identifies feature information such as facial expressions and head movements in the images. The first terminal superimposes the identified feature information and audio data into a data stream, and transmits it to the second terminal of the opposite end.
  • FIG. 2 a schematic diagram showing a comparison of the data processing process of the video call receiving end (ie, the second terminal) in the embodiment of the present application and the prior art is shown.
  • the second terminal After receiving the data stream transmitted by the first terminal, the second terminal decodes the video stream and the audio stream, thereby displaying the corresponding picture and playing the sound to realize the video call.
  • the data stream received by the second terminal is not a video stream, but a special call stream with feature information superimposed on the audio stream. Therefore, on the one hand, the second terminal can decode the audio stream according to the traditional method; The feature information is mapped into the avatar to form a video call image.
  • the second terminal synchronizes the image and audio according to the time stamp, and implements a video call between the first terminal and the second terminal by displaying the avatar image and playing the sound synchronously.
  • the transmitted data is still a video stream. Since the transmission of the video stream needs to occupy a lot of network bandwidth, in the case of poor network conditions, the traditional method cannot be used to realize the video call.
  • the video call method provided by the embodiment of the present application does not need to transmit a video stream, but a special data stream formed by adding feature information on the basis of transmitting an audio stream for a voice call, which requires less network bandwidth. Even in the case of poor network conditions, by using the video call method provided by the embodiments of the present application, a video call can be implemented without being downgraded to a voice call.
  • the above-mentioned first terminal or second terminal may be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop computer , personal computer (personal computer, PC), netbook, personal digital assistant (personal digital assistant, PDA) and other electronic equipment with audio and video capture function.
  • AR augmented reality
  • VR virtual reality
  • laptop computer personal computer
  • personal computer personal computer
  • PC personal computer
  • netbook personal digital assistant
  • PDA personal digital assistant
  • the first terminal and the second terminal in this embodiment of the present application may be electronic devices of the same type, for example, both the first terminal and the second terminal are mobile phones; or, the first terminal and the second terminal are both tablet computers.
  • the first terminal and the second terminal in the embodiments of the present application may also be different types of electronic devices.
  • the first terminal is a mobile phone and the second terminal is a tablet computer; or, the first terminal is a tablet computer and the second terminal is a tablet computer. cell phone.
  • FIG. 3 it is a schematic diagram of data transmission provided by an embodiment of the present application.
  • a first terminal 31 and a second terminal 32 are included.
  • the first terminal 31 can be a mobile phone 311, a tablet computer 312, a PC device 313 or a smart TV 314; similarly, the second terminal 32 can also be a mobile phone 321, a tablet computer 322, a PC device 323 or a smart TV 324.
  • the communication device may be a communication base station, a cloud server, or other devices.
  • the first terminal 31 transmits the collected feature information and audio data to the cloud server 30, the cloud server 30 transmits the data to the second terminal 32, and the second terminal 32 processes the data, thereby displaying the The video call image of the avatar is played, and the corresponding audio is played to realize the video call between the first terminal 31 and the second terminal 32 .
  • the data stream between the first terminal 31 and the second terminal 32 may also be transmitted in the form of a peer-to-peer (peer to peer, P2P) data stream, which is not limited in this embodiment of the present application.
  • P2P peer to peer
  • FIG. 4 shows a schematic structural diagram of an electronic device 400 .
  • the first terminal 31 and the second terminal 32 described above reference may be made to the structure of the electronic device 400 .
  • the electronic device 400 may include a processor 410, an external memory interface 420, an internal memory 421, a universal serial bus (USB) interface 430, a charge management module 440, a power management module 441, a battery 442, an antenna 1, an antenna 2 , mobile communication module 450, wireless communication module 460, audio module 470, speaker 470A, receiver 470B, microphone 470C, headphone jack 470D, sensor module 480, buttons 490, motor 491, indicator 492, camera 493, display screen 494, and Subscriber identification module (subscriber identification module, SIM) card interface 495 and so on.
  • SIM Subscriber identification module
  • the sensor module 480 may include a pressure sensor 480A, a gyroscope sensor 480B, an air pressure sensor 480C, a magnetic sensor 480D, an acceleration sensor 480E, a distance sensor 480F, a proximity light sensor 480G, a fingerprint sensor 480H, a temperature sensor 480J, a touch sensor 480K, an environmental sensor Light sensor 480L, bone conduction sensor 480M, etc.
  • the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device 400 .
  • the electronic device 400 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • Processor 410 may include one or more processing units.
  • the processor 410 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, a video Codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or, neural-network processing unit (neural-network processing unit, NPU), etc.
  • the different processing units can be stand-alone devices or integrated in one or more processors.
  • the controller can generate an operation control signal according to the instruction operation code and the timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 410 for storing instructions and data.
  • the memory in the processor 410 is a cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 410 . If the processor 410 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided, and the waiting time of the processor 410 is reduced, thereby improving the efficiency of the system.
  • the processor 410 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and /or, a universal serial bus (universal serial bus, USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transceiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL).
  • the processor 410 may include multiple sets of I2C buses.
  • the processor 410 can be respectively coupled to the touch sensor 480K, the charger, the flash, the camera 493 and the like through different I2C bus interfaces.
  • the processor 410 can couple the touch sensor 480K through an I2C interface, so that the processor 410 and the touch sensor 480K communicate with each other through the I2C bus interface, so as to realize the touch function of the electronic device 400 .
  • the I2S interface can be used for audio communication.
  • the processor 410 may include multiple sets of I2S buses.
  • the processor 410 may be coupled with the audio module 470 through an I2S bus to implement communication between the processor 410 and the audio module 470 .
  • the audio module 470 may transmit audio signals to the wireless communication module 460 through the I2S interface, so as to realize the function of answering calls through a Bluetooth headset.
  • the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
  • the audio module 470 and the wireless communication module 460 may be coupled through a PCM bus interface.
  • the audio module 470 may also transmit audio signals to the wireless communication module 460 through the PCM interface, so as to realize the function of answering calls through a Bluetooth headset.
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • a UART interface is generally used to connect the processor 410 and the wireless communication module 460 .
  • the processor 410 communicates with the Bluetooth module in the wireless communication module 460 through the UART interface to implement the Bluetooth function.
  • the audio module 470 may transmit an audio signal to the wireless communication module 460 through a UART interface, so as to realize the function of playing music through a Bluetooth headset.
  • the MIPI interface can be used to connect the processor 410 with peripheral devices such as the display screen 494 and the camera 493 .
  • MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
  • the processor 410 communicates with the camera 493 through a CSI interface, so as to implement the shooting function of the electronic device 400 .
  • the processor 410 communicates with the display screen 494 through the DSI interface to implement the display function of the electronic device 400 .
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 410 with the camera 493, the display screen 494, the wireless communication module 460, the audio module 470, the sensor module 480, and the like.
  • the GPIO interface can also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.
  • the USB interface 430 is an interface that conforms to the USB standard specification, and can specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface 430 can be used to connect a charger to charge the electronic device 400, and can also be used to transmit data between the electronic device 400 and peripheral devices.
  • the USB interface 430 can also be used to connect an earphone and play audio through the earphone.
  • the interface can also be used to connect other electronic devices, such as AR devices.
  • the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the electronic device 400 .
  • the electronic device 400 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the charging management module 440 is used to receive charging input from the charger.
  • the charger may be a wireless charger or a wired charger.
  • the charging management module 440 may receive charging input from the wired charger through the USB interface 430 .
  • the charging management module 440 may receive wireless charging input through a wireless charging coil of the electronic device 400 . While the charging management module 440 charges the battery 442 , it can also supply power to the electronic device through the power management module 441 .
  • the power management module 441 is used for connecting the battery 442 , the charging management module 440 and the processor 410 .
  • the power management module 441 receives input from the battery 442 and/or the charging management module 440, and supplies power to the processor 410, the internal memory 421, the display screen 494, the camera 493, the wireless communication module 460, and the like.
  • the power management module 441 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance).
  • the power management module 441 may also be provided in the processor 410 . In other embodiments, the power management module 441 and the charging management module 440 may also be provided in the same device.
  • the wireless communication function of the electronic device 400 may be implemented by the antenna 1, the antenna 2, the mobile communication module 450, the wireless communication module 460, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 400 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 450 may provide a wireless communication solution including 2G/3G/4G/5G etc. applied on the electronic device 400 .
  • the mobile communication module 450 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), and the like.
  • the mobile communication module 450 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 450 can also amplify the signal modulated by the modulation and demodulation processor, and then convert it into electromagnetic waves for radiation through the antenna 1 .
  • At least part of the functional modules of the mobile communication module 450 may be provided in the processor 410 . In some embodiments of the present application, at least part of the functional modules of the mobile communication module 450 may be provided in the same device as at least part of the modules of the processor 410 .
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then, the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low frequency baseband signal is processed by the baseband processor and passed to the application processor.
  • the application processor outputs sound signals through audio devices (not limited to speaker 470A, receiver 470B, etc.), or displays images or videos through display screen 494 .
  • the modem processor may be an independent device. In other embodiments, the modem processor may be independent of the processor 410, and may be provided in the same device as the mobile communication module 450 or other functional modules.
  • the wireless communication module 460 can provide applications on the electronic device 400 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR).
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication
  • IR infrared technology
  • the wireless communication module 460 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 460 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 410 .
  • the wireless communication module 460 can also receive the signal to be sent from the processor 410 , perform frequency modulation and amplification on the signal, and then convert it into an electromagnetic wave for radiation through the antenna 2 .
  • the antenna 1 of the electronic device 400 is coupled with the mobile communication module 450, and the antenna 2 is coupled with the wireless communication module 460, so that the electronic device 400 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technologies may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • the GNSS may include a global positioning system (global positioning system, GPS), a global navigation satellite system (GLONASS), a Beidou satellite navigation system (beidou navigation satellite system, BDS), a quasi-zenith satellite system (quasi -zenith satellite system, QZSS), and/or satellite based augmentation systems (SBAS).
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • BDS Beidou satellite navigation system
  • BDS Beidou navigation satellite system
  • QZSS quasi-zenith satellite system
  • SBAS satellite based augmentation systems
  • the electronic device 400 implements a display function through a GPU, a display screen 494, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 494 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 410 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 494 is used to display images, video, and the like.
  • Display screen 494 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode, or an active matrix organic light emitting diode (active-matrix organic light).
  • emitting diode, AMOLED organic light-emitting diode
  • flexible light-emitting diode flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • the electronic device 400 may include one or N display screens 494 , where N is a positive integer greater than one.
  • the electronic device 400 may implement a shooting function through an ISP, a camera 493, a video codec, a GPU, a display screen 494, an application processor, and the like.
  • the ISP is used to process the data fed back by the camera 493 .
  • the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, converting it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
  • ISP can also optimize parameters such as exposure and color temperature of the shooting scene.
  • the ISP may be set in the camera 493 .
  • Camera 493 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the electronic device 400 may include one or N cameras 493 , where N is a positive integer greater than one.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device 400 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy, and the like.
  • Video codecs are used to compress or decompress digital video.
  • Electronic device 400 may support one or more video codecs.
  • the electronic device 400 can play or record videos of various encoding formats, for example, moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the electronic device 400 can be implemented through the NPU, for example, image recognition, face recognition, speech recognition, text understanding, and the like.
  • the external memory interface 420 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 400.
  • the external memory card communicates with the processor 410 through the external memory interface 420 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • Internal memory 421 may be used to store computer executable program code, which includes instructions.
  • the internal memory 421 may include a storage program area and a storage data area.
  • the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
  • the storage data area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 400 and the like.
  • the internal memory 421 may include high-speed random access memory, and may also include non-volatile memory.
  • non-volatile memory For example, at least one disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • the processor 410 executes various functional applications and data processing of the electronic device 400 by executing instructions stored in the internal memory 421, and/or instructions stored in a memory provided in the processor.
  • the electronic device 400 may implement audio functions through an audio module 470, a speaker 470A, a receiver 470B, a microphone 470C, an earphone interface 470D, an application processor, and the like. Such as music playback, recording, etc.
  • the audio module 470 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 470 may also be used to encode and decode audio signals. In some embodiments of the present application, the audio module 470 may be provided in the processor 410 , or some functional modules of the audio module 470 may be provided in the processor 410 .
  • Speaker 470A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the electronic device 400 can listen to music through the speaker 470A, or listen to a hands-free call.
  • the receiver 470B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be answered by placing the receiver 470B close to the human ear.
  • Microphone 470C also called “microphone” or “microphone” is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 470C through the human mouth, and input the sound signal into the microphone 470C.
  • the electronic device 400 may be provided with at least one microphone 470C. In other embodiments, the electronic device 400 may be provided with two microphones 470C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 400 may further be provided with three, four or more microphones 470C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the headphone jack 470D is used to connect wired headphones.
  • the earphone interface 470D can be a USB interface 430, or can be a 3.5mm open mobile terminal platform (OMTP) standard interface or a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the pressure sensor 480A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
  • pressure sensor 480A may be provided on display screen 494 .
  • the capacitive pressure sensor may be comprised of at least two parallel plates of conductive material. When a force is applied to pressure sensor 480A, the capacitance between the electrodes changes.
  • the electronic device 400 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 494, the electronic device 400 detects the intensity of the touch operation according to the pressure sensor 480A.
  • the electronic device 400 may also calculate the touched position according to the detection signal of the pressure sensor 480A.
  • touch operations acting on the same touch position but with different touch operation intensities may correspond to different operation instructions. For example, when a touch operation with a touch operation intensity less than the first pressure threshold acts on the short message application icon, the instruction for viewing the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, the instruction to create a new short message is executed.
  • the gyro sensor 480B can be used to determine the motion attitude of the electronic device 400 .
  • the angular velocity of the electronic device 400 about three axes may be determined by the gyro sensor 480B.
  • the gyro sensor 480B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyro sensor 480B detects the shaking angle of the electronic device 400, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to counteract the shaking of the electronic device 400 through reverse motion to achieve anti-shake.
  • the gyro sensor 480B can also be used for navigation and somatosensory game scenarios.
  • Air pressure sensor 480C is used to measure air pressure.
  • the electronic device 400 calculates the altitude, assists positioning and navigation through the air pressure value measured by the air pressure sensor 480C.
  • Magnetic sensor 480D includes a Hall sensor.
  • the electronic device 400 can detect the opening and closing of the flip holster using the magnetic sensor 480D.
  • the electronic device 400 when the electronic device 400 is a flip machine, the electronic device 400 can detect the opening and closing of the flip cover according to the magnetic sensor 480D, and further according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, set Features such as automatic unlocking of the flip cover.
  • the acceleration sensor 480E can detect the magnitude of the acceleration of the electronic device 400 in various directions (generally three axes).
  • the magnitude and direction of gravity can be detected when the electronic device 400 is stationary. It can also be used to identify the posture of electronic devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.
  • the electronic device 400 can measure the distance by infrared or laser. In some embodiments of the present application, for example, in a shooting scene, the electronic device 400 can use the distance sensor 480F to measure the distance to achieve fast focusing.
  • Proximity light sensor 480G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
  • the light emitting diodes may be infrared light emitting diodes.
  • the electronic device 400 emits infrared light to the outside through the light emitting diode.
  • Electronic device 400 uses photodiodes to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it may be determined that there is an object near the electronic device 400 . When insufficient reflected light is detected, the electronic device 400 may determine that there is no object near the electronic device 400 .
  • the electronic device 400 can use the proximity light sensor 480G to detect that the user holds the electronic device 400 close to the ear to talk, so as to automatically turn off the screen to save power.
  • Proximity light sensor 480G can also be used in holster mode, pocket mode automatically unlocks and locks the screen.
  • the ambient light sensor 480L is used to sense ambient light brightness.
  • the electronic device 400 can adaptively adjust the brightness of the display screen 494 according to the perceived ambient light brightness.
  • the ambient light sensor 480L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 480L can also cooperate with the proximity light sensor 480G to detect whether the electronic device 400 is in the pocket to prevent accidental touch.
  • the fingerprint sensor 480H is used to collect fingerprints.
  • the electronic device 400 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering incoming calls with fingerprints, and the like.
  • the temperature sensor 480J is used to detect the temperature.
  • the electronic device 400 uses the temperature detected by the temperature sensor 480J to execute the temperature processing strategy. For example, when the temperature reported by the temperature sensor 480J exceeds a threshold, the electronic device 400 performs a reduction in the performance of the processor located near the temperature sensor 480J in order to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is lower than another threshold, the electronic device 400 heats the battery 442 to avoid abnormal shutdown of the electronic device 400 caused by the low temperature. In some other embodiments, when the temperature is lower than another threshold, the electronic device 400 performs boosting on the output voltage of the battery 442 to avoid abnormal shutdown caused by low temperature.
  • the touch sensor 480K is also called “touch device”.
  • the touch sensor 480K may be disposed on the display screen 494, and the touch sensor 480K and the display screen 494 form a touch screen, also called a "touch screen”.
  • the touch sensor 480K is used to detect a touch operation on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to touch operations may be provided through display screen 494 .
  • the touch sensor 480K may also be disposed on the surface of the electronic device 400 at a different location than the display screen 494 .
  • the bone conduction sensor 480M can acquire vibration signals.
  • the bone conduction sensor 480M can acquire the vibration signal of the vibrating bone mass of the human voice.
  • the bone conduction sensor 480M can also contact the pulse of the human body and receive the blood pressure beating signal.
  • the bone conduction sensor 480M may also be disposed in the earphone, and combined with the bone conduction earphone.
  • the audio module 470 can analyze the voice signal based on the vibration signal of the voice vibration bone block obtained by the bone conduction sensor 480M, and realize the voice function.
  • the application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 480M, and realize the function of heart rate detection.
  • the keys 490 include a power-on key, a volume key, and the like.
  • the key 490 may be a mechanical key or a touch key.
  • the electronic device 400 may receive key inputs and generate key signal inputs related to user settings and function control of the electronic device 400 .
  • Motor 491 can generate vibrating cues.
  • the motor 491 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback.
  • touch operations acting on different applications can correspond to different vibration feedback effects.
  • the motor 491 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 494 .
  • Different application scenarios for example, time reminder, receiving information, alarm clock, game, etc.
  • the touch vibration feedback effect can also support customization.
  • the indicator 492 can be an indicator light, which can be used to indicate a charging state, a change in power, or a message, a missed call, a notification, and the like.
  • the SIM card interface 495 is used to connect a SIM card.
  • the SIM card can be inserted into the SIM card interface 495 or pulled out from the SIM card interface 495 to achieve contact and separation with the electronic device 400 .
  • the electronic device 400 may support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
  • the SIM card interface 495 can support Nano SIM cards, Micro SIM cards, SIM cards, and the like.
  • the same SIM card interface 495 can insert multiple cards at the same time.
  • the types of the plurality of cards may be the same or different.
  • the SIM card interface 495 can also be compatible with different types of SIM cards.
  • the SIM card interface 495 is also compatible with external memory cards.
  • the electronic device 400 interacts with the network through the SIM card to implement functions such as calls and data communication.
  • the electronic device 400 adopts an eSIM (ie, an embedded SIM card).
  • the eSIM card can be embedded in the electronic device 400 and cannot be separated from the electronic device 400 .
  • the software system of the electronic device 400 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present application use a layered architecture Taking the system as an example, the software structure of the electronic device 400 is exemplarily described.
  • FIG. 5 is a block diagram of a software structure of an electronic device 400 according to an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the The system is divided into four layers, from top to bottom, the application layer, the application framework layer, Runtime( runtime) and the system layer, as well as the kernel layer.
  • the application layer can include a series of application packages. As shown in FIG. 5 , the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, and short message.
  • applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, and short message.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, and the like.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide the communication function of the electronic device 400 .
  • the management of call status including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction.
  • the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, prompt text information in the status bar, sound a prompt, electronic equipment vibrates, indicator lights flash, etc.
  • Runtime includes core libraries and virtual machines. runtime is responsible System scheduling and management.
  • the core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • the system layer can include multiple functional modules.
  • surface manager surface manager
  • media library Media Libraries
  • 3D graphics processing library eg, OpenGL ES
  • 2D graphics engine eg, SGL
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of many common audio and video formats, as well as still image files.
  • the media library can support multiple audio and video encoding formats, such as MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer includes at least display drivers, camera drivers, audio drivers, and sensor drivers.
  • the following embodiments take a terminal having the above-mentioned hardware structure/software structure as an example to describe the avatar-based video call method provided by the embodiment of the present application.
  • FIG. 6 a schematic diagram of steps of a method for a video call based on an avatar provided by an embodiment of the present application is shown, and the method may specifically include the following steps:
  • the first terminal transmits avatar number information to the second terminal.
  • the first terminal may be a terminal that initiates a video call
  • the second terminal may be a terminal that receives the video call
  • the video call may be triggered by an operation of the first user on the first terminal.
  • the first user may refer to a user using the first terminal; correspondingly, the second user may refer to a user using the second terminal.
  • the first user may click the "Phone" control 701 in the interface of the first terminal as shown in (a) of FIG. 7 .
  • the first terminal enters the dialing interface as shown in (b) of FIG. 7 .
  • the first user can input the phone number of the second user or other contact information that can be used to contact the second user; or, if the contact information of the second user is stored in the first terminal, the first user can also Call up the contact information of the second user directly from the first terminal.
  • the first user may input the phone number of the second user in the interface shown in (b) of FIG. 7 .
  • the first user can click the “video call” control 702 to trigger the first terminal to initiate a corresponding video call request to the second terminal.
  • the first terminal may display a dialog box 703 as shown in (d) of FIG. 7 to the first user.
  • the dialog 703 includes a "normal video call” control 7031 and a "virtual image video call” control 7032, and the first user can select any video call mode from the two controls 7031 or 7032 above.
  • the ordinary video call may refer to a traditional video call method.
  • the first terminal can collect the image and voice of the first user in real time, and transmit the collected image and voice to the second terminal, so as to realize the realization of the first terminal and the second terminal. video call between.
  • the first terminal transmits the image and voice of the first user to the second terminal, and the image displayed on the second terminal is the image of the first user.
  • the avatar video call may refer to the video call method provided in the embodiment of this application. During the avatar video call, the image displayed on the second terminal is not the image of the first user himself, but the processed avatar. .
  • the first user clicks the "avatar video call” control 7032 as shown in (f) in FIG. 7 , and requests the first terminal to establish an avatar video call connection with the second terminal.
  • the first terminal may pop up a dialog 704 as shown in (g) in FIG. 7 .
  • the first terminal requests the first user to select The avatar you wish to use.
  • avatar 1 and avatar 2 are included in dialog 704 .
  • FIG. 7 shows a dialog 704 as shown in (h) of FIG.
  • the first user can select the control 7041 corresponding to the avatar 1, in this way, the first terminal can transmit the information of the avatar 1 selected by the first user to the second terminal, and Request to establish an avatar video call connection between the first terminal and the second terminal based on the avatar 1.
  • the above-mentioned information of the avatar 1 is the avatar number information transmitted from the first terminal to the second terminal.
  • the avatar that can be used for the video call can be any type of avatar.
  • the virtual image may be a virtual pet image or a virtual character image, and the embodiment of the present application does not limit the type of the virtual image.
  • Table 1 it is an example of data transmitted when the first terminal and the second terminal establish a video call connection provided by the embodiment of the present application.
  • parameter list Data length optional/required field description characterId 4Byte required Avatar number information otherData Other data, add as required
  • the second terminal determines a target avatar from a plurality of avatars according to the avatar number information.
  • the avatar video call request initiated by the first terminal may be transmitted to the second terminal based on any communication means.
  • the avatar video call request may be transmitted to the second terminal by means of a base station, a cloud server, or P2P.
  • FIG. 8 it is a schematic diagram of the interface when the second terminal receives the avatar video call request sent by the first terminal.
  • This interface includes the communication number 801 of the first terminal, and multiple operation controls for the second user to process the call request, such as "answer” control 802, "reject” control 803, "convert to speech” Control 804 and so on.
  • the interface of the second terminal may also include display information 805a for displaying the type of the call request of this time.
  • the second terminal may inform the second user that the current call request is an avatar video call request by displaying the information 805a.
  • the second user can click the "answer" control 802 to establish a video call connection between the first terminal and the second terminal; alternatively, the second user can also click the "convert to voice" control 804 to establish the first terminal A voice call connection with the second terminal; alternatively, the second user can reject the communication request of the first terminal by clicking the “reject” control 803 .
  • the second user may click the “answer” control 802 to accept the avatar video call request initiated by the first terminal.
  • the second terminal may pop up a dialog box as shown in (c) in FIG.
  • the dialog box includes “avatar 1” and “avatar 2” "Two selection controls 8021 and 8022, from which the second user can select any one of the controls, for example, the second user selects the control 8022 corresponding to "avatar 2" as shown in (c) in FIG. 8 .
  • the second terminal can receive the avatar transmitted by the first terminal according to the received avatar.
  • the number information determines the target avatar displayed on the terminal.
  • the second terminal may also transmit the information of the avatar selected by the second user to the first terminal, and the first terminal determines the avatar from the plurality of avatars on the first terminal according to the received avatar number information.
  • the displayed target avatar is
  • the avatars selected by the first user and the second user may be the same avatar or different avatars, which are not limited in this embodiment of the present application.
  • both the first user and the second user can select “avatar 1" or “avatar 2" as the avatar used during the video call; "Virtual image 2" is selected to be used, which is not limited in this embodiment of the present application.
  • the target avatar may refer to the image of the opposite end user displayed in the terminal.
  • the first terminal may transmit the information of the "avatar 1" selected by the first user to the second terminal.
  • the second terminal can determine "avatar 1" from the plurality of avatars as the target avatar according to the received information. That is, the avatar of the first user displayed on the second terminal is "avatar 1".
  • the second terminal may also call the avatar 2" selected by the second user. The information is transmitted to the first terminal. In this way, the first terminal can also determine " avatar 2 " as the target avatar from a plurality of avatars according to the received information. That is, the second user displayed in the first terminal The image is "avatar 2".
  • FIG. 9 respectively are schematic diagrams of call interfaces of the first terminal and the second terminal after the avatar video call connection is established between the first terminal and the second terminal.
  • FIG. 9 shows a schematic diagram of a call interface on the first terminal.
  • the call interface shown in (a) of FIG. 9 includes the communication number 9011 of the second terminal, the avatar 9021 of the first user, and the avatar 9031 of the second user; shown in (b) of FIG. 9
  • the call interface of the first terminal includes the communication number 9012 of the first terminal, the avatar 9022 of the second user, and the avatar 9032 of the first user.
  • the avatar 9031 of the second user displayed on the call interface shown in (a) of FIG.
  • the avatar 9021 is different from the avatar 9031 of the second user displayed on the call interface shown in FIG.
  • the avatar 9022 is the same; the avatar 9032 of the first user displayed in the call interface shown in FIG. 9(b) is the same as the second user displayed in the call interface shown in FIG. 9(a) .
  • the user's avatar 9021 is the same.
  • the first user when the first user selects a video call type, he or she may select a normal video call. That is, the first user clicks the control 7031 shown in (e) of FIG. 7 . In this way, the first terminal will request to establish a video call connection with the second terminal.
  • the video call request received by the second terminal may be as shown in (d) in FIG. 8 .
  • the display information 805b displayed by the second terminal when receiving the ordinary video call request indicates that the current video call is an ordinary video call. As shown in (e) of FIG.
  • the second user may click on the “answer” control 802 .
  • the second terminal may pop up a dialog box 806 as shown in (f) in FIG. 8 .
  • the second terminal may again request the second user to confirm whether to conduct a normal video call with the first user or to Avatar video calls.
  • the second user clicks the "avatar video call" control 8062 shown in (f) of FIG. 8 the second terminal may pop up a dialog box to request the second user to select the avatar of the user's video call.
  • the user selects the avatar 2 as shown in (g) of FIG. 8 . In this way, the first terminal and the second terminal can establish a unilateral avatar video call connection.
  • the image of the second user displayed on the first terminal may be the second user
  • the virtual image of the first user displayed on the second terminal may be the real image of the first user himself.
  • an avatar video call may also be established directly between the first terminal and the second terminal.
  • the video call interfaces displayed on the first terminal and the second terminal may be as shown in FIG. 9 ( The call interfaces shown in a) and (b) may also be the call interfaces shown in (c) and (d) in FIG. 9 . This embodiment of the present application does not limit this.
  • the first terminal collects image data and audio data of the user during the call.
  • the first terminal may collect image data and audio data of the first user during the call.
  • the first terminal may call an image acquisition device, such as a camera, to capture a video of the first user to obtain corresponding image data.
  • the first terminal may call an audio collection device, such as a microphone, to collect the voice of the first user during the call to obtain corresponding audio data.
  • the camera when the first terminal uses a camera to shoot a video of the first user, the camera may be a front camera or a rear camera.
  • the corresponding video information can be displayed on the main interface of the first terminal.
  • the corresponding video information can be displayed in the display device or module on the back of the first terminal, which is not limited in this embodiment of the present application.
  • the first terminal extracts multi-frame target feature information from the image data.
  • the image data collected by the first terminal may be composed of multiple video frames. Therefore, when the first terminal processes the image data, target feature information that can be used to characterize the facial expression and head movement of the first user can be extracted from each video frame.
  • the first terminal may be configured with a first face recognition engine.
  • a second face recognition engine may also be configured in the second terminal.
  • the first face recognition engine and the second face recognition engine may be the same type of face recognition engine, or may be different types of face recognition engines.
  • the first terminal when the first terminal processes the collected image data, the first terminal can transmit multiple video frames to the first face recognition engine frame by frame, and use the first face recognition engine to separately Analyze the facial features in each video frame to get the feature point information contained in each video frame. Then, the first terminal may encode the above-mentioned feature point information according to each video frame, and obtain multiple data frames corresponding to each video frame one-to-one, and each frame of data frame corresponds to a frame of target feature information, and these target feature information That is, the data that needs to be subsequently transmitted to the second terminal.
  • FIG. 11 it is a schematic diagram of a video frame processing manner provided by an embodiment of the present application.
  • FIG. 11 it is a schematic diagram of a conventional video frame after encoding, including a plurality of I frames, B frames and P frames.
  • each video frame represents a still image.
  • various algorithms can be used to reduce the data capacity, and IPB is the most common compression encoding algorithm.
  • the I frame is a key frame, which belongs to intra-frame compression and contains the most and most critical data or feature information. It can be understood as the complete preservation of this frame of picture.
  • a P frame represents the difference between this frame and a previous key frame (or P frame). When decoding, it is necessary to superimpose the difference defined in this frame with the previously buffered picture to generate the final picture.
  • the P frame belongs to the difference frame, and the P frame does not have complete picture data, but only data that is different from the picture of the previous frame.
  • the B frame is a two-way difference frame, that is, the B frame records the difference between the current frame and the previous frame. To decode the B frame, not only the previous cached picture, but also the picture after decoding must be obtained, and the final picture is obtained by superimposing the previous and previous pictures and the data of the current frame.
  • the first terminal in order to reduce the occupation of network bandwidth during the video passing process, can extract the feature information in each video frame from the collected image data, and then use the feature information in (b) as shown in FIG. 11 . ) is encoded by frame to obtain data frame 1, data frame 2, data frame 3, etc. that only contain feature information. These data frames are not the video frames transmitted during traditional video calls. Each data frame only contains target feature information extracted from the corresponding video frame.
  • Table 2 it is an example of the data included in the data frame obtained by encoding according to the encoding method shown in (b) of FIG. 11 .
  • the first terminal may firstly determine each frame according to the sequence in which each video frame is received by the first terminal according to the feature point information of each video frame.
  • the frame serial number of the target feature information and then the first terminal identifies multiple face regions according to the feature point information contained in each video frame, and the first terminal obtains the feature information of each face region, such as the state information of each face region and Coordinate information.
  • the first terminal may store the frame serial number and the feature information of each face region in a preset data structure to obtain the data frames shown in Table 2 above, each data frame corresponding to a frame of target feature information.
  • the first terminal encodes the extracted feature point information according to the encoding method shown in (b) of FIG. 11 , it does not compress the target feature information nor perform inter-frame encoding. That is to say, the data frames such as frame 1, frame 2, . The original feature information of the part action.
  • the feature point information in each video frame is extracted and encoded, so that the subsequent transmission to the second terminal is not the video picture, but only the expression feature information, and does not contain redundant data, so that the transmission efficiency is higher.
  • the bit rate is only about 30 kbps, and the amount of data transmitted is much smaller than that of the video stream directly transmitted by traditional video calls.
  • the first terminal when the first terminal encodes the feature point information, the first terminal may also use an inter-frame compression encoding method to encode the feature point information. .
  • the first terminal may determine the face area to be transmitted. That is, the first terminal may first determine which facial area feature information needs to be transmitted to the second terminal. In each subsequent frame of data, you only need to fill in the corresponding frame number and the coordinates and status of the determined face area.
  • the first terminal may determine a key video frame (I frame) from a plurality of video frames.
  • the information of the key video frame determined by the first terminal may be sent to the second terminal when a video call connection is established with the second terminal.
  • the extracted feature point information is encoded in an inter-frame compression manner
  • the data that the first terminal needs to transmit when establishing a video call connection with the second terminal may be shown in Table 3 below.
  • the first terminal may obtain all feature information of the face region to be transmitted in the key video frames; and for non-key video frames, the first terminal may first determine any two adjacent non-key video frames to be transmitted in the Whether the feature information of the face region has changed, if the feature information of the face region to be transmitted in any two adjacent non-key video frames has changed, the information of the face region to be transmitted in the changed non-key video frame can be obtained. feature information, so that only the changed feature information is encoded.
  • FIG. 11 it is a schematic diagram of encoding the extracted feature point information by means of inter-frame compression encoding.
  • the first terminal can retain all the feature information in the three video frames, and for other video frames, then Only the feature information that has changed in each frame can be retained.
  • the key video frames retain complete frame data (feature information of the face region), and each non-key video frame in the middle retains only the changed facial feature information. Between two adjacent frames, there will be no drastic changes in expressions and actions, so in general, the data of each non-key video frame in the middle is smaller than the key frame data.
  • Table 4 and Table 5 are respectively examples of data contained in the data frames corresponding to the key video frame and the non-key video frame obtained after encoding according to the encoding method shown in (c) in FIG. 11 .
  • Table 4 an example of the data contained in the data frame obtained after the key video frame is encoded:
  • the overall code rate will be further reduced on the basis of directly coding the extracted feature point information. For users, less bandwidth is occupied and less traffic is consumed. According to the different GOP and the actual picture change degree, the compression effect is also different. Generally speaking, the larger the GOP, the lower the code rate.
  • the first terminal adds a timestamp to the multi-frame target feature information and audio data.
  • the first terminal may be the multi-frame target feature information and audio data.
  • a timestamp is added to the data to ensure that the target feature information of each frame obtained by encoding can be aligned with the audio data corresponding to the frame.
  • the first terminal may encapsulate the time stamped multi-frame target feature information and audio data into a call data stream, and then transmit the call data stream to the second terminal.
  • the call data stream transmitted by the avatar-based video call method provided by the embodiment of the present application only includes audio data and the facial expression and head used to characterize the first user.
  • the target feature information of the action greatly reduces the occupation of network bandwidth during data transmission.
  • the second terminal splits the audio data and the multi-frame target feature information from the call data stream.
  • FIG. 12 it is a schematic diagram of a data processing process of a second terminal provided by an embodiment of the present application.
  • the second terminal may firstly split the audio data and multi-frame target feature information in the call data stream.
  • the second terminal may split an audio stream and a video stream from the received call data stream, and the video stream may be multi-frame target feature information transmitted in the form of a data stream.
  • the second terminal can perform audio decoding on it to obtain corresponding audio data; for the video stream, after the second terminal performs video decoding on it, the target feature information of each frame is obtained.
  • the second terminal maps the multi-frame target feature information to the target avatar to generate a video call image.
  • the second terminal can generate multiple frames of target feature information after mapping the multi-frame target feature information to the target avatar.
  • the frame contains images of the facial expressions and head movements of the first user, and these images can constitute a corresponding video call picture.
  • each frame of target feature information decoded by the second terminal may include state information and coordinate information of multiple face regions, and the second terminal may The information calculates the orientation of the user's head, that is, the orientation of the first user's head; then, the second terminal can adjust the orientation of the user's head according to the status information of multiple facial regions, and simulate the facial expression and head action.
  • the second terminal may calculate the orientation of the head through the normal of the face according to the coordinates of the face region obtained by decoding.
  • FIG. 13 it is a schematic diagram of a face normal provided by an embodiment of the present application.
  • the distance le between the eyes of a person, the vertical distance lf between the eyes and the lips, and the vertical distance lm between the tip of the nose and the lips are shown; in (b) of FIG. 13
  • There are data such as the distance ln between the nose tip and the face, the vertical distance lf between the eyes and the lips, and the vertical distance lm between the nose tip and the lips.
  • the second terminal may calculate the orientation of the head of the first user according to the face normal shown in FIG. 13 according to the received coordinates of each face area. Then, the second terminal may adjust the orientation of the user's head according to the state information of the multiple facial regions, and simulate the facial expression and head movement of the first user.
  • the second terminal may map the above facial expression and head movement to a preset target avatar, thereby generating a video call image.
  • the above-mentioned target avatar is the avatar determined according to the avatar number information transmitted by the first terminal when the first terminal and the second terminal establish a video call connection.
  • the second terminal When displaying the video call image, the second terminal synchronously plays audio data.
  • the second terminal after mapping the facial expression and head action of the first user to the target avatar to obtain the video call image, the second terminal also needs to perform time synchronization on the video call image and audio data.
  • the multi-frame target feature information and audio data decoded by the second terminal have a time stamp, and the time stamp is added to it by the first terminal.
  • the second terminal may determine the time stamp of each frame of the video call image according to the time stamps of the multi-frame target feature information; then, the second terminal may determine the time stamp of the video call image and the audio data according to the time stamp of each frame of the video call image and the time stamp of the audio data.
  • the audio data is synchronized, so that when the video call image is displayed, the audio data is played synchronously.
  • the above-mentioned video call image is an image of an avatar on which the facial expression and head motion of the first user are mapped.
  • the first terminal transmits audio data and target feature information to the second terminal.
  • the second terminal processes the target feature information to simulate The facial expression and head movement of the first user, so as to present a mode including the facial expression and head movement of the first user on the second terminal, and then realize the video call between the first user and the second user.
  • the first terminal can present on the first terminal an avatar with the facial expressions and head movements of the second user.
  • the first terminal can present on the first terminal an avatar with the facial expressions and head movements of the second user.
  • FIG. 14 shows a schematic diagram of steps of a avatar-based video call method implemented on the first terminal side provided by an embodiment of the present application, and the method may specifically include the following steps:
  • the first terminal transmits avatar number information to the second terminal, where the avatar number information is used to instruct the second terminal to determine a target avatar from multiple avatars.
  • the avatar number information may be transmitted from the first terminal to the second terminal after the first terminal establishes a video call connection with the second terminal.
  • the second terminal may determine the target avatar from the plurality of avatars according to the information.
  • the target avatar is the avatar that is subsequently displayed on the second terminal and is used to map the facial expression and head movement of the first user.
  • the first terminal collects image data and audio data of the user during the call.
  • the foregoing embodiments describe the avatar-based video calling method of the present application by taking the first terminal and the second terminal as a whole.
  • the method of the present application is introduced on the first terminal side.
  • the image data and audio data of the user during the call collected by the first terminal may refer to the audio data and image data of the first user during the call.
  • These image data include multiple video frames.
  • the first terminal extracts multi-frame target feature information from the image data, where the multi-frame target feature information includes feature information used to represent the user's facial expressions and head movements.
  • a first face recognition engine is configured in the first terminal.
  • the first terminal may use the first face recognition engine to analyze the facial features in each video frame respectively to obtain feature point information contained in each video frame. Then, the first terminal may encode the feature point information according to each video frame to obtain multiple frames of target feature information corresponding to each video frame one-to-one.
  • the first terminal when the first terminal extracts multiple frames of target feature information from the image data, it may first determine the frame sequence number of each frame of target feature information according to the sequence in which each video frame is received; then, the first terminal may separately Identify multiple face regions according to the feature point information contained in each video frame; after acquiring each feature information such as state information and coordinate information of each face region, the first terminal can convert the frame serial number and the The feature information is stored in a preset data structure to obtain multi-frame target feature information.
  • the first terminal transmits the multi-frame target feature information and audio data to the second terminal, and the second terminal is used to map the multi-frame target feature information to a preset target avatar to generate a video call image, a video call image contains the target avatar with the above facial expressions and head movements.
  • the first terminal before transmitting the target feature information and audio data to the second terminal, the first terminal may add a timestamp to the multi-frame target feature information and audio data. Then, the first terminal may encapsulate the time-stamped target feature information and audio data into a call data stream, and transmit the call data stream to the second terminal. After receiving the call data stream transmitted by the first terminal, the second terminal can map the multi-frame target feature information to the preset target avatar by splitting and decoding the call data stream to generate a video call.
  • the above-mentioned video call image includes a target avatar with facial expressions and head movements of the first user.
  • the network bandwidth may not be able to support a video call between the first terminal and the second terminal.
  • the first terminal since the first terminal only transmits audio data and target feature information that can characterize the facial expression and head movement of the first user to the second terminal, fewer data streams need to be transmitted, and the network bandwidth is limited. Less demanding.
  • the avatar video call can still be realized by using this method. The first user and the second user can still see each other's expressions and actions.
  • the embodiment of the present application completely uses virtual images, which will not expose the user's surrounding environment, and can effectively protect the privacy and security of the user.
  • FIG. 15 a schematic diagram of steps of another avatar-based video call method implemented on the first terminal side provided by an embodiment of the present application is shown, and the method may specifically include the following steps:
  • the first terminal transmits avatar number information to the second terminal, where the avatar number information is used to instruct the second terminal to determine a target avatar from multiple avatars.
  • the first terminal determines the face area to be transmitted.
  • each frame of target feature information transmitted by the first terminal to the second terminal is a data frame containing the complete facial feature information of the first user, including which facial area and its coordinates. , status and other information.
  • it may be pre-determined which facial area data needs to be transmitted. In this way, in each subsequent frame of data, it is only necessary to fill in the frame serial number and the coordinates, status and other information of the face area, and the amount of transmitted data is further reduced by a method similar to the inter-frame compression in video coding.
  • the first terminal collects image data and audio data of the user during the call, where the image data includes multiple video frames.
  • the first terminal determines a key video frame from a plurality of video frames.
  • the first terminal may determine a key video frame from the multiple video frames collected.
  • the key video frame is the video frame that needs to transmit all the feature information in the frame to the second terminal.
  • the first terminal acquires the feature information of the face region to be transmitted in the key video frame.
  • the first terminal determines whether the feature information of the face region to be transmitted in any two adjacent non-critical video frames has changed. When the feature information of the face region changes, the feature information of the face region to be transmitted in the changed non-key video frame is acquired.
  • all feature information of the face region to be transmitted in the video frame may be acquired.
  • the first terminal performs inter-frame compression coding on the feature point information of the key video frame and the non-key video frame, and obtains multi-frame target feature information corresponding to each video frame one-to-one.
  • the feature information of the user's facial expressions and head movements are included in the video frame.
  • the first terminal may perform inter-frame compression coding on feature point information of key video frames and non-key video frames, thereby obtaining multiple frames of data frames, each data frame corresponding to a frame of target feature information, and these
  • the target feature information can be used to characterize the facial expression and head action of the first user.
  • the first terminal transmits the multi-frame target feature information and audio data to the second terminal, and the second terminal is used to map the multi-frame target feature information to a preset target avatar to generate a video call image, a video call image contains the target avatar with the above facial expressions and head movements.
  • the overall bit rate after processing is further reduced on the basis of the previous embodiment. For users, making video calls consumes less bandwidth and consumes less data.
  • FIG. 16 a schematic diagram of steps of another avatar-based video call method implemented on the first terminal side provided by an embodiment of the present application is shown, and the method may specifically include the following steps:
  • the first terminal transmits avatar number information to the second terminal, where the avatar number information is used to instruct the second terminal to determine a target avatar from multiple avatars, the first terminal is configured with a first face recognition engine, and the first terminal is configured with a first face recognition engine.
  • a second face recognition engine is configured in the second terminal, and the first face recognition engine and the second face recognition engine are the same type of face recognition engine.
  • the first terminal collects image data and audio data of the user during the call.
  • the first terminal extracts multi-frame target feature information from the image data, where the multi-frame target feature information includes feature information used to represent the user's facial expression and head movement, and the multi-frame target feature information is identified by the first face Raw feature information recognized by the engine.
  • the first terminal transmits the multi-frame target feature information and audio data to the second terminal, where the second terminal is configured to use the second face recognition engine to map the original feature information to the target avatar to generate a video call image , the video call image contains the target avatar with the above facial expressions and head movements.
  • the feature information representing facial expressions and head movements may not be processed on the sending side, but the original feature information may be sent to the receiving side for processing.
  • the first terminal may transmit the image data to the first face recognition engine for processing.
  • the first face recognition engine can return all processed raw data.
  • the first face recognition engine can return 276 original feature points, these original feature points not only include eyes, lips and other feature information that can be used to characterize facial expressions and head movements, but also include some redundant information.
  • the first terminal can transmit all the original feature information returned by the first face recognition engine to the second terminal, which is processed by the second face recognition engine in the second terminal, and the person of the first user is mapped in the target avatar. Facial expressions and head movements.
  • the first terminal on the data sending side does not process the original feature information, but transmits all the original feature information to the second terminal, and the processing of the original feature information is performed on the receiving side. In this way, less information is discarded, and the receiving side can perform more accurate expression and action restoration based on the original feature information.
  • this embodiment needs to transmit a larger amount of data, and the data stream bit rate during a call will also increase to a certain extent.
  • the second terminal on the receiving side can also map more expressive expressions and actions, which helps to better restore the expressions and actions on the sending side.
  • FIG. 17 it shows a schematic diagram of steps of a avatar-based video call method implemented on the second terminal side provided by an embodiment of the present application.
  • the method may specifically include the following steps:
  • the second terminal receives the avatar number information transmitted by the first terminal, and determines a target avatar from a plurality of avatars according to the avatar number information.
  • the second terminal receives the call data stream transmitted by the first terminal, where the call data stream includes audio data and multi-frame target feature information, and the multi-frame target feature information includes facial expressions and head movements used to characterize the user during the call characteristic information.
  • the second terminal maps multiple frames of target feature information to a preset target avatar to generate a video call image, where the video call image includes the target avatar with the above facial expressions and head movements.
  • the second terminal When displaying the video call image, the second terminal synchronously plays audio data.
  • the method of the present application is introduced on the second terminal side.
  • the second terminal may receive the avatar number information transmitted by the first terminal.
  • the second terminal may determine the target avatar from the plurality of avatars according to the avatar number information.
  • the target avatar is the avatar displayed on the second terminal and used to map the facial expressions and head movements of the first user.
  • the call data stream received by the second terminal may be a data stream including audio data and multi-frame target feature information.
  • the target feature information can be used to represent the facial expression and head movement of the first user during the call.
  • the second terminal may split the audio data and the multi-frame target feature information from the call data stream. Then, the second terminal may determine the facial expressions and head movements included in each frame of target feature information, respectively, and map the facial expressions and head movements included in each frame of target feature information to a preset target avatar. , the video call image is generated.
  • the second terminal may first calculate the orientation of the user's head according to the coordinate information of multiple facial regions; The orientation of the head can be adjusted, and the facial expressions and head movements can be simulated.
  • the target feature information may be original feature information that has not been processed by the first terminal.
  • the original feature information may be recognized by the first face recognition engine on the first terminal.
  • the second terminal may transmit it to the second face recognition engine.
  • the second face recognition engine on the second terminal may be the same type of face recognition engine as the first face recognition engine. In this way, the second terminal can use the second face recognition engine to map the original feature information to the target avatar to generate a video call image.
  • the target feature information may be feature extraction for multiple video frames, and all features that can be used to represent the facial expression and head movement of the first user are reserved during encoding The resulting data frame of information.
  • the target feature information may be a data frame obtained after the first terminal performs inter-frame compression encoding on multiple video frames.
  • This type of target feature information includes target feature information corresponding to key video frames and target feature information corresponding to non-key video frames.
  • the target feature information corresponding to the key video frame includes the complete feature information of the key video frame
  • the target feature information corresponding to the non-key video frame includes the feature information that changes in the non-key video frame. Therefore, after splitting the audio data and the multi-frame target feature information from the call data stream, the second terminal can also generate a non-critical video according to the complete feature information of the key video frame and the changed feature information in the non-key video frame. Full feature information of the frame. Then, based on the complete feature information of the key video frames and the complete feature information of the non-key video frames, the facial expressions and head movements of the first user are mapped to the target avatar.
  • the second terminal may determine the time stamp of each frame of the video call image according to the time stamps of the multi-frame target feature information, and then according to the time stamp of each frame of the video call image and the time stamp of the audio data, Synchronize video call image and audio data.
  • the second terminal After completing the mapping of facial expressions and head movements, obtaining corresponding video call images and synchronizing the video call images and audio data, the second terminal can display these video call images, and multiple video call images form a video stream.
  • a video call between the first terminal and the second terminal is formed by superimposing the video stream and the audio stream.
  • the terminal device may be divided into functional modules according to the foregoing method examples.
  • each functional module may be divided corresponding to each function, or one or more functions may be integrated into one functional module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation. The following description takes as an example that each function module is divided corresponding to each function.
  • FIG. 18 a structural block diagram of an avatar-based video call device provided by an embodiment of the present application is shown.
  • the device can be applied to the first terminal in the foregoing embodiments. Specifically, it may include the following modules: a collection module 1801, an extraction module 1802 and a transmission module 1803, wherein:
  • the collection module 1801 is used to collect the image data and audio data of the user during the call;
  • Extraction module 1802 for extracting multi-frame target feature information from the image data, where the multi-frame target feature information includes feature information used to characterize the user's facial expressions and head movements;
  • the transmission module 1803 is used to transmit the multi-frame target feature information and the audio data to a second terminal, and the second terminal is used to map the multi-frame target feature information to a preset target avatar, To generate a video call image, the video call image includes the target avatar with the facial expression and the head action.
  • the image data includes multiple video frames
  • the first terminal is configured with a first face recognition engine
  • the extraction module 1802 may specifically include the following submodules:
  • a parsing submodule used for using the first face recognition engine to parse the facial features in each video frame respectively, to obtain the feature point information contained in each video frame;
  • An encoding sub-module configured to encode the feature point information according to each video frame to obtain multiple frames of target feature information corresponding to each video frame one-to-one.
  • the encoding submodule may specifically include the following units:
  • a frame sequence number determining unit used to determine the frame sequence number of each frame of target feature information according to the sequence in which each video frame is received;
  • a face area identification unit for identifying a plurality of face areas according to the feature point information contained in each of the video frames
  • a feature information acquisition unit for acquiring feature information of each face region, the feature information comprising state information and coordinate information of each face region;
  • a feature information storage unit configured to store the frame serial number and the feature information of each face region in a preset data structure to obtain the multi-frame target feature information.
  • the encoding sub-module may further include the following units:
  • a face area determination unit for determining the face area to be transmitted
  • the feature information acquisition unit may specifically include the following subunits:
  • a key video frame determination subunit for determining a key video frame from the plurality of video frames
  • a first feature information obtaining subunit for obtaining the feature information of the face region to be transmitted in the key video frame for the key video frame
  • the second feature information acquisition subunit is used for determining whether the feature information of the to-be-transmitted face region in any two adjacent non-key video frames has changed for the non-key video frame, if the any adjacent two If the feature information of the to-be-transmitted face region in each of the non-critical video frames changes, the feature information of the to-be-transmitted face region in the changed non-critical video frame is acquired.
  • the first terminal is configured with a first face recognition engine
  • the second terminal is configured with a second face recognition engine
  • the first face recognition engine and the second face recognition engine The face recognition engine is the same type of face recognition engine
  • the multi-frame target feature information is the original feature information recognized by the first face recognition engine
  • the second terminal is used for using the second face recognition engine.
  • the recognition engine maps the original feature information to the target avatar to generate the video call image.
  • the device may further include the following modules:
  • a timestamp adding module configured to add timestamps to the multi-frame target feature information and the audio data.
  • the transmission module 1803 may specifically include the following sub-modules:
  • an encapsulation submodule for encapsulating the target feature information and the audio data into a call data stream
  • a transmission submodule configured to transmit the call data stream to the second terminal.
  • the transmission module 1803 is further configured to transmit avatar number information to the second terminal, where the avatar number information is used to instruct the second terminal to determine the target avatar.
  • FIG. 19 a structural block diagram of another avatar-based video call device provided by an embodiment of the present application is shown.
  • the device can be applied to the second terminal in each of the foregoing embodiments, and the device can specifically include the following modules:
  • the receiving module 1901 is configured to receive the call data stream transmitted by the first terminal, where the call data stream includes audio data and multi-frame target feature information, and the multi-frame target feature information includes the data used to represent the user during the call.
  • the mapping module 1902 is used to map the multi-frame target feature information to a preset target avatar to generate a video call image, and the video call image includes the facial expression and the head movement. the target avatar;
  • the call module 1903 is configured to display the video call image and play the audio data synchronously.
  • mapping module 1902 may specifically include the following sub-modules:
  • Determining submodules for respectively determining the facial expressions and the head movements contained in the target feature information of each frame;
  • the mapping submodule is used to map the facial expressions and the head movements contained in each frame of target feature information to a preset target avatar, so as to generate a video call image.
  • the target feature information of each frame includes state information and coordinate information of multiple face regions
  • the determination submodule may specifically include the following units:
  • a calculation unit used for the second terminal to calculate the orientation of the user's head according to the coordinate information of the multiple face regions
  • the adjustment and simulation unit is used for the second terminal to adjust the orientation of the user's head according to the state information of the multiple facial regions, and to simulate the facial expression and the head movement.
  • the multi-frame target feature information includes target feature information corresponding to key video frames and target feature information corresponding to non-key video frames, and the target feature information corresponding to the key video frames includes the The complete feature information of the key video frame, the target feature information corresponding to the non-key video frame includes the feature information that changes in the non-key video frame;
  • the mapping module 1902 may also include the following submodules:
  • a generating submodule is configured to generate complete feature information of the non-key video frame according to the complete feature information of the key video frame and the changed feature information of the non-key video frame.
  • the first terminal is configured with a first face recognition engine
  • the second terminal is configured with a second face recognition engine
  • the first face recognition engine and the second face recognition engine The face recognition engine is the same type of face recognition engine
  • the multi-frame target feature information is the original feature information identified by the first face recognition engine
  • the mapping submodule is also used for using the second person
  • the face recognition engine maps the original feature information to the target avatar to generate the video call image.
  • the receiving module 1901 may further include the following sub-modules:
  • an avatar number information receiving submodule for receiving the avatar number information transmitted by the first terminal
  • the target avatar determination submodule is configured to determine the target avatar from a plurality of avatars according to the avatar number information.
  • the multi-frame target feature information and the audio data have timestamps
  • the call module 1903 may specifically include the following sub-modules:
  • a timestamp determination submodule configured to determine the timestamp of each frame of the video call image according to the timestamps of the multi-frame target feature information
  • An audio and video synchronization submodule configured to synchronize the video call image and the audio data according to the time stamp of each frame of the video call image and the time stamp of the audio data.
  • An embodiment of the present application further provides a terminal, where the terminal may be the first terminal or the second terminal in the foregoing embodiments, the terminal includes a memory, a processor, and a terminal stored in the memory and capable of being executed on the processor
  • the running computer program when the processor executes the computer program, implements the avatar-based video calling method in each of the foregoing embodiments.
  • Embodiments of the present application further provide a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on the terminal, the terminal executes the above-mentioned related method steps to realize the above-mentioned various embodiments.
  • An avatar-based video call method is provided.
  • Embodiments of the present application further provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the above-mentioned relevant steps, so as to realize the avatar-based video calling method in the above-mentioned various embodiments.
  • An embodiment of the present application further provides a communication system, including the first terminal and the second terminal in each of the foregoing embodiments, and a communication device for establishing a communication connection between the first terminal and the second terminal.
  • An embodiment of the present application further provides a chip, and the chip may be a general-purpose processor or a special-purpose processor.
  • the chip includes a processor.
  • the processor is configured to support the terminal to perform the above-mentioned relevant steps, so as to realize the avatar-based video calling method in the above-mentioned various embodiments.
  • the chip further includes a transceiver, and the transceiver is used for receiving the control of the processor and used for supporting the terminal to perform the above-mentioned relevant steps, so as to realize the avatar-based video calling method in the above-mentioned various embodiments.
  • the chip may further include a storage medium.
  • the chip can be implemented using the following circuits or devices: one or more field programmable gate arrays (FPGA), programmable logic devices (PLDs), controllers, A state machine, gate logic, discrete hardware components, any other suitable circuit, or any combination of circuits capable of performing the various functions described throughout this application.
  • FPGA field programmable gate arrays
  • PLDs programmable logic devices
  • a state machine gate logic, discrete hardware components, any other suitable circuit, or any combination of circuits capable of performing the various functions described throughout this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请实施例适用于终端技术领域,提供了一种基于虚拟形象的视频通话方法、装置和终端,所述方法应用于第一终端,包括:所述第一终端采集用户在通话过程中的图像数据和音频数据;所述第一终端从所述图像数据中提取多帧目标特征信息,所述多帧目标特征信息包括用于表征所述用户的人脸表情和头部动作的特征信息;所述第一终端将所述多帧目标特征信息和所述音频数据传输至第二终端,所述第二终端用于将所述多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,所述视频通话图像中包含具有所述人脸表情和所述头部动作的所述目标虚拟形象。采用上述方法,可以解决在网络条件较差的情况下,无法使用虚拟形象视频通话的问题。

Description

基于虚拟形象的视频通话方法、装置和终端
本申请要求于2020年12月29日提交国家知识产权局、申请号为202011608114.6、申请名称为“基于虚拟形象的视频通话方法、装置和终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及终端技术领域,尤其涉及一种基于虚拟形象的视频通话方法、装置和终端。
背景技术
网络电话等基于互联网、移动互联网的音视频通话技术是目前社交领域广泛采用的通信手段。与传统电话相比,网络电话资费更低,使用起来更加方便。借助于移动互联网技术,网络电话不需要固定的终端设备,用户使用手机等便携式终端即可接入。此外,相较于传统电话只能传输音频,网络电话还可以实现视频通话。
另一方面,人脸识别技术得到了飞速发展,通过摄像头识别人脸、五官,已被广泛应用于身份识别、人脸替换以及表情映射等领域。将人脸识别技术应用于视频通话,实时识别通话中的人物形象并使用虚拟形象进行替换,便形成了一种更具趣味性的虚拟形象视频通话技术。
目前,支持虚拟形象视频通话的终端或应用程序,大多采用的都是传输视频流的技术,其本质上跟传统的视频通话并无区别。在用户接入的网络条件较差的情况下,如带宽无法支持视频通话时,这种虚拟形象的视频通话也就无法使用。
发明内容
本申请实施例提供的一种基于虚拟形象的视频通话方法、装置和终端,用以解决现有技术中在网络条件较差的情况下,无法使用虚拟形象视频通话的问题。
为达到上述目的,本申请采用如下技术方案:
第一方面,提供一种基于虚拟形象的视频通话方法,应用于第一终端,该方法包括:
第一终端采集用户在通话过程中的图像数据和音频数据;
第一终端从图像数据中提取多帧目标特征信息,多帧目标特征信息包括用于表征用户的人脸表情和头部动作的特征信息;
第一终端将多帧目标特征信息和音频数据传输至第二终端,第二终端用于将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,视频通话图像中包含具有人脸表情和头部动作的目标虚拟形象。
实施本申请实施例具有以下有益效果:在通话过程中,第一终端无需向第二终端传输视频流,而只需传输从图像数据中提取出的特征信息,极大地减少了需要传输的数据量,使得在网络条件不佳的情况下,用户也可以使用视频通话与其他用户联系。其次,由于第一终端并不需要向第二终端传输用户在通话过程中的实时图像,也能够保证用户的隐私安全。
在第一方面的一种可能的实现方式中,图像数据包括多个视频帧,第一终端中配置有第一人脸识别引擎,第一终端从图像数据中提取多帧目标特征信息时,可以采用第一人脸 识别引擎分别解析每个视频帧中的面部特征,得到每个视频帧中包含的特征点信息;然后,第一终端根据每个视频帧为特征点信息进行编码,得到分别与每个视频帧一一对应的多帧目标特征信息。
在第一方面的一种可能的实现方式中,第一终端根据每个视频帧为特征点信息进行编码,得到分别与每个视频帧一一对应的多帧目标特征信息,可以按照如下步骤进行:第一终端按照接收到每个视频帧的顺序,分别确定每帧目标特征信息的帧序号;第一终端分别根据每个视频帧中包含的特征点信息识别多个面部区域;第一终端获取每个面部区域的特征信息,上述特征信息包括每个面部区域的状态信息和坐标信息;第一终端将帧序号以及每个面部区域的特征信息存储至预设的数据结构中,得到多帧目标特征信息。
在第一方面的一种可能的实现方式中,在第一终端采集用户在通话过程中的图像数据和音频数据之前,还包括:第一终端确定待传输的面部区域。相应地,第一终端获取每个面部区域的特征信息,包括:第一终端从多个视频帧中确定关键视频帧;针对关键视频帧,第一终端获取关键视频帧中待传输的面部区域的特征信息;针对非关键视频帧,第一终端确定任意相邻的两个非关键视频帧中待传输的面部区域的特征信息是否发生变化,若任意相邻的两个非关键视频帧中待传输的面部区域的特征信息发生变化,则获取发生变化的非关键视频帧中待传输的面部区域的特征信息。
在第一方面的一种可能的实现方式中,第一终端中配置有第一人脸识别引擎,第二终端中配置有第二人脸识别引擎,第一人脸识别引擎和第二人脸识别引擎为相同类型的人脸识别引擎,多帧目标特征信息为由第一人脸识别引擎识别的原始特征信息,第二终端用于采用第二人脸识别引擎将原始特征信息映射至目标虚拟形象中,以生成视频通话图像。
在第一方面的一种可能的实现方式中,在第一终端将多帧目标特征信息和音频数据传输至第二终端之前,还包括:第一终端为多帧目标特征信息和音频数据添加时间戳。
在第一方面的一种可能的实现方式中,第一终端将目标特征信息和音频数据传输至第二终端,包括:第一终端将目标特征信息和音频数据封装成通话数据流;第一终端将通话数据流传输至第二终端。
在第一方面的一种可能的实现方式中,在第一终端将目标特征信息和音频数据传输至第二终端之前,还包括:第一终端向第二终端传输虚拟形象编号信息,虚拟形象编号信息用于指示第二终端从多个虚拟形象中确定目标虚拟形象。
第二方面,提供一种基于虚拟形象的视频通话方法,应用于与第一终端通信的第二终端,该方法包括:
第二终端接收第一终端传输的通话数据流,通话数据流包含音频数据和多帧目标特征信息,多帧目标特征信息包括用于表征用户在通话过程中的人脸表情和头部动作的特征信息;
第二终端将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,视频通话图像中包含具有上述人脸表情和头部动作的目标虚拟形象;
第二终端在显示视频通话图像时,同步播放所述音频数据。
在第二方面的一种可能的实现方式中,第二终端将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,包括:第二终端从通话数据流中拆分出音频数据和多帧目标特征信息;第二终端分别确定每帧目标特征信息中包含的人脸表情和头部动作;第二终端分别将每帧目标特征信息中包含的人脸表情和头部动作映射至预设的目标虚拟 形象中,以生成视频通话图像。
在第二方面的一种可能的实现方式中,每帧目标特征信息包括多个面部区域的状态信息和坐标信息,第二终端分别确定每帧目标特征信息中包含的人脸表情和头部动作,包括:第二终端根据多个面部区域的坐标信息计算用户头部的朝向;第二终端根据多个面部区域的状态信息对用户头部的朝向进行调整,以及模拟人脸表情和头部动作。
在第二方面的一种可能的实现方式中,多帧目标特征信息包括与关键视频帧对应的目标特征信息以及与非关键视频帧对应的目标特征信息,与关键视频帧对应的目标特征信息包括关键视频帧的完整特征信息,与非关键视频帧对应的目标特征信息包括在非关键视频帧中发生变化的特征信息;在第二终端从通话数据流中拆分出音频数据和多帧目标特征信息之后,还包括:第二终端根据关键视频帧的完整特征信息和非关键视频帧中发生变化的特征信息,生成非关键视频帧的完整特征信息。
在第二方面的一种可能的实现方式中,第一终端中配置有第一人脸识别引擎,第二终端中配置有第二人脸识别引擎,第一人脸识别引擎和第二人脸识别引擎为相同类型的人脸识别引擎,多帧目标特征信息为由第一人脸识别引擎识别的原始特征信息,第二终端将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,包括:第二终端采用第二人脸识别引擎将原始特征信息映射至目标虚拟形象中,以生成视频通话图像。
在第二方面的一种可能的实现方式中,在第二终端接收第一终端传输的通话数据流之前,还包括:第二终端接收第一终端传输的虚拟形象编号信息;第二终端根据虚拟形象编号信息从多个虚拟形象中确定目标虚拟形象。
在第二方面的一种可能的实现方式中,多帧目标特征信息和音频数据具有时间戳,第二终端在显示所述视频通话图像时,同步播放所述音频数据,包括:第二终端根据多帧目标特征信息的时间戳,确定每帧视频通话图像的时间戳;述第二终端根据每帧视频通话图像的时间戳和音频数据的时间戳,对视频通话图像和音频数据进行同步。
第三方面,提供一种基于虚拟形象的视频通话装置,该装置可以应用于第一终端,该装置具体可以包括如下模块:
采集模块,用于采集用户在通话过程中的图像数据和音频数据;
提取模块,用于从图像数据中提取多帧目标特征信息,多帧目标特征信息包括用于表征用户的人脸表情和头部动作的特征信息;
传输模块,用于将多帧目标特征信息和音频数据传输至第二终端,第二终端用于将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,视频通话图像中包含具有上述人脸表情和头部动作的目标虚拟形象。
在第三方面的一种可能的实现方式中,图像数据包括多个视频帧,第一终端中配置有第一人脸识别引擎,提取模块具体可以包括如下子模块:
解析子模块,用于采用第一人脸识别引擎分别解析每个视频帧中的面部特征,得到每个视频帧中包含的特征点信息;
编码子模块,用于根据每个视频帧为特征点信息进行编码,得到分别与每个视频帧一一对应的多帧目标特征信息。
在第三方面的一种可能的实现方式中,编码子模块具体可以包括如下单元:
帧序号确定单元,用于按照接收到每个视频帧的顺序,分别确定每帧目标特征信息的帧序号;
面部区域识别单元,用于分别根据每个视频帧中包含的特征点信息识别多个面部区域;
特征信息获取单元,用于获取每个面部区域的特征信息,特征信息包括每个面部区域的状态信息和坐标信息;
特征信息存储单元,用于将帧序号以及每个面部区域的特征信息存储至预设的数据结构中,得到多帧目标特征信息。
在第三方面的一种可能的实现方式中,编码子模块还可以包括如下单元:
面部区域确定单元,用于确定待传输的面部区域。
在第三方面的一种可能的实现方式中,特征信息获取单元具体可以包括如下子单元:
关键视频帧确定子单元,用于从多个视频帧中确定关键视频帧;
第一特征信息获取子单元,用于针对关键视频帧,获取关键视频帧中待传输的面部区域的特征信息;
第二特征信息获取子单元,用于针对非关键视频帧,确定任意相邻的两个非关键视频帧中待传输的面部区域的特征信息是否发生变化,若任意相邻的两个非关键视频帧中待传输的面部区域的特征信息发生变化,则获取发生变化的非关键视频帧中待传输的面部区域的特征信息。
在第三方面的一种可能的实现方式中,第一终端中配置有第一人脸识别引擎,第二终端中配置有第二人脸识别引擎,第一人脸识别引擎和第二人脸识别引擎为相同类型的人脸识别引擎,多帧目标特征信息为由第一人脸识别引擎识别的原始特征信息,第二终端用于采用第二人脸识别引擎将原始特征信息映射至目标虚拟形象中,以生成视频通话图像。
在第三方面的一种可能的实现方式中,该装置还可以包括如下模块:
时间戳添加模块,用于为多帧目标特征信息和音频数据添加时间戳。
在第三方面的一种可能的实现方式中,传输模块具体可以包括如下子模块:
封装子模块,用于将目标特征信息和音频数据封装成通话数据流;
传输子模块,用于将通话数据流传输至所述第二终端。
在第三方面的一种可能的实现方式中,传输模块还用于向第二终端传输虚拟形象编号信息,虚拟形象编号信息用于指示第二终端从多个虚拟形象中确定目标虚拟形象。
第四方面,提供一种基于虚拟形象的视频通话装置,该装置可以应用于第二终端,该装置具体可以包括如下模块:
接收模块,用于接收第一终端传输的通话数据流,通话数据流包含音频数据和多帧目标特征信息,多帧目标特征信息包括用于表征用户在通话过程中的人脸表情和头部动作的特征信息;
映射模块,用于将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,视频通话图像中包含具有人脸表情和头部动作的目标虚拟形象;
通话模块,用于显示视频通话图像,并同步播放音频数据。
在第四方面的一种可能的实现方式中,映射模块具体可以包括如下子模块:
拆分子模块,用于从通话数据流中拆分出音频数据和多帧目标特征信息;
确定子模块,用于分别确定每帧目标特征信息中包含的人脸表情和头部动作;
映射子模块,用于分别将每帧目标特征信息中包含的人脸表情和头部动作映射至预设的目标虚拟形象中,以生成视频通话图像。
在第四方面的一种可能的实现方式中,每帧目标特征信息包括多个面部区域的状态信 息和坐标信息,确定子模块具体可以包括如下单元:
计算单元,用于根据多个面部区域的坐标信息计算用户头部的朝向;
调整及模拟单元,用于根据多个面部区域的状态信息对用户头部的朝向进行调整,以及模拟人脸表情和头部动作。
在第四方面的一种可能的实现方式中,多帧目标特征信息包括与关键视频帧对应的目标特征信息以及与非关键视频帧对应的目标特征信息,与关键视频帧对应的目标特征信息包括关键视频帧的完整特征信息,与非关键视频帧对应的目标特征信息包括在非关键视频帧中发生变化的特征信息;映射模块还可以包括如下子模块:
生成子模块,用于根据关键视频帧的完整特征信息和非关键视频帧中发生变化的特征信息,生成非关键视频帧的完整特征信息。
在第四方面的一种可能的实现方式中,第一终端中配置有第一人脸识别引擎,第二终端中配置有第二人脸识别引擎,第一人脸识别引擎和第二人脸识别引擎为相同类型的人脸识别引擎,多帧目标特征信息为由第一人脸识别引擎识别的原始特征信息,映射子模块还用于采用第二人脸识别引擎将原始特征信息映射至目标虚拟形象中,以生成视频通话图像。
在第四方面的一种可能的实现方式中,接收模块还可以包括如下子模块:
虚拟形象编号信息接收子模块,用于接收第一终端传输的虚拟形象编号信息;
目标虚拟形象确定子模块,用于根据虚拟形象编号信息从多个虚拟形象中确定目标虚拟形象。
在第四方面的一种可能的实现方式中,多帧目标特征信息和音频数据具有时间戳,通话模块具体可以包括如下子模块:
时间戳确定子模块,用于根据多帧目标特征信息的时间戳,确定每帧视频通话图像的时间戳;
音视频同步子模块,用于根据每帧视频通话图像的时间戳和音频数据的时间戳,对视频通话图像和音频数据进行同步。
第五方面,提供一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面或第二方面任一项所述的基于虚拟形象的视频通话方法。
第六方面,提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在终端上运行时,使得终端执行上述相关方法步骤实现上述第一方面或第二方面任一项所述的基于虚拟形象的视频通话方法。
第七方面,提供一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述第一方面或第二方面任一项所述的基于虚拟形象的视频通话方法。
第八方面,提供一种芯片,所述芯片包括存储器和处理器,所述处理器执行所述存储器中存储的计算机程序,以实现如上述第一方面或第二方面任一项所述的基于虚拟形象的视频通话方法。
第九方面,提供一种通信系统,包括如上述第一方面任一项所述的第一终端和上述第二方面任一项所述的第二终端,以及用于建立第一终端和第二终端之间的通信连接的通信设备。
可以理解的是,上述第二方面至第九方面的有益效果可以参见上述第一方面中的相关 描述,在此不再赘述。
附图说明
图1是现有技术中的一种虚拟形象视频通话的界面示意图。
图2是本申请实施例提供的基于虚拟形象的视频通话方法与现有技术中传统的虚拟形象视频通话方法的对比示意图。
图3是本申请实施例提供的一种数据传输示意图。
图4是本申请实施例提供的一种电子设备的结构示意图。
图5是本申请实施例提供的一种电子设备的软件结构框图。
图6是本申请实施例提供的一种基于虚拟形象的视频通话方法的步骤示意图。
图7是本申请实施例提供的一种触发第一终端发起视频通话请求的操作示意图。
图8是本申请实施例提供的一种第二终端接受视频通话请求的操作示意图。
图9是本申请实施例提供的一种在第一终端和第二终端之间进行视频通话时的通话界面示意图。
图10是本申请实施例提供的第一终端的数据处理过程示意图。
图11是本申请实施例提供的一种视频帧的处理方式示意图。
图12是本申请实施例提供的第二终端的数据处理过程示意图。
图13是本申请实施例提供的一种人脸法线的示意图。
图14是本申请实施例提供的一种在第一终端侧实现的基于虚拟形象的视频通话方法的步骤示意图。
图15是本申请实施例提供的另一种在第一终端侧实现的基于虚拟形象的视频通话方法的步骤示意图。
图16是本申请实施例提供的又一种在第一终端侧实现的基于虚拟形象的视频通话方法的步骤示意图。
图17是本申请实施例提供的一种在第二终端侧实现的基于虚拟形象的视频通话方法的步骤示意图。
图18是本申请实施例提供的一种基于虚拟形象的视频通话装置的结构框图。
图19是本申请实施例提供的另一种基于虚拟形象的视频通话装置的结构框图。
具体实施方式
为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。例如,第一人脸识别引擎、第二人脸识别引擎等等仅仅是为了区分不同终端上的人脸识别引擎,并不对其数量和执行次序进行限定。
需要说明的是,本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
本申请实施例描述的业务场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请实施例中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
本申请实施例提供的一种基于虚拟形象的视频通话方法中所涉及到的步骤仅仅作为示例,并非所有的步骤均是必须执行的步骤,或者并非各个信息或消息中的内容均是必选的,在使用过程中可以根据需要酌情增加或减少。
本申请实施例中同一个步骤或者具有相同功能的步骤或者消息在不同实施例之间可以互相参考借鉴。
如图1所示,是现有技术中的一种虚拟形象视频通话的界面示意图。用户在使用图1所示的虚拟形象视频通话时,需要从多个待选虚拟形象中选择一个作为当前通话的虚拟形象。例如,用户从包含多个待选虚拟形象的区域100中选择虚拟形象103作为当前通话的虚拟形象。第一终端或应用程序使用人脸识别技术,将采集到的视频帧中的用户人脸替换为选中的虚拟形象103,替换后的用户人脸如图1中的110所示。然后,第一终端向第二终端发送视频流,实现虚拟形象视频通话。
可见,现有技术中的虚拟形象视频通话是将替换后的画面作为完整的画面传输至对端设备,整个过程与传统视频通话传输视频流和音频流的过程并无区别。假设视频通话过程中原本要传输的视频流大小为1080*1920像素,每秒传输帧数(frames per second,fps)为30帧。采用虚拟形象会将每一帧中的人脸替换,最终生成的视频流仍为1080*1920像素,帧率为30fps,与原视频流在数据大小上无太大差异。这样,在用户接入的网络条件较差的情况下,如带宽无法支持视频通话时,这种虚拟形象的视频通话也就无法使用。
针对上述问题,本申请实施例提供一种基于虚拟形象的视频通话方法。采用该方法,第一终端在采集用户通话过程中的图像数据和音频数据后,可以从图像数据中提取出表征该用户的人脸表情和头部动作的特征信息。然后,第一终端将音频数据和提取出的特征信息传输至第二终端,由第二终端将接收到的特征信息映射至虚拟形象上,形成视频通话图像。并且,第二终端在显示视频通话图像的同时,通过同步播放接收到的音频数据,可以在第一终端和第二终端之间实现基于虚拟形象的视频通话。这样,在通话过程中,第一终端无需向第二终端传输视频流,而只需传输从视频流中提取出的特征信息,极大地减少了需要传输的数据量,使得在网络条件不佳的情况下,用户也可以使用视频通话与其他用户联系。其次,采用本方法,由于第一终端并不需要向第二终端传输用户在通话过程中的实时图像,也能够保证用户的隐私安全。
具体地,如图2所示,是本申请实施例提供的基于虚拟形象的视频通话方法与现有技术中传统的虚拟形象视频通话方法的对比示意图。在图2中的(a)中,示出了在本申请实施例中以及现有技术中视频通话发起端(即,第一终端)的数据处理过程对比示意图。如图2中的(a)所示,现有技术中传统的视频通话,是由第一终端调用摄像头采集图像数据,调用麦克风采集音频数据。然后,第一终端将图像数据和音频数据叠加为视频流,并将视频流传输给对端(即,第二终端)。而本申请实施例提供的视频通话方法,第一终端可以调用摄 像头采集图像数据,调用麦克风采集音频数据。然后,第一终端对采集到的图像数据进行处理,识别出图像中的人脸表情、头部动作等特征信息。第一终端将识别出的特征信息与音频数据叠加为数据流,传输给对端的第二终端。在图2中的(b)中,示出了在本申请实施例中以及现有技术中视频通话接收端(即,第二终端)的数据处理过程对比示意图。现有技术中传统的视频通话,第二终端在接收到第一终端传输的数据流后,通过对视频流和音频流进行解码,从而显示出相应的画面,并播放声音,实现视频通话。而本申请实施例提供的视频通话方法,第二终端所接收到的数据流并非视频流,而是在音频流的基础上叠加有特征信息的特殊通话流。因此,第二终端一方面可以按照传统方法对音频流进行解码,另一方面则需要逐帧提取特征信息,并对每一帧特征信息进行分析处理,然后将包含有人脸表情和头部动作的特征信息映射至虚拟形象中,形成视频通话图像。最后,第二终端根据时间戳对图像和音频进行同步,通过显示虚拟形象画面并同步播放声音,实现第一终端和第二终端之间的视频通话。
可见,现有技术中传统的视频通话过程,传输的数据仍然为视频流。由于视频流的传输需要占用较多的网络带宽,在网络条件不佳的情况下,无法采用传统方法实现视频通话。而本申请实施例提供的视频通话方法,并不需要传输视频流,而是在语音通话传输音频流的基础上,增加了特征信息而形成的特殊数据流,其需要占用的网络带宽较少,即使在网络条件不佳的情况下,采用本申请实施例提供的视频通话方法,也能够实现视频通话,而不会降级为语音通话。
在本申请实施例中,上述第一终端或第二终端可以是手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、个人计算机(personal computer,PC)、上网本、个人数字助理(personal digital assistant,PDA)等具备音视频采集功能的电子设备。本申请实施例对第一终端或第二终端的具体类型不作限定。
本申请实施例中的第一终端和第二终端可以是同类型的电子设备,例如,第一终端和第二终端均为手机;或者,第一终端和第二终端均为平板电脑。本申请实施例中的第一终端和第二终端也可以是不同类型的电子设备,例如,第一终端为手机,第二终端为平板电脑;或者,第一终端为平板电脑,第二终端为手机。
如图3所示,是本申请实施例提供的一种数据传输示意图。在图3中,包括第一终端31和第二终端32。其中,第一终端31可以是手机311、平板电脑312、PC设备313或者智能电视机314;类似地,第二终端32也可以是手机321、平板电脑322、PC设备323或者智能电视机324。在一种可能的实现方式中,第一终端31与第二终端32通信时,相互之间的数据流可以通过通信设备传输。该通信设备可以是通信基站、云服务器等设备。例如,第一终端31将采集到的特征信息和音频数据传输至云服务器30,由云服务器30再将这些数据传输至第二终端32,由第二终端32对数据进行处理,从而显示包含有虚拟形象的视频通话图像,并播放相应的音频,实现第一终端31和第二终端32之间的视频通话。在另一种可能的实现方式中,第一终端31和第二终端32之间的数据流也可以以点对点(peer to peer,P2P)数据流的形式进行传输本申请实施例对此不作限定。
示例性的,图4示出了一种电子设备400的结构示意图。上述第一终端31和第二终端32的结构可以参考电子设备400的结构。
电子设备400可以包括处理器410、外部存储器接口420、内部存储器421、通用串 行总线(universal serial bus,USB)接口430、充电管理模块440、电源管理模块441、电池442、天线1、天线2、移动通信模块450、无线通信模块460、音频模块470、扬声器470A、受话器470B、麦克风470C、耳机接口470D、传感器模块480、按键490、马达491、指示器492、摄像头493、显示屏494,以及用户标识模块(subscriber identification module,SIM)卡接口495等。其中,传感器模块480可以包括压力传感器480A、陀螺仪传感器480B、气压传感器480C、磁传感器480D、加速度传感器480E、距离传感器480F、接近光传感器480G、指纹传感器480H、温度传感器480J、触摸传感器480K、环境光传感器480L、骨传导传感器480M等。
可以理解的是,本申请实施例示意的结构并不构成对电子设备400的具体限定。在本申请一些实施例中,电子设备400可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器410可以包括一个或多个处理单元。例如,处理器410可以包括应用处理器(application processor,AP)、调制解调处理器、图形处理器(graphics processing unit,GPU)、图像信号处理器(image signal processor,ISP)、控制器、视频编解码器、数字信号处理器(digital signal processor,DSP)、基带处理器,和/或,神经网络处理器(neural-network processing unit,NPU)等。不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器410中还可以设置存储器,用于存储指令和数据。在本申请一些实施例中,处理器410中的存储器为高速缓冲存储器。该存储器可以保存处理器410刚用过或循环使用的指令或数据。如果处理器410需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器410的等待时间,因而提高了系统的效率。
在本申请一些实施例中,处理器410可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口、集成电路内置音频(inter-integrated circuit sound,I2S)接口、脉冲编码调制(pulse code modulation,PCM)接口、通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口、移动产业处理器接口(mobile industry processor interface,MIPI)、通用输入输出(general-purpose input/output,GPIO)接口、用户标识模块(subscriber identity module,SIM)接口,和/或,通用串行总线(universal serial bus,USB)接口等。
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。在本申请一些实施例中,处理器410可以包含多组I2C总线。处理器410可以通过不同的I2C总线接口分别耦合触摸传感器480K、充电器、闪光灯、摄像头493等。例如,处理器410可以通过I2C接口耦合触摸传感器480K,使处理器410与触摸传感器480K通过I2C总线接口通信,实现电子设备400的触摸功能。
I2S接口可以用于音频通信。在本申请一些实施例中,处理器410可以包含多组I2S总线。处理器410可以通过I2S总线与音频模块470耦合,实现处理器410与音频模块470之间的通信。在本申请一些实施例中,音频模块470可以通过I2S接口向无线通信模块460传递音频信号,实现通过蓝牙耳机接听电话的功能。
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在本申请一些实施例中,音频模块470与无线通信模块460可以通过PCM总线接口耦合。在本申请一些实施例中,音频模块470也可以通过PCM接口向无线通信模块460传递音频信号,实现通过蓝牙耳机接听电话的功能。
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在本申请一些实施例中,UART接口通常被用于连接处理器410与无线通信模块460。例如,处理器410通过UART接口与无线通信模块460中的蓝牙模块通信,实现蓝牙功能。在本申请一些实施例中,音频模块470可以通过UART接口向无线通信模块460传递音频信号,实现通过蓝牙耳机播放音乐的功能。
MIPI接口可以被用于连接处理器410与显示屏494、摄像头493等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI)、显示屏串行接口(display serial interface,DSI)等。
在本申请一些实施例中,处理器410和摄像头493通过CSI接口通信,实现电子设备400的拍摄功能。处理器410和显示屏494通过DSI接口通信,实现电子设备400的显示功能。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在本申请一些实施例中,GPIO接口可以用于连接处理器410与摄像头493、显示屏494、无线通信模块460、音频模块470、传感器模块480等。GPIO接口还可以被配置为I2C接口、I2S接口、UART接口、MIPI接口等。
USB接口430是符合USB标准规范的接口,具体可以是Mini USB接口、Micro USB接口、USB Type C接口等。USB接口430可以用于连接充电器为电子设备400充电,也可以用于电子设备400与外围设备之间传输数据。USB接口430也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备400的结构限定。在本申请另一些实施例中,电子设备400也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块440用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块440可以通过USB接口430接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块440可以通过电子设备400的无线充电线圈接收无线充电输入。充电管理模块440为电池442充电的同时,还可以通过电源管理模块441为电子设备供电。
电源管理模块441用于连接电池442、充电管理模块440与处理器410。电源管理模块441接收电池442和/或充电管理模块440的输入,为处理器410、内部存储器421、显示屏494、摄像头493、无线通信模块460等供电。电源管理模块441还可以用于监测电池容量、电池循环次数、电池健康状态(漏电,阻抗)等参数。
在其他一些实施例中,电源管理模块441也可以设置于处理器410中。在另一些实施例中,电源管理模块441和充电管理模块440也可以设置于同一个器件中。
电子设备400的无线通信功能可以通过天线1、天线2、移动通信模块450、无线通信模块460、调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备400中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如,可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块450可以提供应用在电子设备400上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块450可以包括至少一个滤波器、开关、功率放大器、低噪声放大器(low noise amplifier,LNA)等。移动通信模块450可以由天线1接收电磁波,并对接收的电磁波进行滤波、放大等处理,传送至调制解调处理器进行解调。移动通信模块450还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。
在本申请一些实施例中,移动通信模块450的至少部分功能模块可以被设置于处理器410中。在本申请一些实施例中,移动通信模块450的至少部分功能模块可以与处理器410的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后,解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器470A、受话器470B等)输出声音信号,或通过显示屏494显示图像或视频。
在本申请一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器410,与移动通信模块450或其他功能模块设置在同一个器件中。
无线通信模块460可以提供应用在电子设备400上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络)、蓝牙(bluetooth,BT)、全球导航卫星系统(global navigation satellite system,GNSS)、调频(frequency modulation,FM)、近距离无线通信技术(near field communication,NFC)、红外技术(infrared,IR)等无线通信的解决方案。无线通信模块460可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块460经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器410。无线通信模块460还可以从处理器410接收待发送的信号,对其进行调频、放大,经天线2转为电磁波辐射出去。
在本申请一些实施例中,电子设备400的天线1和移动通信模块450耦合,天线2和无线通信模块460耦合,使得电子设备400可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM)、通用分组无线服务(general packet radio service,GPRS)、码分多址接入(code division multiple access,CDMA)、宽带码分多址(wideband code division multiple access,WCDMA)、时分码分多址(time-division code division multiple access,TD-SCDMA)、长期演进(long term evolution,LTE)、BT、GNSS、WLAN、NFC、FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS)、全球导航卫星系统(global navigation satellite system,GLONASS)、北斗卫星导航系统(beidou navigation satellite system,BDS)、准天顶卫星系统(quasi-zenith satellite system,QZSS),和/或星基增强系统(satellite based augmentation systems,SBAS)。
电子设备400通过GPU、显示屏494以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏494和应用处理器。GPU用于执行数学和几何计算,用于 图形渲染。处理器410可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏494用于显示图像、视频等。显示屏494包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD)、有机发光二极管(organic light-emitting diode,OLED)、有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED)、柔性发光二极管(flex light-emitting diode,FLED)、Miniled、MicroLed、Micro-oLed、量子点发光二极管(quantum dot light emitting diodes,QLED)等。在本申请一些实施例中,电子设备400可以包括1个或N个显示屏494,N为大于1的正整数。
电子设备400可以通过ISP、摄像头493、视频编解码器、GPU、显示屏494以及应用处理器等实现拍摄功能。
ISP用于处理摄像头493反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点、亮度、肤色进行算法优化。ISP还可以对拍摄场景的曝光、色温等参数优化。在本申请一些实施例中,ISP可以设置在摄像头493中。
摄像头493用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB、YUV等格式的图像信号。在本申请一些实施例中,电子设备400可以包括1个或N个摄像头493,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备400在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备400可以支持一种或多种视频编解码器。这样,电子设备400可以播放或录制多种编码格式的视频,例如,动态图像专家组(moving picture experts group,MPEG)1、MPEG2、MPEG3、MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备400的智能认知等应用,例如,图像识别、人脸识别、语音识别、文本理解等。
外部存储器接口420可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备400的存储能力。外部存储卡通过外部存储器接口420与处理器410通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器421可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器421可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备400使用过程中所创建的数据(比如音频数据、电话本等)等。
此外,内部存储器421可以包括高速随机存取存储器,还可以包括非易失性存储器。例如至少一个磁盘存储器件、闪存器件、通用闪存存储器(universal flash storage,UFS)等。
处理器410通过运行存储在内部存储器421的指令,和/或存储在设置于处理器中的 存储器的指令,执行电子设备400的各种功能应用以及数据处理。
电子设备400可以通过音频模块470、扬声器470A、受话器470B、麦克风470C、耳机接口470D,以及应用处理器等实现音频功能。例如音乐播放、录音等。
音频模块470用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块470还可以用于对音频信号编码和解码。在本申请一些实施例中,音频模块470可以设置于处理器410中,或将音频模块470的部分功能模块设置于处理器410中。
扬声器470A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备400可以通过扬声器470A收听音乐,或收听免提通话。
受话器470B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备400接听电话或语音信息时,可以通过将受话器470B靠近人耳接听语音。
麦克风470C,也称“话筒”、“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风470C发声,将声音信号输入到麦克风470C。电子设备400可以设置至少一个麦克风470C。在另一些实施例中,电子设备400可以设置两个麦克风470C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备400还可以设置三个、四个或更多麦克风470C,实现采集声音信号、降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口470D用于连接有线耳机。耳机接口470D可以是USB接口430,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口、美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器480A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器480A可以设置于显示屏494。压力传感器480A的种类很多,如电阻式压力传感器、电感式压力传感器、电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器480A,电极之间的电容改变。电子设备400根据电容的变化确定压力的强度。当有触摸操作作用于显示屏494,电子设备400根据压力传感器480A检测所述触摸操作强度。电子设备400也可以根据压力传感器480A的检测信号计算触摸的位置。
在本申请一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如,当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时,执行新建短消息的指令。
陀螺仪传感器480B可以用于确定电子设备400的运动姿态。在本申请一些实施例中,可以通过陀螺仪传感器480B确定电子设备400围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器480B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪传感器480B检测电子设备400抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消电子设备400的抖动,实现防抖。陀螺仪传感器480B还可以用于导航、体感游戏场景。
气压传感器480C用于测量气压。在本申请一些实施例中,电子设备400通过气压传感器480C测得的气压值计算海拔高度、辅助定位和导航。
磁传感器480D包括霍尔传感器。电子设备400可以利用磁传感器480D检测翻盖皮套的开合。在本申请一些实施例中,当电子设备400是翻盖机时,电子设备400可以根据磁传感器480D检测翻盖的开合,进而根据检测到的皮套的开合状态或翻盖的开合状态,设置翻盖自动解锁等特性。
加速度传感器480E可检测电子设备400在各个方向上(一般为三轴)加速度的大小。当电子设备400静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。
距离传感器480F,用于测量距离。电子设备400可以通过红外或激光测量距离。在本申请一些实施例中,例如拍摄场景,电子设备400可以利用距离传感器480F测距以实现快速对焦。
接近光传感器480G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。电子设备400通过发光二极管向外发射红外光。电子设备400使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定电子设备400附近有物体。当检测到不充分的反射光时,电子设备400可以确定电子设备400附近没有物体。电子设备400可以利用接近光传感器480G检测用户手持电子设备400贴近耳朵通话,以便自动熄灭屏幕达到省电的目的。接近光传感器480G也可用于皮套模式,口袋模式自动解锁与锁屏。
环境光传感器480L用于感知环境光亮度。电子设备400可以根据感知的环境光亮度自适应调节显示屏494亮度。环境光传感器480L也可用于拍照时自动调节白平衡。环境光传感器480L还可以与接近光传感器480G配合,检测电子设备400是否在口袋里,以防误触。
指纹传感器480H用于采集指纹。电子设备400可以利用采集的指纹特性实现指纹解锁、访问应用锁、指纹拍照、指纹接听来电等。
温度传感器480J用于检测温度。在本申请一些实施例中,电子设备400利用温度传感器480J检测的温度,执行温度处理策略。例如,当温度传感器480J上报的温度超过阈值,电子设备400执行降低位于温度传感器480J附近的处理器的性能,以便降低功耗实施热保护。在另一些实施例中,当温度低于另一阈值时,电子设备400对电池442加热,以避免低温导致电子设备400异常关机。在其他一些实施例中,当温度低于又一阈值时,电子设备400对电池442的输出电压执行升压,以避免低温导致的异常关机。
触摸传感器480K,也称“触控器件”。触摸传感器480K可以设置于显示屏494,由触摸传感器480K与显示屏494组成触摸屏,也称“触控屏”。触摸传感器480K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏494提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器480K也可以设置于电子设备400的表面,与显示屏494所处的位置不同。
骨传导传感器480M可以获取振动信号。在本申请一些实施例中,骨传导传感器480M可以获取人体声部振动骨块的振动信号。骨传导传感器480M也可以接触人体脉搏,接收血压跳动信号。
在本申请一些实施例中,骨传导传感器480M也可以设置于耳机中,结合成骨传导耳机。音频模块470可以基于所述骨传导传感器480M获取的声部振动骨块的振动信号,解 析出语音信号,实现语音功能。应用处理器可以基于骨传导传感器480M获取的血压跳动信号解析心率信息,实现心率检测功能。
按键490包括开机键、音量键等。按键490可以是机械按键,也可以是触摸式按键。电子设备400可以接收按键输入,产生与电子设备400的用户设置以及功能控制有关的键信号输入。
马达491可以产生振动提示。马达491可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照、音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏494不同区域的触摸操作,马达491也可对应不同的振动反馈效果。不同的应用场景(例如,时间提醒、接收信息、闹钟、游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。
指示器492可以是指示灯,可以用于指示充电状态、电量变化,也可以用于指示消息、未接来电、通知等。
SIM卡接口495用于连接SIM卡。SIM卡可以通过插入SIM卡接口495,或从SIM卡接口495拔出,实现和电子设备400的接触和分离。电子设备400可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口495可以支持Nano SIM卡、Micro SIM卡、SIM卡等。同一个SIM卡接口495可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口495也可以兼容不同类型的SIM卡。SIM卡接口495也可以兼容外部存储卡。电子设备400通过SIM卡和网络交互,实现通话以及数据通信等功能。在本申请一些实施例中,电子设备400采用eSIM(即嵌入式SIM卡)。eSIM卡可以嵌在电子设备400中,不能和电子设备400分离。
电子设备400的软件系统可以采用分层架构、事件驱动架构、微核架构、微服务架构,或云架构。本申请实施例以分层架构的
Figure PCTCN2021137526-appb-000001
系统为例,示例性说明电子设备400的软件结构。
图5是本申请实施例的电子设备400的软件结构框图。分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在本申请一些实施例中,将
Figure PCTCN2021137526-appb-000002
系统分为四层,从上至下分别为应用程序层、应用程序框架层、
Figure PCTCN2021137526-appb-000003
运行时(
Figure PCTCN2021137526-appb-000004
runtime)和系统层,以及内核层。
应用程序层可以包括一系列应用程序包。如图5所示,应用程序包可以包括相机、图库、日历、通话、地图、导航、WLAN、蓝牙、音乐、视频、短信息等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图5所示,应用程序框架层可以包括窗口管理器、内容提供器、视图系统、电话管理器、资源管理器、通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小、判断是否有状态栏、锁定屏幕、截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频、图像、音频、拨打和接听的电话、浏览历史和书签、电话簿等。
视图系统包括可视控件,例如显示文字的控件、显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供电子设备400的通信功能。例如,通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串、图标、图片、布局文件、视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成、消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息、发出提示音、电子设备振动、指示灯闪烁等。
Figure PCTCN2021137526-appb-000005
Runtime包括核心库和虚拟机。
Figure PCTCN2021137526-appb-000006
runtime负责
Figure PCTCN2021137526-appb-000007
系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是
Figure PCTCN2021137526-appb-000008
的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理、堆栈管理、线程管理、安全和异常的管理、以及垃圾回收等功能。
系统层可以包括多个功能模块。例如,表面管理器(surface manager)、媒体库(Media Libraries)、三维(3D)图形处理库(例如,OpenGL ES)、二维(2D)图形引擎(例如,SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频、视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如,MPEG4、H.264、MP3、AAC、AMR、JPG、PNG等。
三维图形处理库用于实现三维图形绘图、图像渲染、合成,以及图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动、摄像头驱动、音频驱动、传感器驱动。
以下实施例以具有上述硬件结构/软件结构的终端为例,对本申请实施例提供的基于虚拟形象的视频通话方法进行说明。
参照图6,示出了本申请实施例提供的一种基于虚拟形象的视频通话方法的步骤示意图,该方法具体可以包括如下步骤:
S601、第一终端向第二终端传输虚拟形象编号信息。
在本申请实施例中,第一终端可以是发起视频通话的终端,第二终端可以是接收该视频通话的终端。
在本申请实施例的一种可能的实现方式中,视频通话可以是由第一用户在第一终端上的操作触发的。第一用户可以是指使用第一终端的用户;相应地,第二用户可以是指使用第二终端的用户。
示例性地,第一用户希望与第二用户进行视频通话,则第一用户可以在如图7中的(a)所示的第一终端的界面中点击“电话”控件701。响应于第一用户点击“电话”控件701的操作,第一终端进入如图7中的(b)所示的拨号界面。在该拨号界面中,第一用户可以输入第二用户的电话号码或其他可用于联系第二用户的联系方式;或者,若第一终端中存储 有第二用户的联系方式,第一用户也可以直接从第一终端中调出第二用户的联系方式。在一种可能的实现方式,第一用户可以在如图7中的(b)所示的界面中输入第二用户的电话号码。待第一用户输入完整的电话号码后,如图7中的(c)所示,第一用户可以点击“视频通话”控件702,触发第一终端向第二终端发起相应的视频通话请求。
在本申请实施例的一种可能的实现方式中,第一终端在向第二终端发起视频通话请求前,可以向第一用户展示如图7中的(d)所示的对话框703。在该对话框703中,包括“普通视频通话”控件7031和“虚拟形象视频通话”控件7032,第一用户可以从上述两个控件7031或7032中选择任意一种视频通话方式。其中,普通视频通话可以是指传统的视频通话方式,第一终端可以实时采集第一用户的图像和语音,并将采集到的图像和语音传输至第二终端,实现第一终端和第二终端之间的视频通话。在普通视频通话模式下,第一终端向第二终端传输的是第一用户本人的图像和语音,显示于第二终端上的图像也就是第一用户本人的图像。虚拟形象视频通话可以是指本申请实施例中提供的视频通话方式,在虚拟形象视频通话过程中,显示于第二终端上的图像并非第一用户本人的图像,而是经过处理后的虚拟形象。
在一种示例中,第一用户点击如图7中的(f)所示的“虚拟形象视频通话”控件7032,请求第一终端建立与第二终端之间的虚拟形象视频通话连接。在第一用户选定“虚拟形象视频通话”控件7032后,第一终端可以弹出如图7中的(g)所示的对话框704,在对话框704中,第一终端请求第一用户选择希望使用的虚拟形象。例如,对话框704中包括虚拟形象1和虚拟形象2。如图7中的(h)所示,第一用户可以选定虚拟形象1对应的控件7041,这样,第一终端可以将第一用户选定的虚拟形象1的信息传输至第二终端,并请求基于虚拟形象1建立第一终端与第二终端之间的虚拟形象视频通话连接。上述虚拟形象1的信息即是第一终端传输至第二终端的虚拟形象编号信息。
需要说明的是,可用于视频通话的虚拟形象可以是任意类型的虚拟形象。例如,虚拟形象可以是虚拟宠物形象,也可以是虚拟人物形象,本申请实施例对虚拟形象的类型不作限定。
如表一所示,是本申请实施例提供的一种在第一终端和第二终端建立视频通话连接时所传输的数据的示例。
表一:
参数列表 数据长度 可选/必选 字段描述
charcterId 4Byte 必选 虚拟形象编号信息
otherData     其他数据,根据需求添加
S602、第二终端根据虚拟形象编号信息,从多个虚拟形象中确定目标虚拟形象。
在本申请实施例中,第一终端发起的虚拟形象视频通话请求可以基于任意的通信手段传输至第二终端。示例性地,上述虚拟形象视频通话请求可以通过基站、云服务器或者P2P的方式传输至第二终端。
如图8中的(a)所示,是第二终端接收到第一终端发送的虚拟形象视频通话请求时的界面示意图。在该界面中,包括第一终端的通信号码801,以及可供第二用户对该通话请求进行处理的多个操作控件,如“接听”控件802、“拒绝”控件803、“转为语音”控件804等等。当然,第二终端的界面中还可以包括用于显示本次通话请求类型的显示信息805a。 第二终端通过显示信息805a可以告知第二用户当前的通话请求是虚拟形象视频通话请求。第二用户可以通过点击“接听”控件802,建立起第一终端与第二终端之间的视频通话连接;或者,第二用户也可以通过点击“转为语音”控件804,建立起第一终端与第二终端之间的语音通话连接;又或者,第二用户可以通过点击“拒绝”控件803,拒绝第一终端的通信请求。如图8中的(b)所示,第二用户可以点击“接听”控件802,接受第一终端发起的虚拟形象视频通话请求。当第二用户接受第一终端的虚拟视频通话请求后,第二终端可以弹出如图8中的(c)所示的对话框,该对话框中包括有“虚拟形象1”和“虚拟形象2”两个选择控件8021和8022,第二用户可以从中选择任意一个控件,例如第二用户选择如图8中的(c)所示的“虚拟形象2”对应的控件8022。在第二用户通过点击“虚拟形象2”对应的控件8022,建立起第一终端与第二终端之间的虚拟形象视频通话连接后,第二终端可以根据接收到的第一终端传输的虚拟形象编号信息确定在本终端上显示的目标虚拟形象。相应地,第二终端也可以将第二用户选定的虚拟形象的信息传输至第一终端,由第一终端根据接收到的虚拟形象编号信息,从多个虚拟形象中确定在第一终端上显示的目标虚拟形象。
需要说明的是,第一用户和第二用户选定的虚拟形象可以是相同的虚拟形象,也可以是不同的虚拟形象,本申请实施例对此不作限定。例如,第一用户和第二用户均可以选择“虚拟形象1”或者“虚拟形象2”作为视频通话过程中使用的虚拟形象;或者,第一用户选择使用“虚拟形象1”,而第二用户选择使用“虚拟形象2”,本申请实施例对此不作限定。
在本申请实施例中,目标虚拟形象可以是指在本终端中所显示的对端用户的形象。示例性地,在第一用户选定“虚拟形象1”用于后续的视频通话后,第一终端可以将第一用户选定的“虚拟形象1”的信息传输至第二终端。这样,第二终端便可以根据接收到的信息,从多个虚拟形象中确定“虚拟形象1”作为目标虚拟形象。即,在第二终端中所显示的第一用户的形象为“虚拟形象1”。相应地,在第二用户接受第一终端发起的虚拟形象视频通话请求并选定“虚拟形象2”用户后续的视频通话后,第二终端也可以将第二用户选定的虚拟形象2”的信息传输至第一终端。这样,第一终端也可以根据接收到的信息,从多个虚拟形象中确定“虚拟形象2”作为目标虚拟形象。即,在第一终端中所显示的第二用户的形象为“虚拟形象2”。
参见图9中的(a)和(b)所示,分别是在第一终端和第二终端之间建立起虚拟形象视频通话连接后,第一终端和第二终端的通话界面示意图。其中,图9中的(a)所示的是在第一终端上的通话界面示意图。在图9中的(a)所示的通话界面中,包括第二终端的通信号码9011、第一用户的虚拟形象9021以及第二用户的虚拟形象9031;在图9中的(b)所示的通话界面中,包括第一终端的通信号码9012、第二用户的虚拟形象9022以及第一用户的虚拟形象9032。需要说明的是,在图9中的(a)所示的通话界面中显示的第二用户的虚拟形象9031,与在图9中的(b)所示的通话界面中显示的第二用户的虚拟形象9022是相同的;在图9中的(b)所示的通话界面中显示的第一用户的虚拟形象9032,与在图9中的(a)所示的通话界面中显示的第二用户的虚拟形象9021是相同的。
在本申请实施例的一种可能的实现方式中,如图7中的(e)所示,第一用户在选择视频通话类型时,可以选择普通视频通话。即,第一用户点击图7中的(e)所示的控件7031。这样,第一终端将请求建立与第二终端之间的视频通话连接。在第一用户请求建立第一终 端与第二终端之间的普通视频通话时,第二终端接收到的视频通话请求可以如图8中的(d)所示。参见图8中的(a)和(d)所示,第二终端在接收到普通视频通话请求时所显示的显示信息805b,表示当前视频话为普通视频通话。如图8中的(e)所示,第二用户可以点击“接听”控件802。此时,第二终端可以弹出如图8中的(f)所示的对话框806,在该对话框806中,第二终端可以再次请求第二用户确认是与第一用户进行普通视频通话还是虚拟形象视频通话。若第二用户点击图8中的(f)所示的“虚拟形象视频通话”控件8062,则第二终端可以弹出对话框,请求第二用户选择用户视频通话的虚拟形象。例如,用户选择如图8中的(g)所示的虚拟形象2。这样,第一终端和第二终端可以建立起单方的虚拟形象视频通话连接。
如图9中的(c)和(d)所示,在第一终端和第二终端建立单方的虚拟形象视频通话连接后,显示于第一终端上的第二用户的形象可以是第二用户的虚拟形象,显示于第二终端上的第一用户的形象可以是第一用户本人的真实形象。或者,在通话中的某一用户选择进行普通视频通话,而另一用户选择进行虚拟形象视频通话时,也可以在第一终端和第二终端之间直接建立虚拟形象视频通话。例如,在第一用户请求与第二用户进行普通视频通话,但第二用户选择接受虚拟形象视频通话时,第一终端和第二终端上显示的视频通话界面既可以是如图9中的(a)和(b)所示的通话界面,也可以是如图9中的(c)和(d)所示的通话界面。本申请实施例对此不作限定。
S603、第一终端采集用户在通话过程中的图像数据和音频数据。
以第一用户和第二用户均选择虚拟形象视频通话为例。在第一终端和第二终端建立虚拟形象视频通话连接后,第一终端可以采集第一用户在通话过程中的图像数据和音频数据。
如图10所示,是本申请实施例提供的第一终端的数据处理过程示意图。按照图10所示,第一终端可以调用图像采集装置,如摄像头对第一用户进行视频拍摄,得到相应的图像数据。另一方面,第一终端可以调用音频采集装置,如麦克风采集第一用户在通话过程中的声音,得到相应的音频数据。
需要说明的是,第一终端在使用摄像头对第一用户进行视频拍摄时,该摄像头可以是前置摄像头,也可以是后置摄像头。在第一用户使用第一终端的前置摄像头进行视频通话时,相应的视频信息可以显示于第一终端的主界面中,在第一用户使用第一终端的后置摄像头进行视频通话时,相应的视频信息可以显示于第一终端的背面的显示装置或模块中,本申请实施例对此亦不作限定。
S604、第一终端从图像数据中提取多帧目标特征信息。
在本申请实施例中,第一终端采集得到的图像数据可以是由多个视频帧组成的。因此,第一终端在对图像数据进行处理时,可以从每个视频帧中提取出可以用于表征第一用户的人脸表情和头部动作的目标特征信息。
在本申请实施例的一种可能的实现方式中,第一终端中可以配置有第一人脸识别引擎。相应地,在第二终端中也可以配置有第二人脸识别引擎。第一人脸识别引擎和第二人脸认识引擎可以是相同类型的人脸识别引擎,也可以是不同类型的人脸识别引擎。
因此,如图10所示,第一终端在对采集得到的图像数据进行处理时,第一终端可以将多个视频帧逐帧传递给第一人脸识别引擎,采用第一人脸识别引擎分别解析每个视频帧中的面部特征,得到每个视频帧中包含的特征点信息。然后,第一终端可以根据每个视频帧为上述特征点信息进行编码,得到分别与每个视频帧一一对应的多帧数据帧,每帧数据 帧对应一帧目标特征信息,这些目标特征信息即是后续需要传递给第二终端的数据。
如图11所示,是本申请实施例提供的视频帧处理方式示意图。其中,如图11中的(a)所示,是传统的视频帧编码后的示意图,包括多个I帧、B帧和P帧。
通常,在视频压缩过程中,每个视频帧代表一幅静止的图像。在实际压缩过程中,可以采取各种算法减少数据的容量,IPB就是最常见的一种压缩编码算法。其中,I帧是关键帧,属于帧内压缩,包含有最多、最关键的数据或特征信息。可以理解为这一帧画面的完整保留,在解码时,因为其包含有完整画面,所以只需要本帧数据就可以完成解码。P帧表示的是这一帧跟之前的一个关键帧(或P帧)之间的差别。解码时需要用之前缓存的画面叠加上本帧定义的差别,生成最终画面。也就是说P帧属于差别帧,P帧没有完整的画面数据,只有与前一帧的画面差别的数据。B帧是双向差别帧,也就是B帧记录的是本帧与前后帧的差别。要解码B帧,不仅要取得之前的缓存画面,还要解码之后的画面,通过前后画面与本帧数据的叠加取得最终的画面。
若按照图11中的(a)所示的传统的视频帧编码后的视频帧序列进行传输,其本质上仍然是传输的视频流。
在本申请实施例中,为了减少视频通过过程中对网络带宽的占用,第一终端可以从采集得到的图像数据中提取出每个视频帧中的特征信息,然后采用如图11中的(b)所示的编码方式按帧编码,得到仅包含有特征信息的数据帧1、数据帧2、数据帧3等等。这些数据帧并非传统视频通话过程中所传输的视频帧。每个数据帧中仅包含有从对应视频帧中提取出的目标特征信息。
如表二所示,是按照图11中的(b)所示的编码方式进行编码后所得到的数据帧中包含的数据的示例。
表二:
Figure PCTCN2021137526-appb-000009
因此,在本申请实施例的一种可能的实现方式中,第一终端根据每个视频帧为特征点信息进行编码可以首先由第一终端按照接收到每个视频帧的顺序,分别确定每帧目标特征 信息的帧序号,然后第一终端分别根据每个视频帧中包含的特征点信息识别多个面部区域,第一终端获取每个面部区域的特征信息,如每个面部区域的状态信息和坐标信息。第一终端可以将帧序号以及每个面部区域的特征信息存储至预设的数据结构中,得到上述表二所示的数据帧,每个数据帧分别对应一帧目标特征信息。
需要说明的是,第一终端在按照图11中的(b)所示的编码方式对提取出的特征点信息进行编码时,并未对目标特征信息进行压缩,也未进行帧间编码。也就是说,图11中(b)的帧1、帧2、……、帧12等数据帧中包含的是从每个视频帧中提取出的可用于表征第一用户的人脸表情和头部动作的原始特征信息。
本申请实施例通过提取每个视频帧中的特征点信息进行编码,使得后续向第二终端传输的是不是视频画面,只是表情特征信息,不包含冗余数据,这样传输的效率更高。按照每帧10Byte特征信息,帧率24fps计算,码率只有30kbps左右,传输的数据量远小于传统视频通话直接传输的视频流的数据量。
在本申请实施例的另一种可能的实现方式中,为了进一步减少视频通话过程中需要传输的数据量,第一终端在对特征点信息进行编码时,还可以采用帧间压缩编码的方式进行。
在本申请实施例中,第一终端在建立起与第二终端之间的视频通话通信后,可以确定待传输的面部区域。即,第一终端可以首先确定哪些面部区域的特征信息需要传输至第二终端。在之后的每帧数据中,只需填写相应的帧序号及已确定的面部区域的坐标、状态等信息即可。
在具体实现中,第一终端可以从多个视频帧中确定关键视频帧(I帧)。第一终端确定的关键视频帧的信息可以在与第二终端建立起视频通话连接时,发送至第二终端。
因此,若按照帧间压缩的方式对提取的特征点信息进行编码,则第一终端在与第二终端建立视频通话连接时所需传输的数据可以如下表三所示。
表三:
参数列表 数据长度 可选/必选 字段描述
charcterId 4Byte 必选 虚拟形象编号信息
gop 1Byte 必选 关键帧间隔帧数
otherData     其他数据,根据需求添加
faceTypeList N*1Byte 必选 约定后续数据中包含多少种面部特征
facialAreaType 1Byte 必选 面部区域,标识16个不同的面部区域
针对关键视频帧,第一终端可以获取关键视频帧中待传输的面部区域的全部特征信息;而针对非关键视频帧,第一终端可以首先确定任意相邻的两个非关键视频帧中待传输的面部区域的特征信息是否发生变化,若任意相邻的两个非关键视频帧中待传输的面部区域的特征信息发生变化,则可以获取发生变化的非关键视频帧中待传输的面部区域的特征信息,从而仅仅对发生变化的特征信息进行编码。
如图11中的(c)所示,是采用帧间压缩编码的方式对提取出的特征点信息进行编码的示意图。其中,对于关键视频帧,即图11中的(c)所示的帧1、帧6和帧11,第一终端可以保留这三个视频帧中的全部特征信息,而对其他视频帧,则可以仅仅保留每一帧中发生了变化的特征信息。
关键视频帧中保留有完整的帧数据(面部区域的特征信息),中间各个非关键视频帧只保留变化的面部特征信息。在相邻的两帧间,不会有剧烈的表情和动作变化,所以一般情况下,中间各个非关键视频帧的数据要小于关键帧数据。
如表四和表五所示,分别是按照图11中的(c)所示的编码方式进行编码后所得到的关键视频帧和非关键视频帧对应的数据帧中包含的数据的示例。
表四,关键视频帧编码后得到的数据帧所包含的数据的示例:
Figure PCTCN2021137526-appb-000010
表五,非关键视频帧编码后得到的数据帧所包含的数据的示例:
Figure PCTCN2021137526-appb-000011
由于本实施例中由于采用了压缩编码,整体码率在直接对提取出的特征点信息进行编码的基础上会再降低。对于用户而言,所占用的带宽和消耗的流量更少。根据GOP的不 同、实际画面变化程度不同,压缩效果也不同。一般来说,GOP越大,码率越低。
S605、第一终端为多帧目标特征信息和音频数据添加时间戳。
如图10所示,在对每个视频帧进行处理,得到相应的多帧目标特征信息后,为了保证每帧目标特征信息与音频数据能够同步,第一终端可以为多帧目标特征信息和音频数据添加时间戳,保证编码得到的每帧目标特征信息能够与该帧对应的音频数据对齐。
S606、第一终端在将添加有时间戳的多帧目标特征信息和音频数据封装成通话数据流后,将通话数据流传输至第二终端。
在添加时间戳后,第一终端可以将添加有时间戳的多帧目标特征信息和音频数据封装成通话数据流后,然后将通话数据流传输至第二终端。相较于传统的视频通话所传输的视频流,本申请实施例提供的基于虚拟形象的视频通话方法所传输的通话数据流仅仅包含音频数据和用于表征第一用户的人脸表情和头部动作的目标特征信息,极大地减少了数据传输时对网络带宽的占用。
S607、第二终端从通话数据流中拆分出音频数据和多帧目标特征信息。
如图12所示,是本申请实施例提供的第二终端的数据处理过程示意图。按照图12所示的处理过程,第二终端在接收到第一终端传输的通话数据流后,可以首先对通话数据流中的音频数据和多帧目标特征信息进行拆分。
在具体实现中,第二终端可以从接收到的通话数据流中拆分出音频流和视频流,上述视频流可以是以数据流形式传输的多帧目标特征信息。对于音频流,第二终端可以对其进行音频解码,从而得到相应的音频数据;对于视频流,第二终端对其进行视频解码后,得到的便是每一帧的目标特征信息。
S608、第二终端将多帧目标特征信息映射至目标虚拟形象中,以生成视频通话图像。
在本申请实施例中,由于目标特征信息是表征第一用户的人脸表情和头部动作的特征信息,因此,第二终端在将多帧目标特征信息映射至目标虚拟形象后,可以生成多帧包含有第一用户的人脸表情和头部动作的图像,这些图像可以构成相应的视频通话画面。
在本申请实施例的一种可能的实现方式中,第二终端解码得到的每帧目标特征信息中可以包括多个面部区域的状态信息和坐标信息,第二终端可以根据多个面部区域的坐标信息计算用户头部的朝向,也就是第一用户的头部的朝向;然后,第二终端可以根据多个面部区域的状态信息对用户头部的朝向进行调整,以及模拟人脸表情和头部动作。
在具体实现中,第二终端可以根据解码得到的面部区域的坐标,通过人脸法线计算出头部的朝向。
如图13所示,是本申请实施例提供的一种人脸法线的示意图。在图13中的(a)中显示有人的两眼之间的距离le、两眼与嘴唇之间的垂直距离lf,以及鼻尖与嘴唇之间的垂直距离lm;在图13中的(b)中显示有鼻尖距离面部的距离ln以及两眼与嘴唇之间的垂直距离lf、鼻尖与嘴唇之间的垂直距离lm等数据。第二终端可以根据接收到的各个面部区域的坐标,按照图13所示的人脸法线计算出第一用户的头部的朝向。然后,第二终端可以根据多个面部区域的状态信息对用户头部的朝向进行调整,模拟第一用户的人脸表情和头部动作。
在确定出第一用户的人脸表情和头部动作后,第二终端可以将上述人脸表情和头部动作映射至预设的目标虚拟形象中,从而生成视频通话图像。上述目标虚拟形象即是第一终端和第二终端在建立视频通话连接时,根据第一终端传输的虚拟形象编号信息确定的虚拟 形象。
S609、第二终端在显示视频通话图像时,同步播放音频数据。
如图13所示,在将第一用户的人脸表情和头部动作映射至目标虚拟形象,得到视频通话图像后,第二终端还需要对视频通话图像和音频数据进行时间同步。
在本申请实施例中,第二终端解码得到的多帧目标特征信息和音频数据具有时间戳,该时间戳是第一终端为其添加的。第二终端可以根据多帧目标特征信息的时间戳,确定每帧视频通话图像的时间戳;然后,第二终端根据每帧视频通话图像的时间戳和音频数据的时间戳,对视频通话图像和音频数据进行同步,从而实现在显示视频通话图像时,同步播放音频数据。上述视频通话图像是映射有第一用户的人脸表情和头部动作的虚拟形象的图像。
需要说明的是,上述实施例仅以第一终端向第二终端传输音频数据和目标特征信息,由第二终端在接收到音频数据和目标特征信息后,通过对目标特征信息进行处理,模拟出第一用户的人脸表情和头部动作,从而在第二终端上呈现包含有第一用户的人脸表情和头部动作的方式,进而实现第一用户和第二用户的视频通话来对本申请实施例的基于虚拟形象的视频通话方法进行的介绍。可以理解的是,视频通话是双向的,第二终端可以采集第二用户的图像数据和音频数据,并从图像数据中提取出目标特征信息,然后将目标特征信息和音频数据传输至第一终端,由第一终端基于接收到的目标特征信息和音频数据,可以在第一终端上呈现出具有第二用户的人脸表情和头部动作的虚拟形象。在此过程中,第一终端和第二终端对数据的处理方式可以参见前述实施例各个步骤的介绍,本申请实施例对此不再赘述。
参照图14,示出了本申请实施例提供的一种在第一终端侧实现的基于虚拟形象的视频通话方法的步骤示意图,该方法具体可以包括如下步骤:
S1401、第一终端向第二终端传输虚拟形象编号信息,虚拟形象编号信息用于指示第二终端从多个虚拟形象中确定目标虚拟形象。
在本申请实施例中,虚拟形象编号信息可以是在第一终端与第二终端建立起视频通话连接后,由第一终端传输至第二终端的。第二终端在接收到上述虚拟形象编号信息后,可以根据该信息,从多个虚拟形象中确定目标虚拟形象。目标虚拟形象便是后续显示在第二终端上,用于映射第一用户的人脸表情和头部动作的虚拟形象。
S1402、第一终端采集用户在通话过程中的图像数据和音频数据。
需要说明的是,前述实施例是将第一终端和第二终端作为一个整体来对本申请的基于虚拟形象的视频通话方法进行的介绍。本实施例是以第一终端侧来对本申请的方法进行的介绍。
在本申请实施例中,第一终端采集的用户在通话过程中的图像数据和音频数据可以是指第一用户在通话过程中的音频数据和图像数据。这些图像数据包括多个视频帧。
S1403、第一终端从图像数据中提取多帧目标特征信息,多帧目标特征信息包括用于表征用户的人脸表情和头部动作的特征信息。
在本申请实施例中,第一终端中配置有第一人脸识别引擎。第一终端可以采用第一人脸识别引擎分别解析每个视频帧中的面部特征,得到每个视频帧中包含的特征点信息。然后,第一终端可以根据每个视频帧为上述特征点信息进行编码,得到分别与每个视频帧一一对应的多帧目标特征信息。
在具体实现中,第一终端从图像数据中提取多帧目标特征信息时,可以首先按照接收到每个视频帧的顺序,分别确定每帧目标特征信息的帧序号;然后,第一终端可以分别根据每个视频帧中包含的特征点信息识别多个面部区域;在获取每个包含每个面部区域的状态信息和坐标信息等特征信息后,第一终端可以将帧序号以及每个面部区域的特征信息存储至预设的数据结构中,得到多帧目标特征信息。
S1404、第一终端将多帧目标特征信息和音频数据传输至第二终端,第二终端用于将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,视频通话图像中包含具有上述人脸表情和头部动作的目标虚拟形象。
在本申请实施例中,第一终端在向第二终端传输目标特征信息和音频数据前,可以为多帧目标特征信息和音频数据添加时间戳。然后,第一终端可以将添加有时间戳的目标特征信息和音频数据封装成通话数据流,并将该通话数据流传输至第二终端。第二终端在接收到第一终端传输的通话数据流后,可以通过对通话数据流进行拆分、解码等处理,将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,上述视频通话图像中包含具有第一用户的人脸表情和头部动作的目标虚拟形象。
在网络条件较差的情况下,网络带宽可能无法支持在第一终端和第二终端之间进行视频通话。在本申请实施例中,由于第一终端仅向第二终端传输音频数据和可以表征第一用户的人脸表情和头部动作的目标特征信息,需要传输的数据流较少,对于网络带宽的要求较低。即使在网络条件较差的情况下,采用本方法仍然可以实现虚拟形象视频通话。第一用户和第二用户仍可以看到对方的表情和动作。其次,本申请实施例完全使用虚拟形象,不会暴露用户周围环境,能有有效保障用户的隐私安全。
参照图15,示出了本申请实施例提供的另一种在第一终端侧实现的基于虚拟形象的视频通话方法的步骤示意图,该方法具体可以包括如下步骤:
S1501、第一终端向第二终端传输虚拟形象编号信息,虚拟形象编号信息用于指示第二终端从多个虚拟形象中确定目标虚拟形象。
由于S1501与前述实施例中S1401类似,可以相互参阅,本实施例对此不再赘述。
S1502、第一终端确定待传输的面部区域。
需要说明的是,在前一实施例中,第一终端向第二终端传输的每一帧目标特征信息都是包含有第一用户完整的面部特征信息的数据帧,包括哪个面部区域,其坐标、状态等信息。在本实施例中,可以在第一终端和第二终端建立视频通话连接后,预先确定需要传输哪些面部区域的数据。这样,在之后的每帧数据中,只需要填写帧序号以及面部区域的坐标、状态等信息即可,通过类似于视频编码中的帧间压缩的方式,进一步减少传输的数据量。
S1503、第一终端采集用户在通话过程中的图像数据和音频数据,图像数据包括多个视频帧。
由于S1503与前述实施例中S1402类似,可以相互参阅,本实施例对此不再赘述。
S1504、第一终端从多个视频帧中确定关键视频帧。
在本申请实施例中,对于采集得到的多个视频帧,第一终端可以从中确定出关键视频帧。关键视频帧即是需要将该帧中的全部特征信息传输至第二终端的视频帧。
S1505、针对关键视频帧,第一终端获取关键视频帧中待传输的面部区域的特征信息。
S1506、针对非关键视频帧,第一终端确定任意相邻的两个非关键视频帧中待传输的 面部区域的特征信息是否发生变化,若任意相邻的两个非关键视频帧中待传输的面部区域的特征信息发生变化,则获取发生变化的非关键视频帧中待传输的面部区域的特征信息。
在本申请实施例中,对于关键视频帧,可以获取该视频帧中待传输的面部区域的全部特征信息。对于非关键视频帧,则可以通过比较相邻两帧之间的面部区域的特征信息是否发生变化来确定需要获取哪些特征信息。如果某一非关键视频帧中有特征信息发生了变化,则可以获取发生了变化的特征信息。也就是说,对于关键视频帧,保留完整的帧数据;而对于中间各个非关键视频帧,则仅保留发生变化的特征信息。
S1507、第一终端对关键视频帧和非关键视频帧的特征点信息进行帧间压缩编码,得到分别与每个视频帧一一对应的多帧目标特征信息,多帧目标特征信息包括用于表征用户的人脸表情和头部动作的特征信息。
在本申请实施例中,第一终端可以对关键视频帧和非关键视频帧的特征点信息进行帧间压缩编码,从而得到多帧数据帧,每一数据帧均对应一帧目标特征信息,这些目标特征信息可以用于表征第一用户的人脸表情和头部动作。
S1508、第一终端将多帧目标特征信息和音频数据传输至第二终端,第二终端用于将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,视频通话图像中包含具有上述人脸表情和头部动作的目标虚拟形象。
由于S1508与前述实施例中S1404类似,可以相互参阅,本实施例对此不再赘述。
在本实施例中,由于采用了帧间压缩编码的方式对视频帧进行处理,处理后的整体码率在前一实施例的基础上会再降低。对于用户而言,进行视频通话所占用的带宽和消耗的流量更少。
参照图16,示出了本申请实施例提供的又一种在第一终端侧实现的基于虚拟形象的视频通话方法的步骤示意图,该方法具体可以包括如下步骤:
S1601、第一终端向第二终端传输虚拟形象编号信息,虚拟形象编号信息用于指示第二终端从多个虚拟形象中确定目标虚拟形象,第一终端中配置有第一人脸识别引擎,第二终端中配置有第二人脸识别引擎,第一人脸识别引擎和第二人脸识别引擎为相同类型的人脸识别引擎。
S1602、第一终端采集用户在通话过程中的图像数据和音频数据。
S1603、第一终端从图像数据中提取多帧目标特征信息,多帧目标特征信息包括用于表征用户的人脸表情和头部动作的特征信息,多帧目标特征信息为由第一人脸识别引擎识别的原始特征信息。
S1604、第一终端将多帧目标特征信息和音频数据传输至第二终端,所述第二终端用于采用第二人脸识别引擎将原始特征信息映射至目标虚拟形象中,以生成视频通话图像,视频通话图像中包含具有上述人脸表情和头部动作的目标虚拟形象。
在本申请实施例中,可以不在发送侧处理表征人脸表情和头部动作的特征信息,而是将原始特征信息发送到接收侧进行处理。
在具体实现中,第一终端在采集得到第一用户在通话过程中的图像数据和音频数据后,可以将图像数据传递给第一人脸识别引擎进行处理。第一人脸识别引擎可以返回处理得到的全部原始数据。例如,第一人脸识别引擎可以返回276个原始特征点,这些原始特征点不仅包括眼睛、嘴唇等可以用于表征人脸表情和头部动作的特征信息,还包括一些冗余信息。第一终端可以将第一人脸识别引擎返回的全部原始特征信息传输至第二终端,由第二 终端中的第二人脸识别引擎进行处理,在目标虚拟形象中映射出第一用户的人脸表情和头部动作。
在本实施例中,数据发送侧的第一终端不对原始特征信息进行处理,而是将全部原始特征信息均传输至第二终端,对原始特征信息的处理在接收侧进行。这样,舍弃的信息更少,接收侧可以基于原始特征信息进行更加精准的表情和动作还原。相较于前述两个实施例中的处理方式,本实施例需要传递更多的数据量,通话时的数据流码率也会有一定的升高,但由于第一终端传递了更多的原始数据,相应地在接收侧的第二终端也能够映射出表现力更丰富的表情和动作,有助于更好地还原发送侧的表情和动作。
参照图17,示出了本申请实施例提供的一种在第二终端侧实现的基于虚拟形象的视频通话方法的步骤示意图,该方法具体可以包括如下步骤:
S1701、第二终端接收第一终端传输的虚拟形象编号信息,根据虚拟形象编号信息从多个虚拟形象中确定目标虚拟形象。
S1702、第二终端接收第一终端传输的通话数据流,通话数据流包含音频数据和多帧目标特征信息,多帧目标特征信息包括用于表征用户在通话过程中的人脸表情和头部动作的特征信息。
S1703、第二终端将多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,视频通话图像中包含具有上述人脸表情和头部动作的目标虚拟形象。
S1704、第二终端在显示视频通话图像时,同步播放音频数据。
需要说明的是,本实施例是以第二终端侧来对本申请的方法进行的介绍。
在本申请实施例中,在建立起第一终端和第二终端之间的视频通话连接后,第二终端可以接收到第一终端传输的虚拟形象编号信息。第二终端可以根据该虚拟形象编号信息从多个虚拟形象中确定目标虚拟形象。目标虚拟形象即是显示在第二终端上并用于映射第一用户的人脸表情和头部动作的虚拟形象。
在本申请实施例中,第二终端接收到的通话数据流可以是包含有音频数据和多帧目标特征信息的数据流。这些目标特征信息可以用于表征第一用户在通话过程中的人脸表情和头部动作。
在具体实现中,第二终端可以从通话数据流中拆分出音频数据和多帧目标特征信息。然后,第二终端可以分别确定每帧目标特征信息中包含的人脸表情和头部动作,通过分别将每帧目标特征信息中包含的人脸表情和头部动作映射至预设的目标虚拟形象中,生成视频通话图像。
第二终端在确定每帧目标特征信息中包含的人脸表情和头部动作时,可以首先根据多个面部区域的坐标信息计算用户头部的朝向;然后根据多个面部区域的状态信息对用户头部的朝向进行调整,以及模拟人脸表情和头部动作。
在本申请实施例的一种可能的实现方式中,目标特征信息可以是未经第一终端处理的原始特征信息。原始特征信息可以是由第一终端上的第一人脸识别引擎识别得到的。第二终端在接收到未经处理的原始特征信息后,可以将其传递至第二人脸识别引擎。第二终端上的第二人脸识别引擎可以是与第一人脸识别引擎相同类型的人脸识别引擎。这样,第二终端可以采用第二人脸识别引擎将原始特征信息映射至目标虚拟形象中,以生成视频通话图像。
在本申请实施例的另一种可能的实现方式中,目标特征信息可以是对多个视频帧进行 特征提取,在编码时保留全部可用于表征第一用户的人脸表情和头部动作的特征信息所得到的数据帧。
在本申请实施例的另一种可能的实现方式中,目标特征信息可以是第一终端对多个视频帧进行帧间压缩编码后得到的数据帧。这种类型的目标特征信息包括与关键视频帧对应的目标特征信息以及与非关键视频帧对应的目标特征信息。其中,与关键视频帧对应的目标特征信息包括关键视频帧的完整特征信息,与非关键视频帧对应的目标特征信息包括在非关键视频帧中发生变化的特征信息。因此,第二终端在从通话数据流中拆分出音频数据和多帧目标特征信息之后,还可以根据关键视频帧的完整特征信息和非关键视频帧中发生变化的特征信息,生成非关键视频帧的完整特征信息。然后,基于关键视频帧的完整特征信息和非关键视频帧的完整特征信息,将第一用户的人脸表情和头部动作映射至目标虚拟形象中。
为了同步视频通话图像和音频数据,第二终端可以根据多帧目标特征信息的时间戳,确定每帧视频通话图像的时间戳,然后根据每帧视频通话图像的时间戳和音频数据的时间戳,对视频通话图像和音频数据进行同步。
在完成人脸表情和头部动作的映射,得到相应的视频通话图像并同步视频通话图像和音频数据后,第二终端可以显示这些视频通话图像,多个视频通话图像便形成了视频流。叠加有视频流和音频流便形成了第一终端和第二终端之间的视频通话。
本申请实施例可以根据上述方法示例对终端设备进行功能模块的划分,例如,可以对应每一个功能划分每一个功能模块,也可以将一个或多个的功能集成在一个功能模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。下面以对应每一个功能划分每一个功能模块为例进行说明。
对应于上述各个实施例,参照图18,示出了本申请实施例提供的一种基于虚拟形象的视频通话装置的结构框图,该装置可以应用于前述各个实施例中的第一终端,该装置具体可以包括如下模块:采集模块1801、提取模块1802和传输模块1803,其中:
采集模块1801,用于采集用户在通话过程中的图像数据和音频数据;
提取模块1802,用于从所述图像数据中提取多帧目标特征信息,所述多帧目标特征信息包括用于表征所述用户的人脸表情和头部动作的特征信息;
传输模块1803,用于将所述多帧目标特征信息和所述音频数据传输至第二终端,所述第二终端用于将所述多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,所述视频通话图像中包含具有所述人脸表情和所述头部动作的所述目标虚拟形象。
在本申请实施例中,所述图像数据包括多个视频帧,所述第一终端中配置有第一人脸识别引擎,所述提取模块1802具体可以包括如下子模块:
解析子模块,用于采用所述第一人脸识别引擎分别解析每个视频帧中的面部特征,得到所述每个视频帧中包含的特征点信息;
编码子模块,用于根据所述每个视频帧为所述特征点信息进行编码,得到分别与所述每个视频帧一一对应的多帧目标特征信息。
在本申请实施例中,所述编码子模块具体可以包括如下单元:
帧序号确定单元,用于按照接收到所述每个视频帧的顺序,分别确定每帧目标特征信息的帧序号;
面部区域识别单元,用于分别根据所述每个视频帧中包含的特征点信息识别多个面部区域;
特征信息获取单元,用于获取每个面部区域的特征信息,所述特征信息包括所述每个面部区域的状态信息和坐标信息;
特征信息存储单元,用于将所述帧序号以及所述每个面部区域的特征信息存储至预设的数据结构中,得到所述多帧目标特征信息。
在本申请实施例中,所述编码子模块还可以包括如下单元:
面部区域确定单元,用于确定待传输的面部区域;
在本申请实施例中,所述特征信息获取单元具体可以包括如下子单元:
关键视频帧确定子单元,用于从所述多个视频帧中确定关键视频帧;
第一特征信息获取子单元,用于针对所述关键视频帧,获取所述关键视频帧中所述待传输的面部区域的特征信息;
第二特征信息获取子单元,用于针对非关键视频帧,确定任意相邻的两个非关键视频帧中所述待传输的面部区域的特征信息是否发生变化,若所述任意相邻的两个非关键视频帧中所述待传输的面部区域的特征信息发生变化,则获取发生变化的非关键视频帧中所述待传输的面部区域的特征信息。
在本申请实施例中,所述第一终端中配置有第一人脸识别引擎,所述第二终端中配置有第二人脸识别引擎,所述第一人脸识别引擎和所述第二人脸识别引擎为相同类型的人脸识别引擎,所述多帧目标特征信息为由所述第一人脸识别引擎识别的原始特征信息,所述第二终端用于采用所述第二人脸识别引擎将所述原始特征信息映射至所述目标虚拟形象中,以生成所述视频通话图像。
在本申请实施例中,所述装置还可以包括如下模块:
时间戳添加模块,用于为所述多帧目标特征信息和所述音频数据添加时间戳。
在本申请实施例中,所述传输模块1803具体可以包括如下子模块:
封装子模块,用于将所述目标特征信息和所述音频数据封装成通话数据流;
传输子模块,用于将所述通话数据流传输至所述第二终端。
在本申请实施例中,所述传输模块1803还用于向所述第二终端传输虚拟形象编号信息,所述虚拟形象编号信息用于指示所述第二终端从多个虚拟形象中确定所述目标虚拟形象。
参照图19,示出了本申请实施例提供的另一种基于虚拟形象的视频通话装置的结构框图,该装置可以应用于前述各个实施例中的第二终端,该装置具体可以包括如下模块:接收模块1901、映射模块1902和通话模块1903,其中:
接收模块1901,用于接收所述第一终端传输的通话数据流,所述通话数据流包含音频数据和多帧目标特征信息,所述多帧目标特征信息包括用于表征用户在通话过程中的人脸表情和头部动作的特征信息;
映射模块1902,用于将所述多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,所述视频通话图像中包含具有所述人脸表情和所述头部动作的所述目标虚拟形象;
通话模块1903,用于显示所述视频通话图像,并同步播放所述音频数据。
在本申请实施例中,所述映射模块1902具体可以包括如下子模块:
拆分子模块,用于从所述通话数据流中拆分出所述音频数据和所述多帧目标特征信息;
确定子模块,用于分别确定每帧目标特征信息中包含的所述人脸表情和所述头部动作;
映射子模块,用于分别将每帧目标特征信息中包含的所述人脸表情和所述头部动作映射至预设的目标虚拟形象中,以生成视频通话图像。
在本申请实施例中,所述每帧目标特征信息包括多个面部区域的状态信息和坐标信息,所述确定子模块具体可以包括如下单元:
计算单元,用于所述第二终端根据所述多个面部区域的坐标信息计算用户头部的朝向;
调整及模拟单元,用于所述第二终端根据所述多个面部区域的状态信息对所述用户头部的朝向进行调整,以及模拟所述人脸表情和所述头部动作。
在本申请实施例中,所述多帧目标特征信息包括与关键视频帧对应的目标特征信息以及与非关键视频帧对应的目标特征信息,与所述关键视频帧对应的目标特征信息包括所述关键视频帧的完整特征信息,与所述非关键视频帧对应的目标特征信息包括在所述非关键视频帧中发生变化的特征信息;所述映射模块1902还可以包括如下子模块:
生成子模块,用于根据所述关键视频帧的完整特征信息和所述非关键视频帧中发生变化的特征信息,生成所述非关键视频帧的完整特征信息。
在本申请实施例中,所述第一终端中配置有第一人脸识别引擎,所述第二终端中配置有第二人脸识别引擎,所述第一人脸识别引擎和所述第二人脸识别引擎为相同类型的人脸识别引擎,所述多帧目标特征信息为由所述第一人脸识别引擎识别的原始特征信息,所述映射子模块还用于采用所述第二人脸识别引擎将所述原始特征信息映射至所述目标虚拟形象中,以生成所述视频通话图像。
在本申请实施例中,所述接收模块1901还可以包括如下子模块:
虚拟形象编号信息接收子模块,用于接收所述第一终端传输的虚拟形象编号信息;
目标虚拟形象确定子模块,用于根据所述虚拟形象编号信息从多个虚拟形象中确定所述目标虚拟形象。
在本申请实施例中,所述多帧目标特征信息和所述音频数据具有时间戳,所述通话模块1903具体可以包括如下子模块:
时间戳确定子模块,用于根据所述多帧目标特征信息的时间戳,确定每帧视频通话图像的时间戳;
音视频同步子模块,用于根据所述每帧视频通话图像的时间戳和所述音频数据的时间戳,对所述视频通话图像和所述音频数据进行同步。
需要说明的是,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本申请实施例还提供一种终端,该终端可以是前述各个实施例中的第一终端或第二终端,该终端包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,当处理器执行计算机程序时,实现上述各个实施例中的基于虚拟形象的视频通话方法。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在终端上运行时,使得终端执行上述相关方法步骤实现上述各个实施例中的基于虚拟形象的视频通话方法。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品在计算机上运行时, 使得计算机执行上述相关步骤,以实现上述各个实施例中的基于虚拟形象的视频通话方法。
本申请实施例还提供一种通信系统,包括上述各个实施例中的第一终端和第二终端,以及用于建立第一终端和第二终端之间的通信连接的通信设备。
本申请实施例还提供一种芯片,该芯片可以为通用处理器,也可以为专用处理器。该芯片包括处理器。其中,处理器用于支持终端执行上述相关步骤,以实现上述各个实施例中的基于虚拟形象的视频通话方法。
可选的,该芯片还包括收发器,收发器用于接受处理器的控制,用于支持终端执行上述相关步骤,以实现上述各个实施例中的基于虚拟形象的视频通话方法。
可选的,该芯片还可以包括存储介质。
需要说明的是,该芯片可以使用下述电路或者器件来实现:一个或多个现场可编程门阵列(field programmable gate array,FPGA)、可编程逻辑器件(programmable logic device,PLD)、控制器、状态机、门逻辑、分立硬件部件、任何其他适合的电路、或者能够执行本申请通篇所描述的各种功能的电路的任意组合。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。

Claims (18)

  1. 一种基于虚拟形象的视频通话方法,其特征在于,应用于第一终端,所述方法包括:
    所述第一终端采集用户在通话过程中的图像数据和音频数据;
    所述第一终端从所述图像数据中提取多帧目标特征信息,所述多帧目标特征信息包括用于表征所述用户的人脸表情和头部动作的特征信息;
    所述第一终端将所述多帧目标特征信息和所述音频数据传输至第二终端,所述第二终端用于将所述多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,所述视频通话图像中包含具有所述人脸表情和所述头部动作的所述目标虚拟形象。
  2. 根据权利要求1所述的方法,其特征在于,所述图像数据包括多个视频帧,所述第一终端中配置有第一人脸识别引擎,所述第一终端从所述图像数据中提取多帧目标特征信息,包括:
    所述第一终端采用所述第一人脸识别引擎分别解析每个视频帧中的面部特征,得到所述每个视频帧中包含的特征点信息;
    所述第一终端根据所述每个视频帧为所述特征点信息进行编码,得到分别与所述每个视频帧一一对应的多帧目标特征信息。
  3. 根据权利要求2所述的方法,其特征在于,所述第一终端根据所述每个视频帧为所述特征点信息进行编码,得到分别与所述每个视频帧一一对应的多帧目标特征信息,包括:
    所述第一终端按照接收到所述每个视频帧的顺序,分别确定每帧目标特征信息的帧序号;
    所述第一终端分别根据所述每个视频帧中包含的特征点信息识别多个面部区域;
    所述第一终端获取每个面部区域的特征信息,所述特征信息包括所述每个面部区域的状态信息和坐标信息;
    所述第一终端将所述帧序号以及所述每个面部区域的特征信息存储至预设的数据结构中,得到所述多帧目标特征信息。
  4. 根据权利要求3所述的方法,其特征在于,在所述第一终端采集用户在通话过程中的图像数据和音频数据之前,还包括:
    所述第一终端确定待传输的面部区域;
    相应地,所述第一终端获取每个面部区域的特征信息,包括:
    所述第一终端从所述多个视频帧中确定关键视频帧;
    针对所述关键视频帧,所述第一终端获取所述关键视频帧中所述待传输的面部区域的特征信息;
    针对非关键视频帧,所述第一终端确定任意相邻的两个非关键视频帧中所述待传输的面部区域的特征信息是否发生变化,若所述任意相邻的两个非关键视频帧中所述待传输的面部区域的特征信息发生变化,则获取发生变化的非关键视频帧中所述待传输的面部区域的特征信息。
  5. 根据权利要求1所述的方法,其特征在于,所述第一终端中配置有第一人脸识别引擎,所述第二终端中配置有第二人脸识别引擎,所述第一人脸识别引擎和所述第二 人脸识别引擎为相同类型的人脸识别引擎,所述多帧目标特征信息为由所述第一人脸识别引擎识别的原始特征信息,所述第二终端用于采用所述第二人脸识别引擎将所述原始特征信息映射至所述目标虚拟形象中,以生成所述视频通话图像。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,在所述第一终端将所述多帧目标特征信息和所述音频数据传输至第二终端之前,还包括:
    所述第一终端为所述多帧目标特征信息和所述音频数据添加时间戳。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述第一终端将所述目标特征信息和所述音频数据传输至第二终端,包括:
    所述第一终端将所述目标特征信息和所述音频数据封装成通话数据流;
    所述第一终端将所述通话数据流传输至所述第二终端。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,在所述第一终端将所述目标特征信息和所述音频数据传输至第二终端之前,还包括:
    所述第一终端向所述第二终端传输虚拟形象编号信息,所述虚拟形象编号信息用于指示所述第二终端从多个虚拟形象中确定所述目标虚拟形象。
  9. 一种基于虚拟形象的视频通话方法,其特征在于,应用于与第一终端通信的第二终端,所述方法包括:
    所述第二终端接收所述第一终端传输的通话数据流,所述通话数据流包含音频数据和多帧目标特征信息,所述多帧目标特征信息包括用于表征用户在通话过程中的人脸表情和头部动作的特征信息;
    所述第二终端将所述多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,所述视频通话图像中包含具有所述人脸表情和所述头部动作的所述目标虚拟形象;
    所述第二终端在显示所述视频通话图像时,同步播放所述音频数据。
  10. 根据权利要求9所述的方法,其特征在于,所述第二终端将所述多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,包括:
    所述第二终端从所述通话数据流中拆分出所述音频数据和所述多帧目标特征信息;
    所述第二终端分别确定每帧目标特征信息中包含的所述人脸表情和所述头部动作;
    所述第二终端分别将每帧目标特征信息中包含的所述人脸表情和所述头部动作映射至预设的目标虚拟形象中,以生成视频通话图像。
  11. 根据权利要求10所述的方法,其特征在于,所述每帧目标特征信息包括多个面部区域的状态信息和坐标信息,所述第二终端分别确定每帧目标特征信息中包含的所述人脸表情和所述头部动作,包括:
    所述第二终端根据所述多个面部区域的坐标信息计算用户头部的朝向;
    所述第二终端根据所述多个面部区域的状态信息对所述用户头部的朝向进行调整,以及模拟所述人脸表情和所述头部动作。
  12. 根据权利要求10或11所述的方法,其特征在于,所述多帧目标特征信息包括与关键视频帧对应的目标特征信息以及与非关键视频帧对应的目标特征信息,与所述关键视频帧对应的目标特征信息包括所述关键视频帧的完整特征信息,与所述非关键视频帧对应的目标特征信息包括在所述非关键视频帧中发生变化的特征信息;在所述第二终 端从所述通话数据流中拆分出所述音频数据和所述多帧目标特征信息之后,还包括:
    所述第二终端根据所述关键视频帧的完整特征信息和所述非关键视频帧中发生变化的特征信息,生成所述非关键视频帧的完整特征信息。
  13. 根据权利要求9所述的方法,其特征在于,所述第一终端中配置有第一人脸识别引擎,所述第二终端中配置有第二人脸识别引擎,所述第一人脸识别引擎和所述第二人脸识别引擎为相同类型的人脸识别引擎,所述多帧目标特征信息为由所述第一人脸识别引擎识别的原始特征信息,所述第二终端将所述多帧目标特征信息映射至预设的目标虚拟形象中,以生成视频通话图像,包括:
    所述第二终端采用所述第二人脸识别引擎将所述原始特征信息映射至所述目标虚拟形象中,以生成所述视频通话图像。
  14. 根据权利要求9-13任一项所述的方法,其特征在于,在所述第二终端接收所述第一终端传输的通话数据流之前,还包括:
    所述第二终端接收所述第一终端传输的虚拟形象编号信息;
    所述第二终端根据所述虚拟形象编号信息从多个虚拟形象中确定所述目标虚拟形象。
  15. 根据权利要求9-14任一项所述的方法,其特征在于,所述多帧目标特征信息和所述音频数据具有时间戳,所述第二终端在显示所述视频通话图像时,同步播放所述音频数据,包括:
    所述第二终端根据所述多帧目标特征信息的时间戳,确定每帧视频通话图像的时间戳;
    所述第二终端根据所述每帧视频通话图像的时间戳和所述音频数据的时间戳,对所述视频通话图像和所述音频数据进行同步。
  16. 一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1-15任一项所述的基于虚拟形象的视频通话方法。
  17. 一种通信系统,包括如权利要求1-15任一项所述的第一终端和第二终端,以及用于建立所述第一终端和所述第二终端之间的通信连接的通信设备。
  18. 一种芯片,其特征在于,所述芯片包括存储器和处理器,所述处理器执行所述存储器中存储的计算机程序,以实现如权利要求1-15任一项所述的基于虚拟形象的视频通话方法。
PCT/CN2021/137526 2020-12-29 2021-12-13 基于虚拟形象的视频通话方法、装置和终端 WO2022143128A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011608114.6A CN114710640B (zh) 2020-12-29 2020-12-29 基于虚拟形象的视频通话方法、装置和终端
CN202011608114.6 2020-12-29

Publications (1)

Publication Number Publication Date
WO2022143128A1 true WO2022143128A1 (zh) 2022-07-07

Family

ID=82166346

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137526 WO2022143128A1 (zh) 2020-12-29 2021-12-13 基于虚拟形象的视频通话方法、装置和终端

Country Status (2)

Country Link
CN (1) CN114710640B (zh)
WO (1) WO2022143128A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115359156A (zh) * 2022-07-31 2022-11-18 荣耀终端有限公司 音频播放方法、装置、设备和存储介质
CN115512017A (zh) * 2022-10-19 2022-12-23 深圳市诸葛瓜科技有限公司 一种基于人物特征的动漫形象生成系统及方法
WO2023122488A1 (en) * 2021-12-21 2023-06-29 Snap Inc. Avatar call platform
CN116823591A (zh) * 2023-05-05 2023-09-29 国政通科技有限公司 一种基于卷积神经元的人形检测去隐私化方法及装置
CN117809002A (zh) * 2024-02-29 2024-04-02 成都理工大学 一种基于人脸表情识别与动作捕捉的虚拟现实同步方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023208090A1 (en) * 2022-04-28 2023-11-02 Neufast Limited Method and system for personal identifiable information removal and data processing of human multimedia
CN116112761B (zh) * 2023-04-12 2023-06-27 海马云(天津)信息技术有限公司 生成虚拟形象视频的方法及装置、电子设备和存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102271241A (zh) * 2011-09-02 2011-12-07 北京邮电大学 一种基于面部表情/动作识别的图像通信方法及系统
WO2013027893A1 (ko) * 2011-08-22 2013-02-28 Kang Jun-Kyu 통신단말장치의 감정 컨텐츠 서비스 장치 및 방법, 이를 위한 감정 인지 장치 및 방법, 이를 이용한 감정 컨텐츠를 생성하고 정합하는 장치 및 방법
WO2013152454A1 (en) * 2012-04-09 2013-10-17 Intel Corporation System and method for avatar management and selection
CN103415003A (zh) * 2013-08-26 2013-11-27 苏州跨界软件科技有限公司 一种虚拟人物通话系统
CN103647922A (zh) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 虚拟视频通话方法和终端
CN105407313A (zh) * 2015-10-28 2016-03-16 掌赢信息科技(上海)有限公司 一种视频通话方法、设备和系统
CN107911644A (zh) * 2017-12-04 2018-04-13 吕庆祥 基于虚拟人脸表情进行视频通话的方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7616821B2 (en) * 2005-07-19 2009-11-10 International Business Machines Corporation Methods for transitioning compression levels in a streaming image system
CN106254869A (zh) * 2016-08-25 2016-12-21 腾讯科技(深圳)有限公司 一种视频数据的编解码方法、装置和系统
JP2019057057A (ja) * 2017-09-20 2019-04-11 富士ゼロックス株式会社 情報処理装置、情報処理システム及びプログラム
CN109348125B (zh) * 2018-10-31 2020-02-04 Oppo广东移动通信有限公司 视频校正方法、装置、电子设备和计算机可读存储介质
CN110572723A (zh) * 2019-08-30 2019-12-13 华为终端有限公司 一种缩略图生成的方法以及相关装置
CN112016513B (zh) * 2020-09-08 2024-01-30 北京达佳互联信息技术有限公司 视频语义分割方法、模型训练方法、相关装置及电子设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013027893A1 (ko) * 2011-08-22 2013-02-28 Kang Jun-Kyu 통신단말장치의 감정 컨텐츠 서비스 장치 및 방법, 이를 위한 감정 인지 장치 및 방법, 이를 이용한 감정 컨텐츠를 생성하고 정합하는 장치 및 방법
CN102271241A (zh) * 2011-09-02 2011-12-07 北京邮电大学 一种基于面部表情/动作识别的图像通信方法及系统
WO2013152454A1 (en) * 2012-04-09 2013-10-17 Intel Corporation System and method for avatar management and selection
CN103415003A (zh) * 2013-08-26 2013-11-27 苏州跨界软件科技有限公司 一种虚拟人物通话系统
CN103647922A (zh) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 虚拟视频通话方法和终端
CN105407313A (zh) * 2015-10-28 2016-03-16 掌赢信息科技(上海)有限公司 一种视频通话方法、设备和系统
CN107911644A (zh) * 2017-12-04 2018-04-13 吕庆祥 基于虚拟人脸表情进行视频通话的方法及装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023122488A1 (en) * 2021-12-21 2023-06-29 Snap Inc. Avatar call platform
CN115359156A (zh) * 2022-07-31 2022-11-18 荣耀终端有限公司 音频播放方法、装置、设备和存储介质
CN115359156B (zh) * 2022-07-31 2023-12-05 荣耀终端有限公司 音频播放方法、装置、设备和存储介质
CN115512017A (zh) * 2022-10-19 2022-12-23 深圳市诸葛瓜科技有限公司 一种基于人物特征的动漫形象生成系统及方法
CN115512017B (zh) * 2022-10-19 2023-11-28 邝文武 一种基于人物特征的动漫形象生成系统及方法
CN116823591A (zh) * 2023-05-05 2023-09-29 国政通科技有限公司 一种基于卷积神经元的人形检测去隐私化方法及装置
CN116823591B (zh) * 2023-05-05 2024-02-02 国政通科技有限公司 一种基于卷积神经元的人形检测去隐私化方法及装置
CN117809002A (zh) * 2024-02-29 2024-04-02 成都理工大学 一种基于人脸表情识别与动作捕捉的虚拟现实同步方法
CN117809002B (zh) * 2024-02-29 2024-05-14 成都理工大学 一种基于人脸表情识别与动作捕捉的虚拟现实同步方法

Also Published As

Publication number Publication date
CN114710640A (zh) 2022-07-05
CN114710640B (zh) 2023-06-27

Similar Documents

Publication Publication Date Title
WO2022143128A1 (zh) 基于虚拟形象的视频通话方法、装置和终端
CN111316598B (zh) 一种多屏互动方法及设备
WO2020253719A1 (zh) 一种录屏方法及电子设备
WO2021000807A1 (zh) 一种应用程序中等待场景的处理方法和装置
WO2021104485A1 (zh) 一种拍摄方法及电子设备
WO2020093988A1 (zh) 一种图像处理方法及电子设备
CN113923230B (zh) 数据同步方法、电子设备和计算机可读存储介质
WO2021036318A1 (zh) 一种视频图像处理方法及装置
CN114040242B (zh) 投屏方法、电子设备和存储介质
WO2022007862A1 (zh) 图像处理方法、系统、电子设备及计算机可读存储介质
US20210377642A1 (en) Method and Apparatus for Implementing Automatic Translation by Using a Plurality of TWS Headsets Connected in Forwarding Mode
WO2023030099A1 (zh) 跨设备交互的方法、装置、投屏系统及终端
WO2023005298A1 (zh) 基于多摄像头的图像内容屏蔽方法和装置
CN113593567B (zh) 视频声音转文本的方法及相关设备
WO2021052388A1 (zh) 一种视频通信方法及视频通信装置
CN111886849B (zh) 一种传输信息的方法及电子设备
WO2022161006A1 (zh) 合拍的方法、装置、电子设备和可读存储介质
WO2022033344A1 (zh) 视频防抖方法、终端设备和计算机可读存储介质
WO2022267640A1 (zh) 视频共享方法、电子设备及存储介质
US20230419562A1 (en) Method for Generating Brush Effect Picture, Image Editing Method, Device, and Storage Medium
CN114283195A (zh) 生成动态图像的方法、电子设备及可读存储介质
CN115686403A (zh) 显示参数的调整方法、电子设备、芯片及可读存储介质
CN114860178A (zh) 一种投屏的方法和电子设备
WO2022042774A1 (zh) 头像显示方法及电子设备
CN115686339A (zh) 跨进程信息处理方法、电子设备、存储介质和程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913869

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21913869

Country of ref document: EP

Kind code of ref document: A1