WO2022193910A1 - Data processing method, apparatus and system, and electronic device and readable storage medium - Google Patents

Data processing method, apparatus and system, and electronic device and readable storage medium Download PDF

Info

Publication number
WO2022193910A1
WO2022193910A1 PCT/CN2022/077098 CN2022077098W WO2022193910A1 WO 2022193910 A1 WO2022193910 A1 WO 2022193910A1 CN 2022077098 W CN2022077098 W CN 2022077098W WO 2022193910 A1 WO2022193910 A1 WO 2022193910A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
face key
face
key point
Prior art date
Application number
PCT/CN2022/077098
Other languages
French (fr)
Chinese (zh)
Inventor
陈伟杰
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2022193910A1 publication Critical patent/WO2022193910A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets

Definitions

  • the present application relates to the technical field of data processing, and in particular, to a data processing method, apparatus, system, electronic device and readable storage medium.
  • Multimedia refers to the synthesis of multiple media, generally including sound and image and other media forms.
  • the sending end may send multimedia data such as images, sounds, and videos to the receiving end.
  • multimedia data such as images, sounds, and videos
  • user A's terminal can collect multimedia data such as images, sounds, or videos of user A, and send the collected multimedia data to other users' terminals; user A's terminal can also receive other users' multimedia data sent by other users' terminals. data.
  • Embodiments of the present application provide a data processing method, apparatus, system, electronic device, and readable storage medium, which can reduce the consumption of network bandwidth resources by the transmission of multimedia data between terminals.
  • a data processing method comprising:
  • the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice text obtained by performing voice recognition on the first audio of the target object, wherein,
  • the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;
  • the multimedia feature information is processed based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features.
  • target multimedia data is displayed, the The dynamic expression corresponding to the face key point information, and the speech corresponding to the speech text is presented.
  • a data processing method comprising:
  • the multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data, and the biometric information includes a face model and timbre features, and the When the target multimedia data is displayed, a dynamic expression corresponding to the face key point information is presented, and a voice corresponding to the voice text is presented.
  • a data processing method comprising:
  • the sender obtains the first video and the first audio of the target object, and obtains face key point information from the first video, and the face key point information includes the face key points of the target object in the first video. the coordinates in each video frame of a video;
  • the sending end performs voice recognition on the first audio to obtain voice text, and sends the face key point information and the voice text to the receiving end as multimedia feature information;
  • the receiving end processes the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and a timbre feature, and when the target multimedia data is displayed, presents the target multimedia data.
  • a dynamic expression corresponding to the face key point information, and a voice corresponding to the voice text is presented.
  • a data processing apparatus comprising:
  • the receiving module is used to receive the multimedia feature information sent by the sending end, and the multimedia feature information includes the face key point information obtained from the first video of the target object and the first audio frequency of the target object obtained by voice recognition. voice text, wherein the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;
  • the processing module is configured to process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features, and when the target multimedia data is displayed, A dynamic expression corresponding to the face key point information is presented, and a speech corresponding to the speech text is presented.
  • a data processing device comprising:
  • the acquisition module is used to acquire the first video and the first audio of the target object, and acquire face key point information from the first video, and the face key point information includes the face key points of the target object in Coordinates in each video frame of the first video;
  • a processing module configured to perform voice recognition on the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end as multimedia feature information;
  • the multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data, and the biometric information includes a face model and timbre features, and the When the target multimedia data is displayed, a dynamic expression corresponding to the face key point information is presented, and a voice corresponding to the voice text is presented.
  • a data processing system in a sixth aspect, includes a sending end and a receiving end;
  • the receiving end configured to execute the data processing method described in the first aspect
  • the sending end is configured to execute the data processing method described in the second aspect.
  • an electronic device comprising a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the processor causes the processor to execute the first aspect or the first aspect above. The steps of the method described in the second aspect.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the method described in the first aspect or the second aspect.
  • the multimedia feature information includes face key point information obtained from the first video of the target object and voice text obtained by performing speech recognition on the first audio of the target object.
  • the point information includes the coordinates of the face key points of the target object in each video frame of the first video, and then, based on the pre-acquired biometric information, the multimedia feature information is processed to obtain the target multimedia data, and the biometric information includes the face model. and the timbre feature, when the target multimedia data is displayed, the dynamic expression corresponding to the key point information of the human face and the voice corresponding to the voice text are presented. It can be seen that the sender in the embodiment of the present application does not have to send the original first video.
  • the sending end only sends face key point information and voice text, which can significantly reduce the amount of data sent by the sending end and reduce the gap between the sending end and the receiving end.
  • the transmission of multimedia data consumes network bandwidth, which is beneficial to improve the efficiency of data transmission.
  • Fig. 1 is the application environment diagram of the data processing method in one embodiment
  • Fig. 2 is the flow chart of the data processing method in one embodiment
  • FIG. 3 is a schematic diagram of an exemplary face key point of a target object
  • FIG. 4 is a schematic diagram of an exemplary skeleton joint point of a target object in one embodiment
  • FIG. 5 is a schematic diagram of an exemplary data transmission process of a video call between terminal A and terminal B in an embodiment
  • FIG. 6 is a schematic diagram of an exemplary data transmission process between a viewer terminal and a host terminal in an embodiment
  • FIG. 7 is a schematic diagram of an exemplary connected microphone data transmission process between anchor A and anchor B in one embodiment
  • FIG. 10 is a flow chart of obtaining biometric information in one embodiment
  • FIG. 11 is a flowchart of a data processing method in one embodiment
  • FIG. 13 is a schematic diagram of an exemplary terminal A creating a target face model of user A and a target timbre feature of user A in one embodiment
  • FIG. 14 is a schematic diagram of an exemplary terminal B creating a user B target face model and a user B target timbre feature in one embodiment
  • 15 is a schematic diagram of an exemplary exchange of target biometric information between terminal A and terminal B in an embodiment
  • 16 is a schematic diagram of an exemplary process of multimedia data transmission between terminal A and terminal B in an embodiment
  • 17 is a structural block diagram of a data processing apparatus in one embodiment
  • FIG. 18 is a structural block diagram of a data processing apparatus in one embodiment
  • 19 is a structural block diagram of a data processing apparatus in one embodiment
  • FIG. 20 is a schematic diagram of the internal structure of an electronic device in one embodiment.
  • FIG. 1 is a schematic diagram of an implementation environment involved in a data processing method provided by an embodiment of the present application.
  • the implementation environment may include a sending end 110 and a receiving end 120, and communication between the sending end 110 and the receiving end 120 may be performed through a wired network or a wireless network.
  • the transmitting end 110 and the receiving end 120 may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.
  • the sending end 110 may obtain the first video and the first audio of the target object, and obtain face key point information from the first video, where the face key point information includes the face of the target object The coordinates of the key point in each video frame of the first video; the sending end 110 can perform speech recognition on the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end 120 as multimedia feature information; receive The terminal 120 can process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, wherein the biometric information includes a face model and a timbre feature. The dynamic expression corresponding to the point information, and the speech corresponding to the speech text is presented.
  • FIG. 2 shows a flowchart of a data processing method provided by an embodiment of the present application, and the data processing method may be applied to the receiving end 120 shown in FIG. 1 .
  • the data processing method may include the following steps:
  • Step 201 The receiving end receives the multimedia feature information sent by the transmitting end.
  • the sending end can record the video of the target object to obtain the first video, and record the sound of the target object to obtain the first audio, and the sending end extracts multimedia from the first video and the first audio.
  • the multimedia feature information includes face key point information obtained by the sending end from the first video of the target object and speech text obtained by the sending end performing speech recognition on the first audio of the target object, and the face key point information includes The coordinates of the face key points of the target object in each video frame of the first video.
  • the face key points of the target object may be, for example, several key points corresponding to one or more of the facial regions such as eyebrows, eyes, nose, mouth, etc. of the target object, see FIG. 3 , which is an exemplary target Schematic representation of the face keypoints of an object. It should be noted that, the embodiments of the present application do not specifically limit the number of face key points and the facial area indicated by the face key points.
  • the sender can use the key point detection algorithm to perform face key point detection on each video frame in the first video, to obtain the coordinates of the target object's face key points in each video frame, and the sender can use the target object's face key points.
  • the coordinates of points in each video frame are used as face key point information.
  • the key point detection algorithm can be CPR (Cascaded pose regression, cascade pose regression) or AAM (Active Appearance Model, active appearance model), and so on.
  • the sending end performs speech recognition on the first audio, which can be inputting the first audio into a speech recognition model to obtain the corresponding speech text of the first audio, and the speech recognition model can be, for example, an automatic speech recognition model (Automatic Speech Recognition, ASR) .
  • ASR Automatic Speech Recognition
  • the sender obtains face key point information and voice text based on the first video and the first audio, and the sender sends the face key point information and voice text to the receiver as multimedia feature information.
  • Step 202 The receiving end processes the multimedia feature information based on the pre-obtained biometric feature information to obtain target multimedia data.
  • the biometric information may be pre-acquired by the receiving end, for example, the biometric information may be pre-acquired by the receiving end from the transmitting end, the biometric information may also be preset when the receiving end leaves the factory, or the biometric information The information may also be obtained by the receiving end from the server in advance, and the manner in which the receiving end obtains the biometric information is not specifically limited herein.
  • the biometric information may include a face model and a timbre feature
  • the face model may be a three-dimensional face model, or a two-dimensional face model
  • the three-dimensional face model may be a two-dimensional face image
  • the 2D face model obtained by performing 3D modeling can be a 2D face image.
  • the two-dimensional face image may be a two-dimensional face image of the target object or a two-dimensional face image of another user.
  • the timbre feature may be the timbre feature of the target object extracted from the audio data of the target object, or the timbre feature of the other object extracted from the audio data of other users.
  • the receiving end processes the multimedia feature information based on the pre-acquired biometric information to obtain the target multimedia data, which may be that the receiving end processes the face key point information based on the face model to obtain the second video and is based on the timbre feature.
  • the second audio is obtained by processing the speech text.
  • the receiver can detect whether the coordinates of the face key points in the face key point information sent by the sender are two-dimensional coordinates or Three-dimensional coordinates; if the coordinates of the face key points are three-dimensional coordinates, the receiving end first converts the three-dimensional coordinates of the face key points into two-dimensional coordinates, and based on the converted two-dimensional coordinates of the face key points, the two-dimensional face The model is driven and rendered to obtain the second video; if the coordinates of the key points of the face are two-dimensional coordinates, the receiving end directly drives and renders the two-dimensional face model based on the two-dimensional coordinates of the key points of the face to obtain the second video. video.
  • the receiving end may also obtain a three-dimensional face model and a two-dimensional face model in advance.
  • the receiving end may detect that the coordinates of the face key points in the face key point information sent by the sending end are as follows: Two-dimensional coordinates or three-dimensional coordinates; if the coordinates of the face key points are three-dimensional coordinates, the receiving end selects a three-dimensional face model, drives the three-dimensional face model based on the three-dimensional coordinates of the face key points, and then drives the driven
  • the three-dimensional face model is projected onto a two-dimensional plane and rendered to obtain a second video; if the coordinates of the key points of the face are two-dimensional coordinates, the receiving end selects the two-dimensional face model, and based on the two-dimensional coordinates of the key points of the face
  • the two-dimensional face model is driven and rendered to obtain a second video.
  • the above-mentioned driving the face model based on the coordinates of the face key points of the target object in each video frame can be understood as the coordinates of the face key points of the face model follow the changes of the face key points of the target object in each video frame.
  • the coordinates of different face key points of the face model correspond to different expressions, such as smile, anger, anger, etc.
  • the receiving end can fuse the timbre feature and the phonetic text, that is, use the timbre feature and the phonetic text to perform speech synthesis to obtain a second audio, which has a sound matching the phonetic text and the timbre feature when played.
  • the receiving end only processes the received multimedia feature information to obtain the target multimedia data.
  • the target multimedia data is displayed, the dynamic expressions corresponding to the key point information of the face are presented, and the voice and text are presented. corresponding voice.
  • the face key point information is the feature information extracted from the first video
  • the voice text is the feature information extracted from the first audio
  • the data volume of the first video is much larger than that of the face key point information.
  • the data volume, and the data volume of the first audio is much larger than the data volume of the voice text. Therefore, compared with the sending end directly sending the first video and the first audio, the sending end only sends the face key point information and the voice text.
  • the data volume of the data sent by the sender is significantly reduced, the consumption of network bandwidth due to multimedia data transmission between the sender and the receiver is reduced, and the efficiency of data transmission is improved.
  • the sender obtains multimedia feature information from the first video and the first audio Afterwards, a preset encoding rule may also be used to encode the multimedia feature information to obtain encoded multimedia feature information, and the transmitting end sends the encoded multimedia feature information to the receiving end.
  • the receiving end receives the encoded multimedia feature information, decodes the encoded multimedia feature information to obtain multimedia feature information, and then processes the multimedia feature information based on the pre-obtained biometric feature information to obtain target multimedia data. In this way, by encoding the multimedia feature information, the data amount of the multimedia feature information can be further compressed, thereby further reducing the consumption of network bandwidth for the multimedia data transmission between the sender and the receiver.
  • the multimedia feature information may further include skeleton joint point information obtained by the transmitting end from the first video of the target object, the skeleton joint point information Including the coordinates of the skeleton joint points of the target object in each video frame of the first video.
  • FIG. 4 is a schematic diagram of an exemplary skeleton joint point of a target object.
  • the 25 skeleton joint points include: nose 0, neck 1, right shoulder 2, right elbow 3, right wrist 4, left shoulder 5, left elbow 6, left wrist 7, sacrum 8, right waist 9, right Knee 10, Right Ankle 11, Left Waist 12, Left Knee 13, Left Ankle 14, Right Eye 15, Left Eye 16, Right Ear 17, Left Ear 18, Left Toe One 19, Left Toe Two 20, Left Heel 21, Right The first toe 22, the second right toe 23, and the right heel 24.
  • the skeleton joint points of the user may include some or all of the 25 skeleton joint points shown in FIG. make specific restrictions.
  • the biometric information may also include a body posture model
  • the receiving end may obtain a portrait model by splicing the body posture model and the face model.
  • the receiving end drives the portrait model based on the face key point information and the skeleton joint point information, so that the portrait model changes with the coordinate changes of the face key points and the skeleton joint points of the target object in each video frame.
  • the coordinates of the key points of the face correspond to different expressions of the portrait model, and the coordinates of different skeleton joint points correspond to different poses of the portrait model.
  • the receiving end processes the multimedia feature information, and after obtaining the target multimedia data, the receiving end can also display the target multimedia data.
  • any terminal can be used as a receiving end or a sending end.
  • FIG. 5 is an exemplary schematic diagram of data transmission of a video call between terminal A and terminal B.
  • terminal A can collect the first video and first audio of user A (that is, the target object), obtain face key point information from the first video, and perform speech recognition on the first audio to obtain the voice.
  • Text, the face key point information and voice text are used as multimedia feature information, and the face key point information includes the coordinates of user A's face key points in each video frame of the first video.
  • Terminal A sends the multimedia feature information corresponding to user A to terminal B, and terminal B acts as a receiving end. After receiving the content of the multimedia feature information, based on the pre-acquired face model corresponding to user A, the key points of user A's face are determined. The information is processed, and the voice and text of user A are processed based on the pre-acquired timbre characteristics corresponding to user A, so as to obtain target multimedia data corresponding to user A, and the target multimedia data may include second video and second audio. Terminal B may display the obtained second video and second audio corresponding to the user A in time synchronization.
  • terminal B may also perform the steps performed by the above-mentioned sending end to send the multimedia feature information corresponding to user B to terminal A.
  • terminal A can display the second video corresponding to user B and the second audio (that is, the target multimedia data corresponding to user B) in time synchronization based on the multimedia feature information corresponding to user B, thereby realizing The video call process between terminal A and terminal B.
  • FIG. 5 only exemplarily shows two terminals (terminal A and terminal B), and in other embodiments, more terminals may also be included. Any terminal can send the extracted multimedia feature information to other terminals; any terminal can also receive corresponding multimedia feature information sent by other terminals, and display corresponding target multimedia data.
  • the video call process of multiple terminals can be realized by only transmitting multimedia feature information without the need to transmit the original video and audio, which reduces the bandwidth resources during the video call process of multiple terminals. Excessive consumption is conducive to improving the call quality of video calls on multiple terminals.
  • the viewer terminal held by the viewer may be used as the receiving terminal in this embodiment of the present application, and the host terminal held by the anchor may be used as the transmitting terminal.
  • FIG. 6 is an exemplary schematic diagram of data transmission between a viewer terminal and a host terminal.
  • the host terminal may collect the first video and first audio of the host (ie, the target object).
  • the host terminal obtains face key point information from the first video, performs speech recognition on the first audio to obtain voice text, uses the face key point information and the voice text as multimedia feature information, and the face key point information includes the host's face.
  • the coordinates of the key point in each video frame of the first video is the first video and first audio of the host.
  • the host terminal sends the multimedia feature information corresponding to the host to the viewer terminal, and the viewer terminal acts as a receiver.
  • the host's face key point information is processed based on the pre-acquired face model corresponding to the host.
  • the voice and text of the host is processed to obtain target multimedia data corresponding to the host, and the target multimedia data may include a second video and a second audio.
  • the viewer terminal can display the second video and the second audio corresponding to the anchor in time synchronization.
  • different anchors can also connect to the microphone, which is a form of web live broadcast.
  • different anchors can interact, and in the Lianmai interface of the host terminal of each Lianmai anchor and the viewer terminal held by the audience, multimedia data corresponding to each Lianmai's anchors can be displayed at the same time.
  • the host terminal A of the host A and the host terminal B of the host B can be respectively used as senders, and the host terminal A, the host terminal B and the viewer terminal can be respectively used as Receiving end.
  • FIG. 7 is an exemplary schematic diagram of data transmission in connection between host A and host B.
  • the anchor terminal A can collect the first video and the first audio of the anchor A (that is, the target object).
  • the first audio of the anchor A that is, the target object.
  • face key point information and voice text are used as multimedia feature information, and the face key point information includes the coordinates of anchor A's face key points in each video frame of the first video.
  • the host terminal A sends the multimedia feature information corresponding to the host A to the viewer terminal and the host terminal B.
  • the host terminal B can also send the multimedia feature information corresponding to the host B to the viewer terminal and the host terminal A in a similar manner.
  • the viewer terminal After receiving the multimedia feature information corresponding to the anchor A, the viewer terminal processes the key point information of the face of the anchor A based on the pre-acquired face model corresponding to the anchor A, and based on the pre-acquired timbre characteristics corresponding to the anchor A, the information of the anchor A's face is processed.
  • the voice and text are processed to obtain the target multimedia data corresponding to the anchor A; similarly, after receiving the multimedia feature information corresponding to the anchor B, the viewer terminal can also obtain the target multimedia data corresponding to the anchor B according to a similar process.
  • the viewer terminal can simultaneously display the target multimedia data corresponding to the anchor A and the target multimedia data corresponding to the anchor B on the viewer terminal.
  • the host terminal A After the host terminal A receives the multimedia feature information corresponding to the host B, it can process the key point information of the host B's face based on the pre-acquired face model corresponding to the host B, similar to the processing method of the viewer terminal as the receiving end.
  • the pre-acquired timbre feature corresponding to the anchor B processes the voice and text of the anchor B, thereby obtaining the target multimedia data corresponding to the anchor B, and displaying the target multimedia data corresponding to the anchor B on the anchor terminal A.
  • the host terminal A may also display the multimedia data of the host A and the target multimedia data corresponding to the host B on the host terminal A at the same time.
  • the host terminal B may also display the target multimedia data corresponding to the host A and the multimedia data of the host B in the host terminal B.
  • the live video process can be realized by only transmitting the multimedia feature information of the host without the need to transmit original multimedia data between each host terminal and the audience terminal, which reduces the bandwidth required during the live video process. Excessive consumption of resources is conducive to improving the communication quality of live video broadcasts.
  • the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice text obtained by performing voice recognition on the first audio of the target object.
  • the face key point information includes the coordinates of the face key points of the target object in each video frame of the first video, and then, based on the pre-acquired biometric information, the multimedia feature information is processed to obtain the target multimedia data, and the biometric information includes Face model and timbre feature, when the target multimedia data is displayed, the dynamic expression corresponding to the face key point information and the voice corresponding to the voice text are presented.
  • the sender in the embodiment of the present application does not The first video and the first audio, and only the face key point information and voice text are sent, and then the receiving end processes the face key point information and voice text, and then the face key point information that can be presented can be obtained.
  • the corresponding dynamic expression and the target multimedia data presenting the voice corresponding to the voice text because the data volume of the first video is far greater than the data volume of the face key point information, and the data volume of the first audio is far greater than the data volume of the voice text, Therefore, compared with the sending end directly sending the first video and the first audio, the sending end only sends face key point information and voice text, which can significantly reduce the amount of data sent by the sending end, and reduce the amount of data sent by the sending end and the receiving end.
  • the transmission of multimedia data between terminals consumes network bandwidth, which is beneficial to improve the efficiency of data transmission.
  • this embodiment relates to how the receiving end compares the multimedia features based on pre-acquired biometric information when the target multimedia data includes the second video.
  • step 202 may include step 801 shown in FIG. 8:
  • Step 801 the receiving end transforms the coordinates of the face key points in the face model according to the coordinates of the target object's face key points in each video frame to obtain a second video.
  • the face model may be a three-dimensional face model or a two-dimensional face model.
  • the face model is taken as an example of a two-dimensional face model. The implementation process is described. It can be understood that taking the face model as a two-dimensional face model as an example does not constitute a limitation on the type of the face model.
  • the face model can be a two-dimensional face image of the target object, and the face model includes the coordinates of multiple face key points.
  • the face key points can be eyebrows, eyes, nose, mouth etc. Face key points.
  • the receiving end After receiving the multimedia feature information sent by the transmitting end, the receiving end transforms the coordinates of the corresponding facial key points in the face model according to the coordinates of the facial key points of the target object in each video frame.
  • the receiving end can transform the coordinates of each face key point in the face model into the coordinates of the corresponding face key points in the video frame, for example, the face in the face model
  • the coordinates of the key point "nose” are transformed into the coordinates of the corresponding face key point "nose” in the video frame.
  • the face model has the same value as the corresponding video frame. the same expression.
  • the following describes the process of how the receiving end transforms the coordinates of the face key points in the face model according to the coordinates of the target object's face key points in each video frame.
  • the receiving end can perform the following steps A1 to realize the process of transforming the coordinates of the key points of the face in the face model according to the coordinates of the key points of the face of the target object in each video frame:
  • Step A1 the receiving end transforms the coordinates of the face key points in the face model according to the coordinates of the face key points of the target object in each video frame based on the sequence order of each video frame in the first video.
  • the face key point information may further include the time sequence sequence of each video frame in the first video and the identification of each face key point.
  • the receiving end can perform transformation processing on the coordinates of the face key points in the face model according to the coordinates of the face key points in each video frame according to the sequence sequence of each video frame from front to back in the first video.
  • the receiving end can determine each face key point in the face model according to the identification of each face key point in the face key point information.
  • the face key points corresponding to each identifier and then, the receiving end transforms the coordinates of the face key points determined in the face model into the coordinates of the corresponding face key points in the video frame.
  • the face model can have The same expression as the video frame; then, for the second video frame, the third video frame, etc. in the first video, the receiving end can perform the same steps as the first video frame, The frame sequence sequence is driven, and the driven face model is rendered to obtain a second video.
  • the second video can be generated by transforming the coordinates of the key points of the face in the face model.
  • the operation process mainly involves the processing of the coordinates of the key points of the face, and does not need to process other parameters of the face model. , thereby helping to reduce the amount of computation at the receiving end and improve the performance of the receiving end.
  • this embodiment relates to how the receiving end compares the multimedia features based on pre-acquired biometric information when the target multimedia data includes the second audio.
  • step 202 may include step 901 shown in FIG. 9 :
  • Step 901 the receiving end inputs the timbre feature and the speech text into the speech synthesis model to obtain the second audio.
  • the timbre feature may be obtained by performing Fourier analysis on the audio data of the target object to obtain a spectrum, and extracting the spectrum feature.
  • the receiving end inputs the timbre feature and the speech text into the speech synthesis model (Text-To-Speech, TTS) to obtain the second audio output from the speech synthesis model.
  • the speech synthesis model Text-To-Speech, TTS
  • the receiving end processes the speech text according to the pre-acquired timbre feature, so that the second audio frequency of the speech synthesis has the characteristics represented by the timbre feature, so that the receiving end does not need to obtain the first audio frequency collected by the receiving end through the microphone, only According to the voice text extracted from the first audio and the timbre feature, a second audio with the same timbre as the first audio can be synthesized, so as to ensure the fidelity of the second audio, the receiving end and the sending end are greatly reduced.
  • the amount of data in the multimedia data transmission process between them is beneficial to improve the efficiency of multimedia data transmission.
  • the process of how the receiving end acquires biometric information involved in this embodiment As shown in FIG. 10, before step 202, the data processing method of this embodiment further includes step 203 and step 204:
  • step 203 the receiving end acquires the terminal identification corresponding to the transmitting end, and according to the terminal identification, detects whether the target biometric information of the target object corresponding to the terminal identification is stored.
  • the target biometric information includes the target face model and the target timbre feature.
  • the sending end may collect a face image of the target object, and determine the face image of the target object as the target face model of the target object; the sending end may also perform audio data on the target object.
  • the feature extraction obtains the target timbre feature of the target object.
  • the receiver can obtain the terminal identifier corresponding to the sender.
  • the sender can send the receiver to the receiver according to the terminal identifier.
  • the target biometric information of the target object is requested, and if the target biometric information of the target object is received, the receiving end can associate and store the received target biometric information of the target object with the terminal identifier.
  • the receiver after receiving the multimedia feature information sent by the sender, the receiver can detect whether the target biometric information of the target object corresponding to the terminal ID is stored according to the terminal identifier of the sender.
  • Step 204 if the target biometric information is stored, the receiving end uses the target biometric information as the biometric information.
  • the target biometric information of the target object If the target biometric information of the target object is stored, it means that the receiving end has acquired the target biometric information of the target object from the transmitting end during the historical multimedia data transmission process. In this way, the receiver can process the multimedia feature information sent by the sender based on the target biometric feature information of the target object.
  • the receiving end can truly restore the real multimedia data of the target object at the sending end based on the target biometric information, ensuring the effect of multimedia data transmission.
  • step 204 it also includes step 205, step 206, step 207 and step 208:
  • Step 205 if the target biometric information is not stored, the receiving end detects whether the current data transmission rate is greater than the transmission rate threshold.
  • the receiving end determines whether to request the target biometric information from the sending end according to the current network quality.
  • the receiving end detects whether the current data transmission rate is greater than a transmission rate threshold, and the transmission rate threshold can be set by itself during implementation.
  • Step 206 if the current data transmission rate is greater than the transmission rate threshold, the receiver sends an acquisition request to the transmitter, and the acquisition request is used to request the transmitter to return the target biometric information.
  • the receiving end If the current data transmission rate is greater than the transmission rate threshold, it indicates that the current network quality of the receiving end is good, and the receiving end requests the target biometric information from the transmitting end.
  • Step 207 The receiving end receives the target biometric information sent by the transmitting end based on the acquisition request, and uses the target biometric information as the biometric information.
  • Step 208 if the current data transmission rate is less than or equal to the transmission rate threshold, the receiving end acquires the pre-stored general biometric information, and uses the general biometric information as the biometric information.
  • the general biometric information includes a general face model and a general timbre feature.
  • biometric biometric information and use the generic biometric information as biometric information.
  • the general biometric information can be any preset user's face image and timbre features.
  • the receiver can also process the multimedia feature information based on the locally stored general biometric information to obtain the target multimedia data.
  • the multimedia data transmission can also be realized. , reducing the demand for network bandwidth for multimedia data transmission and saving network bandwidth resources.
  • FIG. 11 shows a flowchart of a data processing method provided by an embodiment of the present application, and the data processing method may be applied to the sending end 120 shown in FIG. 1 .
  • the data processing method may include the following steps:
  • Step 111 the sending end obtains the first video and the first audio of the target object, and obtains face key point information from the first video.
  • the sender can record the video of the target object to obtain the first video, and record the sound of the target object to obtain the first audio, and the sender obtains the person from the first video of the target object.
  • Face key point information includes the coordinates of the face key point of the target object in each video frame of the first video.
  • the first video may include multiple video frames, and for each video frame in the first video, the transmitting end performs facial key point detection on the video frame to obtain the coordinates of the target object's facial key points in the video frame. , the sender generates face key point information based on the coordinates of the face key points of the target object in each video frame.
  • the sending end can input each video frame into the face key point detection model to obtain the coordinates of each face key point in the video frame output by the face key point detection model, wherein the face key point detection model It can be any pre-trained deep learning model for facial keypoint detection.
  • the sender takes the coordinates of the face key points in each video frame as the face key point information.
  • the sender may also add a corresponding identifier to the coordinates of each face key point in the face key point information and the time sequence sequence of each video frame in the first video, and add the added person
  • the face key point information is used as the final face key point information.
  • Step 112 the sending end performs speech recognition on the first audio to obtain speech text, and sends the face key point information and the speech text to the receiving end as multimedia feature information.
  • the sending end may input the first audio into an automatic speech recognition model (Automatic Speech Recognition, ASR) to obtain the speech text output by the automatic speech recognition model.
  • ASR Automatic Speech Recognition
  • the sender sends the face key point information and the voice text to the receiver as multimedia feature information, and the multimedia feature information is used for the receiver to process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, biometric features.
  • the information includes face model and timbre features.
  • the receiving end processes the multimedia feature information based on the pre-acquired biometric feature information, and the process of obtaining the target multimedia data may refer to the above-mentioned embodiment, which will not be repeated here.
  • a data processing method is provided, and the data processing method can be applied in the implementation environment shown in FIG. 1 .
  • the method may include the following steps:
  • Step 121 the sending end obtains the first video and the first audio of the target object, and obtains face key point information from the first video.
  • the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video.
  • step 122 the transmitting end performs speech recognition on the first audio to obtain speech text, and sends the face key point information and the speech text to the receiving end as multimedia feature information.
  • Step 123 The receiving end processes the multimedia feature information based on the pre-obtained biometric feature information to obtain target multimedia data.
  • the biometric information includes a face model and a timbre feature.
  • the target multimedia data is displayed, the dynamic expression corresponding to the key point information of the face is presented, and the voice corresponding to the voice text is presented.
  • step 121 the specific limitations and beneficial effects of step 121 , step 122 , and step 123 in this embodiment are similar to those in the foregoing embodiment, and reference may be made to the description of the foregoing embodiment, which is not repeated here.
  • terminal A is preset with a general face model and a general timbre feature, and terminal A uses the face image of user A as the target face model of user A, and extracts the target timbre feature of user A according to user A's audio data.
  • FIG. 13 is an exemplary schematic diagram of terminal A creating a target face model of user A and a target timbre feature of user A.
  • terminal B is preset with a general face model and a general timbre feature.
  • Terminal B uses the face image of user B as the target face model of user B, and extracts the target timbre feature of user B according to the audio data of user B.
  • FIG. 14 is an exemplary schematic diagram of terminal B creating a target face model of user B and a target timbre feature of user B. As shown in FIG.
  • Step c if terminal A and terminal B need to transmit multimedia data, and terminal A does not store user B target face model and user B target timbre feature, terminal B does not store user A target face model and user A target timbre feature ,
  • terminal A requests terminal B for the target face model of user B and the target timbre feature of user B
  • terminal B requests terminal A for the target face of user A model and user A's target timbre characteristics, thereby completing the exchange of standard models.
  • FIG. 15 is a schematic diagram of an exemplary exchange of target biometric information between terminal A and terminal B.
  • terminal A collects the first video of user A through a camera, and inputs each video frame of user A's first video into the facial key point detection model to obtain the facial key point information of user A;
  • the microphone collects the first audio of the user A, and inputs the first audio of the user A into the automatic speech recognition model to obtain the speech text of the user A.
  • terminal B collects the first video of user B through a camera, and inputs each video frame of the first video of user B into the facial key point detection model to obtain the facial key point information of user B;
  • the microphone collects the first audio of the user B, and inputs the first audio of the user B into the automatic speech recognition model to obtain the speech text of the user B.
  • step f terminal A sends user A's face key point information and user A's voice and text to terminal B as user A's multimedia feature information.
  • step g terminal B sends user B's face key point information and user B's voice and text to terminal A as user B's multimedia feature information.
  • step h terminal A processes user B's face key point information based on user B's target face model to obtain the second video of user B; terminal A processes user B's voice text based on user B's target timbre feature to obtain User B's second audio, terminal A displays user B's second video and user B's second audio as user B's target multimedia data.
  • Step i terminal B processes the key point information of user A's face based on user A's target face model to obtain the second video of user A; terminal B processes user A's voice text based on user A's target timbre feature to obtain User A's second audio, and terminal B displays user A's second video and user A's second audio as user A's target multimedia data.
  • FIG. 16 is a schematic diagram of an exemplary process of terminal A and terminal B performing multimedia data transmission.
  • steps in the flowcharts of the above embodiments are sequentially displayed according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in the flowcharts of the above embodiments may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps Alternatively, the order of execution of the stages is not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a stage.
  • FIG. 17 is a structural block diagram of a data processing apparatus according to an embodiment. As shown in Figure 17, the device includes:
  • the receiving module 1701 is used for receiving the multimedia feature information sent by the sending end, and the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice of the first audio of the target object.
  • the recognized speech text wherein the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;
  • the processing module 1702 is used to process the multimedia feature information based on the biometric information obtained in advance to obtain target multimedia data. , presenting a dynamic expression corresponding to the face key point information, and presenting a voice corresponding to the voice text.
  • the target multimedia data includes a second video
  • the processing module 1702 includes:
  • the coordinate transformation unit 1702a performs transformation processing on the coordinates of the face key points in the face model according to the coordinates of the face key points of the target object in each of the video frames, so as to obtain the second video.
  • the coordinate transformation unit 1702a is specifically configured to, based on the sequential order of each of the video frames in the first video, sequentially according to the key points of the face of the target object in each of the video frames. coordinates, and transforms the coordinates of the key points of the face in the face model.
  • the target multimedia data includes second audio; the processing module 1702 is specifically configured to input the timbre feature and the speech text into the speech synthesis model to obtain the second audio.
  • the apparatus further includes:
  • the first detection module is used to obtain the terminal identification corresponding to the transmitting end, and detect whether the target biometric information of the target object corresponding to the terminal identification is stored according to the terminal identification, and the target biometrics
  • the information includes the target face model and the target timbre feature
  • a first determining module configured to use the target biometric information as the biometric information if the target biometric information is stored
  • a second detection module configured to detect whether the current data transmission rate is greater than the transmission rate threshold if the target biometric information is not stored
  • a request module configured to send an acquisition request to the sender if the current data transmission rate is greater than the transmission rate threshold, where the acquisition request is used to request the sender to return the target biometric information
  • a second determining module configured to receive the target biometric information sent by the sender based on the acquisition request, and use the target biometric information as the biometric information.
  • a third determining module configured to acquire pre-stored general biometric information if the current data transmission rate is less than or equal to the transmission rate threshold, and use the general biometric information as the biometric information, and the
  • the general biometric information includes a general face model and a general timbre feature.
  • each module in the above data processing apparatus is only used for illustration. In other embodiments, the data processing apparatus may be divided into different modules as required to complete all or part of the functions of the above data processing apparatus.
  • Each module in the above-mentioned data processing apparatus can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • FIG. 19 is a structural block diagram of a data processing apparatus according to an embodiment. As shown in Figure 19, the device includes:
  • Obtaining module 1901 configured to obtain the first video and first audio of the target object, and obtain face key point information from the first video, where the face key point information includes the face key points of the target object coordinates in each video frame of the first video;
  • the processing module 1902 is used for a processing module for performing speech recognition on the first audio to obtain speech text, and sending the face key point information and the speech text to the receiving end as multimedia feature information;
  • the multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data.
  • the dynamic expression corresponding to the face key point information is presented, and the speech corresponding to the speech text is presented.
  • the acquisition module 1901 is specifically configured to perform face key point detection on the video frame for each of the video frames in the first video, obtaining the coordinates of the face key points of the target object in the video frame; generating the face key point information based on the coordinates of the face key points of the target object in each of the video frames.
  • each module in the above data processing apparatus is only used for illustration. In other embodiments, the data processing apparatus may be divided into different modules as required to complete all or part of the functions of the above data processing apparatus.
  • Each module in the above-mentioned data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a data processing system includes a transmitter and a receiver;
  • the receiving end is configured to execute the above method steps for the receiving end, so as to realize the above method embodiments for the receiving end.
  • the sending end is configured to execute the above method steps for the sending end, so as to realize the above method embodiments for the sending end.
  • FIG. 20 is a schematic diagram of the internal structure of an electronic device in one embodiment.
  • the electronic device includes a processor and a memory connected by a system bus.
  • the processor is used to provide computing and control capabilities to support the operation of the entire electronic device.
  • the memory may include non-volatile storage media and internal memory.
  • the nonvolatile storage medium stores an operating system and a computer program.
  • the computer program can be executed by the processor to implement a data processing method provided by the following embodiments.
  • Internal memory provides a cached execution environment for operating system computer programs in non-volatile storage media.
  • the electronic device may be any terminal device such as a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (Point of Sales, a sales terminal), a vehicle-mounted computer, a wearable device, and the like.
  • each module in the data processing apparatus provided in the embodiments of the present application may be in the form of a computer program.
  • the computer program can be run on a terminal or server.
  • the program modules constituted by the computer program can be stored on the memory of the electronic device.
  • the steps of the methods described in the embodiments of the present application are implemented.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • One or more non-transitory computer-readable storage media containing computer-executable instructions, when executed by one or more processors, cause the processors to perform the steps of a data processing method.
  • Embodiments of the present application also provide a computer program product containing instructions, which, when run on a computer, cause the computer to execute the steps of the data processing method.
  • Any reference to a memory, storage, database, or other medium as used herein may include non-volatile and/or volatile memory.
  • the non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory.
  • Volatile memory may include random access memory (RAM), which acts as external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed are a data processing method, apparatus and system, and an electronic device and a readable storage medium. The method comprises: receiving multimedia feature information, which is sent by a sending end; and processing the multimedia feature information on the basis of pre-acquired biometric information, so as to obtain target multimedia data. By means of the technical solution provided in the embodiments of the present application, the consumption of network bandwidth resources in multimedia data transmission between terminals can be reduced.

Description

数据处理方法、装置、系统、电子设备和可读存储介质Data processing method, apparatus, system, electronic device and readable storage medium 技术领域technical field
本申请涉及数据处理技术领域,特别是涉及一种数据处理方法、装置、系统、电子设备和可读存储介质。The present application relates to the technical field of data processing, and in particular, to a data processing method, apparatus, system, electronic device and readable storage medium.
背景技术Background technique
随着移动通信技术的发展以及智能终端的普及,终端之间传输数据的数据形式也越来越多样化,多媒体数据就是一种典型的数据形式。多媒体是指多种媒体的综合,一般包括声音和图像等多种媒体形式。With the development of mobile communication technology and the popularization of intelligent terminals, the data forms of data transmitted between terminals are becoming more and more diverse, and multimedia data is a typical data form. Multimedia refers to the synthesis of multiple media, generally including sound and image and other media forms.
以终端之间传输的数据是多媒体数据为例,发送端可以向接收端发送图像、声音、视频等多媒体数据。例如,用户A的终端可以采集用户A的图像、声音或视频等多媒体数据,并将采集的多媒体数据发送至其他用户的终端;用户A的终端也可以接收其他用户的终端发送的其他用户的多媒体数据。Taking the data transmitted between terminals as multimedia data as an example, the sending end may send multimedia data such as images, sounds, and videos to the receiving end. For example, user A's terminal can collect multimedia data such as images, sounds, or videos of user A, and send the collected multimedia data to other users' terminals; user A's terminal can also receive other users' multimedia data sent by other users' terminals. data.
然而,终端之间大量的多媒体数据传输会对网络带宽造成较大的负荷,如何降低终端之间多媒体数据的传输对网络带宽资源的消耗,成为目前亟待解决的问题。However, a large amount of multimedia data transmission between terminals will cause a large load on the network bandwidth, and how to reduce the consumption of network bandwidth resources by the transmission of multimedia data between terminals has become an urgent problem to be solved at present.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种数据处理方法、装置、系统、电子设备和可读存储介质,可以降低终端之间多媒体数据的传输对网络带宽资源的消耗。Embodiments of the present application provide a data processing method, apparatus, system, electronic device, and readable storage medium, which can reduce the consumption of network bandwidth resources by the transmission of multimedia data between terminals.
第一方面,提供了一种数据处理方法,所述方法包括:In a first aspect, a data processing method is provided, the method comprising:
接收发送端发送的多媒体特征信息,所述多媒体特征信息包括从目标对象的第一视频中获取的人脸关键点信息和对所述目标对象的第一音频进行语音识别得到的语音文本,其中,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;Receive the multimedia feature information sent by the sending end, the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice text obtained by performing voice recognition on the first audio of the target object, wherein, The face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;
基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The multimedia feature information is processed based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features. When the target multimedia data is displayed, the The dynamic expression corresponding to the face key point information, and the speech corresponding to the speech text is presented.
第二方面,提供了一种数据处理方法,所述方法包括:In a second aspect, a data processing method is provided, the method comprising:
获取目标对象的第一视频和第一音频,并从所述第一视频中获取人脸关键点信息,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;Obtain the first video and the first audio of the target object, and obtain face key point information from the first video, and the face key point information includes the face key points of the target object in the first video. The coordinates in each video frame of ;
对所述第一音频进行语音识别得到语音文本,并将所述人脸关键点信息和所述语音文本作为多媒体特征信息发送至接收端;Carry out voice recognition to the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end as multimedia feature information;
其中,所述多媒体特征信息用于供所述接收端基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data, and the biometric information includes a face model and timbre features, and the When the target multimedia data is displayed, a dynamic expression corresponding to the face key point information is presented, and a voice corresponding to the voice text is presented.
第三方面,提供了一种数据处理方法,所述方法包括:In a third aspect, a data processing method is provided, the method comprising:
发送端获取目标对象的第一视频和第一音频,并从所述第一视频中获取人脸关键点信息,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;The sender obtains the first video and the first audio of the target object, and obtains face key point information from the first video, and the face key point information includes the face key points of the target object in the first video. the coordinates in each video frame of a video;
所述发送端对所述第一音频进行语音识别得到语音文本,并将所述人脸关键点信息和所述语音文本作为多媒体特征信息发送至接收端;The sending end performs voice recognition on the first audio to obtain voice text, and sends the face key point information and the voice text to the receiving end as multimedia feature information;
所述接收端基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标 多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The receiving end processes the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and a timbre feature, and when the target multimedia data is displayed, presents the target multimedia data. A dynamic expression corresponding to the face key point information, and a voice corresponding to the voice text is presented.
第四方面,提供了一种数据处理装置,所述装置包括:In a fourth aspect, a data processing apparatus is provided, the apparatus comprising:
接收模块,用于接收发送端发送的多媒体特征信息,所述多媒体特征信息包括从目标对象的第一视频中获取的人脸关键点信息和对所述目标对象的第一音频进行语音识别得到的语音文本,其中,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;The receiving module is used to receive the multimedia feature information sent by the sending end, and the multimedia feature information includes the face key point information obtained from the first video of the target object and the first audio frequency of the target object obtained by voice recognition. voice text, wherein the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;
处理模块,用于基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The processing module is configured to process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features, and when the target multimedia data is displayed, A dynamic expression corresponding to the face key point information is presented, and a speech corresponding to the speech text is presented.
第五方面,提供了一种数据处理装置,所述装置包括:In a fifth aspect, a data processing device is provided, the device comprising:
获取模块,用于获取目标对象的第一视频和第一音频,并从所述第一视频中获取人脸关键点信息,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;The acquisition module is used to acquire the first video and the first audio of the target object, and acquire face key point information from the first video, and the face key point information includes the face key points of the target object in Coordinates in each video frame of the first video;
处理模块,用于对所述第一音频进行语音识别得到语音文本,并将所述人脸关键点信息和所述语音文本作为多媒体特征信息发送至接收端;a processing module, configured to perform voice recognition on the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end as multimedia feature information;
其中,所述多媒体特征信息用于供所述接收端基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data, and the biometric information includes a face model and timbre features, and the When the target multimedia data is displayed, a dynamic expression corresponding to the face key point information is presented, and a voice corresponding to the voice text is presented.
第六方面,提供了一种数据处理系统,所述系统包括发送端和接收端;In a sixth aspect, a data processing system is provided, the system includes a sending end and a receiving end;
所述接收端,用于执行上述第一方面所述的数据处理方法;the receiving end, configured to execute the data processing method described in the first aspect;
所述发送端,用于执行上述第二方面所述的数据处理方法。The sending end is configured to execute the data processing method described in the second aspect.
第七方面,提供了一种电子设备,包括存储器及处理器,所述存储器中储存有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行上述第一方面或第二方面所述的方法的步骤。In a seventh aspect, an electronic device is provided, comprising a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the processor causes the processor to execute the first aspect or the first aspect above. The steps of the method described in the second aspect.
第八方面,提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述第一方面或第二方面所述的方法的步骤。In an eighth aspect, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the method described in the first aspect or the second aspect.
本申请实施例提供的技术方案带来的有益效果至少包括:The beneficial effects brought by the technical solutions provided in the embodiments of the present application include at least:
通过接收发送端发送的多媒体特征信息,该多媒体特征信息包括从目标对象的第一视频中获取的人脸关键点信息和对目标对象的第一音频进行语音识别得到的语音文本,该人脸关键点信息包括目标对象的人脸关键点在第一视频的各视频帧中的坐标,而后,基于预先获取的生物特征信息对多媒体特征信息进行处理得到目标多媒体数据,该生物特征信息包括人脸模型和音色特征,该目标多媒体数据被展示的情况下呈现与人脸关键点信息对应的动态表情以及呈现与语音文本对应的语音,可以看出,本申请实施例发送端不必发送原始的第一视频和第一音频,而仅发送人脸关键点信息和语音文本即可,再由接收端对该人脸关键点信息和语音文本进行处理,则可以得到能够呈现与人脸关键点信息对应的动态表情以及呈现与语音文本对应的语音的目标多媒体数据;由于第一视频的数据量远大于人脸关键点信息的数据量,且第一音频的数据量远大于语音文本的数据量,因此,相较于发送端直接发送第一视频和第一音频而言,发送端仅发送人脸关键点信息和语音文本可以显著降低发送端所发送的数据的数据量,降低了发送端和接收端之间多媒体数据的传输对网络带宽的消耗,有利于提升数据传输的效率。By receiving the multimedia feature information sent by the sender, the multimedia feature information includes face key point information obtained from the first video of the target object and voice text obtained by performing speech recognition on the first audio of the target object. The point information includes the coordinates of the face key points of the target object in each video frame of the first video, and then, based on the pre-acquired biometric information, the multimedia feature information is processed to obtain the target multimedia data, and the biometric information includes the face model. and the timbre feature, when the target multimedia data is displayed, the dynamic expression corresponding to the key point information of the human face and the voice corresponding to the voice text are presented. It can be seen that the sender in the embodiment of the present application does not have to send the original first video. and the first audio, and only send the face key point information and the voice text, and then the receiving end processes the face key point information and voice text, and then the dynamic information corresponding to the face key point information can be obtained. Expressions and target multimedia data presenting the voice corresponding to the voice text; because the data volume of the first video is much larger than the data volume of the face key point information, and the data volume of the first audio is much larger than the data volume of the voice text, therefore, the Compared with the sending end directly sending the first video and the first audio, the sending end only sends face key point information and voice text, which can significantly reduce the amount of data sent by the sending end and reduce the gap between the sending end and the receiving end. The transmission of multimedia data consumes network bandwidth, which is beneficial to improve the efficiency of data transmission.
附图说明Description of drawings
图1为一个实施例中数据处理方法的应用环境图;Fig. 1 is the application environment diagram of the data processing method in one embodiment;
图2为一个实施例中数据处理方法的流程图;Fig. 2 is the flow chart of the data processing method in one embodiment;
图3为一种示例性地目标对象的人脸关键点的示意图;3 is a schematic diagram of an exemplary face key point of a target object;
图4为一个实施例中一种示例性地目标对象的骨架关节点的示意图;4 is a schematic diagram of an exemplary skeleton joint point of a target object in one embodiment;
图5为一个实施例中一种示例性地终端A和终端B视频通话的数据传输过程示意图;5 is a schematic diagram of an exemplary data transmission process of a video call between terminal A and terminal B in an embodiment;
图6为一个实施例中一种示例性地观众终端和主播终端的数据传输过程示意图;6 is a schematic diagram of an exemplary data transmission process between a viewer terminal and a host terminal in an embodiment;
图7为一个实施例中一种示例性地主播A和主播B的连麦数据传输过程示意图;FIG. 7 is a schematic diagram of an exemplary connected microphone data transmission process between anchor A and anchor B in one embodiment;
图8为另一个实施例中数据处理方法的流程图;8 is a flowchart of a data processing method in another embodiment;
图9为另一个实施例中数据处理方法的流程图;9 is a flowchart of a data processing method in another embodiment;
图10为一个实施例中获取生物特征信息的流程图;10 is a flow chart of obtaining biometric information in one embodiment;
图11为一个实施例中数据处理方法的流程图;11 is a flowchart of a data processing method in one embodiment;
图12为一个实施例中数据处理方法的流程图;12 is a flowchart of a data processing method in one embodiment;
图13为一个实施例中一种示例性地终端A创建用户A目标人脸模型和用户A目标音色特征的示意图;13 is a schematic diagram of an exemplary terminal A creating a target face model of user A and a target timbre feature of user A in one embodiment;
图14为一个实施例中一种示例性地终端B创建用户B目标人脸模型和用户B目标音色特征的示意图;14 is a schematic diagram of an exemplary terminal B creating a user B target face model and a user B target timbre feature in one embodiment;
图15为一个实施例中一种示例性地终端A和终端B进行目标生物特征信息交换的示意图;15 is a schematic diagram of an exemplary exchange of target biometric information between terminal A and terminal B in an embodiment;
图16为一个实施例中一种示例性地终端A和终端B进行多媒体数据传输的过程示意图;16 is a schematic diagram of an exemplary process of multimedia data transmission between terminal A and terminal B in an embodiment;
图17为一个实施例中数据处理装置的结构框图;17 is a structural block diagram of a data processing apparatus in one embodiment;
图18为一个实施例中数据处理装置的结构框图;18 is a structural block diagram of a data processing apparatus in one embodiment;
图19为一个实施例中数据处理装置的结构框图;19 is a structural block diagram of a data processing apparatus in one embodiment;
图20为一个实施例中电子设备的内部结构示意图。FIG. 20 is a schematic diagram of the internal structure of an electronic device in one embodiment.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
下面,将对本申请实施例提供的数据处理方法所涉及到的实施环境进行简要说明。Below, the implementation environment involved in the data processing method provided by the embodiment of the present application will be briefly described.
图1为本申请实施例提供的数据处理方法所涉及到的一种实施环境的示意图。如图1所示,该实施环境可以包括发送端110和接收端120,发送端110和接收端120之间可以通过有线网络或无线网络进行通信。FIG. 1 is a schematic diagram of an implementation environment involved in a data processing method provided by an embodiment of the present application. As shown in FIG. 1 , the implementation environment may include a sending end 110 and a receiving end 120, and communication between the sending end 110 and the receiving end 120 may be performed through a wired network or a wireless network.
其中,发送端110和接收端120可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。Wherein, the transmitting end 110 and the receiving end 120 may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.
在图1所示的实施环境中,发送端110可以获取目标对象的第一视频和第一音频,并从第一视频中获取人脸关键点信息,人脸关键点信息包括目标对象的人脸关键点在第一视频的各视频帧中的坐标;发送端110可以对第一音频进行语音识别得到语音文本,并将人脸关键点信息和语音文本作为多媒体特征信息发送至接收端120;接收端120可以基于预先获取的生物特征信息对多媒体特征信息进行处理,得到目标多媒体数据,其中,生物特征信息包括人脸模型和音色特征,该目标多媒体数据被展示的情况下,呈现与人脸关键点信息对应的动态表情,以及呈现与语音文本对应的语音。In the implementation environment shown in FIG. 1 , the sending end 110 may obtain the first video and the first audio of the target object, and obtain face key point information from the first video, where the face key point information includes the face of the target object The coordinates of the key point in each video frame of the first video; the sending end 110 can perform speech recognition on the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end 120 as multimedia feature information; receive The terminal 120 can process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, wherein the biometric information includes a face model and a timbre feature. The dynamic expression corresponding to the point information, and the speech corresponding to the speech text is presented.
请参考图2,其示出了本申请实施例提供的一种数据处理方法的流程图,该数据处理方法可以应用于图1所示的接收端120中。如图2所示,该数据处理方法可以包括以下步骤:Please refer to FIG. 2 , which shows a flowchart of a data processing method provided by an embodiment of the present application, and the data processing method may be applied to the receiving end 120 shown in FIG. 1 . As shown in Figure 2, the data processing method may include the following steps:
步骤201,接收端接收发送端发送的多媒体特征信息。Step 201: The receiving end receives the multimedia feature information sent by the transmitting end.
发送端与接收端进行多媒体数据传输的过程中,发送端可以录制目标对象的视频得到第一视频,并录制目标对象的声音得到第一音频,发送端从第一视频和第一音频中提取多媒体特征信息,该多媒体特征信息包括发送端从目标对象的第一视频中获取的人脸关键点信息以及发送端对目标对象的第一音频进行语音识别得到的语音文本,该人脸关键点信息包括目标对象的人脸关键点在第一视频的各视频帧中的坐标。In the process of transmitting multimedia data between the sending end and the receiving end, the sending end can record the video of the target object to obtain the first video, and record the sound of the target object to obtain the first audio, and the sending end extracts multimedia from the first video and the first audio. Feature information, the multimedia feature information includes face key point information obtained by the sending end from the first video of the target object and speech text obtained by the sending end performing speech recognition on the first audio of the target object, and the face key point information includes The coordinates of the face key points of the target object in each video frame of the first video.
以下,对发送端获取人脸关键点信息以及语音文本的过程进行介绍。In the following, the process of acquiring face key point information and voice text by the sender is introduced.
目标对象的人脸关键点例如可以是目标对象的眉毛、眼睛、鼻子、嘴巴等面部区域中的一种或多种对应的若干个关键点,参见图3,图3为一种示例性地目标对象的人脸关键点的示意图。需要说明的是,本申请实施例对人脸关键点的数量以及人脸关键点所指示的面部区域不做具体限制。The face key points of the target object may be, for example, several key points corresponding to one or more of the facial regions such as eyebrows, eyes, nose, mouth, etc. of the target object, see FIG. 3 , which is an exemplary target Schematic representation of the face keypoints of an object. It should be noted that, the embodiments of the present application do not specifically limit the number of face key points and the facial area indicated by the face key points.
发送端可以利用关键点检测算法对第一视频中的各视频帧分别进行人脸关键点检测,得到目标对象的人脸关键点在各视频帧中的坐标,发送端将目标对象的人脸关键点在各视频帧中的坐标作为人脸关键点信息。其中,关键点检测算法可以是CPR(Cascaded pose regression,级联姿态回归)或AAM(Active Appearance Model,主动外观模型),等等。The sender can use the key point detection algorithm to perform face key point detection on each video frame in the first video, to obtain the coordinates of the target object's face key points in each video frame, and the sender can use the target object's face key points. The coordinates of points in each video frame are used as face key point information. Among them, the key point detection algorithm can be CPR (Cascaded pose regression, cascade pose regression) or AAM (Active Appearance Model, active appearance model), and so on.
发送端对第一音频进行语音识别,可以是将第一音频输入至语音识别模型中,得到第一音频对应的语音文本,该语音识别模型例如可以是自动语音识别模型(Automatic Speech Recognition,ASR)。The sending end performs speech recognition on the first audio, which can be inputting the first audio into a speech recognition model to obtain the corresponding speech text of the first audio, and the speech recognition model can be, for example, an automatic speech recognition model (Automatic Speech Recognition, ASR) .
这样,通过上述实施方式,发送端则基于第一视频和第一音频获取到人脸关键点信息和语音文本,发送端将人脸关键点信息和语音文本作为多媒体特征信息发送至接收端。In this way, through the above embodiment, the sender obtains face key point information and voice text based on the first video and the first audio, and the sender sends the face key point information and voice text to the receiver as multimedia feature information.
步骤202,接收端基于预先获取的生物特征信息对多媒体特征信息进行处理,得到目标多媒体数据。Step 202: The receiving end processes the multimedia feature information based on the pre-obtained biometric feature information to obtain target multimedia data.
生物特征信息可以是接收端预先获取的,示例性地,该生物特征信息可以是接收端预先从发送端获取的,该生物特征信息也可以是接收端出厂时预置的,或者,该生物特征信息也可以是接收端预先从服务器中获取的,在此对接收端获取生物特征信息的方式不做具体限制。The biometric information may be pre-acquired by the receiving end, for example, the biometric information may be pre-acquired by the receiving end from the transmitting end, the biometric information may also be preset when the receiving end leaves the factory, or the biometric information The information may also be obtained by the receiving end from the server in advance, and the manner in which the receiving end obtains the biometric information is not specifically limited herein.
本申请实施例中,生物特征信息可以包括人脸模型和音色特征,该人脸模型可以是三维人脸模型,也可以是二维人脸模型,三维人脸模型可以是对二维人脸图像进行三维建模得到的,二维人脸模型则可以是二维人脸图像。该二维人脸图像可以是目标对象的二维人脸图像,也可以是其他用户的二维人脸图像。该音色特征可以是从目标对象的音频数据中提取的目标对象的音色特征,也可以是从其他用户的音频数据中提取的该其他对象的音色特征。In this embodiment of the present application, the biometric information may include a face model and a timbre feature, the face model may be a three-dimensional face model, or a two-dimensional face model, and the three-dimensional face model may be a two-dimensional face image The 2D face model obtained by performing 3D modeling can be a 2D face image. The two-dimensional face image may be a two-dimensional face image of the target object or a two-dimensional face image of another user. The timbre feature may be the timbre feature of the target object extracted from the audio data of the target object, or the timbre feature of the other object extracted from the audio data of other users.
作为一种实施方式,接收端基于预先获取的生物特征信息对多媒体特征信息进行处理得到目标多媒体数据,可以是接收端基于人脸模型对人脸关键点信息进行处理得到第二视频且基于音色特征对语音文本进行处理得到第二音频。As an embodiment, the receiving end processes the multimedia feature information based on the pre-acquired biometric information to obtain the target multimedia data, which may be that the receiving end processes the face key point information based on the face model to obtain the second video and is based on the timbre feature. The second audio is obtained by processing the speech text.
以下,对接收端基于人脸模型对人脸关键点信息进行处理得到第二视频的过程进行介绍。In the following, the process of obtaining the second video by processing the key point information of the face at the receiving end based on the face model will be introduced.
在一种可能的实施方式中,若接收端预先获取的人脸模型为二维人脸模型,接收端可以检测发送端发送的人脸关键点信息中人脸关键点的坐标为二维坐标还是三维坐标;若人脸关键点的坐标为三维坐标,接收端首先将人脸关键点的三维坐标转换成二维坐标,并基于转换后的人脸关键点的二维坐标对该二维人脸模型进行驱动以及渲染,得到第二视频;若人脸关键点的坐标为二维坐标,接收端则直接基于人脸关键点的二维坐标对二维人脸模型进行驱动以及渲染,得到第二视频。In a possible implementation, if the face model pre-obtained by the receiver is a two-dimensional face model, the receiver can detect whether the coordinates of the face key points in the face key point information sent by the sender are two-dimensional coordinates or Three-dimensional coordinates; if the coordinates of the face key points are three-dimensional coordinates, the receiving end first converts the three-dimensional coordinates of the face key points into two-dimensional coordinates, and based on the converted two-dimensional coordinates of the face key points, the two-dimensional face The model is driven and rendered to obtain the second video; if the coordinates of the key points of the face are two-dimensional coordinates, the receiving end directly drives and renders the two-dimensional face model based on the two-dimensional coordinates of the key points of the face to obtain the second video. video.
在另一种可能的实施方式中,接收端还可以预先获取三维人脸模型以及二维人脸模型,这样,接收端可以检测发送端发送的人脸关键点信息中人脸关键点的坐标为二维坐标还是三维坐标;若人脸关键点的坐标为三维坐标,接收端则选取三维人脸模型,基于人脸 关键点的三维坐标对该三维人脸模型进行驱动,而后,将驱动后的三维人脸模型投影至二维平面并进行渲染,得到第二视频;若人脸关键点的坐标为二维坐标,接收端则选取二维人脸模型,并基于人脸关键点的二维坐标对二维人脸模型进行驱动以及渲染,得到第二视频。In another possible implementation manner, the receiving end may also obtain a three-dimensional face model and a two-dimensional face model in advance. In this way, the receiving end may detect that the coordinates of the face key points in the face key point information sent by the sending end are as follows: Two-dimensional coordinates or three-dimensional coordinates; if the coordinates of the face key points are three-dimensional coordinates, the receiving end selects a three-dimensional face model, drives the three-dimensional face model based on the three-dimensional coordinates of the face key points, and then drives the driven The three-dimensional face model is projected onto a two-dimensional plane and rendered to obtain a second video; if the coordinates of the key points of the face are two-dimensional coordinates, the receiving end selects the two-dimensional face model, and based on the two-dimensional coordinates of the key points of the face The two-dimensional face model is driven and rendered to obtain a second video.
上述基于目标对象的人脸关键点在各视频帧中的坐标对人脸模型进行驱动,可以理解为人脸模型的人脸关键点的坐标随着目标对象的人脸关键点在各视频帧中的坐标变化而变化,人脸模型不同的人脸关键点的坐标对应不同的表情,例如,微笑、生气、愤怒等。The above-mentioned driving the face model based on the coordinates of the face key points of the target object in each video frame can be understood as the coordinates of the face key points of the face model follow the changes of the face key points of the target object in each video frame. The coordinates of different face key points of the face model correspond to different expressions, such as smile, anger, anger, etc.
以下,对接收端基于音色特征对语音文本进行处理得到第二音频的过程进行介绍。Hereinafter, the process of obtaining the second audio by processing the speech text at the receiving end based on the timbre feature will be introduced.
接收端可以将音色特征和语音文本进行融合,即,利用音色特征和语音文本进行语音合成得到第二音频,该第二音频被播放时具有与语音文本和该音色特征相匹配的声音。The receiving end can fuse the timbre feature and the phonetic text, that is, use the timbre feature and the phonetic text to perform speech synthesis to obtain a second audio, which has a sound matching the phonetic text and the timbre feature when played.
这样,接收端则仅基于接收到的多媒体特征信息进行处理,即可得到目标多媒体数据,该目标多媒体数据被展示的情况下,呈现与人脸关键点信息对应的动态表情,以及呈现与语音文本对应的语音。In this way, the receiving end only processes the received multimedia feature information to obtain the target multimedia data. When the target multimedia data is displayed, the dynamic expressions corresponding to the key point information of the face are presented, and the voice and text are presented. corresponding voice.
本申请实施例中,由于人脸关键点信息是从第一视频中提取的特征信息,语音文本是从第一音频中提取的特征信息,第一视频的数据量远大于人脸关键点信息的数据量,且第一音频的数据量远大于语音文本的数据量,因此,相较于发送端直接发送第一视频和第一音频而言,发送端仅发送人脸关键点信息和语音文本可以显著降低发送端所发送的数据的数据量,降低了发送端和接收端之间多媒体数据的传输对网络带宽的消耗,有利于提升数据传输的效率。In the embodiment of the present application, since the face key point information is the feature information extracted from the first video, and the voice text is the feature information extracted from the first audio, the data volume of the first video is much larger than that of the face key point information. The data volume, and the data volume of the first audio is much larger than the data volume of the voice text. Therefore, compared with the sending end directly sending the first video and the first audio, the sending end only sends the face key point information and the voice text. The data volume of the data sent by the sender is significantly reduced, the consumption of network bandwidth due to multimedia data transmission between the sender and the receiver is reduced, and the efficiency of data transmission is improved.
在一种可能的实施方式中,为了进一步降低发送端和接收端之间多媒体数据的传输对网络带宽的消耗,本申请实施例中,发送端从第一视频和第一音频中获取多媒体特征信息后,还可以采用预设的编码规则,对该多媒体特征信息进行编码,得到编码后的多媒体特征信息,发送端将该编码后的多媒体特征信息发送至接收端。接收端接收该编码后的多媒体特征信息,对该编码后的多媒体特征信息进行解码,得到多媒体特征信息,接收端再基于预先获取的生物特征信息对该多媒体特征信息进行处理,得到目标多媒体数据。这样,通过对多媒体特征信息进行编码可以进一步压缩多媒体特征信息的数据量,从而进一步降低了发送端和接收端之间多媒体数据的传输对网络带宽的消耗。In a possible implementation, in order to further reduce the consumption of network bandwidth due to the transmission of multimedia data between the sender and the receiver, in this embodiment of the present application, the sender obtains multimedia feature information from the first video and the first audio Afterwards, a preset encoding rule may also be used to encode the multimedia feature information to obtain encoded multimedia feature information, and the transmitting end sends the encoded multimedia feature information to the receiving end. The receiving end receives the encoded multimedia feature information, decodes the encoded multimedia feature information to obtain multimedia feature information, and then processes the multimedia feature information based on the pre-obtained biometric feature information to obtain target multimedia data. In this way, by encoding the multimedia feature information, the data amount of the multimedia feature information can be further compressed, thereby further reducing the consumption of network bandwidth for the multimedia data transmission between the sender and the receiver.
在一种可能的实施方式中,为了扩大本申请实施例数据处理方法的应用范围,多媒体特征信息还可以包括发送端从目标对象的第一视频中获取的骨架关节点信息,该骨架关节点信息包括目标对象的骨架关节点在第一视频的各视频帧中的坐标。In a possible implementation manner, in order to expand the application scope of the data processing method of the embodiment of the present application, the multimedia feature information may further include skeleton joint point information obtained by the transmitting end from the first video of the target object, the skeleton joint point information Including the coordinates of the skeleton joint points of the target object in each video frame of the first video.
参见图4,图4为一种示例性地目标对象的骨架关节点的示意图。如图4所示,25个骨架关节点包括:鼻子0、脖子1、右肩2、右手肘3、右手腕4、左肩5、左手肘6、左手腕7、骶骨8、右腰9、右膝盖10、右脚踝11、左腰12、左膝盖13、左脚踝14、右眼15、左眼16、右耳17、左耳18、左脚趾一19、左脚趾二20、左脚跟21、右脚趾一22、右脚趾二23、右脚跟24,需要说明的是,本申请实施例中,用户的骨架关节点可以包括图4所示的25个骨架关节点中的部分或者全部,在此不做具体限制。Referring to FIG. 4 , FIG. 4 is a schematic diagram of an exemplary skeleton joint point of a target object. As shown in Figure 4, the 25 skeleton joint points include: nose 0, neck 1, right shoulder 2, right elbow 3, right wrist 4, left shoulder 5, left elbow 6, left wrist 7, sacrum 8, right waist 9, right Knee 10, Right Ankle 11, Left Waist 12, Left Knee 13, Left Ankle 14, Right Eye 15, Left Eye 16, Right Ear 17, Left Ear 18, Left Toe One 19, Left Toe Two 20, Left Heel 21, Right The first toe 22, the second right toe 23, and the right heel 24. It should be noted that, in this embodiment of the present application, the skeleton joint points of the user may include some or all of the 25 skeleton joint points shown in FIG. make specific restrictions.
对应地,生物特征信息还可以包括身体姿态模型,接收端可以将身体姿态模型和人脸模型拼接得到人像模型。接收端基于人脸关键点信息和骨架关节点信息对该人像模型进行驱动,以使该人像模型随着目标对象的人脸关键点和骨架关节点在各视频帧中的坐标变化而变化,不同的人脸关键点的坐标对应人像模型不同的表情,不同的骨架关节点的坐标对应人像模型不同的姿态。Correspondingly, the biometric information may also include a body posture model, and the receiving end may obtain a portrait model by splicing the body posture model and the face model. The receiving end drives the portrait model based on the face key point information and the skeleton joint point information, so that the portrait model changes with the coordinate changes of the face key points and the skeleton joint points of the target object in each video frame. The coordinates of the key points of the face correspond to different expressions of the portrait model, and the coordinates of different skeleton joint points correspond to different poses of the portrait model.
本申请实施例中,接收端对多媒体特征信息进行处理,得到目标多媒体数据后,还可以展示该目标多媒体数据。In this embodiment of the present application, the receiving end processes the multimedia feature information, and after obtaining the target multimedia data, the receiving end can also display the target multimedia data.
示例性地,以下将结合几种不同的应用场景,对本申请实施例数据处理方法的应用进行示例性地介绍:Exemplarily, the application of the data processing method of this embodiment of the present application will be exemplarily introduced in combination with several different application scenarios:
多方视频通话场景Multi-party video call scenario
多个终端进行视频通话的过程中,可以理解的是,任意一个终端均可以作为接收端,也可以作为发送端。During a video call between multiple terminals, it can be understood that any terminal can be used as a receiving end or a sending end.
以多个终端是终端A和终端B为例,参见图5,图5为一种示例性地终端A和终端B视频通话的数据传输示意图。Taking the multiple terminals being terminal A and terminal B as an example, see FIG. 5 , which is an exemplary schematic diagram of data transmission of a video call between terminal A and terminal B.
终端A作为发送端的情况下,终端A可以采集用户A(即目标对象)的第一视频和第一音频,并从第一视频中获取人脸关键点信息、对第一音频进行语音识别得到语音文本,将人脸关键点信息和语音文本作为多媒体特征信息,该人脸关键点信息包括用户A的人脸关键点在第一视频的各视频帧中的坐标。In the case of terminal A as the sender, terminal A can collect the first video and first audio of user A (that is, the target object), obtain face key point information from the first video, and perform speech recognition on the first audio to obtain the voice. Text, the face key point information and voice text are used as multimedia feature information, and the face key point information includes the coordinates of user A's face key points in each video frame of the first video.
终端A将该用户A对应的多媒体特征信息发送至终端B,终端B作为接收端,收到该多媒体特征信息内容之后,基于预先获取的用户A对应的人脸模型对用户A的人脸关键点信息进行处理,并基于预先获取的用户A对应音色特征对用户A的语音文本进行处理,从而得到用户A对应的目标多媒体数据,该目标多媒体数据可以包括第二视频和第二音频。终端B可以将得到的该用户A对应的第二视频和第二音频进行时间同步展示。Terminal A sends the multimedia feature information corresponding to user A to terminal B, and terminal B acts as a receiving end. After receiving the content of the multimedia feature information, based on the pre-acquired face model corresponding to user A, the key points of user A's face are determined. The information is processed, and the voice and text of user A are processed based on the pre-acquired timbre characteristics corresponding to user A, so as to obtain target multimedia data corresponding to user A, and the target multimedia data may include second video and second audio. Terminal B may display the obtained second video and second audio corresponding to the user A in time synchronization.
同样地,终端B也可以执行上述发送端所执行的步骤,将用户B对应的多媒体特征信息发送至终端A。对应地,终端A作为接收端,则可以基于用户B对应的多媒体特征信息,将用户B对应的第二视频和第二音频(即用户B对应的目标多媒体数据)进行时间同步展示,从而实现了终端A和终端B的视频通话过程。Similarly, terminal B may also perform the steps performed by the above-mentioned sending end to send the multimedia feature information corresponding to user B to terminal A. Correspondingly, as the receiving end, terminal A can display the second video corresponding to user B and the second audio (that is, the target multimedia data corresponding to user B) in time synchronization based on the multimedia feature information corresponding to user B, thereby realizing The video call process between terminal A and terminal B.
需要说明的是,图5仅示例性地示出了两个终端(终端A和终端B),在其他实施例中,还可以包括更多的终端。任意一个终端可以向其他各个终端发送其提取到的多媒体特征信息;任意一个终端也均可以接收其他终端发送的对应的多媒体特征信息,并展示相应地目标多媒体数据。It should be noted that FIG. 5 only exemplarily shows two terminals (terminal A and terminal B), and in other embodiments, more terminals may also be included. Any terminal can send the extracted multimedia feature information to other terminals; any terminal can also receive corresponding multimedia feature information sent by other terminals, and display corresponding target multimedia data.
这样,多个终端之间在不必传输原始的视频和音频的情况下,仅传输多媒体特征信息即可实现多个终端的视频通话过程,这就降低了多个终端视频通话过程中对带宽资源的过度消耗,有利于提升多个终端视频通话的通话质量。In this way, the video call process of multiple terminals can be realized by only transmitting multimedia feature information without the need to transmit the original video and audio, which reduces the bandwidth resources during the video call process of multiple terminals. Excessive consumption is conducive to improving the call quality of video calls on multiple terminals.
视频直播场景Live video scene
可选地,观众通过观众终端观看网络视频直播的过程中,观众所持有的观众终端可以作为本申请实施例的接收端,主播所持有的主播终端可以作为发送端。Optionally, in the process of watching the live web video by the viewer through the viewer terminal, the viewer terminal held by the viewer may be used as the receiving terminal in this embodiment of the present application, and the host terminal held by the anchor may be used as the transmitting terminal.
参见图6,图6为一种示例性地观众终端和主播终端的数据传输示意图。Referring to FIG. 6, FIG. 6 is an exemplary schematic diagram of data transmission between a viewer terminal and a host terminal.
主播终端可以采集主播(即目标对象)的第一视频和第一音频。主播终端从第一视频中获取人脸关键点信息、对第一音频进行语音识别得到语音文本,将人脸关键点信息和语音文本作为多媒体特征信息,该人脸关键点信息包括主播的人脸关键点在第一视频的各视频帧中的坐标。The host terminal may collect the first video and first audio of the host (ie, the target object). The host terminal obtains face key point information from the first video, performs speech recognition on the first audio to obtain voice text, uses the face key point information and the voice text as multimedia feature information, and the face key point information includes the host's face. The coordinates of the key point in each video frame of the first video.
主播终端将该主播对应的多媒体特征信息发送至观众终端,观众终端作为接收端,收到该多媒体特征信息之后,基于预先获取的主播对应的人脸模型对主播的人脸关键点信息进行处理,并基于预先获取的主播的音色特征对主播的语音文本进行处理,从而得到主播对应的目标多媒体数据,该目标多媒体数据可以包括第二视频和第二音频。观众终端则可以将主播对应的第二视频和第二音频进行时间同步展示。The host terminal sends the multimedia feature information corresponding to the host to the viewer terminal, and the viewer terminal acts as a receiver. After receiving the multimedia feature information, the host's face key point information is processed based on the pre-acquired face model corresponding to the host. And based on the pre-acquired timbre feature of the host, the voice and text of the host is processed to obtain target multimedia data corresponding to the host, and the target multimedia data may include a second video and a second audio. The viewer terminal can display the second video and the second audio corresponding to the anchor in time synchronization.
可选地,观众通过观众终端观看网络视频直播的过程中,不同的主播还可以连麦,连麦是网络直播的一种形式。通过连麦,不同的主播可以进行互动,连麦的各主播的主播终端和观众所持有的观众终端的连麦界面中,可以同时展示连麦的各主播分别对应的多媒体数据。Optionally, in the process of watching the live web video through the viewer terminal, different anchors can also connect to the microphone, which is a form of web live broadcast. Through Lianmai, different anchors can interact, and in the Lianmai interface of the host terminal of each Lianmai anchor and the viewer terminal held by the audience, multimedia data corresponding to each Lianmai's anchors can be displayed at the same time.
在主播连麦的情况下,例如,主播A和主播B连麦,主播A的主播终端A和主播B的主播终端B可以分别作为发送端,主播终端A、主播终端B以及观众终端可以分别作为接收端。In the case where the host is connected to the microphone, for example, the host A and the host B are connected to the microphone, the host terminal A of the host A and the host terminal B of the host B can be respectively used as senders, and the host terminal A, the host terminal B and the viewer terminal can be respectively used as Receiving end.
参见图7,图7为一种示例性地主播A和主播B的连麦数据传输示意图。Referring to FIG. 7 , FIG. 7 is an exemplary schematic diagram of data transmission in connection between host A and host B.
主播终端A可以采集主播A(即目标对象)的第一视频和第一音频,主播终端A从主播A的第一视频中获取人脸关键点信息、对主播A的第一音频进行语音识别得到语音文本,将人脸关键点信息和语音文本作为多媒体特征信息,该人脸关键点信息包括主播A的人脸关键点在第一视频的各视频帧中的坐标。The anchor terminal A can collect the first video and the first audio of the anchor A (that is, the target object). For voice text, face key point information and voice text are used as multimedia feature information, and the face key point information includes the coordinates of anchor A's face key points in each video frame of the first video.
主播终端A将该主播A对应的多媒体特征信息发送至观众终端和主播终端B。同样地,主播终端B也可以按照类似的方式,将主播B对应的多媒体特征信息发送至观众终端和主播终端A。The host terminal A sends the multimedia feature information corresponding to the host A to the viewer terminal and the host terminal B. Similarly, the host terminal B can also send the multimedia feature information corresponding to the host B to the viewer terminal and the host terminal A in a similar manner.
观众终端接收到主播A对应的多媒体特征信息之后,基于预先获取的主播A对应的人脸模型对主播A的人脸关键点信息进行处理,并基于预先获取的主播A对应音色特征对主播A的语音文本进行处理,从而得到主播A对应的目标多媒体数据;同样地,观众终端接收到主播B对应的多媒体特征信息之后,按照类似的过程处理,同样可以得到主播B对应的目标多媒体数据。观众终端则可以将主播A对应的目标多媒体数据和主播B对应的目标多媒体数据同时展示在观众终端。After receiving the multimedia feature information corresponding to the anchor A, the viewer terminal processes the key point information of the face of the anchor A based on the pre-acquired face model corresponding to the anchor A, and based on the pre-acquired timbre characteristics corresponding to the anchor A, the information of the anchor A's face is processed. The voice and text are processed to obtain the target multimedia data corresponding to the anchor A; similarly, after receiving the multimedia feature information corresponding to the anchor B, the viewer terminal can also obtain the target multimedia data corresponding to the anchor B according to a similar process. The viewer terminal can simultaneously display the target multimedia data corresponding to the anchor A and the target multimedia data corresponding to the anchor B on the viewer terminal.
主播终端A接收到主播B对应的多媒体特征信息之后,与观众终端作为接收端的处理方式类似,可以基于预先获取的主播B对应的人脸模型对主播B的人脸关键点信息进行处理,并基于预先获取的主播B对应音色特征对主播B的语音文本进行处理,从而得到主播B对应的目标多媒体数据,并在主播终端A展示主播B对应的目标多媒体数据。可选地,主播终端A还可以将主播A的多媒体数据与主播B对应的目标多媒体数同时展示在主播终端A。After the host terminal A receives the multimedia feature information corresponding to the host B, it can process the key point information of the host B's face based on the pre-acquired face model corresponding to the host B, similar to the processing method of the viewer terminal as the receiving end. The pre-acquired timbre feature corresponding to the anchor B processes the voice and text of the anchor B, thereby obtaining the target multimedia data corresponding to the anchor B, and displaying the target multimedia data corresponding to the anchor B on the anchor terminal A. Optionally, the host terminal A may also display the multimedia data of the host A and the target multimedia data corresponding to the host B on the host terminal A at the same time.
同理,主播终端B也可以在主播终端B中展示主播A对应的目标多媒体数据和主播B的多媒体数据。Similarly, the host terminal B may also display the target multimedia data corresponding to the host A and the multimedia data of the host B in the host terminal B.
这样,在视频直播的场景下,各主播终端和观众终端之间在不必传输原始的多媒体数据的情况下,仅传输主播的多媒体特征信息即可实现视频直播过程,降低了视频直播过程中对带宽资源的过度消耗,有利于提升视频直播的通信质量。In this way, in the scenario of live video, the live video process can be realized by only transmitting the multimedia feature information of the host without the need to transmit original multimedia data between each host terminal and the audience terminal, which reduces the bandwidth required during the live video process. Excessive consumption of resources is conducive to improving the communication quality of live video broadcasts.
上述实施例通过接收发送端发送的多媒体特征信息,该多媒体特征信息包括从目标对象的第一视频中获取的人脸关键点信息和对目标对象的第一音频进行语音识别得到的语音文本,该人脸关键点信息包括目标对象的人脸关键点在第一视频的各视频帧中的坐标,而后,基于预先获取的生物特征信息对多媒体特征信息进行处理得到目标多媒体数据,该生物特征信息包括人脸模型和音色特征,该目标多媒体数据被展示的情况下呈现与人脸关键点信息对应的动态表情以及呈现与语音文本对应的语音,可以看出,本申请实施例发送端不必发送原始的第一视频和第一音频,而仅发送人脸关键点信息和语音文本即可,再由接收端对该人脸关键点信息和语音文本进行处理,则可以得到能够呈现与人脸关键点信息对应的动态表情以及呈现与语音文本对应的语音的目标多媒体数据;由于第一视频的数据量远大于人脸关键点信息的数据量,且第一音频的数据量远大于语音文本的数据量,因此,相较于发送端直接发送第一视频和第一音频而言,发送端仅发送人脸关键点信息和语音文本可以显著降低发送端所发送的数据的数据量,降低了发送端和接收端之间多媒体数据的传输对网络带宽的消耗,有利于提升数据传输的效率。In the above-mentioned embodiment, by receiving the multimedia feature information sent by the sending end, the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice text obtained by performing voice recognition on the first audio of the target object. The face key point information includes the coordinates of the face key points of the target object in each video frame of the first video, and then, based on the pre-acquired biometric information, the multimedia feature information is processed to obtain the target multimedia data, and the biometric information includes Face model and timbre feature, when the target multimedia data is displayed, the dynamic expression corresponding to the face key point information and the voice corresponding to the voice text are presented. It can be seen that the sender in the embodiment of the present application does not The first video and the first audio, and only the face key point information and voice text are sent, and then the receiving end processes the face key point information and voice text, and then the face key point information that can be presented can be obtained. The corresponding dynamic expression and the target multimedia data presenting the voice corresponding to the voice text; because the data volume of the first video is far greater than the data volume of the face key point information, and the data volume of the first audio is far greater than the data volume of the voice text, Therefore, compared with the sending end directly sending the first video and the first audio, the sending end only sends face key point information and voice text, which can significantly reduce the amount of data sent by the sending end, and reduce the amount of data sent by the sending end and the receiving end. The transmission of multimedia data between terminals consumes network bandwidth, which is beneficial to improve the efficiency of data transmission.
在一个实施例中,基于图2所示的实施例,参见图8,本实施例涉及的是在目标多媒体数据包括第二视频的情况下,接收端如何基于预先获取的生物特征信息对多媒体特征信息进行处理,得到目标多媒体数据的过程。如图8所示,步骤202可以包括图8所示的步骤801:In one embodiment, based on the embodiment shown in FIG. 2 , and referring to FIG. 8 , this embodiment relates to how the receiving end compares the multimedia features based on pre-acquired biometric information when the target multimedia data includes the second video. The process of processing information to obtain target multimedia data. As shown in FIG. 8, step 202 may include step 801 shown in FIG. 8:
步骤801,接收端根据目标对象的人脸关键点在各视频帧中的坐标,对人脸模型中人脸关键点的坐标进行变换处理,以得到第二视频。 Step 801, the receiving end transforms the coordinates of the face key points in the face model according to the coordinates of the target object's face key points in each video frame to obtain a second video.
本申请实施例中,人脸模型可以是三维人脸模型,也可以是二维人脸模型,以下,为 了便于描述,均以人脸模型是二维人脸模型为例,对本申请实施例的实施过程进行说明,可以理解的是,以人脸模型是二维人脸模型为例并不构成对人脸模型的类型的限制。In the embodiments of the present application, the face model may be a three-dimensional face model or a two-dimensional face model. In the following, for the convenience of description, the face model is taken as an example of a two-dimensional face model. The implementation process is described. It can be understood that taking the face model as a two-dimensional face model as an example does not constitute a limitation on the type of the face model.
如上文所述,人脸模型可以是目标对象的二维人脸图像,人脸模型中包括多个人脸关键点的坐标,如上文所述,人脸关键点可以是眉毛、眼睛、鼻子、嘴巴等人脸关键点。As mentioned above, the face model can be a two-dimensional face image of the target object, and the face model includes the coordinates of multiple face key points. As mentioned above, the face key points can be eyebrows, eyes, nose, mouth etc. Face key points.
接收端接收到发送端所发送的多媒体特征信息后,则根据目标对象的人脸关键点在各视频帧中的坐标,对人脸模型中对应人脸关键点的坐标进行变换处理。After receiving the multimedia feature information sent by the transmitting end, the receiving end transforms the coordinates of the corresponding facial key points in the face model according to the coordinates of the facial key points of the target object in each video frame.
作为一种实施方式,对于每个视频帧,接收端可以将人脸模型中各人脸关键点的坐标变换为该视频帧中对应人脸关键点的坐标,例如,将人脸模型中人脸关键点“鼻子”的坐标变换为该视频帧中对应的人脸关键点“鼻子”的坐标。As an implementation manner, for each video frame, the receiving end can transform the coordinates of each face key point in the face model into the coordinates of the corresponding face key points in the video frame, for example, the face in the face model The coordinates of the key point "nose" are transformed into the coordinates of the corresponding face key point "nose" in the video frame.
可以理解的是,对于每个视频帧,接收端将人脸模型中各人脸关键点的坐标变换为该视频帧中对应人脸关键点的坐标后,则该人脸模型具有与对应视频帧相同的表情。It can be understood that, for each video frame, after the receiving end transforms the coordinates of each face key point in the face model into the coordinates of the corresponding face key point in the video frame, then the face model has the same value as the corresponding video frame. the same expression.
以下,对接收端如何根据目标对象的人脸关键点在各视频帧中的坐标,对人脸模型中人脸关键点的坐标进行变换处理的过程进行说明。The following describes the process of how the receiving end transforms the coordinates of the face key points in the face model according to the coordinates of the target object's face key points in each video frame.
可选地,接收端可以执行如下步骤A1实现根据目标对象的人脸关键点在各视频帧中的坐标,对人脸模型中人脸关键点的坐标进行变换处理的过程:Optionally, the receiving end can perform the following steps A1 to realize the process of transforming the coordinates of the key points of the face in the face model according to the coordinates of the key points of the face of the target object in each video frame:
步骤A1,接收端基于各视频帧在第一视频中的时序顺序,依次根据目标对象的人脸关键点在每个视频帧中的坐标,对人脸模型中人脸关键点的坐标进行变换处理。Step A1, the receiving end transforms the coordinates of the face key points in the face model according to the coordinates of the face key points of the target object in each video frame based on the sequence order of each video frame in the first video. .
在一种可能的实施方式中,人脸关键点信息还可以包括各视频帧在第一视频中的时序顺序以及各人脸关键点的标识。接收端可以按照各视频帧在第一视频中从前往后的时序顺序,依次根据每个视频帧中人脸关键点的坐标,对人脸模型中人脸关键点的坐标进行变换处理。In a possible implementation manner, the face key point information may further include the time sequence sequence of each video frame in the first video and the identification of each face key point. The receiving end can perform transformation processing on the coordinates of the face key points in the face model according to the coordinates of the face key points in each video frame according to the sequence sequence of each video frame from front to back in the first video.
例如,按照各视频帧在第一视频中从前往后的时序顺序,对于第一个视频帧,接收端可以根据人脸关键点信息中各人脸关键点的标识,在人脸模型中确定每个标识对应的人脸关键点,接着,接收端将人脸模型中确定的各人脸关键点的坐标变换为该视频帧中对应的人脸关键点的坐标,这样,人脸模型则可以具有与该视频帧相同的表情;接着,对于第一视频中的第二个视频帧、第三个视频帧等等,接收端可以执行与第一视频帧同样的步骤,对人脸模型按各视频帧的时序顺序进行驱动,并对驱动后的人脸模型进行渲染,则得到第二视频。For example, according to the sequential order of each video frame in the first video, for the first video frame, the receiving end can determine each face key point in the face model according to the identification of each face key point in the face key point information. The face key points corresponding to each identifier, and then, the receiving end transforms the coordinates of the face key points determined in the face model into the coordinates of the corresponding face key points in the video frame. In this way, the face model can have The same expression as the video frame; then, for the second video frame, the third video frame, etc. in the first video, the receiving end can perform the same steps as the first video frame, The frame sequence sequence is driven, and the driven face model is rendered to obtain a second video.
上述实施例通过对人脸模型中人脸关键点的坐标进行变换处理,即可生成第二视频,运算过程主要涉及人脸关键点的坐标处理,而不需要对人脸模型的其他参数进行处理,从而有利于降低接收端的运算量,提升接收端的性能。In the above embodiment, the second video can be generated by transforming the coordinates of the key points of the face in the face model. The operation process mainly involves the processing of the coordinates of the key points of the face, and does not need to process other parameters of the face model. , thereby helping to reduce the amount of computation at the receiving end and improve the performance of the receiving end.
在一个实施例中,基于图2所示的实施例,参见图9,本实施例涉及的是在目标多媒体数据包括第二音频的情况下,接收端如何基于预先获取的生物特征信息对多媒体特征信息进行处理,得到目标多媒体数据的过程。如图9所示,本实施例中,步骤202可以包括图9所示的步骤901:In one embodiment, based on the embodiment shown in FIG. 2 , and referring to FIG. 9 , this embodiment relates to how the receiving end compares the multimedia features based on pre-acquired biometric information when the target multimedia data includes the second audio. The process of processing information to obtain target multimedia data. As shown in FIG. 9 , in this embodiment, step 202 may include step 901 shown in FIG. 9 :
步骤901,接收端将音色特征和语音文本输入至语音合成模型中,得到第二音频。 Step 901, the receiving end inputs the timbre feature and the speech text into the speech synthesis model to obtain the second audio.
本申请实施例中,音色特征可以是对目标对象的音频数据进行傅里叶分析得到频谱,并提取频谱特征得到的。In the embodiment of the present application, the timbre feature may be obtained by performing Fourier analysis on the audio data of the target object to obtain a spectrum, and extracting the spectrum feature.
接收端将音色特征和语音文本输入至语音合成模型(Text-To-Speech,TTS)中,得到语音合成模型输出的第二音频。The receiving end inputs the timbre feature and the speech text into the speech synthesis model (Text-To-Speech, TTS) to obtain the second audio output from the speech synthesis model.
上述实施例接收端根据预先获取的音色特征对语音文本进行处理,这样,语音合成的第二音频具有该音色特征所表征的特征,从而接收端不必获取接收端通过麦克风采集的第一音频,仅根据从第一音频中提取的语音文本和该音色特征即可合成出与第一音频的音色相同的第二音频,从而在确保第二音频保真的情况下,大大降低了接收端和发送端之间多媒体数据传输过程中的数据量,有利于提升多媒体数据传输的效率。In the above-mentioned embodiment, the receiving end processes the speech text according to the pre-acquired timbre feature, so that the second audio frequency of the speech synthesis has the characteristics represented by the timbre feature, so that the receiving end does not need to obtain the first audio frequency collected by the receiving end through the microphone, only According to the voice text extracted from the first audio and the timbre feature, a second audio with the same timbre as the first audio can be synthesized, so as to ensure the fidelity of the second audio, the receiving end and the sending end are greatly reduced. The amount of data in the multimedia data transmission process between them is beneficial to improve the efficiency of multimedia data transmission.
在一个实施例中,基于图2所示的实施例,参加图10,本实施例涉及的接收端如何获取生物特征信息的过程。如图10所示,在步骤202之前,本实施例数据处理方法还包括步骤203和步骤204:In one embodiment, based on the embodiment shown in FIG. 2 , referring to FIG. 10 , the process of how the receiving end acquires biometric information involved in this embodiment. As shown in FIG. 10, before step 202, the data processing method of this embodiment further includes step 203 and step 204:
步骤203,接收端获取发送端对应的终端标识,并根据终端标识,检测是否存储有与终端标识对应的目标对象的目标生物特征信息。In step 203, the receiving end acquires the terminal identification corresponding to the transmitting end, and according to the terminal identification, detects whether the target biometric information of the target object corresponding to the terminal identification is stored.
其中,目标生物特征信息包括目标人脸模型和目标音色特征。在一种可能的实施方式中,发送端可以采集目标对象的人脸图像,并将该目标对象的人脸图像确定为目标对象的目标人脸模型;发送端还可以对目标对象的音频数据进行特征提取得到目标对象的目标音色特征。The target biometric information includes the target face model and the target timbre feature. In a possible implementation, the sending end may collect a face image of the target object, and determine the face image of the target object as the target face model of the target object; the sending end may also perform audio data on the target object. The feature extraction obtains the target timbre feature of the target object.
在接收端和发送端建立通信连接的过程中,接收端可以获取发送端对应的终端标识,在历史时刻下,若发送端与接收端进行多媒体数据传输,发送端可以根据该终端标识向接收端请求该目标对象的目标生物特征信息,若接收到目标对象的目标生物特征信息,接收端则可以将收到的目标对象的目标生物特征信息与该终端标识关联存储。During the process of establishing a communication connection between the receiver and the sender, the receiver can obtain the terminal identifier corresponding to the sender. At historical moments, if the sender and the receiver perform multimedia data transmission, the sender can send the receiver to the receiver according to the terminal identifier. The target biometric information of the target object is requested, and if the target biometric information of the target object is received, the receiving end can associate and store the received target biometric information of the target object with the terminal identifier.
这样,本申请实施例中,接收端在接收到发送端所发送的多媒体特征信息之后,则可以根据该发送端的终端标识,检测是否存储有与终端标识对应的目标对象的目标生物特征信息。In this way, in the embodiment of the present application, after receiving the multimedia feature information sent by the sender, the receiver can detect whether the target biometric information of the target object corresponding to the terminal ID is stored according to the terminal identifier of the sender.
步骤204,若存储有目标生物特征信息,接收端则将目标生物特征信息作为生物特征信息。 Step 204, if the target biometric information is stored, the receiving end uses the target biometric information as the biometric information.
若存储有目标对象的目标生物特征信息,则表征接收端在历史多媒体数据传输的过程中,已经从发送端获取到了目标对象的目标生物特征信息。这样,接收端则可以基于该目标对象的目标生物特征信息对发送端发送的多媒体特征信息进行处理。If the target biometric information of the target object is stored, it means that the receiving end has acquired the target biometric information of the target object from the transmitting end during the historical multimedia data transmission process. In this way, the receiver can process the multimedia feature information sent by the sender based on the target biometric feature information of the target object.
由于目标生物特征信息为目标对象的目标人脸模型和目标音色特征,因此,接收端可以基于该目标生物特征信息真实地还原发送端目标对象的真实多媒体数据,确保了多媒体数据传输的效果。Since the target biometric information is the target face model and target timbre feature of the target object, the receiving end can truly restore the real multimedia data of the target object at the sending end based on the target biometric information, ensuring the effect of multimedia data transmission.
请继续参见图10,步骤204之后还包括步骤205、步骤206、步骤207和步骤208:Please continue to refer to FIG. 10, after step 204, it also includes step 205, step 206, step 207 and step 208:
步骤205,若未存储目标生物特征信息,接收端则检测当前的数据传输速率是否大于传输速率阈值。 Step 205, if the target biometric information is not stored, the receiving end detects whether the current data transmission rate is greater than the transmission rate threshold.
接收端若未存储目标生物特征信息,接收端则结合当前的网络质量确定是否向发送端请求该目标生物特征信息。可选地,接收端则检测当前的数据传输速率是否大于传输速率阈值,该传输速率阈值在实施时可以自行设置。If the receiving end does not store the target biometric information, the receiving end determines whether to request the target biometric information from the sending end according to the current network quality. Optionally, the receiving end detects whether the current data transmission rate is greater than a transmission rate threshold, and the transmission rate threshold can be set by itself during implementation.
步骤206,若当前的数据传输速率大于传输速率阈值,接收端则向发送端发送获取请求,获取请求用于请求发送端返回目标生物特征信息。 Step 206, if the current data transmission rate is greater than the transmission rate threshold, the receiver sends an acquisition request to the transmitter, and the acquisition request is used to request the transmitter to return the target biometric information.
若当前的数据传输速率大于传输速率阈值,则表征接收端当前的网络质量较好,接收端则向发送端请求目标生物特征信息。If the current data transmission rate is greater than the transmission rate threshold, it indicates that the current network quality of the receiving end is good, and the receiving end requests the target biometric information from the transmitting end.
步骤207,接收端接收发送端基于获取请求发送的目标生物特征信息,并将目标生物特征信息作为生物特征信息。Step 207: The receiving end receives the target biometric information sent by the transmitting end based on the acquisition request, and uses the target biometric information as the biometric information.
步骤208,若当前的数据传输速率小于或者等于传输速率阈值,接收端则获取预先存储的通用生物特征信息,并将通用生物特征信息作为生物特征信息。 Step 208, if the current data transmission rate is less than or equal to the transmission rate threshold, the receiving end acquires the pre-stored general biometric information, and uses the general biometric information as the biometric information.
其中,通用生物特征信息包括通用人脸模型和通用音色特征。Wherein, the general biometric information includes a general face model and a general timbre feature.
而若当前的数据传输速率小于或者等于传输速率阈值,则表征接收端当前的网络质量较差,为了避免向发送端请求目标生物特征信息导致接收端的网络质量更差,接收端则获取预先存储的通用生物特征信息,并将通用生物特征信息作为生物特征信息。该通用生物特征信息可以是任意预设用户的人脸图像和音色特征。However, if the current data transmission rate is less than or equal to the transmission rate threshold, it indicates that the current network quality of the receiving end is poor. Generic biometric information, and use the generic biometric information as biometric information. The general biometric information can be any preset user's face image and timbre features.
这样,接收端接收到多媒体特征信息后,还可以基于本地存储的通用生物特征信息对多媒体特征信息进行处理,以得到目标多媒体数据,在网络质量较差的情况下,也可以实 现多媒体数据的传输,降低了多媒体数据的传输对网络带宽的需求,节约了网络带宽资源。In this way, after receiving the multimedia feature information, the receiver can also process the multimedia feature information based on the locally stored general biometric information to obtain the target multimedia data. In the case of poor network quality, the multimedia data transmission can also be realized. , reducing the demand for network bandwidth for multimedia data transmission and saving network bandwidth resources.
在一个实施例中,参见图11,其示出了本申请实施例提供的一种数据处理方法的流程图,该数据处理方法可以应用于图1所示的发送端120中。如图11所示,该数据处理方法可以包括以下步骤:In an embodiment, referring to FIG. 11 , it shows a flowchart of a data processing method provided by an embodiment of the present application, and the data processing method may be applied to the sending end 120 shown in FIG. 1 . As shown in Figure 11, the data processing method may include the following steps:
步骤111,发送端获取目标对象的第一视频和第一音频,并从第一视频中获取人脸关键点信息。 Step 111, the sending end obtains the first video and the first audio of the target object, and obtains face key point information from the first video.
发送端与接收端进行多媒体数据传输的过程中,发送端可以录制目标对象的视频得到第一视频,以及录制目标对象的声音得到第一音频,发送端从目标对象的第一视频中获取的人脸关键点信息,该人脸关键点信息包括目标对象的人脸关键点在第一视频的各视频帧中的坐标。In the process of multimedia data transmission between the sender and the receiver, the sender can record the video of the target object to obtain the first video, and record the sound of the target object to obtain the first audio, and the sender obtains the person from the first video of the target object. Face key point information, the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video.
以下,对发送端从第一视频中获取的人脸关键点信息的过程进行说明。Hereinafter, the process of acquiring the facial key point information from the first video by the sender will be described.
可选地,第一视频可以包括多个视频帧,对于第一视频中的各视频帧,发送端对视频帧进行人脸关键点检测,得到目标对象的人脸关键点在视频帧中的坐标,发送端基于目标对象的人脸关键点在各视频帧中的坐标生成人脸关键点信息。Optionally, the first video may include multiple video frames, and for each video frame in the first video, the transmitting end performs facial key point detection on the video frame to obtain the coordinates of the target object's facial key points in the video frame. , the sender generates face key point information based on the coordinates of the face key points of the target object in each video frame.
具体地,发送端可以将每个视频帧输入至人脸关键点检测模型中,得到人脸关键点检测模型输出的该视频帧中各人脸关键点的坐标,其中,人脸关键点检测模型可以是预先训练的任一人脸关键点检测深度学习模型。Specifically, the sending end can input each video frame into the face key point detection model to obtain the coordinates of each face key point in the video frame output by the face key point detection model, wherein the face key point detection model It can be any pre-trained deep learning model for facial keypoint detection.
发送端将各视频帧中人脸关键点的坐标作为人脸关键点信息。在一种可能的实施方式中,发送端还可以对人脸关键点信息中各人脸关键点的坐标添加对应的标识以及各视频帧在第一视频中的时序顺序,并将添加后的人脸关键点信息作为最终的人脸关键点信息。The sender takes the coordinates of the face key points in each video frame as the face key point information. In a possible implementation manner, the sender may also add a corresponding identifier to the coordinates of each face key point in the face key point information and the time sequence sequence of each video frame in the first video, and add the added person The face key point information is used as the final face key point information.
步骤112,发送端对第一音频进行语音识别得到语音文本,并将人脸关键点信息和语音文本作为多媒体特征信息发送至接收端。 Step 112 , the sending end performs speech recognition on the first audio to obtain speech text, and sends the face key point information and the speech text to the receiving end as multimedia feature information.
发送端可以将第一音频输入至自动语音识别模型(Automatic Speech Recognition,ASR)中,得到自动语音识别模型输出的语音文本。The sending end may input the first audio into an automatic speech recognition model (Automatic Speech Recognition, ASR) to obtain the speech text output by the automatic speech recognition model.
发送端将人脸关键点信息和语音文本作为多媒体特征信息发送至接收端,该多媒体特征信息用于供接收端基于预先获取的生物特征信息对多媒体特征信息进行处理,得到目标多媒体数据,生物特征信息包括人脸模型和音色特征,目标多媒体数据被展示的情况下,呈现与人脸关键点信息对应的动态表情,以及呈现与语音文本对应的语音。The sender sends the face key point information and the voice text to the receiver as multimedia feature information, and the multimedia feature information is used for the receiver to process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, biometric features. The information includes face model and timbre features. When the target multimedia data is displayed, dynamic expressions corresponding to face key point information are presented, and speech corresponding to speech text is presented.
其中,接收端基于预先获取的生物特征信息对多媒体特征信息进行处理,得到目标多媒体数据的过程可以参见上述实施例,在此不再赘述。Wherein, the receiving end processes the multimedia feature information based on the pre-acquired biometric feature information, and the process of obtaining the target multimedia data may refer to the above-mentioned embodiment, which will not be repeated here.
在发送端与接收端进行多媒体数据传输的过程中,发送端向接收端发送的不是原始的第一音频和第一视频,是从第一音频和第一视频中提取的多媒体特征信息,由于第一视频的数据量远大于人脸关键点信息的数据量,且第一音频的数据量远大于语音文本的数据量,因此,相较于发送端直接发送第一视频和第一音频而言,发送端仅发送人脸关键点信息和语音文本可以显著降低发送端所发送的数据的数据量,降低了发送端和接收端之间多媒体数据的传输对网络带宽的消耗,有利于提升数据传输的效率。In the process of multimedia data transmission between the sender and the receiver, what the sender sends to the receiver is not the original first audio and the first video, but the multimedia feature information extracted from the first audio and the first video. The data volume of a video is much larger than the data volume of the face key point information, and the data volume of the first audio is much larger than the data volume of the voice text. Therefore, compared with the sending end directly sending the first video and the first audio, The sender only sends face key point information and voice text, which can significantly reduce the amount of data sent by the sender, reduce the consumption of network bandwidth for the transmission of multimedia data between the sender and the receiver, and help improve the efficiency of data transmission. efficiency.
在一个实施例中,参见图12,提供了一种数据处理方法,该数据处理方法可以应用于图1所示的实施环境中。如图12所示,该方法可以包括以下步骤:In one embodiment, referring to FIG. 12 , a data processing method is provided, and the data processing method can be applied in the implementation environment shown in FIG. 1 . As shown in Figure 12, the method may include the following steps:
步骤121,发送端获取目标对象的第一视频和第一音频,并从第一视频中获取人脸关键点信息。Step 121, the sending end obtains the first video and the first audio of the target object, and obtains face key point information from the first video.
其中,人脸关键点信息包括目标对象的人脸关键点在第一视频的各视频帧中的坐标。The face key point information includes the coordinates of the face key point of the target object in each video frame of the first video.
步骤122,发送端对第一音频进行语音识别得到语音文本,并将人脸关键点信息和语音文本作为多媒体特征信息发送至接收端。In step 122, the transmitting end performs speech recognition on the first audio to obtain speech text, and sends the face key point information and the speech text to the receiving end as multimedia feature information.
步骤123,接收端基于预先获取的生物特征信息对多媒体特征信息进行处理,得到目 标多媒体数据。Step 123: The receiving end processes the multimedia feature information based on the pre-obtained biometric feature information to obtain target multimedia data.
其中,生物特征信息包括人脸模型和音色特征,目标多媒体数据被展示的情况下,呈现与人脸关键点信息对应的动态表情,以及呈现与语音文本对应的语音。The biometric information includes a face model and a timbre feature. When the target multimedia data is displayed, the dynamic expression corresponding to the key point information of the face is presented, and the voice corresponding to the voice text is presented.
本实施例关于步骤121、步骤122和步骤123的具体限定和有益效果与上述实施例类似,可以参见上述实施例的描述,在此不再赘述。The specific limitations and beneficial effects of step 121 , step 122 , and step 123 in this embodiment are similar to those in the foregoing embodiment, and reference may be made to the description of the foregoing embodiment, which is not repeated here.
以下,终端A(对应用户A)和终端B(对应用户B)进行多媒体数据传输为例,对本申请实施例的一种示例性地实施方式进行介绍。Hereinafter, an exemplary implementation of the embodiments of the present application will be introduced by taking the multimedia data transmission performed by terminal A (corresponding to user A) and terminal B (corresponding to user B) as an example.
步骤a,终端A中预置有通用人脸模型和通用音色特征,终端A将用户A的人脸图像作为用户A目标人脸模型,并根据用户A的音频数据提取用户A目标音色特征。参见图13,图13为一种示例性地终端A创建用户A目标人脸模型和用户A目标音色特征的示意图。In step a, terminal A is preset with a general face model and a general timbre feature, and terminal A uses the face image of user A as the target face model of user A, and extracts the target timbre feature of user A according to user A's audio data. Referring to FIG. 13, FIG. 13 is an exemplary schematic diagram of terminal A creating a target face model of user A and a target timbre feature of user A.
步骤b,终端B中预置有通用人脸模型和通用音色特征,终端B将用户B的人脸图像作为用户B目标人脸模型,并根据用户B的音频数据提取用户B目标音色特征。参见图14,图14为一种示例性地终端B创建用户B目标人脸模型和用户B目标音色特征的示意图。In step b, terminal B is preset with a general face model and a general timbre feature. Terminal B uses the face image of user B as the target face model of user B, and extracts the target timbre feature of user B according to the audio data of user B. Referring to FIG. 14, FIG. 14 is an exemplary schematic diagram of terminal B creating a target face model of user B and a target timbre feature of user B. As shown in FIG.
步骤c,终端A和终端B若需要传输多媒体数据,且终端A中未存储用户B目标人脸模型和用户B目标音色特征、终端B中未存储用户A目标人脸模型和用户A目标音色特征、终端A和终端B当前的数据传输速率均大于传输速率阈值的情况下,终端A向终端B请求用户B目标人脸模型和用户B目标音色特征,终端B向终端A请求用户A目标人脸模型和用户A目标音色特征,从而完成标准模型的交换。参见图15,图15为一种示例性地终端A和终端B进行目标生物特征信息交换的示意图。Step c, if terminal A and terminal B need to transmit multimedia data, and terminal A does not store user B target face model and user B target timbre feature, terminal B does not store user A target face model and user A target timbre feature , When the current data transmission rates of both terminal A and terminal B are greater than the transmission rate threshold, terminal A requests terminal B for the target face model of user B and the target timbre feature of user B, and terminal B requests terminal A for the target face of user A model and user A's target timbre characteristics, thereby completing the exchange of standard models. Referring to FIG. 15 , FIG. 15 is a schematic diagram of an exemplary exchange of target biometric information between terminal A and terminal B.
步骤d,终端A通过摄像头采集用户A的第一视频,并将用户A的第一视频的各个视频帧输入至人脸关键点检测模型中,得到用户A的人脸关键点信息;终端A通过麦克风采集用户A的第一音频,并将用户A的第一音频输入至自动语音识别模型中,得到用户A的语音文本。In step d, terminal A collects the first video of user A through a camera, and inputs each video frame of user A's first video into the facial key point detection model to obtain the facial key point information of user A; The microphone collects the first audio of the user A, and inputs the first audio of the user A into the automatic speech recognition model to obtain the speech text of the user A.
步骤e,终端B通过摄像头采集用户B的第一视频,并将用户B的第一视频的各个视频帧输入至人脸关键点检测模型中,得到用户B的人脸关键点信息;终端B通过麦克风采集用户B的第一音频,并将用户B的第一音频输入至自动语音识别模型中,得到用户B的语音文本。In step e, terminal B collects the first video of user B through a camera, and inputs each video frame of the first video of user B into the facial key point detection model to obtain the facial key point information of user B; The microphone collects the first audio of the user B, and inputs the first audio of the user B into the automatic speech recognition model to obtain the speech text of the user B.
步骤f,终端A将用户A的人脸关键点信息和用户A的语音文本作为用户A的多媒体特征信息发送至终端B。In step f, terminal A sends user A's face key point information and user A's voice and text to terminal B as user A's multimedia feature information.
步骤g,终端B将用户B的人脸关键点信息和用户B的语音文本作为用户B的多媒体特征信息发送至终端A。In step g, terminal B sends user B's face key point information and user B's voice and text to terminal A as user B's multimedia feature information.
步骤h,终端A基于用户B目标人脸模型对用户B的人脸关键点信息进行处理,得到用户B的第二视频;终端A基于用户B目标音色特征对用户B的语音文本进行处理,得到用户B的第二音频,终端A将用户B的第二视频和用户B的第二音频作为用户B的目标多媒体数据进行展示。In step h, terminal A processes user B's face key point information based on user B's target face model to obtain the second video of user B; terminal A processes user B's voice text based on user B's target timbre feature to obtain User B's second audio, terminal A displays user B's second video and user B's second audio as user B's target multimedia data.
步骤i,终端B基于用户A目标人脸模型对用户A的人脸关键点信息进行处理,得到用户A的第二视频;终端B基于用户A目标音色特征对用户A的语音文本进行处理,得到用户A的第二音频,终端B将用户A的第二视频和用户A的第二音频作为用户A的目标多媒体数据进行展示。Step i, terminal B processes the key point information of user A's face based on user A's target face model to obtain the second video of user A; terminal B processes user A's voice text based on user A's target timbre feature to obtain User A's second audio, and terminal B displays user A's second video and user A's second audio as user A's target multimedia data.
参见图16,图16为一种示例性地终端A和终端B进行多媒体数据传输的过程示意图。Referring to FIG. 16, FIG. 16 is a schematic diagram of an exemplary process of terminal A and terminal B performing multimedia data transmission.
这样,终端A和终端B进行多媒体数据传输的过程中,不需要传输原始完整的多媒体数据,仅需要传输多媒体特征信息即可实现多媒体数据传输,降低了多媒体数据传输对网络带宽的需求,提升了传输效率,且节约了传输资源。In this way, in the process of multimedia data transmission between terminal A and terminal B, there is no need to transmit original and complete multimedia data, and only multimedia feature information needs to be transmitted to realize multimedia data transmission, which reduces the demand for network bandwidth for multimedia data transmission, and improves the performance of multimedia data transmission. The transmission efficiency is improved, and the transmission resources are saved.
应该理解的是,虽然上述实施例流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上述实施例的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of the above embodiments are sequentially displayed according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in the flowcharts of the above embodiments may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps Alternatively, the order of execution of the stages is not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a stage.
图17为一个实施例的数据处理装置的结构框图。如图17所示,所述装置包括:FIG. 17 is a structural block diagram of a data processing apparatus according to an embodiment. As shown in Figure 17, the device includes:
接收模块1701,用于用于接收发送端发送的多媒体特征信息,所述多媒体特征信息包括从目标对象的第一视频中获取的人脸关键点信息和对所述目标对象的第一音频进行语音识别得到的语音文本,其中,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;The receiving module 1701 is used for receiving the multimedia feature information sent by the sending end, and the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice of the first audio of the target object. The recognized speech text, wherein the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;
处理模块1702,用于基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The processing module 1702 is used to process the multimedia feature information based on the biometric information obtained in advance to obtain target multimedia data. , presenting a dynamic expression corresponding to the face key point information, and presenting a voice corresponding to the voice text.
基于图17所示的实施例,参见图18,可选地,所述目标多媒体数据包括第二视频;所述处理模块1702,包括:Based on the embodiment shown in FIG. 17 , referring to FIG. 18 , optionally, the target multimedia data includes a second video; the processing module 1702 includes:
坐标变换单元1702a,根据所述目标对象的人脸关键点在各所述视频帧中的坐标,对所述人脸模型中人脸关键点的坐标进行变换处理,以得到所述第二视频。The coordinate transformation unit 1702a performs transformation processing on the coordinates of the face key points in the face model according to the coordinates of the face key points of the target object in each of the video frames, so as to obtain the second video.
可选地,所述坐标变换单元1702a具体用于基于各所述视频帧在所述第一视频中的时序顺序,依次根据所述目标对象的人脸关键点在每个所述视频帧中的坐标,对所述人脸模型中人脸关键点的坐标进行变换处理。Optionally, the coordinate transformation unit 1702a is specifically configured to, based on the sequential order of each of the video frames in the first video, sequentially according to the key points of the face of the target object in each of the video frames. coordinates, and transforms the coordinates of the key points of the face in the face model.
基于图17所示的实施例,可选地,所述目标多媒体数据包括第二音频;所述处理模块1702具体用于将所述音色特征和所述语音文本输入至语音合成模型中,得到所述第二音频。Based on the embodiment shown in FIG. 17 , optionally, the target multimedia data includes second audio; the processing module 1702 is specifically configured to input the timbre feature and the speech text into the speech synthesis model to obtain the the second audio.
基于图17所示的实施例,可选地,所述装置还包括:Based on the embodiment shown in FIG. 17 , optionally, the apparatus further includes:
第一检测模块,用于获取所述发送端对应的终端标识,并根据所述终端标识,检测是否存储有与所述终端标识对应的所述目标对象的目标生物特征信息,所述目标生物特征信息包括目标人脸模型和目标音色特征;The first detection module is used to obtain the terminal identification corresponding to the transmitting end, and detect whether the target biometric information of the target object corresponding to the terminal identification is stored according to the terminal identification, and the target biometrics The information includes the target face model and the target timbre feature;
第一确定模块,用于若存储有所述目标生物特征信息,则将所述目标生物特征信息作为所述生物特征信息;a first determining module, configured to use the target biometric information as the biometric information if the target biometric information is stored;
第二检测模块,用于若未存储所述目标生物特征信息,则检测当前的数据传输速率是否大于传输速率阈值;a second detection module, configured to detect whether the current data transmission rate is greater than the transmission rate threshold if the target biometric information is not stored;
请求模块,用于若所述当前的数据传输速率大于所述传输速率阈值,则向所述发送端发送获取请求,所述获取请求用于请求所述发送端返回所述目标生物特征信息;a request module, configured to send an acquisition request to the sender if the current data transmission rate is greater than the transmission rate threshold, where the acquisition request is used to request the sender to return the target biometric information;
第二确定模块,用于接收所述发送端基于所述获取请求发送的所述目标生物特征信息,并将所述目标生物特征信息作为所述生物特征信息。A second determining module, configured to receive the target biometric information sent by the sender based on the acquisition request, and use the target biometric information as the biometric information.
第三确定模块,用于若所述当前的数据传输速率小于或者等于所述传输速率阈值,则获取预先存储的通用生物特征信息,并将所述通用生物特征信息作为所述生物特征信息,所述通用生物特征信息包括通用人脸模型和通用音色特征。A third determining module, configured to acquire pre-stored general biometric information if the current data transmission rate is less than or equal to the transmission rate threshold, and use the general biometric information as the biometric information, and the The general biometric information includes a general face model and a general timbre feature.
上述数据处理装置中各个模块的划分仅用于举例说明,在其他实施例中,可将数据处理装置按照需要划分为不同的模块,以完成上述数据处理装置的全部或部分功能。The division of each module in the above data processing apparatus is only used for illustration. In other embodiments, the data processing apparatus may be divided into different modules as required to complete all or part of the functions of the above data processing apparatus.
关于数据处理装置的具体限定可以参见上文中对于应用于接收端的数据处理方法的限定,在此不再赘述。上述数据处理装置中的各个模块可全部或部分通过软件、硬件及其 组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the data processing apparatus, reference may be made to the limitation of the data processing method applied to the receiving end above, which will not be repeated here. Each module in the above-mentioned data processing apparatus can be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
图19为一个实施例的数据处理装置的结构框图。如图19所示,所述装置包括:FIG. 19 is a structural block diagram of a data processing apparatus according to an embodiment. As shown in Figure 19, the device includes:
获取模块1901,用于获取目标对象的第一视频和第一音频,并从所述第一视频中获取人脸关键点信息,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;Obtaining module 1901, configured to obtain the first video and first audio of the target object, and obtain face key point information from the first video, where the face key point information includes the face key points of the target object coordinates in each video frame of the first video;
处理模块1902,用于处理模块,用于对所述第一音频进行语音识别得到语音文本,并将所述人脸关键点信息和所述语音文本作为多媒体特征信息发送至接收端;其中,所述多媒体特征信息用于供所述接收端基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The processing module 1902 is used for a processing module for performing speech recognition on the first audio to obtain speech text, and sending the face key point information and the speech text to the receiving end as multimedia feature information; The multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data. In the case of being displayed, the dynamic expression corresponding to the face key point information is presented, and the speech corresponding to the speech text is presented.
基于图19所示的实施例,在一种可能的实施方式中,获取模块1901具体用于对于所述第一视频中的各所述视频帧,对所述视频帧进行人脸关键点检测,得到所述目标对象的人脸关键点在所述视频帧中的坐标;基于所述目标对象的人脸关键点在各所述视频帧中的坐标生成所述人脸关键点信息。Based on the embodiment shown in FIG. 19 , in a possible implementation, the acquisition module 1901 is specifically configured to perform face key point detection on the video frame for each of the video frames in the first video, obtaining the coordinates of the face key points of the target object in the video frame; generating the face key point information based on the coordinates of the face key points of the target object in each of the video frames.
上述数据处理装置中各个模块的划分仅用于举例说明,在其他实施例中,可将数据处理装置按照需要划分为不同的模块,以完成上述数据处理装置的全部或部分功能。The division of each module in the above data processing apparatus is only used for illustration. In other embodiments, the data processing apparatus may be divided into different modules as required to complete all or part of the functions of the above data processing apparatus.
关于数据处理装置的具体限定可以参见上文中对于应用于发送端的数据处理方法的限定,在此不再赘述。上述数据处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the data processing apparatus, reference may be made to the limitation on the data processing method applied to the sending end above, which will not be repeated here. Each module in the above-mentioned data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种数据处理系统,所述系统包括发送端和接收端;In one embodiment, a data processing system is provided, the system includes a transmitter and a receiver;
所述接收端,用于执行上述用于接收端的方法步骤,以实现上述用于接收端的方法实施例。The receiving end is configured to execute the above method steps for the receiving end, so as to realize the above method embodiments for the receiving end.
所述发送端,用于执行上述用于发送端的方法步骤,以实现上述用于发送端的方法实施例。The sending end is configured to execute the above method steps for the sending end, so as to realize the above method embodiments for the sending end.
本申请实施例提供的数据处理系统,其实现原理和技术效果与上述方法实施例类似,在此不再赘述。The implementation principle and technical effect of the data processing system provided by the embodiments of the present application are similar to those of the foregoing method embodiments, and details are not described herein again.
图20为一个实施例中电子设备的内部结构示意图。如图20所示,该电子设备包括通过系统总线连接的处理器和存储器。其中,该处理器用于提供计算和控制能力,支撑整个电子设备的运行。存储器可包括非易失性存储介质及内存储器。非易失性存储介质存储有操作系统和计算机程序。该计算机程序可被处理器所执行,以用于实现以下各个实施例所提供的一种数据处理方法。内存储器为非易失性存储介质中的操作系统计算机程序提供高速缓存的运行环境。该电子设备可以是手机、平板电脑、PDA(Personal Digital Assistant,个人数字助理)、POS(Point of Sales,销售终端)、车载电脑、穿戴式设备等任意终端设备。FIG. 20 is a schematic diagram of the internal structure of an electronic device in one embodiment. As shown in FIG. 20, the electronic device includes a processor and a memory connected by a system bus. Among them, the processor is used to provide computing and control capabilities to support the operation of the entire electronic device. The memory may include non-volatile storage media and internal memory. The nonvolatile storage medium stores an operating system and a computer program. The computer program can be executed by the processor to implement a data processing method provided by the following embodiments. Internal memory provides a cached execution environment for operating system computer programs in non-volatile storage media. The electronic device may be any terminal device such as a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (Point of Sales, a sales terminal), a vehicle-mounted computer, a wearable device, and the like.
本申请实施例中提供的数据处理装置中的各个模块的实现可为计算机程序的形式。该计算机程序可在终端或服务器上运行。该计算机程序构成的程序模块可存储在电子设备的存储器上。该计算机程序被处理器执行时,实现本申请实施例中所描述方法的步骤。The implementation of each module in the data processing apparatus provided in the embodiments of the present application may be in the form of a computer program. The computer program can be run on a terminal or server. The program modules constituted by the computer program can be stored on the memory of the electronic device. When the computer program is executed by the processor, the steps of the methods described in the embodiments of the present application are implemented.
本申请实施例还提供了一种计算机可读存储介质。一个或多个包含计算机可执行指令的非易失性计算机可读存储介质,当所述计算机可执行指令被一个或多个处理器执行时,使得所述处理器执行数据处理方法的步骤。Embodiments of the present application also provide a computer-readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions, when executed by one or more processors, cause the processors to perform the steps of a data processing method.
本申请实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行数据处理方法的步骤。Embodiments of the present application also provide a computer program product containing instructions, which, when run on a computer, cause the computer to execute the steps of the data processing method.
本申请所使用的对存储器、存储、数据库或其它介质的任何引用可包括非易失性和/或易失性存储器。Any reference to a memory, storage, database, or other medium as used herein may include non-volatile and/or volatile memory.
其中,非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM),它用作外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDR SDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)。Among them, the non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory may include random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the patent of the present application. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims (20)

  1. 一种数据处理方法,其中,所述方法包括:A data processing method, wherein the method comprises:
    接收发送端发送的多媒体特征信息,所述多媒体特征信息包括从目标对象的第一视频中获取的人脸关键点信息和对所述目标对象的第一音频进行语音识别得到的语音文本,其中,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;Receive the multimedia feature information sent by the sending end, the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice text obtained by performing voice recognition on the first audio of the target object, wherein, The face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;
    基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The multimedia feature information is processed based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features. When the target multimedia data is displayed, the The dynamic expression corresponding to the face key point information, and the speech corresponding to the speech text is presented.
  2. 根据权利要求1所述的方法,其中,所述目标多媒体数据包括第二视频;所述基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,包括:The method according to claim 1, wherein the target multimedia data comprises a second video; the processing of the multimedia feature information based on the pre-acquired biometric information to obtain the target multimedia data comprises:
    根据所述目标对象的人脸关键点在各所述视频帧中的坐标,对所述人脸模型中人脸关键点的坐标进行变换处理,以得到所述第二视频。According to the coordinates of the face key points of the target object in each of the video frames, transform the coordinates of the face key points in the face model to obtain the second video.
  3. 根据权利要求2所述的方法,其中,所述根据所述目标对象的人脸关键点在各所述视频帧中的坐标,对所述人脸模型中人脸关键点的坐标进行变换处理,包括:The method according to claim 2, wherein the coordinates of the face key points in the face model are transformed according to the coordinates of the face key points of the target object in each of the video frames, include:
    基于各所述视频帧在所述第一视频中的时序顺序,依次根据所述目标对象的人脸关键点在每个所述视频帧中的坐标,对所述人脸模型中人脸关键点的坐标进行变换处理。Based on the time sequence order of each of the video frames in the first video, according to the coordinates of the key points of the face of the target object in each of the video frames, the key points of the face in the face model The coordinates are transformed.
  4. 根据权利要求1所述的方法,其中,所述目标多媒体数据包括第二音频;所述基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,包括:The method according to claim 1, wherein the target multimedia data includes second audio; and the processing of the multimedia feature information based on the pre-acquired biometric information to obtain the target multimedia data comprises:
    将所述音色特征和所述语音文本输入至语音合成模型中,得到所述第二音频。The timbre feature and the speech text are input into a speech synthesis model to obtain the second audio.
  5. 根据权利要求1所述的方法,其中,所述基于预先获取的生物特征信息对所述多媒体特征信息进行处理之前,所述方法还包括:The method according to claim 1, wherein before the multimedia feature information is processed based on the pre-acquired biometric information, the method further comprises:
    获取所述发送端对应的终端标识,并根据所述终端标识,检测是否存储有与所述终端标识对应的所述目标对象的目标生物特征信息,所述目标生物特征信息包括目标人脸模型和目标音色特征;Obtain the terminal identification corresponding to the sending end, and detect whether the target biometric information of the target object corresponding to the terminal identification is stored according to the terminal identification, and the target biometric information includes the target face model and the target biometric information. target timbre characteristics;
    若存储有所述目标生物特征信息,则将所述目标生物特征信息作为所述生物特征信息。If the target biometric information is stored, the target biometric information is used as the biometric information.
  6. 根据权利要求5所述的方法,其中,所述方法还包括:The method of claim 5, wherein the method further comprises:
    若未存储所述目标生物特征信息,则检测当前的数据传输速率是否大于传输速率阈值;If the target biometric information is not stored, detecting whether the current data transmission rate is greater than the transmission rate threshold;
    若所述当前的数据传输速率大于所述传输速率阈值,则向所述发送端发送获取请求,所述获取请求用于请求所述发送端返回所述目标生物特征信息;If the current data transmission rate is greater than the transmission rate threshold, sending an acquisition request to the sender, where the acquisition request is used to request the sender to return the target biometric information;
    接收所述发送端基于所述获取请求发送的所述目标生物特征信息,并将所述目标生物特征信息作为所述生物特征信息。Receive the target biometric information sent by the sender based on the acquisition request, and use the target biometric information as the biometric information.
  7. 根据权利要求6所述的方法,其中,所述方法还包括:The method of claim 6, wherein the method further comprises:
    若所述当前的数据传输速率小于或者等于所述传输速率阈值,则获取预先存储的通用生物特征信息,并将所述通用生物特征信息作为所述生物特征信息,所述通用生物特征信息包括通用人脸模型和通用音色特征。If the current data transmission rate is less than or equal to the transmission rate threshold, acquire pre-stored general biometric information, and use the general biometric information as the biometric information, where the general biometric information includes general biometric information. Face models and generic timbre features.
  8. 根据权利要求1所述的方法,其中,所述人脸模型为二维人脸模型或三维人脸模型;所述二维人脸模型为二维人脸图像,所述三维人脸模型是对二维人脸图像进行三维建模得到的。The method according to claim 1, wherein the human face model is a two-dimensional human face model or a three-dimensional human face model; the two-dimensional human face model is a two-dimensional human face image, and the three-dimensional human face model is a pair of The 2D face image is obtained by 3D modeling.
  9. 根据权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    展示所述目标多媒体数据。The target multimedia data is displayed.
  10. 根据权利要求1所述的方法,其中,所述多媒体特征信息是所述发送端采用预设的编码规则对从所述第一视频和所述第一音频中获取的多媒体特征信息进行编码得到的。The method according to claim 1, wherein the multimedia feature information is obtained by encoding the multimedia feature information obtained from the first video and the first audio by the sending end using a preset coding rule .
  11. 一种数据处理方法,其中,所述方法包括:A data processing method, wherein the method comprises:
    获取目标对象的第一视频和第一音频,并从所述第一视频中获取人脸关键点信息,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;Obtain the first video and the first audio of the target object, and obtain face key point information from the first video, and the face key point information includes the face key points of the target object in the first video. The coordinates in each video frame of ;
    对所述第一音频进行语音识别得到语音文本,并将所述人脸关键点信息和所述语音文本作为多媒体特征信息发送至接收端;Carry out voice recognition to the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end as multimedia feature information;
    其中,所述多媒体特征信息用于供所述接收端基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data, and the biometric information includes a face model and timbre features, and the When the target multimedia data is displayed, a dynamic expression corresponding to the face key point information is presented, and a voice corresponding to the voice text is presented.
  12. 根据权利要求11所述的方法,其中,所述从所述第一视频中获取人脸关键点信息,包括:The method according to claim 11, wherein the acquiring the face key point information from the first video comprises:
    对于所述第一视频中的各所述视频帧,对所述视频帧进行人脸关键点检测,得到所述目标对象的人脸关键点在所述视频帧中的坐标;For each of the video frames in the first video, perform face key point detection on the video frame to obtain the coordinates of the face key points of the target object in the video frame;
    基于所述目标对象的人脸关键点在各所述视频帧中的坐标生成所述人脸关键点信息。The face key point information is generated based on the coordinates of the face key points of the target object in each of the video frames.
  13. 根据权利要求12所述的方法,其中,所述基于所述目标对象的人脸关键点在各所述视频帧中的坐标生成所述人脸关键点信息,包括:The method according to claim 12, wherein the generating the face key point information based on the coordinates of the face key points of the target object in each of the video frames comprises:
    将各所述视频帧中人脸关键点的坐标作为所述人脸关键点信息。The coordinates of the face key points in each of the video frames are used as the face key point information.
  14. 根据权利要求12所述的方法,其中,所述基于所述目标对象的人脸关键点在各所述视频帧中的坐标生成所述人脸关键点信息,包括:The method according to claim 12, wherein the generating the face key point information based on the coordinates of the face key points of the target object in each of the video frames comprises:
    对各所述视频帧中人脸关键点的坐标添加对应的标识以及各所述视频帧在所述第一视频中的时序顺序,得到所述人脸关键点信息。A corresponding identifier is added to the coordinates of the face key points in each of the video frames and the time sequence sequence of each of the video frames in the first video is added to obtain the face key point information.
  15. 一种数据处理方法,其中,所述方法包括:A data processing method, wherein the method comprises:
    发送端获取目标对象的第一视频和第一音频,并从所述第一视频中获取人脸关键点信息,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;The sending end obtains the first video and the first audio of the target object, and obtains face key point information from the first video, and the face key point information includes the face key point of the target object in the first video. the coordinates in each video frame of a video;
    所述发送端对所述第一音频进行语音识别得到语音文本,并将所述人脸关键点信息和所述语音文本作为多媒体特征信息发送至接收端;The sending end performs voice recognition on the first audio to obtain voice text, and sends the face key point information and the voice text to the receiving end as multimedia feature information;
    所述接收端基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The receiving end processes the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and a timbre feature, and when the target multimedia data is displayed, presents the target multimedia data. A dynamic expression corresponding to the face key point information, and a voice corresponding to the voice text is presented.
  16. 一种数据处理装置,其中,所述装置包括:A data processing device, wherein the device comprises:
    接收模块,用于接收发送端发送的多媒体特征信息,所述多媒体特征信息包括从目标对象的第一视频中获取的人脸关键点信息和对所述目标对象的第一音频进行语音识别得到的语音文本,其中,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;The receiving module is used for receiving the multimedia feature information sent by the sending end, and the multimedia feature information includes the face key point information obtained from the first video of the target object and the information obtained by performing voice recognition on the first audio of the target object. voice text, wherein the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;
    处理模块,用于基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。The processing module is configured to process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features, and when the target multimedia data is displayed, A dynamic expression corresponding to the face key point information is presented, and a speech corresponding to the speech text is presented.
  17. 一种数据处理装置,其中,所述装置包括:A data processing device, wherein the device comprises:
    获取模块,用于获取目标对象的第一视频和第一音频,并从所述第一视频中获取人脸关键点信息,所述人脸关键点信息包括所述目标对象的人脸关键点在所述第一视频的各视频帧中的坐标;The acquisition module is used to acquire the first video and the first audio of the target object, and obtain face key point information from the first video, and the face key point information includes the face key points of the target object in Coordinates in each video frame of the first video;
    处理模块,用于对所述第一音频进行语音识别得到语音文本,并将所述人脸关键点信 息和所述语音文本作为多媒体特征信息发送至接收端;A processing module, for carrying out voice recognition to the first audio and obtaining voice text, and sending the face key point information and the voice text to the receiving end as multimedia feature information;
    其中,所述多媒体特征信息用于供所述接收端基于预先获取的生物特征信息对所述多媒体特征信息进行处理,得到目标多媒体数据,所述生物特征信息包括人脸模型和音色特征,所述目标多媒体数据被展示的情况下,呈现与所述人脸关键点信息对应的动态表情,以及呈现与所述语音文本对应的语音。Wherein, the multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data, and the biometric information includes a face model and a timbre feature. When the target multimedia data is displayed, a dynamic expression corresponding to the face key point information is presented, and a voice corresponding to the voice text is presented.
  18. 一种数据处理系统,其中,所述系统包括发送端和接收端;A data processing system, wherein the system includes a transmitter and a receiver;
    所述接收端,用于执行权利要求1至10中任一项所述的数据处理方法;The receiving end is used to execute the data processing method according to any one of claims 1 to 10;
    所述发送端,用于执行权利要求11至14中任一项所述的数据处理方法。The sending end is configured to execute the data processing method according to any one of claims 11 to 14.
  19. 一种电子设备,包括存储器及处理器,所述存储器中储存有计算机程序,其中,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至14中任一项所述的方法的步骤。An electronic device, comprising a memory and a processor, wherein a computer program is stored in the memory, wherein, when the computer program is executed by the processor, the processor is made to execute any one of claims 1 to 14 the steps of the method.
  20. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至14中任一项所述的方法的步骤。A computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the method as claimed in any one of claims 1 to 14.
PCT/CN2022/077098 2021-03-18 2022-02-21 Data processing method, apparatus and system, and electronic device and readable storage medium WO2022193910A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110292311.X 2021-03-18
CN202110292311.XA CN113066497A (en) 2021-03-18 2021-03-18 Data processing method, device, system, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
WO2022193910A1 true WO2022193910A1 (en) 2022-09-22

Family

ID=76562156

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077098 WO2022193910A1 (en) 2021-03-18 2022-02-21 Data processing method, apparatus and system, and electronic device and readable storage medium

Country Status (2)

Country Link
CN (1) CN113066497A (en)
WO (1) WO2022193910A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373455A (en) * 2023-12-04 2024-01-09 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113066497A (en) * 2021-03-18 2021-07-02 Oppo广东移动通信有限公司 Data processing method, device, system, electronic equipment and readable storage medium
CN113553946A (en) * 2021-07-22 2021-10-26 深圳市慧鲤科技有限公司 Information prompting method and device, electronic equipment and storage medium
CN113709548B (en) * 2021-08-09 2023-08-25 北京达佳互联信息技术有限公司 Image-based multimedia data synthesis method, device, equipment and storage medium
CN114007130A (en) * 2021-10-29 2022-02-01 维沃移动通信有限公司 Data transmission method and device, electronic equipment and storage medium
CN114466224B (en) * 2022-01-26 2024-04-16 广州繁星互娱信息科技有限公司 Video data encoding and decoding method and device, storage medium and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103647922A (en) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 Virtual video call method and terminals
CN105763828A (en) * 2014-12-18 2016-07-13 中兴通讯股份有限公司 Instant communication method and device
CN110349581A (en) * 2019-05-30 2019-10-18 平安科技(深圳)有限公司 Voice and text conversion transmission method, system, computer equipment and storage medium
US20200358983A1 (en) * 2019-05-09 2020-11-12 Present Communications, Inc. Video conferencing method
CN111985268A (en) * 2019-05-21 2020-11-24 搜狗(杭州)智能科技有限公司 Method and device for driving animation by human face
CN112202803A (en) * 2020-10-10 2021-01-08 北京字节跳动网络技术有限公司 Audio processing method, device, terminal and storage medium
CN113066497A (en) * 2021-03-18 2021-07-02 Oppo广东移动通信有限公司 Data processing method, device, system, electronic equipment and readable storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0702150D0 (en) * 2007-02-05 2007-03-14 Amegoworld Ltd A Communication Network and Devices
CN106056661B (en) * 2016-05-31 2018-08-28 钱进 Three-dimensional graphics renderer engine based on Direct3D 11
CN108805977A (en) * 2018-06-06 2018-11-13 浙江大学 A kind of face three-dimensional rebuilding method based on end-to-end convolutional neural networks
EP4283577A3 (en) * 2019-01-18 2024-02-14 Snap Inc. Text and audio-based real-time face reenactment
CN109976519B (en) * 2019-03-14 2022-05-03 浙江工业大学 Interactive display device based on augmented reality and interactive display method thereof
CN110322535A (en) * 2019-06-25 2019-10-11 深圳市迷你玩科技有限公司 Method, terminal and the storage medium of customized three-dimensional role textures
CN110536095A (en) * 2019-08-30 2019-12-03 Oppo广东移动通信有限公司 Call method, device, terminal and storage medium
CN110688008A (en) * 2019-09-27 2020-01-14 贵州小爱机器人科技有限公司 Virtual image interaction method and device
CN111325817B (en) * 2020-02-04 2023-07-18 清华珠三角研究院 Virtual character scene video generation method, terminal equipment and medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103647922A (en) * 2013-12-20 2014-03-19 百度在线网络技术(北京)有限公司 Virtual video call method and terminals
CN105763828A (en) * 2014-12-18 2016-07-13 中兴通讯股份有限公司 Instant communication method and device
US20200358983A1 (en) * 2019-05-09 2020-11-12 Present Communications, Inc. Video conferencing method
CN111985268A (en) * 2019-05-21 2020-11-24 搜狗(杭州)智能科技有限公司 Method and device for driving animation by human face
CN110349581A (en) * 2019-05-30 2019-10-18 平安科技(深圳)有限公司 Voice and text conversion transmission method, system, computer equipment and storage medium
CN112202803A (en) * 2020-10-10 2021-01-08 北京字节跳动网络技术有限公司 Audio processing method, device, terminal and storage medium
CN113066497A (en) * 2021-03-18 2021-07-02 Oppo广东移动通信有限公司 Data processing method, device, system, electronic equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373455A (en) * 2023-12-04 2024-01-09 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium
CN117373455B (en) * 2023-12-04 2024-03-08 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113066497A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
WO2022193910A1 (en) Data processing method, apparatus and system, and electronic device and readable storage medium
US10938725B2 (en) Load balancing multimedia conferencing system, device, and methods
US11257293B2 (en) Augmented reality method and device fusing image-based target state data and sound-based target state data
US20220392224A1 (en) Data processing method and apparatus, device, and readable storage medium
WO2021223724A1 (en) Information processing method and apparatus, and electronic device
US11769286B2 (en) Beauty processing method, electronic device, and computer-readable storage medium
CN114187547A (en) Target video output method and device, storage medium and electronic device
KR20180054407A (en) Apparatus for recognizing user emotion and method thereof, and robot system using the same
CN111107278B (en) Image processing method and device, electronic equipment and readable storage medium
JP2024513640A (en) Virtual object action processing method, device, and computer program
CN110298326A (en) A kind of image processing method and device, storage medium and terminal
WO2021169501A1 (en) Living body video picture processing method and apparatus, computer device, and storage medium
CN115049016A (en) Model driving method and device based on emotion recognition
CN110910512B (en) Virtual object self-adaptive adjustment method, device, computer equipment and storage medium
CN115100325A (en) Video generation method and device, computer equipment and storage medium
US20230335148A1 (en) Speech Separation Method, Electronic Device, Chip, and Computer-Readable Storage Medium
WO2020062998A1 (en) Image processing method, storage medium, and electronic device
CN114332976A (en) Virtual object processing method, electronic device and storage medium
CN111104827A (en) Image processing method and device, electronic equipment and readable storage medium
CN113822114A (en) Image processing method, related equipment and computer readable storage medium
CN110310318B (en) Special effect processing method and device, storage medium and terminal
US20200412773A1 (en) Method and apparatus for generating information
US20200184973A1 (en) Transcription of communications
CN113362243A (en) Model training method, image processing method and apparatus, medium, and electronic device
Zeng et al. Ultra-low bit rate facial coding hybrid model based on saliency detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22770265

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22770265

Country of ref document: EP

Kind code of ref document: A1