WO2022193910A1

WO2022193910A1 - Data processing method, apparatus and system, and electronic device and readable storage medium

Info

Publication number: WO2022193910A1
Application number: PCT/CN2022/077098
Authority: WO
Inventors: 陈伟杰
Original assignee: Oppo广东移动通信有限公司
Priority date: 2021-03-18
Filing date: 2022-02-21
Publication date: 2022-09-22
Also published as: CN113066497A

Abstract

Disclosed are a data processing method, apparatus and system, and an electronic device and a readable storage medium. The method comprises: receiving multimedia feature information, which is sent by a sending end; and processing the multimedia feature information on the basis of pre-acquired biometric information, so as to obtain target multimedia data. By means of the technical solution provided in the embodiments of the present application, the consumption of network bandwidth resources in multimedia data transmission between terminals can be reduced.

Description

Data processing method, apparatus, system, electronic device and readable storage medium

technical field

The present application relates to the technical field of data processing, and in particular, to a data processing method, apparatus, system, electronic device and readable storage medium.

Background technique

With the development of mobile communication technology and the popularization of intelligent terminals, the data forms of data transmitted between terminals are becoming more and more diverse, and multimedia data is a typical data form. Multimedia refers to the synthesis of multiple media, generally including sound and image and other media forms.

Taking the data transmitted between terminals as multimedia data as an example, the sending end may send multimedia data such as images, sounds, and videos to the receiving end. For example, user A's terminal can collect multimedia data such as images, sounds, or videos of user A, and send the collected multimedia data to other users' terminals; user A's terminal can also receive other users' multimedia data sent by other users' terminals. data.

However, a large amount of multimedia data transmission between terminals will cause a large load on the network bandwidth, and how to reduce the consumption of network bandwidth resources by the transmission of multimedia data between terminals has become an urgent problem to be solved at present.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a data processing method, apparatus, system, electronic device, and readable storage medium, which can reduce the consumption of network bandwidth resources by the transmission of multimedia data between terminals.

In a first aspect, a data processing method is provided, the method comprising:

Receive the multimedia feature information sent by the sending end, the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice text obtained by performing voice recognition on the first audio of the target object, wherein, The face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;

The multimedia feature information is processed based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features. When the target multimedia data is displayed, the The dynamic expression corresponding to the face key point information, and the speech corresponding to the speech text is presented.

In a second aspect, a data processing method is provided, the method comprising:

Obtain the first video and the first audio of the target object, and obtain face key point information from the first video, and the face key point information includes the face key points of the target object in the first video. The coordinates in each video frame of ;

Carry out voice recognition to the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end as multimedia feature information;

The multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data, and the biometric information includes a face model and timbre features, and the When the target multimedia data is displayed, a dynamic expression corresponding to the face key point information is presented, and a voice corresponding to the voice text is presented.

In a third aspect, a data processing method is provided, the method comprising:

The sender obtains the first video and the first audio of the target object, and obtains face key point information from the first video, and the face key point information includes the face key points of the target object in the first video. the coordinates in each video frame of a video;

The sending end performs voice recognition on the first audio to obtain voice text, and sends the face key point information and the voice text to the receiving end as multimedia feature information;

The receiving end processes the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and a timbre feature, and when the target multimedia data is displayed, presents the target multimedia data. A dynamic expression corresponding to the face key point information, and a voice corresponding to the voice text is presented.

In a fourth aspect, a data processing apparatus is provided, the apparatus comprising:

The receiving module is used to receive the multimedia feature information sent by the sending end, and the multimedia feature information includes the face key point information obtained from the first video of the target object and the first audio frequency of the target object obtained by voice recognition. voice text, wherein the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;

The processing module is configured to process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features, and when the target multimedia data is displayed, A dynamic expression corresponding to the face key point information is presented, and a speech corresponding to the speech text is presented.

In a fifth aspect, a data processing device is provided, the device comprising:

The acquisition module is used to acquire the first video and the first audio of the target object, and acquire face key point information from the first video, and the face key point information includes the face key points of the target object in Coordinates in each video frame of the first video;

a processing module, configured to perform voice recognition on the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end as multimedia feature information;

In a sixth aspect, a data processing system is provided, the system includes a sending end and a receiving end;

the receiving end, configured to execute the data processing method described in the first aspect;

The sending end is configured to execute the data processing method described in the second aspect.

In a seventh aspect, an electronic device is provided, comprising a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the processor causes the processor to execute the first aspect or the first aspect above. The steps of the method described in the second aspect.

In an eighth aspect, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the method described in the first aspect or the second aspect.

The beneficial effects brought by the technical solutions provided in the embodiments of the present application include at least:

By receiving the multimedia feature information sent by the sender, the multimedia feature information includes face key point information obtained from the first video of the target object and voice text obtained by performing speech recognition on the first audio of the target object. The point information includes the coordinates of the face key points of the target object in each video frame of the first video, and then, based on the pre-acquired biometric information, the multimedia feature information is processed to obtain the target multimedia data, and the biometric information includes the face model. and the timbre feature, when the target multimedia data is displayed, the dynamic expression corresponding to the key point information of the human face and the voice corresponding to the voice text are presented. It can be seen that the sender in the embodiment of the present application does not have to send the original first video. and the first audio, and only send the face key point information and the voice text, and then the receiving end processes the face key point information and voice text, and then the dynamic information corresponding to the face key point information can be obtained. Expressions and target multimedia data presenting the voice corresponding to the voice text; because the data volume of the first video is much larger than the data volume of the face key point information, and the data volume of the first audio is much larger than the data volume of the voice text, therefore, the Compared with the sending end directly sending the first video and the first audio, the sending end only sends face key point information and voice text, which can significantly reduce the amount of data sent by the sending end and reduce the gap between the sending end and the receiving end. The transmission of multimedia data consumes network bandwidth, which is beneficial to improve the efficiency of data transmission.

Description of drawings

Fig. 1 is the application environment diagram of the data processing method in one embodiment;

Fig. 2 is the flow chart of the data processing method in one embodiment;

3 is a schematic diagram of an exemplary face key point of a target object;

4 is a schematic diagram of an exemplary skeleton joint point of a target object in one embodiment;

5 is a schematic diagram of an exemplary data transmission process of a video call between terminal A and terminal B in an embodiment;

6 is a schematic diagram of an exemplary data transmission process between a viewer terminal and a host terminal in an embodiment;

FIG. 7 is a schematic diagram of an exemplary connected microphone data transmission process between anchor A and anchor B in one embodiment;

8 is a flowchart of a data processing method in another embodiment;

9 is a flowchart of a data processing method in another embodiment;

10 is a flow chart of obtaining biometric information in one embodiment;

11 is a flowchart of a data processing method in one embodiment;

12 is a flowchart of a data processing method in one embodiment;

13 is a schematic diagram of an exemplary terminal A creating a target face model of user A and a target timbre feature of user A in one embodiment;

14 is a schematic diagram of an exemplary terminal B creating a user B target face model and a user B target timbre feature in one embodiment;

15 is a schematic diagram of an exemplary exchange of target biometric information between terminal A and terminal B in an embodiment;

16 is a schematic diagram of an exemplary process of multimedia data transmission between terminal A and terminal B in an embodiment;

17 is a structural block diagram of a data processing apparatus in one embodiment;

18 is a structural block diagram of a data processing apparatus in one embodiment;

19 is a structural block diagram of a data processing apparatus in one embodiment;

FIG. 20 is a schematic diagram of the internal structure of an electronic device in one embodiment.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

Below, the implementation environment involved in the data processing method provided by the embodiment of the present application will be briefly described.

FIG. 1 is a schematic diagram of an implementation environment involved in a data processing method provided by an embodiment of the present application. As shown in FIG. 1 , the implementation environment may include a sending end 110 and a receiving end 120, and communication between the sending end 110 and the receiving end 120 may be performed through a wired network or a wireless network.

Wherein, the transmitting end 110 and the receiving end 120 may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.

In the implementation environment shown in FIG. 1 , the sending end 110 may obtain the first video and the first audio of the target object, and obtain face key point information from the first video, where the face key point information includes the face of the target object The coordinates of the key point in each video frame of the first video; the sending end 110 can perform speech recognition on the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end 120 as multimedia feature information; receive The terminal 120 can process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, wherein the biometric information includes a face model and a timbre feature. The dynamic expression corresponding to the point information, and the speech corresponding to the speech text is presented.

Please refer to FIG. 2 , which shows a flowchart of a data processing method provided by an embodiment of the present application, and the data processing method may be applied to the receiving end 120 shown in FIG. 1 . As shown in Figure 2, the data processing method may include the following steps:

Step 201: The receiving end receives the multimedia feature information sent by the transmitting end.

In the process of transmitting multimedia data between the sending end and the receiving end, the sending end can record the video of the target object to obtain the first video, and record the sound of the target object to obtain the first audio, and the sending end extracts multimedia from the first video and the first audio. Feature information, the multimedia feature information includes face key point information obtained by the sending end from the first video of the target object and speech text obtained by the sending end performing speech recognition on the first audio of the target object, and the face key point information includes The coordinates of the face key points of the target object in each video frame of the first video.

In the following, the process of acquiring face key point information and voice text by the sender is introduced.

The face key points of the target object may be, for example, several key points corresponding to one or more of the facial regions such as eyebrows, eyes, nose, mouth, etc. of the target object, see FIG. 3 , which is an exemplary target Schematic representation of the face keypoints of an object. It should be noted that, the embodiments of the present application do not specifically limit the number of face key points and the facial area indicated by the face key points.

The sender can use the key point detection algorithm to perform face key point detection on each video frame in the first video, to obtain the coordinates of the target object's face key points in each video frame, and the sender can use the target object's face key points. The coordinates of points in each video frame are used as face key point information. Among them, the key point detection algorithm can be CPR (Cascaded pose regression, cascade pose regression) or AAM (Active Appearance Model, active appearance model), and so on.

The sending end performs speech recognition on the first audio, which can be inputting the first audio into a speech recognition model to obtain the corresponding speech text of the first audio, and the speech recognition model can be, for example, an automatic speech recognition model (Automatic Speech Recognition, ASR) .

In this way, through the above embodiment, the sender obtains face key point information and voice text based on the first video and the first audio, and the sender sends the face key point information and voice text to the receiver as multimedia feature information.

Step 202: The receiving end processes the multimedia feature information based on the pre-obtained biometric feature information to obtain target multimedia data.

The biometric information may be pre-acquired by the receiving end, for example, the biometric information may be pre-acquired by the receiving end from the transmitting end, the biometric information may also be preset when the receiving end leaves the factory, or the biometric information The information may also be obtained by the receiving end from the server in advance, and the manner in which the receiving end obtains the biometric information is not specifically limited herein.

In this embodiment of the present application, the biometric information may include a face model and a timbre feature, the face model may be a three-dimensional face model, or a two-dimensional face model, and the three-dimensional face model may be a two-dimensional face image The 2D face model obtained by performing 3D modeling can be a 2D face image. The two-dimensional face image may be a two-dimensional face image of the target object or a two-dimensional face image of another user. The timbre feature may be the timbre feature of the target object extracted from the audio data of the target object, or the timbre feature of the other object extracted from the audio data of other users.

As an embodiment, the receiving end processes the multimedia feature information based on the pre-acquired biometric information to obtain the target multimedia data, which may be that the receiving end processes the face key point information based on the face model to obtain the second video and is based on the timbre feature. The second audio is obtained by processing the speech text.

In the following, the process of obtaining the second video by processing the key point information of the face at the receiving end based on the face model will be introduced.

In a possible implementation, if the face model pre-obtained by the receiver is a two-dimensional face model, the receiver can detect whether the coordinates of the face key points in the face key point information sent by the sender are two-dimensional coordinates or Three-dimensional coordinates; if the coordinates of the face key points are three-dimensional coordinates, the receiving end first converts the three-dimensional coordinates of the face key points into two-dimensional coordinates, and based on the converted two-dimensional coordinates of the face key points, the two-dimensional face The model is driven and rendered to obtain the second video; if the coordinates of the key points of the face are two-dimensional coordinates, the receiving end directly drives and renders the two-dimensional face model based on the two-dimensional coordinates of the key points of the face to obtain the second video. video.

In another possible implementation manner, the receiving end may also obtain a three-dimensional face model and a two-dimensional face model in advance. In this way, the receiving end may detect that the coordinates of the face key points in the face key point information sent by the sending end are as follows: Two-dimensional coordinates or three-dimensional coordinates; if the coordinates of the face key points are three-dimensional coordinates, the receiving end selects a three-dimensional face model, drives the three-dimensional face model based on the three-dimensional coordinates of the face key points, and then drives the driven The three-dimensional face model is projected onto a two-dimensional plane and rendered to obtain a second video; if the coordinates of the key points of the face are two-dimensional coordinates, the receiving end selects the two-dimensional face model, and based on the two-dimensional coordinates of the key points of the face The two-dimensional face model is driven and rendered to obtain a second video.

The above-mentioned driving the face model based on the coordinates of the face key points of the target object in each video frame can be understood as the coordinates of the face key points of the face model follow the changes of the face key points of the target object in each video frame. The coordinates of different face key points of the face model correspond to different expressions, such as smile, anger, anger, etc.

Hereinafter, the process of obtaining the second audio by processing the speech text at the receiving end based on the timbre feature will be introduced.

The receiving end can fuse the timbre feature and the phonetic text, that is, use the timbre feature and the phonetic text to perform speech synthesis to obtain a second audio, which has a sound matching the phonetic text and the timbre feature when played.

In this way, the receiving end only processes the received multimedia feature information to obtain the target multimedia data. When the target multimedia data is displayed, the dynamic expressions corresponding to the key point information of the face are presented, and the voice and text are presented. corresponding voice.

In the embodiment of the present application, since the face key point information is the feature information extracted from the first video, and the voice text is the feature information extracted from the first audio, the data volume of the first video is much larger than that of the face key point information. The data volume, and the data volume of the first audio is much larger than the data volume of the voice text. Therefore, compared with the sending end directly sending the first video and the first audio, the sending end only sends the face key point information and the voice text. The data volume of the data sent by the sender is significantly reduced, the consumption of network bandwidth due to multimedia data transmission between the sender and the receiver is reduced, and the efficiency of data transmission is improved.

In a possible implementation, in order to further reduce the consumption of network bandwidth due to the transmission of multimedia data between the sender and the receiver, in this embodiment of the present application, the sender obtains multimedia feature information from the first video and the first audio Afterwards, a preset encoding rule may also be used to encode the multimedia feature information to obtain encoded multimedia feature information, and the transmitting end sends the encoded multimedia feature information to the receiving end. The receiving end receives the encoded multimedia feature information, decodes the encoded multimedia feature information to obtain multimedia feature information, and then processes the multimedia feature information based on the pre-obtained biometric feature information to obtain target multimedia data. In this way, by encoding the multimedia feature information, the data amount of the multimedia feature information can be further compressed, thereby further reducing the consumption of network bandwidth for the multimedia data transmission between the sender and the receiver.

In a possible implementation manner, in order to expand the application scope of the data processing method of the embodiment of the present application, the multimedia feature information may further include skeleton joint point information obtained by the transmitting end from the first video of the target object, the skeleton joint point information Including the coordinates of the skeleton joint points of the target object in each video frame of the first video.

Referring to FIG. 4 , FIG. 4 is a schematic diagram of an exemplary skeleton joint point of a target object. As shown in Figure 4, the 25 skeleton joint points include: nose 0, neck 1, right shoulder 2, right elbow 3, right wrist 4, left shoulder 5, left elbow 6, left wrist 7, sacrum 8, right waist 9, right Knee 10, Right Ankle 11, Left Waist 12, Left Knee 13, Left Ankle 14, Right Eye 15, Left Eye 16, Right Ear 17, Left Ear 18, Left Toe One 19, Left Toe Two 20, Left Heel 21, Right The first toe 22, the second right toe 23, and the right heel 24. It should be noted that, in this embodiment of the present application, the skeleton joint points of the user may include some or all of the 25 skeleton joint points shown in FIG. make specific restrictions.

Correspondingly, the biometric information may also include a body posture model, and the receiving end may obtain a portrait model by splicing the body posture model and the face model. The receiving end drives the portrait model based on the face key point information and the skeleton joint point information, so that the portrait model changes with the coordinate changes of the face key points and the skeleton joint points of the target object in each video frame. The coordinates of the key points of the face correspond to different expressions of the portrait model, and the coordinates of different skeleton joint points correspond to different poses of the portrait model.

In this embodiment of the present application, the receiving end processes the multimedia feature information, and after obtaining the target multimedia data, the receiving end can also display the target multimedia data.

Exemplarily, the application of the data processing method of this embodiment of the present application will be exemplarily introduced in combination with several different application scenarios:

Multi-party video call scenario

During a video call between multiple terminals, it can be understood that any terminal can be used as a receiving end or a sending end.

Taking the multiple terminals being terminal A and terminal B as an example, see FIG. 5 , which is an exemplary schematic diagram of data transmission of a video call between terminal A and terminal B.

In the case of terminal A as the sender, terminal A can collect the first video and first audio of user A (that is, the target object), obtain face key point information from the first video, and perform speech recognition on the first audio to obtain the voice. Text, the face key point information and voice text are used as multimedia feature information, and the face key point information includes the coordinates of user A's face key points in each video frame of the first video.

Terminal A sends the multimedia feature information corresponding to user A to terminal B, and terminal B acts as a receiving end. After receiving the content of the multimedia feature information, based on the pre-acquired face model corresponding to user A, the key points of user A's face are determined. The information is processed, and the voice and text of user A are processed based on the pre-acquired timbre characteristics corresponding to user A, so as to obtain target multimedia data corresponding to user A, and the target multimedia data may include second video and second audio. Terminal B may display the obtained second video and second audio corresponding to the user A in time synchronization.

Similarly, terminal B may also perform the steps performed by the above-mentioned sending end to send the multimedia feature information corresponding to user B to terminal A. Correspondingly, as the receiving end, terminal A can display the second video corresponding to user B and the second audio (that is, the target multimedia data corresponding to user B) in time synchronization based on the multimedia feature information corresponding to user B, thereby realizing The video call process between terminal A and terminal B.

It should be noted that FIG. 5 only exemplarily shows two terminals (terminal A and terminal B), and in other embodiments, more terminals may also be included. Any terminal can send the extracted multimedia feature information to other terminals; any terminal can also receive corresponding multimedia feature information sent by other terminals, and display corresponding target multimedia data.

In this way, the video call process of multiple terminals can be realized by only transmitting multimedia feature information without the need to transmit the original video and audio, which reduces the bandwidth resources during the video call process of multiple terminals. Excessive consumption is conducive to improving the call quality of video calls on multiple terminals.

Live video scene

Optionally, in the process of watching the live web video by the viewer through the viewer terminal, the viewer terminal held by the viewer may be used as the receiving terminal in this embodiment of the present application, and the host terminal held by the anchor may be used as the transmitting terminal.

Referring to FIG. 6, FIG. 6 is an exemplary schematic diagram of data transmission between a viewer terminal and a host terminal.

The host terminal may collect the first video and first audio of the host (ie, the target object). The host terminal obtains face key point information from the first video, performs speech recognition on the first audio to obtain voice text, uses the face key point information and the voice text as multimedia feature information, and the face key point information includes the host's face. The coordinates of the key point in each video frame of the first video.

The host terminal sends the multimedia feature information corresponding to the host to the viewer terminal, and the viewer terminal acts as a receiver. After receiving the multimedia feature information, the host's face key point information is processed based on the pre-acquired face model corresponding to the host. And based on the pre-acquired timbre feature of the host, the voice and text of the host is processed to obtain target multimedia data corresponding to the host, and the target multimedia data may include a second video and a second audio. The viewer terminal can display the second video and the second audio corresponding to the anchor in time synchronization.

Optionally, in the process of watching the live web video through the viewer terminal, different anchors can also connect to the microphone, which is a form of web live broadcast. Through Lianmai, different anchors can interact, and in the Lianmai interface of the host terminal of each Lianmai anchor and the viewer terminal held by the audience, multimedia data corresponding to each Lianmai's anchors can be displayed at the same time.

In the case where the host is connected to the microphone, for example, the host A and the host B are connected to the microphone, the host terminal A of the host A and the host terminal B of the host B can be respectively used as senders, and the host terminal A, the host terminal B and the viewer terminal can be respectively used as Receiving end.

Referring to FIG. 7 , FIG. 7 is an exemplary schematic diagram of data transmission in connection between host A and host B.

The anchor terminal A can collect the first video and the first audio of the anchor A (that is, the target object). For voice text, face key point information and voice text are used as multimedia feature information, and the face key point information includes the coordinates of anchor A's face key points in each video frame of the first video.

The host terminal A sends the multimedia feature information corresponding to the host A to the viewer terminal and the host terminal B. Similarly, the host terminal B can also send the multimedia feature information corresponding to the host B to the viewer terminal and the host terminal A in a similar manner.

After receiving the multimedia feature information corresponding to the anchor A, the viewer terminal processes the key point information of the face of the anchor A based on the pre-acquired face model corresponding to the anchor A, and based on the pre-acquired timbre characteristics corresponding to the anchor A, the information of the anchor A's face is processed. The voice and text are processed to obtain the target multimedia data corresponding to the anchor A; similarly, after receiving the multimedia feature information corresponding to the anchor B, the viewer terminal can also obtain the target multimedia data corresponding to the anchor B according to a similar process. The viewer terminal can simultaneously display the target multimedia data corresponding to the anchor A and the target multimedia data corresponding to the anchor B on the viewer terminal.

After the host terminal A receives the multimedia feature information corresponding to the host B, it can process the key point information of the host B's face based on the pre-acquired face model corresponding to the host B, similar to the processing method of the viewer terminal as the receiving end. The pre-acquired timbre feature corresponding to the anchor B processes the voice and text of the anchor B, thereby obtaining the target multimedia data corresponding to the anchor B, and displaying the target multimedia data corresponding to the anchor B on the anchor terminal A. Optionally, the host terminal A may also display the multimedia data of the host A and the target multimedia data corresponding to the host B on the host terminal A at the same time.

Similarly, the host terminal B may also display the target multimedia data corresponding to the host A and the multimedia data of the host B in the host terminal B.

In this way, in the scenario of live video, the live video process can be realized by only transmitting the multimedia feature information of the host without the need to transmit original multimedia data between each host terminal and the audience terminal, which reduces the bandwidth required during the live video process. Excessive consumption of resources is conducive to improving the communication quality of live video broadcasts.

In the above-mentioned embodiment, by receiving the multimedia feature information sent by the sending end, the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice text obtained by performing voice recognition on the first audio of the target object. The face key point information includes the coordinates of the face key points of the target object in each video frame of the first video, and then, based on the pre-acquired biometric information, the multimedia feature information is processed to obtain the target multimedia data, and the biometric information includes Face model and timbre feature, when the target multimedia data is displayed, the dynamic expression corresponding to the face key point information and the voice corresponding to the voice text are presented. It can be seen that the sender in the embodiment of the present application does not The first video and the first audio, and only the face key point information and voice text are sent, and then the receiving end processes the face key point information and voice text, and then the face key point information that can be presented can be obtained. The corresponding dynamic expression and the target multimedia data presenting the voice corresponding to the voice text; because the data volume of the first video is far greater than the data volume of the face key point information, and the data volume of the first audio is far greater than the data volume of the voice text, Therefore, compared with the sending end directly sending the first video and the first audio, the sending end only sends face key point information and voice text, which can significantly reduce the amount of data sent by the sending end, and reduce the amount of data sent by the sending end and the receiving end. The transmission of multimedia data between terminals consumes network bandwidth, which is beneficial to improve the efficiency of data transmission.

In one embodiment, based on the embodiment shown in FIG. 2 , and referring to FIG. 8 , this embodiment relates to how the receiving end compares the multimedia features based on pre-acquired biometric information when the target multimedia data includes the second video. The process of processing information to obtain target multimedia data. As shown in FIG. 8, step 202 may include step 801 shown in FIG. 8:

Step 801, the receiving end transforms the coordinates of the face key points in the face model according to the coordinates of the target object's face key points in each video frame to obtain a second video.

In the embodiments of the present application, the face model may be a three-dimensional face model or a two-dimensional face model. In the following, for the convenience of description, the face model is taken as an example of a two-dimensional face model. The implementation process is described. It can be understood that taking the face model as a two-dimensional face model as an example does not constitute a limitation on the type of the face model.

As mentioned above, the face model can be a two-dimensional face image of the target object, and the face model includes the coordinates of multiple face key points. As mentioned above, the face key points can be eyebrows, eyes, nose, mouth etc. Face key points.

After receiving the multimedia feature information sent by the transmitting end, the receiving end transforms the coordinates of the corresponding facial key points in the face model according to the coordinates of the facial key points of the target object in each video frame.

As an implementation manner, for each video frame, the receiving end can transform the coordinates of each face key point in the face model into the coordinates of the corresponding face key points in the video frame, for example, the face in the face model The coordinates of the key point "nose" are transformed into the coordinates of the corresponding face key point "nose" in the video frame.

It can be understood that, for each video frame, after the receiving end transforms the coordinates of each face key point in the face model into the coordinates of the corresponding face key point in the video frame, then the face model has the same value as the corresponding video frame. the same expression.

The following describes the process of how the receiving end transforms the coordinates of the face key points in the face model according to the coordinates of the target object's face key points in each video frame.

Optionally, the receiving end can perform the following steps A1 to realize the process of transforming the coordinates of the key points of the face in the face model according to the coordinates of the key points of the face of the target object in each video frame:

Step A1, the receiving end transforms the coordinates of the face key points in the face model according to the coordinates of the face key points of the target object in each video frame based on the sequence order of each video frame in the first video. .

In a possible implementation manner, the face key point information may further include the time sequence sequence of each video frame in the first video and the identification of each face key point. The receiving end can perform transformation processing on the coordinates of the face key points in the face model according to the coordinates of the face key points in each video frame according to the sequence sequence of each video frame from front to back in the first video.

For example, according to the sequential order of each video frame in the first video, for the first video frame, the receiving end can determine each face key point in the face model according to the identification of each face key point in the face key point information. The face key points corresponding to each identifier, and then, the receiving end transforms the coordinates of the face key points determined in the face model into the coordinates of the corresponding face key points in the video frame. In this way, the face model can have The same expression as the video frame; then, for the second video frame, the third video frame, etc. in the first video, the receiving end can perform the same steps as the first video frame, The frame sequence sequence is driven, and the driven face model is rendered to obtain a second video.

In the above embodiment, the second video can be generated by transforming the coordinates of the key points of the face in the face model. The operation process mainly involves the processing of the coordinates of the key points of the face, and does not need to process other parameters of the face model. , thereby helping to reduce the amount of computation at the receiving end and improve the performance of the receiving end.

In one embodiment, based on the embodiment shown in FIG. 2 , and referring to FIG. 9 , this embodiment relates to how the receiving end compares the multimedia features based on pre-acquired biometric information when the target multimedia data includes the second audio. The process of processing information to obtain target multimedia data. As shown in FIG. 9 , in this embodiment, step 202 may include step 901 shown in FIG. 9 :

Step 901, the receiving end inputs the timbre feature and the speech text into the speech synthesis model to obtain the second audio.

In the embodiment of the present application, the timbre feature may be obtained by performing Fourier analysis on the audio data of the target object to obtain a spectrum, and extracting the spectrum feature.

The receiving end inputs the timbre feature and the speech text into the speech synthesis model (Text-To-Speech, TTS) to obtain the second audio output from the speech synthesis model.

In the above-mentioned embodiment, the receiving end processes the speech text according to the pre-acquired timbre feature, so that the second audio frequency of the speech synthesis has the characteristics represented by the timbre feature, so that the receiving end does not need to obtain the first audio frequency collected by the receiving end through the microphone, only According to the voice text extracted from the first audio and the timbre feature, a second audio with the same timbre as the first audio can be synthesized, so as to ensure the fidelity of the second audio, the receiving end and the sending end are greatly reduced. The amount of data in the multimedia data transmission process between them is beneficial to improve the efficiency of multimedia data transmission.

In one embodiment, based on the embodiment shown in FIG. 2 , referring to FIG. 10 , the process of how the receiving end acquires biometric information involved in this embodiment. As shown in FIG. 10, before step 202, the data processing method of this embodiment further includes step 203 and step 204:

In step 203, the receiving end acquires the terminal identification corresponding to the transmitting end, and according to the terminal identification, detects whether the target biometric information of the target object corresponding to the terminal identification is stored.

The target biometric information includes the target face model and the target timbre feature. In a possible implementation, the sending end may collect a face image of the target object, and determine the face image of the target object as the target face model of the target object; the sending end may also perform audio data on the target object. The feature extraction obtains the target timbre feature of the target object.

During the process of establishing a communication connection between the receiver and the sender, the receiver can obtain the terminal identifier corresponding to the sender. At historical moments, if the sender and the receiver perform multimedia data transmission, the sender can send the receiver to the receiver according to the terminal identifier. The target biometric information of the target object is requested, and if the target biometric information of the target object is received, the receiving end can associate and store the received target biometric information of the target object with the terminal identifier.

In this way, in the embodiment of the present application, after receiving the multimedia feature information sent by the sender, the receiver can detect whether the target biometric information of the target object corresponding to the terminal ID is stored according to the terminal identifier of the sender.

Step 204, if the target biometric information is stored, the receiving end uses the target biometric information as the biometric information.

If the target biometric information of the target object is stored, it means that the receiving end has acquired the target biometric information of the target object from the transmitting end during the historical multimedia data transmission process. In this way, the receiver can process the multimedia feature information sent by the sender based on the target biometric feature information of the target object.

Since the target biometric information is the target face model and target timbre feature of the target object, the receiving end can truly restore the real multimedia data of the target object at the sending end based on the target biometric information, ensuring the effect of multimedia data transmission.

Please continue to refer to FIG. 10, after step 204, it also includes step 205, step 206, step 207 and step 208:

Step 205, if the target biometric information is not stored, the receiving end detects whether the current data transmission rate is greater than the transmission rate threshold.

If the receiving end does not store the target biometric information, the receiving end determines whether to request the target biometric information from the sending end according to the current network quality. Optionally, the receiving end detects whether the current data transmission rate is greater than a transmission rate threshold, and the transmission rate threshold can be set by itself during implementation.

Step 206, if the current data transmission rate is greater than the transmission rate threshold, the receiver sends an acquisition request to the transmitter, and the acquisition request is used to request the transmitter to return the target biometric information.

If the current data transmission rate is greater than the transmission rate threshold, it indicates that the current network quality of the receiving end is good, and the receiving end requests the target biometric information from the transmitting end.

Step 207: The receiving end receives the target biometric information sent by the transmitting end based on the acquisition request, and uses the target biometric information as the biometric information.

Step 208, if the current data transmission rate is less than or equal to the transmission rate threshold, the receiving end acquires the pre-stored general biometric information, and uses the general biometric information as the biometric information.

Wherein, the general biometric information includes a general face model and a general timbre feature.

However, if the current data transmission rate is less than or equal to the transmission rate threshold, it indicates that the current network quality of the receiving end is poor. Generic biometric information, and use the generic biometric information as biometric information. The general biometric information can be any preset user's face image and timbre features.

In this way, after receiving the multimedia feature information, the receiver can also process the multimedia feature information based on the locally stored general biometric information to obtain the target multimedia data. In the case of poor network quality, the multimedia data transmission can also be realized. , reducing the demand for network bandwidth for multimedia data transmission and saving network bandwidth resources.

In an embodiment, referring to FIG. 11 , it shows a flowchart of a data processing method provided by an embodiment of the present application, and the data processing method may be applied to the sending end 120 shown in FIG. 1 . As shown in Figure 11, the data processing method may include the following steps:

Step 111, the sending end obtains the first video and the first audio of the target object, and obtains face key point information from the first video.

In the process of multimedia data transmission between the sender and the receiver, the sender can record the video of the target object to obtain the first video, and record the sound of the target object to obtain the first audio, and the sender obtains the person from the first video of the target object. Face key point information, the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video.

Hereinafter, the process of acquiring the facial key point information from the first video by the sender will be described.

Optionally, the first video may include multiple video frames, and for each video frame in the first video, the transmitting end performs facial key point detection on the video frame to obtain the coordinates of the target object's facial key points in the video frame. , the sender generates face key point information based on the coordinates of the face key points of the target object in each video frame.

Specifically, the sending end can input each video frame into the face key point detection model to obtain the coordinates of each face key point in the video frame output by the face key point detection model, wherein the face key point detection model It can be any pre-trained deep learning model for facial keypoint detection.

The sender takes the coordinates of the face key points in each video frame as the face key point information. In a possible implementation manner, the sender may also add a corresponding identifier to the coordinates of each face key point in the face key point information and the time sequence sequence of each video frame in the first video, and add the added person The face key point information is used as the final face key point information.

Step 112 , the sending end performs speech recognition on the first audio to obtain speech text, and sends the face key point information and the speech text to the receiving end as multimedia feature information.

The sending end may input the first audio into an automatic speech recognition model (Automatic Speech Recognition, ASR) to obtain the speech text output by the automatic speech recognition model.

The sender sends the face key point information and the voice text to the receiver as multimedia feature information, and the multimedia feature information is used for the receiver to process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, biometric features. The information includes face model and timbre features. When the target multimedia data is displayed, dynamic expressions corresponding to face key point information are presented, and speech corresponding to speech text is presented.

Wherein, the receiving end processes the multimedia feature information based on the pre-acquired biometric feature information, and the process of obtaining the target multimedia data may refer to the above-mentioned embodiment, which will not be repeated here.

In the process of multimedia data transmission between the sender and the receiver, what the sender sends to the receiver is not the original first audio and the first video, but the multimedia feature information extracted from the first audio and the first video. The data volume of a video is much larger than the data volume of the face key point information, and the data volume of the first audio is much larger than the data volume of the voice text. Therefore, compared with the sending end directly sending the first video and the first audio, The sender only sends face key point information and voice text, which can significantly reduce the amount of data sent by the sender, reduce the consumption of network bandwidth for the transmission of multimedia data between the sender and the receiver, and help improve the efficiency of data transmission. efficiency.

In one embodiment, referring to FIG. 12 , a data processing method is provided, and the data processing method can be applied in the implementation environment shown in FIG. 1 . As shown in Figure 12, the method may include the following steps:

Step 121, the sending end obtains the first video and the first audio of the target object, and obtains face key point information from the first video.

The face key point information includes the coordinates of the face key point of the target object in each video frame of the first video.

In step 122, the transmitting end performs speech recognition on the first audio to obtain speech text, and sends the face key point information and the speech text to the receiving end as multimedia feature information.

Step 123: The receiving end processes the multimedia feature information based on the pre-obtained biometric feature information to obtain target multimedia data.

The biometric information includes a face model and a timbre feature. When the target multimedia data is displayed, the dynamic expression corresponding to the key point information of the face is presented, and the voice corresponding to the voice text is presented.

The specific limitations and beneficial effects of step 121 , step 122 , and step 123 in this embodiment are similar to those in the foregoing embodiment, and reference may be made to the description of the foregoing embodiment, which is not repeated here.

Hereinafter, an exemplary implementation of the embodiments of the present application will be introduced by taking the multimedia data transmission performed by terminal A (corresponding to user A) and terminal B (corresponding to user B) as an example.

In step a, terminal A is preset with a general face model and a general timbre feature, and terminal A uses the face image of user A as the target face model of user A, and extracts the target timbre feature of user A according to user A's audio data. Referring to FIG. 13, FIG. 13 is an exemplary schematic diagram of terminal A creating a target face model of user A and a target timbre feature of user A.

In step b, terminal B is preset with a general face model and a general timbre feature. Terminal B uses the face image of user B as the target face model of user B, and extracts the target timbre feature of user B according to the audio data of user B. Referring to FIG. 14, FIG. 14 is an exemplary schematic diagram of terminal B creating a target face model of user B and a target timbre feature of user B. As shown in FIG.

Step c, if terminal A and terminal B need to transmit multimedia data, and terminal A does not store user B target face model and user B target timbre feature, terminal B does not store user A target face model and user A target timbre feature , When the current data transmission rates of both terminal A and terminal B are greater than the transmission rate threshold, terminal A requests terminal B for the target face model of user B and the target timbre feature of user B, and terminal B requests terminal A for the target face of user A model and user A's target timbre characteristics, thereby completing the exchange of standard models. Referring to FIG. 15 , FIG. 15 is a schematic diagram of an exemplary exchange of target biometric information between terminal A and terminal B.

In step d, terminal A collects the first video of user A through a camera, and inputs each video frame of user A's first video into the facial key point detection model to obtain the facial key point information of user A; The microphone collects the first audio of the user A, and inputs the first audio of the user A into the automatic speech recognition model to obtain the speech text of the user A.

In step e, terminal B collects the first video of user B through a camera, and inputs each video frame of the first video of user B into the facial key point detection model to obtain the facial key point information of user B; The microphone collects the first audio of the user B, and inputs the first audio of the user B into the automatic speech recognition model to obtain the speech text of the user B.

In step f, terminal A sends user A's face key point information and user A's voice and text to terminal B as user A's multimedia feature information.

In step g, terminal B sends user B's face key point information and user B's voice and text to terminal A as user B's multimedia feature information.

In step h, terminal A processes user B's face key point information based on user B's target face model to obtain the second video of user B; terminal A processes user B's voice text based on user B's target timbre feature to obtain User B's second audio, terminal A displays user B's second video and user B's second audio as user B's target multimedia data.

Step i, terminal B processes the key point information of user A's face based on user A's target face model to obtain the second video of user A; terminal B processes user A's voice text based on user A's target timbre feature to obtain User A's second audio, and terminal B displays user A's second video and user A's second audio as user A's target multimedia data.

Referring to FIG. 16, FIG. 16 is a schematic diagram of an exemplary process of terminal A and terminal B performing multimedia data transmission.

In this way, in the process of multimedia data transmission between terminal A and terminal B, there is no need to transmit original and complete multimedia data, and only multimedia feature information needs to be transmitted to realize multimedia data transmission, which reduces the demand for network bandwidth for multimedia data transmission, and improves the performance of multimedia data transmission. The transmission efficiency is improved, and the transmission resources are saved.

It should be understood that although the steps in the flowcharts of the above embodiments are sequentially displayed according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in the flowcharts of the above embodiments may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps Alternatively, the order of execution of the stages is not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a stage.

FIG. 17 is a structural block diagram of a data processing apparatus according to an embodiment. As shown in Figure 17, the device includes:

The receiving module 1701 is used for receiving the multimedia feature information sent by the sending end, and the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice of the first audio of the target object. The recognized speech text, wherein the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;

The processing module 1702 is used to process the multimedia feature information based on the biometric information obtained in advance to obtain target multimedia data. , presenting a dynamic expression corresponding to the face key point information, and presenting a voice corresponding to the voice text.

Based on the embodiment shown in FIG. 17 , referring to FIG. 18 , optionally, the target multimedia data includes a second video; the processing module 1702 includes:

The coordinate transformation unit 1702a performs transformation processing on the coordinates of the face key points in the face model according to the coordinates of the face key points of the target object in each of the video frames, so as to obtain the second video.

Optionally, the coordinate transformation unit 1702a is specifically configured to, based on the sequential order of each of the video frames in the first video, sequentially according to the key points of the face of the target object in each of the video frames. coordinates, and transforms the coordinates of the key points of the face in the face model.

Based on the embodiment shown in FIG. 17 , optionally, the target multimedia data includes second audio; the processing module 1702 is specifically configured to input the timbre feature and the speech text into the speech synthesis model to obtain the the second audio.

Based on the embodiment shown in FIG. 17 , optionally, the apparatus further includes:

The first detection module is used to obtain the terminal identification corresponding to the transmitting end, and detect whether the target biometric information of the target object corresponding to the terminal identification is stored according to the terminal identification, and the target biometrics The information includes the target face model and the target timbre feature;

a first determining module, configured to use the target biometric information as the biometric information if the target biometric information is stored;

a second detection module, configured to detect whether the current data transmission rate is greater than the transmission rate threshold if the target biometric information is not stored;

a request module, configured to send an acquisition request to the sender if the current data transmission rate is greater than the transmission rate threshold, where the acquisition request is used to request the sender to return the target biometric information;

A second determining module, configured to receive the target biometric information sent by the sender based on the acquisition request, and use the target biometric information as the biometric information.

A third determining module, configured to acquire pre-stored general biometric information if the current data transmission rate is less than or equal to the transmission rate threshold, and use the general biometric information as the biometric information, and the The general biometric information includes a general face model and a general timbre feature.

The division of each module in the above data processing apparatus is only used for illustration. In other embodiments, the data processing apparatus may be divided into different modules as required to complete all or part of the functions of the above data processing apparatus.

For the specific limitation of the data processing apparatus, reference may be made to the limitation of the data processing method applied to the receiving end above, which will not be repeated here. Each module in the above-mentioned data processing apparatus can be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

FIG. 19 is a structural block diagram of a data processing apparatus according to an embodiment. As shown in Figure 19, the device includes:

Obtaining module 1901, configured to obtain the first video and first audio of the target object, and obtain face key point information from the first video, where the face key point information includes the face key points of the target object coordinates in each video frame of the first video;

The processing module 1902 is used for a processing module for performing speech recognition on the first audio to obtain speech text, and sending the face key point information and the speech text to the receiving end as multimedia feature information; The multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data. In the case of being displayed, the dynamic expression corresponding to the face key point information is presented, and the speech corresponding to the speech text is presented.

Based on the embodiment shown in FIG. 19 , in a possible implementation, the acquisition module 1901 is specifically configured to perform face key point detection on the video frame for each of the video frames in the first video, obtaining the coordinates of the face key points of the target object in the video frame; generating the face key point information based on the coordinates of the face key points of the target object in each of the video frames.

For the specific limitation of the data processing apparatus, reference may be made to the limitation on the data processing method applied to the sending end above, which will not be repeated here. Each module in the above-mentioned data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a data processing system is provided, the system includes a transmitter and a receiver;

The receiving end is configured to execute the above method steps for the receiving end, so as to realize the above method embodiments for the receiving end.

The sending end is configured to execute the above method steps for the sending end, so as to realize the above method embodiments for the sending end.

The implementation principle and technical effect of the data processing system provided by the embodiments of the present application are similar to those of the foregoing method embodiments, and details are not described herein again.

FIG. 20 is a schematic diagram of the internal structure of an electronic device in one embodiment. As shown in FIG. 20, the electronic device includes a processor and a memory connected by a system bus. Among them, the processor is used to provide computing and control capabilities to support the operation of the entire electronic device. The memory may include non-volatile storage media and internal memory. The nonvolatile storage medium stores an operating system and a computer program. The computer program can be executed by the processor to implement a data processing method provided by the following embodiments. Internal memory provides a cached execution environment for operating system computer programs in non-volatile storage media. The electronic device may be any terminal device such as a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (Point of Sales, a sales terminal), a vehicle-mounted computer, a wearable device, and the like.

The implementation of each module in the data processing apparatus provided in the embodiments of the present application may be in the form of a computer program. The computer program can be run on a terminal or server. The program modules constituted by the computer program can be stored on the memory of the electronic device. When the computer program is executed by the processor, the steps of the methods described in the embodiments of the present application are implemented.

Embodiments of the present application also provide a computer-readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions, when executed by one or more processors, cause the processors to perform the steps of a data processing method.

Embodiments of the present application also provide a computer program product containing instructions, which, when run on a computer, cause the computer to execute the steps of the data processing method.

Any reference to a memory, storage, database, or other medium as used herein may include non-volatile and/or volatile memory.

Among them, the non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory may include random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).

The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the patent of the present application. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

A data processing method, wherein the method comprises:

Receive the multimedia feature information sent by the sending end, the multimedia feature information includes the face key point information obtained from the first video of the target object and the voice text obtained by performing voice recognition on the first audio of the target object, wherein, The face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;

The multimedia feature information is processed based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features. When the target multimedia data is displayed, the The dynamic expression corresponding to the face key point information, and the speech corresponding to the speech text is presented.
The method according to claim 1, wherein the target multimedia data comprises a second video; the processing of the multimedia feature information based on the pre-acquired biometric information to obtain the target multimedia data comprises:

According to the coordinates of the face key points of the target object in each of the video frames, transform the coordinates of the face key points in the face model to obtain the second video.
The method according to claim 2, wherein the coordinates of the face key points in the face model are transformed according to the coordinates of the face key points of the target object in each of the video frames, include:

Based on the time sequence order of each of the video frames in the first video, according to the coordinates of the key points of the face of the target object in each of the video frames, the key points of the face in the face model The coordinates are transformed.
The method according to claim 1, wherein the target multimedia data includes second audio; and the processing of the multimedia feature information based on the pre-acquired biometric information to obtain the target multimedia data comprises:

The timbre feature and the speech text are input into a speech synthesis model to obtain the second audio.
The method according to claim 1, wherein before the multimedia feature information is processed based on the pre-acquired biometric information, the method further comprises:

Obtain the terminal identification corresponding to the sending end, and detect whether the target biometric information of the target object corresponding to the terminal identification is stored according to the terminal identification, and the target biometric information includes the target face model and the target biometric information. target timbre characteristics;

If the target biometric information is stored, the target biometric information is used as the biometric information.
The method of claim 5, wherein the method further comprises:

If the target biometric information is not stored, detecting whether the current data transmission rate is greater than the transmission rate threshold;

If the current data transmission rate is greater than the transmission rate threshold, sending an acquisition request to the sender, where the acquisition request is used to request the sender to return the target biometric information;

Receive the target biometric information sent by the sender based on the acquisition request, and use the target biometric information as the biometric information.
The method of claim 6, wherein the method further comprises:

If the current data transmission rate is less than or equal to the transmission rate threshold, acquire pre-stored general biometric information, and use the general biometric information as the biometric information, where the general biometric information includes general biometric information. Face models and generic timbre features.
The method according to claim 1, wherein the human face model is a two-dimensional human face model or a three-dimensional human face model; the two-dimensional human face model is a two-dimensional human face image, and the three-dimensional human face model is a pair of The 2D face image is obtained by 3D modeling.
The method of claim 1, wherein the method further comprises:

The target multimedia data is displayed.
The method according to claim 1, wherein the multimedia feature information is obtained by encoding the multimedia feature information obtained from the first video and the first audio by the sending end using a preset coding rule .
A data processing method, wherein the method comprises:

Obtain the first video and the first audio of the target object, and obtain face key point information from the first video, and the face key point information includes the face key points of the target object in the first video. The coordinates in each video frame of ;

Carry out voice recognition to the first audio to obtain voice text, and send the face key point information and the voice text to the receiving end as multimedia feature information;

The multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data, and the biometric information includes a face model and timbre features, and the When the target multimedia data is displayed, a dynamic expression corresponding to the face key point information is presented, and a voice corresponding to the voice text is presented.
The method according to claim 11, wherein the acquiring the face key point information from the first video comprises:

For each of the video frames in the first video, perform face key point detection on the video frame to obtain the coordinates of the face key points of the target object in the video frame;

The face key point information is generated based on the coordinates of the face key points of the target object in each of the video frames.
The method according to claim 12, wherein the generating the face key point information based on the coordinates of the face key points of the target object in each of the video frames comprises:

The coordinates of the face key points in each of the video frames are used as the face key point information.
The method according to claim 12, wherein the generating the face key point information based on the coordinates of the face key points of the target object in each of the video frames comprises:

A corresponding identifier is added to the coordinates of the face key points in each of the video frames and the time sequence sequence of each of the video frames in the first video is added to obtain the face key point information.
A data processing method, wherein the method comprises:

The sending end obtains the first video and the first audio of the target object, and obtains face key point information from the first video, and the face key point information includes the face key point of the target object in the first video. the coordinates in each video frame of a video;

The sending end performs voice recognition on the first audio to obtain voice text, and sends the face key point information and the voice text to the receiving end as multimedia feature information;

The receiving end processes the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and a timbre feature, and when the target multimedia data is displayed, presents the target multimedia data. A dynamic expression corresponding to the face key point information, and a voice corresponding to the voice text is presented.
A data processing device, wherein the device comprises:

The receiving module is used for receiving the multimedia feature information sent by the sending end, and the multimedia feature information includes the face key point information obtained from the first video of the target object and the information obtained by performing voice recognition on the first audio of the target object. voice text, wherein the face key point information includes the coordinates of the face key point of the target object in each video frame of the first video;

The processing module is configured to process the multimedia feature information based on the pre-acquired biometric information to obtain target multimedia data, where the biometric information includes a face model and timbre features, and when the target multimedia data is displayed, A dynamic expression corresponding to the face key point information is presented, and a speech corresponding to the speech text is presented.
A data processing device, wherein the device comprises:

The acquisition module is used to acquire the first video and the first audio of the target object, and obtain face key point information from the first video, and the face key point information includes the face key points of the target object in Coordinates in each video frame of the first video;

A processing module, for carrying out voice recognition to the first audio and obtaining voice text, and sending the face key point information and the voice text to the receiving end as multimedia feature information;

Wherein, the multimedia feature information is used for the receiving end to process the multimedia feature information based on pre-acquired biometric information to obtain target multimedia data, and the biometric information includes a face model and a timbre feature. When the target multimedia data is displayed, a dynamic expression corresponding to the face key point information is presented, and a voice corresponding to the voice text is presented.
A data processing system, wherein the system includes a transmitter and a receiver;

The receiving end is used to execute the data processing method according to any one of claims 1 to 10;

The sending end is configured to execute the data processing method according to any one of claims 11 to 14.
An electronic device, comprising a memory and a processor, wherein a computer program is stored in the memory, wherein, when the computer program is executed by the processor, the processor is made to execute any one of claims 1 to 14 the steps of the method.
A computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the method as claimed in any one of claims 1 to 14.