CN113420177A

CN113420177A - Audio data processing method and device, computer equipment and storage medium

Info

Publication number: CN113420177A
Application number: CN202110738574.9A
Authority: CN
Inventors: 蔡超; 党正军
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-21

Abstract

The embodiment of the application discloses an audio data processing method and device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring a frequency spectrum image of target audio data, acquiring a target expression parameter set matched with the frequency spectrum image of the target audio data based on the mapping relation between the frequency spectrum image and the expression parameter set, and adjusting the template face model according to the target expression parameter set to obtain a target face model with facial expressions. The target expression parameter set matched with the target audio data is obtained based on the mapping relation between the frequency spectrum image and the expression parameter set, the target expression parameter in the target expression parameter set is used for simulating the facial expression of a singer when the singer sends out the audio data, a target facial model with the facial expression is obtained according to the expression parameter in the expression parameter set, and the target facial model is synchronously displayed when the audio data are played subsequently so as to simulate the effect of the real person sending out the audio data, and therefore the playing effect is improved.

Description

Audio data processing method and device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to an audio data processing method and device, computer equipment and a storage medium.

Background

With the development of computer technology and the emergence of audio playing applications, playing audio data becomes a common entertainment mode in people's daily life. Usually, when the target audio data is played, text content or other information in the target audio data is also synchronously displayed, so that the user can browse conveniently. However, the audio playing mode is single and the playing effect is poor.

Disclosure of Invention

The embodiment of the application provides an audio data processing method and device, a computer device and a storage medium, and the playing effect is improved. The technical scheme is as follows:

in one aspect, a method for processing audio data is provided, the method comprising:

acquiring a spectrum image of target audio data, wherein the spectrum image is used for representing the condition that the frequency of sound in the target audio data changes along with time;

acquiring a target expression parameter set matched with the spectrum image of the target audio data based on the mapping relation between the spectrum image and the expression parameter set, wherein the target expression parameter in the target expression parameter set is used for simulating the facial expression of a singer when the singer sends out the target audio data;

and adjusting the template face model according to the target expression parameter set to obtain a target face model with the facial expression.

In one possible implementation, the target audio data includes audio data in a plurality of time periods, and the obtaining a spectral image of the target audio data includes:

respectively converting the audio data in the time periods into matched spectrum images;

the obtaining of the target expression parameter set matched with the spectrum image of the target audio data based on the mapping relationship between the spectrum image and the expression parameter set includes:

and respectively acquiring target expression parameter sets matched with the plurality of spectrum images obtained by conversion based on the mapping relation.

In another possible implementation manner, the obtaining a target expression parameter set matched with a spectrum image of the target audio data based on a mapping relationship between the spectrum image and the expression parameter set includes:

inquiring the mapping relation based on the frequency spectrum image of the target audio data, and determining a reference frequency spectrum image similar to the frequency spectrum image of the target audio data in the mapping relation;

and determining the expression parameter set matched with the reference spectrum image in the mapping relation as the target expression parameter set.

and mapping the frequency spectrum image of the target audio data based on a mapping model to obtain a target expression parameter set matched with the frequency spectrum image of the target audio data, wherein the mapping model comprises a mapping relation between the frequency spectrum image and the expression parameter set.

In another possible implementation manner, the mapping the spectral image of the target audio data based on the mapping model to obtain a target expression parameter set matched with the spectral image of the target audio data includes:

based on the mapping model, performing feature extraction on the frequency spectrum image of the target audio data to obtain image features of the frequency spectrum image, wherein the image features are used for describing the relationship between the frequency and the time of the sound in the frequency spectrum image;

and performing feature transformation on the image features based on the mapping model to obtain the target expression parameter set.

In another possible implementation manner, the performing, based on the mapping model, feature extraction on a spectral image of the target audio data to obtain an image feature of the spectral image includes:

based on the mapping model, performing feature extraction on the frequency spectrum image of the target audio data to obtain image features corresponding to a plurality of time points, wherein the time points belong to a time period corresponding to the target audio data;

the performing feature transformation on the image features based on the mapping model to obtain the target expression parameter set includes:

and respectively carrying out feature transformation on the image features corresponding to each time point on the basis of the mapping model to obtain the target expression parameters corresponding to each time point.

In another possible implementation manner, after the feature extraction is performed on the spectral image of the target audio data based on the mapping model to obtain image features corresponding to a plurality of time points, the method further includes:

and determining the product of the image features corresponding to each time point and the corresponding weight as the updated image features corresponding to each time point on the basis of the mapping model, wherein the weights corresponding to the time points are used for adjusting the smoothness degree of the image features corresponding to the time points.

In another possible implementation, the method further includes:

acquiring a sample spectrum image and a sample expression parameter set of sample audio data, wherein the sample expression parameters in the sample expression parameter set are used for simulating facial expressions of a singer when the singer sends the sample audio data;

and training the mapping model based on the sample spectrum image and the sample expression parameter set.

In another possible implementation manner, the target expression parameter set includes target expression parameters corresponding to a plurality of time points; adjusting the template face model according to the target expression parameter set to obtain a target face model with the facial expression, comprising:

respectively adjusting the template face model according to the target expression parameters corresponding to the multiple time points in the target expression parameter set to obtain multiple target face models;

after the template face models are respectively adjusted according to the target expression parameters corresponding to the multiple time points in the target expression parameter set to obtain multiple target face models, the method further includes:

and in the process of playing the target audio data, playing the expression animation formed by the target face models.

In another possible implementation manner, before the template face models are respectively adjusted according to the target expression parameters corresponding to the multiple time points in the target expression parameter set to obtain multiple target face models, the method further includes:

and smoothing the obtained target expression parameters corresponding to the multiple time points.

In another possible implementation manner, the target expression parameters corresponding to any time point include expression parameters corresponding to a plurality of facial regions at the time point; the step of respectively adjusting the template face models according to the target expression parameters corresponding to the multiple time points in the target expression parameter set to obtain multiple target face models comprises the following steps:

and respectively adjusting the corresponding face regions in the template face model according to the target expression parameters of the plurality of face regions corresponding to the plurality of time points to obtain the plurality of target face models.

In another possible implementation manner, before the adjusting the corresponding facial regions in the template facial model respectively according to the target expression parameters corresponding to the multiple facial regions at the multiple time points to obtain the multiple target facial models, the method further includes:

and smoothing the target expression parameters of the same face region corresponding to the multiple time points.

In another possible implementation manner, after the template face models are respectively adjusted according to the target expression parameters corresponding to the multiple time points in the target expression parameter set to obtain multiple target face models, the method further includes:

storing the target audio data in correspondence with an expression animation composed of the plurality of target face models;

in the process of playing the target audio data, playing an expression animation composed of the plurality of target face models, including:

responding to a playing instruction of the target audio data, and inquiring the expression animation stored corresponding to the target audio data;

and playing the target audio data and synchronously playing the expression animation.

In another possible implementation manner, the acquiring a spectral image of target audio data includes:

responding to a playing instruction of the target audio data, and acquiring a frequency spectrum image of the target audio data;

In another possible implementation manner, after obtaining a target expression parameter set matched with the spectral image of the target audio data based on the mapping relationship between the spectral image and the expression parameter set, the method further includes:

correspondingly storing the target audio data and the target expression parameter set;

adjusting the template face model according to the target expression parameter set to obtain a target face model with the facial expression, comprising:

responding to a playing instruction of the target audio data, and inquiring the target expression parameter set stored corresponding to the target audio data;

and adjusting the template face model according to the inquired expression parameters in the target expression parameter set to obtain the target face model.

In another possible implementation manner, after the template face model is adjusted according to the target expression parameter set to obtain a target face model with the facial expression, the method further includes:

responding to a received data playing request sent by a terminal, and determining the target audio data indicated by the audio identifier and the target face model corresponding to the target audio data based on an audio identifier carried by the data playing request;

and sending the target audio data and the target face model to the terminal, playing the target audio data by the terminal, and synchronously playing the expression animation formed by the target face model.

In another aspect, an audio data processing apparatus is provided, the apparatus comprising:

the acquisition module is used for acquiring a frequency spectrum image of target audio data, and the frequency spectrum image is used for representing the condition that the frequency of sound in the target audio data changes along with time;

the obtaining module is further configured to obtain a target expression parameter set matched with the spectrum image of the target audio data based on a mapping relationship between the spectrum image and the expression parameter set, where a target expression parameter in the target expression parameter set is used to simulate a facial expression of a singer when the singer sends out the target audio data;

and the adjusting module is used for adjusting the template face model according to the target expression parameter set to obtain a target face model with the facial expression.

In a possible implementation manner, the target audio data includes audio data in a plurality of time periods, and the obtaining module is configured to convert the audio data in the plurality of time periods into matched spectral images respectively; and respectively acquiring target expression parameter sets matched with the plurality of spectrum images obtained by conversion based on the mapping relation.

In another possible implementation manner, the obtaining module is further configured to query the mapping relationship based on the spectral image of the target audio data, and determine a reference spectral image similar to the spectral image of the target audio data in the mapping relationship; and determining the expression parameter set matched with the reference spectrum image in the mapping relation as the target expression parameter set.

In another possible implementation manner, the obtaining module is further configured to map the spectral image of the target audio data based on a mapping model to obtain a target expression parameter set matched with the spectral image of the target audio data, where the mapping model includes a mapping relationship between the spectral image and the expression parameter set.

In another possible implementation manner, the obtaining module includes:

the feature extraction unit is used for extracting features of the frequency spectrum image of the target audio data based on the mapping model to obtain image features of the frequency spectrum image, and the image features are used for describing the relationship between the frequency and the time of the sound in the frequency spectrum image;

and the feature transformation unit is used for carrying out feature transformation on the image features based on the mapping model to obtain the target expression parameter set.

In another possible implementation manner, the feature extraction unit is configured to perform feature extraction on a spectrum image of the target audio data based on the mapping model to obtain image features corresponding to a plurality of time points, where the plurality of time points belong to a time period corresponding to the target audio data;

and the feature transformation unit is used for respectively carrying out feature transformation on the image features corresponding to each time point based on the mapping model to obtain the target expression parameters corresponding to each time point.

In another possible implementation manner, the apparatus further includes:

a determining module, configured to determine, based on the mapping model, a product of the image feature corresponding to each time point and a corresponding weight as an updated image feature corresponding to each time point, where the weights corresponding to multiple time points are used to adjust a smoothness degree of the image features corresponding to the multiple time points.

In another possible implementation manner, the apparatus further includes:

the acquisition module is further used for acquiring a sample spectrum image and a sample expression parameter set of sample audio data, and sample expression parameters in the sample expression parameter set are used for simulating facial expressions of a singer when the singer sends the sample audio data;

and the training module is used for training the mapping model based on the sample spectrum image and the sample expression parameter set.

In another possible implementation manner, the target expression parameter set includes target expression parameters corresponding to a plurality of time points; the adjusting module is used for adjusting the template face model according to target expression parameters corresponding to a plurality of time points in the target expression parameter set to obtain a plurality of target face models;

the device further comprises:

and the playing module is used for playing the expression animation formed by the plurality of target face models in the process of playing the target audio data.

In another possible implementation manner, the apparatus further includes:

and the processing module is used for smoothing the obtained target expression parameters corresponding to the multiple time points.

In another possible implementation manner, the target expression parameters corresponding to any time point include expression parameters corresponding to a plurality of facial regions at the time point; and the adjusting module is used for respectively adjusting the corresponding face regions in the template face model according to the target expression parameters of the plurality of face regions corresponding to the plurality of time points to obtain the plurality of target face models.

In another possible implementation manner, the apparatus further includes:

and the processing module is used for smoothing the target expression parameters of the same face area corresponding to the multiple time points.

In another possible implementation manner, the apparatus further includes:

the storage module is used for correspondingly storing the target audio data and the expression animation formed by the target face models;

the playing module is used for responding to a playing instruction of the target audio data and inquiring the expression animation stored corresponding to the target audio data; and playing the target audio data and synchronously playing the expression animation.

In another possible implementation manner, the obtaining module is configured to obtain a spectral image of the target audio data in response to a playing instruction for the target audio data;

and the playing module is used for playing the target audio data and synchronously playing the expression animation.

In another possible implementation manner, the apparatus further includes:

the storage module is used for correspondingly storing the target audio data and the target expression parameter set;

the adjusting module is used for responding to a playing instruction of the target audio data and inquiring the target expression parameter set stored corresponding to the target audio data; and adjusting the template face model according to the inquired expression parameters in the target expression parameter set to obtain the target face model.

In another possible implementation manner, the apparatus further includes:

the determining module is used for responding to a received data playing request sent by a terminal, and determining the target audio data indicated by the audio identifier and the target face model corresponding to the target audio data based on the audio identifier carried by the data playing request;

and the sending module is used for sending the target audio data and the target face model to the terminal, playing the target audio data by the terminal and synchronously playing the expression animation formed by the target face model.

In another aspect, a computer device is provided, which includes a processor and a memory, wherein at least one program code is stored in the memory, and the at least one program code is loaded and executed by the processor to implement the operations performed in the audio data processing method according to the above aspect.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, the at least one program code being loaded and executed by a processor to implement the operations performed in the audio data processing method according to the above aspect.

In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code stored in a computer readable storage medium, the computer program code being loaded and executed by a processor to implement the operations performed in the audio data processing method according to the above aspect.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the method, the device, the computer equipment and the storage medium provided by the embodiment of the application represent the change condition of the frequency of the sound in the audio data in the form of the frequency spectrum image, the target expression parameter set matched with the target audio data is obtained based on the mapping relation between the frequency spectrum image and the expression parameter set, and the target expression parameter in the target expression parameter set is used for simulating the facial expression of a singer when the singer sends the audio data, so that the target facial model with the facial expression is obtained according to the expression parameters in the expression parameter set, the facial expression of the target facial model is matched with the facial expression of the singer when the singer sends the audio data, and the accuracy of the target facial model is ensured. And subsequently, the target face model is displayed in the process of playing the audio data, so that the audio data and the facial expression of the target face model are synchronous, the effect of sending the audio data by a real person is simulated, and the playing effect is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

fig. 2 is a flowchart of an audio data processing method provided in an embodiment of the present application;

fig. 3 is a flowchart of another audio data processing method provided by an embodiment of the present application;

fig. 4 is a schematic diagram of an expression parameter after smoothing provided in an embodiment of the present application;

FIG. 5 is a flowchart of generating an emoticon according to an embodiment of the present disclosure;

FIG. 6 is a flow chart of another method for generating an emotive animation according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of another audio data processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.

It will be understood that, as used herein, the terms "at least one," "a plurality," "each," "any," and the like, at least one includes one, two, or more than two, and a plurality includes two or more than two, each referring to each of the corresponding plurality, and any referring to any one of the plurality. For example, the plurality of time points includes 3 time points, each time point refers to each of the 3 time points, and any time point refers to any one of the 3 time points, which may be a first time point, a second time point, or a third time point.

The audio data processing method provided by the embodiment of the application can be executed by computer equipment. Optionally, the computer device is a terminal or a server. Optionally, the terminal is a computer, a mobile phone, a tablet computer or other terminals. Optionally, the server is a background server of the target application or a cloud server providing services such as cloud computing and cloud storage.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 are connected via a wireless or wired network.

The terminal 101 has installed thereon a target application served by the server 102, through which the terminal 101 can implement functions such as data transmission, message interaction, audio playback, and the like. Optionally, the target application is a target application in an operating system of the terminal 101, or a target application provided by a third party. For example, the target application is an audio playback application having a function of audio playback.

In a possible implementation manner, the server 102 is configured to store audio data, the terminal 101 is configured to log in a target application based on a user identifier, the audio data can be requested from the server 102 through the target application, the server 102 can send the audio data requested by the terminal 101 to the terminal 101, the terminal 101 obtains a target face model with facial expressions based on the audio data, and the target face model is displayed during playing of the audio data.

In another possible implementation manner, the server 102 is configured to store audio data, obtain a target facial model with facial expressions corresponding to the audio data based on the stored audio data, and store the audio data corresponding to the corresponding target facial model. The terminal 101 is configured to log in a target application based on the user identifier, and the target application can request the server 102 for the audio data and the corresponding target face model, and the server 102 can send the audio data and the corresponding target face model requested by the terminal 101 to the terminal 101, and the terminal 101 plays the audio data and synchronously displays the target face model.

In another possible implementation manner, the server 102 is configured to store audio data, obtain an expression parameter set corresponding to a facial expression matched with the audio data based on the stored audio data, and store the audio data in correspondence with the corresponding expression parameter set. The terminal 101 is configured to log in a target application based on a user identifier, request audio data and a corresponding expression parameter set from the server 102 through the target application, send the audio data and the corresponding expression parameter set requested by the terminal 101 to the terminal 101 by the server 102, adjust the template face model according to expression parameters in the obtained expression parameter set by the terminal 101, obtain a target face model with facial expressions, play the audio data, and synchronously play an expression animation composed of the target face model.

The method provided by the embodiment of the application can be applied to the audio data playing scene.

For example, the audio data is music, a music playing application is installed in the terminal, the server provides service for the music playing application, a plurality of pieces of music are stored in the server, and by adopting the method provided by the embodiment of the application, the target face model with facial expressions corresponding to each piece of music is obtained, and each piece of music is stored in correspondence with the corresponding target face model with facial expressions. The terminal logs in a music playing application based on a user identifier, sends a music obtaining request to the server through the music playing application, the music obtaining request carries a target music identifier, the server receives the music obtaining request, inquires target music corresponding to the target music identifier and a plurality of target face models with facial expressions corresponding to the target music, returns the inquired target music and the plurality of target face models to the terminal, and plays the target music and synchronously plays facial expressions and animations formed by the plurality of target face models after the terminal receives the target music and the plurality of target face models so as to simulate the facial expressions of the target music sung by a singer, thereby improving the playing effect.

Fig. 2 is a flowchart of an audio data processing method according to an embodiment of the present application. The execution subject of the embodiment of the application is computer equipment. Referring to fig. 2, the method comprises the steps of:

201. the computer device acquires a spectral image of the target audio data.

Wherein the spectral image is used to represent a situation in which the frequency of the sound in the target audio data changes over time.

202. The computer equipment acquires a target expression parameter set matched with the frequency spectrum image of the target audio data based on the mapping relation between the frequency spectrum image and the expression parameter set.

And the target expression parameters in the target expression parameter set are used for simulating the facial expression of the singer when sending the target audio data. The mapping relation can embody the relation between the spectrum image and the expression parameter set, the expression parameter set corresponding to any spectrum image can be determined through the mapping relation, and the determined expression parameters in the expression parameter set can simulate the facial expression of a singer when the singer sends out the audio data corresponding to the spectrum image.

203. And the computer equipment adjusts the template face model according to the target expression parameter set to obtain a target face model with facial expressions.

The template face model is a model for presenting facial expressions, and is an arbitrary face model. The target facial model has facial expressions that match the facial expressions of the singer who uttered the target audio data. The target face model is used for displaying in the process of playing the target audio data, namely, the target face model is displayed while the target audio data is played, so that the presented facial expression is matched with the target audio data, and the playing effect is improved.

According to the method provided by the embodiment of the application, the change condition of the frequency of the sound in the audio data is represented in the form of the frequency spectrum image, the target expression parameter set matched with the target audio data is obtained based on the mapping relation between the frequency spectrum image and the expression parameter set, and the target expression parameter in the target expression parameter set is used for simulating the facial expression of a singer when the singer sends the audio data, so that the target facial model with the facial expression is obtained according to the expression parameters in the expression parameter set, the facial expression of the target facial model is matched with the facial expression of the singer when the singer sends the audio data, and the accuracy of the target facial model is ensured. And subsequently, the target face model is displayed in the process of playing the audio data, so that the audio data and the facial expression of the target face model are synchronous, the effect of sending the audio data by a real person is simulated, and the playing effect is improved.

Fig. 3 is a flowchart of an audio data processing method according to an embodiment of the present application. The execution subject of the embodiment of the application is computer equipment. Referring to fig. 3, the method comprises the steps of:

301. the computer device acquires a spectral image of the target audio data.

The target audio data is any type of audio data, for example, the target audio data is music, news broadcast data, audio data for storytelling, and the like. The spectral image is used to represent the case that the frequency of the sound in the matched audio data changes with time, that is, any spectral image is used to represent the case that the frequency of the sound in the matched audio data changes with time in the time period corresponding to the matched audio data. Optionally, the spectral image includes frequency magnitudes of sounds at a plurality of time points within a corresponding time period. Optionally, the spectrogram includes a frequency curve and a coordinate axis, wherein a horizontal axis of the coordinate axis is used for representing time, and a vertical axis of the coordinate axis is used for representing frequency of the sound, and then the frequency curve can reflect a situation that the frequency of the sound changes along with time.

In one possible implementation, step 301 includes: the target audio data is converted into a matching spectral image.

By converting the target audio data into the matched spectrum image, the converted spectrum image can describe the change of the frequency of the sound in the target audio data with time, so that the expression parameter set matched with the target audio data can be acquired based on the spectrum image.

Optionally, the process of converting the audio data into a spectral image matched to the audio data includes: the computer device converts the target audio data into a spectral image matching the target audio data in response to a play instruction for the target audio data.

Wherein the playing instruction is used for indicating that the target audio data needs to be played. If a playing instruction of the target audio data is received, the target audio data is required to be played, and therefore, a frequency spectrum image of the target audio data is obtained, so that a target face model with facial expressions matched with the target audio data is obtained in a subsequent step, and the target audio data is displayed synchronously when being played.

In one possible implementation, step 301 includes: and respectively converting the audio data in a plurality of time periods included in the target audio data into matched spectral images.

The target audio data includes audio data in a plurality of time periods, and optionally, the duration of the spliced time periods is equal to the duration of the target audio data. Optionally, the time lengths of any two time periods in the plurality of time periods are equal. For example, the total time length of the target audio data is 20 seconds, the time length of each time segment is 5 seconds, and the target audio data includes audio data in 4 time segments, the first time segment is 0 seconds to 5 seconds, the second time segment is 5 seconds to 10 seconds, the third time segment is 10 seconds to 15 seconds, and the fourth time segment is 15 seconds to 20 seconds.

The target audio data comprises audio data in a plurality of time periods, and the audio data in each time period is converted into the matched spectral image, so that the spectral images matched with the audio data in the plurality of time periods are obtained, and the plurality of spectral images are obtained.

Optionally, step 301 includes: the computer device converts audio data in a plurality of time periods included in the target audio data into matched spectral images, respectively, in response to a play instruction for the target audio data.

If a play instruction for the target audio data is received, it indicates that the target audio data needs to be played, and therefore, the target audio data is converted into a plurality of matched spectral images.

302. And the computer equipment performs feature extraction on the frequency spectrum image of the target audio data based on the mapping model to obtain image features corresponding to the multiple time points.

In this embodiment of the application, the mapping model includes a mapping relationship between the spectrum image and the expression parameter set, the spectrum image is input to the mapping model, and the expression parameter set is output to the mapping model, that is, based on the mapping model, the expression parameter set of the spectrum image can be obtained by using the mapping relationship between the spectrum image and the expression parameter set in the mapping model.

The plurality of time points belong to a time period corresponding to the target audio data, and the plurality of time points are any plurality of time points in the time period. Optionally, in the plurality of time points, the time duration between any two adjacent time points is equal. For example, a time period corresponding to the target audio data is equally divided into a plurality of sub-time periods, each sub-time period has a start time point and an end time point, and the end time period of each sub-time period is taken as a time point in the time period, so that a plurality of time points in the time period are obtained.

The image features are used for describing features of the spectrum image, the image features are used for describing a relation between frequency and time of sound in the spectrum image, and the image features corresponding to each time point can describe a relation between the frequency and time of the sound corresponding to the corresponding time point in the spectrum image. And performing feature extraction on the spectrum image based on the mapping model to obtain a plurality of image features corresponding to the spectrum image, wherein each image feature corresponds to one time point.

In one possible implementation, the mapping model includes a feature extraction sub-model, the 302 including: and performing feature extraction on the frequency spectrum image based on the feature extraction submodel to obtain image features corresponding to the multiple time points.

Wherein, the feature extraction submodel is used for extracting the image features of the spectrum image. Optionally, the feature extraction sub-model is an arbitrary network model, for example, the feature extraction sub-model is CNN (Convolutional Neural Networks).

Optionally, the feature extraction submodel includes a plurality of channels, each channel being configured to output an image feature corresponding to a time point. When the feature extraction submodel extracts features of any frequency spectrum image, the image features output by the channels can form a three-dimensional image feature.

For example, feature extraction is performed on the spectral image based on a feature extraction submodel to obtain a three-dimensional image feature, where the size of the three-dimensional image feature is C × W × H, C is used to indicate the number of channels included in the feature extraction submodel, W × H is used to indicate an image feature output by each channel, and each channel corresponds to one time point in a time period corresponding to the spectral image, i.e., image features of multiple time points in the time period corresponding to the spectral image are obtained.

303. The computer device determines, based on the mapping model, a product of the image feature corresponding to each time point and the corresponding weight as an updated image feature corresponding to each time point.

Each time point corresponds to a weight, and the weights corresponding to the time points are used for adjusting the smoothness degree of the image features corresponding to the time points. The corresponding weight is blended into the image features corresponding to each time point, so that the accuracy of the updated image features corresponding to each time point is guaranteed, and the smoothness degree between expression parameters acquired subsequently based on a plurality of image features can be guaranteed, so that the continuity between facial expressions is guaranteed, the presented facial expressions are natural, and the subsequent display effect is guaranteed.

In one possible implementation, the mapping model includes a feature extraction sub-model, and the step 303 includes: and the computer equipment determines the product of the image characteristic corresponding to each time point and the corresponding weight as the updated image characteristic corresponding to each time point on the basis of the characteristic extraction submodel.

In this embodiment of the application, the image features corresponding to the multiple time points corresponding to the spectrum image are obtained by performing feature extraction on the spectrum image by the feature extraction sub-model, the image features corresponding to the multiple time points are output by multiple channels in the feature extraction sub-model, and each channel in the feature extraction sub-model has a corresponding weight. After the image features corresponding to the multiple time points are output by the multiple channels of the feature extraction submodel, the product of the image feature output by each channel and the corresponding weight is determined as the updated image feature corresponding to each channel, namely the updated image feature corresponding to each time point based on the feature extraction submodel.

304. And the computer equipment respectively performs feature transformation on the image features corresponding to each time point based on the mapping model to obtain the target expression parameters corresponding to each time point.

The target expression parameters are used for simulating facial expressions of singers when sending out target audio data. Optionally, the expression parameters include mouth region parameters, eye region parameters, smile parameters, anger parameters, head posture parameters, or the like. For example, in any facial expression, when the facial expression is presented by expression parameter description, mouth region parameters are used to describe the mouth opening size, eye region parameters are used to describe the eye opening size, and the like, smile parameters, anger parameters, and the like are used to describe that the facial expression is smiling, angry, and the like, and head posture parameters are used to describe the head raising angle, inclination angle, and the like of the head.

In the embodiment of the application, the image feature corresponding to any time point can reflect the facial expression matched with the target audio data at the time point, and the image feature corresponding to the time point is subjected to feature transformation to obtain the target expression parameter corresponding to the time point. And in step 303, the image features corresponding to each time point are updated, and the target expression parameters corresponding to each time point are obtained by performing feature transformation on the updated image features corresponding to each time point. The updated image features corresponding to each time point are subjected to feature transformation respectively to obtain target expression parameters corresponding to each time point, and the obtained target expression parameters corresponding to the multiple time points form a target expression parameter set matched with the spectrum image.

In one possible implementation, the mapping model includes a feature transformation submodel-step 304 includes: and respectively carrying out feature transformation on the image features corresponding to each time point based on the feature transformation submodel to obtain the expression parameters corresponding to each time point.

The feature conversion sub-model is used for acquiring expression parameters corresponding to image features, and the feature conversion sub-model is an arbitrary network model. For example, the feature transformation submodel is RNN (Recurrent Neural Networks) or DNN (Deep Neural Networks).

It should be noted that, in the embodiment of the present application, the image features of the multiple time points corresponding to the spectral image are obtained as an example to obtain the target expression parameters corresponding to the multiple time points, and in another embodiment, the

step

302 and 304 are not required to be executed, the multiple time points are not divided, and the target expression parameter set is obtained by taking the spectral image as a whole based on the mapping model.

It should be noted that, in the embodiment of the present application, image features corresponding to each spectral image at multiple time points are obtained first, and then target expression parameters corresponding to each time point are obtained, but in another embodiment, step 302 and step 304 do not need to be executed, and other manners can be adopted, the processes of feature extraction and feature conversion do not need to be performed, and the target expression parameter set matched with the spectral image of the target audio data is directly mapped based on the mapping model.

It should be noted that, in the embodiment of the present application, image features corresponding to a plurality of time points of each spectral image are obtained first based on the mapping model, and then expression parameters corresponding to each time point are obtained, but in another embodiment,

step

302 and 304 do not need to be executed, and no longer based on the mapping model, a target expression parameter set matched with the spectral image of the target audio data can be obtained based on the mapping relationship between the spectral image and the expression parameter set by adopting other mapping manners.

In one possible implementation manner, the process of acquiring the target expression parameter set corresponding to the spectrum image includes: the computer equipment inquires the mapping relation between the frequency spectrum image and the expression parameter set based on the frequency spectrum image of the target audio data, determines a reference frequency spectrum image similar to the frequency spectrum image of the target audio data in the mapping relation, and determines the expression parameter set matched with the reference frequency spectrum image in the mapping relation as the target expression parameter set.

In the embodiment of the application, a mapping relationship between the spectrum image and the expression parameter set is stored in the computer device, and the mapping relationship comprises a plurality of spectrum images and expression parameter sets corresponding to the spectrum images. After the spectrum image of the target audio data is acquired, the spectrum image of the target audio data is compared with each spectrum image in the mapping relation to determine a reference spectrum image similar to the spectrum image of the target audio data, namely, the condition that the frequency of the sound described by the spectrum image of the target audio data changes with time is represented, and the condition that the frequency of the sound described by the reference spectrum image changes with time is similar, so that the expression parameter set matched with the reference spectrum image is used as the expression parameter set matched with the spectrum image of the target audio data.

305. And the computer equipment carries out smoothing processing on the obtained target expression parameters corresponding to the multiple time points.

Because the obtained expression parameters are discontinuous, namely the expression parameters are discrete, the smoothing processing is carried out on the expression parameters corresponding to the obtained time points, so that the continuity of the expression parameters after the smoothing processing is ensured, and the difference between the expression parameters corresponding to any two adjacent time points is reduced.

In a possible implementation manner, a polynomial interpolation manner is adopted to perform smoothing processing on the obtained expression parameters corresponding to the multiple time points.

In one possible implementation manner, the effect of smoothing the obtained expression parameters is represented in the form of a curve. As shown in fig. 4, the expression parameters 401 corresponding to multiple time points in fig. 4 are discrete multiple points, and by using smoothing processing, the expression parameters after smoothing processing are made to present a curve changing with time, so as to ensure that the difference between any two adjacent time points of the expression parameters after smoothing processing is small, so as to ensure that the facial expressions at multiple time points obtained when the template face model is adjusted based on the expression parameters subsequently are continuous, so that the presented facial expressions are natural, and the subsequent display effect is ensured.

In one possible implementation, the target expression parameter corresponding to each time point includes an expression parameter corresponding to each time point of at least one facial region. For example, the target expression parameter corresponding to each time point includes an expression parameter corresponding to the mouth region at each time point, or the target expression parameter corresponding to each time point includes an expression parameter corresponding to the mouth region and the eye region at each time point.

In one possible implementation manner, the target expression parameters corresponding to any time point include expression parameters corresponding to a plurality of facial regions at the time point; step 305 includes: and smoothing the target expression parameters of the same face region corresponding to the obtained multiple time points.

For example, the plurality of face regions are eye regions, mouth regions, forehead regions, and the like. And for the obtained expression parameters corresponding to the multiple time points, performing smoothing processing on the expression parameters corresponding to the multiple time points of the multiple face regions according to the expression parameters corresponding to the multiple time points of each face region, so that the difference between any two adjacent time points in the smoothed expression parameters corresponding to each face region is small, and ensuring that when each face region in the template face model is adjusted based on the expression parameters subsequently, the obtained expression of each face region is natural, namely the obtained facial expression of the target face model is natural, the obtained facial expressions at the multiple time points are continuous, so that the presented facial expression is natural, and the subsequent display effect is ensured.

306. And the computer equipment respectively adjusts the template face model according to the target expression parameters corresponding to the multiple time points in the target expression parameter set to obtain multiple target face models.

The template face model is a model for presenting facial expressions, and the template face model is an arbitrary face model, for example, the template face model is a human face model, a cartoon character face model, an animal face model, or the like. For example, the template face model is a three-dimensional face model supporting Morph.

The target face model is used for displaying in the process of playing the audio data, the target face model has certain facial expressions, and the facial expressions of any target face model are obtained through corresponding expression parameters. After the expression parameters corresponding to the multiple time points are obtained, the template face model is adjusted according to the expression parameters corresponding to each time point to obtain a target face template corresponding to each time point, and then the multiple target face models are obtained. Since the expression parameters corresponding to the plurality of time points obtained in step 305 are smoothed, the template face model is adjusted according to the plurality of expression parameters after smoothing, so as to obtain a plurality of target face models. In this embodiment of the present application, the expression parameters corresponding to different time points may be different, and facial expressions in the obtained multiple target facial models may be different.

In one possible implementation, this step 306 includes: mapping the obtained expression parameters corresponding to the multiple time points to a target interval to obtain adjustment coefficients corresponding to the multiple time points, and adjusting the template face model according to the obtained adjustment coefficients to obtain multiple target face models.

In the embodiment of the present application, the target interval is an adjustment range of the face region in the template face model, and the adjustment coefficient is any value belonging to the target interval. For example, for the opening size of the mouth region in the template face model, the target region is an adjustment region of the mouth region, and if the target region is (0,1), the expression parameter can describe the opening size of the mouth region, and if the opening size of the mouth region is divided into 10 levels, for any expression parameter, the expression parameter indicates that the level to which the opening size of the mouth region belongs is 5 levels, the expression parameter is mapped into the target region, the obtained adjustment coefficient is 0.5, and then the template face model is adjusted according to the adjustment coefficient, so as to ensure the accuracy of the adjusted target face model.

In one possible implementation manner, the target expression parameters corresponding to any time point include expression parameters corresponding to a plurality of facial regions at the time point; this step 306 includes: and respectively adjusting the corresponding face regions in the template face model according to the target expression parameters of the plurality of face regions corresponding to the plurality of time points to obtain a plurality of target face models.

And for any time point, adjusting the corresponding facial area in the template facial model according to the expression parameters of the plurality of facial areas corresponding to the time point to obtain the target facial model corresponding to the time point. Based on the above manner, the corresponding face regions in the template face model are respectively adjusted according to the expression parameters of the plurality of face regions corresponding to the plurality of time points, so that the target face model corresponding to each time point can be obtained, and the plurality of target face models can be obtained.

It should be noted that, in the embodiment of the present application, smoothing is performed on the expression parameters corresponding to multiple time points, and then the target facial models corresponding to multiple time points are obtained, but in another embodiment, step 305 and step 306 do not need to be executed, and other manners can be adopted to adjust the template facial model according to the target expression parameter set, so as to obtain the target facial model with facial expressions.

307. And the computer equipment plays the expression animation formed by the plurality of target face models in the process of playing the target audio data.

Since each target facial model has a facial expression, an expression animation in which the effect of a change in facial expression over time can be exhibited can be constructed from the plurality of target facial models. In the process of playing the target audio data, the expression animation is played, and the facial expression in the expression animation is matched with the target audio data so as to present the effect of the real person speaking the sound data, thereby improving the playing effect.

For example, the target audio data is target music, and when the target music is played, the expressive animation is played synchronously, and the facial expression in the expressive animation is matched with the target music, so that the effect that the real person sings the target music is presented.

In one possible implementation, the method further comprises: and constructing the expression animation by the target face models according to the sequence of the time points.

Wherein each time point is used to represent an interval with a start time point of the target audio data, for example, a plurality of time points are 1 second, 2 seconds, 3 seconds, and the like. Each target face model has a facial expression, and the target face models are switched along with time change according to the sequence of a plurality of time points corresponding to the target face models, so that an expression animation with the facial expression changing along with time is obtained, and the expression animation formed by the target face models is obtained. And when the target audio data and the expression animation are played synchronously, sequentially switching and displaying the plurality of target face models along with the playing time of the target audio data and according to the sequence of the plurality of time points corresponding to the plurality of target face models, so that the expression animation with the facial expression changing along with the time is presented, and the effect of synchronously playing the target audio data and the expression animation is realized.

In one possible implementation, step 301 includes: in response to the playing instruction for the target audio data, acquiring a spectrum image of the target audio data, step 307 includes: and playing the target audio data and synchronously playing the expression animation composed of a plurality of target face models.

If a playing instruction for the target audio data is received, indicating that the target audio data needs to be played, acquiring a plurality of target facial models in real time according to the

above step

301 and 306, playing the target audio data after acquiring the plurality of target facial models, and synchronously playing the expression animation to present an effect that facial expressions change along with the playing of the target audio data, so as to present an effect that a real person speaks the sound data.

It should be noted that in the embodiment of the present application, after obtaining a plurality of target facial models, the target audio data is directly played and the facial expression animation is synchronously played, while in another embodiment, after performing step 306, the step 307 does not need to be directly performed, and the target audio data and the facial expression animation can be played in other cases.

In one possible implementation, after step 306, the method further includes: the computer equipment correspondingly stores the target audio data and the expression animation formed by the target face models, responds to a playing instruction of the target audio data and inquires the expression animation correspondingly stored with the target audio data; and playing the target audio data and synchronously playing the expression animation.

After obtaining a plurality of target face models, the target face models form expression animations, the target audio data and the expression animations are correspondingly stored, if a playing instruction of the target audio data is received subsequently, the expression animations corresponding to the target audio data are inquired, the target audio data and the expression animations are synchronously played, namely the expression animations are not required to be regenerated, and the expression animations can be reused, so that the consumption of equipment resources is reduced, and the playing efficiency is ensured.

Alternatively, when the target audio data is stored in association with an expressive animation composed of a plurality of target face models, the plurality of target face models and corresponding time points are stored in association with the target audio data.

Each time point is used for representing the interval between the time point and the starting time point of the target audio data, so that when the target audio data is played later, the target face models are switched and displayed in sequence according to the playing duration of the target audio data and the sequence of the time points corresponding to the target face models, and the effect that the facial expression changes along with the time is presented.

Alternatively, an expression animation is generated based on a plurality of target face models, and when target audio data is stored in correspondence with an expression animation made up of a plurality of target face models, the expression animation is stored in correspondence with the target audio data.

And the duration of the expression animation is equal to the duration of the target audio data.

Optionally, the process of generating the expression animation includes: and generating the expression animation according to the sequence of the time points and based on the target face model corresponding to the time points.

Wherein each time point is used to represent an interval with a start time point of the target audio data, for example, a plurality of time points are 1 second, 2 seconds, 3 seconds, and the like. And generating the expression animation based on the target facial models according to the sequence of the time points and the time intervals among the time points, so that the time length of the expression animation is equal to that of the target audio data, and displaying the corresponding target facial model when the starting time point of the expression animation is reached. That is, when the expression animation is played subsequently, facial expressions of a plurality of target facial models can be sequentially displayed along with time change, and the time point for displaying each facial expression is the same as the time point corresponding to the corresponding target facial model.

For example, the plurality of time points are 1 second, 2 seconds, 3 seconds, 4 seconds, 5 seconds, and the like. In the process of playing the expression animation, when the playing time reaches 1 second, the displayed facial expression corresponds to the target facial model corresponding to 1 second, when the playing time reaches 2 seconds, the displayed facial expression corresponds to the target facial model corresponding to 2 seconds, and so on, the animation of the facial expression changing along with the time is presented.

After the expression animation is generated, the expression animation and the target audio data are correspondingly stored, so that the expression animation can be synchronously played when the target audio data is played in the subsequent process.

Optionally, in the plurality of time points, the interval duration between any two adjacent time points is an arbitrary duration. For example, the interval duration is 1 second, or 0.04 second, or the like.

In order to ensure the playing effect of the expression animation, in the above embodiment, after the expression parameters corresponding to the multiple time points are smoothed, the expression parameters corresponding to the multiple time points after smoothing are obtained, so as to obtain the target face models corresponding to the multiple time points after smoothing. By adopting the smoothing processing mode, the interval duration between any two adjacent time points is reduced, for example, the interval duration between any two adjacent time points in the time points after the smoothing processing is 0.04 second, and the generated expression animation can embody natural switching of facial expressions based on the target facial models corresponding to the time points after the smoothing processing, so that the playing effect of the expression animation is ensured.

Optionally, the computer device comprises a terminal and a server, the step 301 and the step 306 are executed by the server, and the server stores the target audio data corresponding to the emotion animation, and the step 307 is executed by the terminal. The method comprises the steps that a terminal responds to a playing instruction of target audio data, sends a data acquisition request to a server, the data acquisition request carries a data identifier of the target audio data, the server receives the data acquisition request, inquires the target audio data corresponding to the data identifier, inquires expression animations corresponding to the target audio data and stored, sends the inquired target audio data and expression animations to the terminal, and the terminal synchronously plays the target audio data and the expression animations after receiving the target audio data and the expression animations.

It should be noted that in the embodiment of the present application, after obtaining a plurality of target face models, the target audio data is directly played and the expression animation is synchronously played, while in another embodiment, after performing step 305, the step 306 is not directly performed, and other manners can be adopted to obtain a plurality of target face models.

In one possible implementation, after step 305, the method further includes: correspondingly storing the target audio data and the target expression parameter set; responding to a playing instruction of the target audio data, and inquiring a target expression parameter set stored corresponding to the target audio data; and adjusting the template face model according to the inquired expression parameters in the target expression parameter set to obtain the target face model.

And after inquiring the expression parameters corresponding to the target audio data if a playing instruction for the target audio data is received subsequently, adjusting the template face model according to the inquired expression parameters to obtain the target face model.

Optionally, after obtaining the target face model, playing the target audio data, and synchronously playing an expressive animation composed of a plurality of target face models.

If a playing instruction of the target audio data is received, after the expression parameters corresponding to the target audio data are inquired, the template face model is adjusted according to the inquired expression parameters to obtain the target face model, the target audio data and the expression animation formed by the target face models are synchronously played, so that the expression of the face model is driven in real time when the target audio data are played, the expression change of a real person in singing is simulated in real time, the expressive force of music is enriched, and the perceptibility of the music is realized.

Optionally, the computer device includes a terminal and a server, where the step 301 and the step 305 are executed by the server, and the server correspondingly stores the target audio data and the target expression parameter set; responding to a received data playing request sent by a terminal, determining target audio data indicated by an audio identifier and a target expression parameter set corresponding to the target audio data based on the audio identifier carried by the data playing request, sending the target audio data and the target expression parameter set to the terminal, receiving the target audio data and the target expression parameter set by the terminal, adjusting a template face model according to the inquired expression parameters in the target expression parameter set to obtain a target face model, playing the target audio data, and synchronously playing expression animations formed by the target face model.

The terminal responds to a playing instruction of target audio data, sends a data acquisition request to the server, the data acquisition request carries a data identifier of the target audio data, the server receives the data acquisition request sent by the terminal, inquires the target audio data corresponding to the data identifier, inquires an expression parameter set corresponding to the target audio data, sends the inquired target audio data and the expression parameter set to the terminal, and after receiving the target audio data and the expression parameter set, the terminal acquires a plurality of target face models according to the step 306, obtains an expression animation formed by the target face models, and synchronously plays the target audio data and the expression animation.

Optionally, the computer device includes a terminal and a server, where the step 301 and the step 305 are executed by the server, and the server correspondingly stores the target audio data and the target face model; the server responds to a received data playing request sent by the terminal, determines target audio data indicated by the audio identifier and a target face model corresponding to the target audio data based on the audio identifier carried by the data playing request, sends the target audio data and the target face model to the terminal, and the terminal receives the target audio data and the target face model, plays the target audio data and synchronously plays the expression animation formed by the target face model.

It should be noted that in the embodiment of the present application, after a plurality of target face models are obtained, an expression animation is formed by the plurality of target face models, and in another embodiment, when audio data is played, the template face model is adjusted in real time based on the obtained expression parameters corresponding to a plurality of time points, and the template face model after being adjusted in real time is displayed, so as to present an effect that the facial expression in the template face model gradually changes along with time.

It should be noted that, all the above optional technical solutions can be combined arbitrarily to form optional embodiments of the present application, and are not described in detail herein.

And the target face models at a plurality of time points are obtained, each target face model has facial expressions matched with the audio data, and when the audio data are played, expression animations composed of the target face models are synchronously played to present the effect of the real person speaking the sound data, so that a new audio data playing form is realized, the expressive force of the audio data is enriched, and the playing effect is improved.

And processing the frequency spectrum image matched with the target audio data based on the mapping model to acquire expression parameters so as to ensure the accuracy of the acquired expression parameters and further ensure the accuracy of subsequent expression animations.

On the basis of the embodiment shown in fig. 3, before obtaining the target expression parameter set matched with the spectral image based on the mapping model, the mapping model needs to be trained, and the process of training the mapping model includes the following steps 308-310:

308. and acquiring a sample spectrum image and a sample expression parameter set which are matched with the sample audio data, wherein the sample expression parameters in the sample expression parameter set are used for simulating the facial expression of the singer when the singer sends the sample audio data.

The sample audio data is any type of audio data, for example, the sample audio data is music, news broadcast data, audio data for telling stories, and the like, and the sample spectrum image is used for representing the situation that the frequency of the sound in the sample audio data changes with time. The sample expression parameters are used for describing facial expressions matched with the sample audio data and simulating the facial expressions of the singers when the singers send the sample audio data, and the sample expression parameters are obtained through manual labeling. For example, sample video data matched with the sample audio data is obtained, and facial expressions in the sample video data are labeled to obtain the sample expression parameters. For example, the sample video data is standard sample video data manually selected, and the standard sample video data is used for describing a scene where the singer sends out audio data in a formal occasion, for example, the standard sample video data is used for describing a picture where the singer sings a song in a large evening. And manually labeling the selected standard sample video data to ensure that the obtained sample expression parameters accord with facial expressions when most people send sample audio data, so as to ensure the accuracy of a mapping model for subsequent training.

In one possible implementation, the sample expression parameters include sample expression parameters corresponding to a plurality of facial regions at a plurality of time points. For example, at an interval of 0.04 seconds, facial expressions in video frames in sample video data are labeled to obtain a sample expression parameter of a plurality of facial regions, and the labeling is repeated for a plurality of times to obtain sample expression parameters corresponding to the plurality of facial regions at a plurality of time points.

In one possible implementation, this step 308 includes: and obtaining sample audio data and sample expression parameters, and respectively converting the sample audio data in a plurality of time periods included in the sample audio data into matched sample spectrum images. The process of converting the sample audio data is the same as the quasi-conversion process in step 301, and is not described herein again.

309. And processing the sample spectrum image based on the mapping model to obtain a predicted expression parameter set corresponding to the facial expression matched with the sample audio data.

This step is similar to the

step

302 and 304 in the above embodiment, and will not be described herein again.

310. The mapping model is adjusted based on a difference between the predicted expression parameter set and the sample expression parameter set.

The predicted expression parameters are obtained by the mapping model, the sample expression parameter sets are obtained by manual labeling, the difference between the predicted expression parameter sets and the sample expression parameter sets can reflect the accuracy of the mapping model, and the mapping model is adjusted based on the difference between the predicted expression parameter sets and the sample expression parameter sets, so that the accuracy of the mapping model is improved, and the training of the mapping model is realized.

Optionally, 301 comprises: and training the mapping model based on the difference between the predicted expression parameters in the predicted expression parameter set and the sample expression parameters in the sample expression parameter set.

Optionally, the mapping model is iteratively trained in the manner described above.

It should be noted that, in the embodiment of the present application, the mapping model is adjusted based on the difference between the predicted expression parameter and the sample expression parameter, and in another embodiment, the step 309 and 310 need not be executed, and other manners can be adopted to train the mapping model based on the sample spectrum image and the sample expression parameter set.

According to the method provided by the embodiment of the application, the mapping model is trained through the sample spectrum image and the sample expression parameter set, so that the trained mapping model can learn the relation between the spectrum image and the sample expression parameter set, the accuracy of the mapping model is improved, and the spectrum image can be mapped to form the matched expression parameter set by the trained mapping model.

It should be noted that, in the embodiment shown in fig. 3, only one spectrum image of the target audio data is used for illustration, and in another embodiment, the target audio data includes audio data in a plurality of time periods, the audio data in the plurality of time periods are respectively converted into matched spectrum images, that is, a plurality of spectrum images are obtained, according to step 302 and 304 in the embodiment shown in fig. 3, or according to the query mapping relationship, expression parameter sets matched with each spectrum image are obtained, and the expression parameter set matched with any spectrum image includes target expression parameters corresponding to a plurality of time points in the time period corresponding to the spectrum image. Then, step 305 and step 307 are executed to implement the scheme of synchronously playing the emotion animation when the target audio data is played.

Since the target audio data in the embodiment of the present application corresponds to a plurality of spectral images, each spectral image corresponds to a plurality of expression parameters corresponding to a time point, and when step 305 is executed, the expression parameters corresponding to all time points corresponding to the plurality of spectral images are smoothed. For example, if the target audio data corresponds to 3 spectral images and each spectral image corresponds to 4 time points, expression parameters corresponding to 12 time points are obtained, and then, in step 305, the expression parameters corresponding to 12 time points are smoothed.

Fig. 5 is a flowchart for generating an expression animation according to an embodiment of the present application, and as shown in fig. 5, the flowchart includes:

and converting the audio data in a plurality of time periods included in the audio data into matched spectral images, and forming a spectrogram sequence by the obtained plurality of spectral images.

And respectively processing each frequency spectrum image based on the mapping model to obtain expression parameters corresponding to a plurality of time points, driving the template face model according to the expression parameters corresponding to the time points to obtain a plurality of target face models, and forming an expression animation by the obtained target face models.

Fig. 6 is another flowchart for generating an expression animation according to an embodiment of the present application, and as shown in fig. 6, the flowchart includes:

And for each spectral image, extracting the features of the spectral image based on a feature extraction sub-model in the mapping model to obtain image features corresponding to a plurality of time points corresponding to the spectral image, and determining the product of the image feature corresponding to each time point and the corresponding weight as the updated image feature to obtain the updated image feature corresponding to the plurality of time points corresponding to the spectral image.

And performing feature conversion on the updated image features corresponding to each spectral image based on the feature conversion sub-model in the mapping model to obtain expression parameters corresponding to a plurality of time points corresponding to each spectral image.

Smoothing the obtained expression parameters corresponding to the multiple time points to obtain updated expression parameters, driving the template face model according to the updated expression parameters corresponding to the multiple time points to obtain multiple target face models, and forming the expression animation by the obtained multiple target face models.

Fig. 7 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application. Referring to fig. 7, the apparatus includes:

an obtaining module 701, configured to obtain a spectrum image of target audio data, where the spectrum image is used to indicate a situation that a frequency of a sound in the target audio data changes with time;

the obtaining module 701 is further configured to obtain a target expression parameter set matched with the spectral image of the target audio data based on a mapping relationship between the spectral image and the expression parameter set, where a target expression parameter in the target expression parameter set is used to simulate a facial expression of a singer when the singer sends out the target audio data;

an adjusting module 702, configured to adjust the template face model according to the target expression parameter set, so as to obtain a target face model with facial expressions.

In a possible implementation manner, the target audio data includes audio data in multiple time periods, and the obtaining module 701 is configured to convert the audio data in the multiple time periods into matched spectral images respectively; and respectively acquiring target expression parameter sets matched with the plurality of spectrum images obtained by conversion based on the mapping relation.

In another possible implementation manner, the obtaining module 701 is further configured to query a mapping relationship based on the spectral image of the target audio data, and determine a reference spectral image similar to the spectral image of the target audio data in the mapping relationship; and determining the expression parameter set matched with the reference spectrum image in the mapping relation as a target expression parameter set.

In another possible implementation manner, the obtaining module 701 is further configured to map the spectral image of the target audio data based on a mapping model to obtain a target expression parameter set matched with the spectral image of the target audio data, where the mapping model includes a mapping relationship between the spectral image and the expression parameter set.

In another possible implementation manner, as shown in fig. 8, the obtaining module 701 includes:

the feature extraction unit 7101 is used for performing feature extraction on the spectral image of the target audio data based on the mapping model to obtain image features of the spectral image, wherein the image features are used for describing the relationship between the frequency and the time of sound in the spectral image;

and the feature transformation unit 7102 is used for performing feature transformation on the image features based on the mapping model to obtain a target expression parameter set.

In another possible implementation manner, the feature extraction unit 7101 is configured to perform feature extraction on a spectral image of the target audio data based on the mapping model to obtain image features corresponding to multiple time points, where the multiple time points belong to a time period corresponding to the target audio data;

and the feature transformation unit 7102 is used for respectively performing feature transformation on the image features corresponding to each time point based on the mapping model to obtain the target expression parameters corresponding to each time point.

In another possible implementation manner, as shown in fig. 8, the apparatus further includes:

a determining module 703, configured to determine, based on the mapping model, a product of the image feature corresponding to each time point and the corresponding weight as an updated image feature corresponding to each time point, where the weights corresponding to multiple time points are used to adjust the smoothness of the image features corresponding to multiple time points.

the obtaining module 701 is further configured to obtain a sample spectrum image of the sample audio data and a sample expression parameter set, where a sample expression parameter in the sample expression parameter set is used to simulate a facial expression of a singer when the singer sends out the sample audio data;

and a training module 704, configured to train the mapping model based on the sample spectrum image and the sample expression parameter set.

In another possible implementation manner, the target expression parameter set includes target expression parameters corresponding to a plurality of time points; an adjusting module 702, configured to adjust the template face model according to target expression parameters corresponding to multiple time points in the target expression parameter set, respectively, to obtain multiple target face models;

as shown in fig. 8, the apparatus further comprises:

and a playing module 705, configured to play an expressive animation composed of a plurality of target face models during playing the target audio data.

and the processing module 706 is configured to perform smoothing processing on the obtained target expression parameters corresponding to the multiple time points.

In another possible implementation manner, the target expression parameters corresponding to any time point include expression parameters of a plurality of facial regions corresponding to the time point; an adjusting module 702, configured to adjust corresponding face regions in the template face model according to target expression parameters of the multiple face regions corresponding to the multiple time points, respectively, to obtain multiple target face models.

and the processing module 706 is configured to perform smoothing processing on the target expression parameters corresponding to multiple time points in the same face region.

a storage module 707 for storing the target audio data in correspondence with an expressive animation composed of a plurality of target face models;

the playing module 705 is configured to query, in response to a playing instruction for the target audio data, an expression animation stored in correspondence with the target audio data; and playing the target audio data and synchronously playing the expression animation.

In another possible implementation manner, the obtaining module 701 is configured to obtain a spectrum image of the target audio data in response to a playing instruction for the target audio data;

and the playing module 705 is configured to play the target audio data and synchronously play the emotion animation.

a storage module 707, configured to correspondingly store the target audio data and the target expression parameter set;

an adjusting module 702, configured to query, in response to a play instruction for the target audio data, a target expression parameter set stored in correspondence with the target audio data; and adjusting the template face model according to the inquired expression parameters in the target expression parameter set to obtain the target face model.

the determining module 703 is configured to determine, in response to a received data playing request sent by a terminal, target audio data indicated by an audio identifier and a target face model corresponding to the target audio data based on the audio identifier carried in the data playing request;

the sending module 708 is configured to send the target audio data and the target face model to the terminal, and the terminal plays the target audio data and synchronously plays the expression animation composed of the target face model.

It should be noted that: in the audio data processing apparatus provided in the above embodiment, when displaying a screen, only the division of the above functional modules is exemplified, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the computer device may be divided into different functional modules to complete all or part of the above described functions. In addition, the audio data processing apparatus and the audio data processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, and the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor, so as to implement the operations executed in the audio data processing method of the foregoing embodiment.

Optionally, the computer device is provided as a terminal. Fig. 9 is a schematic structural diagram of a terminal 900 according to an embodiment of the present application. The terminal 900 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

The terminal 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 902 is used to store at least one computer program for execution by the processor 901 to implement the audio data processing method provided by the method embodiments in the present application.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, disposed on the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is used to locate the current geographic Location of the terminal 900 for navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian glonass Positioning System, or the european union galileo Positioning System.

Power supply 909 is used to provide power to the various components in terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When the power source 909 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 913 may be disposed on a side bezel of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the display screen 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

A proximity sensor 916, also referred to as a distance sensor, is provided on the front panel of the terminal 900. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the terminal 900 gradually becomes larger, the display 905 is controlled by the processor 901 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

Optionally, the computer device is provided as a server. Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1001 and one or more memories 1002, where the memory 1002 stores at least one computer program, and the at least one computer program is loaded and executed by the processors 1001 to implement the methods provided by the method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, and the memory stores at least one program code, and the at least one program code is loaded and executed by the processor, so as to implement the operations executed in the audio data processing method of the foregoing embodiment.

The embodiment of the present application further provides a computer-readable storage medium, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the operations executed in the audio data processing method of the foregoing embodiment.

Embodiments of the present application also provide a computer program product or a computer program comprising computer program code stored in a computer readable storage medium. The computer program code is loaded and executed by a processor to implement the operations performed in the audio data processing method of the above-described embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only an alternative embodiment of the present application and is not intended to limit the present application, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of audio data processing, the method comprising:

2. The method of claim 1, wherein the target audio data comprises audio data over a plurality of time periods, and wherein obtaining a spectral image of the target audio data comprises:

3. The method of claim 1, wherein obtaining a target expression parameter set matching a spectral image of the target audio data based on a mapping relationship between the spectral image and the expression parameter set comprises:

4. The method of claim 1, wherein obtaining a target expression parameter set matching a spectral image of the target audio data based on a mapping relationship between the spectral image and the expression parameter set comprises:

and mapping the spectral image of the target audio data based on a mapping model to obtain the target expression parameter set matched with the spectral image of the target audio data, wherein the mapping model comprises the mapping relation.

5. The method of claim 4, wherein the mapping the spectral image of the target audio data based on the mapping model to obtain the target expression parameter set matching the spectral image of the target audio data comprises:

6. The method according to claim 5, wherein the performing feature extraction on the spectral image of the target audio data based on the mapping model to obtain an image feature of the spectral image comprises:

7. The method according to claim 6, wherein after the feature extraction is performed on the spectral image of the target audio data based on the mapping model to obtain image features corresponding to a plurality of time points, the method further comprises:

8. The method of claim 4, further comprising:

9. The method according to any one of claims 1 to 8, wherein the target expression parameter set comprises target expression parameters corresponding to a plurality of time points; adjusting the template face model according to the target expression parameter set to obtain a target face model with the facial expression, comprising:

10. The method according to claim 9, wherein before the template face model is adjusted according to the target expression parameters corresponding to the plurality of time points in the target expression parameter set, the method further comprises:

11. The method of claim 9, wherein the target expression parameters corresponding to any time point comprise the expression parameters corresponding to the time points of the plurality of facial regions; the step of respectively adjusting the template face models according to the target expression parameters corresponding to the multiple time points in the target expression parameter set to obtain multiple target face models comprises the following steps:

12. The method according to claim 11, wherein before the adjusting the corresponding facial regions in the template facial model according to the target expression parameters of the plurality of facial regions corresponding to the plurality of time points to obtain the plurality of target facial models, the method further comprises:

13. The method according to claim 9, wherein after the template facial models are adjusted according to the target expression parameters corresponding to the plurality of time points in the target expression parameter set, respectively, to obtain a plurality of target facial models, the method further comprises:

14. The method of claim 9, wherein the obtaining a spectral image of target audio data comprises:

15. The method according to any one of claims 1 to 8, wherein after obtaining the target expression parameter set matching the spectral image of the target audio data based on the mapping relationship between the spectral image and the expression parameter set, the method further comprises:

16. The method of any one of claims 1-8, wherein after the adjusting the template facial model according to the target expression parameter set to obtain the target facial model with the facial expression, the method further comprises:

17. An audio data processing apparatus, characterized in that the apparatus comprises:

18. A computer device comprising a processor and a memory, the memory having stored therein at least one program code, the at least one program code being loaded into and executed by the processor to perform the operations carried out in the audio data processing method according to any one of claims 1 to 16.

19. A computer-readable storage medium having stored therein at least one program code, the at least one program code being loaded and executed by a processor to perform the operations performed in the audio data processing method according to any one of claims 1 to 16.