CN112990283A

CN112990283A - Image generation method and device and electronic equipment

Info

Publication number: CN112990283A
Application number: CN202110237774.6A
Authority: CN
Inventors: 袁燚; 许曼玲; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-06-18
Anticipated expiration: 2041-03-03
Also published as: CN112990283B

Abstract

The invention provides an image generation method, an image generation device and electronic equipment; wherein, the method comprises the following steps: adjusting the initial action parameter based on the audio characteristics of the target audio to obtain a first action parameter; wherein the action indicated by the first action parameter matches the audio feature; generating a target image based on the first motion parameter and an initial image containing the target object; in the target image, the target object has a motion indicated by the first motion parameter. In the method, the action parameter is adjusted through the audio characteristic of the audio, so that the action indicated by the obtained first action parameter can be matched with the audio characteristic, and the target object in the generated image has the action indicated by the first action parameter.

Description

Image generation method and device and electronic equipment

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image generation method and apparatus, and an electronic device.

Background

When the terminal equipment plays the audio, the specific image is displayed on the display screen, the image content is changed along with the change of the rhythm of the audio, and the visual experience of a user when listening to the audio can be improved. In the related art, the image that changes with the audio rhythm is usually a bar spectrogram of jumps; and performing Fourier transform on the audio which is being played to obtain the frequency domain characteristics of the audio, and generating the strip spectrogram based on the frequency domain characteristics. However, the image content is single, and the image content is unattractive to users, and the visual experience of the users is low.

Disclosure of Invention

In view of this, the present invention provides an image generating method, an image generating apparatus, and an electronic device, so that in the process of playing audio, the image content is varied, and the visual experience of a user is improved.

In a first aspect, an embodiment of the present invention provides an image generation method, where the method includes: adjusting the initial action parameter based on the audio characteristics of the target audio to obtain a first action parameter; wherein the action indicated by the first action parameter matches the audio feature; generating a target image based on the first motion parameter and an initial image containing the target object; in the target image, the target object has a motion indicated by the first motion parameter.

The target object comprises a human face; the motion indicated by the first motion parameter comprises an expressive motion of a human face.

The audio frequency characteristics of the target audio frequency are used for adjusting the action amplitude of the action indicated by the initial action parameters; the motion amplitude of the motion indicated by the first motion parameter matches the audio feature.

The step of adjusting the initial motion parameter based on the audio characteristic of the target audio to obtain the first motion parameter includes: determining a parameter adjustment weight according to the audio characteristics of the target audio; and scaling the initial action parameter based on the parameter adjustment weight to obtain a first action parameter.

The step of determining the parameter adjustment weight according to the audio characteristics of the target audio includes: on the time dimension of the audio features, calculating an average value of feature vectors corresponding to all time points on the time dimension to obtain initial parameters; and mapping the initial parameters to a preset numerical range to obtain parameter adjustment weights.

Before the step of determining the parameter adjustment weight according to the audio feature of the target audio, the method further includes: in the audio features, a specified number of intermediate time points and feature vectors corresponding to each intermediate time point are inserted between any two adjacent initial time points to obtain final audio features; and determining the characteristic vectors corresponding to the intermediate time points based on the characteristic vectors corresponding to the two initial time points adjacent to the intermediate time points.

The audio characteristics of the target audio are obtained by the following steps: extracting a Mel Frequency Cepstrum Coefficient (MFCC) parameter of the target audio; the MFCC parameters comprise a plurality of time points of a preset time interval, and each time point corresponds to an MFCC value; and inputting the MFCC parameters into a pre-trained feature extraction network, and outputting the audio features of the target audio.

The feature extraction network comprises a plurality of feature extraction modules which are sequentially connected in series; the feature extraction module comprises a convolution layer, a batch normalization layer and an activation function layer.

Before the step of inputting the MFCC parameters into the pre-trained feature extraction network and outputting the audio features of the target audio, the method further includes: based on a preset filling value, performing numerical filling on numerical values on a frequency dimension of the MFCC parameter to enable the numerical value quantity on the frequency dimension to be matched with the numerical value quantity on a time dimension of the MFCC parameter; and copying the MFCC parameters after numerical filling to obtain the MFCC parameters with the specified channel number.

The feature extraction network is obtained by training in the following way: inputting MFCC parameters of the sample audio into a coding network, and outputting a feature vector of the sample audio; inputting the characteristic vector of the sample audio into a decoding network to obtain an output parameter of the sample audio; calculating a loss value between the output parameter and the MFCC parameter of the sample audio based on a preset loss function, training a coding network and a decoding network based on the loss value, and determining the coding network after training as a feature extraction network.

The decoding network comprises a plurality of decoding modules which are sequentially connected in series; the decoding module comprises a transposed convolution layer, a batch normalization layer and an activation function layer.

The step of generating the target image based on the first motion parameter and the initial image including the target object includes: extracting image features of an initial image containing a target object; wherein the image features comprise global features and detail features; fusing the first action parameter and the detail characteristic to obtain a fused characteristic; and generating a target image based on the fusion feature and the global feature.

The step of performing fusion processing on the first action parameter and the detail feature to obtain a fusion feature includes: acquiring feature data in a first appointed channel from the image features; wherein the feature data in the first specified channel contains detail features; acquiring parameter data in a second specified channel from the first action parameter; and carrying out point-by-point addition processing on the feature data in the first specified channel and the parameter data in the second specified channel to obtain the fusion feature.

The step of generating the target image based on the fusion feature and the global feature includes: the fusion features and the global features are used as hidden vectors and input into an image generation network which is trained in advance, and target images are output; wherein the hidden vector is used to: and controlling the graph generation network to output an image matched with the characteristic indicated by the hidden vector.

In a second aspect, an embodiment of the present invention provides an image generating apparatus, including: the parameter adjusting module is used for adjusting the initial action parameters based on the audio characteristics of the target audio to obtain first action parameters; wherein the action indicated by the first action parameter matches the audio feature; an image generation module for generating a target image based on the first motion parameter and an initial image containing the target object; in the target image, the target object has a motion indicated by the first motion parameter.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the image generation method.

In a fourth aspect, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the image generation method described above.

The embodiment of the invention has the following beneficial effects:

according to the image generation method, the image generation device and the electronic equipment, firstly, an initial action parameter is adjusted based on the audio characteristic of a target audio to obtain a first action parameter; the motion indicated by the first motion parameter matches the audio feature; then generating a target image based on the first action parameter and the initial image containing the target object; the target object in the target image has a motion indicated by the first motion parameter. In the method, the action parameter is adjusted through the audio characteristic of the audio, so that the action indicated by the obtained first action parameter can be matched with the audio characteristic, and the target object in the generated image has the action indicated by the first action parameter.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an image generation method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a training process of a feature extraction network according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart illustrating adjusting image features based on audio features according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a StyleGAN network according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating generation of a plurality of target images during playing of target audio according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a change in facial image with audio rhythm according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In consideration of the problem that in the related art, when the terminal device plays music, the content of an image displayed on a screen is relatively single, which results in a relatively low visual experience of a user, the image generation method, the image generation device and the electronic device provided by the embodiment of the invention can be applied to image display scenes when music is played, such as a webpage, an APP (Application program) and the like.

Referring first to the flowchart of an image generation method shown in fig. 1, the method includes the steps of:

step S102, adjusting initial action parameters based on the audio characteristics of the target audio to obtain first action parameters; wherein the action indicated by the first action parameter matches the audio feature;

the target audio can be audio such as music, voice or other types of sounds; the audio features of the target audio can be extracted through a trained audio feature extraction network; the audio features of the target audio may generally include characteristics of tone, rhythm, speech rate, emotion, and the like of sounds in the audio. The initial motion parameter generally corresponds to a preset initial motion, taking an expression motion of a human face as an example, the initial motion may be that the human face does not have the expression motion, or has a default expression motion. The initial motion parameter may be set in advance, or may be extracted from an image, for example, from a target image described below. When the action parameters change, actions corresponding to the action parameters also change, and the playing content of the target audio changes along with the change of the time sequence in the playing process, so that the audio characteristics of different time points or time points also change; and adjusting the initial motion parameter based on the audio characteristics, wherein the obtained first motion parameter is continuously changed. The specific adjustment manner may be to scale, replace, invert, map according to a certain rule, and the like.

The different motion parameters may indicate different types of motion, or the different motion parameters may indicate the same type of motion but different magnitudes of motion. Of course, when the audio characteristics are the same, the resulting first motion parameters are generally the same. Therefore, through the above steps, the audio feature and the first motion parameter have a certain corresponding relationship, and thus the motion indicated by the first motion parameter and the audio feature also have a certain corresponding relationship, and the two are matched.

Step S104, generating a target image based on the first motion parameter and the initial image containing the target object; in the target image, the target object has a motion indicated by the first motion parameter.

The target object may be a human face, other parts of a human body, or other animals or still objects, and the present embodiment does not limit the type of the target object. The target object in the initial image has a default motion, for example, a human face, which may be "blankness" or "smile" expression. The initial image contains image information of the target object and may also contain image information of the background area. The first motion parameter may change the motion of the target object, so that the target object has the motion indicated by the first motion parameter, thereby obtaining the target image.

In actual implementation, the image features of the initial image may be extracted, the first motion parameters may be incorporated as the motion features of the target object into the image features of the initial image, the image features may be provided with features related to the first motion parameters, and the target image may be obtained based on the image features.

Firstly, adjusting initial action parameters based on the audio characteristics of target audio to obtain first action parameters; the motion indicated by the first motion parameter matches the audio feature; then generating a target image based on the first action parameter and the initial image containing the target object; the target object in the target image has a motion indicated by the first motion parameter. In the method, the action parameter is adjusted through the audio characteristic of the audio, so that the action indicated by the obtained first action parameter can be matched with the audio characteristic, and the target object in the generated image has the action indicated by the first action parameter.

In order to further enrich the image variation content, the target object in this embodiment may include a human face; the initial image containing the target object can be preset or provided by a user; the motion indicated by the first motion parameter comprises an expression motion of a human face; the expression action of the face in the image is controlled through the audio, so that the expression of the face changes along with the change of the audio content, the interestingness of image display can be improved, and the visual experience of a user is further improved.

In a specific implementation manner, the audio characteristics of the target audio are used for adjusting the action amplitude of the action indicated by the initial action parameter; specifically, when the audio is gentle and soft, the action amplitude of the adjustment may be small, and when the audio is intense, the action amplitude of the adjustment is large. Taking the facial expression as an example, when the audio gradually changes from gentle and soft to intense, the expression action can gradually change to smile, laugh and wild laugh, that is, the amplitude of the expression action laugh is larger and larger. The action amplitude of the action indicated by the first action parameter is matched with the audio characteristics, and the mode can enable the action of the object along with the change of the audio to be more coherent and vivid, so that the user can experience the change of the audio rhythm in the aspects of vision and hearing, and the experience of the user is improved.

The following describes an acquisition method of an audio feature of target audio. Firstly, extracting a Mel-scale Frequency Cepstral Coefficients (Mel-scale Frequency Cepstral Coefficients) parameter of a target audio; the MFCC parameters comprise a plurality of time points of a preset time interval, and each time point corresponds to an MFCC value; since the audio is time-sequenced, when extracting the MFCC parameters, the target audio may be clipped once at regular intervals, and the MFCC may be calculated for the clipped audio, for example, for the audio within 100 milliseconds, clipped once every 100 milliseconds. Thus, for a target audio with a certain time length, it is usually clipped into multiple pieces of audio, so the MFCC parameter of the target audio includes a plurality of MFCC values, each MFCC value corresponds to a time point, and the time interval of adjacent time points is the same as the time interval of audio clipping. In addition, considering that the sound features captured by the human ear are mainly concentrated in the low frequency stage, in order to reduce the unnecessary data processing amount, the audio frequency may be filtered, the audio frequency in the high frequency stage is filtered, and the MFCC is calculated only for the audio frequency in the low frequency stage. In practical implementation, a frequency threshold may be preset, for example, the threshold is set to 13, and frequencies higher than the frequency threshold are filtered out. And then inputting the MFCC parameters into a pre-trained feature extraction network, and outputting the audio features of the target audio.

The feature extraction network comprises a plurality of feature extraction modules which are sequentially connected in series; the feature extraction module comprises a convolution layer, a batch normalization layer and an activation function layer. In practical implementation, eight feature extraction modules can be sequentially connected in series; the structure in each feature extraction module is the same, but the parameters of each module after training are possibly different. The convolutional layer may be specifically a 2D convolutional layer, the batch normalization layer may be also referred to as a batchnormalization layer, and the activation function layer may be specifically a Relu function.

In order to make the format of the input data meet the requirements of the feature extraction network, the MFCC parameters need to be preprocessed before being input into the feature extraction network. Specifically, firstly, numerical filling is performed on numerical values in a frequency dimension of the MFCC parameter based on a preset filling value, so that the numerical value number in the frequency dimension is matched with the numerical value number in a time dimension of the MFCC parameter; the MFCC parameters of the target audio belong to two-dimensional data, the MFCC parameters comprise a time dimension and a frequency dimension, the time dimension comprises a plurality of time points which are arranged at preset intervals, and the frequency dimension comprises MFCC values corresponding to each time point. When the target audio is longer, the number of time points in the time dimension may be much larger than the data amount of the MFCC values in the frequency dimension, and in order to obtain a two-dimensional matrix with the same width and height, the values in the frequency dimension of the MFCC parameters need to be filled with values, for example, the above-mentioned filling values are filled behind the MFCC values, and the filling values may be zero or other values, so that the number of values in the frequency dimension is the same as the number of values in the time dimension. In other embodiments, the feature extraction network may also have a requirement on the number of channels of input data, where the obtained MFCC value is single-channel data, and if the feature extraction network needs to input data of a specified number of channels, the MFCC parameter filled with the value needs to be copied to obtain MFCC parameters of the specified number of channels. The dimension or the channel number of the input data is increased by a data stacking mode.

In addition, the MFCC values in the MFCC parameters may be distributed in a larger range, and at this time, normalization processing needs to be performed on the values in the MFCC parameters to map the values in the MFCC parameters to a specified data range, such as a range of [0,1], so as to facilitate subsequent data processing. And after the preprocessed MFCC parameters are input into the feature extraction network, the audio features of the target audio can be output.

The feature extraction network needs to be trained in advance, and the feature extraction network is trained in an unsupervised training mode in the embodiment, so that the difficulty of labeling sample data and the time and labor cost are reduced. Specifically, the feature extraction network is obtained by training in the following way: inputting MFCC parameters of the sample audio into a coding network, and outputting a feature vector of the sample audio; inputting the characteristic vector of the sample audio into a decoding network to obtain an output parameter of the sample audio; calculating a loss value between the output parameter and the MFCC parameter of the sample audio based on a preset loss function, training a coding network and a decoding network based on the loss value, and determining the coding network after training as a feature extraction network.

For easy understanding, fig. 2 shows a schematic training flow of the feature extraction network, after a sample audio is input, extracting MFCC parameters of the sample audio, and performing preprocessing, such as numerical filling, stacking dimension increasing, and the like, on the MFCC parameters of the sample audio; inputting the preprocessed MFCC parameters into a coding network, and outputting a coding vector of a sample audio, namely a feature vector of the sample audio; inputting the coding vector into a decoding network to obtain an output parameter of the sample audio; calculating a loss value between the output parameter and the MFCC parameter of the sample audio based on a preset loss function, optimizing network parameters of a coding network and a decoding network based on the loss value, judging whether preset training times are reached, if not, continuing to execute the step of extracting the MFCC parameter of the sample audio, and if so, ending the training.

The network structure of the coding network may refer to the network structure of the feature extraction network, and the network structure of the decoding network corresponds to the coding network, specifically, the decoding network includes a plurality of decoding modules connected in series in sequence; the decoding module comprises a transposed convolution layer, a batch normalization layer and an activation function layer. In practical implementation, eight decoding modules can be sequentially connected in series; the structure in each decoding module is the same, but the parameters of each module after training are different. The convolutional layer may be specifically a 2D convolutional layer, the batch normalization layer may be also referred to as a batchnormalization layer, and the activation function layer may be specifically a Relu function.

Unsupervised learning is carried out through the decoding network and the coding network, so that the coding network suitable for audio feature extraction is obtained and used as a feature extraction network. After the training of the feature extraction network is completed, the MFCC parameters of the target audio are used as input data of the network, audio features can be extracted at a specified time interval, and the audio features are output, wherein the dimension of the audio features can be a vector Wc with time sequence, and the dimension of the vector can be (1, 512).

And after the audio characteristics of the target audio are obtained, adjusting the initial action parameters based on the audio characteristics. In a specific implementation manner, the parameter adjustment weight can be determined according to the audio characteristics of the target audio; and adjusting the weight based on the parameter, and scaling the initial action parameter to obtain a first action parameter. The parameter adjusting weight can enlarge or reduce the parameter value in the initial action parameter, thereby controlling the action amplitude; specifically, the first motion parameter can be obtained by multiplying the parameter adjustment weight by the initial motion parameter. When the parameter adjustment weight is not one, the magnitude of the motion indicated by the first motion parameter is usually different from the initial motion parameter. The method can determine the weight of the action parameter to be adjusted based on the audio characteristics, and then adjust the action parameter, so that the action amplitude can be adjusted based on the audio characteristics.

When determining the parameter adjustment weight based on the audio feature, in a specific implementation manner, in the time dimension of the audio feature, an average value is obtained for feature vectors corresponding to each time point in the time dimension to obtain an initial parameter; each time point in the audio features corresponds to a feature vector having a certain length, such as (1,512), so that the feature vector corresponding to each time point includes a plurality of feature values, and the initial parameters can be obtained by averaging the plurality of feature values included in the feature vector. Considering that the distribution range of the feature values in the feature vector is large, the obtained average value may be too large or too small, and if the initial parameter is directly adopted as the weight to adjust the initial action parameter, the action of the final target object may be too exaggerated, and a thrilling or apprehensive bad experience is caused to the user. To avoid this problem, the present embodiment maps the initial parameter to a preset value range to obtain the parameter adjustment weight. The value space may be preset, for example, [0,1], or may be other value ranges. By the method, the parameter adjustment weight can be kept within a reasonable range, so that the finally generated action of the target object is reasonable and vivid, and good visual experience is provided for users.

As can be seen from the foregoing embodiments, the audio features of the target audio include time points of a plurality of preset time intervals, and a feature vector corresponding to each time point. If the time interval is long, the difference between the feature vectors of two adjacent time points is large, so that the motion change of the target object is relatively jumpy and unsmooth. In order to avoid the problem, before the parameter adjustment weight is determined, a specified number of intermediate time points and feature vectors corresponding to each intermediate time point are inserted between any two adjacent initial time points in the audio features to obtain final audio features; and determining the characteristic vectors corresponding to the intermediate time points based on the characteristic vectors corresponding to the two initial time points adjacent to the intermediate time points. The initial time point is a time point for extracting the feature vector, and the intermediate time point is an insertion time point. The method not only considers the generation speed of the audio features, but also can enable the generated image to meet the smoothness requirement of human eyes on motion change.

As an example, if one feature vector is extracted every 0.25 seconds, the time interval between adjacent initial time points is 0.25 seconds, and at this time, 7 intermediate time points may be inserted into every two adjacent initial time points, and the 7 intermediate time points may be uniformly distributed between the two adjacent initial time points. The feature vector of the middle time point can be obtained by performing weighted average on the feature vectors corresponding to the two adjacent initial time points. The weighted average weight may be determined according to the position of each intermediate time point, for example, two initial time points a and two initial time points B arranged in time sequence, and when calculating the feature vector of an intermediate time point close to the initial time point a, a larger weight needs to be given to the initial time point a, so that the feature vector corresponding to the intermediate time point has a higher similarity to the feature vector of the initial time point a; similarly, when calculating the feature vector of the intermediate time point close to the initial time point B, a larger weight needs to be given to the initial time point B, so that the similarity between the feature vector corresponding to the intermediate time point and the feature vector of the initial time point B is higher; in the calculation of the feature vector of the intermediate time point at the intermediate position, the same weight as the initial time point a and the initial time point B is required, and by this means, smooth change of the motion of the target object in the period from the initial time point a to the initial time point B can be realized.

And after the first motion parameter is obtained, generating a target image based on the first motion parameter and the initial image containing the target object. Specifically, firstly, extracting image features of an initial image containing a target object; wherein the image features include global features and detail features; fusing the first action parameter and the detail characteristic to obtain a fused characteristic; and generating a target image based on the fusion feature and the global feature.

An image feature extraction network can be trained in advance, and the initial image is input into the network, so that the image features of the initial image can be output. The image feature extraction network may be implemented by a reverse encoding network StyleGAN2 Encoder, where after an initial image is input to the reverse encoding network, after a plurality of iterations, for example, 1000 iterations, an encoding vector Wf with a scale of (18,512) may be output, where the encoding vector is an image feature of the initial image. In the reverse coding network, initializing a hidden vector w to be a value of 0, generating a random image by using the value, respectively performing feature extraction on the randomly generated image and a target sample image by using a VGG16 network, comparing the difference value of the two as a loss function, and iteratively optimizing the hidden vector w to enable the image generated by taking the hidden vector w as input to infinitely approximate to the target image, thereby obtaining the reverse coding corresponding to the target image, namely the coding vector.

On the basis of the reverse coding network, a standard image and an image with a certain specified action of the same object are distributed and input into the reverse coding network to obtain a first hidden vector corresponding to the standard image and a second hidden vector corresponding to the image with the certain specified action, and the difference value of the first hidden vector and the second hidden vector, namely a control vector corresponding to the specified action, namely an initial action parameter corresponding to the specified action. Taking a face image as an example, reversely coding different expressions (such as non-smiling and smiling) of the same face, wherein the difference of the obtained hidden vectors is a control vector of the face characteristic of smiling, and the control vector can be used as an initial action parameter; by the method, the initial action parameters corresponding to various expressions can be obtained, and the initial action parameters corresponding to various actions can also be obtained.

The image features comprise global features and detail features of the initial image, and the global features can be features of the gender, the age and the like of the face by taking the initial image containing the face as an example; the detail features can be the facial features such as the facial features, the hair texture and the like. Taking the above coded vector with the scale (18,512) as an example, the coded vector includes 18 dimensions (dimensions may also be referred to as channels), the feature length of each dimension is 512, global features may be included in the features of the 1 st to 10 th dimensions, and detailed features may be included in the features of the 11 th to 18 th dimensions.

In order to avoid that the overall similarity of the image is weakened due to the blending of the first action parameter, and the visual prosodic effect is poor due to the large change of the target object in the image, in the embodiment, the first action parameter is fused with the detail feature, and only the local details of the target object in the image, for example, the local actions such as the mouth pose and the eye pose of the human face, are changed through the first action parameter; meanwhile, the whole visual effect of other areas of the image and the image is unchanged, so that the rhythm effect of the image changing along with the audio is improved.

When the first action parameter and the detail feature are subjected to fusion processing, in a specific implementation manner, feature data in a first specified channel is obtained from image features; wherein the feature data in the first specified channel contains detail features; acquiring parameter data in a second specified channel from the first action parameter; and carrying out point-by-point addition processing on the feature data in the first specified channel and the parameter data in the second specified channel to obtain the fusion feature.

The image features generally include multi-channel feature data, and feature types included in the feature data of each channel are different, so that the channel including the detail features can be used as a first designated channel, and the feature data in the first designated channel can be acquired. If the image features comprise 18 channels, the feature data in the top-ranked channel comprises global features, which can also be referred to as large-scale features of the image, and the feature data in the bottom-ranked channel comprises detail features; at this time, a part of the channels ranked next may be selected as the first designated channel, and for example, the last 8 channels may be selected as the first designated channel. Only the first motion parameters are blended into the image features containing the detail features, so that the change is only carried out at the details of the image, and the global effect of the image is not influenced. Taking a face image as an example, the method can control the fine expression of the face, but keeps other characteristics of the face unchanged.

In addition, the first motion parameter usually also includes parameter data of multiple channels, and in order to facilitate feature fusion, in this embodiment, only parameter data of a part of channels in the first motion parameter is selected to be fused into the image feature. If the first action parameter also includes 18 channels, the last 8 channels may be selected as the second designated channel. In a specific implementation manner, the number of channels of the feature data in the first specified channel and the number of channels of the parameter data in the second specified channel may be the same, and the feature scale of each channel may also be the same. By adding point-by-point is understood that for a characteristic value at a certain position in a first specified channel, a parameter value at the position in a second specified channel is obtained, and the characteristic value and the parameter value are added, i.e. the positions of the two added values are the same or correspond in the first specified channel and the second specified channel. Besides the point-by-point addition fusion method, there may be a point-by-point multiplication or other feature fusion method.

For ease of understanding, fig. 3 shows a schematic flow chart for adjusting image features based on audio features; taking a face image as an example, the initial motion parameters may also be called expression control codes. Inputting audio features of target audio, i.e. group W of coded vectors of audio features_cTo W_cInterpolating between two adjacent characteristic vectors at the middle time point to obtain an interpolated characteristic vector group W_c'; to W_cIn the' step, the feature vectors corresponding to each time point are averaged to obtain a weight vector in a one-dimensional data form, and each time point in the weight vector corresponds to a weight value. Scaling the expression control code We through the weight vector; the scaled expression control code and the face image code W are processed_fSuperposed to obtain W_f'; the face image coding here can be understood as a specific example of the above-described image features; finally, outputting the audio control face code W with time sequence_f'. The audio control face coding comprises the fusion feature and the global feature in the previous embodiment.

And after the fusion feature is obtained, generating a target image based on the fusion feature and the global feature. Specifically, the fusion feature and the global feature are used as hidden vectors and input into an image generation network which is trained in advance, and a target image is output; wherein the hidden vector is used to: and controlling the graph generation network to output an image matched with the characteristic indicated by the hidden vector. The image generation network may be implemented by a StyleGAN network, and fig. 4 is a schematic diagram of the StyleGAN network. The StyleGAN network can generate images with higher resolution, and the random noise z is decoupled by an eight-layer fully-connected network and converted into a hidden vector w by the network. w is used as an input A part to control the overall style of the generated picture, Noise is used as an input B part to control the generated details (such as the generation of hair of a human face and the like), and the Noise is jointly input into a generator synthesis network to complete the generation of the image in the game with the discriminator. Specifically, in this embodiment, the above-mentioned fusion feature and global feature are input as a hidden vector w to a portion a in fig. 3, so as to control the generator to generate the target image.

Fig. 5 shows a process of generating a plurality of target images in the course of playing the target audio, thereby generating a video. Firstly, loading a StyleGAN network model; inputting a face code controlled by audio, generating a target image through a StyleGAN network, and arranging the image according to a time sequence to generate a video; and judging whether the target audio is finished, if not, continuing to execute the step of inputting the face code controlled by the audio, and if so, finishing the process of generating the video. FIG. 6 shows a schematic representation of a facial image as a function of audio rhythm; taking mouth movement as an example, along with the playing of audio, the human face makes expressions of non-smiling, smiling and laughing, and the change of the fine movement of the object in the audio driving image is realized.

By the aid of the mode, visualization of audio rhythm can be achieved, a user can input any image such as a photo of the user, fine action control of music on an object in the image is achieved, higher experience can be obtained, and interface visual experience of music playing products is optimized.

Corresponding to the above method embodiment, referring to fig. 7, a schematic structural diagram of an image generating apparatus is shown, the apparatus comprising:

a parameter adjusting module 70, configured to adjust the initial motion parameter based on an audio feature of the target audio to obtain a first motion parameter; wherein the action indicated by the first action parameter matches the audio feature;

an image generation module 72 for generating a target image based on the first motion parameter and an initial image containing the target object; in the target image, the target object has a motion indicated by the first motion parameter.

The image generation device firstly adjusts the initial action parameter based on the audio characteristic of the target audio to obtain a first action parameter; the motion indicated by the first motion parameter matches the audio feature; then generating a target image based on the first action parameter and the initial image containing the target object; the target object in the target image has a motion indicated by the first motion parameter. In the method, the action parameter is adjusted through the audio characteristic of the audio, so that the action indicated by the obtained first action parameter can be matched with the audio characteristic, and the target object in the generated image has the action indicated by the first action parameter.

The parameter adjusting module is further configured to: determining a parameter adjustment weight according to the audio characteristics of the target audio; and scaling the initial action parameter based on the parameter adjustment weight to obtain a first action parameter.

The parameter adjusting module is further configured to: on the time dimension of the audio features, calculating an average value of feature vectors corresponding to all time points on the time dimension to obtain initial parameters; and mapping the initial parameters to a preset numerical range to obtain parameter adjustment weights.

The above-mentioned device still includes: an interpolation module to: in the audio features, a specified number of intermediate time points and feature vectors corresponding to each intermediate time point are inserted between any two adjacent initial time points to obtain final audio features; and determining the characteristic vectors corresponding to the intermediate time points based on the characteristic vectors corresponding to the two initial time points adjacent to the intermediate time points.

The device also comprises a feature extraction module, which is used for obtaining the audio features of the target audio by the following modes: extracting a Mel Frequency Cepstrum Coefficient (MFCC) parameter of the target audio; the MFCC parameters comprise a plurality of time points of a preset time interval, and each time point corresponds to an MFCC value; and inputting the MFCC parameters into a pre-trained feature extraction network, and outputting the audio features of the target audio.

The above-mentioned device still includes: a pre-processing module to: based on a preset filling value, performing numerical filling on numerical values on a frequency dimension of the MFCC parameter to enable the numerical value quantity on the frequency dimension to be matched with the numerical value quantity on a time dimension of the MFCC parameter; and copying the MFCC parameters after numerical filling to obtain the MFCC parameters with the specified channel number.

The device also comprises a network training module which is used for training in the following way to obtain the feature extraction network: inputting MFCC parameters of the sample audio into a coding network, and outputting a feature vector of the sample audio; inputting the characteristic vector of the sample audio into a decoding network to obtain an output parameter of the sample audio; calculating a loss value between the output parameter and the MFCC parameter of the sample audio based on a preset loss function, training a coding network and a decoding network based on the loss value, and determining the coding network after training as a feature extraction network.

The image generation module further includes: extracting image features of an initial image containing a target object; wherein the image features comprise global features and detail features; fusing the first action parameter and the detail characteristic to obtain a fused characteristic; and generating a target image based on the fusion feature and the global feature.

The image generation module further includes: acquiring feature data in a first appointed channel from the image features; wherein the feature data in the first specified channel contains detail features; acquiring parameter data in a second specified channel from the first action parameter; and carrying out point-by-point addition processing on the feature data in the first specified channel and the parameter data in the second specified channel to obtain the fusion feature.

The image generation module further includes: the fusion features and the global features are used as hidden vectors and input into an image generation network which is trained in advance, and target images are output; wherein the hidden vector is used to: and controlling the graph generation network to output an image matched with the characteristic indicated by the hidden vector.

The embodiment also provides an electronic device, which comprises a processor and a memory, wherein the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to realize the image generation method. The electronic device may be a server or a terminal device.

Referring to fig. 8, the electronic device includes a processor 100 and a memory 101, the memory 101 stores machine executable instructions capable of being executed by the processor 100, and the processor 100 executes the machine executable instructions to implement the image generating method.

Further, the electronic device shown in fig. 8 further includes a bus 102 and a communication interface 103, and the processor 100, the communication interface 103, and the memory 101 are connected through the bus 102.

The Memory 101 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but that does not indicate only one bus or one type of bus.

Processor 100 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 100. The Processor 100 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 101, and the processor 100 reads the information in the memory 101 and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.

The present embodiments also provide a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to implement the image generation method described above.

The image generation method, the image generation device, the electronic device, and the computer program product of the storage medium provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood in specific cases for those skilled in the art.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that the following embodiments are merely illustrative of the present invention, and not restrictive, and the scope of the present invention is not limited thereto: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An image generation method, characterized in that the method comprises:

adjusting the initial action parameter based on the audio characteristics of the target audio to obtain a first action parameter; wherein the action indicated by the first action parameter matches the audio feature;

generating a target image based on the first action parameter and an initial image containing a target object; in the target image, the target object has the motion indicated by the first motion parameter.

2. The method of claim 1, wherein the target object comprises a human face; the motion indicated by the first motion parameter comprises an expression motion of the face.

3. The method of claim 1, wherein the audio characteristic of the target audio is used to adjust the motion amplitude of the motion indicated by the initial motion parameter; the motion amplitude of the motion indicated by the first motion parameter matches the audio feature.

4. The method of claim 1, wherein the step of adjusting the initial motion parameter based on the audio characteristics of the target audio to obtain the first motion parameter comprises:

determining a parameter adjustment weight according to the audio characteristics of the target audio;

and scaling the initial action parameter based on the parameter adjusting weight to obtain a first action parameter.

5. The method of claim 4, wherein the step of determining the parameter adjustment weight according to the audio characteristics of the target audio comprises:

on the time dimension of the audio features, averaging the feature vectors corresponding to each time point on the time dimension to obtain initial parameters;

and mapping the initial parameters to a preset numerical range to obtain the parameter adjustment weight.

6. The method of claim 4, wherein the step of determining parameter adjustment weights based on the audio characteristics of the target audio is preceded by the method further comprising:

inserting a specified number of intermediate time points and a feature vector corresponding to each intermediate time point between any two adjacent initial time points in the audio features to obtain final audio features; and determining the characteristic vectors corresponding to the intermediate time points based on the characteristic vectors corresponding to the two initial time points adjacent to the intermediate time points.

7. The method of claim 1, wherein the audio characteristic of the target audio is obtained by:

extracting a Mel Frequency Cepstrum Coefficient (MFCC) parameter of the target audio; the MFCC parameters comprise a plurality of time points of a preset time interval, and each time point corresponds to an MFCC value;

and inputting the MFCC parameters into a pre-trained feature extraction network, and outputting the audio features of the target audio.

8. The method of claim 7, wherein the feature extraction network comprises a plurality of feature extraction modules connected in series in sequence; the feature extraction module comprises a convolution layer, a batch normalization layer and an activation function layer.

9. The method of claim 7, wherein the MFCC parameters are input into a pre-trained feature extraction network, and wherein the method further comprises, prior to the step of outputting the audio features of the target audio:

based on a preset padding value, performing numerical padding on the numerical values in the frequency dimension of the MFCC parameters to enable the numerical value number in the frequency dimension to be matched with the numerical value number in the time dimension of the MFCC parameters;

and copying the MFCC parameters after numerical filling to obtain the MFCC parameters with the specified channel quantity.

10. The method of claim 7, wherein the feature extraction network is trained by:

inputting MFCC parameters of sample audio into an encoding network, and outputting a feature vector of the sample audio; inputting the characteristic vector of the sample audio into a decoding network to obtain an output parameter of the sample audio;

calculating a loss value between the output parameter and the MFCC parameter of the sample audio based on a preset loss function, training the coding network and the decoding network based on the loss value, and determining the coding network after training as the feature extraction network.

11. The method of claim 10, wherein the decoding network comprises a plurality of decoding modules connected in series in sequence; the decoding module comprises a transposed convolution layer, a batch normalization layer and an activation function layer.

12. The method of claim 1, wherein generating a target image based on the first motion parameter and an initial image containing a target object comprises:

extracting image features of the initial image containing the target object; wherein the image features comprise global features and detail features;

performing fusion processing on the first action parameter and the detail characteristic to obtain a fusion characteristic;

generating the target image based on the fused feature and the global feature.

13. The method according to claim 12, wherein the step of fusing the first motion parameter and the detail feature to obtain a fused feature comprises:

acquiring feature data in a first specified channel from the image features; wherein the feature data in the first specified channel contains the detail feature;

acquiring parameter data in a second specified channel from the first action parameter;

and adding the feature data in the first specified channel and the parameter data in the second specified channel point by point to obtain the fusion feature.

14. The method of claim 12, wherein generating the target image based on the fused feature and the global feature comprises:

inputting the fusion feature and the global feature as hidden vectors into an image generation network which is trained in advance, and outputting the target image; wherein the hidden vector is to: and controlling the graph generation network to output an image matched with the characteristic indicated by the hidden vector.

15. An image generation apparatus, characterized in that the apparatus comprises:

the parameter adjusting module is used for adjusting the initial action parameters based on the audio characteristics of the target audio to obtain first action parameters; wherein the action indicated by the first action parameter matches the audio feature;

an image generation module for generating a target image based on the first motion parameter and an initial image containing a target object; in the target image, the target object has the motion indicated by the first motion parameter.

16. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the image generation method of any one of claims 1 to 14.

17. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to implement the image generation method of any of claims 1-14.