CN110148424A

CN110148424A - Method of speech processing, device, electronic equipment and storage medium

Info

Publication number: CN110148424A
Application number: CN201910381777.XA
Authority: CN
Inventors: 方轲; 郑文; 宋丛礼
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2019-08-20
Anticipated expiration: 2039-05-08
Also published as: CN110148424B

Abstract

Present disclose provides a kind of method of speech processing, device and electronic equipment and storage mediums, which comprises obtains the corresponding first voice spectrum figure of the first voice data, the corresponding second voice spectrum figure of second speech data and voice spectrum figure to be processed；Pass through preset two-dimensional convolution neural network model, obtain the first content eigenmatrix of the first voice spectrum figure, first style and features matrix of the second voice spectrum figure and the corresponding second content characteristic matrix of the voice spectrum figure to be processed and the second style and features matrix；Reconstruct loss function is obtained according to the first content eigenmatrix and the second content characteristic matrix；Style loss function is obtained according to the first style and features matrix and the second style and features matrix；The voice spectrum figure to be processed is handled to obtain target voice spectrogram according to the reconstruct loss function and the style loss function, and the corresponding voice data of target voice spectrogram is obtained by default speech reconstructing algorithm.

Description

Voice processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of information processing technologies, and in particular, to a voice processing method and apparatus, an electronic device, and a storage medium.

Background

Because the sound styles (i.e. timbres, etc.) of different users are different, a certain voice content can be converted into different voice styles to be played, so that the voice effect is diversified.

In the related art, it is considered that the voice is a continuous waveform, so that a voice signal of a one-dimensional discrete sequence can be obtained by sampling from the waveform, and the voice signal is input into a one-dimensional network model to obtain a one-dimensional voice feature matrix. It can be seen that the voice features extracted in the prior art are one-dimensional features, and the result of voice style migration is poor due to the fact that the image features and the voice features are different, and the one-dimensional voice features are adopted in the image style migration method.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a voice processing method, an apparatus, an electronic device, and a storage medium, so as to solve the problem in the prior art that a voice style migration result is poor due to a voice style migration performed by using an image style migration method.

According to a first aspect of embodiments of the present disclosure, there is provided a speech processing method, the method including:

acquiring a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles;

acquiring a first content characteristic matrix of the first voice spectrogram, a first style characteristic matrix of the second voice spectrogram, and a second content characteristic matrix and a second style characteristic matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model;

acquiring a reconstruction loss function according to the first content characteristic matrix and the second content characteristic matrix;

obtaining a style loss function according to the first style characteristic matrix and the second style characteristic matrix;

and processing the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquiring voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm.

Optionally, the obtaining a first voice spectrogram corresponding to the first voice data and a second voice spectrogram corresponding to the second voice data include:

and respectively carrying out instantaneous Fourier transform on the first voice data and the second voice data to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram.

Optionally, the obtaining the first style characteristic matrix of the second speech spectrogram through a preset two-dimensional convolutional neural network model includes:

acquiring a third content characteristic matrix corresponding to the second voice spectrogram through the two-dimensional convolution neural network model;

expanding the third content characteristic matrix along a time axis to obtain a first content characteristic conversion matrix; the dimension of the third content feature matrix is represented as (n)_C1，n_H1，n_W1) The dimension of the first content feature transformation matrix is represented as (n)_C1×n_H1，n_W1)，n_C1A channel number axis, n, corresponding to the third content feature matrix_H1Representing the frequency axis, n, corresponding to the third content feature matrix_W1Representing a time axis corresponding to the third content feature matrix; n is₁×n_H1A number axis generated after a channel number axis corresponding to the third content feature matrix and a frequency number axis corresponding to the third content feature matrix are stacked and unfolded is represented;

and performing transposition processing on the first content characteristic conversion matrix to obtain a first content characteristic transposition matrix, and calculating the product between the first content characteristic conversion matrix and the first content characteristic transposition matrix to obtain the first style characteristic matrix.

Optionally, the obtaining a second style feature matrix of the speech spectrogram to be processed through a preset two-dimensional convolutional neural network model includes:

expanding the second content characteristic matrix along a time axis to obtain a second content characteristic conversion matrix; the dimension of the second content feature matrix is represented as (n)_C2,n_H2,n_W2) The dimension of the second content feature transformation matrix is represented as (n)_C2×n_H2，n_W2)，n_C2Representing the second content feature matrix pairNumber of channels, n_H2Representing the frequency axis, n, corresponding to the second content feature matrix_W2Representing the time axis, n, corresponding to the second content feature matrix_C2×n_H2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;

and performing transposition processing on the second content characteristic conversion matrix to obtain a second content characteristic transposition matrix, and calculating the product between the second content characteristic conversion matrix and the second content characteristic transposition matrix to obtain the second style characteristic matrix.

Optionally, the obtaining a style loss function according to the first style feature matrix and the second style feature matrix includes:

calculating the style loss function by the following formula:

or,

where loss represents the style loss function, s_ijRepresenting the style value, q, corresponding to the (i, j) position in the first style feature matrix_ijRepresenting the style value corresponding to the (i, j) position in the second style feature matrix.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech processing apparatus, the apparatus comprising:

the spectrogram acquiring module is configured to acquire a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles;

the feature matrix acquisition module is configured to acquire a first content feature matrix of the first voice spectrogram, a first style feature matrix of the second voice spectrogram, and a second content feature matrix and a second style feature matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model;

a reconstruction function obtaining module configured to obtain a reconstruction loss function according to the first content feature matrix and the second content feature matrix;

a style function obtaining module configured to obtain a style loss function according to the first style feature matrix and the second style feature matrix;

and the voice processing module is configured to process the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquire voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm.

Optionally, the spectrogram acquiring module is configured to perform instantaneous fourier transform on the first voice data and the second voice data respectively to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram.

Optionally, the feature matrix obtaining module is configured to obtain, through the two-dimensional convolutional neural network model, a third content feature matrix corresponding to the second speech spectrogram;

expanding the third content characteristic matrix along a time axis to obtain a first content characteristic conversion matrix; the dimension of the third content feature matrix is represented as (n)_C1，n_H1，n_W1) The dimension of the first content feature transformation matrix is represented as (n)_C1×n_H1，n_W1)，n_C1A channel number axis, n, corresponding to the third content feature matrix_H1Representing the third content feature matrix pairCorresponding frequency axis, n_W1Representing a time axis corresponding to the third content feature matrix; n is₁×n_H1A number axis generated after a channel number axis corresponding to the third content feature matrix and a frequency number axis corresponding to the third content feature matrix are stacked and unfolded is represented;

Optionally, the feature matrix obtaining module is configured to expand the second content feature matrix along a time axis to obtain a second content feature transformation matrix; the dimension of the second content feature matrix is represented as (n)_C2,n_H2,n_W2) The dimension of the second content feature transformation matrix is represented as (n)_C2×n_H2，n_W2)，n_C2Representing the number axis of the channels corresponding to the second content feature matrix, n_H2Representing the frequency axis, n, corresponding to the second content feature matrix_W2Representing the time axis, n, corresponding to the second content feature matrix_C2×n_H2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;

Optionally, the style function obtaining module is configured to calculate the style loss function according to the following formula:

or，

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to perform the speech processing method described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, wherein instructions of the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the above-mentioned speech processing method.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising one or more instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the above-described speech processing method.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the voice processing method shown in the present exemplary embodiment obtains a first voice spectrogram corresponding to first voice data, a second voice spectrogram corresponding to second voice data, and a voice spectrogram to be processed; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles; acquiring a first content characteristic matrix of the first voice spectrogram, a first style characteristic matrix of the second voice spectrogram, and a second content characteristic matrix and a second style characteristic matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model; acquiring a reconstruction loss function according to the first content characteristic matrix and the second content characteristic matrix; obtaining a style loss function according to the first style characteristic matrix and the second style characteristic matrix; and processing the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquiring voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm. It can be seen that, in the voice processing method provided in the embodiment of the present disclosure, since the image style migration method is applied to the image and the voice data is a voice signal of a one-dimensional discrete sequence, in order to apply the image style migration method to the voice data, before the first voice data and the second voice data are input to the preset two-dimensional convolutional neural network model, the first voice data and the second voice data need to be respectively converted into the corresponding first voice spectrogram and second voice spectrogram, so that the first voice spectrogram and the second voice spectrogram conform to the image characteristics in the image style migration method, and thus the accuracy of the voice style migration result is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of speech processing according to an exemplary embodiment;

FIG. 2 is a block diagram illustrating a speech processing apparatus according to an example embodiment;

fig. 3 is a block diagram illustrating a structure of an electronic device according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

FIG. 1 is a flow diagram illustrating a method of speech processing according to an exemplary embodiment, which may include the steps of, as shown in FIG. 1:

in step 101, a first voice spectrogram corresponding to first voice data, a second voice spectrogram corresponding to second voice data, and a voice spectrogram to be processed are obtained; the first voice data is used for extracting voice content; the second voice data is data for extracting a voice style.

In this disclosure, the to-be-processed speech spectrogram may be a spectrogram obtained after random initialization, and in the subsequent steps of this disclosure, the to-be-processed speech spectrogram needs to be processed to obtain a target speech spectrogram, so that the target speech spectrogram includes speech content in the first speech spectrogram and speech style in the second speech spectrogram.

In addition, in the prior art, the voice data of the voice content is extracted, the voice data of the voice style are extracted as voice signals of a one-dimensional discrete sequence, and the voice signals to be processed are subjected to image style migration through the voice signals, but the image style migration method is generally specific to images, and the voice features and the image features are different from each other, so that the voice style migration result is poor when the voice signals are subjected to the voice style migration through the image style migration method. In order to solve the problem, the present disclosure may perform an instantaneous fourier transform on the first voice data and the second voice data, respectively, to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram. Performing instantaneous Fourier transform on the first voice data to obtain a first voice spectrogram; and carrying out instantaneous Fourier transform on the second voice data to obtain a second voice spectrogram. In this way, the first voice data and the second voice data are respectively converted into voice spectrograms, so that the first voice spectrograms, the second voice spectrograms and the voice spectrograms to be processed conform to the image style migration method.

In step 102, a first content feature matrix of the first voice spectrogram, a first style feature matrix of the second voice spectrogram, and a second content feature matrix and a second style feature matrix corresponding to the voice spectrogram to be processed are obtained through a preset two-dimensional convolutional neural network model.

In this disclosure, in a case where the first voice data is data for extracting voice content, the first content feature matrix is a matrix corresponding to a spectrogram content (i.e., voice content) in the first voice spectrogram, and in a case where the second voice data is data for extracting a voice style, the first style feature matrix is a matrix corresponding to a spectrogram style (i.e., voice style) in the second voice spectrogram. In addition, the second content feature matrix may be a matrix corresponding to spectrogram content (i.e., voice content) of the voice spectrogram to be processed, and the second style feature matrix may be a matrix corresponding to spectrogram style (i.e., voice style) of the voice spectrogram to be processed.

The single-layer two-dimensional convolutional neural network model can achieve a good feature extraction effect, so that the feature extraction can be performed through the single-layer two-dimensional convolutional neural network model. Further, a plurality of channels (for example, 32 channels or 64 channels, etc.) may be set for the single-layer two-dimensional convolutional neural network model, so that features extracted by the channels are complemented, and the obtained feature matrix is more accurate. The present disclosure is mainly described for a single-layer two-dimensional convolutional neural network model, and specifically, each feature matrix in the present disclosure may be obtained by:

because the first content feature matrix is a matrix corresponding to the spectrogram content in the first speech spectrogram, the first speech spectrogram can be input into the single-layer two-dimensional convolutional neural network model to obtain the first content feature matrix. In the embodiment of the present disclosure, the multiple channels may be the same channels (that is, filters used by the two-dimensional convolutional neural network model are the same), and of course, the multiple channels may also be different channels (that is, filters used by the two-dimensional convolutional neural network model are different), that is, the channels are different, or at least two of the multiple channels are different, which is not limited by the present disclosure.

Because the first style characteristic matrix is a matrix corresponding to the style of the spectrogram in the second voice spectrogram, a third content characteristic matrix corresponding to the second voice spectrogram can be obtained through a single-layer two-dimensional convolutional neural network model. Therefore, when the style characteristic matrix of the image style is obtained in the prior art, the meanings of the coordinates of all axes in the image are the same (namely, the coordinates of the pixels are all), at the moment, the different coordinates of any pixel in the image are exchanged, namely, the position of the pixel in the image is changed, but for the voice spectrogram, the meanings of the different coordinates of each pixel are different, for example, one coordinate of a certain pixel is a frequency axis, the other coordinate of the certain pixel is a time axis, at the moment, the different coordinates are exchanged, namely, the frequency and the time are changed, and in order to solve the problem, the voice style cannot be changed along with the change of the time. Therefore, can first beThe third content feature matrix is expanded along a time axis to obtain a first content feature transformation matrix; the dimension of the third content feature matrix is represented as (n)_C1，n_H1，n_W1) The dimension of the first content feature transformation matrix is represented as (n)_C1×n_H1，n_W1)，n_C1A channel number axis, n, corresponding to the third content feature matrix_H1Representing the frequency axis, n, corresponding to the third content feature matrix_W1Representing a time axis corresponding to the third content feature matrix; n is₁×n_H1A number axis generated after a channel number axis corresponding to the third content feature matrix and a frequency number axis corresponding to the third content feature matrix are stacked and unfolded is represented; and then transposing the first content feature conversion matrix to obtain a first content feature transpose matrix, and calculating a product between the first content feature conversion matrix and the first content feature transpose matrix to obtain the first style feature matrix.

Because the second content feature matrix is a matrix corresponding to the spectrogram content of the speech spectrogram to be processed, and the second style feature matrix is a matrix corresponding to the spectrogram style of the speech spectrogram to be processed, the speech spectrogram to be processed can be input into the single-layer two-dimensional convolutional neural network model to obtain the second content feature matrix, and the speech style does not change along with the change of time, so that the second content feature matrix is expanded along a time axis to obtain a second content feature transformation matrix; the dimension of the second content feature matrix is represented as (n)_C2,n_H2,n_W2) The dimension of the second content feature transformation matrix is represented as (n)_C2×n_H2，n_W2)，n_C2Representing the number axis of the channels corresponding to the second content feature matrix, n_H2Representing the frequency axis, n, corresponding to the second content feature matrix_W2Representing the time axis, n, corresponding to the second content feature matrix_C2×n_H2Representing the number axis of the channels corresponding to the second content feature matrix and corresponding to the second content feature matrixThe frequency number axis is generated after stacking and unfolding; and performing transposition processing on the second content characteristic conversion matrix to obtain a second content characteristic transposition matrix, and calculating the product between the second content characteristic conversion matrix and the second content characteristic transposition matrix to obtain the second style characteristic matrix. If a plurality of channels are set in the single-layer two-dimensional convolutional neural network, a second content feature matrix corresponding to each channel can be obtained.

It should be noted that, in the present disclosure, feature extraction may also be performed through a multi-layer two-dimensional convolutional neural network model to obtain each feature matrix in the present disclosure. In particular, in the field of image processing, for a multilayer two-dimensional convolutional neural network model, the convolution result of a lower layer network can better reflect the image content, and the image style can be obtained through the convolution result of each layer network. In this way, since the first content feature matrix is a matrix corresponding to the spectrogram content in the first voice spectrogram, after the first voice spectrogram is input into the multi-layer two-dimensional convolutional neural network model, the output result of the designated lower-layer network layer in the multi-layer two-dimensional convolutional neural network model is used as the first content feature matrix; because the first style characteristic matrix is a matrix corresponding to the style of the spectrogram in the second voice spectrogram, the second voice spectrogram is input into a multi-layer two-dimensional convolutional neural network model so as to obtain a first initial content characteristic matrix corresponding to each layer of convolutional network, and the first style characteristic matrix corresponding to each layer of convolutional network is obtained according to the first initial content characteristic matrix corresponding to each layer of convolutional network, and the specific process can refer to obtaining the first style characteristic matrix through a single-layer convolutional neural network model and is not repeated; because the second content feature matrix is a matrix corresponding to the spectrogram content of the speech spectrogram to be processed, and the second style feature matrix is a matrix corresponding to the spectrogram style of the speech spectrogram to be processed, after the speech spectrogram to be processed is input into the multi-layer two-dimensional convolutional neural network model, the output result of a designated lower-layer network layer in the multi-layer two-dimensional convolutional neural network model is used as the second content feature matrix, the output result of each layer of convolutional network in the multi-layer two-dimensional convolutional neural network model is used as the second initial content feature matrix of each layer of convolutional network, and the second style feature matrix of each layer of convolutional network is obtained according to the second initial content feature matrix corresponding to each layer of convolutional network, and in a specific process, the first style feature matrix can be obtained by referring to a single-layer convolutional neural network model, and will not be described in detail. Illustratively, taking the example that the multi-layer two-dimensional convolutional neural network model includes a VGG (visual Geometry group) network model as an example, a layer 4.1 network in the VGG network model may be used as the specified lower layer network layer. The foregoing examples are illustrative only, and the disclosure is not limited thereto.

In step 103, a reconstruction loss function is obtained according to the first content feature matrix and the second content feature matrix.

In this step, since the first content feature matrix is a matrix from which the speech content of the first speech spectrogram is extracted, and the second content feature matrix is a matrix from which the speech content of the speech spectrogram to be processed is extracted, it is obvious that the present disclosure needs to make the speech content to be processed in the speech spectrogram to be processed and the speech content in the first speech spectrogram as close as possible.

In one possible implementation, in the case that the two-dimensional convolutional neural network model is a single-layer convolutional neural network model, if the single-layer convolutional neural network model includes a single channel, the reconstruction loss function may be expressed asL₁Representing the reconstruction loss function corresponding to a convolutional neural network model comprising a single layer of a single channel, F_nRepresenting the content value, P, corresponding to the n position in the first content feature matrix_nRepresenting the content value corresponding to the n position in the second content characteristic matrix; if the single-layer convolutional neural network model includes multiple channels, the reconstruction loss function can be expressed asWherein L is₂Representing the reconstruction loss function corresponding to a convolutional neural network model comprising a single layer of a plurality of channels, F_m,nRepresenting the content value corresponding to the n position in the first content characteristic matrix corresponding to the mth channel; p_m,nAnd representing the content value corresponding to the n position in the first content characteristic matrix corresponding to the mth channel. Under the condition that the two-dimensional convolutional neural network model is a multilayer convolutional neural network model, the reconstruction loss function is obtained in the following mode: the method includes obtaining a reconstruction weighted value corresponding to a reconstruction loss function of each layer of the network, and calculating a sum of the reconstruction weighted values corresponding to each layer of the network to obtain a style loss function of the multi-layer two-dimensional convolutional neural network model.

In step 104, a style loss function is obtained according to the first style feature matrix and the second style feature matrix.

In the embodiment of the present disclosure, since the first style feature matrix is a matrix from which a speech style of the second speech spectrogram is extracted, and the second style feature matrix is a matrix from which a speech style to be processed of the speech spectrogram to be processed is extracted, it is obvious that the present disclosure needs to make the speech style to be processed in the speech spectrogram to be processed and the speech style in the second speech spectrogram as close as possible.

In one possible implementation, in the case that the two-dimensional convolutional neural network model is a single-layer two-dimensional convolutional network model, the style loss function may be calculated by the following formula:or,where loss represents the style loss function, s_ijRepresenting the style value, q, corresponding to the (i, j) position in the first style feature matrix_ijRepresents the secondAnd (3) the style value corresponding to the (i, j) position in the style characteristic matrix. Since the obtained style feature matrix (i.e. the first style feature matrix and/or the second style feature matrix) may have abnormal values, the style loss function used may be determined according to the requirement, for example, if the style loss function is not sensitive to the abnormal values in the style feature matrix, the style loss function may be usedIf sensitive to outliers in the style feature matrix, this can be usedUnder the condition that the two-dimensional convolutional neural network model is a multilayer two-dimensional convolutional neural network model, the style loss function is obtained in the following mode: and obtaining the style weighted value corresponding to the style loss function of each layer of network, and calculating the sum value of the style weighted values corresponding to each layer of network to obtain the style loss function of the multi-layer two-dimensional convolutional neural network model.

In step 105, the speech spectrogram to be processed is processed according to the reconstruction loss function and the style loss function to obtain a target speech spectrogram, and speech data corresponding to the target speech spectrogram is obtained through a preset speech reconstruction algorithm.

In this step, a total loss function, Y being α loss + β L, may be obtained from the reconstruction loss function and the style loss function, α indicating the style weight, β indicating the content weight, L indicating the reconstruction loss function, and loss indicating the style loss function.

Through the total loss function, the style weight and the content weight can be set to fixed values, so that the second style characteristic matrix in the style loss function and the second content characteristic matrix in the reconstruction loss function can be continuously adjusted. Under the condition that the total loss function is the minimum value, the adjusted second style feature matrix and the adjusted second content feature matrix can be correspondingly obtained, and the target speech spectrogram can be obtained according to the adjusted second style feature matrix and the adjusted second content feature matrix, so that the speech data corresponding to the target speech spectrogram can be obtained according to a preset speech reconstruction algorithm, the preset speech reconstruction algorithm can be a Griffin-Lim algorithm, a sp/si (Switching P Picture/Switching I Picture; Switching predicted frame/Switching Start frame) frame technical algorithm or an audio decoding algorithm such as WaveNet, and the speech data is the data including the speech content in the first speech data and the speech style in the second speech data.

By adopting the method, a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed are obtained; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles; acquiring a first content characteristic matrix of the first voice spectrogram, a first style characteristic matrix of the second voice spectrogram, and a second content characteristic matrix and a second style characteristic matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model; acquiring a reconstruction loss function according to the first content characteristic matrix and the second content characteristic matrix; obtaining a style loss function according to the first style characteristic matrix and the second style characteristic matrix; and processing the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquiring voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm. It can be seen that, in the voice processing method provided in the embodiment of the present disclosure, since the image style migration method is applied to the image and the voice data is a voice signal of a one-dimensional discrete sequence, in order to apply the image style migration method to the voice data, before the first voice data and the second voice data are input to the preset two-dimensional convolutional neural network model, the first voice data and the second voice data need to be respectively converted into the corresponding first voice spectrogram and second voice spectrogram, so that the first voice spectrogram and the second voice spectrogram conform to the image characteristics in the image style migration method, and thus the accuracy of the voice style migration result is improved.

FIG. 2 is a block diagram illustrating a speech processing apparatus according to an example embodiment. Referring to fig. 2, the apparatus includes: a spectrogram-acquiring module 201, a feature matrix-acquiring module 202, a reconstruction function-acquiring module 203, a style function-acquiring module 204, and a speech-processing module 205, and, in particular,

a spectrogram acquiring module 201 configured to acquire a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data, and a to-be-processed voice spectrogram; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles;

a feature matrix obtaining module 202, configured to obtain, through a preset two-dimensional convolutional neural network model, a first content feature matrix of the first voice spectrogram, a first style feature matrix of the second voice spectrogram, and a second content feature matrix and a second style feature matrix corresponding to the voice spectrogram to be processed;

a reconstruction function obtaining module 203 configured to obtain a reconstruction loss function according to the first content feature matrix and the second content feature matrix;

a style function obtaining module 204 configured to obtain a style loss function according to the first style feature matrix and the second style feature matrix;

the voice processing module 205 is configured to process the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquire voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm.

Optionally, in another embodiment, the spectrogram acquiring module 201 is configured to perform instantaneous fourier transform on the first voice data and the second voice data respectively to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram.

Optionally, in another embodiment, the feature matrix obtaining module 202 is configured to obtain, through the two-dimensional convolutional neural network model, a third content feature matrix corresponding to the second speech spectrogram;

Optionally, in another embodiment, the feature matrix obtaining module 202 is configured to expand the second content feature matrix along a time axis to obtain a second content feature transformation matrix; the dimension of the second content feature matrix is represented as (n)_C2,n_H2,n_W2) The dimension of the second content feature transformation matrix is represented as (n)_C2×n_H2，n_W2)，n_C2Representing the number axis of the channels corresponding to the second content feature matrix, n_H2Representing the frequency axis, n, corresponding to the second content feature matrix_W2Representing the time axis, n, corresponding to the second content feature matrix_C2×n_H2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;

The second content feature transformation matrix is obtained by expanding the second content feature matrix along a time axis; the dimension of the second content feature matrix is represented as (n)_C2,n_H2,n_W2) The dimension of the second content feature transformation matrix is represented as (n)_C2×n_H2，n_W2)，n_C2Representing the number axis of the channels corresponding to the second content feature matrix, n_H2Representing the frequency axis, n, corresponding to the second content feature matrix_W2Representing the time axis, n, corresponding to the second content feature matrix_C2×n_H2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;

Optionally, in another embodiment, the style function obtaining module 204 is configured to calculate the style loss function by the following formula:

or,

By adopting the device, a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed are obtained; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles; acquiring a first content characteristic matrix of the first voice spectrogram, a first style characteristic matrix of the second voice spectrogram, and a second content characteristic matrix and a second style characteristic matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model; acquiring a reconstruction loss function according to the first content characteristic matrix and the second content characteristic matrix; obtaining a style loss function according to the first style characteristic matrix and the second style characteristic matrix; and processing the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquiring voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm. It can be seen that, in the voice processing method provided in the embodiment of the present disclosure, since the image style migration method is applied to the image and the voice data is a voice signal of a one-dimensional discrete sequence, in order to apply the image style migration method to the voice data, before the first voice data and the second voice data are input to the preset two-dimensional convolutional neural network model, the first voice data and the second voice data need to be respectively converted into the corresponding first voice spectrogram and second voice spectrogram, so that the first voice spectrogram and the second voice spectrogram conform to the image characteristics in the image style migration method, and thus the accuracy of the voice style migration result is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 3 is a block diagram illustrating an electronic device 300 in accordance with an example embodiment. The electronic device may be a mobile terminal or a server, and in the embodiment of the present disclosure, the electronic device is taken as an example for description. For example, the electronic device 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 3, electronic device 300 may include one or more of the following components: a processing component 302, a memory 304, a power component 306, a multimedia component 308, an audio component 310, an input/output (I/O) interface 312, a sensor component 314, and a communication component 316.

The processing component 302 generally controls overall operation of the electronic device 300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 302 may include one or more processors 320 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 302 can include one or more modules that facilitate interaction between the processing component 302 and other components. For example, the processing component 302 may include a multimedia module to facilitate interaction between the multimedia component 308 and the processing component 302.

The memory 304 is configured to store various types of data to support operations at the electronic device 300. Examples of such data include instructions for any application or method operating on the electronic device 300, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 304 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 306 provides power to the various components of the electronic device 300. The power components 306 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 300.

The multimedia component 308 comprises a screen providing an output interface between the electronic device 300 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 308 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 300 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 310 is configured to output and/or input audio signals. For example, the audio component 310 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 300 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 304 or transmitted via the communication component 316. In some embodiments, audio component 310 also includes a speaker for outputting audio signals.

The I/O interface 312 provides an interface between the processing component 302 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

Sensor assembly 314 includes one or more sensors for providing various aspects of status assessment for electronic device 300. For example, sensor assembly 314 may detect an open/closed state of electronic device 300, the relative positioning of components, such as a display and keypad of electronic device 300, sensor assembly 314 may also detect a change in the position of electronic device 300 or a component of electronic device 300, the presence or absence of user contact with electronic device 300, the orientation or acceleration/deceleration of electronic device 300, and a change in the temperature of electronic device 300. Sensor assembly 314 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 314 may also include a light sensor, such as a CMOS or CCD spectrum sensor, for use in imaging applications. In some embodiments, the sensor assembly 314 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 316 is configured to facilitate wired or wireless communication between the electronic device 300 and other devices. The electronic device 300 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 316 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 316 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the voice processing method illustrated in fig. 1 described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 304 comprising instructions, executable by the processor 320 of the electronic device 300 to perform the speech processing method illustrated in fig. 1 described above, is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product, the instructions of which, when executed by the processor 320 of the electronic device 300, cause the electronic device 300 to perform the above-described speech processing method illustrated in fig. 1.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech processing, the method comprising:

2. The method of claim 1, wherein the obtaining a first voice spectrogram corresponding to first voice data and a second voice spectrogram corresponding to second voice data comprises:

3. The method of claim 1, wherein the obtaining the first style feature matrix of the second speech spectrogram through a preset two-dimensional convolutional neural network model comprises:

4. The method according to claim 1, wherein the obtaining the second style feature matrix of the speech spectrogram to be processed through a preset two-dimensional convolutional neural network model comprises:

expanding the second content characteristic matrix along a time axis to obtain a second content characteristic conversion matrix; the dimension of the second content feature matrix is represented as (n)_C2,n_H2,n_W2) The dimension of the second content feature transformation matrix is represented as (n)_C2×n_H2，n_W2)，n_C2Representing the number axis of the channels corresponding to the second content feature matrix, n_H2Representing the frequency axis, n, corresponding to the second content feature matrix_W2Representing the time axis, n, corresponding to the second content feature matrix_C2×n_H2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;

5. The method of claim 1, wherein obtaining a style loss function from the first style feature matrix and the second style feature matrix comprises:

calculating the style loss function by the following formula:

or,

6. A speech processing apparatus, characterized in that the apparatus comprises:

7. The apparatus of claim 6, wherein the spectrogram acquisition module is configured to perform an instantaneous Fourier transform on the first voice data and the second voice data, respectively, to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram.

8. The apparatus of claim 6, wherein the feature matrix obtaining module is configured to obtain a third content feature matrix corresponding to the second speech spectrogram through the two-dimensional convolutional neural network model;

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to perform the steps of the speech processing method of any of claims 1 to 5.

10. A non-transitory computer readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the steps of the speech processing method of any of claims 1 to 5.