CN110148424A - Method of speech processing, device, electronic equipment and storage medium - Google Patents

Method of speech processing, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110148424A
CN110148424A CN201910381777.XA CN201910381777A CN110148424A CN 110148424 A CN110148424 A CN 110148424A CN 201910381777 A CN201910381777 A CN 201910381777A CN 110148424 A CN110148424 A CN 110148424A
Authority
CN
China
Prior art keywords
matrix
voice
content
style
spectrogram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910381777.XA
Other languages
Chinese (zh)
Other versions
CN110148424B (en
Inventor
方轲
郑文
宋丛礼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN201910381777.XA priority Critical patent/CN110148424B/en
Publication of CN110148424A publication Critical patent/CN110148424A/en
Application granted granted Critical
Publication of CN110148424B publication Critical patent/CN110148424B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Present disclose provides a kind of method of speech processing, device and electronic equipment and storage mediums, which comprises obtains the corresponding first voice spectrum figure of the first voice data, the corresponding second voice spectrum figure of second speech data and voice spectrum figure to be processed;Pass through preset two-dimensional convolution neural network model, obtain the first content eigenmatrix of the first voice spectrum figure, first style and features matrix of the second voice spectrum figure and the corresponding second content characteristic matrix of the voice spectrum figure to be processed and the second style and features matrix;Reconstruct loss function is obtained according to the first content eigenmatrix and the second content characteristic matrix;Style loss function is obtained according to the first style and features matrix and the second style and features matrix;The voice spectrum figure to be processed is handled to obtain target voice spectrogram according to the reconstruct loss function and the style loss function, and the corresponding voice data of target voice spectrogram is obtained by default speech reconstructing algorithm.

Description

Voice processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of information processing technologies, and in particular, to a voice processing method and apparatus, an electronic device, and a storage medium.
Background
Because the sound styles (i.e. timbres, etc.) of different users are different, a certain voice content can be converted into different voice styles to be played, so that the voice effect is diversified.
In the related art, it is considered that the voice is a continuous waveform, so that a voice signal of a one-dimensional discrete sequence can be obtained by sampling from the waveform, and the voice signal is input into a one-dimensional network model to obtain a one-dimensional voice feature matrix. It can be seen that the voice features extracted in the prior art are one-dimensional features, and the result of voice style migration is poor due to the fact that the image features and the voice features are different, and the one-dimensional voice features are adopted in the image style migration method.
Disclosure of Invention
In order to overcome the problems in the related art, the present disclosure provides a voice processing method, an apparatus, an electronic device, and a storage medium, so as to solve the problem in the prior art that a voice style migration result is poor due to a voice style migration performed by using an image style migration method.
According to a first aspect of embodiments of the present disclosure, there is provided a speech processing method, the method including:
acquiring a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles;
acquiring a first content characteristic matrix of the first voice spectrogram, a first style characteristic matrix of the second voice spectrogram, and a second content characteristic matrix and a second style characteristic matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model;
acquiring a reconstruction loss function according to the first content characteristic matrix and the second content characteristic matrix;
obtaining a style loss function according to the first style characteristic matrix and the second style characteristic matrix;
and processing the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquiring voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm.
Optionally, the obtaining a first voice spectrogram corresponding to the first voice data and a second voice spectrogram corresponding to the second voice data include:
and respectively carrying out instantaneous Fourier transform on the first voice data and the second voice data to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram.
Optionally, the obtaining the first style characteristic matrix of the second speech spectrogram through a preset two-dimensional convolutional neural network model includes:
acquiring a third content characteristic matrix corresponding to the second voice spectrogram through the two-dimensional convolution neural network model;
expanding the third content characteristic matrix along a time axis to obtain a first content characteristic conversion matrix; the dimension of the third content feature matrix is represented as (n)C1,nH1,nW1) The dimension of the first content feature transformation matrix is represented as (n)C1×nH1,nW1),nC1A channel number axis, n, corresponding to the third content feature matrixH1Representing the frequency axis, n, corresponding to the third content feature matrixW1Representing a time axis corresponding to the third content feature matrix; n is1×nH1A number axis generated after a channel number axis corresponding to the third content feature matrix and a frequency number axis corresponding to the third content feature matrix are stacked and unfolded is represented;
and performing transposition processing on the first content characteristic conversion matrix to obtain a first content characteristic transposition matrix, and calculating the product between the first content characteristic conversion matrix and the first content characteristic transposition matrix to obtain the first style characteristic matrix.
Optionally, the obtaining a second style feature matrix of the speech spectrogram to be processed through a preset two-dimensional convolutional neural network model includes:
expanding the second content characteristic matrix along a time axis to obtain a second content characteristic conversion matrix; the dimension of the second content feature matrix is represented as (n)C2,nH2,nW2) The dimension of the second content feature transformation matrix is represented as (n)C2×nH2,nW2),nC2Representing the second content feature matrix pairNumber of channels, nH2Representing the frequency axis, n, corresponding to the second content feature matrixW2Representing the time axis, n, corresponding to the second content feature matrixC2×nH2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;
and performing transposition processing on the second content characteristic conversion matrix to obtain a second content characteristic transposition matrix, and calculating the product between the second content characteristic conversion matrix and the second content characteristic transposition matrix to obtain the second style characteristic matrix.
Optionally, the obtaining a style loss function according to the first style feature matrix and the second style feature matrix includes:
calculating the style loss function by the following formula:
or,
where loss represents the style loss function, sijRepresenting the style value, q, corresponding to the (i, j) position in the first style feature matrixijRepresenting the style value corresponding to the (i, j) position in the second style feature matrix.
According to a second aspect of the embodiments of the present disclosure, there is provided a speech processing apparatus, the apparatus comprising:
the spectrogram acquiring module is configured to acquire a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles;
the feature matrix acquisition module is configured to acquire a first content feature matrix of the first voice spectrogram, a first style feature matrix of the second voice spectrogram, and a second content feature matrix and a second style feature matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model;
a reconstruction function obtaining module configured to obtain a reconstruction loss function according to the first content feature matrix and the second content feature matrix;
a style function obtaining module configured to obtain a style loss function according to the first style feature matrix and the second style feature matrix;
and the voice processing module is configured to process the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquire voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm.
Optionally, the spectrogram acquiring module is configured to perform instantaneous fourier transform on the first voice data and the second voice data respectively to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram.
Optionally, the feature matrix obtaining module is configured to obtain, through the two-dimensional convolutional neural network model, a third content feature matrix corresponding to the second speech spectrogram;
expanding the third content characteristic matrix along a time axis to obtain a first content characteristic conversion matrix; the dimension of the third content feature matrix is represented as (n)C1,nH1,nW1) The dimension of the first content feature transformation matrix is represented as (n)C1×nH1,nW1),nC1A channel number axis, n, corresponding to the third content feature matrixH1Representing the third content feature matrix pairCorresponding frequency axis, nW1Representing a time axis corresponding to the third content feature matrix; n is1×nH1A number axis generated after a channel number axis corresponding to the third content feature matrix and a frequency number axis corresponding to the third content feature matrix are stacked and unfolded is represented;
and performing transposition processing on the first content characteristic conversion matrix to obtain a first content characteristic transposition matrix, and calculating the product between the first content characteristic conversion matrix and the first content characteristic transposition matrix to obtain the first style characteristic matrix.
Optionally, the feature matrix obtaining module is configured to expand the second content feature matrix along a time axis to obtain a second content feature transformation matrix; the dimension of the second content feature matrix is represented as (n)C2,nH2,nW2) The dimension of the second content feature transformation matrix is represented as (n)C2×nH2,nW2),nC2Representing the number axis of the channels corresponding to the second content feature matrix, nH2Representing the frequency axis, n, corresponding to the second content feature matrixW2Representing the time axis, n, corresponding to the second content feature matrixC2×nH2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;
and performing transposition processing on the second content characteristic conversion matrix to obtain a second content characteristic transposition matrix, and calculating the product between the second content characteristic conversion matrix and the second content characteristic transposition matrix to obtain the second style characteristic matrix.
Optionally, the style function obtaining module is configured to calculate the style loss function according to the following formula:
or,
Where loss represents the style loss function, sijRepresenting the style value, q, corresponding to the (i, j) position in the first style feature matrixijRepresenting the style value corresponding to the (i, j) position in the second style feature matrix.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to perform the speech processing method described above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, wherein instructions of the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the above-mentioned speech processing method.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising one or more instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the above-described speech processing method.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
the voice processing method shown in the present exemplary embodiment obtains a first voice spectrogram corresponding to first voice data, a second voice spectrogram corresponding to second voice data, and a voice spectrogram to be processed; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles; acquiring a first content characteristic matrix of the first voice spectrogram, a first style characteristic matrix of the second voice spectrogram, and a second content characteristic matrix and a second style characteristic matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model; acquiring a reconstruction loss function according to the first content characteristic matrix and the second content characteristic matrix; obtaining a style loss function according to the first style characteristic matrix and the second style characteristic matrix; and processing the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquiring voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm. It can be seen that, in the voice processing method provided in the embodiment of the present disclosure, since the image style migration method is applied to the image and the voice data is a voice signal of a one-dimensional discrete sequence, in order to apply the image style migration method to the voice data, before the first voice data and the second voice data are input to the preset two-dimensional convolutional neural network model, the first voice data and the second voice data need to be respectively converted into the corresponding first voice spectrogram and second voice spectrogram, so that the first voice spectrogram and the second voice spectrogram conform to the image characteristics in the image style migration method, and thus the accuracy of the voice style migration result is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow diagram illustrating a method of speech processing according to an exemplary embodiment;
FIG. 2 is a block diagram illustrating a speech processing apparatus according to an example embodiment;
fig. 3 is a block diagram illustrating a structure of an electronic device according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
FIG. 1 is a flow diagram illustrating a method of speech processing according to an exemplary embodiment, which may include the steps of, as shown in FIG. 1:
in step 101, a first voice spectrogram corresponding to first voice data, a second voice spectrogram corresponding to second voice data, and a voice spectrogram to be processed are obtained; the first voice data is used for extracting voice content; the second voice data is data for extracting a voice style.
In this disclosure, the to-be-processed speech spectrogram may be a spectrogram obtained after random initialization, and in the subsequent steps of this disclosure, the to-be-processed speech spectrogram needs to be processed to obtain a target speech spectrogram, so that the target speech spectrogram includes speech content in the first speech spectrogram and speech style in the second speech spectrogram.
In addition, in the prior art, the voice data of the voice content is extracted, the voice data of the voice style are extracted as voice signals of a one-dimensional discrete sequence, and the voice signals to be processed are subjected to image style migration through the voice signals, but the image style migration method is generally specific to images, and the voice features and the image features are different from each other, so that the voice style migration result is poor when the voice signals are subjected to the voice style migration through the image style migration method. In order to solve the problem, the present disclosure may perform an instantaneous fourier transform on the first voice data and the second voice data, respectively, to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram. Performing instantaneous Fourier transform on the first voice data to obtain a first voice spectrogram; and carrying out instantaneous Fourier transform on the second voice data to obtain a second voice spectrogram. In this way, the first voice data and the second voice data are respectively converted into voice spectrograms, so that the first voice spectrograms, the second voice spectrograms and the voice spectrograms to be processed conform to the image style migration method.
In step 102, a first content feature matrix of the first voice spectrogram, a first style feature matrix of the second voice spectrogram, and a second content feature matrix and a second style feature matrix corresponding to the voice spectrogram to be processed are obtained through a preset two-dimensional convolutional neural network model.
In this disclosure, in a case where the first voice data is data for extracting voice content, the first content feature matrix is a matrix corresponding to a spectrogram content (i.e., voice content) in the first voice spectrogram, and in a case where the second voice data is data for extracting a voice style, the first style feature matrix is a matrix corresponding to a spectrogram style (i.e., voice style) in the second voice spectrogram. In addition, the second content feature matrix may be a matrix corresponding to spectrogram content (i.e., voice content) of the voice spectrogram to be processed, and the second style feature matrix may be a matrix corresponding to spectrogram style (i.e., voice style) of the voice spectrogram to be processed.
The single-layer two-dimensional convolutional neural network model can achieve a good feature extraction effect, so that the feature extraction can be performed through the single-layer two-dimensional convolutional neural network model. Further, a plurality of channels (for example, 32 channels or 64 channels, etc.) may be set for the single-layer two-dimensional convolutional neural network model, so that features extracted by the channels are complemented, and the obtained feature matrix is more accurate. The present disclosure is mainly described for a single-layer two-dimensional convolutional neural network model, and specifically, each feature matrix in the present disclosure may be obtained by:
because the first content feature matrix is a matrix corresponding to the spectrogram content in the first speech spectrogram, the first speech spectrogram can be input into the single-layer two-dimensional convolutional neural network model to obtain the first content feature matrix. In the embodiment of the present disclosure, the multiple channels may be the same channels (that is, filters used by the two-dimensional convolutional neural network model are the same), and of course, the multiple channels may also be different channels (that is, filters used by the two-dimensional convolutional neural network model are different), that is, the channels are different, or at least two of the multiple channels are different, which is not limited by the present disclosure.
Because the first style characteristic matrix is a matrix corresponding to the style of the spectrogram in the second voice spectrogram, a third content characteristic matrix corresponding to the second voice spectrogram can be obtained through a single-layer two-dimensional convolutional neural network model. Therefore, when the style characteristic matrix of the image style is obtained in the prior art, the meanings of the coordinates of all axes in the image are the same (namely, the coordinates of the pixels are all), at the moment, the different coordinates of any pixel in the image are exchanged, namely, the position of the pixel in the image is changed, but for the voice spectrogram, the meanings of the different coordinates of each pixel are different, for example, one coordinate of a certain pixel is a frequency axis, the other coordinate of the certain pixel is a time axis, at the moment, the different coordinates are exchanged, namely, the frequency and the time are changed, and in order to solve the problem, the voice style cannot be changed along with the change of the time. Therefore, can first beThe third content feature matrix is expanded along a time axis to obtain a first content feature transformation matrix; the dimension of the third content feature matrix is represented as (n)C1,nH1,nW1) The dimension of the first content feature transformation matrix is represented as (n)C1×nH1,nW1),nC1A channel number axis, n, corresponding to the third content feature matrixH1Representing the frequency axis, n, corresponding to the third content feature matrixW1Representing a time axis corresponding to the third content feature matrix; n is1×nH1A number axis generated after a channel number axis corresponding to the third content feature matrix and a frequency number axis corresponding to the third content feature matrix are stacked and unfolded is represented; and then transposing the first content feature conversion matrix to obtain a first content feature transpose matrix, and calculating a product between the first content feature conversion matrix and the first content feature transpose matrix to obtain the first style feature matrix.
Because the second content feature matrix is a matrix corresponding to the spectrogram content of the speech spectrogram to be processed, and the second style feature matrix is a matrix corresponding to the spectrogram style of the speech spectrogram to be processed, the speech spectrogram to be processed can be input into the single-layer two-dimensional convolutional neural network model to obtain the second content feature matrix, and the speech style does not change along with the change of time, so that the second content feature matrix is expanded along a time axis to obtain a second content feature transformation matrix; the dimension of the second content feature matrix is represented as (n)C2,nH2,nW2) The dimension of the second content feature transformation matrix is represented as (n)C2×nH2,nW2),nC2Representing the number axis of the channels corresponding to the second content feature matrix, nH2Representing the frequency axis, n, corresponding to the second content feature matrixW2Representing the time axis, n, corresponding to the second content feature matrixC2×nH2Representing the number axis of the channels corresponding to the second content feature matrix and corresponding to the second content feature matrixThe frequency number axis is generated after stacking and unfolding; and performing transposition processing on the second content characteristic conversion matrix to obtain a second content characteristic transposition matrix, and calculating the product between the second content characteristic conversion matrix and the second content characteristic transposition matrix to obtain the second style characteristic matrix. If a plurality of channels are set in the single-layer two-dimensional convolutional neural network, a second content feature matrix corresponding to each channel can be obtained.
It should be noted that, in the present disclosure, feature extraction may also be performed through a multi-layer two-dimensional convolutional neural network model to obtain each feature matrix in the present disclosure. In particular, in the field of image processing, for a multilayer two-dimensional convolutional neural network model, the convolution result of a lower layer network can better reflect the image content, and the image style can be obtained through the convolution result of each layer network. In this way, since the first content feature matrix is a matrix corresponding to the spectrogram content in the first voice spectrogram, after the first voice spectrogram is input into the multi-layer two-dimensional convolutional neural network model, the output result of the designated lower-layer network layer in the multi-layer two-dimensional convolutional neural network model is used as the first content feature matrix; because the first style characteristic matrix is a matrix corresponding to the style of the spectrogram in the second voice spectrogram, the second voice spectrogram is input into a multi-layer two-dimensional convolutional neural network model so as to obtain a first initial content characteristic matrix corresponding to each layer of convolutional network, and the first style characteristic matrix corresponding to each layer of convolutional network is obtained according to the first initial content characteristic matrix corresponding to each layer of convolutional network, and the specific process can refer to obtaining the first style characteristic matrix through a single-layer convolutional neural network model and is not repeated; because the second content feature matrix is a matrix corresponding to the spectrogram content of the speech spectrogram to be processed, and the second style feature matrix is a matrix corresponding to the spectrogram style of the speech spectrogram to be processed, after the speech spectrogram to be processed is input into the multi-layer two-dimensional convolutional neural network model, the output result of a designated lower-layer network layer in the multi-layer two-dimensional convolutional neural network model is used as the second content feature matrix, the output result of each layer of convolutional network in the multi-layer two-dimensional convolutional neural network model is used as the second initial content feature matrix of each layer of convolutional network, and the second style feature matrix of each layer of convolutional network is obtained according to the second initial content feature matrix corresponding to each layer of convolutional network, and in a specific process, the first style feature matrix can be obtained by referring to a single-layer convolutional neural network model, and will not be described in detail. Illustratively, taking the example that the multi-layer two-dimensional convolutional neural network model includes a VGG (visual Geometry group) network model as an example, a layer 4.1 network in the VGG network model may be used as the specified lower layer network layer. The foregoing examples are illustrative only, and the disclosure is not limited thereto.
In step 103, a reconstruction loss function is obtained according to the first content feature matrix and the second content feature matrix.
In this step, since the first content feature matrix is a matrix from which the speech content of the first speech spectrogram is extracted, and the second content feature matrix is a matrix from which the speech content of the speech spectrogram to be processed is extracted, it is obvious that the present disclosure needs to make the speech content to be processed in the speech spectrogram to be processed and the speech content in the first speech spectrogram as close as possible.
In one possible implementation, in the case that the two-dimensional convolutional neural network model is a single-layer convolutional neural network model, if the single-layer convolutional neural network model includes a single channel, the reconstruction loss function may be expressed asL1Representing the reconstruction loss function corresponding to a convolutional neural network model comprising a single layer of a single channel, FnRepresenting the content value, P, corresponding to the n position in the first content feature matrixnRepresenting the content value corresponding to the n position in the second content characteristic matrix; if the single-layer convolutional neural network model includes multiple channels, the reconstruction loss function can be expressed asWherein L is2Representing the reconstruction loss function corresponding to a convolutional neural network model comprising a single layer of a plurality of channels, Fm,nRepresenting the content value corresponding to the n position in the first content characteristic matrix corresponding to the mth channel; pm,nAnd representing the content value corresponding to the n position in the first content characteristic matrix corresponding to the mth channel. Under the condition that the two-dimensional convolutional neural network model is a multilayer convolutional neural network model, the reconstruction loss function is obtained in the following mode: the method includes obtaining a reconstruction weighted value corresponding to a reconstruction loss function of each layer of the network, and calculating a sum of the reconstruction weighted values corresponding to each layer of the network to obtain a style loss function of the multi-layer two-dimensional convolutional neural network model.
In step 104, a style loss function is obtained according to the first style feature matrix and the second style feature matrix.
In the embodiment of the present disclosure, since the first style feature matrix is a matrix from which a speech style of the second speech spectrogram is extracted, and the second style feature matrix is a matrix from which a speech style to be processed of the speech spectrogram to be processed is extracted, it is obvious that the present disclosure needs to make the speech style to be processed in the speech spectrogram to be processed and the speech style in the second speech spectrogram as close as possible.
In one possible implementation, in the case that the two-dimensional convolutional neural network model is a single-layer two-dimensional convolutional network model, the style loss function may be calculated by the following formula:or,where loss represents the style loss function, sijRepresenting the style value, q, corresponding to the (i, j) position in the first style feature matrixijRepresents the secondAnd (3) the style value corresponding to the (i, j) position in the style characteristic matrix. Since the obtained style feature matrix (i.e. the first style feature matrix and/or the second style feature matrix) may have abnormal values, the style loss function used may be determined according to the requirement, for example, if the style loss function is not sensitive to the abnormal values in the style feature matrix, the style loss function may be usedIf sensitive to outliers in the style feature matrix, this can be usedUnder the condition that the two-dimensional convolutional neural network model is a multilayer two-dimensional convolutional neural network model, the style loss function is obtained in the following mode: and obtaining the style weighted value corresponding to the style loss function of each layer of network, and calculating the sum value of the style weighted values corresponding to each layer of network to obtain the style loss function of the multi-layer two-dimensional convolutional neural network model.
In step 105, the speech spectrogram to be processed is processed according to the reconstruction loss function and the style loss function to obtain a target speech spectrogram, and speech data corresponding to the target speech spectrogram is obtained through a preset speech reconstruction algorithm.
In this step, a total loss function, Y being α loss + β L, may be obtained from the reconstruction loss function and the style loss function, α indicating the style weight, β indicating the content weight, L indicating the reconstruction loss function, and loss indicating the style loss function.
Through the total loss function, the style weight and the content weight can be set to fixed values, so that the second style characteristic matrix in the style loss function and the second content characteristic matrix in the reconstruction loss function can be continuously adjusted. Under the condition that the total loss function is the minimum value, the adjusted second style feature matrix and the adjusted second content feature matrix can be correspondingly obtained, and the target speech spectrogram can be obtained according to the adjusted second style feature matrix and the adjusted second content feature matrix, so that the speech data corresponding to the target speech spectrogram can be obtained according to a preset speech reconstruction algorithm, the preset speech reconstruction algorithm can be a Griffin-Lim algorithm, a sp/si (Switching P Picture/Switching I Picture; Switching predicted frame/Switching Start frame) frame technical algorithm or an audio decoding algorithm such as WaveNet, and the speech data is the data including the speech content in the first speech data and the speech style in the second speech data.
By adopting the method, a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed are obtained; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles; acquiring a first content characteristic matrix of the first voice spectrogram, a first style characteristic matrix of the second voice spectrogram, and a second content characteristic matrix and a second style characteristic matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model; acquiring a reconstruction loss function according to the first content characteristic matrix and the second content characteristic matrix; obtaining a style loss function according to the first style characteristic matrix and the second style characteristic matrix; and processing the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquiring voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm. It can be seen that, in the voice processing method provided in the embodiment of the present disclosure, since the image style migration method is applied to the image and the voice data is a voice signal of a one-dimensional discrete sequence, in order to apply the image style migration method to the voice data, before the first voice data and the second voice data are input to the preset two-dimensional convolutional neural network model, the first voice data and the second voice data need to be respectively converted into the corresponding first voice spectrogram and second voice spectrogram, so that the first voice spectrogram and the second voice spectrogram conform to the image characteristics in the image style migration method, and thus the accuracy of the voice style migration result is improved.
FIG. 2 is a block diagram illustrating a speech processing apparatus according to an example embodiment. Referring to fig. 2, the apparatus includes: a spectrogram-acquiring module 201, a feature matrix-acquiring module 202, a reconstruction function-acquiring module 203, a style function-acquiring module 204, and a speech-processing module 205, and, in particular,
a spectrogram acquiring module 201 configured to acquire a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data, and a to-be-processed voice spectrogram; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles;
a feature matrix obtaining module 202, configured to obtain, through a preset two-dimensional convolutional neural network model, a first content feature matrix of the first voice spectrogram, a first style feature matrix of the second voice spectrogram, and a second content feature matrix and a second style feature matrix corresponding to the voice spectrogram to be processed;
a reconstruction function obtaining module 203 configured to obtain a reconstruction loss function according to the first content feature matrix and the second content feature matrix;
a style function obtaining module 204 configured to obtain a style loss function according to the first style feature matrix and the second style feature matrix;
the voice processing module 205 is configured to process the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquire voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm.
Optionally, in another embodiment, the spectrogram acquiring module 201 is configured to perform instantaneous fourier transform on the first voice data and the second voice data respectively to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram.
Optionally, in another embodiment, the feature matrix obtaining module 202 is configured to obtain, through the two-dimensional convolutional neural network model, a third content feature matrix corresponding to the second speech spectrogram;
expanding the third content characteristic matrix along a time axis to obtain a first content characteristic conversion matrix; the dimension of the third content feature matrix is represented as (n)C1,nH1,nW1) The dimension of the first content feature transformation matrix is represented as (n)C1×nH1,nW1),nC1A channel number axis, n, corresponding to the third content feature matrixH1Representing the frequency axis, n, corresponding to the third content feature matrixW1Representing a time axis corresponding to the third content feature matrix; n is1×nH1A number axis generated after a channel number axis corresponding to the third content feature matrix and a frequency number axis corresponding to the third content feature matrix are stacked and unfolded is represented;
and performing transposition processing on the first content characteristic conversion matrix to obtain a first content characteristic transposition matrix, and calculating the product between the first content characteristic conversion matrix and the first content characteristic transposition matrix to obtain the first style characteristic matrix.
Optionally, in another embodiment, the feature matrix obtaining module 202 is configured to expand the second content feature matrix along a time axis to obtain a second content feature transformation matrix; the dimension of the second content feature matrix is represented as (n)C2,nH2,nW2) The dimension of the second content feature transformation matrix is represented as (n)C2×nH2,nW2),nC2Representing the number axis of the channels corresponding to the second content feature matrix, nH2Representing the frequency axis, n, corresponding to the second content feature matrixW2Representing the time axis, n, corresponding to the second content feature matrixC2×nH2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;
and performing transposition processing on the second content characteristic conversion matrix to obtain a second content characteristic transposition matrix, and calculating the product between the second content characteristic conversion matrix and the second content characteristic transposition matrix to obtain the second style characteristic matrix.
The second content feature transformation matrix is obtained by expanding the second content feature matrix along a time axis; the dimension of the second content feature matrix is represented as (n)C2,nH2,nW2) The dimension of the second content feature transformation matrix is represented as (n)C2×nH2,nW2),nC2Representing the number axis of the channels corresponding to the second content feature matrix, nH2Representing the frequency axis, n, corresponding to the second content feature matrixW2Representing the time axis, n, corresponding to the second content feature matrixC2×nH2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;
and performing transposition processing on the second content characteristic conversion matrix to obtain a second content characteristic transposition matrix, and calculating the product between the second content characteristic conversion matrix and the second content characteristic transposition matrix to obtain the second style characteristic matrix.
Optionally, in another embodiment, the style function obtaining module 204 is configured to calculate the style loss function by the following formula:
or,
where loss represents the style loss function, sijRepresenting the style value, q, corresponding to the (i, j) position in the first style feature matrixijRepresenting the style value corresponding to the (i, j) position in the second style feature matrix.
By adopting the device, a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed are obtained; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles; acquiring a first content characteristic matrix of the first voice spectrogram, a first style characteristic matrix of the second voice spectrogram, and a second content characteristic matrix and a second style characteristic matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model; acquiring a reconstruction loss function according to the first content characteristic matrix and the second content characteristic matrix; obtaining a style loss function according to the first style characteristic matrix and the second style characteristic matrix; and processing the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquiring voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm. It can be seen that, in the voice processing method provided in the embodiment of the present disclosure, since the image style migration method is applied to the image and the voice data is a voice signal of a one-dimensional discrete sequence, in order to apply the image style migration method to the voice data, before the first voice data and the second voice data are input to the preset two-dimensional convolutional neural network model, the first voice data and the second voice data need to be respectively converted into the corresponding first voice spectrogram and second voice spectrogram, so that the first voice spectrogram and the second voice spectrogram conform to the image characteristics in the image style migration method, and thus the accuracy of the voice style migration result is improved.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 3 is a block diagram illustrating an electronic device 300 in accordance with an example embodiment. The electronic device may be a mobile terminal or a server, and in the embodiment of the present disclosure, the electronic device is taken as an example for description. For example, the electronic device 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 3, electronic device 300 may include one or more of the following components: a processing component 302, a memory 304, a power component 306, a multimedia component 308, an audio component 310, an input/output (I/O) interface 312, a sensor component 314, and a communication component 316.
The processing component 302 generally controls overall operation of the electronic device 300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 302 may include one or more processors 320 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 302 can include one or more modules that facilitate interaction between the processing component 302 and other components. For example, the processing component 302 may include a multimedia module to facilitate interaction between the multimedia component 308 and the processing component 302.
The memory 304 is configured to store various types of data to support operations at the electronic device 300. Examples of such data include instructions for any application or method operating on the electronic device 300, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 304 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 306 provides power to the various components of the electronic device 300. The power components 306 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 300.
The multimedia component 308 comprises a screen providing an output interface between the electronic device 300 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 308 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 300 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 310 is configured to output and/or input audio signals. For example, the audio component 310 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 300 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 304 or transmitted via the communication component 316. In some embodiments, audio component 310 also includes a speaker for outputting audio signals.
The I/O interface 312 provides an interface between the processing component 302 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
Sensor assembly 314 includes one or more sensors for providing various aspects of status assessment for electronic device 300. For example, sensor assembly 314 may detect an open/closed state of electronic device 300, the relative positioning of components, such as a display and keypad of electronic device 300, sensor assembly 314 may also detect a change in the position of electronic device 300 or a component of electronic device 300, the presence or absence of user contact with electronic device 300, the orientation or acceleration/deceleration of electronic device 300, and a change in the temperature of electronic device 300. Sensor assembly 314 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 314 may also include a light sensor, such as a CMOS or CCD spectrum sensor, for use in imaging applications. In some embodiments, the sensor assembly 314 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 316 is configured to facilitate wired or wireless communication between the electronic device 300 and other devices. The electronic device 300 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 316 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 316 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the voice processing method illustrated in fig. 1 described above.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 304 comprising instructions, executable by the processor 320 of the electronic device 300 to perform the speech processing method illustrated in fig. 1 described above, is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, there is also provided a computer program product, the instructions of which, when executed by the processor 320 of the electronic device 300, cause the electronic device 300 to perform the above-described speech processing method illustrated in fig. 1.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method of speech processing, the method comprising:
acquiring a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles;
acquiring a first content characteristic matrix of the first voice spectrogram, a first style characteristic matrix of the second voice spectrogram, and a second content characteristic matrix and a second style characteristic matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model;
acquiring a reconstruction loss function according to the first content characteristic matrix and the second content characteristic matrix;
obtaining a style loss function according to the first style characteristic matrix and the second style characteristic matrix;
and processing the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquiring voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm.
2. The method of claim 1, wherein the obtaining a first voice spectrogram corresponding to first voice data and a second voice spectrogram corresponding to second voice data comprises:
and respectively carrying out instantaneous Fourier transform on the first voice data and the second voice data to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram.
3. The method of claim 1, wherein the obtaining the first style feature matrix of the second speech spectrogram through a preset two-dimensional convolutional neural network model comprises:
acquiring a third content characteristic matrix corresponding to the second voice spectrogram through the two-dimensional convolution neural network model;
expanding the third content characteristic matrix along a time axis to obtain a first content characteristic conversion matrix; the dimension of the third content feature matrix is represented as (n)C1,nH1,nW1) The dimension of the first content feature transformation matrix is represented as (n)C1×nH1,nW1),nC1A channel number axis, n, corresponding to the third content feature matrixH1Representing the frequency axis, n, corresponding to the third content feature matrixW1Representing a time axis corresponding to the third content feature matrix; n is1×nH1A number axis generated after a channel number axis corresponding to the third content feature matrix and a frequency number axis corresponding to the third content feature matrix are stacked and unfolded is represented;
and performing transposition processing on the first content characteristic conversion matrix to obtain a first content characteristic transposition matrix, and calculating the product between the first content characteristic conversion matrix and the first content characteristic transposition matrix to obtain the first style characteristic matrix.
4. The method according to claim 1, wherein the obtaining the second style feature matrix of the speech spectrogram to be processed through a preset two-dimensional convolutional neural network model comprises:
expanding the second content characteristic matrix along a time axis to obtain a second content characteristic conversion matrix; the dimension of the second content feature matrix is represented as (n)C2,nH2,nW2) The dimension of the second content feature transformation matrix is represented as (n)C2×nH2,nW2),nC2Representing the number axis of the channels corresponding to the second content feature matrix, nH2Representing the frequency axis, n, corresponding to the second content feature matrixW2Representing the time axis, n, corresponding to the second content feature matrixC2×nH2Representing a channel number axis corresponding to the second content characteristic matrix and a frequency number axis corresponding to the second content characteristic matrix, which are generated after stacking and unfolding;
and performing transposition processing on the second content characteristic conversion matrix to obtain a second content characteristic transposition matrix, and calculating the product between the second content characteristic conversion matrix and the second content characteristic transposition matrix to obtain the second style characteristic matrix.
5. The method of claim 1, wherein obtaining a style loss function from the first style feature matrix and the second style feature matrix comprises:
calculating the style loss function by the following formula:
or,
where loss represents the style loss function, sijRepresenting the style value, q, corresponding to the (i, j) position in the first style feature matrixijRepresenting the style value corresponding to the (i, j) position in the second style feature matrix.
6. A speech processing apparatus, characterized in that the apparatus comprises:
the spectrogram acquiring module is configured to acquire a first voice spectrogram corresponding to the first voice data, a second voice spectrogram corresponding to the second voice data and a voice spectrogram to be processed; the first voice data is used for extracting voice content; the second voice data is used for extracting voice styles;
the feature matrix acquisition module is configured to acquire a first content feature matrix of the first voice spectrogram, a first style feature matrix of the second voice spectrogram, and a second content feature matrix and a second style feature matrix corresponding to the voice spectrogram to be processed through a preset two-dimensional convolutional neural network model;
a reconstruction function obtaining module configured to obtain a reconstruction loss function according to the first content feature matrix and the second content feature matrix;
a style function obtaining module configured to obtain a style loss function according to the first style feature matrix and the second style feature matrix;
and the voice processing module is configured to process the voice spectrogram to be processed according to the reconstruction loss function and the style loss function to obtain a target voice spectrogram, and acquire voice data corresponding to the target voice spectrogram through a preset voice reconstruction algorithm.
7. The apparatus of claim 6, wherein the spectrogram acquisition module is configured to perform an instantaneous Fourier transform on the first voice data and the second voice data, respectively, to obtain the corresponding first voice spectrogram and the corresponding second voice spectrogram.
8. The apparatus of claim 6, wherein the feature matrix obtaining module is configured to obtain a third content feature matrix corresponding to the second speech spectrogram through the two-dimensional convolutional neural network model;
expanding the third content characteristic matrix along a time axis to obtain a first content characteristic conversion matrix; the dimension of the third content feature matrix is represented as (n)C1,nH1,nW1) The dimension of the first content feature transformation matrix is represented as (n)C1×nH1,nW1),nC1A channel number axis, n, corresponding to the third content feature matrixH1Representing the frequency axis, n, corresponding to the third content feature matrixW1Representing a time axis corresponding to the third content feature matrix; n is1×nH1A number axis generated after a channel number axis corresponding to the third content feature matrix and a frequency number axis corresponding to the third content feature matrix are stacked and unfolded is represented;
and performing transposition processing on the first content characteristic conversion matrix to obtain a first content characteristic transposition matrix, and calculating the product between the first content characteristic conversion matrix and the first content characteristic transposition matrix to obtain the first style characteristic matrix.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to perform the steps of the speech processing method of any of claims 1 to 5.
10. A non-transitory computer readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the steps of the speech processing method of any of claims 1 to 5.
CN201910381777.XA 2019-05-08 2019-05-08 Voice processing method and device, electronic equipment and storage medium Active CN110148424B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910381777.XA CN110148424B (en) 2019-05-08 2019-05-08 Voice processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910381777.XA CN110148424B (en) 2019-05-08 2019-05-08 Voice processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110148424A true CN110148424A (en) 2019-08-20
CN110148424B CN110148424B (en) 2021-05-25

Family

ID=67594261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910381777.XA Active CN110148424B (en) 2019-05-08 2019-05-08 Voice processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110148424B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037754A (en) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 Method for generating speech synthesis training data and related equipment
CN117198308A (en) * 2023-09-11 2023-12-08 辽宁工程技术大学 Style migration method for in-vehicle feedback sound effect

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1521729A (en) * 2003-01-21 2004-08-18 Method of speech recognition using hidden trajectory hidden markov models
WO2009021183A1 (en) * 2007-08-08 2009-02-12 Lessac Technologies, Inc. System-effected text annotation for expressive prosody in speech synthesis and recognition
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
JP5580585B2 (en) * 2009-12-25 2014-08-27 日本電信電話株式会社 Signal analysis apparatus, signal analysis method, and signal analysis program
US8886538B2 (en) * 2003-09-26 2014-11-11 Nuance Communications, Inc. Systems and methods for text-to-speech synthesis using spoken example
CN105304080A (en) * 2015-09-22 2016-02-03 科大讯飞股份有限公司 Speech synthesis device and speech synthesis method
CN106847294A (en) * 2017-01-17 2017-06-13 百度在线网络技术(北京)有限公司 Audio-frequency processing method and device based on artificial intelligence
CN107767328A (en) * 2017-10-13 2018-03-06 上海交通大学 The moving method and system of any style and content based on the generation of a small amount of sample
US20180068463A1 (en) * 2016-09-02 2018-03-08 Artomatix Ltd. Systems and Methods for Providing Convolutional Neural Network Based Image Synthesis Using Stable and Controllable Parametric Models, a Multiscale Synthesis Framework and Novel Network Architectures
CN107977414A (en) * 2017-11-22 2018-05-01 西安财经学院 Image Style Transfer method and its system based on deep learning
CN108304436A (en) * 2017-09-12 2018-07-20 深圳市腾讯计算机系统有限公司 The generation method of style sentence, the training method of model, device and equipment
CN108510975A (en) * 2017-02-24 2018-09-07 百度(美国)有限责任公司 System and method for real-time neural text-to-speech
CN109410080A (en) * 2018-10-16 2019-03-01 合肥工业大学 A kind of social image recommended method based on level attention mechanism
EP3457401A1 (en) * 2017-09-18 2019-03-20 Thomson Licensing Method for modifying a style of an audio object, and corresponding electronic device, computer readable program products and computer readable storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1521729A (en) * 2003-01-21 2004-08-18 Method of speech recognition using hidden trajectory hidden markov models
US8886538B2 (en) * 2003-09-26 2014-11-11 Nuance Communications, Inc. Systems and methods for text-to-speech synthesis using spoken example
WO2009021183A1 (en) * 2007-08-08 2009-02-12 Lessac Technologies, Inc. System-effected text annotation for expressive prosody in speech synthesis and recognition
JP5580585B2 (en) * 2009-12-25 2014-08-27 日本電信電話株式会社 Signal analysis apparatus, signal analysis method, and signal analysis program
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN105304080A (en) * 2015-09-22 2016-02-03 科大讯飞股份有限公司 Speech synthesis device and speech synthesis method
US20180068463A1 (en) * 2016-09-02 2018-03-08 Artomatix Ltd. Systems and Methods for Providing Convolutional Neural Network Based Image Synthesis Using Stable and Controllable Parametric Models, a Multiscale Synthesis Framework and Novel Network Architectures
CN106847294A (en) * 2017-01-17 2017-06-13 百度在线网络技术(北京)有限公司 Audio-frequency processing method and device based on artificial intelligence
CN108510975A (en) * 2017-02-24 2018-09-07 百度(美国)有限责任公司 System and method for real-time neural text-to-speech
CN108304436A (en) * 2017-09-12 2018-07-20 深圳市腾讯计算机系统有限公司 The generation method of style sentence, the training method of model, device and equipment
EP3457401A1 (en) * 2017-09-18 2019-03-20 Thomson Licensing Method for modifying a style of an audio object, and corresponding electronic device, computer readable program products and computer readable storage medium
CN107767328A (en) * 2017-10-13 2018-03-06 上海交通大学 The moving method and system of any style and content based on the generation of a small amount of sample
CN107977414A (en) * 2017-11-22 2018-05-01 西安财经学院 Image Style Transfer method and its system based on deep learning
CN109410080A (en) * 2018-10-16 2019-03-01 合肥工业大学 A kind of social image recommended method based on level attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YANG CHEN ET AL.: "《Transforming photos to comics using convolutional neural networks》", 《2017 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)》 *
窦亚玲等: "《基于卷积神经网络的图像风格迁移技术》", 《现代计算机》 *
谢志峰等: "《基于生成对抗网络的HDR图像风格迁移技术》", 《上海大学学报(自然科学版)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037754A (en) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 Method for generating speech synthesis training data and related equipment
CN112037754B (en) * 2020-09-09 2024-02-09 广州方硅信息技术有限公司 Method for generating speech synthesis training data and related equipment
CN117198308A (en) * 2023-09-11 2023-12-08 辽宁工程技术大学 Style migration method for in-vehicle feedback sound effect
CN117198308B (en) * 2023-09-11 2024-03-19 辽宁工程技术大学 Style migration method for in-vehicle feedback sound effect

Also Published As

Publication number Publication date
CN110148424B (en) 2021-05-25

Similar Documents

Publication Publication Date Title
US20170304735A1 (en) Method and Apparatus for Performing Live Broadcast on Game
CN110517185B (en) Image processing method, device, electronic equipment and storage medium
CN104156947B (en) Image partition method, device and equipment
CN108108418B (en) Picture management method, device and storage medium
CN107967459B (en) Convolution processing method, convolution processing device and storage medium
CN110944230B (en) Video special effect adding method and device, electronic equipment and storage medium
WO2016127671A1 (en) Image filter generating method and device
CN107798654B (en) Image buffing method and device and storage medium
CN110677734B (en) Video synthesis method and device, electronic equipment and storage medium
CN111009257A (en) Audio signal processing method and device, terminal and storage medium
CN111128221A (en) Audio signal processing method and device, terminal and storage medium
CN109919829A (en) Image Style Transfer method, apparatus and computer readable storage medium
CN106534951B (en) Video segmentation method and device
CN111586296B (en) Image capturing method, image capturing apparatus, and storage medium
CN111078170B (en) Display control method, display control device, and computer-readable storage medium
CN107729880A (en) Method for detecting human face and device
CN112330570A (en) Image processing method, image processing device, electronic equipment and storage medium
CN105574834B (en) Image processing method and device
CN111652107A (en) Object counting method and device, electronic equipment and storage medium
CN110148424B (en) Voice processing method and device, electronic equipment and storage medium
CN110931028A (en) Voice processing method and device and electronic equipment
CN111933171B (en) Noise reduction method and device, electronic equipment and storage medium
US11600300B2 (en) Method and device for generating dynamic image
CN112116528A (en) Image processing method, image processing device, electronic equipment and storage medium
CN111340690A (en) Image processing method, image processing device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant