CN114863945A - Text-based voice changing method and device, electronic equipment and storage medium - Google Patents

Text-based voice changing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114863945A
CN114863945A CN202210416138.4A CN202210416138A CN114863945A CN 114863945 A CN114863945 A CN 114863945A CN 202210416138 A CN202210416138 A CN 202210416138A CN 114863945 A CN114863945 A CN 114863945A
Authority
CN
China
Prior art keywords
frequency spectrum
target
text
mel
mel frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210416138.4A
Other languages
Chinese (zh)
Inventor
朱超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210416138.4A priority Critical patent/CN114863945A/en
Publication of CN114863945A publication Critical patent/CN114863945A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Abstract

The invention relates to the field of voice semantics, and discloses a text-based voice changing method, a text-based voice changing device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a target text and audio data, performing phoneme conversion on the target file by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a vocoder, performing spectrum conversion on the audio data to obtain a target spectrum, and sequentially processing the phoneme sequence by using the encoder, the decoder and the residual error network according to the target spectrum to obtain a target Mel frequency spectrum; and performing audio conversion on the target Mel frequency spectrum by using the vocoder to obtain a target audio. The invention can improve the precision rate and efficiency of voice sound change.

Description

Text-based voice changing method and device, electronic equipment and storage medium
Technical Field
The invention relates to the field of voice semantics, in particular to a text-based voice changing method and device, electronic equipment and a readable storage medium.
Background
Voice voicing refers to a technique that achieves a change in timbre by altering the voice frequency of the voice, for example, a vocalizer can achieve voice voicing.
At present, common voice change is performed on the basis of a section of voice, and when the environmental noise of voice recording is large or the pronunciation of a person is not standard, the voice change result is easily different from the voice, thereby causing voice change failure.
Disclosure of Invention
The invention provides a text-based voice changing method and device, electronic equipment and a readable storage medium, and aims to improve the accuracy and efficiency of text-based voice changing.
In order to achieve the above object, the present invention provides a text-based voice changing method, including:
acquiring target text and audio data, and performing phoneme conversion on the target file by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a vocoder;
carrying out frequency spectrum conversion on the audio data to obtain a target frequency spectrum;
extracting the context feature of the phoneme sequence by using the encoder to obtain a hidden feature matrix;
predicting the Mel frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted Mel frequency spectrum;
performing residual connection on the predicted Mel frequency spectrum by using the residual network to obtain a target Mel frequency spectrum;
and performing audio conversion on the target Mel frequency spectrum by using the vocoder to obtain a target audio.
Optionally, the performing, by using the encoder, context feature extraction on the phoneme sequence to obtain a hidden feature matrix includes:
performing convolution processing on the phoneme sequence by utilizing convolution layers with preset number of layers in the encoder to obtain a feature matrix of the phoneme sequence;
performing modified linear unit activation processing and batch normalization processing on the feature matrix to obtain an optimized feature matrix;
and calculating the optimized feature matrix by using a bidirectional long-time and short-time memory network preset in the encoder to obtain a hidden feature matrix.
Optionally, the predicting the mel frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted mel frequency spectrum includes:
extracting a context vector in the hidden feature matrix by using an attention network in the decoder to obtain a context vector of a first current time step;
performing tandem operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting a tandem result into a double-layer long-time and short-time memory layer in the decoder to obtain a context vector of a second current time step;
performing first linear projection on the context vector of the second current time step by utilizing a post-processing network in the decoder to obtain a context scalar quantity of the current time step;
according to the target frequency spectrum, performing second linear projection on the context vector of the second current time step by using the post-processing network, and performing Mel frequency spectrum prediction on the context scalar subjected to the second linear projection to obtain a Mel frequency spectrum of the second current time step;
calculating the probability of completing prediction of the Mel frequency spectrum by using a preset first activation function according to the context scalar of the current time step;
judging whether the probability of completing prediction of the Mel frequency spectrum is smaller than a preset threshold value or not;
and when the probability of the prediction of the Mel frequency spectrum is not less than the threshold, performing series operation on the context vector of the second current time step and the Mel frequency spectrum of the second current time step, and returning to the step of inputting the series result into a double-layer long-short time memory layer in the decoder until the probability of the prediction of the Mel frequency spectrum is less than the threshold, thereby obtaining the predicted Mel frequency spectrum.
Optionally, the extracting, by using an attention network in the decoder, a context vector in the hidden feature matrix to obtain a context vector of a first current time step includes:
carrying out linear projection on the hidden feature matrix by utilizing a linear layer in the attention network to obtain a key matrix;
inputting the attention weight value in the attention network into a preset convolution layer to generate a position characteristic matrix;
performing linear projection on the position feature matrix by using the linear layer to obtain an additional feature matrix;
adding the additional feature matrix and the key matrix, and processing an addition result by using a preset second activation function to obtain an attention probability vector;
mapping the attention probability vector by using a preset mapping function to obtain a weight vector of the current attention;
and multiplying the current attention weight vector and the hidden feature matrix to obtain a context vector of a first current time step.
Optionally, the performing spectrum conversion on the audio data to obtain a target spectrum includes:
pre-emphasis processing, framing processing and windowing processing are carried out on the audio data to obtain a target audio signal;
optionally, the performing phoneme conversion on the target file by using a preset speech synthesis model to obtain a phoneme sequence includes:
performing language analysis on the target text by using a language analysis tool to determine the language of the target text;
performing sentence segmentation processing on the target text by using the word segmentation rule corresponding to the language type to obtain a segmented sentence text;
converting non-characters in the segmented sentence text into characters according to a preset text format rule;
performing word segmentation processing on the segmented sentence text to obtain a word segmentation text;
mapping the word segmentation text according to a preset character phoneme mapping dictionary to obtain phonemes;
performing vector conversion on the phoneme to obtain a phoneme vector;
and coding and sequencing the phoneme vectors according to the text sequence to obtain a phoneme sequence.
Optionally, the performing residual connection on the predicted mel frequency spectrum by using the residual network to obtain a target mel frequency spectrum includes:
performing residual error calculation on the predicted Mel frequency spectrum by using a preset residual error network to obtain a residual Mel frequency spectrum;
and superposing the residual Mel frequency spectrum and the predicted Mel frequency spectrum to obtain a target Mel frequency spectrum. In order to solve the above problems, the present invention also provides an offline product recommendation apparatus, including:
the phoneme sequence conversion module is used for acquiring target text and audio data and carrying out phoneme conversion on the target file by utilizing a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a vocoder;
a target mel-frequency spectrum obtaining module, configured to perform frequency spectrum conversion on the audio data to obtain a target frequency spectrum, perform context feature extraction on the phoneme sequence by using the encoder to obtain a hidden feature matrix, predict a mel-frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted mel-frequency spectrum, and perform residual error connection on the predicted mel-frequency spectrum by using the residual error network to obtain a target mel-frequency spectrum;
and the target audio acquisition module is used for performing audio conversion on the target Mel frequency spectrum by using the vocoder to obtain a target audio.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one computer program; and
a processor executing a computer program stored in the memory to implement the text-based voice voicing method described above.
In order to solve the above problem, the present invention also provides a computer-readable storage medium having at least one computer program stored therein, the at least one computer program being executed by a processor in an electronic device to implement the text-based voice voicing method described above.
The embodiment of the invention obtains the pronunciation attribute of each word in the target text by converting the target file into the phoneme sequence, avoids pronunciation errors caused by the problem of one word and multiple tones of the target text, improves the accuracy rate of voice change, further performs frequency spectrum conversion on the audio data to obtain a target frequency spectrum, determines the frequency of voice change, thereby ensuring the direction of voice change and improving the accuracy rate of voice change, secondly performs context feature extraction on the phoneme sequence by using the encoder to obtain a hidden feature matrix, predicts the Mel frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted Mel frequency spectrum, performs residual error connection on the predicted Mel frequency spectrum by using the residual error network to obtain the target Mel frequency spectrum, and completes the conversion from the text to the voice frequency spectrum, the influence caused by noise during voice sound change is reduced, finally, the vocoder is used for carrying out audio conversion on the target Mel frequency spectrum to obtain target audio, a voice sound change process based on text is completed, the influence degree of the environment during voice sound change is reduced, and therefore the accuracy and the efficiency of voice sound change are improved. Therefore, the text-based voice changing method, the text-based voice changing device, the electronic equipment and the readable storage medium provided by the embodiment of the invention can improve the accuracy and efficiency of voice changing.
Drawings
Fig. 1 is a schematic flowchart of a text-based voice changing method according to an embodiment of the present invention;
fig. 2 to 8 are flowcharts illustrating detailed implementation of one step in a text-based voice voicing method according to an embodiment of the present invention;
FIG. 9 is a block diagram of a text-based voice morphing apparatus according to an embodiment of the present invention;
fig. 10 is a schematic internal structural diagram of an electronic device implementing a text-based voice changing method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the invention provides a voice sound changing method based on a text. The execution subject of the text-based voice changing method includes, but is not limited to, at least one of the electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server, a terminal, and the like. In other words, the text-based voice voicing method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server may include an independent server, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform.
Referring to fig. 1, a flowchart of a text-based voice voicing method according to an embodiment of the present invention is shown, in an embodiment of the present invention, the text-based voice voicing method includes the following steps S1-S6:
s1, obtaining target text and audio data, and performing phoneme conversion on the target file by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a vocoder.
In the embodiment of the present invention, the source and the text type of the target text may be in various forms, wherein the text type includes chinese, english, and the like. The audio data may be speech data containing a variant target timbre. The phoneme sequence can be the minimum voice unit divided according to the natural attributes of the voice, and is analyzed according to the pronunciation action in the syllables, and one action forms one phoneme, for example, the phoneme of a Chinese character can be Chinese pinyin and tone.
In the optional embodiment of the invention, the target text and audio data can be obtained through network downloading or user input, so that the labor cost is saved, the influence of the ambient environment on the speech synthesis is reduced, and the accuracy of the speech change is improved.
In the embodiment of the invention, the preset speech synthesis model is utilized to carry out phoneme conversion on the target file to obtain the phoneme sequence, and the most basic unit of the target text pronunciation is obtained, so that the problem of pronunciation error is avoided, and the accuracy of speech synthesis and speech change is improved.
Further, referring to fig. 2, as an alternative embodiment of the present invention, the converting the target file into phonemes by using the preset speech synthesis model to obtain a phoneme sequence includes the following steps S11-S17:
s11, performing language analysis on the target text by using a language analysis tool, and determining the language of the target text;
s12, performing statement segmentation processing on the target text by using the language corresponding word segmentation rule to obtain a segmented statement text;
s13, converting the non-characters in the sentence segmentation text into characters according to a preset text format rule;
s14, performing word segmentation processing on the segmented sentence text to obtain a word segmentation text;
s15, mapping the word segmentation text according to a preset character phoneme mapping dictionary to obtain phonemes;
s16, carrying out vector conversion on the phonemes to obtain phoneme vectors;
and S17, coding and sequencing the phoneme vectors according to the text sequence to obtain a phoneme sequence.
In the embodiment of the present invention, the language analysis tool may be translation software. The preset text format rule may be that in the obtained target text, if there are arabic numerals, the arabic numerals are converted into characters, and then the synthesized text is standardized according to the set rule, for example, "there are 56 persons," where "56" is arabic numerals and needs to be converted into chinese characters "fifty six," which facilitates subsequent processes such as character conversion phoneme. The word phoneme mapping dictionary comprises each word and a phoneme corresponding to each word.
In an optional embodiment of the present invention, a pronunciation rule of a target text may be determined by performing language analysis on the target text, and further, the target text is subjected to word segmentation processing to accurately perform phoneme conversion on the target text to obtain phonemes, and finally, the phonemes are coded and sequenced to obtain a phoneme sequence, so that accuracy of the phoneme sequence is ensured by arranging, confusion of pronunciation of the target text is avoided, and accuracy of speech synthesis is improved.
In another alternative embodiment of the present invention, the phoneme conversion of the training file may be implemented to obtain a phoneme sequence, for example, an open-source grapheme-to-phoneme conversion tool G2P.
And S2, carrying out frequency spectrum conversion on the audio data to obtain a target frequency spectrum.
According to the embodiment of the invention, the target frequency spectrum is obtained by carrying out frequency spectrum conversion on the audio data, and the frequency spectrum of the sound changing target tone is obtained, so that the sound changing accuracy is ensured, and the sound changing efficiency and accuracy are improved.
Further, referring to fig. 3, in an alternative embodiment of the present invention, the performing spectrum conversion on the audio data to obtain a target spectrum includes the following steps S21 and S22:
s21, performing pre-emphasis processing, framing processing and windowing processing on the audio data to obtain a target audio signal;
and S22, carrying out Fourier transform on the target voice signal to obtain a target frequency spectrum.
In an alternative embodiment of the present invention, pre-emphasis processing of the audio data is implemented by using a high pass filter with a first order FIR transfer function, thereby emphasizing the high frequency portion of the audio data and eliminating the effects caused by lip radiation of the user speaker.
In the optional embodiment of the invention, the audio data is an unstable voice signal and is mainly caused by the transient change of a sounding organ, so that the once-processed voice signal can be subjected to framing processing by utilizing short-time processing, and the stability of the audio data in a very short time is ensured.
In the optional embodiment of the present invention, since the speech signal will deviate from the original signal more and more after the framing, we can reduce the problem of signal discontinuity at the beginning and end of the frame by performing windowing on the twice-processed speech signal.
In the optional embodiment of the invention, the signal is converted into the frequency spectrum through the Fourier function, so that the change of the audio data to the target frequency spectrum is realized, and the frequency spectrum of the target tone color of the variable sound is obtained, thereby converting characters into the voice of the target tone color and improving the accuracy of voice variable sound.
And S3, extracting the context feature of the phoneme sequence by using the encoder to obtain a hidden feature matrix.
In the embodiment of the invention, the encoder comprises a convolutional layer and a bidirectional long-time and short-time memory network. The hidden feature matrix includes information such as a context vector of the phoneme sequence.
In the embodiment of the invention, because the meaning of each word in the target text is often closely related to the context, for example, in the sentence "i like china", the "good" word has two pronunciations, and the pronunciation of the "good" word cannot be determined by analyzing the "good" word alone, which easily causes the problem of pronunciation error, the context characteristic information of each word needs to be extracted, so that the accuracy of speech synthesis is further ensured.
Further, referring to fig. 4, as an alternative embodiment of the present invention, the performing context feature extraction on the phoneme sequence by using the encoder to obtain a hidden feature matrix includes the following steps S31 to S33:
s31, performing convolution processing on the phoneme sequence by utilizing convolution layers with preset number of layers in the encoder to obtain a feature matrix of the phoneme sequence;
s32, performing modified linear unit activation processing and batch normalization processing on the feature matrix to obtain an optimized feature matrix;
and S33, calculating the optimized feature matrix by using a bidirectional long-time memory network preset in the encoder to obtain a hidden feature matrix.
In the embodiment of the present invention, the bidirectional long and short term memory network may be used to obtain and store a context vector of the phoneme sequence.
In the optional embodiment of the invention, the encoder is used for extracting the features of the phoneme sequence to obtain the hidden feature matrix, and the hidden feature matrix contains information such as the context vector of the phoneme sequence, so that the context features of the phoneme sequence can be obtained by obtaining the hidden feature matrix, thereby improving the influence of the context features of the phoneme sequence on the phoneme sequence and improving the pronunciation accuracy of the speech synthesis model.
And S4, predicting the Mel frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted Mel frequency spectrum.
In an embodiment of the present invention, the decoder may be an autoregressive recurrent neural network, wherein the autoregressive recurrent neural network includes an attention network and a post-processing network.
According to the hidden feature matrix and the target frequency spectrum, the decoder is used for predicting the Mel frequency spectrum of the target text to obtain a predicted Mel frequency spectrum, and the voice tone of a voice synthesis result is ensured to be consistent with audio data.
Further, referring to fig. 5, as an alternative embodiment of the present invention, the predicting the mel spectrum of the target text by the decoder according to the hidden feature matrix and the target spectrum to obtain a predicted mel spectrum includes the following steps S41 to S48:
s41, extracting the context vector in the hidden feature matrix by using the attention network in the decoder to obtain the context vector of the first current time step;
s42, performing tandem operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting a tandem result into a double-layer long-short time memory layer in the decoder to obtain a context vector of a second current time step;
s43, performing first linear projection on the context vector of the second current time step by utilizing a post-processing network in the decoder to obtain a context scalar quantity of the current time step;
s44, according to the target frequency spectrum, performing second linear projection on the context vector of the second current time step by using the post-processing network, and performing Mel frequency spectrum prediction on the context scalar after the second linear projection to obtain a Mel frequency spectrum of the second current time step;
s45, calculating the probability of completion of prediction of the Mel frequency spectrum by using a preset first activation function according to the context scalar of the current time step;
s46, judging whether the probability of the prediction completion of the Mel frequency spectrum is smaller than a preset threshold value;
s47, when the probability of completing prediction of the mel spectrum is not less than the threshold, performing a concatenation operation on the context vector of the second current time step and the mel spectrum of the second current time step, and returning to S43;
and S48, obtaining a predicted Mel frequency spectrum when the probability of the Mel frequency spectrum prediction completion is smaller than the threshold value.
In an embodiment of the present invention, the attention network includes the position-sensitive attention mechanism and the two layers of long and short term memory layers, and is mainly used for determining which part of the encoder input needs to be focused on. The first activation function may be a sigmoid function.
Further, referring to fig. 6, the extracting the context vector in the hidden feature matrix by using the attention network in the decoder to obtain the context vector at the first current time step includes the following steps S411 to S415:
s411, performing linear projection on the hidden feature matrix by using a linear layer in the attention network to obtain a key matrix;
s412, inputting the attention weight value in the attention network into a preset convolution layer to generate a position characteristic matrix;
s413, performing linear projection on the position feature matrix by using the linear layer to obtain an additional feature matrix;
s414, adding the additional characteristic matrix and the key matrix, and processing an addition result by using a preset second activation function to obtain an attention probability vector;
s415, mapping the attention probability vector by using a preset mapping function to obtain a weight vector of the current attention;
and S416, multiplying the current attention weight vector and the hidden feature matrix to obtain a context vector of the first current time step.
In this embodiment of the present invention, the attention weight value may be obtained by concatenating the attention weight of the previous time step and the accumulated attention weights of all previous time steps. The second activation function may be a Tanh function. The mapping function may be a softmax function.
And S5, performing residual error connection on the predicted Mel frequency spectrum by using the residual error network to obtain a target Mel frequency spectrum.
In an embodiment of the present invention, the residual error network includes a convolutional layer and a series of functions.
The embodiment of the invention utilizes the residual error network to carry out residual error connection on the predicted Mel frequency spectrum and determine the final output Mel frequency spectrum of the target text, thereby realizing the primary conversion from characters to voice and ensuring the feasibility of voice sound change based on the text.
Further, referring to fig. 7, as an alternative embodiment of the present invention, the residual connection of the predicted mel spectrum by using the residual network to obtain the target mel spectrum includes the following steps S51 and S52:
s51, performing residual error calculation on the predicted Mel frequency spectrum by using a preset residual error network to obtain a residual error Mel frequency spectrum;
and S52, overlapping the residual Mel frequency spectrum and the predicted Mel frequency spectrum to obtain a target Mel frequency spectrum.
In an optional embodiment of the invention, the predicted Mel frequency spectrum obtained by N decoding steps is sent to a residual error network, and a residual error is generated and is superposed with the predicted Mel frequency spectrum to generate a target Mel frequency spectrum. The residual error network is composed of 5 convolution layers, each convolution layer is composed of 512 convolution kernels with the shapes of 5 x 1, batch normalization processing is carried out on each convolution layer, and the other four convolution layers except the last convolution layer are activated by adopting Tanh activation functions.
And S6, performing audio conversion on the target Mel frequency spectrum by using the vocoder to obtain a target audio. In the embodiment of the invention, the vocoder can be a public common WaveNet vocoder or a WaveGlow vocoder.
The embodiment of the invention utilizes the vocoder to perform audio conversion on the target Mel frequency spectrum to obtain the target audio, thereby realizing voice change, reducing the influence of environment and human factors in the voice change process and improving the accuracy and efficiency of voice change.
Further, referring to fig. 8, as an alternative embodiment of the present invention, the audio conversion of the target mel spectrum by using the vocoder to obtain the target audio includes the following steps S61 and S62:
s61, performing voice waveform conversion on the target Mel frequency spectrum by using the vocoder to obtain a target voice waveform;
and S62, performing audio conversion on the target voice waveform to obtain a target audio.
In an alternative embodiment of the present invention, the conversion of the waveform to audio may be achieved by sampling, quantizing, and encoding the target speech waveform.
In an alternative embodiment of the present invention, WaveGlow may be selected as a vocoder to convert the target mel spectrum into the target audio. The WaveGlow vocoder is a stream-based model, and can generate high-quality audio samples in parallel, thereby improving the speed of speech synthesis.
The embodiment of the invention obtains the pronunciation attribute of each word in the target text by converting the target file into the phoneme sequence, avoids pronunciation errors caused by the problem of one word and multiple tones of the target text, improves the accuracy rate of voice change, further performs frequency spectrum conversion on the audio data to obtain a target frequency spectrum, determines the frequency of voice change, thereby ensuring the direction of voice change and improving the accuracy rate of voice change, secondly performs context feature extraction on the phoneme sequence by using the encoder to obtain a hidden feature matrix, predicts the Mel frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted Mel frequency spectrum, performs residual error connection on the predicted Mel frequency spectrum by using the residual error network to obtain the target Mel frequency spectrum, and completes the conversion from the text to the voice frequency spectrum, the influence caused by noise during voice sound change is reduced, finally, the vocoder is used for carrying out audio conversion on the target Mel frequency spectrum to obtain target audio, a voice sound change process based on text is completed, the influence degree of the environment during voice sound change is reduced, and therefore the accuracy and the efficiency of voice sound change are improved. Therefore, the voice changing method based on the text provided by the embodiment of the invention can improve the accuracy and efficiency of voice changing.
Fig. 9 is a functional block diagram of the text-based voice changing apparatus according to the present invention.
The text-based voice sound-changing apparatus 100 according to the present invention may be installed in an electronic device. According to the implemented functions, the text-based voice sound-changing apparatus 100 may include a phoneme sequence conversion module 101, a target mel spectrum obtaining module 102 and a target audio frequency obtaining module 103, which may also be referred to as a unit, and refer to a series of computer program segments capable of being executed by a processor of an electronic device and performing fixed functions, and stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the phoneme sequence conversion module 101 is configured to obtain a target text and audio data, and perform phoneme conversion on the target file by using a preset speech synthesis model to obtain a phoneme sequence, where the speech synthesis model includes an encoder, a decoder, a residual network, and a vocoder.
In the embodiment of the present invention, the source and the text type of the target text may be in various forms, wherein the text type includes chinese, english, and the like. The audio data may be speech data containing a variant target timbre. The phoneme sequence can be the minimum phonetic unit divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action forms one phoneme, for example, the phoneme of a Chinese character can be Chinese pinyin and tone.
In the optional embodiment of the invention, the target text and audio data can be obtained through network downloading or user input, so that the labor cost is saved, the influence of the ambient environment on the speech synthesis is reduced, and the accuracy of the speech change is improved.
In the embodiment of the invention, the preset speech synthesis model is utilized to carry out phoneme conversion on the target file to obtain the phoneme sequence, and the most basic unit of the target text pronunciation is obtained, so that the problem of pronunciation error is avoided, and the accuracy of speech synthesis and speech change is improved.
Further, as an optional embodiment of the present invention, the performing phoneme conversion on the target file by using a preset speech synthesis model to obtain a phoneme sequence includes:
performing language analysis on the target text by using a language analysis tool to determine the language of the target text;
performing sentence segmentation processing on the target text by using the word segmentation rule corresponding to the language type to obtain a segmented sentence text;
converting non-characters in the segmented sentence text into characters according to a preset text format rule;
performing word segmentation processing on the segmented sentence text to obtain a word segmentation text;
mapping the word segmentation text according to a preset character phoneme mapping dictionary to obtain phonemes;
performing vector conversion on the phoneme to obtain a phoneme vector;
and coding and sequencing the phoneme vectors according to the text sequence to obtain a phoneme sequence.
In the embodiment of the present invention, the language analysis tool may be translation software. The preset text format rule may be that in the obtained target text, if there are arabic numerals, the arabic numerals are converted into characters, and then the synthesized text is standardized according to the set rule, for example, "there are 56 persons," where "56" is arabic numerals and needs to be converted into chinese characters "fifty six," which facilitates subsequent processes such as character conversion phoneme. The word phoneme mapping dictionary comprises each word and a phoneme corresponding to each word.
In an optional embodiment of the present invention, a pronunciation rule of a target text may be determined by performing language analysis on the target text, and further, the target text is subjected to word segmentation processing to accurately perform phoneme conversion on the target text to obtain phonemes, and finally, the phonemes are coded and sequenced to obtain a phoneme sequence, so that accuracy of the phoneme sequence is ensured by arranging, confusion of pronunciation of the target text is avoided, and accuracy of speech synthesis is improved.
In another alternative embodiment of the present invention, the phoneme conversion of the training file may be implemented to obtain a phoneme sequence, for example, an open-source grapheme-to-phoneme conversion tool G2P.
The target mel-frequency spectrum obtaining module 102 is configured to perform frequency spectrum conversion on the audio data to obtain a target frequency spectrum, perform context feature extraction on the phoneme sequence by using the encoder to obtain a hidden feature matrix, predict the mel-frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted mel-frequency spectrum, and perform residual error connection on the predicted mel-frequency spectrum by using the residual error network to obtain the target mel-frequency spectrum.
According to the embodiment of the invention, the target frequency spectrum is obtained by carrying out frequency spectrum conversion on the audio data, and the frequency spectrum of the sound changing target tone is obtained, so that the sound changing accuracy is ensured, and the sound changing efficiency and accuracy are improved.
Further, in an optional embodiment of the present invention, the performing spectrum conversion on the audio data to obtain a target spectrum includes:
pre-emphasis processing, framing processing and windowing processing are carried out on the audio data to obtain a target audio signal;
and carrying out Fourier transform on the target voice signal to obtain a target frequency spectrum.
In an alternative embodiment of the present invention, pre-emphasis processing of the audio data is implemented by using a high pass filter with a first order FIR transfer function, thereby emphasizing the high frequency portion of the audio data and eliminating the effects caused by lip radiation of the user speaker.
In the optional embodiment of the invention, the audio data is an unstable voice signal and is mainly caused by the transient change of a sounding organ, so that the once-processed voice signal can be subjected to framing processing by utilizing short-time processing, and the stability of the audio data in a very short time is ensured.
In the optional embodiment of the present invention, since the speech signal will deviate from the original signal more and more after the framing, we can reduce the problem of signal discontinuity at the beginning and end of the frame by performing windowing on the twice-processed speech signal.
In the optional embodiment of the invention, the signal is converted into the frequency spectrum through the Fourier function, so that the change of the audio data to the target frequency spectrum is realized, and the frequency spectrum of the target tone color of the variable sound is obtained, thereby converting characters into the voice of the target tone color and improving the accuracy of voice variable sound.
In the embodiment of the invention, the encoder comprises a convolutional layer and a bidirectional long-time and short-time memory network. The hidden feature matrix includes information such as a context vector of the phoneme sequence.
In the embodiment of the invention, because the meaning of each word in the target text is often closely related to the context, for example, in the sentence "i like china", the "good" word has two pronunciations, and the pronunciation of the "good" word cannot be determined by analyzing the "good" word alone, which easily causes the problem of pronunciation error, the context characteristic information of each word needs to be extracted, so that the accuracy of speech synthesis is further ensured.
Further, as an optional embodiment of the present invention, the performing context feature extraction on the phoneme sequence by using the encoder to obtain a hidden feature matrix includes:
performing convolution processing on the phoneme sequence by utilizing convolution layers with preset number of layers in the encoder to obtain a feature matrix of the phoneme sequence;
performing modified linear unit activation processing and batch normalization processing on the feature matrix to obtain an optimized feature matrix;
and calculating the optimized feature matrix by using a bidirectional long-time and short-time memory network preset in the encoder to obtain a hidden feature matrix.
In the embodiment of the present invention, the bidirectional long and short term memory network may be used to obtain and store a context vector of the phoneme sequence.
In the optional embodiment of the invention, the encoder is used for extracting the features of the phoneme sequence to obtain the hidden feature matrix, and the hidden feature matrix contains information such as the context vector of the phoneme sequence, so that the context features of the phoneme sequence can be obtained by obtaining the hidden feature matrix, thereby improving the influence of the context features of the phoneme sequence on the phoneme sequence and improving the pronunciation accuracy of the speech synthesis model.
In an embodiment of the present invention, the decoder may be an autoregressive recurrent neural network, wherein the autoregressive recurrent neural network includes an attention network and a post-processing network.
According to the hidden feature matrix and the target frequency spectrum, the decoder is used for predicting the Mel frequency spectrum of the target text to obtain a predicted Mel frequency spectrum, and the voice tone of a voice synthesis result is ensured to be consistent with audio data.
Further, as an optional embodiment of the present invention, the predicting the mel spectrum of the target text by using the decoder according to the hidden feature matrix and the target spectrum to obtain a predicted mel spectrum includes:
extracting a context vector in the hidden feature matrix by using an attention network in the decoder to obtain a context vector of a first current time step;
performing tandem operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting a tandem result into a double-layer long-time and short-time memory layer in the decoder to obtain a context vector of a second current time step;
performing first linear projection on the context vector of the second current time step by utilizing a post-processing network in the decoder to obtain a context scalar quantity of the current time step;
according to the target frequency spectrum, performing second linear projection on the context vector of the second current time step by using the post-processing network, and performing Mel frequency spectrum prediction on the context scalar subjected to the second linear projection to obtain a Mel frequency spectrum of the second current time step;
calculating the probability of completing prediction of the Mel frequency spectrum by using a preset first activation function according to the context scalar of the current time step;
judging whether the probability of completing prediction of the Mel frequency spectrum is smaller than a preset threshold value or not;
when the probability of completing prediction of the Mel frequency spectrum is not less than the threshold, performing tandem operation on the context vector of the second current time step and the Mel frequency spectrum of the second current time step, and returning to the step of inputting the tandem result into a preset double-layer long-short-time memory layer;
and when the probability of the prediction completion of the Mel frequency spectrum is smaller than the threshold value, obtaining a predicted Mel frequency spectrum.
In an embodiment of the present invention, the attention network includes the position-sensitive attention mechanism and the two layers of long and short term memory layers, and is mainly used for determining which part of the encoder input needs to be focused on. The first activation function may be a sigmoid function.
Further, the extracting a context vector in the hidden feature matrix by using an attention network in the decoder to obtain a context vector at a first current time step includes:
carrying out linear projection on the hidden feature matrix by utilizing a linear layer in the attention network to obtain a key matrix;
inputting the attention weight value in the attention network into a preset convolution layer to generate a position characteristic matrix;
performing linear projection on the position feature matrix by using the linear layer to obtain an additional feature matrix;
adding the additional feature matrix and the key matrix, and processing an addition result by using a preset second activation function to obtain an attention probability vector;
mapping the attention probability vector by using a preset mapping function to obtain a weight vector of the current attention;
and multiplying the current attention weight vector and the hidden feature matrix to obtain a context vector of a first current time step.
In this embodiment of the present invention, the attention weight value may be obtained by concatenating the attention weight of the previous time step and the accumulated attention weights of all previous time steps. The second activation function may be a Tanh function. The mapping function may be a softmax function.
In an embodiment of the present invention, the residual error network includes a convolutional layer and a series of functions.
The embodiment of the invention utilizes the residual error network to carry out residual error connection on the predicted Mel frequency spectrum and determine the final output Mel frequency spectrum of the target text, thereby realizing the primary conversion from characters to voice and ensuring the feasibility of voice sound change based on the text.
Further, as an optional embodiment of the present invention, the performing residual error connection on the predicted mel frequency spectrum by using the residual error network to obtain a target mel frequency spectrum includes:
performing residual error calculation on the predicted Mel frequency spectrum by using a preset residual error network to obtain a residual Mel frequency spectrum;
and superposing the residual Mel frequency spectrum and the predicted Mel frequency spectrum to obtain a target Mel frequency spectrum.
In an optional embodiment of the invention, the predicted Mel frequency spectrum obtained by N decoding steps is sent to a residual error network, and a residual error is generated and is superposed with the predicted Mel frequency spectrum to generate a target Mel frequency spectrum. The residual error network is composed of 5 convolution layers, each convolution layer is composed of 512 convolution kernels with the shapes of 5 x 1, batch normalization processing is carried out on each convolution layer, and the other four convolution layers except the last convolution layer are activated by adopting Tanh activation functions.
The target audio obtaining module 103 is configured to perform audio conversion on the target mel spectrum by using the vocoder to obtain a target audio.
The embodiment of the invention utilizes the vocoder to perform audio conversion on the target Mel frequency spectrum to obtain the target audio, thereby realizing voice change, reducing the influence of environment and human factors in the voice change process and improving the accuracy and efficiency of voice change.
Further, as an optional embodiment of the present invention, the performing audio conversion on the target mel spectrum by using the vocoder to obtain a target audio includes:
performing voice waveform conversion on the target Mel frequency spectrum by using the vocoder to obtain a target voice waveform;
and carrying out audio conversion on the target voice waveform to obtain a target audio.
In an alternative embodiment of the present invention, the conversion of the waveform to audio may be achieved by sampling, quantizing, and encoding the target speech waveform.
In an alternative embodiment of the present invention, WaveGlow may be selected as a vocoder to convert the target mel spectrum into the target audio. The WaveGlow vocoder is a stream-based model, and can generate high-quality audio samples in parallel, thereby improving the speed of speech synthesis.
Fig. 10 is a schematic structural diagram of an electronic device implementing the text-based voice changing method according to the present invention.
The electronic device may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as an off-line product recommendation program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of offline product recommendation programs, etc., but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., off-line product recommendation programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The communication bus 12 may be a PerIPheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The communication bus 12 is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
Fig. 10 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 10 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Optionally, the communication interface 13 may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which is generally used to establish a communication connection between the electronic device and other electronic devices.
Optionally, the communication interface 13 may further include a user interface, which may be a Display (Display), an input unit (such as a Keyboard (Keyboard)), and optionally, a standard wired interface, or a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The off-line product recommendation program stored in the memory 11 of the electronic device is a combination of a plurality of computer programs, which when executed in the processor 10, may implement:
acquiring target text and audio data, and performing phoneme conversion on the target file by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a vocoder;
carrying out frequency spectrum conversion on the audio data to obtain a target frequency spectrum;
extracting the context feature of the phoneme sequence by using the encoder to obtain a hidden feature matrix;
predicting the Mel frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted Mel frequency spectrum;
performing residual connection on the predicted Mel frequency spectrum by using the residual network to obtain a target Mel frequency spectrum;
and performing audio conversion on the target Mel frequency spectrum by using the vocoder to obtain a target audio.
Specifically, the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.
Further, the electronic device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or volatile. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
Embodiments of the present invention may also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of an electronic device, the computer program may implement:
acquiring target text and audio data, and performing phoneme conversion on the target file by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a vocoder;
carrying out frequency spectrum conversion on the audio data to obtain a target frequency spectrum;
extracting the context feature of the phoneme sequence by using the encoder to obtain a hidden feature matrix;
predicting the Mel frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted Mel frequency spectrum;
performing residual connection on the predicted Mel frequency spectrum by using the residual network to obtain a target Mel frequency spectrum;
and performing audio conversion on the target Mel frequency spectrum by using the vocoder to obtain a target audio.
Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
In the embodiments provided in the present invention, it should be understood that the disclosed electronic device, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A text-based voice voicing method, the method comprising:
acquiring target text and audio data, and performing phoneme conversion on the target file by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a vocoder;
carrying out frequency spectrum conversion on the audio data to obtain a target frequency spectrum;
extracting the context feature of the phoneme sequence by using the encoder to obtain a hidden feature matrix;
predicting the Mel frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted Mel frequency spectrum;
performing residual connection on the predicted Mel frequency spectrum by using the residual network to obtain a target Mel frequency spectrum;
and performing audio conversion on the target Mel frequency spectrum by using the vocoder to obtain a target audio.
2. The method of claim 1, wherein the extracting the context feature of the phoneme sequence by the encoder to obtain the hidden feature matrix comprises:
performing convolution processing on the phoneme sequence by utilizing convolution layers with preset number of layers in the encoder to obtain a feature matrix of the phoneme sequence;
performing modified linear unit activation processing and batch normalization processing on the feature matrix to obtain an optimized feature matrix;
and calculating the optimized feature matrix by using a bidirectional long-time and short-time memory network preset in the encoder to obtain a hidden feature matrix.
3. The method of claim 1, wherein the predicting the Mel frequency spectrum of the target text by the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted Mel frequency spectrum comprises:
extracting a context vector in the hidden feature matrix by using an attention network in the decoder to obtain a context vector of a first current time step;
performing tandem operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting a tandem result into a double-layer long-time and short-time memory layer in the decoder to obtain a context vector of a second current time step;
performing first linear projection on the context vector of the second current time step by utilizing a post-processing network in the decoder to obtain a context scalar of the current time step;
according to the target frequency spectrum, performing second linear projection on the context vector of the second current time step by using the post-processing network, and performing Mel frequency spectrum prediction on the context scalar after the second linear projection to obtain a Mel frequency spectrum of the second current time step;
calculating the probability of completing prediction of the Mel frequency spectrum by using a preset first activation function according to the context scalar of the current time step;
judging whether the probability of completing prediction of the Mel frequency spectrum is smaller than a preset threshold value or not;
and when the probability of the prediction of the Mel frequency spectrum is not less than the threshold, performing series operation on the context vector of the second current time step and the Mel frequency spectrum of the second current time step, and returning to the step of inputting the series result into a double-layer long-short time memory layer in the decoder until the probability of the prediction of the Mel frequency spectrum is less than the threshold, thereby obtaining the predicted Mel frequency spectrum.
4. The text-based voice voicing method of claim 3, wherein extracting the context vector in the hidden feature matrix using an attention network in the decoder to obtain the context vector for the first current time step comprises:
carrying out linear projection on the hidden feature matrix by utilizing a linear layer in the attention network to obtain a key matrix;
inputting the attention weight value in the attention network into a preset convolution layer to generate a position characteristic matrix;
performing linear projection on the position characteristic matrix by using the linear layer to obtain an additional characteristic matrix;
adding the additional feature matrix and the key matrix, and processing an addition result by using a preset second activation function to obtain an attention probability vector;
mapping the attention probability vector by using a preset mapping function to obtain a weight vector of the current attention;
and multiplying the current attention weight vector and the hidden feature matrix to obtain a context vector of a first current time step.
5. The text-based voice voicing method of claim 1, wherein the spectrally converting the audio data to obtain a target spectrum comprises:
pre-emphasis processing, framing processing and windowing processing are carried out on the audio data to obtain a target audio signal;
and carrying out Fourier transform on the target voice signal to obtain a target frequency spectrum.
6. The method of claim 1, wherein the converting the target document into phonemes by using a preset speech synthesis model to obtain a phoneme sequence comprises:
performing language analysis on the target text by using a language analysis tool to determine the language of the target text;
performing sentence segmentation processing on the target text by using the word segmentation rule corresponding to the language type to obtain a segmented sentence text;
converting non-characters in the segmented sentence text into characters according to a preset text format rule;
performing word segmentation processing on the segmented sentence text to obtain a word segmentation text;
mapping the word segmentation text according to a preset character phoneme mapping dictionary to obtain phonemes;
performing vector conversion on the phoneme to obtain a phoneme vector;
and coding and sequencing the phoneme vectors according to the text sequence to obtain a phoneme sequence.
7. The text-based voice voicing method of claim 1, wherein residual joining the predicted mel-frequency spectrum using the residual network to obtain a target mel-frequency spectrum comprises:
performing residual error calculation on the predicted Mel frequency spectrum by using a preset residual error network to obtain a residual Mel frequency spectrum;
and superposing the residual Mel frequency spectrum and the predicted Mel frequency spectrum to obtain a target Mel frequency spectrum.
8. A text-based voice voicing apparatus, the apparatus comprising:
the phoneme sequence conversion module is used for acquiring target text and audio data and carrying out phoneme conversion on the target file by utilizing a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a vocoder;
a target mel-frequency spectrum obtaining module, configured to perform frequency spectrum conversion on the audio data to obtain a target frequency spectrum, perform context feature extraction on the phoneme sequence by using the encoder to obtain a hidden feature matrix, predict a mel-frequency spectrum of the target text by using the decoder according to the hidden feature matrix and the target frequency spectrum to obtain a predicted mel-frequency spectrum, and perform residual error connection on the predicted mel-frequency spectrum by using the residual error network to obtain a target mel-frequency spectrum;
and the target audio acquisition module is used for performing audio conversion on the target Mel frequency spectrum by using the vocoder to obtain a target audio.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform the text-based voice voicing method of any of claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a text-based voice voicing method according to any one of claims 1-7.
CN202210416138.4A 2022-04-20 2022-04-20 Text-based voice changing method and device, electronic equipment and storage medium Pending CN114863945A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210416138.4A CN114863945A (en) 2022-04-20 2022-04-20 Text-based voice changing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210416138.4A CN114863945A (en) 2022-04-20 2022-04-20 Text-based voice changing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114863945A true CN114863945A (en) 2022-08-05

Family

ID=82631088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210416138.4A Pending CN114863945A (en) 2022-04-20 2022-04-20 Text-based voice changing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114863945A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189654A (en) * 2023-02-23 2023-05-30 京东科技信息技术有限公司 Voice editing method and device, electronic equipment and storage medium
CN116959422A (en) * 2023-09-21 2023-10-27 深圳麦风科技有限公司 Many-to-many real-time voice sound changing method, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189654A (en) * 2023-02-23 2023-05-30 京东科技信息技术有限公司 Voice editing method and device, electronic equipment and storage medium
CN116959422A (en) * 2023-09-21 2023-10-27 深圳麦风科技有限公司 Many-to-many real-time voice sound changing method, equipment and storage medium
CN116959422B (en) * 2023-09-21 2023-11-24 深圳麦风科技有限公司 Many-to-many real-time voice sound changing method, equipment and storage medium

Similar Documents

Publication Publication Date Title
US11881205B2 (en) Speech synthesis method, device and computer readable storage medium
JP7280382B2 (en) End-to-end automatic speech recognition of digit strings
US8990089B2 (en) Text to speech synthesis for texts with foreign language inclusions
CN112086086A (en) Speech synthesis method, device, equipment and computer readable storage medium
CN112397047A (en) Speech synthesis method, device, electronic equipment and readable storage medium
EP4029010B1 (en) Neural text-to-speech synthesis with multi-level context features
CN114863945A (en) Text-based voice changing method and device, electronic equipment and storage medium
WO2022142105A1 (en) Text-to-speech conversion method and apparatus, electronic device, and storage medium
CN112634865B (en) Speech synthesis method, apparatus, computer device and storage medium
CN113327574B (en) Speech synthesis method, device, computer equipment and storage medium
WO2022227190A1 (en) Speech synthesis method and apparatus, and electronic device and storage medium
CN112466273A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN111862937A (en) Singing voice synthesis method, singing voice synthesis device and computer readable storage medium
WO2022121158A1 (en) Speech synthesis method and apparatus, and electronic device and storage medium
CN113345431A (en) Cross-language voice conversion method, device, equipment and medium
CN115002491A (en) Network live broadcast method, device, equipment and storage medium based on intelligent machine
CN116469374A (en) Speech synthesis method, device, equipment and storage medium based on emotion space
CN116580698A (en) Speech synthesis method, device, computer equipment and medium based on artificial intelligence
CN115206284B (en) Model training method, device, server and medium
CN114842880A (en) Intelligent customer service voice rhythm adjusting method, device, equipment and storage medium
Shreekanth et al. Duration modelling using neural networks for Hindi TTS system considering position of syllable in a word
CN112802451A (en) Prosodic boundary prediction method and computer storage medium
CN114373445B (en) Voice generation method and device, electronic equipment and storage medium
CN116645957B (en) Music generation method, device, terminal, storage medium and program product
CN113160793A (en) Speech synthesis method, device, equipment and storage medium based on low resource language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination