CN115966196A

CN115966196A - Text-based voice editing method, system, electronic device and storage medium

Info

Publication number: CN115966196A
Application number: CN202211696422.8A
Authority: CN
Inventors: 俞凯; 陈谐; 梁正; 杜晨鹏
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-04-14

Abstract

The embodiment of the invention provides a text-based voice editing method, a text-based voice editing system, electronic equipment and a storage medium. The method comprises the following steps: inputting the edited text into a text encoder, determining the voice duration corresponding to the modified part in the edited text, and determining the text representation of the edited text based on the voice duration and the phoneme coding of the edited text; inputting the voice time length and the voice before the modification of the editing text into a voice coder, and covering a corresponding modified part in the voice before the modification based on the voice time length to obtain an acoustic representation, a hidden representation with a covering context and a Mel frequency spectrum with a covering region; and inputting the text representation, the covered acoustic representation and the hidden representation into a joint network to obtain a predicted Mel frequency spectrum corresponding to the covered area. The embodiment of the invention can lead the model to utilize the context information of the original voice, thereby predicting the voice of the editing area which is more accordant with the original audio, and avoiding the phenomena of unnaturalness and discontinuity of the voice generated by the splicing method.

Description

Text-based voice editing method, system, electronic device and storage medium

Technical Field

The present invention relates to the field of intelligent speech, and in particular, to a text-based speech editing method, system, electronic device, and storage medium.

Background

Under the condition of recording and corresponding text, the user only needs to edit the text, and the system can output the corresponding edited voice according to the edited text. Text-to-Speech (TTS) synthesis models are closely related to Text-to-Speech (TTS) editing methods. At present, the text-based voice editing method and system mainly have two types, one type is a splicing-based method, and the other type is an end-to-end method. The neural network based TTS model generally generates a mel spectrum using words or phonemes as input, and then generates speech by a vocoder (vocoder), or generates speech directly by the TTS model.

In a text-based speech editing method and system, for a system based on a splicing method, a speech segment of an editing region in a text is often synthesized by a TTS model or selected from existing speech data, and then the obtained speech segment is inserted into a corresponding region in original speech.

For end-to-end text-based speech editing methods, neural network models are typically used to predict speech in the editing region of the text, and such methods input the edited text and directly output the edited speech or mel-frequency spectrum.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the TTS model cannot be directly used for carrying out a text-based voice editing task, and is difficult to synthesize voice of an editing area conforming to the speaker, rhythm and the like of the original voice, so that a large amount of voice data similar to the original voice and corresponding text data are required to finely adjust the TTS model;

in the text-based voice editing method and system, for the system based on the splicing method, the generated voice has obvious interval between an editing area and a non-editing area, which is a discontinuous phenomenon generated by directly splicing the voice; the voice fragment generation part, the conversion part, the splicing part and the like are separated, so that the utilization of the audio features of the original voice is less, and the voice corresponding to the text of the editing area is greatly different from the original voice;

end-to-end text-based speech editing methods use the extracted speaker characteristics to prompt the predicted speech of the editing region to conform to the original speech, which makes these methods highly dependent on the speaker characteristics and thus the speaker's generalization poor; some methods use an encoder-decoder (encoder-decoder) architecture and use autoregressive approach in decoding, making prediction less efficient

Disclosure of Invention

The method aims to at least solve the problems of incoherence and poor generalization of the speech generated by text-based speech editing in the prior art.

In a first aspect, an embodiment of the present invention provides a text-based speech editing method, including:

inputting an edited text into a text encoder, determining a first voice duration corresponding to a modified part in the edited text and a second voice duration corresponding to the whole edited text, and determining a text representation of the edited text based on the second voice duration and a phoneme code of the edited text;

inputting the first voice duration and the voice before the modification of the edited text into a voice coder, and covering the modified part in the voice before the modification based on the first voice duration to obtain a covered acoustic representation, a hidden representation with a covered context and a Mel frequency spectrum with a covered area, wherein the length of the text representation is consistent with the Mel frequency spectrum with the covered area;

inputting the text representation, the covered acoustic representation and the hidden representation with the covered context into a joint network to obtain a predicted Mel frequency spectrum corresponding to a covered area, and obtaining the modified voice of the edited text based on the Mel frequency spectrum with the covered area and the predicted Mel frequency spectrum.

In a second aspect, an embodiment of the present invention provides a text-based speech editing system, including:

the text coding program module is used for inputting an edited text into a text coder, determining a first voice duration corresponding to a modified part in the edited text and a second voice duration corresponding to the whole edited text, and determining a text representation of the edited text based on the second voice duration and the phoneme coding of the edited text;

a speech coding program module, configured to input the first speech duration and the speech before the text editing modification into a speech coder, and cover the modified part of the speech before the text editing modification based on the first speech duration to obtain a covered acoustic representation, a hidden representation with a covered context, and a mel spectrum with a covered region, where a length of the text representation is consistent with a length of the mel spectrum with the covered region;

and the voice editing program module is used for inputting the text representation, the acoustic representation after covering and the hidden representation with the covering context into a joint network to obtain a predicted Mel frequency spectrum corresponding to a covering area, and obtaining the voice modified by the edited text based on the Mel frequency spectrum with the covering area and the predicted Mel frequency spectrum.

In a third aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the text-based speech editing method of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the steps of the text-based speech editing method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the voice context modeling based on the BERT enables the voice in the editing area to capture rich voice context information of the recorded audio when predicting, wherein the rich voice context information comprises characteristics of speakers, environment, pitch and the like, and the model can well utilize the context information of the original voice, so that the voice in the editing area which is more consistent with the original audio is predicted, and unnatural and discontinuous voice generated by a splicing method can be avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method for text-based speech editing according to an embodiment of the present invention;

fig. 2 is a schematic overall structure diagram of a text-based speech editing method according to an embodiment of the present invention;

FIG. 3 shows the MCD evaluation results of different baseline models according to an embodiment of the present invention;

FIG. 4 is a MOS score for speaker visible/speaker invisible of a text-based speech editing method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an editing process of a text-based speech editing method according to an embodiment of the present invention;

FIG. 6 is a block diagram of a text-based speech editing system according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an embodiment of an electronic device for text-based speech editing according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a text-based speech editing method according to an embodiment of the present invention, which includes the following steps:

s11: inputting an edited text into a text encoder, determining a first voice duration corresponding to a modified part in the edited text and a second voice duration corresponding to the whole edited text, and determining a text representation of the edited text based on the second voice duration and a phoneme code of the edited text;

s12: inputting the first voice duration and the voice before the modification of the edited text into a voice coder, and covering the modified part in the voice before the modification based on the first voice duration to obtain a covered acoustic representation, a hidden representation with a covered context and a Mel frequency spectrum with a covered area, wherein the length of the text representation is consistent with the Mel frequency spectrum with the covered area;

s13: inputting the text representation, the covered acoustic representation and the hidden representation with the covered context into a joint network to obtain a predicted Mel frequency spectrum corresponding to a covered area, and obtaining the modified voice of the edited text based on the Mel frequency spectrum with the covered area and the predicted Mel frequency spectrum.

In this embodiment, the model proposed by the method can be regarded as an effective combination of a non-autoregressive TTS model fastspech 2 and a BERT (Bidirectional Encoder representation in a converter) with an excellent context modeling capability, and the model of the method can be called a beat-TTS (TEXT-BASED SPEECH EDITING SYSTEM WITH Bidirectional conversion speech editing system). As shown in fig. 2, the method includes the following parts: text coders (text encoders), speech coders (speech encoders), and joint networks (join nets). The input of the traditional TTS model is text and the target output is mel spectrum, whereas in the model of the method, the text and the covered real speech are used as input and the target output is mel spectrum of the covered area. And preparing a text and a voice corresponding to the text in advance, and editing a target word in the text to obtain an edited text when a user uses the text.

For step S11, the purpose of the text encoder is to extract from the input edited text the text representation in the edited text modified by the user.

Specifically, the text encoder includes: a phoneme embedding block, an encoder, a duration predictor, and a length adjuster, wherein,

the phoneme embedding block is used for determining phoneme embedding of the edited text;

the coder is used for determining the text representation of the editing text according to phoneme embedding and corresponding position coding;

the duration predictor is used for determining a first voice duration corresponding to a modified part in the edited text and a second voice duration corresponding to the whole edited text;

and the length adjuster is used for adjusting the length of the text representation according to the second voice time length so that the adjusted length of the text representation is consistent with the length of the Mel frequency spectrum with the covering region.

In this embodiment, as shown in the text encoder in the lower left portion of FIG. 2, this portion is designed similarly to the FastSpeech2 structure, including a phoneme embedded block, an encoder, a duration predictor, and a length adjuster. By this section, the phoneme embedding block determines the corresponding phoneme embedding from the input edited text. And then, the encoder encodes the phoneme embedding and the corresponding position coding to obtain the text representation of the edited text modified by the user, however, because the content in the edited text is modified, the length of the edited text needs to be adjusted in advance for the length matching of the subsequent steps and the acoustic features. And determining the voice time length of the modified part of the user and the voice time length corresponding to the whole text editing by using the duration predictor.

Through the part, text characteristics consistent with the Mel frequency spectrum length of the corresponding voice can be extracted, but the characteristics do not contain some key acoustic characteristics, such as speakers, prosody and the like. Thus, the method uses a separate speech coder to extract the acoustic information.

For step S12, the existing speech corresponding to the text before modification by the user is input to the speech encoder, as shown in the speech encoder at the lower right part of fig. 2, the module takes the real speech as input, and aims to learn rich acoustic information, including speaker, prosody, channel effect, and the like.

As an embodiment, the speech encoder includes: a masking operation block, and a transcoder, wherein,

the covering operation block is used for receiving the voice before the modification of the edited text and the corresponding Mel frequency spectrum, and covering the corresponding modified part in the voice before the modification of the edited text according to the first voice duration corresponding to the modified part in the edited text to obtain the acoustic representation after covering and the Mel frequency spectrum with the covering area;

the conversion coder is used for transcoding the Mel frequency spectrum with the covering region to obtain the hidden representation with the covering context.

In this embodiment, the duration of the covering region determined by the text encoder is used to cover the corresponding modified portion in the speech and mel spectrum, so as to obtain the speech and mel spectrum covered by the modified portion, and a specific covering process may use a fixed value, such as 0 or 1, as the covering value. Then the continuous and tight hidden representation in the masked mel-frequency spectrum is calculated by the conversion encoder.

For step S13, as shown in the joint network of the upper part of fig. 2, the purpose of the model is to effectively combine the text information extracted by the text encoder and the acoustic information extracted by the speech encoder to generate pitch and energy features and mel-frequency spectrum features similar to real speech.

Specifically, the union network includes: a pitch energy converter and a Mel frequency spectrum decoder, wherein

The pitch energy converter is used for generating pitch energy characteristics and predicted Mel frequency spectrum characteristics of simulated real voice according to the received text representation, the covered acoustic representation and the hidden representation with the covered context;

the Mel frequency spectrum decoder is used for determining a predicted Mel frequency spectrum corresponding to the covering region according to the pitch energy characteristics of the simulated real voice and the predicted Mel frequency spectrum characteristics.

In this embodiment, to achieve the prediction of the masked portion of speech (i.e., the user-modified target word), the module fuses the textual information and the acoustic information and uses the converter model to predict the pitch and energy of the masked region, and then fuses the textual information, the acoustic information, and the predicted pitch and energy to the Mel spectral decoder to predict the Mel spectral characteristics of the masked region. The mel spectral decoder is implemented using a feed-forward converter (feed-forward transformer) in FastSpeech 2. And splicing the obtained Mel frequency spectrum characteristics of the covering region with the Mel frequency spectrum characteristics with the covering region to obtain the voice after text editing.

As an embodiment, the text encoder is trained from modified text data and speech data corresponding to the modified text data, and includes:

inputting the modified text data into a text encoder to obtain the predicted voice duration of the voice corresponding to the modified text data;

determining a true speech duration of the speech data using a Gaussian mixture-hidden Markov model;

training a duration predictor and a length adjuster in the text encoder based on the loss of the true speech duration and the predicted speech duration.

In this embodiment, the training duration predictor only takes the text information as input, and does not consider the duration of the adjacent factors of the covered area, in which case, when the real speech rate is too fast or too slow, the predicted speech rate of the covered area may be inconsistent. To get an accurate and contextual phoneme duration result, in predicting the duration, the predicted duration is as follows:

wherein, the first and the second end of the pipe are connected with each other,

is a phoneme x _i Adjusted duration, M is the set of indices for the uncovered phonemes, and->

Representing uncovered phonemes x _j True duration of d' _j Representing uncovered phonemes x _j The duration predicted by the duration predictor. The true duration of a phone may be obtained by forcing an alignment of the phone and speech using a Gaussian mixture-hidden Markov model (GMM-HMM). The duration predictor and the length adjuster in the text coder are trained based on the loss of the true speech duration and the predicted speech duration until convergence.

As an embodiment, the modifying part in the edited text includes: replacement, insertion, and deletion of target words in the edited text.

In this embodiment, if the user replaces the target word in the text, the user replaces a word in the original text with the target word, and then inputs the replaced text into the model text encoder, and inputs the mel spectrum, pitch and energy characteristics of the original speech into the speech encoder. The duration predictor predicts the duration of the target word according to the text information, and then in the voice encoder, the length of the covering area is adjusted to be consistent with the duration according to the duration of the target word; the masked pitch, energy and mel-frequency spectral features are converted into hidden features by a converter encoder in the speech encoder, and then the above-described reasoning process is performed. Finally, the Mel frequency spectrum of the model output target word is spliced with the original covered Mel frequency spectrum, and the spliced complete Mel frequency spectrum is converted into voice by the vocoder.

If the user inserts the position of the target word in the text, the inserting operation is similar to the replacing operation, the user inserts the target word into a certain position in the text, then the edited text is input into a text encoder, and the Mel frequency spectrum, the pitch and the energy characteristics of the original voice are input into a voice encoder. The duration predictor predicts the duration of the target word according to the text information, and then in a speech coder, a covering region with the duration length consistent with the predicted target word is inserted into the corresponding positions of the Mel frequency spectrum and the pitch and energy characteristics of the original speech; then, prediction is carried out, the generated Mel frequency spectrum of the target word is inserted into the corresponding position of the original Mel frequency spectrum, and the edited Mel frequency spectrum is converted into voice by a vocoder.

If the user deletes the position of the target word in the text, the deletion operation has two modes. One is that the user deletes the target word in the text, then the system deletes the Mel frequency spectrum of the corresponding position, and then the vocoder converts the edited Mel frequency spectrum into voice; the other is by means of a replacement operation, namely replacing the target word and the adjacent words by the words adjacent to the target word; thus, the task of deleting the target word in the voice is completed through the replacement operation.

According to the embodiment, based on the BERT voice context modeling, the voice in the editing area can capture rich voice context information of the recorded audio when predicting, wherein the rich voice context information comprises characteristics of a speaker, the environment, the pitch and the like, the model can well utilize the context information of the original voice, so that the voice in the editing area which is more accordant with the original audio is predicted, and the phenomena of voice unnaturalness and discontinuity generated by a splicing method can be avoided.

The method is explained by specific experiments, and the experiments are carried out on two English data sets of HiFiTTS and LibriTTS. The hifitss data set contains approximately 292 hours of audio data and corresponding text for a total of 10 speakers. The method randomly selects 30 sentences for each speaker to form a speaker visible test set, and uses the residual HiFiTTS data as a training set training model. In addition, the performance of the model was tested by randomly selecting 8-9 sentences for each speaker as a speaker-invisible test set from a clean-test set (test-clean) of LibriTTS containing 39 speakers in total.

All speech audio sampling rates are 16kHz. The original audio is extracted as 80-dimensional log-Mel filterbanks (fbanks) features, using a configuration with a frame length of 50ms and a frame shift of 12.5. Both G2P (grapheme-to-phoneme) and forced alignment information are obtained by building a GMM-HMM model on Kaldi. The model proposed by the method is built on an Espnet tool, and HiFiTTS trained by the same training set is used as a vocoder.

The method comprises three baseline models for comparison, namely, a TTS model is used for generating the whole voice of the edited text; secondly, only synthesizing the voice of the target word by using a TTS model, and inserting the synthesized voice into the corresponding position of the original voice; thirdly, the TTS model is used for synthesizing the whole voice of the edited text, but the voice of the target word is cut out of the whole voice and then inserted into the corresponding position of the original voice.

The method uses two evaluation indexes, objective and subjective.

Objective evaluation experiment: the objective evaluation index is the average MCD (Mel-cepstral distance) of DTW (Dynamic time warping) paths, and lower MCDs indicate higher similarity. In objective experiments, we masked 1-4 words at random in each sentence and calculated the MCD of the target word and the whole sentence editing speech. In order to avoid the influence caused by the voice synthesized by the vocoder, the method cuts off the target word part in the voice synthesized by the proposed system and the baseline system, and then inserts the target word part into the position corresponding to the original voice, so that the experiment is fairer, and the experimental result is shown as the MCD evaluation result of the model in the graph in the HiFiTTS test set (the speaker is visible) and the LibriTTS test set (the speaker is invisible) in fig. 3.

The BEdit-TTS model provided by the method obtains the lowest MCD no matter the voice of the target word is suitable for the whole sentence on the test set with the visible speaker and the invisible speaker, which shows that the voice synthesized by the BEdit-TTS model has better human perception and higher naturalness.

Subjective evaluation test: the subjective evaluation index was MOS (Mean Opinion Score), and in the experiment, 15 words were randomly selected from the speaker visual test set to perform the replacement and insertion operations, respectively. Since the baseline model does not use any speaker adaptation technology, 15 sentences are randomly selected to be reconstructed from the invisible test set of speakers and are compared with the real voice. The experiment had a total of 15 participants and scores, and participants had to listen to all audio and give a score, and were informed of the edited area before each sentence was tested. The score was 1-5, where 1 indicated very poor, 2 indicated poor, 3 indicated general, 4 indicated good, and 5 indicated good, and the results of the experiment are shown in FIG. 3.

The edited voice generated after the voice synthesized by BEdit-TTS is replaced and inserted on the test set visible to the speaker obtains the highest MOS score, which shows that the generated voice has higher naturalness and quality, the voice characteristic of the generated editing area conforms to the characteristic of the original voice, and meanwhile, the boundary of the editing area and the non-editing area is smooth and natural. After the BEdit-TTS model carries out voice reconstruction operation on the covering area on the test set which can not be seen by the speaker, the generated voice has a score which is closer to the real voice, which shows that the model of the method still has better performance under the condition that the speaker can not be seen.

In addition, by means of the replacement operation of the model of the method, VC (voice cloning) can be realized, and the specific flow is as follows: the voice cloning is achieved by repeatedly applying the substitution operation. A speech and corresponding text are given as an original speech and an original text, and a target text. First, the target text and the original text are divided into the same number of parts, each part corresponding to one or more words. Then, we replace the corresponding part in the original text with the part of the target text, and the model repeats the replacement operation with the replaced text and the Mel frequency spectrum and pitch and energy characteristics of the original speech until all parts of the original text are replaced by the corresponding part of the target text. Finally, the vocoder is used to convert the mel spectrum after multiple editing operations into the voice of the target text, and the corresponding flow is shown in fig. 5.

In general, the method proposes a new text-based speech editing model, named BEdit TTS, to simplify various operations on recorded audio, including substitution, insertion, and deletion. For the speech editing task, the requirements on the speech quality and acoustic consistency of the synthesized speech are also important. To this end, the BEdit TTS aims to integrate the advantages of the neural TTS in terms of high-fidelity audio generation and the advantages of the BERT in terms of context modeling. Experimental results show that the model provided by the method can generate the voice with good quality and high similarity with the recorded audio.

Fig. 6 is a schematic structural diagram of a text-based voice editing system according to an embodiment of the present invention, which can execute the text-based voice editing method according to any of the above embodiments and is configured in a terminal.

The present embodiment provides a text-based speech editing system 10, which includes: a text encoding program module 11, a speech encoding program module 12 and a speech editing program module 13.

The text encoding program module 11 is configured to input an edited text to a text encoder, determine a first speech duration corresponding to a modified portion of the edited text and a second speech duration corresponding to the entire edited text, and determine a text representation of the edited text based on the second speech duration and a phoneme code of the edited text; the speech encoding program module 12 is configured to input the first speech duration and the speech before the text editing modification into a speech encoder, and cover the modified part of the speech before the modification based on the first speech duration to obtain a covered acoustic representation, a hidden representation with a covered context, and a mel spectrum with a covered area, where a length of the text representation is consistent with a length of the mel spectrum with the covered area; the speech editing program module 13 is configured to input the text representation, the covered acoustic representation, and the hidden representation with the covered context to a joint network, obtain a predicted mel spectrum corresponding to a covered area, and obtain the modified speech of the edited text based on the mel spectrum with the covered area and the predicted mel spectrum.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the voice editing method based on the text in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the text-based speech editing method of any of the method embodiments described above.

Fig. 7 is a schematic hardware structure diagram of an electronic device for a text-based speech editing method according to another embodiment of the present application, and as shown in fig. 7, the electronic device includes:

one or more processors 710 and a memory 720, one processor 710 being illustrated in fig. 7. The apparatus of the text-based speech editing method may further include: an input device 730 and an output device 740.

The processor 710, the memory 720, the input device 730, and the output device 740 may be connected by a bus or other means, as exemplified by the bus connection in fig. 7.

Memory 720, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the text-based speech editing method in the embodiments of the present application. The processor 710 executes various functional applications of the server and data processing, i.e., implements the text-based voice editing method of the above-described method embodiments, by running non-volatile software programs, instructions, and modules stored in the memory 720.

The memory 720 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 720 may optionally include memory located remotely from processor 710, which may be connected to a mobile device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 730 may receive input numeric or character information. The output device 740 may include a display device such as a display screen.

The one or more modules are stored in the memory 720 and, when executed by the one or more processors 710, perform the text-based speech editing method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The non-volatile computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the text-based speech editing method of any of the embodiments of the present invention.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) Mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has the functions of calculation and processing, and generally has the mobile internet access characteristic. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A text-based speech editing method comprising:

inputting the first voice duration and the voice before the editing text modification into a voice coder, and covering the modified part in the voice before the modification based on the first voice duration to obtain a covered acoustic representation, a hidden representation with a covered context and a Mel frequency spectrum with a covered area, wherein the length of the text representation is consistent with the Mel frequency spectrum with the covered area;

and inputting the text representation, the acoustic representation after covering and the hidden representation with the covering context into a joint network to obtain a predicted Mel frequency spectrum corresponding to a covering area, and obtaining the modified voice of the edited text based on the Mel frequency spectrum with the covering area and the predicted Mel frequency spectrum.

2. The method of claim 1, wherein the text encoder comprises: a phoneme embedding block, an encoder, a duration predictor, and a length adjuster, wherein,

3. The method of claim 1, wherein the speech encoder comprises: a masking operation block, and a transcoder, wherein,

the covering operation block is used for receiving the voice before the modification of the edited text and the corresponding Mel frequency spectrum, and covering the corresponding modified part in the voice before the modification of the edited text according to the first voice duration corresponding to the modified part in the edited text to obtain the covered acoustic representation and the Mel frequency spectrum with the covering region;

4. The method of claim 1, wherein the federated network comprises: pitch energy converter and mel frequency spectrum decoder, wherein

5. The method of claim 2, wherein the text encoder is trained from modified text data and speech data corresponding to the modified text data, comprising:

inputting the modified text data into a text encoder to obtain predicted voice duration of voice corresponding to the modified text data;

determining a real voice time length of the voice data by using a Gaussian mixture-hidden Markov model;

training a duration predictor and a length adjuster in the text coder based on the loss of the true speech duration and the predicted speech duration.

6. The method of claim 1, wherein modifying the portion of the edited text comprises: replacement, insertion, and deletion of target words in the edited text.

7. A text-based speech editing system comprising:

and the voice editing program module is used for inputting the text representation, the covered acoustic representation and the hidden representation with the covered context into a joint network to obtain a predicted Mel frequency spectrum corresponding to a covered area, and obtaining the voice modified by the edited text based on the Mel frequency spectrum with the covered area and the predicted Mel frequency spectrum.

8. The system of claim 7, wherein the text encoder is trained from modified text data and speech data corresponding to the modified text data, comprising:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-6.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.