WO2023207541A1

WO2023207541A1 - Speech processing method and related device

Info

Publication number: WO2023207541A1
Application number: PCT/CN2023/086497
Authority: WO
Inventors: 邓利群; 朱杰明; 张立超; 赵洲
Original assignee: 华为技术有限公司
Priority date: 2022-04-29
Filing date: 2023-04-06
Publication date: 2023-11-02
Also published as: CN114882862A

Abstract

A speech processing method and a related device, which are applied to the field of song editing. The method comprises: acquiring original speech and second text; predicting a second pitch feature of the second text according to a first pitch feature of non-edited speech in the original speech and information of target text; according to the second pitch feature and the second text, and by means of a neural network, obtaining a first speech feature corresponding to the second text; and according to the first speech feature, generating target edited speech corresponding to the second text. By means of the present application, a pitch feature of second text is predicted, a first speech feature of the second text is generated according to the pitch feature, and target edited speech corresponding to the second text is generated on the basis of the first speech feature, such that pitch features of speech before and after song editing is performed are similar to each other, and thus the acoustic experience of the target edited speech is similar to the acoustic experience of the original speech.

Description

Speech processing method and related equipment

This application claims priority to the Chinese patent application filed with the China Patent Office on April 29, 2022, with the application number 202210468926.8 and the invention title "A speech processing method and related equipment", the entire content of which is incorporated into this application by reference. middle.

Technical field

The embodiments of the present application relate to the field of artificial intelligence, and in particular, to a speech processing method and related equipment.

Background technique

Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence. Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.

Currently, voice editing has very important practical significance. For example, in scenarios where users record songs (such as singing a cappella), some content in the voice is often wrong due to slips of the tongue. In this case, voice editing can help users quickly correct the erroneous content in the original singing voice and generate corrected voice. A commonly used speech editing method is to pre-build a database containing a large number of speech segments, obtain segments of pronunciation units from the database, and use the segments to replace erroneous segments in the original speech to generate corrected speech.

However, the above-mentioned voice editing method relies on the diversity of voice segments in the database. When there are few voice segments in the database, the corrected voice (such as the user's singing voice) will have a poor hearing quality.

Contents of the invention

Embodiments of the present application provide a voice processing method and related equipment, which can achieve a listening experience of edited singing that is similar to that of original speech, thereby improving user experience.

In the first aspect, this application provides a voice processing method, which can be applied to scenarios such as users recording short videos and teachers recording teaching voices. The method may be executed by the speech processing device, or may be executed by a component of the speech processing device (such as a processor, a chip, or a chip system, etc.). Wherein, the speech processing device can be a terminal device or a cloud device, and the method includes: obtaining the original speech and the second text, the second text being the text other than the first text in the target text, the target text The original text corresponding to the original speech includes the first text, and the speech corresponding to the first text in the original speech is a non-edited speech; according to the first pitch of the non-edited speech features and the information of the target text, predict the second pitch feature of the second text; according to the second pitch feature and the second text, obtain the first pitch corresponding to the second text through a neural network Speech features: generate a target editing voice corresponding to the second text according to the first voice features. This application predicts the pitch characteristics of the second text (text to be edited), generates the first speech characteristics of the second text based on the pitch characteristics, and based on the first The voice features generate a second text that corresponds to the target edited voice, so that the pitch characteristics of the voice before and after the singing are similar, thereby achieving a listening experience of the target edited voice that is similar to that of the original voice. .

In addition, there are many ways to obtain the second text, which can be to obtain the second text directly; or to obtain the position information first (which can also be understood as mark information, used to indicate the position of the second text in the target text). When obtaining the second text according to the position and the target text, the position information is used to represent the position of the second text in the target text; it can also be obtained by obtaining the target text and the original text (or obtaining the target text and the original voice, and recognizing the original voice to obtain original text), and then determine the second text based on the original text and the target text.

In one possible implementation, generating a target editing voice corresponding to the second text based on the second voice feature includes: generating the target editing voice through a vocoder based on the second voice feature.

In this possible implementation, the second voice features are converted into the target edited voice according to the vocoder, so that the target edited voice has voice features similar to the original voice, thereby improving the user's listening experience.

In one possible implementation, the content of the original voice is the user's singing voice, which may be, for example, the voice recorded when the user sings a cappella.

In one possible implementation, obtaining the original voice and the second text includes: receiving the original voice and the second text sent by the terminal device; the method also includes: sending the target editing voice to the terminal device, and the target editing voice is used for generation by the terminal device The target speech corresponding to the target text. It can also be understood as an interactive scenario. The cloud device performs complex calculation operations, and the terminal device performs a simple splicing operation. The original voice and the second text are obtained from the terminal device. After the cloud device generates the target editing voice, it is sent to the terminal device. The target edits the voice, and then the terminal device splices it to obtain the target voice.

In this possible implementation, when the voice processing device is a cloud device, on the one hand, through the interaction between the cloud device and the terminal device, the cloud device can perform complex calculations to obtain the target edited voice and return it to the terminal device. Reduce the computing power and storage space of the terminal device. On the other hand, a target edited voice corresponding to the modified text can be generated based on the voice characteristics of the non-edited area in the original voice, and then a target voice corresponding to the target text can be generated from the non-edited voice.

Optionally, in a possible implementation of the first aspect, the above steps: obtaining the original voice and the second text include: receiving the original voice and the target text sent by the terminal device; the method further includes: based on the non-edited voice and the second text The target editing voice generates a target voice corresponding to the target text, and sends the target voice to the terminal device.

In this possible implementation, the original voice and the target text sent by the terminal device are received, the non-edited voice can be obtained, and the second voice feature corresponding to the second text is generated based on the first voice feature of the unedited voice, and then the second voice feature corresponding to the second text is generated according to the vocoded voice code. The processor obtains the target edited voice, and splices the target edited voice and the non-edited voice to generate the target voice. Equivalently, the processing is done on the voice processing device, and the results are returned to the terminal device. The cloud device performs complex calculations to obtain the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device.

In a possible implementation, the first pitch (pitch) feature based on the non-edited voice and the second text include: based on the first pitch (pitch) feature of the non-edited voice, the The information of the target text and the second speech feature of the non-edited speech; the second speech feature carries at least one of the following information: some speech frames or all speech frames of the non-edited speech; the non-edited speech The voiceprint characteristics of the voice; the timbre of the non-edited voice Features; the prosodic features of the non-edited speech; and the rhythmic features of the non-edited speech.

Among them, the first speech feature can be the same or similar to the second speech feature in terms of rhythm, timbre, and/or signal-to-noise ratio. Rhythm can reflect the speaker's emotional state or speech form. Rhythm generally refers to intonation, pitch, and stress emphasis. , pauses or rhythm characteristics.

In one possible implementation, the second voice feature carries the voiceprint feature of the original voice. Among them, the voiceprint features may be obtained directly, or the voiceprint features may be obtained by recognizing original speech, etc.

In this possible implementation, on the one hand, by introducing the voiceprint feature of the original voice, the subsequently generated first voice feature also carries the voiceprint feature of the original voice, thereby improving the similarity between the target edited voice and the original voice. . On the other hand, when there are multiple speakers (or users), introducing voiceprint features can improve the subsequently predicted voice features to be more similar to the voiceprints of the speakers of the original speech.

In a possible implementation, the target text information includes:

Text embedding of each phoneme in the target text.

In a possible implementation, the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text. Text, the second text is text adjacent to the first part of text;

Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text includes:

Fusion of the first pitch feature of the non-edited speech and the information of the target text to obtain a first fusion result;

The first fusion result is input into the second neural network to obtain the second pitch feature of the second text.

In a possible implementation, the target text is obtained by replacing the second part of the text in the first text with the second text;

Input the first pitch (pitch) feature of the non-edited speech into a third neural network to obtain an initial pitch feature, where the first initial pitch feature includes the pitch of each frame in a plurality of frames;

Input the information of the target text into a fourth neural network to obtain the pronunciation feature of the second text, where the pronunciation feature is used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated;

The initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.

In a possible implementation, the method further includes:

According to the frame number of each phoneme in the non-edited speech and the information of the target text, the frame number of each phoneme in the second text is predicted.

In a possible implementation, the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;

The second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited speech.

In a possible implementation, the information based on the number of frames of each phoneme in the non-edited speech and the target text includes:

According to the frame number of each phoneme in the non-edited speech, the information of the target text and the second speech feature of the non-edited speech.

In a possible implementation, the above steps further include: obtaining the position of the second text in the target text; and splicing the target editing voice and the non-editing voice based on the position to obtain the target voice corresponding to the target text. It can also be understood as replacing the edited voice in the original voice with the target edited voice, and the edited voice is the voice in the original voice except the non-edited voice.

In this possible implementation, the target editing voice and the non-editing voice can be spliced according to the position of the second text in the target text. If the first text is all overlapping text in the original text and the target text, the voice of the desired text (ie, the target text) can be generated without changing the non-edited voice in the original voice.

Optionally, in a possible implementation of the first aspect, the above steps further include: determining the non-edited voice based on the target text, the original text and the original voice. Specifically, it may be: determining the first text based on the target text and the original text. ; Determine the non-edited voice based on the first text, original text and original voice.

In this possible implementation, the non-edited voice of the first text in the original voice is determined by comparing the original text and the original voice, so as to facilitate the subsequent generation of the first voice feature.

Optionally, in a possible implementation of the first aspect, the above steps: determining the first text based on the target text and the original text, including: determining the overlapping text based on the target text and the original text; displaying the overlapping text to the user; and responding The user's second operation determines the first text from the overlapping text.

In a second aspect, this application provides a voice processing device, which includes:

Acquisition module, used to obtain the original speech and the second text. The second text is the text in the target text except the first text. The target text and the original text corresponding to the original speech both include the first text. , the voice corresponding to the first text in the original voice is non-edited voice;

A pitch prediction module, configured to predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text;

A generation module, configured to obtain the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text;

According to the first voice characteristics, a target editing voice corresponding to the second text is generated.

In one possible implementation, the content of the original voice is the user's singing voice.

In a possible implementation, the first pitch (pitch) feature of the non-edited voice and the second text include:

According to the first pitch feature of the non-edited voice, the information of the target text and the second voice feature of the non-edited voice; the second voice feature carries at least one of the following information:

Partial speech frames or all speech frames of the non-edited speech;

The voiceprint characteristics of the non-edited speech;

The timbre characteristics of the non-edited voice;

The prosodic characteristics of the unedited speech; and,

The rhythmic characteristics of the non-edited speech.

In a possible implementation, the information of the target text includes: text embedding of each phoneme in the target text.

The pitch prediction module is specifically used for:

In a possible implementation, the device further includes:

A duration prediction module, configured to predict the number of frames of each phoneme in the second text based on the number of frames of each phoneme in the non-edited speech and the information of the target text.

The second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited voice.

In a possible implementation, the duration prediction module is specifically used to:

In a possible implementation, the acquisition module is also used to:

Obtain the position of the second text in the target text;

The generating module is further configured to splice the target edited voice and the non-edited voice based on the position to obtain a target voice corresponding to the target text.

A third aspect of the present application provides a voice processing device that performs the method in the foregoing first aspect or any possible implementation of the first aspect.

A fourth aspect of the present application provides a speech processing device, including: a processor, the processor is coupled to a memory, and the memory is used to store programs or instructions. When the program or instructions are executed by the processor, the speech processing device implements the above-mentioned first step. A method in any possible implementation of an aspect or first aspect.

The fifth aspect of the present application provides a computer-readable medium on which a computer program or instructions are stored. When the computer program or instructions are run on a computer, the computer is caused to execute the foregoing first aspect or any possible implementation of the first aspect. method within the method.

A sixth aspect of the present application provides a computer program product, which, when executed on a computer, causes the computer to execute the method in the foregoing first aspect or any possible implementation of the first aspect.

Description of the drawings

Figure 1 is a schematic structural diagram of a system architecture provided by this application;

Figure 2 is a schematic structural diagram of a convolutional neural network provided by this application;

Figure 3 is a schematic structural diagram of another convolutional neural network provided by this application;

Figure 4 is a schematic diagram of the chip hardware structure provided by this application;

Figure 5 is a schematic flow chart of a neural network training method provided by this application;

Figure 6 is a schematic structural diagram of a neural network provided by this application;

Figure 7a is a schematic flow chart of the speech processing method provided by this application;

Figure 7b is a schematic diagram of duration prediction provided by this application;

Figure 7c is a schematic diagram of pitch prediction provided by this application;

Figure 7d is a schematic diagram of pitch prediction provided by this application;

Figures 8 to 10 are several schematic diagrams of the display interface of the voice processing device provided by this application;

Figure 11 is a schematic structural diagram of a bidirectional decoder provided by this application;

Figure 12 is another schematic diagram of the display interface of the voice processing device provided by this application;

Figure 13 is another schematic flow chart of the speech processing method provided by this application;

Figures 14-16 are schematic structural diagrams of several speech processing devices provided by this application.

Detailed ways

Embodiments of the present application provide a speech processing method and related equipment, which can realize that the listening feeling of edited speech is similar to that of original speech, thereby improving user experience.

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of this application.

In order to facilitate understanding, the relevant terms and concepts mainly involved in the embodiments of this application are first introduced below.

1. Neural network

The neural network can be composed of neural units. The neural unit can refer to an arithmetic unit that takes X _s and intercept 1 as input. The output of the arithmetic unit can be:

Among them, s=1, 2,...n, n is a natural number greater than 1, W _s is the weight of X _s , and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.

2. Deep neural network

Deep neural network (DNN), also known as multi-layer neural network, can be understood as a neural network with many hidden layers. There is no special metric for "many" here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Of course, the deep neural network may not include hidden layers, and there is no specific limitation here.

The work of each layer in a deep neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the columns) through five operations on the input space (a collection of input vectors). space), these five operations include: 1. Dimension raising/reducing; 2. Zoom in/out; 3. Rotation; 4. Translation; 5. "Bend". Among them, the operations of 1, 2 and 3 are performed by Completed, operation 4 is performed by Completed, the operation of 5 is implemented by α(). The reason why the word "space" is used here is because the object to be classified is not a single thing, but a class of things. Space refers to the collection of all individuals of this type of thing. Among them, W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer. This vector W determines the spatial transformation from the input space to the output space mentioned above, that is, the weight W of each layer controls how Transform space. The purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vectors W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.

3. Convolutional neural network

Convolutional neural network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers. The feature extractor can be viewed as a filter, and the convolution process can be viewed as convolving the same trainable filter with an input image or feature map. The convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. In the convolutional layer of a convolutional neural network, a neuron can be connected to only some of the neighboring layer neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information independent of position. The underlying principle is that the statistical information of one part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting. The separation network, recognition network, detection network, depth estimation network and other networks in the embodiments of this application can all be CNNs.

4. Recurrent Neural Network (RNN)

In the traditional neural network model, the layers are fully connected, and the nodes between each layer are unconnected. But this ordinary neural network cannot solve many problems. For example, predicting what the next word of a sentence is, because the preceding and following words in a sentence are not independent, generally the previous word needs to be used. Recurrent neural network (RNN) refers to the current output of a sequence that is also related to the previous output. The specific form of expression is that the network will remember the previous information, save it in the internal state of the network, and apply it to the calculation of the current output.

5. Loss function

In the process of training a deep neural network, because we hope that the output of the deep neural network is as close as possible to the value that we really want to predict, we can compare the predicted value of the current network with the really desired target value, and then based on the difference between the two to update the weight vector of each layer of the neural network according to the difference (of course, there is usually an initialization process before the first update, that is, preconfiguring parameters for each layer in the deep neural network). For example, if the predicted value of the network If it is high, adjust the weight vector to make it predict a lower value, and continue to adjust until the neural network can predict the target value that you really want. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.

6. From text to speech

Text to speech (TTS) is a program or software system that converts text into speech.

7. Vocoder

A vocoder is a sound signal processing module or software that encodes acoustic features into a sound waveform.

8. Pitch

Pitch can also be called fundamental frequency. When a sound-emitting body emits sound due to vibration, the sound can generally be decomposed into many simple sine waves. That is to say, all natural sounds are basically composed of many sine waves with different frequencies. , the sine wave with the lowest frequency is the fundamental tone (that is, the fundamental frequency, which can be represented by F0), while other sine waves with higher frequencies are overtones.

9. Rhythm

In the field of speech synthesis, prosody broadly refers to features that control functions such as intonation, pitch, emphasis, pauses, and rhythm. Prosody can reflect the speaker's emotional state or speech form, etc.

10. Phonemes

Phoneme (phone): It is the smallest unit of speech divided according to the natural properties of speech. It is analyzed based on the pronunciation movements in the syllable. One movement constitutes a phoneme. Phonemes are divided into two categories: vowels and consonants. For example, the Chinese syllable a (for example, one tone: ah) has only one phoneme, ai (for example, four tones: love) has two phonemes, dai (for example, one tone: stay) has three phonemes, etc.

11. Word vector (embedding)

Word vectors can also be called "word embeddings", "vectorization", "vector mapping", "embeddings", etc. Formally speaking, a word vector represents an object as a dense vector.

12. Voice characteristics

Speech features: Convert the processed speech signal into a concise and logical representation that is more discriminating and reliable than the actual signal. After acquiring a segment of speech signal, speech features can be extracted from the speech signal. Among them, the extraction method usually extracts a multi-dimensional feature vector for each speech signal. There are many ways to represent the parameterization of speech signals, such as: perceptual linear prediction (PLP), linear predictive coding (LPC) and frequency cepstrum coefficient (MFCC), etc.

13. transformer layer

The neural network includes an embedding layer and at least one transformer layer. At least one transformer layer can be N transformer layers (N is an integer greater than 0), where each transformer layer includes successively adjacent attention layers, summation and normalization. (add&norm) layer, feed forward layer and summation and normalization layer. In the embedding layer, the current input is embedded to obtain multiple feature vectors; in the attention layer, P input vectors are obtained from the upper layer of the first transformer layer, and any of the P input vectors are The first input vector is the center. Based on the correlation between each input vector within the preset attention window range and the first input vector, the intermediate vector corresponding to the first input vector is obtained. In this way, P input vectors are determined. Corresponding P intermediate vectors; in the pooling layer, the P intermediate vectors are merged into Q output vectors, where the multiple output vectors obtained by the last transformer layer in the transformer layer are used as features of the current input express.

Next, each of the above steps will be introduced in detail with specific examples.

First, in the embedding layer, the current input is embedded to obtain multiple feature vectors.

The embedding layer can be called the input embedding layer. The current input can be text input, for example, it can be a paragraph of text or a sentence. The text can be Chinese text, English text, or other language text. After obtaining the current input, the embedding layer can embed each word in the current input to obtain the feature vector of each word. In some embodiments, the embedding layer includes an input embedding layer and a positional encoding layer. In the input embedding layer, word embedding processing can be performed on each word in the current input to obtain the word embedding vector of each word. In the position coding layer, the position of each word in the current input can be obtained, and then a position vector is generated for the position of each word. In some examples, the position of each word may be the absolute position of each word in the current input. Taking the current input as "What number should I pay back the Huabei?" for example, the position of "number" can be represented as the first digit, the position of "number" can be represented as the second digit,... In some examples, the position of each word may be a relative position between each word. Still taking the current input as "what number should I pay back Huabei" as an example, the position of "what number" can be expressed as before "number", and the position of "number" can be expressed as after "what number" and before "should",... …. When the word embedding vector and position vector of each word in the current input are obtained, the position vector of each word and the corresponding word embedding vector can be combined to obtain the feature vector of each word, that is, multiple feature vectors corresponding to the current input are obtained. Multiple feature vectors can be represented as embedding matrices with preset dimensions. The number of eigenvectors in the plurality of eigenvectors can be set to M, and the default dimension is H dimension, then the plurality of eigenvectors can be expressed as an M×H embedding matrix.

14. attention mechanism

The attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sensation to increase the precision of observation in some areas, and can use limited attention resources to quickly filter out high-value information from a large amount of information. . The attention mechanism can quickly extract important features of sparse data and is therefore widely used in natural language processing tasks, especially machine translation. The self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features. The essential idea of the attention mechanism can be rewritten as the following formula:

Among them, Lx=||Source|| represents the length of Source. The meaning of the formula is to imagine that the constituent elements in Source are composed of a series of data pairs. At this time, given a certain element Query in the target Target, by calculating the Query and Based on the similarity or correlation of each Key, the weight coefficient of each Key's corresponding Value is obtained, and then the Value is weighted and summed to obtain the final Attention value. So essentially the Attention mechanism is a weighted summation of the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value. Conceptually, Attention can be understood as selectively filtering out a small amount of important information from a large amount of information and focusing on this important information, while ignoring most of the unimportant information. The process of focusing is reflected in the calculation of the weight coefficient. The greater the weight, the more focused it is on its corresponding Value value. That is, the weight represents the importance of the information, and the Value is its corresponding information. The self-attention mechanism can be understood as internal Attention (intra attention). The Attention mechanism occurs between the Target element Query and all elements in the Source. The self-attention mechanism refers to between the internal elements of the Source or between the internal elements of the Target. The Attention mechanism that occurs can also be understood as the attention calculation mechanism in the special case of Target=Source. The specific calculation process is the same, but the calculation object has changed.

At present, there are more and more scenarios for voice editing. For example, the scenario of singing voice editing is when the user is recording a song (such as singing a cappella). In order to repair the erroneous content in the original voice caused by a slip of the tongue, voice editing is usually used. Head The current voice editing method is to obtain voice segments from the database, replace the erroneous content with the voice segments, and then generate corrected speech.

However, this method relies too much on the speech fragments stored in the database. If the timbre, rhythm, signal-to-noise ratio, etc. of the speech fragments are greatly different from the original speech, the corrected human voice will be incoherent and the rhythm will be unnatural, resulting in The corrected speech sounds worse. And although the scene of singing editing is very similar to that of voice editing, unlike the smooth voice of a speaking voice, singing data changes more in dimensions such as pronunciation duration, sound energy and pitch, making it difficult for existing voice editing technology to be directly applied to singing editing.

In order to solve the above problems, this application provides a voice editing method. When editing a singing voice, the pitch characteristics will affect the hearing sense of the target edited voice and the hearing sense of the original voice. This application predicts the second text (text to be edited) by predicting Pitch feature, generate the first voice feature of the second text based on the pitch feature, and generate the target editing voice corresponding to the second text based on the first voice feature, so that the pitch features of the voice before and after singing editing are similar, thereby achieving the target editing voice The listening feeling of the target is similar to that of the original speech.

First, the system architecture provided by the embodiment of this application is introduced.

Referring to Figure 1, an embodiment of the present application provides a system architecture 10. As shown in the system architecture 10, the data collection device 16 is used to collect training data. In the embodiment of the present application, the training data includes training speech and training text corresponding to the training speech. The training data is stored in the database 13, and the training device 12 trains to obtain the target model/rule 101 based on the training data maintained in the database 13. The following will describe in more detail how the training device 12 obtains the target model/rules 101 based on the training data. The target model/rules 101 can be used to implement the speech processing method provided by the embodiment of the present application, that is, the text is input after relevant preprocessing. The target model/rule 101 can obtain the phonetic features of the text. The target model/rule 101 in the embodiment of this application may specifically be a neural network. It should be noted that in actual applications, the training data maintained in the database 13 may not all be collected from the data collection device 16, and may also be received from other devices. In addition, it should be noted that the training device 12 may not necessarily train the target model/rules 101 based entirely on the training data maintained by the database 13. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a limitation of this application. Limitations of Examples.

The target model/rules 101 trained according to the training device 12 can be applied to different systems or devices, such as to the execution device 11 shown in Figure 1. The execution device 11 can be a terminal, such as a mobile phone terminal, a tablet computer, Laptops, AR/VR, vehicle terminals, etc., or servers or clouds, etc. In Figure 1, the execution device 11 is configured with an I/O interface 112 for data interaction with external devices. The user can input data to the I/O interface 112 through the client device 14. The input data is used in the embodiment of the present application. may include: second voice features, target text and mark information, and the input data may also include second voice features and second text. In addition, the input data can be input by the user, or uploaded by the user through other devices. Of course, it can also come from a database, and there is no specific limit here.

If the input data includes the second speech feature, target text, and mark information, the preprocessing module 113 is configured to perform preprocessing according to the target text and mark information received by the I/O interface 112. In the embodiment of the present application, the preprocessing module 113 may be used to determine the target editing text in the target text based on the target text and mark information. If the input data includes the second speech feature and the second text, the preprocessing module 113 is configured to perform preprocessing according to the target text and mark information received by the I/O interface 112, for example, converting the target text into phonemes and other preparatory work.

When the execution device 11 preprocesses the input data, or when the calculation module 111 of the execution device 11 performs calculations and other related processes, the execution device 11 can call data, codes, etc. in the data storage system 15 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 15 .

Finally, the I/O interface 112 returns the processing result, such as the first voice feature obtained as described above, to the client device 14, thereby providing it to the user.

It is worth mentioning that the training device 12 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results or providing input for other subsequent processing.

In the situation shown in FIG. 1 , the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 . In another case, the client device 14 can automatically send input data to the I/O interface 112. If requiring the client device 14 to automatically send the input data requires the user's authorization, the user can set corresponding permissions in the client device 14. The user can view the results output by the execution device 11 on the client device 14, and the specific presentation form may be display, sound, action, etc. The client device 14 can also be used as a data collection terminal to collect input data from the input I/O interface 112 and output results from the output I/O interface 112 as new sample data, and store them in the database 13 . Of course, it is also possible to collect data without going through the client device 14. Instead, the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result outputted from the I/O interface 112 as a new sample as shown in the figure. The data is stored in database 13.

It is worth noting that Figure 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation. For example, in Figure 1 , the data storage system 15 is an external memory relative to the execution device 11. In other cases, the data storage system 15 can also be placed in the execution device 11.

As shown in Figure 1, a target model/rule 101 is obtained by training according to the training device 12. The target model/rule 101 can be a neural network in the embodiment of the present application. Specifically, in the network provided by the embodiment of the present application, the neural network It can be a recurrent neural network, a long short-term memory network, etc. The prediction network can be a convolutional neural network, a recurrent neural network, etc.

Optionally, the neural network and the prediction network in the embodiment of this application can be two separate networks, or they can be a multi-task neural network, one of which is to output duration, one to predict pitch features, and the other to predict pitch characteristics. The task is to output speech features.

Since CNN is a very common neural network, the structure of CNN will be introduced in detail below in conjunction with Figure 2. As mentioned in the previous introduction to the basic concepts, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture. The deep learning architecture refers to the algorithm of machine learning. Multiple levels of learning at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the image input into it.

As shown in Figure 2, a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional/pooling layer 120, and a neural network layer 130 where the pooling layer is optional.

Convolutional layer/pooling layer 120:

Convolutional layer:

As shown in Figure 2, the convolution layer/pooling layer 120 may include layers 121-126 as examples. In one implementation, layer 121 is a convolution layer, layer 122 is a pooling layer, layer 123 is a convolution layer, and 124 The layer is a pooling layer, 125 is a convolution layer, and 126 is a pooling layer; in another implementation, 121 and 122 are convolution layers, 123 is a pooling layer, 124 and 125 are convolution layers, and 126 is Pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or can be used as the input of another convolutional layer to continue the convolution operation.

Taking the convolution layer 121 as an example, the convolution layer 121 may include many convolution operators. The convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can essentially be a weight matrix. This weight matrix is usually predefined. During the convolution operation on the image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels followed by two pixels... This depends on the value of the step size) to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices with the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform fuzzification...the multiple weight matrices have the same dimensions, and the feature maps extracted by the multiple weight matrices with the same dimensions also have the same dimensions, and then the extracted feature maps with the same dimensions are combined to form the output of the convolution operation .

The weight values in these weight matrices need to be obtained through a large amount of training in practical applications. Each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.

When the convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (for example, 121) often extracts more general features, which can also be called low-level features; as the convolutional neural network As the depth of the network 100 deepens, the features extracted by subsequent convolutional layers (for example, 126) become more and more complex, such as high-level semantic features. Features with higher semantics are more suitable for the problem to be solved.

Pooling layer:

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolution layer, that is, each layer 121-126 as shown at 120 in Figure 2, which can be a convolution layer followed by a layer The pooling layer can also be a multi-layer convolution layer followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the average value of pixel values in an image within a specific range. The max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling. In addition, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the output image processed by the pooling layer can be smaller than the size of the image input to the pooling layer. Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

Neural network layer 130:

After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 120 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to use the neural network layer 130 to generate an output or a set of required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in Figure 2) and an output layer 140. The parameters included in the multiple hidden layers may be based on specific task types. Related training data are pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc.

After the multi-layer hidden layer in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140. The output layer 140 has a loss function similar to classification cross entropy, specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 100 (the propagation from 110 to 140 in Figure 2 is forward propagation) is completed, the back propagation (the propagation from 140 to 110 in Figure 2 is back propagation) will start to update. The weight values and deviations of each layer mentioned above are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in Figure 2 is only an example of a convolutional neural network. In specific applications, the convolutional neural network can also exist in the form of other network models, for example, as The multiple convolutional layers/pooling layers shown in Figure 3 are parallel, and the extracted features are all input to the full neural network layer 130 for processing.

The following introduces a chip hardware structure provided by the embodiment of the present application.

Figure 4 is a chip hardware structure provided by an embodiment of the present application. The chip includes a neural network processor 40. The chip can be disposed in the execution device 110 as shown in Figure 1 to complete the calculation work of the calculation module 111. The chip can also be provided in the training device 120 as shown in Figure 1 to complete the training work of the training device 120 and output the target model/rules 101. The algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 4.

The neural network processor 40 may be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., which are suitable for large-scale applications. A processor that handles scaled XOR operations. Take the NPU as an example: the neural network processor NPU40 is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the main CPU allocates tasks. The core part of the NPU is the arithmetic circuit 403. The controller 404 controls the arithmetic circuit 403 to extract data in the memory (weight memory or input memory) and perform operations.

In some implementations, the computing circuit 403 internally includes multiple processing engines (PEs). In some implementations, arithmetic circuit 403 is a two-dimensional systolic array. The arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 403 is a general-purpose matrix processor.

For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit obtains the corresponding data of matrix B from the weight memory 402 and caches it on each PE in the arithmetic circuit. The arithmetic circuit converts the input memory 401 The matrix A data is obtained and matrix B is used for matrix operation, and the partial result or final result of the matrix is stored in the accumulator 408 .

The vector calculation unit 407 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. For example, the vector calculation unit 407 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .

In some implementations, the vector calculation unit can 407 store the processed output vector into the unified buffer 406 . For example, the vector calculation unit 407 may apply a nonlinear function to the output of the operation circuit 403, such as a vector of accumulated values, to generate an activation value. In some implementations, vector calculation unit 407 generates normalized values, merged values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 403, such as for use in a subsequent layer in a neural network.

The unified memory 406 is used to store input data and output data.

The weight data directly transfers the input data in the external memory to the input memory 401 and/or unified memory 406 through the storage unit access controller 405 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 402. And store the data in the unified memory 506 into the external memory.

A bus interface unit (BIU) 410 is used to implement interaction between the main CPU, the DMAC and the fetch memory 409 through the bus.

An instruction fetch buffer 409 connected to the controller 404 is used to store instructions used by the controller 404.

The controller 404 is used to call instructions cached in the memory 409 to control the working process of the computing accelerator.

Generally, the unified memory 406, the input memory 401, the weight memory 402 and the instruction memory 409 are all on-chip memories, and the external memory is a memory external to the NPU. The external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM for short), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.

Among them, the operations of each layer in the convolutional neural network shown in Figure 2 or Figure 3 can be performed by the operation circuit 403 or the vector calculation unit 407.

First, the application scenarios to which the voice processing method provided by the embodiment of the present application is applicable are described. This voice processing method can be applied to scenarios where voice content needs to be modified, such as scenarios where users record short videos, teachers record teaching voices, etc. The voice processing method can be applied to applications, software or voice processing devices with voice editing functions such as smart voice assistants on mobile phones, computers, detachable terminals that can produce sounds, smart speakers, etc.

Among them, the voice processing device is a terminal device used to serve users, or a cloud device. The terminal device may include a head mount display (HMD), which may be a combination of a virtual reality (VR) box and a terminal, a VR all-in-one machine, or a personal computer (PC), Augmented reality (AR) devices, mixed reality (MR) devices, etc. The terminal device may also include a cellular phone, a smart phone, a personal digital assistant (personal digital assistant), etc. digital assistant (PDA), tablet computer, laptop computer (laptop computer), personal computer (PC), vehicle-mounted terminal, etc. The details are not limited here.

The following is a detailed introduction to the neural network, the training method of the prediction network, and the speech processing method in the embodiments of the present application with reference to the accompanying drawings.

The neural network and the prediction network in the embodiment of this application can be two separate networks, or they can be a multi-task neural network, one of which is to output duration, and the other is to output speech features.

Secondly, the training method of the neural network according to the embodiment of the present application will be introduced in detail with reference to Figure 5 . The training method shown in Figure 5 can be executed by a neural network training device. The neural network training device can be a cloud service device or a terminal device. For example, a computer, server, etc. with sufficient computing power to execute the neural network. The training method device may also be a system composed of cloud service equipment and terminal equipment. For example, the training method may be executed by the training device 120 in FIG. 1 and the neural network processor 40 in FIG. 4 .

Optionally, the training method can be processed by the CPU, or it can be processed by the CPU and GPU together, or it can not use the GPU but use other processors suitable for neural network calculations, which is not limited by this application.

The training method shown in Figure 5 includes step 501 and step 502. Step 501 and step 502 will be described in detail below.

First, let’s briefly describe the training process of the prediction network. The prediction network in the embodiment of this application can be a transformer network, RNN, CNN, etc., and is not specifically limited here. In the training phase of the prediction network, the input is the vector of the training text, and the output is the duration, pitch characteristics or voice characteristics of each phoneme in the training text. Then continue to narrow the difference between the duration, pitch characteristics or phonetic features of each phoneme in the training text output by the prediction network and the actual duration, actual pitch features or actual phonetic features of the training text corresponding to the training text, and then obtain the trained predictions network.

Step 501: Obtain training data.

The training data in the embodiment of the present application includes training speech, or includes training speech and training text corresponding to the training speech. If the training data does not include training text, the training text can be obtained by recognizing the training speech.

Optionally, if there are multiple speakers (or users), in order to ensure that the subsequently predicted speech features are correct, the training speech features in the training data may also include user identification, or include the voiceprint features of the training speech, or include using A vector identifying the voiceprint features of the training speech.

Optionally, the training data may also include start and end duration information of each phoneme in the training speech.

In the embodiments of this application, the training data can be obtained by directly recording the utterances of the voicing object, or by the user inputting audio information and video information, or by receiving transmissions from the collection device. In practical applications , there are other ways to obtain training data, and there are no specific limitations on how to obtain training data.

Step 502: Use the training data as the input of the neural network, train the neural network with the goal that the value of the loss function is less than the threshold, and obtain a trained neural network.

Optionally, some preprocessing can be performed on the training data. For example, if the training data includes training speech as described above, the training text can be obtained by identifying the training speech, and the training text can be input into the neural network using phoneme representation.

During the training process, the entire training text can be regarded as the target editing text and used as input to train the neural network with the goal of reducing the value of the loss function, that is, continuously reducing the speech features output by the neural network and the training speech The difference between the corresponding actual speech features. This training process can be understood as a prediction task. The loss function can be understood as the loss function corresponding to the prediction task.

The neural network in the embodiment of this application may specifically be an attention mechanism model, such as transformer, tacotron2, etc. Among them, the attention mechanism model includes an encoder-decoder, and the structure of the encoder or decoder can be a recurrent neural network, a long short-term memory network (long short-term memory, LSTM), etc.

The neural network in the embodiment of the present application includes an encoder and a decoder. The structural types of the encoder and decoder may be RNN, LSTM, etc., and are not limited here. The function of the encoder is to encode the training text into a text vector (a vector representation in units of phonemes, with each input corresponding to a vector). The function of the decoder is to obtain the phonetic features corresponding to the text based on the text vector. During the training process of the decoder, the calculation of each step is calculated based on the real speech features corresponding to the previous step.

Furthermore, in order to ensure the coherence of the preceding and following speech, the prediction network can be used to correct the speech duration corresponding to the text vector. That is, it can be understood as upsampling the text vector according to the duration of each phoneme in the training speech (it can also be understood as extending the number of frames of the vector) to obtain a vector of the corresponding number of frames. The function of the decoder is to obtain the speech features corresponding to the text based on the above vector corresponding to the frame number.

Optionally, the above-mentioned decoder may be a unidirectional decoder or a bidirectional decoder (that is, two directions are parallel), and the details are not limited here. Among them, the two directions refer to the direction of the training text, which can also be understood as the direction of the vector corresponding to the training text. It can also be understood as the forward or reverse order of the training text. One direction means that one side of the training text points to the training text. The other side, the other direction is the other side of the training text pointing to the side of the training text.

For example, if the training text is: "Have you had lunch?", the first direction or positive sequence can be from "mid" to "no", and the second direction or reverse sequence can be from "no" to "no" "Center" direction.

If the decoder is a bidirectional decoder, the decoders in both directions (or forward and reverse order) are trained in parallel and are calculated independently during the training process, so there is no dependence on the results. Of course, if the prediction network and the neural network are a multi-task network, the prediction network can be called a prediction module, and the decoder can correct the speech features output by the neural network based on the real duration information corresponding to the training text.

For example, taking singing editing as an example, the input during model training can be the original singing audio, the corresponding lyric text (expressed in units of phonemes). According to the original singing audio, the duration information of each phoneme in the original audio, and the singing voice are obtained. Pattern features, frame-level pitch information, etc., can be obtained through other pre-trained models or tools (such as singing lyrics alignment tool, Singer voiceprint extraction tool, and pitch extraction algorithm, etc.). The output can be a trained acoustic model, and the training goal is to minimize the error between the predicted singing voice features and the singing voice features.

In terms of data preparation for training samples, training data sets can be synthesized based on singing voices, and corresponding training data samples can be constructed by simulating "insertion, deletion and replacement" operation scenarios.

Training process:

Stage1: First use ground-truth lyrics and audio as well as pitch and duration data to train a singing voice synthesis model to obtain the trained text encoding module and audio feature decoding module;

Stage2: Fixed text encoding module and audio feature decoding module, use simulated editing operation training data set to train duration regularization module and pitch prediction module;

Stage3: End-to-end training, finetune the entire model using all training data.

The architecture of the neural network in the embodiment of this application can be seen in Figure 6. Among them, neural network includes encoder and decoder. Optionally, the neural network may also include a prediction module and an upsampling module. The prediction module is specifically used to implement the function of the above prediction network, and the upsampling module is specifically used to implement the above process of upsampling the text vector according to the duration of each phoneme in the training speech, which will not be described again here.

It should be noted that the training process may also adopt other training methods instead of the aforementioned training methods, which are not limited here.

The speech processing method according to the embodiment of the present application will be introduced in detail below with reference to the accompanying drawings.

First of all, the voice processing method provided by the embodiment of the present application can be applied to replacing scenes, inserting scenes, or deleting scenes. The above scenario can be understood as replacing, inserting, deleting, etc. the original speech corresponding to the original text to obtain the target speech, so as to achieve a similar listening experience between the target speech and the original speech and/or improve the fluency of the target speech. Among them, the original voice can be considered to include the voice to be modified, and the target voice is the voice obtained after the user wants to modify the original voice.

To facilitate understanding, several examples of the above scenarios are described below:

1. For replacement scenarios.

The original text is "Today the weather is very good in Shenzhen", and the target text is "Today the weather is very good in Guangzhou". Among them, the overlapping text is "The weather is very good today". The non-overlapping text in the original text is "Shenzhen", and the non-overlapping text in the target text is "Guangzhou". The target text includes a first text and a second text, and the first text is an overlapping text or a part of the overlapping text. The second text is the text in the target text other than the first text. For example: If the first text is "The weather is very good today", the second text is "Guangzhou". If the first text is "Today's weather is very good", the second text is "Heaven, Guangzhou, Tian".

2. For insertion scenes.

The original text is "The weather in Shenzhen is very good today", and the target text is "The weather in Shenzhen is very good this morning". Among them, the overlapping text is "The weather in Shenzhen is very good today". The non-overlapping text in the target text is "AM". In order to achieve coherence before and after the target speech, the insertion scene can be regarded as a replacement scene in which "tianshen" in the original speech is replaced by "tianchenshen". That is, the first text is "The weather in Zhenzhou is very good today", and the second text is "It's morning and late in the day".

3. For deleted scenes.

The original text is "The weather in Shenzhen is very good today" and the target text is "The weather is very good today". Among them, the overlapping text is "The weather is very good today". The non-overlapping text in the original text is "Shenzhen". In order to achieve continuity before and after the target speech, the deleted scene can be regarded as a replacement scene in which "天天天" in the original speech is replaced by "天天". That is, the first text is "Today's weather is very good" and the second text is "Every day".

Optionally, the above scenarios are just examples. In actual applications, there are other scenarios, which are not limited here.

Since the above-mentioned deletion scenes and insertion scenes can be replaced by replacement scenes, the speech processing method provided by the embodiment of the present application is described below only by taking the replacement scene as an example. The voice processing method provided by the embodiments of this application can be executed by the terminal device or the cloud device alone, or can be completed by the terminal device and the cloud device together. They are described below:

Embodiment 1: The terminal device or the cloud device executes the voice processing method independently.

Please refer to Figure 7a, which is an example of a voice processing method provided by the embodiment of the present application. This method can be executed by a voice processing device or by a component of the voice processing device (such as a processor, a chip, or a chip system, etc.). The voice processing device may be a terminal device or a cloud device, and this embodiment includes steps 701 to 704.

Step 701: Obtain the original voice and the second text.

In the embodiment of the present application, the speech processing device can directly obtain the original speech, the original text and the second text. It is also possible to obtain the original speech and the second text first, and then recognize the original speech to obtain the original text corresponding to the original speech. The second text is the text in the target text other than the first text, and the original text and the target text contain the first text. The first text can be understood as part or all of the overlapping text of the original text and the target text.

In the embodiment of this application, there are multiple ways for the voice processing device to obtain the second text, which are described below:

First, the voice processing device can directly obtain the second text through input from other devices or users.

Second, the speech processing device obtains the target text, obtains the overlapping text based on the original text corresponding to the target text and the original speech, and then determines the second text based on the overlapping text. Specifically, the characters in the original text and the target text can be compared one by one or a comparison model can be input to determine overlapping text and/or non-overlapping text between the original text and the target text. Then determine the first text based on the overlapping text. The first text may be an overlapping text, or may be part of the overlapping text.

In the embodiment of the present application, there are many ways to determine the first text based on the overlapping text. The speech processing device can directly determine the overlapping text as the first text, can also determine the first text in the overlapping text according to preset rules, or can also determine the first text in the overlapping text according to the user. The operation determines the first text in the overlapping text. The preset rule may be to obtain the first text after removing N characters in the overlapping content, where N is a positive integer.

It can be understood that the above two methods are just examples. In actual applications, there are other ways to obtain the second text, which are not limited here.

In addition, the speech processing device can align the original text with the original speech, determine the starting and ending positions of each phoneme in the original text in the original speech, and can learn the duration of each phoneme in the original text. Then, the phonemes corresponding to the first text are obtained, that is, the speech corresponding to the first text in the original speech (that is, the non-edited speech) is obtained.

Optionally, the voice processing device can align the original text with the original speech by using a forced alignment method, such as: Montreal forced aligner (MFA), neural network with alignment function and other alignment tools, specifically There are no limitations here.

Optionally, after the speech processing device obtains the original speech and the original text, the user interface can be displayed to the user, and the user interface includes the original speech and the original text. Further, the user performs a first operation on the original text through the user interface, and the speech processing device determines the target text in response to the user's first operation. The first operation can be understood as the user's editing of the original text, and the editing can be the aforementioned replacement, insertion, or deletion.

Illustratively, continue the example in the above replacement scenario. The original text is "Today the weather is very good in Shenzhen", and the target text is "Today the weather is very good in Guangzhou". For example, the speech processing device is a mobile phone. After the speech processing device obtains the original text and the original voice, the user is presented with an interface as shown in Figure 8, which includes the original text and the original voice. As shown in Figure 9, the user can perform the first operation 901 on the original text, such as modifying "Shenzhen" to "Guangzhou" and other aforementioned insertion, deletion, and replacement operations. Here, only replacement is used as an example for description.

Optionally, after determining the overlapping text between the original text and the target text, the speech processing device displays the overlapping text to the user, and then determines the first text from the overlapping text according to the user's second operation, and then determines the second text. Among them, the second The operation can be click, drag, slide, etc. There are no specific limitations here.

Illustratively, continuing the above example, the second text is "Guangzhou", the first text is "The weather is very good today", and the non-edited voice is the voice of the first text in the original voice. Assuming that one text corresponds to 2 frames and the original speech corresponding to the original text includes 16 frames, the non-edited speech is equivalent to frames 1 to 4 and frames 9 to 16 in the original speech. It can be understood that in practical applications, the correspondence between text and voice frames is not necessarily 1:2 as in the above example. The above example is only for the convenience of understanding the non-editing area. The number of frames corresponding to the original text is not specifically limited here. After determining the target text, the speech processing device can display an interface as shown in Figure 10, which can include the second text, the target text, the non-edited voice and the edited voice in the original voice, where the second text is "Guangzhou" and the target text The text is "The weather in Guangzhou is very good today", the non-edited voice is the voice corresponding to "The weather is very good today", and the edited voice is the voice corresponding to "Shenzhen". It can also be understood that as the user edits the target text, the speech processing device determines the non-edited voice in the original voice based on the target text, the original text, and the original voice.

Optionally, the voice processing device receives an editing request sent by the user, where the editing request includes the original voice and the second text. Optionally, the edit request also includes the original text and/or speaker identification. Of course, the editing request can also include the original speech and the target text.

Step 702: Predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text.

In a possible implementation, the text embedding of each phoneme in the target text can be obtained through the text encoding module (Text Encoder) based on the target text. For example, the target text can be converted into the corresponding phoneme sequence (for example, the phoneme corresponding to "How can love not ask whether it is right or wrong" is the sequence of initial consonants and finals in pinyin), and then input to the Text Encoder and converted into the corresponding phoneme-based phoneme sequence. Text embedding of units. The network structure of Text Encoder can be exemplified by the Tacotron 2 model.

In a possible implementation, the number of frames of each phoneme in the non-edited speech (which can also be called the duration) can be obtained, and based on the number of frames of each phoneme in the non-edited speech and the information of the target text, Predict the number of frames for each phoneme in the second text.

In one possible implementation, the neural network used to predict the number of frames of each phoneme in the second text can be as shown in Figure 7b (for example, it can be a duration prediction model based on a mask mechanism that fuses the original real duration), It uses the output of Text Encoder and the original real duration (Reference Duration, that is, the duration of each phoneme in the first text) and the corresponding mask as input to predict the duration of each phoneme to be edited (that is, each phoneme in the second text) (That is, the number of frames corresponding to the audio).

In a possible implementation, after obtaining the number of frames of each phoneme in the target text (including the first text and the second text), each text embedding can be performed based on the predicted duration of each phoneme. Upsample to obtain the embedding result corresponding to the number of frames (for example, if the prediction duration of phoneme ai is 10 frames, you can copy the text embedding corresponding to ai N copies, N is a positive number greater than 1, for example, N is 10 ).

It should be understood that, optionally, in the singing editing scenario, the singing itself will follow a certain music score, and the music score also stipulates the pronunciation duration and pitch of each word. Therefore, when editing the singing voice, for the area that does not need to be edited (non-edited voice), There is no need to predict the corresponding duration and pitch information, just get the accurate real value directly and use it.

Next, a diagram for predicting the duration of the second text is given:

Referring to Figure 7b, Reference Durations is the real duration of each phoneme in the original singing audio, and the dotted box is the duration to be predicted of each phoneme in the second text (since it is unknown at this time, it can be replaced by 0); while the Edit Mask is used Mark the phonemes to be predicted (Mask=0 indicates that prediction is required); the Embedding Layer fuses the Reference durations with the edit Mask (for example, it can be fused by performing an inner product operation), and the result is then the same as the Text Embedding and Singer Embedding (the extracted sound Grain features) are accumulated. Among them, 1 FFT Block can be a Transformer block. For example, 4 (i.e. N=4) FFT blocks can be used; finally, the model predicts the duration of the phoneme corresponding to Mask=0 and compares it with other unedited phonemes. The duration is also used as output.

In one possible implementation, the predicted duration of each phoneme in the second text can be used to upsample each input in pitch feature prediction. For example, the input for pitch feature prediction can include text embeddings, above Each text embedding before sampling corresponds to a phoneme, and the text embedding after upsampling includes the number of text embeddings corresponding to the number of frames of the phoneme.

In a possible implementation, the second speech feature of the non-edited speech may also be obtained based on the non-edited speech. The second voice feature may carry at least one of the following information: some voice frames or all voice frames of the non-edited voice; voiceprint features of the non-edited voice; timbre features of the non-edited voice; The prosodic characteristics of the non-edited speech; and the rhythmic characteristics of the non-edited speech.

The speech features in the embodiments of the present application can be used to represent the characteristics of speech (such as timbre, rhythm, emotion or rhythm, etc.). The speech features can be expressed in many forms, such as speech frames, sequences, vectors, etc., specifically here No restrictions. In addition, the speech features in the embodiments of the present application may specifically be parameters extracted from the above-mentioned expression forms through the aforementioned PLP, LPC, MFCC and other methods.

Optionally, at least one speech frame is selected from the non-edited speech as the second speech feature. Furthermore, the second speech feature of the context is further combined with the first speech feature. The text corresponding to at least one speech frame may be text adjacent to the second text in the first text.

Optionally, the non-edited speech is encoded through a coding model to obtain a target sequence, and the target sequence is used as the second speech feature. Among them, the coding model can be CNN, RNN, etc., and there is no specific limit here.

In addition, the second voice feature may also carry the voiceprint feature of the original voice. Among them, the voiceprint features may be obtained directly, or the voiceprint features may be obtained by recognizing original speech, etc. On the one hand, by introducing the voiceprint feature of the original voice, the subsequently generated first voice feature also carries the voiceprint feature of the original voice, thereby improving the similarity between the target edited voice and the original voice. On the other hand, when there are multiple speakers (or users), introducing voiceprint features can improve the subsequently predicted voice features to be more similar to the voiceprints of the speakers of the original speech.

Optionally, the speech processing device can also obtain the speaker identification of the original speech, so that when there are multiple speakers, the speech corresponding to the corresponding speaker can be matched, and the similarity between the subsequent target edited speech and the original speech can be improved.

The following description only takes speech frames as speech features (or is understood as obtaining speech features based on speech frames) as an example. Illustratively, continuing the above example, at least one frame from the 1st to 4th frame and the 9th to 16th frame in the original speech is selected as the second speech feature.

For example, the second speech feature is a Mel spectrum feature.

In a possible implementation, the second speech feature can be expressed in the form of a vector. In a possible implementation, the predicted duration of each phoneme in the second text can be used to perform each input in pitch feature prediction. For upsampling, for example, the input for pitch feature prediction may include the second speech feature, each vector before upsampling corresponds to a phoneme, and the text embedding after upsampling includes a vector corresponding to the number of frames of the phoneme.

In a possible implementation, the second pitch feature of the second text can be predicted based on the first pitch feature of the non-edited voice and the information of the target text.

In a possible implementation, the first pitch (pitch) feature of the non-edited speech can be obtained through an existing pitch extraction algorithm, which is not limited by this application.

In a possible implementation, the neural grid can be used to predict the target text according to the first pitch feature of the non-edited voice, the information of the target text, and the second voice feature of the non-edited voice. Describe the second pitch characteristics of the second text.

Next, we will introduce how to predict the second pitch feature of the second text based on the first pitch feature of the non-edited speech and the information of the target text:

In a possible implementation, the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text. Text, the second text is text adjacent to the first part of the text; the first pitch feature of the non-edited voice and the information of the target text can be fused to obtain the first fusion Result: Input the first fusion result to the second neural network to obtain the second pitch feature of the second text.

For insertion and deletion operations: Use the model shown in Figure 7c to predict the frame-level pitch features of the target edited phoneme. For the pitch prediction model for insertion and deletion operations, the model structure can be consistent or similar to the setting in Figure 7b. The only difference is that the input at this time is at the frame level (the input in Figure 7b is at the phoneme level) extracted from real singing. The pitch value (the duration information is entered in Figure 7b), where the pitch of the area to be edited is marked by a dotted box, and its corresponding Edit Mask mark is set to 0.

In a possible implementation, the target text is obtained by replacing the second part of the first text with the second text; the first pitch of the non-edited voice may be The features are input to the third neural network to obtain the initial pitch feature, and the first initial pitch feature includes the pitch of each frame in multiple frames; the information of the target text is input to the fourth neural network to obtain the The pronunciation features of the second text are used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated; the initial pitch features and the pronunciation features are fused to obtain the Describe the second pitch characteristics of the second text.

For the replacement operation (the replacement operation here only means that the number of words in the new edited text is consistent with the number of words in the replaced text. If they are not consistent, the replacement operation will be decomposed into two editing operations: first deletion and then insertion). Since the replaced text may have a big difference in pronunciation, in order to ensure the coherence of the singing before and after the replacement, the model shown in Figure 7d is used to predict the new pitch:

Pitch prediction model for replacement operations. Frame-level voiced/unvoiced (U/UV) prediction can be introduced to help pitch prediction. For example, the design of the V/UV Predictor and F0Predictor modules can refer to the F0predictor in Fastspeech2.

In a possible implementation, the input first pitch (pitch) feature may include the pitch feature of each frame in the multiple frames of the non-edited speech; correspondingly, the output second pitch feature The high features may include pitch features of each of the plurality of frames of the target edited speech.

Step 703: According to the second pitch feature and the second text, obtain the first speech feature corresponding to the second text through a neural network.

In a possible implementation, the second pitch feature and the second text (for example, the text embedding of the second text) can be fused (for example, added), and the fusion result can be input into the neural network to obtain The first speech feature corresponding to the second text. The first speech feature corresponding to the second text may be a Mel spectrum feature.

In a possible implementation, the description of the second voice feature may be based on the first pitch feature of the non-edited voice, the information of the target text, and the second voice feature of the non-edited voice. Reference may be made to the description of the second speech feature in the above embodiment, which will not be described again here.

In a possible implementation, after obtaining the second speech feature, the first speech feature corresponding to the second text can be obtained through a neural network based on the second speech feature and the second text. The neural network may include an encoder and a decoder. The second text is input into the encoder to obtain a first vector corresponding to the second text, and then the first vector is decoded by the decoder based on the second speech feature to obtain the first speech feature. Among them, the second speech feature can be the same as or similar to the first speech feature in terms of rhythm, timbre, and/or signal-to-noise ratio. Rhythm can reflect the speaker's emotional state or speech form. Rhythm generally refers to intonation, pitch, and emphasis. , pauses or rhythm characteristics.

Optionally, an attention mechanism can be introduced between the encoder and the decoder to adjust the quantitative correspondence between the input and the output.

Optionally, the target text where the second text is located can be introduced during the coding process of the encoder, so that the generated first vector of the second text refers to the target text, so that the second text described by the first vector is more accurate. That is, the first speech feature corresponding to the second text can be obtained through the neural network based on the second speech feature, the target text, and the mark information. Specifically, the target text and mark information may be input into an encoder to obtain a first vector corresponding to the second text, and then the first vector may be decoded by a decoder based on the second speech feature to obtain the first speech feature. The marking information is used to mark the second text in the target text.

The decoder in the embodiment of the present application may be a unidirectional decoder or a bidirectional decoder, which are described respectively below.

First, the decoder is a one-way decoder.

The decoder calculates the first vector or the speech frame obtained by the second vector from the first direction of the target text as the first speech feature based on the second speech feature. The first direction is a direction from one side of the target text to the other side of the target text. In addition, the first direction can be understood as the forward or reverse order of the target text (for related descriptions, please refer to the description of the forward or reverse order in the embodiment shown in FIG. 5).

Optionally, the second speech feature and the first vector are input into the decoder to obtain the first speech feature. Or the second speech feature and the second vector are input into the decoder to obtain the first speech feature.

Second, if the second text is in the middle area of the target text, the decoder can be a bidirectional decoder (it can also be understood that the encoder includes a first encoder and a second encoder).

The above-mentioned second text is in the middle area of the target text, which can be understood to mean that the second text is not at both ends of the target text.

There are many situations of bidirectional decoders in the embodiments of this application, which are described below:

1. The first speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the second text, and the fourth speech feature output by the bidirectional decoder from the second direction is the speech feature corresponding to the second text.

In this case, it can be understood that the complete phonetic features corresponding to the two second texts can be obtained through the left and right sides (ie, forward and reverse order), and the first phonetic features can be obtained based on the two phonetic features.

The first decoder calculates a first vector or a second vector from the first direction of the target text based on the second speech feature to obtain the first speech feature of the second text (hereinafter referred to as LR). The second decoder calculates a first vector or a second vector from the second direction of the target text based on the second speech feature to obtain a fourth speech feature of the second text (hereinafter referred to as RL). and generate the first voice feature according to the first voice feature and the fourth voice feature. Wherein, the first direction is the direction from one side of the target text to the other side of the target text, and the second direction is opposite to the first direction (or understood as the second direction is from the other side of the target text to one side of the target text). side direction). The first direction may be the above-mentioned forward sequence, and the second direction may be the above-mentioned reverse sequence.

For a bidirectional decoder, when the first encoder decodes the first frame of the first vector or the second vector in the first direction, the non-edited speech may be adjacent to the side of the second text (which may also be called the left side). The speech frame is decoded as a condition to obtain N frames LR. When the second encoder decodes the first vector or the first frame of the second vector in the second direction, the speech frame adjacent to the other side (also called the right side) of the second text in the non-edited speech can be used as a condition Decode to obtain N frames of RL. Optionally, the structure of the bidirectional decoder can be referred to Figure 11. After obtaining N frames of LR and N frames of RL, the frame with the difference between LR and RL less than the threshold can be used as a transition frame (position is m, m<n,), or the frame with the smallest difference between LR and RL can be used as a transition frame . Then the N frames of the first speech feature may include the first m frames in LR and the last n-m frames in RL, or the N frames of the first speech feature may include the first n-m frames in LR and the last m frames in RL. Among them, the difference between LR and RL can be understood as the distance between vectors. In addition, if the speaker identification is obtained in the aforementioned step 701, the first vector or the second vector in this step may also include a third vector used to identify the speaker. It can also be understood that the third vector is used to identify the voiceprint characteristics of the original speech.

Illustratively, continuing the above example, assume that the first encoder obtains the LR frames corresponding to "Guangzhou" including LR ₁ , LR ₂ , LR ₃ , and LR ₄ . The second encoder obtains the RL frames corresponding to "Guangzhou" including RL ₁ , RL ₂ , RL ₃ , and RL ₄ . And the difference between LR ₂ and RL ₂ is the smallest, then LR ₁ , LR ₂ , RL ₃ , RL ₄ or LR ₁ , RL ₂ , RL ₃ , RL ₄ are used as the first speech features.

2. The first speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the third text in the second text, and the fourth speech feature output by the bidirectional decoder from the second direction is the speech feature corresponding to the fourth text in the second text. voice characteristics.

In this case, it can be understood that the partial speech features corresponding to the second text can be obtained through the left and right sides (ie, forward and reverse order), and the complete first speech features can be obtained based on the two partial speech features. That is, one part of the phonetic features is taken from the forward direction, another part of the phonetic features is taken from the reverse direction, and one part of the phonetic features and another part of the phonetic features are spliced to obtain the overall phonetic features.

Illustratively, continuing the above example, assume that the first encoder obtains the LR frame corresponding to the third text ("Guang") including LR ₁ and LR ₂ . The second encoder obtains the RL frame corresponding to the fourth text ("state") including RL ₃ and RL ₄ . Then the first speech feature is obtained by splicing LR ₁ , LR ₂ , RL ₃ and RL ₄ .

It can be understood that the above two methods are only examples. In practical applications, there are other methods to obtain the first speech feature, which are not limited here.

Step 704: Generate a target editing voice corresponding to the second text according to the first voice feature.

In a possible implementation, after obtaining the first speech feature, the first speech feature can be converted into a target edited voice corresponding to the second text according to the vocoder. Among them, the vocoder can be a traditional vocoder (such as Griffin-lim algorithm), or a neural network vocoder (such as Melgan or Hifigan pre-trained using audio training data), etc. The details will not be discussed here. limited.

Illustratively, continuing the above example, the target editing voice corresponding to "Guangzhou" is shown in Figure 12.

Step 705: Obtain the position of the second text in the target text. This step is optional.

Optionally, if what is obtained in step 701 is the original speech and the second text, the position of the second text in the target text is obtained.

Optionally, if the target text has been obtained in step 701, the starting and ending positions of each phoneme in the original text in the original speech can be determined by aligning the original speech and the original text through the alignment technology in step 701. And determine the position of the second text in the target text based on the starting and ending positions of each phoneme.

Step 706: Splice the target edited voice and the non-edited voice based on the position to generate a target voice corresponding to the target text. This step is optional.

The position in the embodiment of this application is used to splice the non-edited speech and the target edited speech. The position can be the position of the second text in the target text, the position of the first text in the target text, or the non-edited speech. The position in the original speech can also be the position of the edited speech in the original speech.

Optionally, after obtaining the position of the second text in the target text, the original speech and the original text can be aligned using the alignment technology in step 701 to determine the starting and ending positions of each phoneme in the original text in the original speech. And based on the position of the first text in the original text, the position of the non-edited speech or the edited speech in the original speech is determined. Then, the speech processing device splices the target edited speech and the non-edited speech based on the position to obtain the target speech. That is, the target speech corresponding to the second text is replaced with the editing area in the original speech to obtain the target speech.

Illustratively, continuing the above example, the non-edited speech is equivalent to frames 1 to 4 and frames 9 to 16 in the original speech. The target editing voices are LR ₁ , LR ₂ , RL ₃ , RL ₄ or LR ₁ , RL ₂ , RL ₃ , RL ₄ . Splicing the target edited speech and the non-edited speech can be understood as replacing the 5th to 8th frames in the original speech with the four obtained frames, thereby obtaining the target speech. That is, the voice corresponding to "Guangzhou" is replaced with the voice corresponding to "Shenzhen" in the original voice, and then the target text is obtained: the target voice corresponding to "The weather in Guangzhou is very good today". The target speech corresponding to "The weather in Guangzhou is very good today" is shown in Figure 12.

Optionally, after acquiring the target editing voice or the target voice, the voice processing device plays the target editing voice or the target voice.

In one possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 704. In another possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 705. In another possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 706. In addition, the various steps shown in Figure 7a in the embodiment of the present application do not limit the timing relationship. For example, step 705 in the above method can also be performed after step 704, or before step 701, or can be executed together with step 701.

An embodiment of the present application provides a speech processing method. The method includes: obtaining original speech and a second text. The second text is a text other than the first text in the target text. The target text is the same as the original speech. The original texts corresponding to the speech all include the first text, and the speech corresponding to the first text in the original speech is non-edited language. sound; predicting the second pitch characteristic of the second text according to the first pitch characteristic of the non-edited speech and the information of the target text; predicting the second pitch characteristic of the second text according to the second pitch characteristic and the third For the second text, obtain the first voice feature corresponding to the second text through a neural network; and generate the target editing voice corresponding to the second text based on the first voice feature. This application predicts the pitch characteristics of the second text (text to be edited), generates the first speech characteristics of the second text based on the pitch characteristics, and generates the target editing speech corresponding to the second text based on the first speech characteristics, so that the singing voice can be edited before and after The pitch characteristics of the target edited voice are similar to that of the original voice, so that the listening experience of the target edited voice is similar to that of the original voice.

Next, we will introduce the speech processing method in the embodiment of this application with a schematic:

Take the scene of singing voice editing as an example, take the original singing voice W to be edited (the voice content is S "Love can be right or wrong"), and the following three different target voices as examples:

Editing request Q1: Its target voice is W1 (the corresponding text T1 of the voice content is "How can love not ask whether it is right or wrong"),

Editing request Q2: Its target voice is W2 (the corresponding text T2 of the voice content is "Love does not ask whether it is right or wrong"),

Editing request Q3: Its target voice is W3 (the corresponding text T2 of the voice content is "How can love not ask whether it is right or wrong");

Step S1: Receive the user’s “voice editing” request;

The request at least includes the original voice to be edited W, the original lyric text S, the target text T (T1 or T2 or T3) and other data. The pre-operation includes: comparing the original text S and the target text to determine the editing type of the current editing request: that is, for Q1, Q2 and Q3 can be determined to be insertion, deletion and replacement operations respectively; extract the audio features and pitch features of each frame from W; W extracts Singer embedding through the voiceprint model; convert S and target text T* The representation form of phonemes, such as T2, its phoneme sequence is [ai4 b u2 w en4d ui4c cuo4]; according to W and S, extract the duration (i.e., the number of frames) corresponding to each phoneme in S; determine the Mask area according to the operation type , for Q1, it is an insertion operation (inserting the word "how"), then its target Mask phoneme is the phoneme corresponding to "how", that is, the final target text phoneme of Q1 is [ai4 z en3 m e5 k e2 y i3 b u2 w en4 d ui4 c cuo4]; (red indicates the phoneme being Masked), for Q2, it is a deletion operation (delete the word "can"), then its target phoneme is the phoneme of the original word adjacent to "can" in S; That is, the final target text phoneme of Q2 is [ai4 b u2 w en4 d ui4 c cuo4]; (red indicates the phoneme being Masked); for Q3, it is a replacement operation (replacing "can" with "how"), so The target text phoneme is [ai4 z en3 me e5 b u2 w en4 d ui4 c cuo4]; (red indicates the phoneme being Masked);

Step S2: The target text phonemes obtained in S1 are used by the text encoding module to generate text features, that is, Phoneme-level Text Embedding;

Step S3: Predict the duration information of each phoneme in the target text through the duration regularization module; this step can be completed through the following sub-steps:

Generate a Mask vector and a reference duration vector according to the Mask tag of the phoneme: that is, for non-Mask phonemes, the reference duration is the real duration extracted in step S1, otherwise it is set to 0; for non-Mask phonemes, the corresponding position in the Mask vector is set to 1, otherwise set to 0;

Taking the Text Embedding, Singer Embedding reference duration vector and Mask vector as input, use the duration prediction module as shown in Figure 2-2 to predict the duration corresponding to the Mask phoneme;

According to the duration corresponding to each phoneme, the Embedding of each phoneme is upsampled (that is, if the duration of phoneme A is 10, then the Embedding of A is copied 10 times), thereby generating Frame-level Text Embedding;

Step S4: Predict the pitch value of each frame through the pitch prediction module. This step can be completed through the following sub-steps:

For Q1 and Q2, use the model shown in Figure 2-3 to predict the pitch of the frame corresponding to the Mask phoneme:

Among them, for non-Mask phonemes, the reference pitch is the real pitch extracted in S1, and its corresponding position in the Mask vector is marked as 1; for Mask phonemes, the pitch on the corresponding frame is set to 0, and the Mask is set to 0; Predict the Frame-level pitch corresponding to the Mask phoneme.

For the replacement operation Q3, the model shown in Figure 2-4 is used to predict the Frame-level Pitch of the Mask phoneme;

Step S5: Add Frame-Level text Embedding and Pitch together and input them into the audio feature decoding module to predict the audio feature frame corresponding to the new Mask phoneme.

It should be understood that if an editing request involves multiple editing operations, the editing can be performed one by one using the process described above in a processing order from left to right. On the other hand, a replacement operation can also be implemented by two operations: "delete first and then insert".

The voice processing method implemented by the terminal device or the cloud device alone is described above, and the voice processing method implemented by the terminal device and the cloud device jointly is described below.

Embodiment 2: The terminal device and the cloud device jointly execute the voice processing method.

Please refer to Figure 13, which is an example of a voice processing method provided by the embodiment of the present application. This method can be executed jointly by the terminal device and the cloud device, or can be performed by components of the terminal device (such as a processor, a chip, or a chip system, etc.) and The components of the cloud device (such as a processor, a chip, or a chip system, etc.) execute. This embodiment includes steps 1301 to 1306.

Step 1301: The terminal device obtains the original voice and the second text.

Step 1301 performed by the terminal device in this embodiment is similar to step 701 performed by the voice processing device in the embodiment shown in Figure 7a, and will not be described again here.

Step 1302: The terminal device sends the original voice and the second text to the cloud device.

After the terminal device obtains the original voice and the second text, it can send the original voice and the second text to the cloud device.

Optionally, if in step 1301, the terminal device obtains the original voice and the target text, the terminal device sends the original voice and the target text to the cloud device.

Step 1303: The cloud device obtains the non-edited voice based on the original voice and the second text.

Step 1303 performed by the cloud device in this embodiment is similar to the description of determining non-edited voice in step 701 performed by the speech processing device in the embodiment shown in Figure 7a, and will not be described again here.

Step 1304: The cloud device obtains the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text.

Step 1303 performed by the cloud device in this embodiment is similar to the description of determining non-edited voice in step 702 performed by the speech processing device in the embodiment shown in Figure 7a, and will not be described again here.

Step 1305: The cloud device obtains the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text.

Step 1306: The cloud device generates a target editing voice corresponding to the second text based on the first voice feature.

Steps 1304 to 1306 performed by the cloud device in this embodiment are similar to steps 702 to 704 performed by the voice processing device in the embodiment shown in Figure 7a, and will not be described again here.

Step 1307: The cloud device sends the target editing voice to the terminal device. This step is optional.

Optionally, after the cloud device obtains the target editing voice, it can send the target editing voice to the terminal device.

Step 1308: The terminal device or cloud device obtains the position of the second text in the target text. This step is optional.

Step 1309: The terminal device or the cloud device splices the target edited voice and the non-edited voice based on the location to generate a target voice corresponding to the target text. This step is optional. This step is optional.

Steps 1308 and 1309 in this embodiment are similar to steps 705 to 706 performed by the speech processing device in the embodiment shown in FIG. 7a, and will not be described again here. Steps 1308 and 1309 in this embodiment can be executed by a terminal device or a cloud device.

Step 1310: The cloud device sends the target voice to the terminal device. This step is optional.

Optionally, if steps 1308 and 1309 are executed by the cloud device, then after acquiring the target voice, the cloud device sends the target voice to the terminal device. If steps 1308 and 1309 are executed by the terminal device, this step may not be executed.

Optionally, after acquiring the target editing voice or target voice, the terminal device plays the target editing voice or target voice.

In a possible implementation manner, the voice processing method provided by the embodiment of the present application may include: the cloud device generates the target editing voice and sends the target editing voice to the terminal device, that is, the method includes steps 1301 to 1307. In another possible implementation manner, the speech processing method provided by the embodiment of the present application may include: the cloud device generates the target edited voice, generates the target voice based on the target edited voice and the non-edited voice, and sends the target voice to the terminal device. That is, the method includes steps 1301 to 1306, and steps 1308 to 1310. In another possible implementation manner, the voice processing method provided by the embodiment of the present application may include: the cloud device generates the target editing voice, and sends the target editing voice to the terminal device. The terminal device generates the target voice based on the target edited voice and the non-edited voice. That is, the method includes steps 1301 to 1309.

In the embodiments of this application, on the one hand, through the interaction between the cloud device and the terminal device, the cloud device performs complex calculations to obtain the target edited voice or the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device. On the other hand, a target edited voice corresponding to the modified text can be generated based on the voice characteristics of the non-edited area in the original voice, and then a target voice corresponding to the target text can be generated from the non-edited voice. On the other hand, the user can modify the text in the original text to obtain the target editing voice corresponding to the modified text (ie, the second text). Improve users' editing experience based on text-based voice editing. On the other hand, the non-edited speech is not modified when the target speech is generated, and the pitch characteristics of the target edited speech are similar to those of the non-edited speech, making it difficult for users to hear the original speech when listening to the original speech and the target speech. Differences in phonetic characteristics from the target voice.

The speech processing method in the embodiment of the present application is described above, and the speech processing device in the embodiment of the present application is described below. Please refer to Figure 14. An embodiment of the speech processing device in the embodiment of the present application includes:

Obtaining module 1401 is used to obtain the original speech and the second text. The second text is the text in the target text except the first text. The target text and the original text corresponding to the original speech both include the first text. Text, the voice corresponding to the first text in the original voice is a non-edited voice;

For a specific description of the acquisition module 1401, reference may be made to the description of step 701 in the above embodiment, which will not be described again here.

Pitch prediction module 1402, configured to use the first pitch (pitch) feature of the non-edited speech and the target Text information, predicting the second pitch feature of the second text;

For a detailed description of the pitch prediction module 1402, reference may be made to the description of step 702 in the above embodiment, and will not be described again here.

Generating module 1403, configured to obtain the first speech feature corresponding to the second text through a neural network according to the second pitch feature and the second text;

For detailed description of the generation module 1403, reference may be made to the description of steps 703 and 704 in the above embodiment, which will not be described again here.

Partial speech frames or all speech frames of the non-edited speech;

The voiceprint characteristics of the non-edited speech;

The timbre characteristics of the non-edited voice;

The prosodic characteristics of the unedited speech; and,

The rhythmic characteristics of the non-edited speech.

The pitch prediction module is specifically used for:

In a possible implementation, the device further includes:

A duration prediction module used to predict the number of frames of each phoneme in the non-edited speech and the information of the target text, Predict the number of frames for each phoneme in the second text.

In a possible implementation, the acquisition module is also used to:

Obtain the position of the second text in the target text;

Please refer to Figure 15. This embodiment of the present application provides another voice processing device. For convenience of explanation, only the parts related to the embodiment of the present application are shown. For specific technical details that are not disclosed, please refer to the method part of the embodiment of the present application. . The voice processing device can be any terminal device including a mobile phone, tablet computer, personal digital assistant (PDA), point of sales (POS), vehicle-mounted computer, etc. Taking the voice processing device as a mobile phone as an example:

FIG. 15 shows a block diagram of a partial structure of a mobile phone related to the voice processing device provided by an embodiment of the present application. Referring to Figure 15, the mobile phone includes: radio frequency (RF) circuit 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuit 1560, wireless fidelity (WiFi) module 1570, processor 1580 , and power supply 1590 and other components. Those skilled in the art can understand that the structure of the mobile phone shown in FIG. 15 does not constitute a limitation on the mobile phone, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components.

The following is a detailed introduction to each component of the mobile phone in conjunction with Figure 15:

The RF circuit 1510 can be used to receive and transmit information or signals during a call. In particular, after receiving downlink information from the base station, it is processed by the processor 1580; in addition, the designed uplink data is sent to the base station. Typically, the RF circuit 1510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, etc. Additionally, RF circuitry 1510 can communicate with networks and other devices through wireless communications. The above wireless communication can use any communication standard or protocol, including but not limited to global system of mobile communication (GSM), general packet radio service (GPRS), code division multiple access (code division) multiple access (CDMA), wideband code division multiple access (WCDMA), long term evolution (LTE), email, short messaging service (SMS), etc.

The memory 1520 can be used to store software programs and modules. The processor 1580 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 1520 . The memory 1520 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store data based on Where to use your mobile phone Created data (such as audio data, phone book, etc.), etc. In addition, memory 1520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 1530 may be used to receive input numeric or character information, and generate key signal input related to user settings and function control of the mobile phone. Specifically, the input unit 1530 may include a touch panel 1531 and other input devices 1532. The touch panel 1531 , also known as a touch screen, can collect the user's touch operations on or near the touch panel 1531 (for example, the user uses a finger, stylus, or any suitable object or accessory on or near the touch panel 1531 operation), and drive the corresponding connection device according to the preset program. Optionally, the touch panel 1531 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact point coordinates, and then sends it to the touch controller. to the processor 1580, and can receive commands sent by the processor 1580 and execute them. In addition, the touch panel 1531 can be implemented using various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 1531, the input unit 1530 may also include other input devices 1532. Specifically, other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, etc.

The display unit 1540 may be used to display information input by the user or information provided to the user as well as various menus of the mobile phone. The display unit 1540 may include a display panel 1541. Optionally, the display panel 1541 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (organic light-emitting diode, OLED), etc. Further, the touch panel 1531 can cover the display panel 1541. When the touch panel 1531 detects a touch operation on or near it, it is sent to the processor 1580 to determine the type of the touch event. The processor 1580 then determines the type of the touch event. Type provides corresponding visual output on display panel 1541. Although in Figure 15, the touch panel 1531 and the display panel 1541 are used as two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 1531 and the display panel 1541 can be integrated. Realize the input and output functions of mobile phone.

The phone may also include at least one sensor 1550, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor. The ambient light sensor may adjust the brightness of the display panel 1541 according to the brightness of the ambient light. The proximity sensor may close the display panel 1541 and/or when the mobile phone is moved to the ear. or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three axes). It can detect the magnitude and direction of gravity when stationary. It can be used to identify applications of mobile phone posture (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. that can be configured on the mobile phone, we will not mention them here. Repeat.

The audio circuit 1560, speaker 1561, and microphone 1562 can provide an audio interface between the user and the mobile phone. The audio circuit 1560 can transmit the electrical signal converted from the received audio data to the speaker 1561, and the speaker 1561 converts it into a sound signal for output; on the other hand, the microphone 1562 converts the collected sound signal into an electrical signal, and the audio circuit 1560 After receiving, it is converted into audio data, and then processed by the audio data output processor 1580, and then sent to, for example, another mobile phone through the RF circuit 1510, or the audio data is output to the memory 1520 for further processing.

WiFi is a short-distance wireless transmission technology. Through the WiFi module 1570, mobile phones can help users send and receive emails, Browsing the web and accessing streaming media, etc., it provides users with wireless broadband Internet access. Although FIG. 15 shows the WiFi module 1570, it can be understood that it is not a necessary component of the mobile phone.

The processor 1580 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing software programs and/or modules stored in the memory 1520, and calling data stored in the memory 1520 to execute Various functions of the mobile phone and processing data, thereby conducting overall monitoring of the mobile phone. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor and a modem processor, where the application processor mainly processes operating systems, user interfaces, application programs, etc. , the modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 1580.

The mobile phone also includes a power supply 1590 (such as a battery) that supplies power to various components. Preferably, the power supply can be logically connected to the processor 1580 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.

Although not shown, the mobile phone may also include a camera, a Bluetooth module, etc., which will not be described in detail here.

In this embodiment of the present application, the processor 1580 included in the terminal device can perform the functions of the voice processing device in the embodiment shown in Figure 7a, or the functions of the terminal device in the embodiment shown in Figure 13, which will not be described again here. .

Refer to Figure 16, which is a schematic structural diagram of another voice processing device provided by this application. The voice processing device may be a cloud device. The cloud device may include a processor 1601, a memory 1602, and a communication interface 1603. The processor 1601, memory 1602 and communication interface 1603 are interconnected through lines. Among them, the memory 1602 stores program instructions and data.

The memory 1602 stores program instructions and data corresponding to the steps executed by the speech processing device in the aforementioned embodiment corresponding to FIG. 7a. Or the program instructions and data corresponding to the steps executed by the cloud device in the aforementioned embodiment corresponding to Figure 13 are stored.

The processor 1601 is configured to perform the steps performed by the speech processing device shown in any of the embodiments shown in FIG. 7a. Or used to perform the steps performed by the cloud device shown in any of the embodiments shown in FIG. 13 .

The communication interface 1603 can be used to receive and send data, and to perform steps related to obtaining, sending, and receiving in any of the embodiments shown in FIG. 7a or FIG. 13 .

In one implementation, the cloud device may include more or fewer components than in Figure 16 , which is only an illustrative description in this application and is not limiting.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated unit can Implemented in whole or in part by software, hardware, firmware, or any combination thereof.

When the integrated unit is implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.

The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application. Furthermore, the terms "include" and "having" and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, product or apparatus comprising a series of elements need not be limited to those elements, but may include not explicitly other elements specifically listed or inherent to such processes, methods, products or equipment.

Claims

A speech processing method, characterized in that the method includes:

Obtain the original speech and the second text, the second text is the text in the target text except the first text, the target text and the original text corresponding to the original speech both include the first text, the first text The voice corresponding to the text in the original voice is non-edited voice;

Predict the second pitch feature of the second text based on the first pitch feature of the non-edited speech and the information of the target text;

According to the second pitch feature and the second text, obtain the first speech feature corresponding to the second text through a neural network;

According to the first voice characteristics, a target editing voice corresponding to the second text is generated.
The method according to claim 1, characterized in that the content of the original voice is the user's singing voice.
The method according to any one of claims 1 or 2, characterized in that the first pitch (pitch) feature of the non-edited voice and the second text include:

According to the first pitch feature of the non-edited voice, the information of the target text and the second voice feature of the non-edited voice; the second voice feature carries at least one of the following information:

Partial speech frames or all speech frames of the non-edited speech;

The voiceprint characteristics of the non-edited speech;

The timbre characteristics of the non-edited voice;

The prosodic characteristics of the unedited speech; and,

The rhythmic characteristics of the non-edited speech.
The method according to any one of claims 1 to 3, characterized in that the information of the target text includes:

Text embedding of each phoneme in the target text.
The method according to any one of claims 1 to 4, wherein the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by inserting the third text into the first text. A text obtained by deleting the first part of a text, and the second text is a text adjacent to the first part of the text;

Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited speech and the information of the target text includes:

Fusion of the first pitch feature of the non-edited speech and the information of the target text to obtain a first fusion result;

The first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
The method according to any one of claims 1 to 5, characterized in that the target text is obtained by replacing the second part of the text in the first text with the second text;

Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text includes:

Input the first pitch (pitch) feature of the non-edited speech into a third neural network to obtain an initial pitch feature, where the first initial pitch feature includes the pitch of each frame in a plurality of frames;

Input the information of the target text into a fourth neural network to obtain the pronunciation feature of the second text, where the pronunciation feature is used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated;

The initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
The method according to any one of claims 1 to 6, characterized in that the method further includes:

According to the frame number of each phoneme in the non-edited speech and the information of the target text, the frame number of each phoneme in the second text is predicted.
The method according to any one of claims 1 to 7, characterized in that the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;

The second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited speech.
The method according to claim 7 or 8, characterized in that the information based on the frame number of each phoneme in the non-edited speech and the target text includes:

According to the frame number of each phoneme in the non-edited speech, the information of the target text and the second speech feature of the non-edited speech.
The method according to any one of claims 1 to 9, characterized in that the method further includes:

Obtain the position of the second text in the target text;

The target edited voice and the non-edited voice are spliced based on the position to obtain the target voice corresponding to the target text.
A speech processing device, characterized in that the device includes:

Acquisition module, used to obtain the original speech and the second text. The second text is the text in the target text except the first text. The target text and the original text corresponding to the original speech both include the first text. , the voice corresponding to the first text in the original voice is non-edited voice;

A pitch prediction module, configured to predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text;

A generation module, configured to obtain the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text;

According to the first voice characteristics, a target editing voice corresponding to the second text is generated.
The device according to claim 11, characterized in that the content of the original voice is the user's singing voice.
The device according to claim 11 or 12, characterized in that the first pitch (pitch) feature according to the non-edited voice and the second text include:

According to the first pitch feature of the non-edited voice, the information of the target text and the second voice feature of the non-edited voice; the second voice feature carries at least one of the following information:

Partial speech frames or all speech frames of the non-edited speech;

The voiceprint characteristics of the non-edited speech;

The timbre characteristics of the non-edited voice;

The prosodic characteristics of the unedited speech; and,

The rhythmic characteristics of the non-edited speech.
The device according to any one of claims 11 to 13, characterized in that the information of the target text includes: text embedding of each phoneme in the target text.
The device according to any one of claims 11 to 14, wherein the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by inserting the second text into the first text. A text obtained by deleting the first part of a text, and the second text is a text adjacent to the first part of the text;

The pitch prediction module is specifically used for:

Fusion of the first pitch feature of the non-edited speech and the information of the target text to obtain a first fusion result;

The first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
The device according to any one of claims 11 to 14, wherein the target text is obtained by replacing the second part of the text in the first text with the second text;

The pitch prediction module is specifically used for:

Input the first pitch (pitch) feature of the non-edited speech into a third neural network to obtain an initial pitch feature, where the first initial pitch feature includes the pitch of each frame in a plurality of frames;

Input the information of the target text into a fourth neural network to obtain the pronunciation feature of the second text, where the pronunciation feature is used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated;

The initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
The device according to any one of claims 11 to 16, characterized in that the device further includes:

A duration prediction module, configured to predict the number of frames of each phoneme in the second text based on the number of frames of each phoneme in the non-edited speech and the information of the target text.
The device according to any one of claims 11 to 17, wherein the first pitch (pitch) feature includes: the pitch feature of each frame in the plurality of frames of the non-edited speech;

The second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited speech.
The device according to claim 17 or 18, characterized in that the duration prediction module is specifically used for:

According to the frame number of each phoneme in the non-edited speech, the information of the target text and the second speech feature of the non-edited speech.
The device according to any one of claims 11 to 19, characterized in that the acquisition module is also used to:

Obtain the position of the second text in the target text;

The generating module is further configured to splice the target edited voice and the non-edited voice based on the position to obtain a target voice corresponding to the target text.
A speech processing device, characterized in that it includes: a processor, the processor is coupled to a memory, the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the The speech processing device performs the method according to any one of claims 1 to 10.
The device according to claim 21, characterized in that the device further includes:

an input unit for receiving the second text;

An output unit is used to play the target editing voice corresponding to the second text or the target voice corresponding to the target text.
A computer-readable storage medium, characterized in that instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, they cause the computer to execute as described in any one of claims 1 to 10 Methods.
A computer program product, characterized in that, when executed on a computer, the computer program product causes the computer to execute the method according to any one of claims 1 to 10.