WO2023207541A1 - Procédé de traitement de la parole et dispositif associé - Google Patents

Procédé de traitement de la parole et dispositif associé Download PDF

Info

Publication number
WO2023207541A1
WO2023207541A1 PCT/CN2023/086497 CN2023086497W WO2023207541A1 WO 2023207541 A1 WO2023207541 A1 WO 2023207541A1 CN 2023086497 W CN2023086497 W CN 2023086497W WO 2023207541 A1 WO2023207541 A1 WO 2023207541A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
speech
voice
feature
target
Prior art date
Application number
PCT/CN2023/086497
Other languages
English (en)
Chinese (zh)
Inventor
邓利群
朱杰明
张立超
赵洲
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023207541A1 publication Critical patent/WO2023207541A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular, to a speech processing method and related equipment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • voice editing has very important practical significance. For example, in scenarios where users record songs (such as singing a cappella), some content in the voice is often wrong due to slips of the tongue. In this case, voice editing can help users quickly correct the erroneous content in the original singing voice and generate corrected voice.
  • a commonly used speech editing method is to pre-build a database containing a large number of speech segments, obtain segments of pronunciation units from the database, and use the segments to replace erroneous segments in the original speech to generate corrected speech.
  • the above-mentioned voice editing method relies on the diversity of voice segments in the database.
  • the corrected voice such as the user's singing voice
  • the corrected voice will have a poor hearing quality.
  • Embodiments of the present application provide a voice processing method and related equipment, which can achieve a listening experience of edited singing that is similar to that of original speech, thereby improving user experience.
  • this application provides a voice processing method, which can be applied to scenarios such as users recording short videos and teachers recording teaching voices.
  • the method may be executed by the speech processing device, or may be executed by a component of the speech processing device (such as a processor, a chip, or a chip system, etc.).
  • the speech processing device can be a terminal device or a cloud device, and the method includes: obtaining the original speech and the second text, the second text being the text other than the first text in the target text, the target text
  • the original text corresponding to the original speech includes the first text, and the speech corresponding to the first text in the original speech is a non-edited speech; according to the first pitch of the non-edited speech features and the information of the target text, predict the second pitch feature of the second text; according to the second pitch feature and the second text, obtain the first pitch corresponding to the second text through a neural network
  • Speech features generate a target editing voice corresponding to the second text according to the first voice features.
  • This application predicts the pitch characteristics of the second text (text to be edited), generates the first speech characteristics of the second text based on the pitch characteristics, and based on the first
  • the voice features generate a second text that corresponds to the target edited voice, so that the pitch characteristics of the voice before and after the singing are similar, thereby achieving a listening experience of the target edited voice that is similar to that of the original voice. .
  • the second text can be to obtain the second text directly; or to obtain the position information first (which can also be understood as mark information, used to indicate the position of the second text in the target text).
  • the position information is used to represent the position of the second text in the target text; it can also be obtained by obtaining the target text and the original text (or obtaining the target text and the original voice, and recognizing the original voice to obtain original text), and then determine the second text based on the original text and the target text.
  • generating a target editing voice corresponding to the second text based on the second voice feature includes: generating the target editing voice through a vocoder based on the second voice feature.
  • the second voice features are converted into the target edited voice according to the vocoder, so that the target edited voice has voice features similar to the original voice, thereby improving the user's listening experience.
  • the content of the original voice is the user's singing voice, which may be, for example, the voice recorded when the user sings a cappella.
  • obtaining the original voice and the second text includes: receiving the original voice and the second text sent by the terminal device; the method also includes: sending the target editing voice to the terminal device, and the target editing voice is used for generation by the terminal device The target speech corresponding to the target text. It can also be understood as an interactive scenario.
  • the cloud device performs complex calculation operations, and the terminal device performs a simple splicing operation.
  • the original voice and the second text are obtained from the terminal device. After the cloud device generates the target editing voice, it is sent to the terminal device.
  • the target edits the voice, and then the terminal device splices it to obtain the target voice.
  • the voice processing device when the voice processing device is a cloud device, on the one hand, through the interaction between the cloud device and the terminal device, the cloud device can perform complex calculations to obtain the target edited voice and return it to the terminal device. Reduce the computing power and storage space of the terminal device.
  • a target edited voice corresponding to the modified text can be generated based on the voice characteristics of the non-edited area in the original voice, and then a target voice corresponding to the target text can be generated from the non-edited voice.
  • the above steps: obtaining the original voice and the second text include: receiving the original voice and the target text sent by the terminal device; the method further includes: based on the non-edited voice and the second text The target editing voice generates a target voice corresponding to the target text, and sends the target voice to the terminal device.
  • the original voice and the target text sent by the terminal device are received, the non-edited voice can be obtained, and the second voice feature corresponding to the second text is generated based on the first voice feature of the unedited voice, and then the second voice feature corresponding to the second text is generated according to the vocoded voice code.
  • the processor obtains the target edited voice, and splices the target edited voice and the non-edited voice to generate the target voice. Equivalently, the processing is done on the voice processing device, and the results are returned to the terminal device.
  • the cloud device performs complex calculations to obtain the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • the first pitch (pitch) feature based on the non-edited voice and the second text include: based on the first pitch (pitch) feature of the non-edited voice, the The information of the target text and the second speech feature of the non-edited speech; the second speech feature carries at least one of the following information: some speech frames or all speech frames of the non-edited speech; the non-edited speech The voiceprint characteristics of the voice; the timbre of the non-edited voice Features; the prosodic features of the non-edited speech; and the rhythmic features of the non-edited speech.
  • the first speech feature can be the same or similar to the second speech feature in terms of rhythm, timbre, and/or signal-to-noise ratio.
  • Rhythm can reflect the speaker's emotional state or speech form. Rhythm generally refers to intonation, pitch, and stress emphasis. , pauses or rhythm characteristics.
  • the second voice feature carries the voiceprint feature of the original voice.
  • the voiceprint features may be obtained directly, or the voiceprint features may be obtained by recognizing original speech, etc.
  • the subsequently generated first voice feature also carries the voiceprint feature of the original voice, thereby improving the similarity between the target edited voice and the original voice.
  • introducing voiceprint features can improve the subsequently predicted voice features to be more similar to the voiceprints of the speakers of the original speech.
  • the target text information includes:
  • the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text.
  • Text the second text is text adjacent to the first part of text;
  • Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text includes:
  • the first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
  • the target text is obtained by replacing the second part of the text in the first text with the second text;
  • Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text includes:
  • the initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
  • the method further includes:
  • the frame number of each phoneme in the non-edited speech and the information of the target text is predicted.
  • the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
  • the second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited speech.
  • the information based on the number of frames of each phoneme in the non-edited speech and the target text includes:
  • the frame number of each phoneme in the non-edited speech the information of the target text and the second speech feature of the non-edited speech.
  • the above steps further include: obtaining the position of the second text in the target text; and splicing the target editing voice and the non-editing voice based on the position to obtain the target voice corresponding to the target text. It can also be understood as replacing the edited voice in the original voice with the target edited voice, and the edited voice is the voice in the original voice except the non-edited voice.
  • the target editing voice and the non-editing voice can be spliced according to the position of the second text in the target text. If the first text is all overlapping text in the original text and the target text, the voice of the desired text (ie, the target text) can be generated without changing the non-edited voice in the original voice.
  • the above steps further include: determining the non-edited voice based on the target text, the original text and the original voice. Specifically, it may be: determining the first text based on the target text and the original text. ; Determine the non-edited voice based on the first text, original text and original voice.
  • the non-edited voice of the first text in the original voice is determined by comparing the original text and the original voice, so as to facilitate the subsequent generation of the first voice feature.
  • the above steps determining the first text based on the target text and the original text, including: determining the overlapping text based on the target text and the original text; displaying the overlapping text to the user; and responding The user's second operation determines the first text from the overlapping text.
  • this application provides a voice processing device, which includes:
  • the Acquisition module used to obtain the original speech and the second text.
  • the second text is the text in the target text except the first text.
  • the target text and the original text corresponding to the original speech both include the first text.
  • the voice corresponding to the first text in the original voice is non-edited voice;
  • a pitch prediction module configured to predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text
  • a generation module configured to obtain the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text;
  • a target editing voice corresponding to the second text is generated.
  • the content of the original voice is the user's singing voice.
  • the first pitch (pitch) feature of the non-edited voice and the second text include:
  • the second voice feature carries at least one of the following information:
  • the information of the target text includes: text embedding of each phoneme in the target text.
  • the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text.
  • Text the second text is text adjacent to the first part of text;
  • the pitch prediction module is specifically used for:
  • the first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
  • the target text is obtained by replacing the second part of the text in the first text with the second text;
  • the pitch prediction module is specifically used for:
  • the initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
  • the device further includes:
  • a duration prediction module configured to predict the number of frames of each phoneme in the second text based on the number of frames of each phoneme in the non-edited speech and the information of the target text.
  • the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
  • the second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited voice.
  • the duration prediction module is specifically used to:
  • the frame number of each phoneme in the non-edited speech the information of the target text and the second speech feature of the non-edited speech.
  • the acquisition module is also used to:
  • the generating module is further configured to splice the target edited voice and the non-edited voice based on the position to obtain a target voice corresponding to the target text.
  • a third aspect of the present application provides a voice processing device that performs the method in the foregoing first aspect or any possible implementation of the first aspect.
  • a fourth aspect of the present application provides a speech processing device, including: a processor, the processor is coupled to a memory, and the memory is used to store programs or instructions. When the program or instructions are executed by the processor, the speech processing device implements the above-mentioned first step.
  • a speech processing device including: a processor, the processor is coupled to a memory, and the memory is used to store programs or instructions.
  • the speech processing device implements the above-mentioned first step.
  • a method in any possible implementation of an aspect or first aspect.
  • the fifth aspect of the present application provides a computer-readable medium on which a computer program or instructions are stored.
  • the computer program or instructions When the computer program or instructions are run on a computer, the computer is caused to execute the foregoing first aspect or any possible implementation of the first aspect. method within the method.
  • a sixth aspect of the present application provides a computer program product, which, when executed on a computer, causes the computer to execute the method in the foregoing first aspect or any possible implementation of the first aspect.
  • Figure 1 is a schematic structural diagram of a system architecture provided by this application.
  • Figure 2 is a schematic structural diagram of a convolutional neural network provided by this application.
  • FIG. 3 is a schematic structural diagram of another convolutional neural network provided by this application.
  • FIG. 4 is a schematic diagram of the chip hardware structure provided by this application.
  • Figure 5 is a schematic flow chart of a neural network training method provided by this application.
  • Figure 6 is a schematic structural diagram of a neural network provided by this application.
  • Figure 7a is a schematic flow chart of the speech processing method provided by this application.
  • Figure 7b is a schematic diagram of duration prediction provided by this application.
  • Figure 7c is a schematic diagram of pitch prediction provided by this application.
  • Figure 7d is a schematic diagram of pitch prediction provided by this application.
  • FIGS 8 to 10 are several schematic diagrams of the display interface of the voice processing device provided by this application.
  • Figure 11 is a schematic structural diagram of a bidirectional decoder provided by this application.
  • Figure 12 is another schematic diagram of the display interface of the voice processing device provided by this application.
  • FIG. 13 is another schematic flow chart of the speech processing method provided by this application.
  • FIGS 14-16 are schematic structural diagrams of several speech processing devices provided by this application.
  • Embodiments of the present application provide a speech processing method and related equipment, which can realize that the listening feeling of edited speech is similar to that of original speech, thereby improving user experience.
  • the neural network can be composed of neural units.
  • the neural unit can refer to an arithmetic unit that takes X s and intercept 1 as input.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of X s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with many hidden layers. There is no special metric for "many” here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the deep neural network may not include hidden layers, and there is no specific limitation here.
  • the work of each layer in a deep neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the columns) through five operations on the input space (a collection of input vectors). space), these five operations include: 1. Dimension raising/reducing; 2. Zoom in/out; 3. Rotation; 4. Translation; 5. "Bend”. Among them, the operations of 1, 2 and 3 are performed by Completed, operation 4 is performed by Completed, the operation of 5 is implemented by ⁇ (). The reason why the word "space” is used here is because the object to be classified is not a single thing, but a class of things.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
  • This vector W determines the spatial transformation from the input space to the output space mentioned above, that is, the weight W of each layer controls how Transform space.
  • the purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vectors W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolving the same trainable filter with an input image or feature map.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can be connected to only some of the neighboring layer neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information independent of position. The underlying principle is that the statistical information of one part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image.
  • multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a random-sized matrix.
  • the convolution kernel can obtain reasonable weights through learning.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the separation network, recognition network, detection network, depth estimation network and other networks in the embodiments of this application can all be CNNs.
  • Recurrent neural network refers to the current output of a sequence that is also related to the previous output.
  • the specific form of expression is that the network will remember the previous information, save it in the internal state of the network, and apply it to the calculation of the current output.
  • loss function loss function
  • objective function object function
  • Text to speech is a program or software system that converts text into speech.
  • a vocoder is a sound signal processing module or software that encodes acoustic features into a sound waveform.
  • Pitch can also be called fundamental frequency.
  • fundamental frequency When a sound-emitting body emits sound due to vibration, the sound can generally be decomposed into many simple sine waves. That is to say, all natural sounds are basically composed of many sine waves with different frequencies. , the sine wave with the lowest frequency is the fundamental tone (that is, the fundamental frequency, which can be represented by F0), while other sine waves with higher frequencies are overtones.
  • prosody In the field of speech synthesis, prosody broadly refers to features that control functions such as intonation, pitch, emphasis, pauses, and rhythm. Prosody can reflect the speaker's emotional state or speech form, etc.
  • Phoneme It is the smallest unit of speech divided according to the natural properties of speech. It is analyzed based on the pronunciation movements in the syllable. One movement constitutes a phoneme. Phonemes are divided into two categories: vowels and consonants. For example, the Chinese syllable a (for example, one tone: ah) has only one phoneme, ai (for example, four tones: love) has two phonemes, dai (for example, one tone: stay) has three phonemes, etc.
  • Word vectors can also be called “word embeddings”, “vectorization”, “vector mapping”, “embeddings”, etc. Formally speaking, a word vector represents an object as a dense vector.
  • Speech features Convert the processed speech signal into a concise and logical representation that is more discriminating and reliable than the actual signal. After acquiring a segment of speech signal, speech features can be extracted from the speech signal. Among them, the extraction method usually extracts a multi-dimensional feature vector for each speech signal. There are many ways to represent the parameterization of speech signals, such as: perceptual linear prediction (PLP), linear predictive coding (LPC) and frequency cepstrum coefficient (MFCC), etc.
  • PLPP linear predictive coding
  • MFCC frequency cepstrum coefficient
  • the neural network includes an embedding layer and at least one transformer layer.
  • At least one transformer layer can be N transformer layers (N is an integer greater than 0), where each transformer layer includes successively adjacent attention layers, summation and normalization. (add&norm) layer, feed forward layer and summation and normalization layer.
  • the current input is embedded to obtain multiple feature vectors; in the attention layer, P input vectors are obtained from the upper layer of the first transformer layer, and any of the P input vectors are The first input vector is the center. Based on the correlation between each input vector within the preset attention window range and the first input vector, the intermediate vector corresponding to the first input vector is obtained. In this way, P input vectors are determined.
  • Corresponding P intermediate vectors; in the pooling layer the P intermediate vectors are merged into Q output vectors, where the multiple output vectors obtained by the last transformer layer in the transformer layer are used as features of the current input express.
  • the current input is embedded to obtain multiple feature vectors.
  • the embedding layer can be called the input embedding layer.
  • the current input can be text input, for example, it can be a paragraph of text or a sentence.
  • the text can be Chinese text, English text, or other language text.
  • the embedding layer can embed each word in the current input to obtain the feature vector of each word.
  • the embedding layer includes an input embedding layer and a positional encoding layer.
  • word embedding processing can be performed on each word in the current input to obtain the word embedding vector of each word.
  • the position coding layer the position of each word in the current input can be obtained, and then a position vector is generated for the position of each word.
  • the position of each word may be the absolute position of each word in the current input. Taking the current input as "What number should I pay back the Huabei?" for example, the position of "number” can be represented as the first digit, the position of "number” can be represented as the second digit,... In some examples, the position of each word may be a relative position between each word. Still taking the current input as "what number should I pay back Huabei" as an example, the position of "what number” can be expressed as before “number”, and the position of "number” can be expressed as after "what number” and before “should",... ....
  • the position vector of each word and the corresponding word embedding vector can be combined to obtain the feature vector of each word, that is, multiple feature vectors corresponding to the current input are obtained.
  • Multiple feature vectors can be represented as embedding matrices with preset dimensions.
  • the number of eigenvectors in the plurality of eigenvectors can be set to M, and the default dimension is H dimension, then the plurality of eigenvectors can be expressed as an M ⁇ H embedding matrix.
  • the attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sensation to increase the precision of observation in some areas, and can use limited attention resources to quickly filter out high-value information from a large amount of information. .
  • the attention mechanism can quickly extract important features of sparse data and is therefore widely used in natural language processing tasks, especially machine translation.
  • the self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features.
  • the essential idea of the attention mechanism can be rewritten as the following formula:
  • Lx
  • represents the length of Source.
  • the meaning of the formula is to imagine that the constituent elements in Source are composed of a series of data pairs. At this time, given a certain element Query in the target Target, by calculating the Query and Based on the similarity or correlation of each Key, the weight coefficient of each Key's corresponding Value is obtained, and then the Value is weighted and summed to obtain the final Attention value. So essentially the Attention mechanism is a weighted summation of the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value.
  • Attention can be understood as selectively filtering out a small amount of important information from a large amount of information and focusing on this important information, while ignoring most of the unimportant information.
  • the process of focusing is reflected in the calculation of the weight coefficient.
  • the self-attention mechanism can be understood as internal Attention (intra attention).
  • the Attention mechanism occurs between the Target element Query and all elements in the Source.
  • the self-attention mechanism refers to between the internal elements of the Source or between the internal elements of the Target.
  • the scenario of singing voice editing is when the user is recording a song (such as singing a cappella).
  • voice editing is usually used. Head The current voice editing method is to obtain voice segments from the database, replace the erroneous content with the voice segments, and then generate corrected speech.
  • this application provides a voice editing method.
  • the pitch characteristics will affect the hearing sense of the target edited voice and the hearing sense of the original voice.
  • This application predicts the second text (text to be edited) by predicting Pitch feature, generate the first voice feature of the second text based on the pitch feature, and generate the target editing voice corresponding to the second text based on the first voice feature, so that the pitch features of the voice before and after singing editing are similar, thereby achieving the target editing voice
  • the listening feeling of the target is similar to that of the original speech.
  • an embodiment of the present application provides a system architecture 10.
  • the data collection device 16 is used to collect training data.
  • the training data includes training speech and training text corresponding to the training speech.
  • the training data is stored in the database 13, and the training device 12 trains to obtain the target model/rule 101 based on the training data maintained in the database 13.
  • the target model/rules 101 can be used to implement the speech processing method provided by the embodiment of the present application, that is, the text is input after relevant preprocessing.
  • the target model/rule 101 can obtain the phonetic features of the text.
  • the target model/rule 101 in the embodiment of this application may specifically be a neural network. It should be noted that in actual applications, the training data maintained in the database 13 may not all be collected from the data collection device 16, and may also be received from other devices. In addition, it should be noted that the training device 12 may not necessarily train the target model/rules 101 based entirely on the training data maintained by the database 13. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a limitation of this application. Limitations of Examples.
  • the target model/rules 101 trained according to the training device 12 can be applied to different systems or devices, such as to the execution device 11 shown in Figure 1.
  • the execution device 11 can be a terminal, such as a mobile phone terminal, a tablet computer, Laptops, AR/VR, vehicle terminals, etc., or servers or clouds, etc.
  • the execution device 11 is configured with an I/O interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 14.
  • the input data is used in the embodiment of the present application. may include: second voice features, target text and mark information, and the input data may also include second voice features and second text.
  • the input data can be input by the user, or uploaded by the user through other devices. Of course, it can also come from a database, and there is no specific limit here.
  • the preprocessing module 113 is configured to perform preprocessing according to the target text and mark information received by the I/O interface 112. In the embodiment of the present application, the preprocessing module 113 may be used to determine the target editing text in the target text based on the target text and mark information. If the input data includes the second speech feature and the second text, the preprocessing module 113 is configured to perform preprocessing according to the target text and mark information received by the I/O interface 112, for example, converting the target text into phonemes and other preparatory work.
  • the execution device 11 When the execution device 11 preprocesses the input data, or when the calculation module 111 of the execution device 11 performs calculations and other related processes, the execution device 11 can call data, codes, etc. in the data storage system 15 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 15 .
  • the I/O interface 112 returns the processing result, such as the first voice feature obtained as described above, to the client device 14, thereby providing it to the user.
  • the training device 12 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results or providing input for other subsequent processing.
  • the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 .
  • the client device 14 can automatically send input data to the I/O interface 112. If requiring the client device 14 to automatically send the input data requires the user's authorization, the user can set corresponding permissions in the client device 14. The user can view the results output by the execution device 11 on the client device 14, and the specific presentation form may be display, sound, action, etc.
  • the client device 14 can also be used as a data collection terminal to collect input data from the input I/O interface 112 and output results from the output I/O interface 112 as new sample data, and store them in the database 13 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result outputted from the I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in database 13.
  • Figure 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 15 is an external memory relative to the execution device 11. In other cases, the data storage system 15 can also be placed in the execution device 11.
  • a target model/rule 101 is obtained by training according to the training device 12.
  • the target model/rule 101 can be a neural network in the embodiment of the present application.
  • the neural network It can be a recurrent neural network, a long short-term memory network, etc.
  • the prediction network can be a convolutional neural network, a recurrent neural network, etc.
  • the neural network and the prediction network in the embodiment of this application can be two separate networks, or they can be a multi-task neural network, one of which is to output duration, one to predict pitch features, and the other to predict pitch characteristics.
  • the task is to output speech features.
  • CNN is a very common neural network
  • the structure of CNN will be introduced in detail below in conjunction with Figure 2.
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning. Multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the image input into it.
  • a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional/pooling layer 120, and a neural network layer 130 where the pooling layer is optional.
  • the convolution layer/pooling layer 120 may include layers 121-126 as examples.
  • layer 121 is a convolution layer
  • layer 122 is a pooling layer
  • layer 123 is a convolution layer
  • 124 is a pooling layer
  • 121 and 122 are convolution layers
  • 123 is a pooling layer
  • 124 and 125 are convolution layers
  • 126 is Pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 121 may include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially be a weight matrix. This weight matrix is usually predefined. During the convolution operation on the image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels followed by two pixels... This depends on the value of the step size) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices with the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform fuzzification...the multiple weight matrices have the same dimensions, and the feature maps extracted by the multiple weight matrices with the same dimensions also have the same dimensions, and then the extracted feature maps with the same dimensions are combined to form the output of the convolution operation .
  • weight values in these weight matrices need to be obtained through a large amount of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer for example, 121
  • the features extracted by subsequent convolutional layers for example, 126) become more and more complex, such as high-level semantic features.
  • the pooling layer can also be a multi-layer convolution layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the average value of pixel values in an image within a specific range.
  • the max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image processed by the pooling layer can be smaller than the size of the image input to the pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 120 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to use the neural network layer 130 to generate an output or a set of required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in Figure 2) and an output layer 140. The parameters included in the multiple hidden layers may be based on specific task types. Related training data are pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc.
  • the output layer 140 After the multi-layer hidden layer in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140.
  • the output layer 140 has a loss function similar to classification cross entropy, specifically used to calculate the prediction error.
  • the convolutional neural network 100 shown in Figure 2 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models, for example, as The multiple convolutional layers/pooling layers shown in Figure 3 are parallel, and the extracted features are all input to the full neural network layer 130 for processing.
  • Figure 4 is a chip hardware structure provided by an embodiment of the present application.
  • the chip includes a neural network processor 40.
  • the chip can be disposed in the execution device 110 as shown in Figure 1 to complete the calculation work of the calculation module 111.
  • the chip can also be provided in the training device 120 as shown in Figure 1 to complete the training work of the training device 120 and output the target model/rules 101.
  • the algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 4.
  • the neural network processor 40 may be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., which are suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • the NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the main CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 403.
  • the controller 404 controls the arithmetic circuit 403 to extract data in the memory (weight memory or input memory) and perform operations.
  • the computing circuit 403 internally includes multiple processing engines (PEs).
  • PEs processing engines
  • arithmetic circuit 403 is a two-dimensional systolic array.
  • the arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 403 is a general-purpose matrix processor.
  • the arithmetic circuit obtains the corresponding data of matrix B from the weight memory 402 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit converts the input memory 401
  • the matrix A data is obtained and matrix B is used for matrix operation, and the partial result or final result of the matrix is stored in the accumulator 408 .
  • the vector calculation unit 407 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • the vector calculation unit 407 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit can 407 store the processed output vector into the unified buffer 406 .
  • the vector calculation unit 407 may apply a nonlinear function to the output of the operation circuit 403, such as a vector of accumulated values, to generate an activation value.
  • vector calculation unit 407 generates normalized values, merged values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 403, such as for use in a subsequent layer in a neural network.
  • the unified memory 406 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 401 and/or unified memory 406 through the storage unit access controller 405 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 402. And store the data in the unified memory 506 into the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 410 is used to implement interaction between the main CPU, the DMAC and the fetch memory 409 through the bus.
  • An instruction fetch buffer 409 connected to the controller 404 is used to store instructions used by the controller 404.
  • the controller 404 is used to call instructions cached in the memory 409 to control the working process of the computing accelerator.
  • the unified memory 406, the input memory 401, the weight memory 402 and the instruction memory 409 are all on-chip memories, and the external memory is a memory external to the NPU.
  • the external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM for short), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • each layer in the convolutional neural network shown in Figure 2 or Figure 3 can be performed by the operation circuit 403 or the vector calculation unit 407.
  • This voice processing method can be applied to scenarios where voice content needs to be modified, such as scenarios where users record short videos, teachers record teaching voices, etc.
  • the voice processing method can be applied to applications, software or voice processing devices with voice editing functions such as smart voice assistants on mobile phones, computers, detachable terminals that can produce sounds, smart speakers, etc.
  • the voice processing device is a terminal device used to serve users, or a cloud device.
  • the terminal device may include a head mount display (HMD), which may be a combination of a virtual reality (VR) box and a terminal, a VR all-in-one machine, or a personal computer (PC), Augmented reality (AR) devices, mixed reality (MR) devices, etc.
  • the terminal device may also include a cellular phone, a smart phone, a personal digital assistant (personal digital assistant), etc. digital assistant (PDA), tablet computer, laptop computer (laptop computer), personal computer (PC), vehicle-mounted terminal, etc. The details are not limited here.
  • the neural network and the prediction network in the embodiment of this application can be two separate networks, or they can be a multi-task neural network, one of which is to output duration, and the other is to output speech features.
  • the training method shown in Figure 5 can be executed by a neural network training device.
  • the neural network training device can be a cloud service device or a terminal device.
  • the training method device may also be a system composed of cloud service equipment and terminal equipment.
  • the training method may be executed by the training device 120 in FIG. 1 and the neural network processor 40 in FIG. 4 .
  • the training method can be processed by the CPU, or it can be processed by the CPU and GPU together, or it can not use the GPU but use other processors suitable for neural network calculations, which is not limited by this application.
  • the training method shown in Figure 5 includes step 501 and step 502. Step 501 and step 502 will be described in detail below.
  • the prediction network in the embodiment of this application can be a transformer network, RNN, CNN, etc., and is not specifically limited here.
  • the input is the vector of the training text
  • the output is the duration, pitch characteristics or voice characteristics of each phoneme in the training text. Then continue to narrow the difference between the duration, pitch characteristics or phonetic features of each phoneme in the training text output by the prediction network and the actual duration, actual pitch features or actual phonetic features of the training text corresponding to the training text, and then obtain the trained predictions network.
  • Step 501 Obtain training data.
  • the training data in the embodiment of the present application includes training speech, or includes training speech and training text corresponding to the training speech. If the training data does not include training text, the training text can be obtained by recognizing the training speech.
  • the training speech features in the training data may also include user identification, or include the voiceprint features of the training speech, or include using A vector identifying the voiceprint features of the training speech.
  • the training data may also include start and end duration information of each phoneme in the training speech.
  • the training data can be obtained by directly recording the utterances of the voicing object, or by the user inputting audio information and video information, or by receiving transmissions from the collection device.
  • the training data there are other ways to obtain training data, and there are no specific limitations on how to obtain training data.
  • Step 502 Use the training data as the input of the neural network, train the neural network with the goal that the value of the loss function is less than the threshold, and obtain a trained neural network.
  • some preprocessing can be performed on the training data.
  • the training data includes training speech as described above
  • the training text can be obtained by identifying the training speech, and the training text can be input into the neural network using phoneme representation.
  • the entire training text can be regarded as the target editing text and used as input to train the neural network with the goal of reducing the value of the loss function, that is, continuously reducing the speech features output by the neural network and the training speech The difference between the corresponding actual speech features.
  • This training process can be understood as a prediction task.
  • the loss function can be understood as the loss function corresponding to the prediction task.
  • the neural network in the embodiment of this application may specifically be an attention mechanism model, such as transformer, tacotron2, etc.
  • the attention mechanism model includes an encoder-decoder, and the structure of the encoder or decoder can be a recurrent neural network, a long short-term memory network (long short-term memory, LSTM), etc.
  • the neural network in the embodiment of the present application includes an encoder and a decoder.
  • the structural types of the encoder and decoder may be RNN, LSTM, etc., and are not limited here.
  • the function of the encoder is to encode the training text into a text vector (a vector representation in units of phonemes, with each input corresponding to a vector).
  • the function of the decoder is to obtain the phonetic features corresponding to the text based on the text vector.
  • the calculation of each step is calculated based on the real speech features corresponding to the previous step.
  • the prediction network can be used to correct the speech duration corresponding to the text vector. That is, it can be understood as upsampling the text vector according to the duration of each phoneme in the training speech (it can also be understood as extending the number of frames of the vector) to obtain a vector of the corresponding number of frames.
  • the function of the decoder is to obtain the speech features corresponding to the text based on the above vector corresponding to the frame number.
  • the above-mentioned decoder may be a unidirectional decoder or a bidirectional decoder (that is, two directions are parallel), and the details are not limited here.
  • the two directions refer to the direction of the training text, which can also be understood as the direction of the vector corresponding to the training text. It can also be understood as the forward or reverse order of the training text.
  • One direction means that one side of the training text points to the training text.
  • the other side, the other direction is the other side of the training text pointing to the side of the training text.
  • the first direction or positive sequence can be from “mid” to “no”
  • the second direction or reverse sequence can be from “no” to “no” “Center” direction.
  • the decoder is a bidirectional decoder
  • the decoders in both directions are trained in parallel and are calculated independently during the training process, so there is no dependence on the results.
  • the prediction network and the neural network are a multi-task network
  • the prediction network can be called a prediction module, and the decoder can correct the speech features output by the neural network based on the real duration information corresponding to the training text.
  • the input during model training can be the original singing audio, the corresponding lyric text (expressed in units of phonemes).
  • the original singing audio the duration information of each phoneme in the original audio, and the singing voice are obtained.
  • Pattern features, frame-level pitch information, etc. can be obtained through other pre-trained models or tools (such as singing lyrics alignment tool, Singer voiceprint extraction tool, and pitch extraction algorithm, etc.).
  • the output can be a trained acoustic model, and the training goal is to minimize the error between the predicted singing voice features and the singing voice features.
  • training data sets can be synthesized based on singing voices, and corresponding training data samples can be constructed by simulating "insertion, deletion and replacement" operation scenarios.
  • Stage1 First use ground-truth lyrics and audio as well as pitch and duration data to train a singing voice synthesis model to obtain the trained text encoding module and audio feature decoding module;
  • Stage2 Fixed text encoding module and audio feature decoding module, use simulated editing operation training data set to train duration regularization module and pitch prediction module;
  • Stage3 End-to-end training, finetune the entire model using all training data.
  • neural network includes encoder and decoder.
  • the neural network may also include a prediction module and an upsampling module.
  • the prediction module is specifically used to implement the function of the above prediction network
  • the upsampling module is specifically used to implement the above process of upsampling the text vector according to the duration of each phoneme in the training speech, which will not be described again here.
  • training process may also adopt other training methods instead of the aforementioned training methods, which are not limited here.
  • the voice processing method provided by the embodiment of the present application can be applied to replacing scenes, inserting scenes, or deleting scenes.
  • the above scenario can be understood as replacing, inserting, deleting, etc. the original speech corresponding to the original text to obtain the target speech, so as to achieve a similar listening experience between the target speech and the original speech and/or improve the fluency of the target speech.
  • the original voice can be considered to include the voice to be modified, and the target voice is the voice obtained after the user wants to modify the original voice.
  • the original text is "Today the weather is very good in Shenzhen", and the target text is “Today the weather is very good in Guangzhou”.
  • the overlapping text is "The weather is very good today”.
  • the non-overlapping text in the original text is "Shenzhen”, and the non-overlapping text in the target text is "Guangzhou”.
  • the target text includes a first text and a second text, and the first text is an overlapping text or a part of the overlapping text.
  • the second text is the text in the target text other than the first text. For example: If the first text is "The weather is very good today", the second text is "Guangzhou”. If the first text is "Today's weather is very good”, the second text is "Heaven, Guangzhou, Tian”.
  • the original text is "The weather in Shenzhen is very good today", and the target text is "The weather in Shenzhen is very good this morning".
  • the overlapping text is "The weather in Shenzhen is very good today”.
  • the non-overlapping text in the target text is "AM”.
  • the insertion scene can be regarded as a replacement scene in which "tianshen” in the original speech is replaced by "tianchenshen”. That is, the first text is "The weather in Zhenzhou is very good today", and the second text is "It's morning and late in the day”.
  • the original text is "The weather in Shenzhen is very good today” and the target text is "The weather is very good today”.
  • the overlapping text is "The weather is very good today”.
  • the non-overlapping text in the original text is "Shenzhen”.
  • the deleted scene can be regarded as a replacement scene in which " ⁇ " in the original speech is replaced by " ⁇ ". That is, the first text is "Today's weather is very good” and the second text is "Every day”.
  • the speech processing method provided by the embodiment of the present application is described below only by taking the replacement scene as an example.
  • the voice processing method provided by the embodiments of this application can be executed by the terminal device or the cloud device alone, or can be completed by the terminal device and the cloud device together. They are described below:
  • Embodiment 1 The terminal device or the cloud device executes the voice processing method independently.
  • Figure 7a is an example of a voice processing method provided by the embodiment of the present application.
  • This method can be executed by a voice processing device or by a component of the voice processing device (such as a processor, a chip, or a chip system, etc.).
  • the voice processing device may be a terminal device or a cloud device, and this embodiment includes steps 701 to 704.
  • Step 701 Obtain the original voice and the second text.
  • the speech processing device can directly obtain the original speech, the original text and the second text. It is also possible to obtain the original speech and the second text first, and then recognize the original speech to obtain the original text corresponding to the original speech.
  • the second text is the text in the target text other than the first text, and the original text and the target text contain the first text.
  • the first text can be understood as part or all of the overlapping text of the original text and the target text.
  • the content of the original voice is the user's singing voice, which may be, for example, the voice recorded when the user sings a cappella.
  • the voice processing device can directly obtain the second text through input from other devices or users.
  • the speech processing device obtains the target text, obtains the overlapping text based on the original text corresponding to the target text and the original speech, and then determines the second text based on the overlapping text.
  • the characters in the original text and the target text can be compared one by one or a comparison model can be input to determine overlapping text and/or non-overlapping text between the original text and the target text.
  • the first text may be an overlapping text, or may be part of the overlapping text.
  • the speech processing device can directly determine the overlapping text as the first text, can also determine the first text in the overlapping text according to preset rules, or can also determine the first text in the overlapping text according to the user.
  • the operation determines the first text in the overlapping text.
  • the preset rule may be to obtain the first text after removing N characters in the overlapping content, where N is a positive integer.
  • the speech processing device can align the original text with the original speech, determine the starting and ending positions of each phoneme in the original text in the original speech, and can learn the duration of each phoneme in the original text. Then, the phonemes corresponding to the first text are obtained, that is, the speech corresponding to the first text in the original speech (that is, the non-edited speech) is obtained.
  • the voice processing device can align the original text with the original speech by using a forced alignment method, such as: Montreal forced aligner (MFA), neural network with alignment function and other alignment tools, specifically There are no limitations here.
  • MFA Montreal forced aligner
  • neural network with alignment function and other alignment tools, specifically There are no limitations here.
  • the user interface can be displayed to the user, and the user interface includes the original speech and the original text. Further, the user performs a first operation on the original text through the user interface, and the speech processing device determines the target text in response to the user's first operation.
  • the first operation can be understood as the user's editing of the original text, and the editing can be the aforementioned replacement, insertion, or deletion.
  • the original text is "Today the weather is very good in Shenzhen", and the target text is "Today the weather is very good in Guangzhou”.
  • the speech processing device is a mobile phone. After the speech processing device obtains the original text and the original voice, the user is presented with an interface as shown in Figure 8, which includes the original text and the original voice. As shown in Figure 9, the user can perform the first operation 901 on the original text, such as modifying "Shenzhen" to "Guangzhou” and other aforementioned insertion, deletion, and replacement operations.
  • only replacement is used as an example for description.
  • the speech processing device displays the overlapping text to the user, and then determines the first text from the overlapping text according to the user's second operation, and then determines the second text.
  • the second The operation can be click, drag, slide, etc. There are no specific limitations here.
  • the second text is "Guangzhou”
  • the first text is "The weather is very good today”
  • the non-edited voice is the voice of the first text in the original voice.
  • the non-edited speech is equivalent to frames 1 to 4 and frames 9 to 16 in the original speech. It can be understood that in practical applications, the correspondence between text and voice frames is not necessarily 1:2 as in the above example.
  • the above example is only for the convenience of understanding the non-editing area.
  • the number of frames corresponding to the original text is not specifically limited here.
  • the speech processing device can display an interface as shown in Figure 10, which can include the second text, the target text, the non-edited voice and the edited voice in the original voice, where the second text is "Guangzhou” and the target text
  • the text is "The weather in Guangzhou is very good today”
  • the non-edited voice is the voice corresponding to "The weather is very good today”
  • the edited voice is the voice corresponding to "Shenzhen”. It can also be understood that as the user edits the target text, the speech processing device determines the non-edited voice in the original voice based on the target text, the original text, and the original voice.
  • the voice processing device receives an editing request sent by the user, where the editing request includes the original voice and the second text.
  • the edit request also includes the original text and/or speaker identification.
  • the editing request can also include the original speech and the target text.
  • Step 702 Predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text.
  • the information of the target text includes: text embedding of each phoneme in the target text.
  • the text embedding of each phoneme in the target text can be obtained through the text encoding module (Text Encoder) based on the target text.
  • the target text can be converted into the corresponding phoneme sequence (for example, the phoneme corresponding to "How can love not ask whether it is right or wrong" is the sequence of initial consonants and finals in pinyin), and then input to the Text Encoder and converted into the corresponding phoneme-based phoneme sequence.
  • Text embedding of units can be exemplified by the Tacotron 2 model.
  • the number of frames of each phoneme in the non-edited speech (which can also be called the duration) can be obtained, and based on the number of frames of each phoneme in the non-edited speech and the information of the target text, Predict the number of frames for each phoneme in the second text.
  • the neural network used to predict the number of frames of each phoneme in the second text can be as shown in Figure 7b (for example, it can be a duration prediction model based on a mask mechanism that fuses the original real duration), It uses the output of Text Encoder and the original real duration (Reference Duration, that is, the duration of each phoneme in the first text) and the corresponding mask as input to predict the duration of each phoneme to be edited (that is, each phoneme in the second text) (That is, the number of frames corresponding to the audio).
  • Reference Duration that is, the duration of each phoneme in the first text
  • the corresponding mask as input to predict the duration of each phoneme to be edited (that is, each phoneme in the second text) (That is, the number of frames corresponding to the audio).
  • each text embedding can be performed based on the predicted duration of each phoneme. Upsample to obtain the embedding result corresponding to the number of frames (for example, if the prediction duration of phoneme ai is 10 frames, you can copy the text embedding corresponding to ai N copies, N is a positive number greater than 1, for example, N is 10 ).
  • the singing itself will follow a certain music score, and the music score also stipulates the pronunciation duration and pitch of each word. Therefore, when editing the singing voice, for the area that does not need to be edited (non-edited voice), There is no need to predict the corresponding duration and pitch information, just get the accurate real value directly and use it.
  • 1 FFT Block can be a Transformer block.
  • the predicted duration of each phoneme in the second text can be used to upsample each input in pitch feature prediction.
  • the input for pitch feature prediction can include text embeddings, above Each text embedding before sampling corresponds to a phoneme, and the text embedding after upsampling includes the number of text embeddings corresponding to the number of frames of the phoneme.
  • the second speech feature of the non-edited speech may also be obtained based on the non-edited speech.
  • the second voice feature may carry at least one of the following information: some voice frames or all voice frames of the non-edited voice; voiceprint features of the non-edited voice; timbre features of the non-edited voice; The prosodic characteristics of the non-edited speech; and the rhythmic characteristics of the non-edited speech.
  • the speech features in the embodiments of the present application can be used to represent the characteristics of speech (such as timbre, rhythm, emotion or rhythm, etc.).
  • the speech features can be expressed in many forms, such as speech frames, sequences, vectors, etc., specifically here No restrictions.
  • the speech features in the embodiments of the present application may specifically be parameters extracted from the above-mentioned expression forms through the aforementioned PLP, LPC, MFCC and other methods.
  • At least one speech frame is selected from the non-edited speech as the second speech feature.
  • the second speech feature of the context is further combined with the first speech feature.
  • the text corresponding to at least one speech frame may be text adjacent to the second text in the first text.
  • the non-edited speech is encoded through a coding model to obtain a target sequence, and the target sequence is used as the second speech feature.
  • the coding model can be CNN, RNN, etc., and there is no specific limit here.
  • the second voice feature may also carry the voiceprint feature of the original voice.
  • the voiceprint features may be obtained directly, or the voiceprint features may be obtained by recognizing original speech, etc.
  • the subsequently generated first voice feature also carries the voiceprint feature of the original voice, thereby improving the similarity between the target edited voice and the original voice.
  • introducing voiceprint features can improve the subsequently predicted voice features to be more similar to the voiceprints of the speakers of the original speech.
  • the speech processing device can also obtain the speaker identification of the original speech, so that when there are multiple speakers, the speech corresponding to the corresponding speaker can be matched, and the similarity between the subsequent target edited speech and the original speech can be improved.
  • speech frames as speech features (or is understood as obtaining speech features based on speech frames) as an example.
  • speech features or is understood as obtaining speech features based on speech frames.
  • at least one frame from the 1st to 4th frame and the 9th to 16th frame in the original speech is selected as the second speech feature.
  • the second speech feature is a Mel spectrum feature.
  • the second speech feature can be expressed in the form of a vector.
  • the predicted duration of each phoneme in the second text can be used to perform each input in pitch feature prediction.
  • the input for pitch feature prediction may include the second speech feature, each vector before upsampling corresponds to a phoneme, and the text embedding after upsampling includes a vector corresponding to the number of frames of the phoneme.
  • the second pitch feature of the second text can be predicted based on the first pitch feature of the non-edited voice and the information of the target text.
  • the first pitch (pitch) feature of the non-edited speech can be obtained through an existing pitch extraction algorithm, which is not limited by this application.
  • the neural grid can be used to predict the target text according to the first pitch feature of the non-edited voice, the information of the target text, and the second voice feature of the non-edited voice. Describe the second pitch characteristics of the second text.
  • the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text.
  • Text the second text is text adjacent to the first part of the text; the first pitch feature of the non-edited voice and the information of the target text can be fused to obtain the first fusion Result: Input the first fusion result to the second neural network to obtain the second pitch feature of the second text.
  • the target text is obtained by replacing the second part of the first text with the second text;
  • the first pitch of the non-edited voice may be The features are input to the third neural network to obtain the initial pitch feature, and the first initial pitch feature includes the pitch of each frame in multiple frames;
  • the information of the target text is input to the fourth neural network to obtain the
  • the pronunciation features of the second text are used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated; the initial pitch features and the pronunciation features are fused to obtain the Describe the second pitch characteristics of the second text.
  • the replacement operation here only means that the number of words in the new edited text is consistent with the number of words in the replaced text. If they are not consistent, the replacement operation will be decomposed into two editing operations: first deletion and then insertion). Since the replaced text may have a big difference in pronunciation, in order to ensure the coherence of the singing before and after the replacement, the model shown in Figure 7d is used to predict the new pitch:
  • Frame-level voiced/unvoiced (U/UV) prediction can be introduced to help pitch prediction.
  • U/UV voiced/unvoiced
  • the design of the V/UV Predictor and F0Predictor modules can refer to the F0predictor in Fastspeech2.
  • the input first pitch (pitch) feature may include the pitch feature of each frame in the multiple frames of the non-edited speech; correspondingly, the output second pitch feature
  • the high features may include pitch features of each of the plurality of frames of the target edited speech.
  • Step 703 According to the second pitch feature and the second text, obtain the first speech feature corresponding to the second text through a neural network.
  • the second pitch feature and the second text (for example, the text embedding of the second text) can be fused (for example, added), and the fusion result can be input into the neural network to obtain The first speech feature corresponding to the second text.
  • the first speech feature corresponding to the second text may be a Mel spectrum feature.
  • the description of the second voice feature may be based on the first pitch feature of the non-edited voice, the information of the target text, and the second voice feature of the non-edited voice. Reference may be made to the description of the second speech feature in the above embodiment, which will not be described again here.
  • the first speech feature corresponding to the second text can be obtained through a neural network based on the second speech feature and the second text.
  • the neural network may include an encoder and a decoder.
  • the second text is input into the encoder to obtain a first vector corresponding to the second text, and then the first vector is decoded by the decoder based on the second speech feature to obtain the first speech feature.
  • the second speech feature can be the same as or similar to the first speech feature in terms of rhythm, timbre, and/or signal-to-noise ratio.
  • Rhythm can reflect the speaker's emotional state or speech form. Rhythm generally refers to intonation, pitch, and emphasis. , pauses or rhythm characteristics.
  • an attention mechanism can be introduced between the encoder and the decoder to adjust the quantitative correspondence between the input and the output.
  • the target text where the second text is located can be introduced during the coding process of the encoder, so that the generated first vector of the second text refers to the target text, so that the second text described by the first vector is more accurate. That is, the first speech feature corresponding to the second text can be obtained through the neural network based on the second speech feature, the target text, and the mark information.
  • the target text and mark information may be input into an encoder to obtain a first vector corresponding to the second text, and then the first vector may be decoded by a decoder based on the second speech feature to obtain the first speech feature.
  • the marking information is used to mark the second text in the target text.
  • the decoder in the embodiment of the present application may be a unidirectional decoder or a bidirectional decoder, which are described respectively below.
  • the decoder is a one-way decoder.
  • the decoder calculates the first vector or the speech frame obtained by the second vector from the first direction of the target text as the first speech feature based on the second speech feature.
  • the first direction is a direction from one side of the target text to the other side of the target text.
  • the first direction can be understood as the forward or reverse order of the target text (for related descriptions, please refer to the description of the forward or reverse order in the embodiment shown in FIG. 5).
  • the second speech feature and the first vector are input into the decoder to obtain the first speech feature.
  • the second speech feature and the second vector are input into the decoder to obtain the first speech feature.
  • the decoder can be a bidirectional decoder (it can also be understood that the encoder includes a first encoder and a second encoder).
  • the above-mentioned second text is in the middle area of the target text, which can be understood to mean that the second text is not at both ends of the target text.
  • the first speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the second text
  • the fourth speech feature output by the bidirectional decoder from the second direction is the speech feature corresponding to the second text.
  • the complete phonetic features corresponding to the two second texts can be obtained through the left and right sides (ie, forward and reverse order), and the first phonetic features can be obtained based on the two phonetic features.
  • the first decoder calculates a first vector or a second vector from the first direction of the target text based on the second speech feature to obtain the first speech feature of the second text (hereinafter referred to as LR).
  • the second decoder calculates a first vector or a second vector from the second direction of the target text based on the second speech feature to obtain a fourth speech feature of the second text (hereinafter referred to as RL). and generate the first voice feature according to the first voice feature and the fourth voice feature.
  • the first direction is the direction from one side of the target text to the other side of the target text
  • the second direction is opposite to the first direction (or understood as the second direction is from the other side of the target text to one side of the target text). side direction).
  • the first direction may be the above-mentioned forward sequence
  • the second direction may be the above-mentioned reverse sequence.
  • the non-edited speech when the first encoder decodes the first frame of the first vector or the second vector in the first direction, the non-edited speech may be adjacent to the side of the second text (which may also be called the left side).
  • the speech frame is decoded as a condition to obtain N frames LR.
  • the speech frame adjacent to the other side (also called the right side) of the second text in the non-edited speech can be used as a condition Decode to obtain N frames of RL.
  • the structure of the bidirectional decoder can be referred to Figure 11.
  • the frame with the difference between LR and RL less than the threshold can be used as a transition frame (position is m, m ⁇ n,), or the frame with the smallest difference between LR and RL can be used as a transition frame .
  • the N frames of the first speech feature may include the first m frames in LR and the last n-m frames in RL, or the N frames of the first speech feature may include the first n-m frames in LR and the last m frames in RL.
  • the difference between LR and RL can be understood as the distance between vectors.
  • the first vector or the second vector in this step may also include a third vector used to identify the speaker. It can also be understood that the third vector is used to identify the voiceprint characteristics of the original speech.
  • the first encoder obtains the LR frames corresponding to "Guangzhou” including LR 1 , LR 2 , LR 3 , and LR 4 .
  • the second encoder obtains the RL frames corresponding to "Guangzhou” including RL 1 , RL 2 , RL 3 , and RL 4 .
  • the difference between LR 2 and RL 2 is the smallest, then LR 1 , LR 2 , RL 3 , RL 4 or LR 1 , RL 2 , RL 3 , RL 4 are used as the first speech features.
  • the first speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the third text in the second text
  • the fourth speech feature output by the bidirectional decoder from the second direction is the speech feature corresponding to the fourth text in the second text. voice characteristics.
  • the partial speech features corresponding to the second text can be obtained through the left and right sides (ie, forward and reverse order), and the complete first speech features can be obtained based on the two partial speech features. That is, one part of the phonetic features is taken from the forward direction, another part of the phonetic features is taken from the reverse direction, and one part of the phonetic features and another part of the phonetic features are spliced to obtain the overall phonetic features.
  • the first encoder obtains the LR frame corresponding to the third text ("Guang") including LR 1 and LR 2 .
  • the second encoder obtains the RL frame corresponding to the fourth text (“state”) including RL 3 and RL 4 .
  • the first speech feature is obtained by splicing LR 1 , LR 2 , RL 3 and RL 4 .
  • Step 704 Generate a target editing voice corresponding to the second text according to the first voice feature.
  • the first speech feature can be converted into a target edited voice corresponding to the second text according to the vocoder.
  • the vocoder can be a traditional vocoder (such as Griffin-lim algorithm), or a neural network vocoder (such as Melgan or Hifigan pre-trained using audio training data), etc. The details will not be discussed here. limited.
  • Step 705 Obtain the position of the second text in the target text. This step is optional.
  • step 701 if what is obtained in step 701 is the original speech and the second text, the position of the second text in the target text is obtained.
  • the starting and ending positions of each phoneme in the original text in the original speech can be determined by aligning the original speech and the original text through the alignment technology in step 701. And determine the position of the second text in the target text based on the starting and ending positions of each phoneme.
  • Step 706 Splice the target edited voice and the non-edited voice based on the position to generate a target voice corresponding to the target text. This step is optional.
  • the position in the embodiment of this application is used to splice the non-edited speech and the target edited speech.
  • the position can be the position of the second text in the target text, the position of the first text in the target text, or the non-edited speech.
  • the position in the original speech can also be the position of the edited speech in the original speech.
  • the original speech and the original text can be aligned using the alignment technology in step 701 to determine the starting and ending positions of each phoneme in the original text in the original speech. And based on the position of the first text in the original text, the position of the non-edited speech or the edited speech in the original speech is determined. Then, the speech processing device splices the target edited speech and the non-edited speech based on the position to obtain the target speech. That is, the target speech corresponding to the second text is replaced with the editing area in the original speech to obtain the target speech.
  • the non-edited speech is equivalent to frames 1 to 4 and frames 9 to 16 in the original speech.
  • the target editing voices are LR 1 , LR 2 , RL 3 , RL 4 or LR 1 , RL 2 , RL 3 , RL 4 .
  • Splicing the target edited speech and the non-edited speech can be understood as replacing the 5th to 8th frames in the original speech with the four obtained frames, thereby obtaining the target speech. That is, the voice corresponding to "Guangzhou” is replaced with the voice corresponding to "Shenzhen" in the original voice, and then the target text is obtained: the target voice corresponding to "The weather in Guangzhou is very good today".
  • the target speech corresponding to "The weather in Guangzhou is very good today” is shown in Figure 12.
  • the voice processing device plays the target editing voice or the target voice.
  • the speech processing method provided by the embodiment of the present application includes steps 701 to 704. In another possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 705. In another possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 706.
  • the various steps shown in Figure 7a in the embodiment of the present application do not limit the timing relationship. For example, step 705 in the above method can also be performed after step 704, or before step 701, or can be executed together with step 701.
  • An embodiment of the present application provides a speech processing method.
  • the method includes: obtaining original speech and a second text.
  • the second text is a text other than the first text in the target text.
  • the target text is the same as the original speech.
  • the original texts corresponding to the speech all include the first text, and the speech corresponding to the first text in the original speech is non-edited language. sound; predicting the second pitch characteristic of the second text according to the first pitch characteristic of the non-edited speech and the information of the target text; predicting the second pitch characteristic of the second text according to the second pitch characteristic and the third
  • For the second text obtain the first voice feature corresponding to the second text through a neural network; and generate the target editing voice corresponding to the second text based on the first voice feature.
  • This application predicts the pitch characteristics of the second text (text to be edited), generates the first speech characteristics of the second text based on the pitch characteristics, and generates the target editing speech corresponding to the second text based on the first speech characteristics, so that the singing voice can be edited before and after
  • the pitch characteristics of the target edited voice are similar to that of the original voice, so that the listening experience of the target edited voice is similar to that of the original voice.
  • Editing request Q1 Its target voice is W1 (the corresponding text T1 of the voice content is "How can love not ask whether it is right or wrong"),
  • Editing request Q2 Its target voice is W2 (the corresponding text T2 of the voice content is "Love does not ask whether it is right or wrong"),
  • Editing request Q3 Its target voice is W3 (the corresponding text T2 of the voice content is "How can love not ask whether it is right or wrong");
  • Step S1 Receive the user’s “voice editing” request
  • the request at least includes the original voice to be edited W, the original lyric text S, the target text T (T1 or T2 or T3) and other data.
  • the pre-operation includes: comparing the original text S and the target text to determine the editing type of the current editing request: that is, for Q1, Q2 and Q3 can be determined to be insertion, deletion and replacement operations respectively; extract the audio features and pitch features of each frame from W; W extracts Singer embedding through the voiceprint model; convert S and target text T*
  • the representation form of phonemes, such as T2, its phoneme sequence is [ai4 b u2 w en4d ui4c cuo4]; according to W and S, extract the duration (i.e., the number of frames) corresponding to each phoneme in S; determine the Mask area according to the operation type , for Q1, it is an insertion operation (inserting the word "how"), then its target Mask phoneme is the phoneme corresponding to "how", that is, the final target text phoneme of
  • Step S2 The target text phonemes obtained in S1 are used by the text encoding module to generate text features, that is, Phoneme-level Text Embedding;
  • Step S3 Predict the duration information of each phoneme in the target text through the duration regularization module; this step can be completed through the following sub-steps:
  • the reference duration is the real duration extracted in step S1, otherwise it is set to 0;
  • the corresponding position in the Mask vector is set to 1, otherwise set to 0;
  • the Embedding of each phoneme is upsampled (that is, if the duration of phoneme A is 10, then the Embedding of A is copied 10 times), thereby generating Frame-level Text Embedding;
  • Step S4 Predict the pitch value of each frame through the pitch prediction module. This step can be completed through the following sub-steps:
  • the reference pitch is the real pitch extracted in S1, and its corresponding position in the Mask vector is marked as 1; for Mask phonemes, the pitch on the corresponding frame is set to 0, and the Mask is set to 0; Predict the Frame-level pitch corresponding to the Mask phoneme.
  • the model shown in Figure 2-4 is used to predict the Frame-level Pitch of the Mask phoneme
  • Step S5 Add Frame-Level text Embedding and Pitch together and input them into the audio feature decoding module to predict the audio feature frame corresponding to the new Mask phoneme.
  • an editing request involves multiple editing operations
  • the editing can be performed one by one using the process described above in a processing order from left to right.
  • a replacement operation can also be implemented by two operations: "delete first and then insert”.
  • the voice processing method implemented by the terminal device or the cloud device alone is described above, and the voice processing method implemented by the terminal device and the cloud device jointly is described below.
  • Embodiment 2 The terminal device and the cloud device jointly execute the voice processing method.
  • Figure 13 is an example of a voice processing method provided by the embodiment of the present application.
  • This method can be executed jointly by the terminal device and the cloud device, or can be performed by components of the terminal device (such as a processor, a chip, or a chip system, etc.) and The components of the cloud device (such as a processor, a chip, or a chip system, etc.) execute.
  • This embodiment includes steps 1301 to 1306.
  • Step 1301 The terminal device obtains the original voice and the second text.
  • Step 1301 performed by the terminal device in this embodiment is similar to step 701 performed by the voice processing device in the embodiment shown in Figure 7a, and will not be described again here.
  • Step 1302 The terminal device sends the original voice and the second text to the cloud device.
  • the terminal device After the terminal device obtains the original voice and the second text, it can send the original voice and the second text to the cloud device.
  • step 1301 the terminal device obtains the original voice and the target text, the terminal device sends the original voice and the target text to the cloud device.
  • Step 1303 The cloud device obtains the non-edited voice based on the original voice and the second text.
  • Step 1303 performed by the cloud device in this embodiment is similar to the description of determining non-edited voice in step 701 performed by the speech processing device in the embodiment shown in Figure 7a, and will not be described again here.
  • Step 1304 The cloud device obtains the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text.
  • Step 1303 performed by the cloud device in this embodiment is similar to the description of determining non-edited voice in step 702 performed by the speech processing device in the embodiment shown in Figure 7a, and will not be described again here.
  • Step 1305 The cloud device obtains the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text.
  • Step 1306 The cloud device generates a target editing voice corresponding to the second text based on the first voice feature.
  • Steps 1304 to 1306 performed by the cloud device in this embodiment are similar to steps 702 to 704 performed by the voice processing device in the embodiment shown in Figure 7a, and will not be described again here.
  • Step 1307 The cloud device sends the target editing voice to the terminal device. This step is optional.
  • the cloud device after the cloud device obtains the target editing voice, it can send the target editing voice to the terminal device.
  • Step 1308 The terminal device or cloud device obtains the position of the second text in the target text. This step is optional.
  • Step 1309 The terminal device or the cloud device splices the target edited voice and the non-edited voice based on the location to generate a target voice corresponding to the target text. This step is optional. This step is optional.
  • Steps 1308 and 1309 in this embodiment are similar to steps 705 to 706 performed by the speech processing device in the embodiment shown in FIG. 7a, and will not be described again here. Steps 1308 and 1309 in this embodiment can be executed by a terminal device or a cloud device.
  • Step 1310 The cloud device sends the target voice to the terminal device. This step is optional.
  • steps 1308 and 1309 are executed by the cloud device, then after acquiring the target voice, the cloud device sends the target voice to the terminal device. If steps 1308 and 1309 are executed by the terminal device, this step may not be executed.
  • the terminal device plays the target editing voice or target voice.
  • the voice processing method provided by the embodiment of the present application may include: the cloud device generates the target editing voice and sends the target editing voice to the terminal device, that is, the method includes steps 1301 to 1307.
  • the speech processing method provided by the embodiment of the present application may include: the cloud device generates the target edited voice, generates the target voice based on the target edited voice and the non-edited voice, and sends the target voice to the terminal device. That is, the method includes steps 1301 to 1306, and steps 1308 to 1310.
  • the voice processing method provided by the embodiment of the present application may include: the cloud device generates the target editing voice, and sends the target editing voice to the terminal device. The terminal device generates the target voice based on the target edited voice and the non-edited voice. That is, the method includes steps 1301 to 1309.
  • the cloud device performs complex calculations to obtain the target edited voice or the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • a target edited voice corresponding to the modified text can be generated based on the voice characteristics of the non-edited area in the original voice, and then a target voice corresponding to the target text can be generated from the non-edited voice.
  • the user can modify the text in the original text to obtain the target editing voice corresponding to the modified text (ie, the second text). Improve users' editing experience based on text-based voice editing.
  • the non-edited speech is not modified when the target speech is generated, and the pitch characteristics of the target edited speech are similar to those of the non-edited speech, making it difficult for users to hear the original speech when listening to the original speech and the target speech. Differences in phonetic characteristics from the target voice.
  • An embodiment of the speech processing device in the embodiment of the present application includes:
  • Obtaining module 1401 is used to obtain the original speech and the second text.
  • the second text is the text in the target text except the first text.
  • the target text and the original text corresponding to the original speech both include the first text.
  • Text, the voice corresponding to the first text in the original voice is a non-edited voice;
  • step 701 For a specific description of the acquisition module 1401, reference may be made to the description of step 701 in the above embodiment, which will not be described again here.
  • Pitch prediction module 1402 configured to use the first pitch (pitch) feature of the non-edited speech and the target Text information, predicting the second pitch feature of the second text;
  • step 702 For a detailed description of the pitch prediction module 1402, reference may be made to the description of step 702 in the above embodiment, and will not be described again here.
  • Generating module 1403 configured to obtain the first speech feature corresponding to the second text through a neural network according to the second pitch feature and the second text;
  • a target editing voice corresponding to the second text is generated.
  • the content of the original voice is the user's singing voice.
  • the first pitch (pitch) feature of the non-edited voice and the second text include:
  • the second voice feature carries at least one of the following information:
  • the information of the target text includes: text embedding of each phoneme in the target text.
  • the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text.
  • Text the second text is text adjacent to the first part of text;
  • the pitch prediction module is specifically used for:
  • the first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
  • the target text is obtained by replacing the second part of the text in the first text with the second text;
  • the pitch prediction module is specifically used for:
  • the initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
  • the device further includes:
  • a duration prediction module used to predict the number of frames of each phoneme in the non-edited speech and the information of the target text, Predict the number of frames for each phoneme in the second text.
  • the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
  • the second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited voice.
  • the duration prediction module is specifically used to:
  • the frame number of each phoneme in the non-edited speech the information of the target text and the second speech feature of the non-edited speech.
  • the acquisition module is also used to:
  • the generating module is further configured to splice the target edited voice and the non-edited voice based on the position to obtain a target voice corresponding to the target text.
  • the voice processing device can be any terminal device including a mobile phone, tablet computer, personal digital assistant (PDA), point of sales (POS), vehicle-mounted computer, etc. Taking the voice processing device as a mobile phone as an example:
  • FIG. 15 shows a block diagram of a partial structure of a mobile phone related to the voice processing device provided by an embodiment of the present application.
  • the mobile phone includes: radio frequency (RF) circuit 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuit 1560, wireless fidelity (WiFi) module 1570, processor 1580 , and power supply 1590 and other components.
  • RF radio frequency
  • the RF circuit 1510 can be used to receive and transmit information or signals during a call. In particular, after receiving downlink information from the base station, it is processed by the processor 1580; in addition, the designed uplink data is sent to the base station.
  • the RF circuit 1510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, etc. Additionally, RF circuitry 1510 can communicate with networks and other devices through wireless communications.
  • the above wireless communication can use any communication standard or protocol, including but not limited to global system of mobile communication (GSM), general packet radio service (GPRS), code division multiple access (code division) multiple access (CDMA), wideband code division multiple access (WCDMA), long term evolution (LTE), email, short messaging service (SMS), etc.
  • GSM global system of mobile communication
  • GPRS general packet radio service
  • code division multiple access code division multiple access
  • CDMA code division multiple access
  • WCDMA wideband code division multiple access
  • LTE long term evolution
  • email short messaging service
  • SMS short messaging service
  • the memory 1520 can be used to store software programs and modules.
  • the processor 1580 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 1520 .
  • the memory 1520 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store data based on Where to use your mobile phone Created data (such as audio data, phone book, etc.), etc.
  • memory 1520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the input unit 1530 may be used to receive input numeric or character information, and generate key signal input related to user settings and function control of the mobile phone.
  • the input unit 1530 may include a touch panel 1531 and other input devices 1532.
  • the touch panel 1531 also known as a touch screen, can collect the user's touch operations on or near the touch panel 1531 (for example, the user uses a finger, stylus, or any suitable object or accessory on or near the touch panel 1531 operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 1531 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact point coordinates, and then sends it to the touch controller. to the processor 1580, and can receive commands sent by the processor 1580 and execute them.
  • the touch panel 1531 can be implemented using various types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 1530 may also include other input devices 1532.
  • other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, etc.
  • the display unit 1540 may be used to display information input by the user or information provided to the user as well as various menus of the mobile phone.
  • the display unit 1540 may include a display panel 1541.
  • the display panel 1541 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (organic light-emitting diode, OLED), etc.
  • the touch panel 1531 can cover the display panel 1541. When the touch panel 1531 detects a touch operation on or near it, it is sent to the processor 1580 to determine the type of the touch event. The processor 1580 then determines the type of the touch event. Type provides corresponding visual output on display panel 1541.
  • touch panel 1531 and the display panel 1541 are used as two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 1531 and the display panel 1541 can be integrated. Realize the input and output functions of mobile phone.
  • the phone may also include at least one sensor 1550, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor.
  • the ambient light sensor may adjust the brightness of the display panel 1541 according to the brightness of the ambient light.
  • the proximity sensor may close the display panel 1541 and/or when the mobile phone is moved to the ear. or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three axes). It can detect the magnitude and direction of gravity when stationary.
  • the audio circuit 1560, speaker 1561, and microphone 1562 can provide an audio interface between the user and the mobile phone.
  • the audio circuit 1560 can transmit the electrical signal converted from the received audio data to the speaker 1561, and the speaker 1561 converts it into a sound signal for output; on the other hand, the microphone 1562 converts the collected sound signal into an electrical signal, and the audio circuit 1560 After receiving, it is converted into audio data, and then processed by the audio data output processor 1580, and then sent to, for example, another mobile phone through the RF circuit 1510, or the audio data is output to the memory 1520 for further processing.
  • WiFi is a short-distance wireless transmission technology.
  • mobile phones can help users send and receive emails, Browsing the web and accessing streaming media, etc., it provides users with wireless broadband Internet access.
  • FIG. 15 shows the WiFi module 1570, it can be understood that it is not a necessary component of the mobile phone.
  • the processor 1580 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing software programs and/or modules stored in the memory 1520, and calling data stored in the memory 1520 to execute Various functions of the mobile phone and processing data, thereby conducting overall monitoring of the mobile phone.
  • the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor and a modem processor, where the application processor mainly processes operating systems, user interfaces, application programs, etc. , the modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 1580.
  • the mobile phone also includes a power supply 1590 (such as a battery) that supplies power to various components.
  • a power supply 1590 (such as a battery) that supplies power to various components.
  • the power supply can be logically connected to the processor 1580 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.
  • the mobile phone may also include a camera, a Bluetooth module, etc., which will not be described in detail here.
  • the processor 1580 included in the terminal device can perform the functions of the voice processing device in the embodiment shown in Figure 7a, or the functions of the terminal device in the embodiment shown in Figure 13, which will not be described again here. .
  • the voice processing device may be a cloud device.
  • the cloud device may include a processor 1601, a memory 1602, and a communication interface 1603.
  • the processor 1601, memory 1602 and communication interface 1603 are interconnected through lines.
  • the memory 1602 stores program instructions and data.
  • the memory 1602 stores program instructions and data corresponding to the steps executed by the speech processing device in the aforementioned embodiment corresponding to FIG. 7a. Or the program instructions and data corresponding to the steps executed by the cloud device in the aforementioned embodiment corresponding to Figure 13 are stored.
  • the processor 1601 is configured to perform the steps performed by the speech processing device shown in any of the embodiments shown in FIG. 7a. Or used to perform the steps performed by the cloud device shown in any of the embodiments shown in FIG. 13 .
  • the communication interface 1603 can be used to receive and send data, and to perform steps related to obtaining, sending, and receiving in any of the embodiments shown in FIG. 7a or FIG. 13 .
  • the cloud device may include more or fewer components than in Figure 16 , which is only an illustrative description in this application and is not limiting.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated unit can Implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • the integrated unit When the integrated unit is implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

Procédé de traitement de la parole et dispositif associé, qui sont appliqués au domaine de l'édition de chansons. Le procédé consiste à : acquérir une parole d'origine et un second texte ; prédire une seconde caractéristique de hauteur tonale du second texte selon une première caractéristique de hauteur tonale de parole non éditée dans la parole d'origine et des informations de texte cible ; selon la seconde caractéristique de hauteur tonale et le second texte, et au moyen d'un réseau de neurones artificiels, obtenir une première caractéristique de parole correspondant au second texte ; et selon la première caractéristique de parole, générer une parole éditée cible correspondant au second texte. Au moyen de la présente demande, une caractéristique de hauteur tonale de second texte est prédite, une première caractéristique de parole du second texte est générée selon la caractéristique de hauteur tonale, et une parole éditée cible correspondant au second texte est générée sur la base de la première caractéristique de parole, de telle sorte que des caractéristiques de hauteur tonale de parole avant et après l'édition de chansons sont similaires l'une à l'autre, et ainsi l'expérience acoustique de la parole éditée cible est similaire à l'expérience acoustique de la parole d'origine.
PCT/CN2023/086497 2022-04-29 2023-04-06 Procédé de traitement de la parole et dispositif associé WO2023207541A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210468926.8A CN114882862A (zh) 2022-04-29 2022-04-29 一种语音处理方法及相关设备
CN202210468926.8 2022-04-29

Publications (1)

Publication Number Publication Date
WO2023207541A1 true WO2023207541A1 (fr) 2023-11-02

Family

ID=82673378

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/086497 WO2023207541A1 (fr) 2022-04-29 2023-04-06 Procédé de traitement de la parole et dispositif associé

Country Status (2)

Country Link
CN (1) CN114882862A (fr)
WO (1) WO2023207541A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882862A (zh) * 2022-04-29 2022-08-09 华为技术有限公司 一种语音处理方法及相关设备
CN116189654B (zh) * 2023-02-23 2024-06-18 京东科技信息技术有限公司 语音编辑方法、装置、电子设备及存储介质
CN117153144B (zh) * 2023-10-31 2024-02-06 杭州宇谷科技股份有限公司 基于端计算的电池信息语音播报方法和装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006349787A (ja) * 2005-06-14 2006-12-28 Hitachi Information & Control Solutions Ltd 音声合成方法および装置
JP2011170191A (ja) * 2010-02-19 2011-09-01 Fujitsu Ltd 音声合成装置、音声合成方法、及び音声合成プログラム
CN111899706A (zh) * 2020-07-30 2020-11-06 广州酷狗计算机科技有限公司 音频制作方法、装置、设备及存储介质
CN113421547A (zh) * 2021-06-03 2021-09-21 华为技术有限公司 一种语音处理方法及相关设备
CN113808555A (zh) * 2021-09-17 2021-12-17 广州酷狗计算机科技有限公司 歌曲合成方法及其装置、设备、介质、产品
CN113920977A (zh) * 2021-09-30 2022-01-11 宿迁硅基智能科技有限公司 一种语音合成模型、模型的训练方法以及语音合成方法
CN114882862A (zh) * 2022-04-29 2022-08-09 华为技术有限公司 一种语音处理方法及相关设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006349787A (ja) * 2005-06-14 2006-12-28 Hitachi Information & Control Solutions Ltd 音声合成方法および装置
JP2011170191A (ja) * 2010-02-19 2011-09-01 Fujitsu Ltd 音声合成装置、音声合成方法、及び音声合成プログラム
CN111899706A (zh) * 2020-07-30 2020-11-06 广州酷狗计算机科技有限公司 音频制作方法、装置、设备及存储介质
CN113421547A (zh) * 2021-06-03 2021-09-21 华为技术有限公司 一种语音处理方法及相关设备
CN113808555A (zh) * 2021-09-17 2021-12-17 广州酷狗计算机科技有限公司 歌曲合成方法及其装置、设备、介质、产品
CN113920977A (zh) * 2021-09-30 2022-01-11 宿迁硅基智能科技有限公司 一种语音合成模型、模型的训练方法以及语音合成方法
CN114882862A (zh) * 2022-04-29 2022-08-09 华为技术有限公司 一种语音处理方法及相关设备

Also Published As

Publication number Publication date
CN114882862A (zh) 2022-08-09

Similar Documents

Publication Publication Date Title
CN112487182B (zh) 文本处理模型的训练方法、文本处理方法及装置
CN110782870B (zh) 语音合成方法、装置、电子设备及存储介质
CN110853618B (zh) 一种语种识别的方法、模型训练的方法、装置及设备
CN111048062B (zh) 语音合成方法及设备
WO2023207541A1 (fr) Procédé de traitement de la parole et dispositif associé
CN113421547B (zh) 一种语音处理方法及相关设备
CN111179962B (zh) 语音分离模型的训练方法、语音分离方法及装置
CN111933115B (zh) 语音识别方法、装置、设备以及存储介质
KR102346046B1 (ko) 3차원 가상 인물 입모양 변화 제어 방법 및 장치
KR20210007786A (ko) 시각 보조 음성 처리
CN112069309A (zh) 信息获取方法、装置、计算机设备及存储介质
CN112233698A (zh) 人物情绪识别方法、装置、终端设备及存储介质
WO2022057759A1 (fr) Procédé de conversion de voix et dispositif associé
CN112632244A (zh) 一种人机通话的优化方法、装置、计算机设备及存储介质
WO2023197749A1 (fr) Procédé et appareil de détermination de point temporel d'insertion de musique de fond, dispositif et support de stockage
CN113822076A (zh) 文本生成方法、装置、计算机设备及存储介质
CN115688937A (zh) 一种模型训练方法及其装置
CN113948060A (zh) 一种网络训练方法、数据处理方法及相关设备
CN115240713B (zh) 基于多模态特征和对比学习的语音情感识别方法及装置
CN110781329A (zh) 图像搜索方法、装置、终端设备及存储介质
CN116978359A (zh) 音素识别方法、装置、电子设备及存储介质
CN115169472A (zh) 针对多媒体数据的音乐匹配方法、装置和计算机设备
KR20230120790A (ko) 가변적 언어모델을 이용한 음성인식 헬스케어 서비스
CN114333772A (zh) 语音识别方法、装置、设备、可读存储介质及产品
CN113822084A (zh) 语句翻译方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23794974

Country of ref document: EP

Kind code of ref document: A1