WO2023207541A1 - Speech processing method and related device - Google Patents

Speech processing method and related device Download PDF

Info

Publication number
WO2023207541A1
WO2023207541A1 PCT/CN2023/086497 CN2023086497W WO2023207541A1 WO 2023207541 A1 WO2023207541 A1 WO 2023207541A1 CN 2023086497 W CN2023086497 W CN 2023086497W WO 2023207541 A1 WO2023207541 A1 WO 2023207541A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
speech
voice
feature
target
Prior art date
Application number
PCT/CN2023/086497
Other languages
French (fr)
Chinese (zh)
Inventor
邓利群
朱杰明
张立超
赵洲
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023207541A1 publication Critical patent/WO2023207541A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular, to a speech processing method and related equipment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • voice editing has very important practical significance. For example, in scenarios where users record songs (such as singing a cappella), some content in the voice is often wrong due to slips of the tongue. In this case, voice editing can help users quickly correct the erroneous content in the original singing voice and generate corrected voice.
  • a commonly used speech editing method is to pre-build a database containing a large number of speech segments, obtain segments of pronunciation units from the database, and use the segments to replace erroneous segments in the original speech to generate corrected speech.
  • the above-mentioned voice editing method relies on the diversity of voice segments in the database.
  • the corrected voice such as the user's singing voice
  • the corrected voice will have a poor hearing quality.
  • Embodiments of the present application provide a voice processing method and related equipment, which can achieve a listening experience of edited singing that is similar to that of original speech, thereby improving user experience.
  • this application provides a voice processing method, which can be applied to scenarios such as users recording short videos and teachers recording teaching voices.
  • the method may be executed by the speech processing device, or may be executed by a component of the speech processing device (such as a processor, a chip, or a chip system, etc.).
  • the speech processing device can be a terminal device or a cloud device, and the method includes: obtaining the original speech and the second text, the second text being the text other than the first text in the target text, the target text
  • the original text corresponding to the original speech includes the first text, and the speech corresponding to the first text in the original speech is a non-edited speech; according to the first pitch of the non-edited speech features and the information of the target text, predict the second pitch feature of the second text; according to the second pitch feature and the second text, obtain the first pitch corresponding to the second text through a neural network
  • Speech features generate a target editing voice corresponding to the second text according to the first voice features.
  • This application predicts the pitch characteristics of the second text (text to be edited), generates the first speech characteristics of the second text based on the pitch characteristics, and based on the first
  • the voice features generate a second text that corresponds to the target edited voice, so that the pitch characteristics of the voice before and after the singing are similar, thereby achieving a listening experience of the target edited voice that is similar to that of the original voice. .
  • the second text can be to obtain the second text directly; or to obtain the position information first (which can also be understood as mark information, used to indicate the position of the second text in the target text).
  • the position information is used to represent the position of the second text in the target text; it can also be obtained by obtaining the target text and the original text (or obtaining the target text and the original voice, and recognizing the original voice to obtain original text), and then determine the second text based on the original text and the target text.
  • generating a target editing voice corresponding to the second text based on the second voice feature includes: generating the target editing voice through a vocoder based on the second voice feature.
  • the second voice features are converted into the target edited voice according to the vocoder, so that the target edited voice has voice features similar to the original voice, thereby improving the user's listening experience.
  • the content of the original voice is the user's singing voice, which may be, for example, the voice recorded when the user sings a cappella.
  • obtaining the original voice and the second text includes: receiving the original voice and the second text sent by the terminal device; the method also includes: sending the target editing voice to the terminal device, and the target editing voice is used for generation by the terminal device The target speech corresponding to the target text. It can also be understood as an interactive scenario.
  • the cloud device performs complex calculation operations, and the terminal device performs a simple splicing operation.
  • the original voice and the second text are obtained from the terminal device. After the cloud device generates the target editing voice, it is sent to the terminal device.
  • the target edits the voice, and then the terminal device splices it to obtain the target voice.
  • the voice processing device when the voice processing device is a cloud device, on the one hand, through the interaction between the cloud device and the terminal device, the cloud device can perform complex calculations to obtain the target edited voice and return it to the terminal device. Reduce the computing power and storage space of the terminal device.
  • a target edited voice corresponding to the modified text can be generated based on the voice characteristics of the non-edited area in the original voice, and then a target voice corresponding to the target text can be generated from the non-edited voice.
  • the above steps: obtaining the original voice and the second text include: receiving the original voice and the target text sent by the terminal device; the method further includes: based on the non-edited voice and the second text The target editing voice generates a target voice corresponding to the target text, and sends the target voice to the terminal device.
  • the original voice and the target text sent by the terminal device are received, the non-edited voice can be obtained, and the second voice feature corresponding to the second text is generated based on the first voice feature of the unedited voice, and then the second voice feature corresponding to the second text is generated according to the vocoded voice code.
  • the processor obtains the target edited voice, and splices the target edited voice and the non-edited voice to generate the target voice. Equivalently, the processing is done on the voice processing device, and the results are returned to the terminal device.
  • the cloud device performs complex calculations to obtain the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • the first pitch (pitch) feature based on the non-edited voice and the second text include: based on the first pitch (pitch) feature of the non-edited voice, the The information of the target text and the second speech feature of the non-edited speech; the second speech feature carries at least one of the following information: some speech frames or all speech frames of the non-edited speech; the non-edited speech The voiceprint characteristics of the voice; the timbre of the non-edited voice Features; the prosodic features of the non-edited speech; and the rhythmic features of the non-edited speech.
  • the first speech feature can be the same or similar to the second speech feature in terms of rhythm, timbre, and/or signal-to-noise ratio.
  • Rhythm can reflect the speaker's emotional state or speech form. Rhythm generally refers to intonation, pitch, and stress emphasis. , pauses or rhythm characteristics.
  • the second voice feature carries the voiceprint feature of the original voice.
  • the voiceprint features may be obtained directly, or the voiceprint features may be obtained by recognizing original speech, etc.
  • the subsequently generated first voice feature also carries the voiceprint feature of the original voice, thereby improving the similarity between the target edited voice and the original voice.
  • introducing voiceprint features can improve the subsequently predicted voice features to be more similar to the voiceprints of the speakers of the original speech.
  • the target text information includes:
  • the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text.
  • Text the second text is text adjacent to the first part of text;
  • Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text includes:
  • the first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
  • the target text is obtained by replacing the second part of the text in the first text with the second text;
  • Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text includes:
  • the initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
  • the method further includes:
  • the frame number of each phoneme in the non-edited speech and the information of the target text is predicted.
  • the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
  • the second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited speech.
  • the information based on the number of frames of each phoneme in the non-edited speech and the target text includes:
  • the frame number of each phoneme in the non-edited speech the information of the target text and the second speech feature of the non-edited speech.
  • the above steps further include: obtaining the position of the second text in the target text; and splicing the target editing voice and the non-editing voice based on the position to obtain the target voice corresponding to the target text. It can also be understood as replacing the edited voice in the original voice with the target edited voice, and the edited voice is the voice in the original voice except the non-edited voice.
  • the target editing voice and the non-editing voice can be spliced according to the position of the second text in the target text. If the first text is all overlapping text in the original text and the target text, the voice of the desired text (ie, the target text) can be generated without changing the non-edited voice in the original voice.
  • the above steps further include: determining the non-edited voice based on the target text, the original text and the original voice. Specifically, it may be: determining the first text based on the target text and the original text. ; Determine the non-edited voice based on the first text, original text and original voice.
  • the non-edited voice of the first text in the original voice is determined by comparing the original text and the original voice, so as to facilitate the subsequent generation of the first voice feature.
  • the above steps determining the first text based on the target text and the original text, including: determining the overlapping text based on the target text and the original text; displaying the overlapping text to the user; and responding The user's second operation determines the first text from the overlapping text.
  • this application provides a voice processing device, which includes:
  • the Acquisition module used to obtain the original speech and the second text.
  • the second text is the text in the target text except the first text.
  • the target text and the original text corresponding to the original speech both include the first text.
  • the voice corresponding to the first text in the original voice is non-edited voice;
  • a pitch prediction module configured to predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text
  • a generation module configured to obtain the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text;
  • a target editing voice corresponding to the second text is generated.
  • the content of the original voice is the user's singing voice.
  • the first pitch (pitch) feature of the non-edited voice and the second text include:
  • the second voice feature carries at least one of the following information:
  • the information of the target text includes: text embedding of each phoneme in the target text.
  • the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text.
  • Text the second text is text adjacent to the first part of text;
  • the pitch prediction module is specifically used for:
  • the first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
  • the target text is obtained by replacing the second part of the text in the first text with the second text;
  • the pitch prediction module is specifically used for:
  • the initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
  • the device further includes:
  • a duration prediction module configured to predict the number of frames of each phoneme in the second text based on the number of frames of each phoneme in the non-edited speech and the information of the target text.
  • the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
  • the second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited voice.
  • the duration prediction module is specifically used to:
  • the frame number of each phoneme in the non-edited speech the information of the target text and the second speech feature of the non-edited speech.
  • the acquisition module is also used to:
  • the generating module is further configured to splice the target edited voice and the non-edited voice based on the position to obtain a target voice corresponding to the target text.
  • a third aspect of the present application provides a voice processing device that performs the method in the foregoing first aspect or any possible implementation of the first aspect.
  • a fourth aspect of the present application provides a speech processing device, including: a processor, the processor is coupled to a memory, and the memory is used to store programs or instructions. When the program or instructions are executed by the processor, the speech processing device implements the above-mentioned first step.
  • a speech processing device including: a processor, the processor is coupled to a memory, and the memory is used to store programs or instructions.
  • the speech processing device implements the above-mentioned first step.
  • a method in any possible implementation of an aspect or first aspect.
  • the fifth aspect of the present application provides a computer-readable medium on which a computer program or instructions are stored.
  • the computer program or instructions When the computer program or instructions are run on a computer, the computer is caused to execute the foregoing first aspect or any possible implementation of the first aspect. method within the method.
  • a sixth aspect of the present application provides a computer program product, which, when executed on a computer, causes the computer to execute the method in the foregoing first aspect or any possible implementation of the first aspect.
  • Figure 1 is a schematic structural diagram of a system architecture provided by this application.
  • Figure 2 is a schematic structural diagram of a convolutional neural network provided by this application.
  • FIG. 3 is a schematic structural diagram of another convolutional neural network provided by this application.
  • FIG. 4 is a schematic diagram of the chip hardware structure provided by this application.
  • Figure 5 is a schematic flow chart of a neural network training method provided by this application.
  • Figure 6 is a schematic structural diagram of a neural network provided by this application.
  • Figure 7a is a schematic flow chart of the speech processing method provided by this application.
  • Figure 7b is a schematic diagram of duration prediction provided by this application.
  • Figure 7c is a schematic diagram of pitch prediction provided by this application.
  • Figure 7d is a schematic diagram of pitch prediction provided by this application.
  • FIGS 8 to 10 are several schematic diagrams of the display interface of the voice processing device provided by this application.
  • Figure 11 is a schematic structural diagram of a bidirectional decoder provided by this application.
  • Figure 12 is another schematic diagram of the display interface of the voice processing device provided by this application.
  • FIG. 13 is another schematic flow chart of the speech processing method provided by this application.
  • FIGS 14-16 are schematic structural diagrams of several speech processing devices provided by this application.
  • Embodiments of the present application provide a speech processing method and related equipment, which can realize that the listening feeling of edited speech is similar to that of original speech, thereby improving user experience.
  • the neural network can be composed of neural units.
  • the neural unit can refer to an arithmetic unit that takes X s and intercept 1 as input.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of X s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with many hidden layers. There is no special metric for "many” here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the deep neural network may not include hidden layers, and there is no specific limitation here.
  • the work of each layer in a deep neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the columns) through five operations on the input space (a collection of input vectors). space), these five operations include: 1. Dimension raising/reducing; 2. Zoom in/out; 3. Rotation; 4. Translation; 5. "Bend”. Among them, the operations of 1, 2 and 3 are performed by Completed, operation 4 is performed by Completed, the operation of 5 is implemented by ⁇ (). The reason why the word "space” is used here is because the object to be classified is not a single thing, but a class of things.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer.
  • This vector W determines the spatial transformation from the input space to the output space mentioned above, that is, the weight W of each layer controls how Transform space.
  • the purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vectors W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolving the same trainable filter with an input image or feature map.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can be connected to only some of the neighboring layer neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information independent of position. The underlying principle is that the statistical information of one part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image.
  • multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a random-sized matrix.
  • the convolution kernel can obtain reasonable weights through learning.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the separation network, recognition network, detection network, depth estimation network and other networks in the embodiments of this application can all be CNNs.
  • Recurrent neural network refers to the current output of a sequence that is also related to the previous output.
  • the specific form of expression is that the network will remember the previous information, save it in the internal state of the network, and apply it to the calculation of the current output.
  • loss function loss function
  • objective function object function
  • Text to speech is a program or software system that converts text into speech.
  • a vocoder is a sound signal processing module or software that encodes acoustic features into a sound waveform.
  • Pitch can also be called fundamental frequency.
  • fundamental frequency When a sound-emitting body emits sound due to vibration, the sound can generally be decomposed into many simple sine waves. That is to say, all natural sounds are basically composed of many sine waves with different frequencies. , the sine wave with the lowest frequency is the fundamental tone (that is, the fundamental frequency, which can be represented by F0), while other sine waves with higher frequencies are overtones.
  • prosody In the field of speech synthesis, prosody broadly refers to features that control functions such as intonation, pitch, emphasis, pauses, and rhythm. Prosody can reflect the speaker's emotional state or speech form, etc.
  • Phoneme It is the smallest unit of speech divided according to the natural properties of speech. It is analyzed based on the pronunciation movements in the syllable. One movement constitutes a phoneme. Phonemes are divided into two categories: vowels and consonants. For example, the Chinese syllable a (for example, one tone: ah) has only one phoneme, ai (for example, four tones: love) has two phonemes, dai (for example, one tone: stay) has three phonemes, etc.
  • Word vectors can also be called “word embeddings”, “vectorization”, “vector mapping”, “embeddings”, etc. Formally speaking, a word vector represents an object as a dense vector.
  • Speech features Convert the processed speech signal into a concise and logical representation that is more discriminating and reliable than the actual signal. After acquiring a segment of speech signal, speech features can be extracted from the speech signal. Among them, the extraction method usually extracts a multi-dimensional feature vector for each speech signal. There are many ways to represent the parameterization of speech signals, such as: perceptual linear prediction (PLP), linear predictive coding (LPC) and frequency cepstrum coefficient (MFCC), etc.
  • PLPP linear predictive coding
  • MFCC frequency cepstrum coefficient
  • the neural network includes an embedding layer and at least one transformer layer.
  • At least one transformer layer can be N transformer layers (N is an integer greater than 0), where each transformer layer includes successively adjacent attention layers, summation and normalization. (add&norm) layer, feed forward layer and summation and normalization layer.
  • the current input is embedded to obtain multiple feature vectors; in the attention layer, P input vectors are obtained from the upper layer of the first transformer layer, and any of the P input vectors are The first input vector is the center. Based on the correlation between each input vector within the preset attention window range and the first input vector, the intermediate vector corresponding to the first input vector is obtained. In this way, P input vectors are determined.
  • Corresponding P intermediate vectors; in the pooling layer the P intermediate vectors are merged into Q output vectors, where the multiple output vectors obtained by the last transformer layer in the transformer layer are used as features of the current input express.
  • the current input is embedded to obtain multiple feature vectors.
  • the embedding layer can be called the input embedding layer.
  • the current input can be text input, for example, it can be a paragraph of text or a sentence.
  • the text can be Chinese text, English text, or other language text.
  • the embedding layer can embed each word in the current input to obtain the feature vector of each word.
  • the embedding layer includes an input embedding layer and a positional encoding layer.
  • word embedding processing can be performed on each word in the current input to obtain the word embedding vector of each word.
  • the position coding layer the position of each word in the current input can be obtained, and then a position vector is generated for the position of each word.
  • the position of each word may be the absolute position of each word in the current input. Taking the current input as "What number should I pay back the Huabei?" for example, the position of "number” can be represented as the first digit, the position of "number” can be represented as the second digit,... In some examples, the position of each word may be a relative position between each word. Still taking the current input as "what number should I pay back Huabei" as an example, the position of "what number” can be expressed as before “number”, and the position of "number” can be expressed as after "what number” and before “should",... ....
  • the position vector of each word and the corresponding word embedding vector can be combined to obtain the feature vector of each word, that is, multiple feature vectors corresponding to the current input are obtained.
  • Multiple feature vectors can be represented as embedding matrices with preset dimensions.
  • the number of eigenvectors in the plurality of eigenvectors can be set to M, and the default dimension is H dimension, then the plurality of eigenvectors can be expressed as an M ⁇ H embedding matrix.
  • the attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sensation to increase the precision of observation in some areas, and can use limited attention resources to quickly filter out high-value information from a large amount of information. .
  • the attention mechanism can quickly extract important features of sparse data and is therefore widely used in natural language processing tasks, especially machine translation.
  • the self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features.
  • the essential idea of the attention mechanism can be rewritten as the following formula:
  • Lx
  • represents the length of Source.
  • the meaning of the formula is to imagine that the constituent elements in Source are composed of a series of data pairs. At this time, given a certain element Query in the target Target, by calculating the Query and Based on the similarity or correlation of each Key, the weight coefficient of each Key's corresponding Value is obtained, and then the Value is weighted and summed to obtain the final Attention value. So essentially the Attention mechanism is a weighted summation of the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value.
  • Attention can be understood as selectively filtering out a small amount of important information from a large amount of information and focusing on this important information, while ignoring most of the unimportant information.
  • the process of focusing is reflected in the calculation of the weight coefficient.
  • the self-attention mechanism can be understood as internal Attention (intra attention).
  • the Attention mechanism occurs between the Target element Query and all elements in the Source.
  • the self-attention mechanism refers to between the internal elements of the Source or between the internal elements of the Target.
  • the scenario of singing voice editing is when the user is recording a song (such as singing a cappella).
  • voice editing is usually used. Head The current voice editing method is to obtain voice segments from the database, replace the erroneous content with the voice segments, and then generate corrected speech.
  • this application provides a voice editing method.
  • the pitch characteristics will affect the hearing sense of the target edited voice and the hearing sense of the original voice.
  • This application predicts the second text (text to be edited) by predicting Pitch feature, generate the first voice feature of the second text based on the pitch feature, and generate the target editing voice corresponding to the second text based on the first voice feature, so that the pitch features of the voice before and after singing editing are similar, thereby achieving the target editing voice
  • the listening feeling of the target is similar to that of the original speech.
  • an embodiment of the present application provides a system architecture 10.
  • the data collection device 16 is used to collect training data.
  • the training data includes training speech and training text corresponding to the training speech.
  • the training data is stored in the database 13, and the training device 12 trains to obtain the target model/rule 101 based on the training data maintained in the database 13.
  • the target model/rules 101 can be used to implement the speech processing method provided by the embodiment of the present application, that is, the text is input after relevant preprocessing.
  • the target model/rule 101 can obtain the phonetic features of the text.
  • the target model/rule 101 in the embodiment of this application may specifically be a neural network. It should be noted that in actual applications, the training data maintained in the database 13 may not all be collected from the data collection device 16, and may also be received from other devices. In addition, it should be noted that the training device 12 may not necessarily train the target model/rules 101 based entirely on the training data maintained by the database 13. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a limitation of this application. Limitations of Examples.
  • the target model/rules 101 trained according to the training device 12 can be applied to different systems or devices, such as to the execution device 11 shown in Figure 1.
  • the execution device 11 can be a terminal, such as a mobile phone terminal, a tablet computer, Laptops, AR/VR, vehicle terminals, etc., or servers or clouds, etc.
  • the execution device 11 is configured with an I/O interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 14.
  • the input data is used in the embodiment of the present application. may include: second voice features, target text and mark information, and the input data may also include second voice features and second text.
  • the input data can be input by the user, or uploaded by the user through other devices. Of course, it can also come from a database, and there is no specific limit here.
  • the preprocessing module 113 is configured to perform preprocessing according to the target text and mark information received by the I/O interface 112. In the embodiment of the present application, the preprocessing module 113 may be used to determine the target editing text in the target text based on the target text and mark information. If the input data includes the second speech feature and the second text, the preprocessing module 113 is configured to perform preprocessing according to the target text and mark information received by the I/O interface 112, for example, converting the target text into phonemes and other preparatory work.
  • the execution device 11 When the execution device 11 preprocesses the input data, or when the calculation module 111 of the execution device 11 performs calculations and other related processes, the execution device 11 can call data, codes, etc. in the data storage system 15 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 15 .
  • the I/O interface 112 returns the processing result, such as the first voice feature obtained as described above, to the client device 14, thereby providing it to the user.
  • the training device 12 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results or providing input for other subsequent processing.
  • the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 .
  • the client device 14 can automatically send input data to the I/O interface 112. If requiring the client device 14 to automatically send the input data requires the user's authorization, the user can set corresponding permissions in the client device 14. The user can view the results output by the execution device 11 on the client device 14, and the specific presentation form may be display, sound, action, etc.
  • the client device 14 can also be used as a data collection terminal to collect input data from the input I/O interface 112 and output results from the output I/O interface 112 as new sample data, and store them in the database 13 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result outputted from the I/O interface 112 as a new sample as shown in the figure.
  • the data is stored in database 13.
  • Figure 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 15 is an external memory relative to the execution device 11. In other cases, the data storage system 15 can also be placed in the execution device 11.
  • a target model/rule 101 is obtained by training according to the training device 12.
  • the target model/rule 101 can be a neural network in the embodiment of the present application.
  • the neural network It can be a recurrent neural network, a long short-term memory network, etc.
  • the prediction network can be a convolutional neural network, a recurrent neural network, etc.
  • the neural network and the prediction network in the embodiment of this application can be two separate networks, or they can be a multi-task neural network, one of which is to output duration, one to predict pitch features, and the other to predict pitch characteristics.
  • the task is to output speech features.
  • CNN is a very common neural network
  • the structure of CNN will be introduced in detail below in conjunction with Figure 2.
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning. Multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the image input into it.
  • a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional/pooling layer 120, and a neural network layer 130 where the pooling layer is optional.
  • the convolution layer/pooling layer 120 may include layers 121-126 as examples.
  • layer 121 is a convolution layer
  • layer 122 is a pooling layer
  • layer 123 is a convolution layer
  • 124 is a pooling layer
  • 121 and 122 are convolution layers
  • 123 is a pooling layer
  • 124 and 125 are convolution layers
  • 126 is Pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 121 may include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially be a weight matrix. This weight matrix is usually predefined. During the convolution operation on the image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels followed by two pixels... This depends on the value of the step size) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices with the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform fuzzification...the multiple weight matrices have the same dimensions, and the feature maps extracted by the multiple weight matrices with the same dimensions also have the same dimensions, and then the extracted feature maps with the same dimensions are combined to form the output of the convolution operation .
  • weight values in these weight matrices need to be obtained through a large amount of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer for example, 121
  • the features extracted by subsequent convolutional layers for example, 126) become more and more complex, such as high-level semantic features.
  • the pooling layer can also be a multi-layer convolution layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the average value of pixel values in an image within a specific range.
  • the max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image processed by the pooling layer can be smaller than the size of the image input to the pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 120 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to use the neural network layer 130 to generate an output or a set of required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in Figure 2) and an output layer 140. The parameters included in the multiple hidden layers may be based on specific task types. Related training data are pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc.
  • the output layer 140 After the multi-layer hidden layer in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140.
  • the output layer 140 has a loss function similar to classification cross entropy, specifically used to calculate the prediction error.
  • the convolutional neural network 100 shown in Figure 2 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models, for example, as The multiple convolutional layers/pooling layers shown in Figure 3 are parallel, and the extracted features are all input to the full neural network layer 130 for processing.
  • Figure 4 is a chip hardware structure provided by an embodiment of the present application.
  • the chip includes a neural network processor 40.
  • the chip can be disposed in the execution device 110 as shown in Figure 1 to complete the calculation work of the calculation module 111.
  • the chip can also be provided in the training device 120 as shown in Figure 1 to complete the training work of the training device 120 and output the target model/rules 101.
  • the algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 4.
  • the neural network processor 40 may be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., which are suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • the NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the main CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 403.
  • the controller 404 controls the arithmetic circuit 403 to extract data in the memory (weight memory or input memory) and perform operations.
  • the computing circuit 403 internally includes multiple processing engines (PEs).
  • PEs processing engines
  • arithmetic circuit 403 is a two-dimensional systolic array.
  • the arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 403 is a general-purpose matrix processor.
  • the arithmetic circuit obtains the corresponding data of matrix B from the weight memory 402 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit converts the input memory 401
  • the matrix A data is obtained and matrix B is used for matrix operation, and the partial result or final result of the matrix is stored in the accumulator 408 .
  • the vector calculation unit 407 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • the vector calculation unit 407 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit can 407 store the processed output vector into the unified buffer 406 .
  • the vector calculation unit 407 may apply a nonlinear function to the output of the operation circuit 403, such as a vector of accumulated values, to generate an activation value.
  • vector calculation unit 407 generates normalized values, merged values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 403, such as for use in a subsequent layer in a neural network.
  • the unified memory 406 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 401 and/or unified memory 406 through the storage unit access controller 405 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 402. And store the data in the unified memory 506 into the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 410 is used to implement interaction between the main CPU, the DMAC and the fetch memory 409 through the bus.
  • An instruction fetch buffer 409 connected to the controller 404 is used to store instructions used by the controller 404.
  • the controller 404 is used to call instructions cached in the memory 409 to control the working process of the computing accelerator.
  • the unified memory 406, the input memory 401, the weight memory 402 and the instruction memory 409 are all on-chip memories, and the external memory is a memory external to the NPU.
  • the external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM for short), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • each layer in the convolutional neural network shown in Figure 2 or Figure 3 can be performed by the operation circuit 403 or the vector calculation unit 407.
  • This voice processing method can be applied to scenarios where voice content needs to be modified, such as scenarios where users record short videos, teachers record teaching voices, etc.
  • the voice processing method can be applied to applications, software or voice processing devices with voice editing functions such as smart voice assistants on mobile phones, computers, detachable terminals that can produce sounds, smart speakers, etc.
  • the voice processing device is a terminal device used to serve users, or a cloud device.
  • the terminal device may include a head mount display (HMD), which may be a combination of a virtual reality (VR) box and a terminal, a VR all-in-one machine, or a personal computer (PC), Augmented reality (AR) devices, mixed reality (MR) devices, etc.
  • the terminal device may also include a cellular phone, a smart phone, a personal digital assistant (personal digital assistant), etc. digital assistant (PDA), tablet computer, laptop computer (laptop computer), personal computer (PC), vehicle-mounted terminal, etc. The details are not limited here.
  • the neural network and the prediction network in the embodiment of this application can be two separate networks, or they can be a multi-task neural network, one of which is to output duration, and the other is to output speech features.
  • the training method shown in Figure 5 can be executed by a neural network training device.
  • the neural network training device can be a cloud service device or a terminal device.
  • the training method device may also be a system composed of cloud service equipment and terminal equipment.
  • the training method may be executed by the training device 120 in FIG. 1 and the neural network processor 40 in FIG. 4 .
  • the training method can be processed by the CPU, or it can be processed by the CPU and GPU together, or it can not use the GPU but use other processors suitable for neural network calculations, which is not limited by this application.
  • the training method shown in Figure 5 includes step 501 and step 502. Step 501 and step 502 will be described in detail below.
  • the prediction network in the embodiment of this application can be a transformer network, RNN, CNN, etc., and is not specifically limited here.
  • the input is the vector of the training text
  • the output is the duration, pitch characteristics or voice characteristics of each phoneme in the training text. Then continue to narrow the difference between the duration, pitch characteristics or phonetic features of each phoneme in the training text output by the prediction network and the actual duration, actual pitch features or actual phonetic features of the training text corresponding to the training text, and then obtain the trained predictions network.
  • Step 501 Obtain training data.
  • the training data in the embodiment of the present application includes training speech, or includes training speech and training text corresponding to the training speech. If the training data does not include training text, the training text can be obtained by recognizing the training speech.
  • the training speech features in the training data may also include user identification, or include the voiceprint features of the training speech, or include using A vector identifying the voiceprint features of the training speech.
  • the training data may also include start and end duration information of each phoneme in the training speech.
  • the training data can be obtained by directly recording the utterances of the voicing object, or by the user inputting audio information and video information, or by receiving transmissions from the collection device.
  • the training data there are other ways to obtain training data, and there are no specific limitations on how to obtain training data.
  • Step 502 Use the training data as the input of the neural network, train the neural network with the goal that the value of the loss function is less than the threshold, and obtain a trained neural network.
  • some preprocessing can be performed on the training data.
  • the training data includes training speech as described above
  • the training text can be obtained by identifying the training speech, and the training text can be input into the neural network using phoneme representation.
  • the entire training text can be regarded as the target editing text and used as input to train the neural network with the goal of reducing the value of the loss function, that is, continuously reducing the speech features output by the neural network and the training speech The difference between the corresponding actual speech features.
  • This training process can be understood as a prediction task.
  • the loss function can be understood as the loss function corresponding to the prediction task.
  • the neural network in the embodiment of this application may specifically be an attention mechanism model, such as transformer, tacotron2, etc.
  • the attention mechanism model includes an encoder-decoder, and the structure of the encoder or decoder can be a recurrent neural network, a long short-term memory network (long short-term memory, LSTM), etc.
  • the neural network in the embodiment of the present application includes an encoder and a decoder.
  • the structural types of the encoder and decoder may be RNN, LSTM, etc., and are not limited here.
  • the function of the encoder is to encode the training text into a text vector (a vector representation in units of phonemes, with each input corresponding to a vector).
  • the function of the decoder is to obtain the phonetic features corresponding to the text based on the text vector.
  • the calculation of each step is calculated based on the real speech features corresponding to the previous step.
  • the prediction network can be used to correct the speech duration corresponding to the text vector. That is, it can be understood as upsampling the text vector according to the duration of each phoneme in the training speech (it can also be understood as extending the number of frames of the vector) to obtain a vector of the corresponding number of frames.
  • the function of the decoder is to obtain the speech features corresponding to the text based on the above vector corresponding to the frame number.
  • the above-mentioned decoder may be a unidirectional decoder or a bidirectional decoder (that is, two directions are parallel), and the details are not limited here.
  • the two directions refer to the direction of the training text, which can also be understood as the direction of the vector corresponding to the training text. It can also be understood as the forward or reverse order of the training text.
  • One direction means that one side of the training text points to the training text.
  • the other side, the other direction is the other side of the training text pointing to the side of the training text.
  • the first direction or positive sequence can be from “mid” to “no”
  • the second direction or reverse sequence can be from “no” to “no” “Center” direction.
  • the decoder is a bidirectional decoder
  • the decoders in both directions are trained in parallel and are calculated independently during the training process, so there is no dependence on the results.
  • the prediction network and the neural network are a multi-task network
  • the prediction network can be called a prediction module, and the decoder can correct the speech features output by the neural network based on the real duration information corresponding to the training text.
  • the input during model training can be the original singing audio, the corresponding lyric text (expressed in units of phonemes).
  • the original singing audio the duration information of each phoneme in the original audio, and the singing voice are obtained.
  • Pattern features, frame-level pitch information, etc. can be obtained through other pre-trained models or tools (such as singing lyrics alignment tool, Singer voiceprint extraction tool, and pitch extraction algorithm, etc.).
  • the output can be a trained acoustic model, and the training goal is to minimize the error between the predicted singing voice features and the singing voice features.
  • training data sets can be synthesized based on singing voices, and corresponding training data samples can be constructed by simulating "insertion, deletion and replacement" operation scenarios.
  • Stage1 First use ground-truth lyrics and audio as well as pitch and duration data to train a singing voice synthesis model to obtain the trained text encoding module and audio feature decoding module;
  • Stage2 Fixed text encoding module and audio feature decoding module, use simulated editing operation training data set to train duration regularization module and pitch prediction module;
  • Stage3 End-to-end training, finetune the entire model using all training data.
  • neural network includes encoder and decoder.
  • the neural network may also include a prediction module and an upsampling module.
  • the prediction module is specifically used to implement the function of the above prediction network
  • the upsampling module is specifically used to implement the above process of upsampling the text vector according to the duration of each phoneme in the training speech, which will not be described again here.
  • training process may also adopt other training methods instead of the aforementioned training methods, which are not limited here.
  • the voice processing method provided by the embodiment of the present application can be applied to replacing scenes, inserting scenes, or deleting scenes.
  • the above scenario can be understood as replacing, inserting, deleting, etc. the original speech corresponding to the original text to obtain the target speech, so as to achieve a similar listening experience between the target speech and the original speech and/or improve the fluency of the target speech.
  • the original voice can be considered to include the voice to be modified, and the target voice is the voice obtained after the user wants to modify the original voice.
  • the original text is "Today the weather is very good in Shenzhen", and the target text is “Today the weather is very good in Guangzhou”.
  • the overlapping text is "The weather is very good today”.
  • the non-overlapping text in the original text is "Shenzhen”, and the non-overlapping text in the target text is "Guangzhou”.
  • the target text includes a first text and a second text, and the first text is an overlapping text or a part of the overlapping text.
  • the second text is the text in the target text other than the first text. For example: If the first text is "The weather is very good today", the second text is "Guangzhou”. If the first text is "Today's weather is very good”, the second text is "Heaven, Guangzhou, Tian”.
  • the original text is "The weather in Shenzhen is very good today", and the target text is "The weather in Shenzhen is very good this morning".
  • the overlapping text is "The weather in Shenzhen is very good today”.
  • the non-overlapping text in the target text is "AM”.
  • the insertion scene can be regarded as a replacement scene in which "tianshen” in the original speech is replaced by "tianchenshen”. That is, the first text is "The weather in Zhenzhou is very good today", and the second text is "It's morning and late in the day”.
  • the original text is "The weather in Shenzhen is very good today” and the target text is "The weather is very good today”.
  • the overlapping text is "The weather is very good today”.
  • the non-overlapping text in the original text is "Shenzhen”.
  • the deleted scene can be regarded as a replacement scene in which " ⁇ " in the original speech is replaced by " ⁇ ". That is, the first text is "Today's weather is very good” and the second text is "Every day”.
  • the speech processing method provided by the embodiment of the present application is described below only by taking the replacement scene as an example.
  • the voice processing method provided by the embodiments of this application can be executed by the terminal device or the cloud device alone, or can be completed by the terminal device and the cloud device together. They are described below:
  • Embodiment 1 The terminal device or the cloud device executes the voice processing method independently.
  • Figure 7a is an example of a voice processing method provided by the embodiment of the present application.
  • This method can be executed by a voice processing device or by a component of the voice processing device (such as a processor, a chip, or a chip system, etc.).
  • the voice processing device may be a terminal device or a cloud device, and this embodiment includes steps 701 to 704.
  • Step 701 Obtain the original voice and the second text.
  • the speech processing device can directly obtain the original speech, the original text and the second text. It is also possible to obtain the original speech and the second text first, and then recognize the original speech to obtain the original text corresponding to the original speech.
  • the second text is the text in the target text other than the first text, and the original text and the target text contain the first text.
  • the first text can be understood as part or all of the overlapping text of the original text and the target text.
  • the content of the original voice is the user's singing voice, which may be, for example, the voice recorded when the user sings a cappella.
  • the voice processing device can directly obtain the second text through input from other devices or users.
  • the speech processing device obtains the target text, obtains the overlapping text based on the original text corresponding to the target text and the original speech, and then determines the second text based on the overlapping text.
  • the characters in the original text and the target text can be compared one by one or a comparison model can be input to determine overlapping text and/or non-overlapping text between the original text and the target text.
  • the first text may be an overlapping text, or may be part of the overlapping text.
  • the speech processing device can directly determine the overlapping text as the first text, can also determine the first text in the overlapping text according to preset rules, or can also determine the first text in the overlapping text according to the user.
  • the operation determines the first text in the overlapping text.
  • the preset rule may be to obtain the first text after removing N characters in the overlapping content, where N is a positive integer.
  • the speech processing device can align the original text with the original speech, determine the starting and ending positions of each phoneme in the original text in the original speech, and can learn the duration of each phoneme in the original text. Then, the phonemes corresponding to the first text are obtained, that is, the speech corresponding to the first text in the original speech (that is, the non-edited speech) is obtained.
  • the voice processing device can align the original text with the original speech by using a forced alignment method, such as: Montreal forced aligner (MFA), neural network with alignment function and other alignment tools, specifically There are no limitations here.
  • MFA Montreal forced aligner
  • neural network with alignment function and other alignment tools, specifically There are no limitations here.
  • the user interface can be displayed to the user, and the user interface includes the original speech and the original text. Further, the user performs a first operation on the original text through the user interface, and the speech processing device determines the target text in response to the user's first operation.
  • the first operation can be understood as the user's editing of the original text, and the editing can be the aforementioned replacement, insertion, or deletion.
  • the original text is "Today the weather is very good in Shenzhen", and the target text is "Today the weather is very good in Guangzhou”.
  • the speech processing device is a mobile phone. After the speech processing device obtains the original text and the original voice, the user is presented with an interface as shown in Figure 8, which includes the original text and the original voice. As shown in Figure 9, the user can perform the first operation 901 on the original text, such as modifying "Shenzhen" to "Guangzhou” and other aforementioned insertion, deletion, and replacement operations.
  • only replacement is used as an example for description.
  • the speech processing device displays the overlapping text to the user, and then determines the first text from the overlapping text according to the user's second operation, and then determines the second text.
  • the second The operation can be click, drag, slide, etc. There are no specific limitations here.
  • the second text is "Guangzhou”
  • the first text is "The weather is very good today”
  • the non-edited voice is the voice of the first text in the original voice.
  • the non-edited speech is equivalent to frames 1 to 4 and frames 9 to 16 in the original speech. It can be understood that in practical applications, the correspondence between text and voice frames is not necessarily 1:2 as in the above example.
  • the above example is only for the convenience of understanding the non-editing area.
  • the number of frames corresponding to the original text is not specifically limited here.
  • the speech processing device can display an interface as shown in Figure 10, which can include the second text, the target text, the non-edited voice and the edited voice in the original voice, where the second text is "Guangzhou” and the target text
  • the text is "The weather in Guangzhou is very good today”
  • the non-edited voice is the voice corresponding to "The weather is very good today”
  • the edited voice is the voice corresponding to "Shenzhen”. It can also be understood that as the user edits the target text, the speech processing device determines the non-edited voice in the original voice based on the target text, the original text, and the original voice.
  • the voice processing device receives an editing request sent by the user, where the editing request includes the original voice and the second text.
  • the edit request also includes the original text and/or speaker identification.
  • the editing request can also include the original speech and the target text.
  • Step 702 Predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text.
  • the information of the target text includes: text embedding of each phoneme in the target text.
  • the text embedding of each phoneme in the target text can be obtained through the text encoding module (Text Encoder) based on the target text.
  • the target text can be converted into the corresponding phoneme sequence (for example, the phoneme corresponding to "How can love not ask whether it is right or wrong" is the sequence of initial consonants and finals in pinyin), and then input to the Text Encoder and converted into the corresponding phoneme-based phoneme sequence.
  • Text embedding of units can be exemplified by the Tacotron 2 model.
  • the number of frames of each phoneme in the non-edited speech (which can also be called the duration) can be obtained, and based on the number of frames of each phoneme in the non-edited speech and the information of the target text, Predict the number of frames for each phoneme in the second text.
  • the neural network used to predict the number of frames of each phoneme in the second text can be as shown in Figure 7b (for example, it can be a duration prediction model based on a mask mechanism that fuses the original real duration), It uses the output of Text Encoder and the original real duration (Reference Duration, that is, the duration of each phoneme in the first text) and the corresponding mask as input to predict the duration of each phoneme to be edited (that is, each phoneme in the second text) (That is, the number of frames corresponding to the audio).
  • Reference Duration that is, the duration of each phoneme in the first text
  • the corresponding mask as input to predict the duration of each phoneme to be edited (that is, each phoneme in the second text) (That is, the number of frames corresponding to the audio).
  • each text embedding can be performed based on the predicted duration of each phoneme. Upsample to obtain the embedding result corresponding to the number of frames (for example, if the prediction duration of phoneme ai is 10 frames, you can copy the text embedding corresponding to ai N copies, N is a positive number greater than 1, for example, N is 10 ).
  • the singing itself will follow a certain music score, and the music score also stipulates the pronunciation duration and pitch of each word. Therefore, when editing the singing voice, for the area that does not need to be edited (non-edited voice), There is no need to predict the corresponding duration and pitch information, just get the accurate real value directly and use it.
  • 1 FFT Block can be a Transformer block.
  • the predicted duration of each phoneme in the second text can be used to upsample each input in pitch feature prediction.
  • the input for pitch feature prediction can include text embeddings, above Each text embedding before sampling corresponds to a phoneme, and the text embedding after upsampling includes the number of text embeddings corresponding to the number of frames of the phoneme.
  • the second speech feature of the non-edited speech may also be obtained based on the non-edited speech.
  • the second voice feature may carry at least one of the following information: some voice frames or all voice frames of the non-edited voice; voiceprint features of the non-edited voice; timbre features of the non-edited voice; The prosodic characteristics of the non-edited speech; and the rhythmic characteristics of the non-edited speech.
  • the speech features in the embodiments of the present application can be used to represent the characteristics of speech (such as timbre, rhythm, emotion or rhythm, etc.).
  • the speech features can be expressed in many forms, such as speech frames, sequences, vectors, etc., specifically here No restrictions.
  • the speech features in the embodiments of the present application may specifically be parameters extracted from the above-mentioned expression forms through the aforementioned PLP, LPC, MFCC and other methods.
  • At least one speech frame is selected from the non-edited speech as the second speech feature.
  • the second speech feature of the context is further combined with the first speech feature.
  • the text corresponding to at least one speech frame may be text adjacent to the second text in the first text.
  • the non-edited speech is encoded through a coding model to obtain a target sequence, and the target sequence is used as the second speech feature.
  • the coding model can be CNN, RNN, etc., and there is no specific limit here.
  • the second voice feature may also carry the voiceprint feature of the original voice.
  • the voiceprint features may be obtained directly, or the voiceprint features may be obtained by recognizing original speech, etc.
  • the subsequently generated first voice feature also carries the voiceprint feature of the original voice, thereby improving the similarity between the target edited voice and the original voice.
  • introducing voiceprint features can improve the subsequently predicted voice features to be more similar to the voiceprints of the speakers of the original speech.
  • the speech processing device can also obtain the speaker identification of the original speech, so that when there are multiple speakers, the speech corresponding to the corresponding speaker can be matched, and the similarity between the subsequent target edited speech and the original speech can be improved.
  • speech frames as speech features (or is understood as obtaining speech features based on speech frames) as an example.
  • speech features or is understood as obtaining speech features based on speech frames.
  • at least one frame from the 1st to 4th frame and the 9th to 16th frame in the original speech is selected as the second speech feature.
  • the second speech feature is a Mel spectrum feature.
  • the second speech feature can be expressed in the form of a vector.
  • the predicted duration of each phoneme in the second text can be used to perform each input in pitch feature prediction.
  • the input for pitch feature prediction may include the second speech feature, each vector before upsampling corresponds to a phoneme, and the text embedding after upsampling includes a vector corresponding to the number of frames of the phoneme.
  • the second pitch feature of the second text can be predicted based on the first pitch feature of the non-edited voice and the information of the target text.
  • the first pitch (pitch) feature of the non-edited speech can be obtained through an existing pitch extraction algorithm, which is not limited by this application.
  • the neural grid can be used to predict the target text according to the first pitch feature of the non-edited voice, the information of the target text, and the second voice feature of the non-edited voice. Describe the second pitch characteristics of the second text.
  • the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text.
  • Text the second text is text adjacent to the first part of the text; the first pitch feature of the non-edited voice and the information of the target text can be fused to obtain the first fusion Result: Input the first fusion result to the second neural network to obtain the second pitch feature of the second text.
  • the target text is obtained by replacing the second part of the first text with the second text;
  • the first pitch of the non-edited voice may be The features are input to the third neural network to obtain the initial pitch feature, and the first initial pitch feature includes the pitch of each frame in multiple frames;
  • the information of the target text is input to the fourth neural network to obtain the
  • the pronunciation features of the second text are used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated; the initial pitch features and the pronunciation features are fused to obtain the Describe the second pitch characteristics of the second text.
  • the replacement operation here only means that the number of words in the new edited text is consistent with the number of words in the replaced text. If they are not consistent, the replacement operation will be decomposed into two editing operations: first deletion and then insertion). Since the replaced text may have a big difference in pronunciation, in order to ensure the coherence of the singing before and after the replacement, the model shown in Figure 7d is used to predict the new pitch:
  • Frame-level voiced/unvoiced (U/UV) prediction can be introduced to help pitch prediction.
  • U/UV voiced/unvoiced
  • the design of the V/UV Predictor and F0Predictor modules can refer to the F0predictor in Fastspeech2.
  • the input first pitch (pitch) feature may include the pitch feature of each frame in the multiple frames of the non-edited speech; correspondingly, the output second pitch feature
  • the high features may include pitch features of each of the plurality of frames of the target edited speech.
  • Step 703 According to the second pitch feature and the second text, obtain the first speech feature corresponding to the second text through a neural network.
  • the second pitch feature and the second text (for example, the text embedding of the second text) can be fused (for example, added), and the fusion result can be input into the neural network to obtain The first speech feature corresponding to the second text.
  • the first speech feature corresponding to the second text may be a Mel spectrum feature.
  • the description of the second voice feature may be based on the first pitch feature of the non-edited voice, the information of the target text, and the second voice feature of the non-edited voice. Reference may be made to the description of the second speech feature in the above embodiment, which will not be described again here.
  • the first speech feature corresponding to the second text can be obtained through a neural network based on the second speech feature and the second text.
  • the neural network may include an encoder and a decoder.
  • the second text is input into the encoder to obtain a first vector corresponding to the second text, and then the first vector is decoded by the decoder based on the second speech feature to obtain the first speech feature.
  • the second speech feature can be the same as or similar to the first speech feature in terms of rhythm, timbre, and/or signal-to-noise ratio.
  • Rhythm can reflect the speaker's emotional state or speech form. Rhythm generally refers to intonation, pitch, and emphasis. , pauses or rhythm characteristics.
  • an attention mechanism can be introduced between the encoder and the decoder to adjust the quantitative correspondence between the input and the output.
  • the target text where the second text is located can be introduced during the coding process of the encoder, so that the generated first vector of the second text refers to the target text, so that the second text described by the first vector is more accurate. That is, the first speech feature corresponding to the second text can be obtained through the neural network based on the second speech feature, the target text, and the mark information.
  • the target text and mark information may be input into an encoder to obtain a first vector corresponding to the second text, and then the first vector may be decoded by a decoder based on the second speech feature to obtain the first speech feature.
  • the marking information is used to mark the second text in the target text.
  • the decoder in the embodiment of the present application may be a unidirectional decoder or a bidirectional decoder, which are described respectively below.
  • the decoder is a one-way decoder.
  • the decoder calculates the first vector or the speech frame obtained by the second vector from the first direction of the target text as the first speech feature based on the second speech feature.
  • the first direction is a direction from one side of the target text to the other side of the target text.
  • the first direction can be understood as the forward or reverse order of the target text (for related descriptions, please refer to the description of the forward or reverse order in the embodiment shown in FIG. 5).
  • the second speech feature and the first vector are input into the decoder to obtain the first speech feature.
  • the second speech feature and the second vector are input into the decoder to obtain the first speech feature.
  • the decoder can be a bidirectional decoder (it can also be understood that the encoder includes a first encoder and a second encoder).
  • the above-mentioned second text is in the middle area of the target text, which can be understood to mean that the second text is not at both ends of the target text.
  • the first speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the second text
  • the fourth speech feature output by the bidirectional decoder from the second direction is the speech feature corresponding to the second text.
  • the complete phonetic features corresponding to the two second texts can be obtained through the left and right sides (ie, forward and reverse order), and the first phonetic features can be obtained based on the two phonetic features.
  • the first decoder calculates a first vector or a second vector from the first direction of the target text based on the second speech feature to obtain the first speech feature of the second text (hereinafter referred to as LR).
  • the second decoder calculates a first vector or a second vector from the second direction of the target text based on the second speech feature to obtain a fourth speech feature of the second text (hereinafter referred to as RL). and generate the first voice feature according to the first voice feature and the fourth voice feature.
  • the first direction is the direction from one side of the target text to the other side of the target text
  • the second direction is opposite to the first direction (or understood as the second direction is from the other side of the target text to one side of the target text). side direction).
  • the first direction may be the above-mentioned forward sequence
  • the second direction may be the above-mentioned reverse sequence.
  • the non-edited speech when the first encoder decodes the first frame of the first vector or the second vector in the first direction, the non-edited speech may be adjacent to the side of the second text (which may also be called the left side).
  • the speech frame is decoded as a condition to obtain N frames LR.
  • the speech frame adjacent to the other side (also called the right side) of the second text in the non-edited speech can be used as a condition Decode to obtain N frames of RL.
  • the structure of the bidirectional decoder can be referred to Figure 11.
  • the frame with the difference between LR and RL less than the threshold can be used as a transition frame (position is m, m ⁇ n,), or the frame with the smallest difference between LR and RL can be used as a transition frame .
  • the N frames of the first speech feature may include the first m frames in LR and the last n-m frames in RL, or the N frames of the first speech feature may include the first n-m frames in LR and the last m frames in RL.
  • the difference between LR and RL can be understood as the distance between vectors.
  • the first vector or the second vector in this step may also include a third vector used to identify the speaker. It can also be understood that the third vector is used to identify the voiceprint characteristics of the original speech.
  • the first encoder obtains the LR frames corresponding to "Guangzhou” including LR 1 , LR 2 , LR 3 , and LR 4 .
  • the second encoder obtains the RL frames corresponding to "Guangzhou” including RL 1 , RL 2 , RL 3 , and RL 4 .
  • the difference between LR 2 and RL 2 is the smallest, then LR 1 , LR 2 , RL 3 , RL 4 or LR 1 , RL 2 , RL 3 , RL 4 are used as the first speech features.
  • the first speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the third text in the second text
  • the fourth speech feature output by the bidirectional decoder from the second direction is the speech feature corresponding to the fourth text in the second text. voice characteristics.
  • the partial speech features corresponding to the second text can be obtained through the left and right sides (ie, forward and reverse order), and the complete first speech features can be obtained based on the two partial speech features. That is, one part of the phonetic features is taken from the forward direction, another part of the phonetic features is taken from the reverse direction, and one part of the phonetic features and another part of the phonetic features are spliced to obtain the overall phonetic features.
  • the first encoder obtains the LR frame corresponding to the third text ("Guang") including LR 1 and LR 2 .
  • the second encoder obtains the RL frame corresponding to the fourth text (“state”) including RL 3 and RL 4 .
  • the first speech feature is obtained by splicing LR 1 , LR 2 , RL 3 and RL 4 .
  • Step 704 Generate a target editing voice corresponding to the second text according to the first voice feature.
  • the first speech feature can be converted into a target edited voice corresponding to the second text according to the vocoder.
  • the vocoder can be a traditional vocoder (such as Griffin-lim algorithm), or a neural network vocoder (such as Melgan or Hifigan pre-trained using audio training data), etc. The details will not be discussed here. limited.
  • Step 705 Obtain the position of the second text in the target text. This step is optional.
  • step 701 if what is obtained in step 701 is the original speech and the second text, the position of the second text in the target text is obtained.
  • the starting and ending positions of each phoneme in the original text in the original speech can be determined by aligning the original speech and the original text through the alignment technology in step 701. And determine the position of the second text in the target text based on the starting and ending positions of each phoneme.
  • Step 706 Splice the target edited voice and the non-edited voice based on the position to generate a target voice corresponding to the target text. This step is optional.
  • the position in the embodiment of this application is used to splice the non-edited speech and the target edited speech.
  • the position can be the position of the second text in the target text, the position of the first text in the target text, or the non-edited speech.
  • the position in the original speech can also be the position of the edited speech in the original speech.
  • the original speech and the original text can be aligned using the alignment technology in step 701 to determine the starting and ending positions of each phoneme in the original text in the original speech. And based on the position of the first text in the original text, the position of the non-edited speech or the edited speech in the original speech is determined. Then, the speech processing device splices the target edited speech and the non-edited speech based on the position to obtain the target speech. That is, the target speech corresponding to the second text is replaced with the editing area in the original speech to obtain the target speech.
  • the non-edited speech is equivalent to frames 1 to 4 and frames 9 to 16 in the original speech.
  • the target editing voices are LR 1 , LR 2 , RL 3 , RL 4 or LR 1 , RL 2 , RL 3 , RL 4 .
  • Splicing the target edited speech and the non-edited speech can be understood as replacing the 5th to 8th frames in the original speech with the four obtained frames, thereby obtaining the target speech. That is, the voice corresponding to "Guangzhou” is replaced with the voice corresponding to "Shenzhen" in the original voice, and then the target text is obtained: the target voice corresponding to "The weather in Guangzhou is very good today".
  • the target speech corresponding to "The weather in Guangzhou is very good today” is shown in Figure 12.
  • the voice processing device plays the target editing voice or the target voice.
  • the speech processing method provided by the embodiment of the present application includes steps 701 to 704. In another possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 705. In another possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 706.
  • the various steps shown in Figure 7a in the embodiment of the present application do not limit the timing relationship. For example, step 705 in the above method can also be performed after step 704, or before step 701, or can be executed together with step 701.
  • An embodiment of the present application provides a speech processing method.
  • the method includes: obtaining original speech and a second text.
  • the second text is a text other than the first text in the target text.
  • the target text is the same as the original speech.
  • the original texts corresponding to the speech all include the first text, and the speech corresponding to the first text in the original speech is non-edited language. sound; predicting the second pitch characteristic of the second text according to the first pitch characteristic of the non-edited speech and the information of the target text; predicting the second pitch characteristic of the second text according to the second pitch characteristic and the third
  • For the second text obtain the first voice feature corresponding to the second text through a neural network; and generate the target editing voice corresponding to the second text based on the first voice feature.
  • This application predicts the pitch characteristics of the second text (text to be edited), generates the first speech characteristics of the second text based on the pitch characteristics, and generates the target editing speech corresponding to the second text based on the first speech characteristics, so that the singing voice can be edited before and after
  • the pitch characteristics of the target edited voice are similar to that of the original voice, so that the listening experience of the target edited voice is similar to that of the original voice.
  • Editing request Q1 Its target voice is W1 (the corresponding text T1 of the voice content is "How can love not ask whether it is right or wrong"),
  • Editing request Q2 Its target voice is W2 (the corresponding text T2 of the voice content is "Love does not ask whether it is right or wrong"),
  • Editing request Q3 Its target voice is W3 (the corresponding text T2 of the voice content is "How can love not ask whether it is right or wrong");
  • Step S1 Receive the user’s “voice editing” request
  • the request at least includes the original voice to be edited W, the original lyric text S, the target text T (T1 or T2 or T3) and other data.
  • the pre-operation includes: comparing the original text S and the target text to determine the editing type of the current editing request: that is, for Q1, Q2 and Q3 can be determined to be insertion, deletion and replacement operations respectively; extract the audio features and pitch features of each frame from W; W extracts Singer embedding through the voiceprint model; convert S and target text T*
  • the representation form of phonemes, such as T2, its phoneme sequence is [ai4 b u2 w en4d ui4c cuo4]; according to W and S, extract the duration (i.e., the number of frames) corresponding to each phoneme in S; determine the Mask area according to the operation type , for Q1, it is an insertion operation (inserting the word "how"), then its target Mask phoneme is the phoneme corresponding to "how", that is, the final target text phoneme of
  • Step S2 The target text phonemes obtained in S1 are used by the text encoding module to generate text features, that is, Phoneme-level Text Embedding;
  • Step S3 Predict the duration information of each phoneme in the target text through the duration regularization module; this step can be completed through the following sub-steps:
  • the reference duration is the real duration extracted in step S1, otherwise it is set to 0;
  • the corresponding position in the Mask vector is set to 1, otherwise set to 0;
  • the Embedding of each phoneme is upsampled (that is, if the duration of phoneme A is 10, then the Embedding of A is copied 10 times), thereby generating Frame-level Text Embedding;
  • Step S4 Predict the pitch value of each frame through the pitch prediction module. This step can be completed through the following sub-steps:
  • the reference pitch is the real pitch extracted in S1, and its corresponding position in the Mask vector is marked as 1; for Mask phonemes, the pitch on the corresponding frame is set to 0, and the Mask is set to 0; Predict the Frame-level pitch corresponding to the Mask phoneme.
  • the model shown in Figure 2-4 is used to predict the Frame-level Pitch of the Mask phoneme
  • Step S5 Add Frame-Level text Embedding and Pitch together and input them into the audio feature decoding module to predict the audio feature frame corresponding to the new Mask phoneme.
  • an editing request involves multiple editing operations
  • the editing can be performed one by one using the process described above in a processing order from left to right.
  • a replacement operation can also be implemented by two operations: "delete first and then insert”.
  • the voice processing method implemented by the terminal device or the cloud device alone is described above, and the voice processing method implemented by the terminal device and the cloud device jointly is described below.
  • Embodiment 2 The terminal device and the cloud device jointly execute the voice processing method.
  • Figure 13 is an example of a voice processing method provided by the embodiment of the present application.
  • This method can be executed jointly by the terminal device and the cloud device, or can be performed by components of the terminal device (such as a processor, a chip, or a chip system, etc.) and The components of the cloud device (such as a processor, a chip, or a chip system, etc.) execute.
  • This embodiment includes steps 1301 to 1306.
  • Step 1301 The terminal device obtains the original voice and the second text.
  • Step 1301 performed by the terminal device in this embodiment is similar to step 701 performed by the voice processing device in the embodiment shown in Figure 7a, and will not be described again here.
  • Step 1302 The terminal device sends the original voice and the second text to the cloud device.
  • the terminal device After the terminal device obtains the original voice and the second text, it can send the original voice and the second text to the cloud device.
  • step 1301 the terminal device obtains the original voice and the target text, the terminal device sends the original voice and the target text to the cloud device.
  • Step 1303 The cloud device obtains the non-edited voice based on the original voice and the second text.
  • Step 1303 performed by the cloud device in this embodiment is similar to the description of determining non-edited voice in step 701 performed by the speech processing device in the embodiment shown in Figure 7a, and will not be described again here.
  • Step 1304 The cloud device obtains the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text.
  • Step 1303 performed by the cloud device in this embodiment is similar to the description of determining non-edited voice in step 702 performed by the speech processing device in the embodiment shown in Figure 7a, and will not be described again here.
  • Step 1305 The cloud device obtains the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text.
  • Step 1306 The cloud device generates a target editing voice corresponding to the second text based on the first voice feature.
  • Steps 1304 to 1306 performed by the cloud device in this embodiment are similar to steps 702 to 704 performed by the voice processing device in the embodiment shown in Figure 7a, and will not be described again here.
  • Step 1307 The cloud device sends the target editing voice to the terminal device. This step is optional.
  • the cloud device after the cloud device obtains the target editing voice, it can send the target editing voice to the terminal device.
  • Step 1308 The terminal device or cloud device obtains the position of the second text in the target text. This step is optional.
  • Step 1309 The terminal device or the cloud device splices the target edited voice and the non-edited voice based on the location to generate a target voice corresponding to the target text. This step is optional. This step is optional.
  • Steps 1308 and 1309 in this embodiment are similar to steps 705 to 706 performed by the speech processing device in the embodiment shown in FIG. 7a, and will not be described again here. Steps 1308 and 1309 in this embodiment can be executed by a terminal device or a cloud device.
  • Step 1310 The cloud device sends the target voice to the terminal device. This step is optional.
  • steps 1308 and 1309 are executed by the cloud device, then after acquiring the target voice, the cloud device sends the target voice to the terminal device. If steps 1308 and 1309 are executed by the terminal device, this step may not be executed.
  • the terminal device plays the target editing voice or target voice.
  • the voice processing method provided by the embodiment of the present application may include: the cloud device generates the target editing voice and sends the target editing voice to the terminal device, that is, the method includes steps 1301 to 1307.
  • the speech processing method provided by the embodiment of the present application may include: the cloud device generates the target edited voice, generates the target voice based on the target edited voice and the non-edited voice, and sends the target voice to the terminal device. That is, the method includes steps 1301 to 1306, and steps 1308 to 1310.
  • the voice processing method provided by the embodiment of the present application may include: the cloud device generates the target editing voice, and sends the target editing voice to the terminal device. The terminal device generates the target voice based on the target edited voice and the non-edited voice. That is, the method includes steps 1301 to 1309.
  • the cloud device performs complex calculations to obtain the target edited voice or the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • a target edited voice corresponding to the modified text can be generated based on the voice characteristics of the non-edited area in the original voice, and then a target voice corresponding to the target text can be generated from the non-edited voice.
  • the user can modify the text in the original text to obtain the target editing voice corresponding to the modified text (ie, the second text). Improve users' editing experience based on text-based voice editing.
  • the non-edited speech is not modified when the target speech is generated, and the pitch characteristics of the target edited speech are similar to those of the non-edited speech, making it difficult for users to hear the original speech when listening to the original speech and the target speech. Differences in phonetic characteristics from the target voice.
  • An embodiment of the speech processing device in the embodiment of the present application includes:
  • Obtaining module 1401 is used to obtain the original speech and the second text.
  • the second text is the text in the target text except the first text.
  • the target text and the original text corresponding to the original speech both include the first text.
  • Text, the voice corresponding to the first text in the original voice is a non-edited voice;
  • step 701 For a specific description of the acquisition module 1401, reference may be made to the description of step 701 in the above embodiment, which will not be described again here.
  • Pitch prediction module 1402 configured to use the first pitch (pitch) feature of the non-edited speech and the target Text information, predicting the second pitch feature of the second text;
  • step 702 For a detailed description of the pitch prediction module 1402, reference may be made to the description of step 702 in the above embodiment, and will not be described again here.
  • Generating module 1403 configured to obtain the first speech feature corresponding to the second text through a neural network according to the second pitch feature and the second text;
  • a target editing voice corresponding to the second text is generated.
  • the content of the original voice is the user's singing voice.
  • the first pitch (pitch) feature of the non-edited voice and the second text include:
  • the second voice feature carries at least one of the following information:
  • the information of the target text includes: text embedding of each phoneme in the target text.
  • the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text.
  • Text the second text is text adjacent to the first part of text;
  • the pitch prediction module is specifically used for:
  • the first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
  • the target text is obtained by replacing the second part of the text in the first text with the second text;
  • the pitch prediction module is specifically used for:
  • the initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
  • the device further includes:
  • a duration prediction module used to predict the number of frames of each phoneme in the non-edited speech and the information of the target text, Predict the number of frames for each phoneme in the second text.
  • the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
  • the second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited voice.
  • the duration prediction module is specifically used to:
  • the frame number of each phoneme in the non-edited speech the information of the target text and the second speech feature of the non-edited speech.
  • the acquisition module is also used to:
  • the generating module is further configured to splice the target edited voice and the non-edited voice based on the position to obtain a target voice corresponding to the target text.
  • the voice processing device can be any terminal device including a mobile phone, tablet computer, personal digital assistant (PDA), point of sales (POS), vehicle-mounted computer, etc. Taking the voice processing device as a mobile phone as an example:
  • FIG. 15 shows a block diagram of a partial structure of a mobile phone related to the voice processing device provided by an embodiment of the present application.
  • the mobile phone includes: radio frequency (RF) circuit 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuit 1560, wireless fidelity (WiFi) module 1570, processor 1580 , and power supply 1590 and other components.
  • RF radio frequency
  • the RF circuit 1510 can be used to receive and transmit information or signals during a call. In particular, after receiving downlink information from the base station, it is processed by the processor 1580; in addition, the designed uplink data is sent to the base station.
  • the RF circuit 1510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, etc. Additionally, RF circuitry 1510 can communicate with networks and other devices through wireless communications.
  • the above wireless communication can use any communication standard or protocol, including but not limited to global system of mobile communication (GSM), general packet radio service (GPRS), code division multiple access (code division) multiple access (CDMA), wideband code division multiple access (WCDMA), long term evolution (LTE), email, short messaging service (SMS), etc.
  • GSM global system of mobile communication
  • GPRS general packet radio service
  • code division multiple access code division multiple access
  • CDMA code division multiple access
  • WCDMA wideband code division multiple access
  • LTE long term evolution
  • email short messaging service
  • SMS short messaging service
  • the memory 1520 can be used to store software programs and modules.
  • the processor 1580 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 1520 .
  • the memory 1520 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store data based on Where to use your mobile phone Created data (such as audio data, phone book, etc.), etc.
  • memory 1520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the input unit 1530 may be used to receive input numeric or character information, and generate key signal input related to user settings and function control of the mobile phone.
  • the input unit 1530 may include a touch panel 1531 and other input devices 1532.
  • the touch panel 1531 also known as a touch screen, can collect the user's touch operations on or near the touch panel 1531 (for example, the user uses a finger, stylus, or any suitable object or accessory on or near the touch panel 1531 operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 1531 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact point coordinates, and then sends it to the touch controller. to the processor 1580, and can receive commands sent by the processor 1580 and execute them.
  • the touch panel 1531 can be implemented using various types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 1530 may also include other input devices 1532.
  • other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, etc.
  • the display unit 1540 may be used to display information input by the user or information provided to the user as well as various menus of the mobile phone.
  • the display unit 1540 may include a display panel 1541.
  • the display panel 1541 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (organic light-emitting diode, OLED), etc.
  • the touch panel 1531 can cover the display panel 1541. When the touch panel 1531 detects a touch operation on or near it, it is sent to the processor 1580 to determine the type of the touch event. The processor 1580 then determines the type of the touch event. Type provides corresponding visual output on display panel 1541.
  • touch panel 1531 and the display panel 1541 are used as two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 1531 and the display panel 1541 can be integrated. Realize the input and output functions of mobile phone.
  • the phone may also include at least one sensor 1550, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor.
  • the ambient light sensor may adjust the brightness of the display panel 1541 according to the brightness of the ambient light.
  • the proximity sensor may close the display panel 1541 and/or when the mobile phone is moved to the ear. or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three axes). It can detect the magnitude and direction of gravity when stationary.
  • the audio circuit 1560, speaker 1561, and microphone 1562 can provide an audio interface between the user and the mobile phone.
  • the audio circuit 1560 can transmit the electrical signal converted from the received audio data to the speaker 1561, and the speaker 1561 converts it into a sound signal for output; on the other hand, the microphone 1562 converts the collected sound signal into an electrical signal, and the audio circuit 1560 After receiving, it is converted into audio data, and then processed by the audio data output processor 1580, and then sent to, for example, another mobile phone through the RF circuit 1510, or the audio data is output to the memory 1520 for further processing.
  • WiFi is a short-distance wireless transmission technology.
  • mobile phones can help users send and receive emails, Browsing the web and accessing streaming media, etc., it provides users with wireless broadband Internet access.
  • FIG. 15 shows the WiFi module 1570, it can be understood that it is not a necessary component of the mobile phone.
  • the processor 1580 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing software programs and/or modules stored in the memory 1520, and calling data stored in the memory 1520 to execute Various functions of the mobile phone and processing data, thereby conducting overall monitoring of the mobile phone.
  • the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor and a modem processor, where the application processor mainly processes operating systems, user interfaces, application programs, etc. , the modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 1580.
  • the mobile phone also includes a power supply 1590 (such as a battery) that supplies power to various components.
  • a power supply 1590 (such as a battery) that supplies power to various components.
  • the power supply can be logically connected to the processor 1580 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.
  • the mobile phone may also include a camera, a Bluetooth module, etc., which will not be described in detail here.
  • the processor 1580 included in the terminal device can perform the functions of the voice processing device in the embodiment shown in Figure 7a, or the functions of the terminal device in the embodiment shown in Figure 13, which will not be described again here. .
  • the voice processing device may be a cloud device.
  • the cloud device may include a processor 1601, a memory 1602, and a communication interface 1603.
  • the processor 1601, memory 1602 and communication interface 1603 are interconnected through lines.
  • the memory 1602 stores program instructions and data.
  • the memory 1602 stores program instructions and data corresponding to the steps executed by the speech processing device in the aforementioned embodiment corresponding to FIG. 7a. Or the program instructions and data corresponding to the steps executed by the cloud device in the aforementioned embodiment corresponding to Figure 13 are stored.
  • the processor 1601 is configured to perform the steps performed by the speech processing device shown in any of the embodiments shown in FIG. 7a. Or used to perform the steps performed by the cloud device shown in any of the embodiments shown in FIG. 13 .
  • the communication interface 1603 can be used to receive and send data, and to perform steps related to obtaining, sending, and receiving in any of the embodiments shown in FIG. 7a or FIG. 13 .
  • the cloud device may include more or fewer components than in Figure 16 , which is only an illustrative description in this application and is not limiting.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated unit can Implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • the integrated unit When the integrated unit is implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.

Abstract

A speech processing method and a related device, which are applied to the field of song editing. The method comprises: acquiring original speech and second text; predicting a second pitch feature of the second text according to a first pitch feature of non-edited speech in the original speech and information of target text; according to the second pitch feature and the second text, and by means of a neural network, obtaining a first speech feature corresponding to the second text; and according to the first speech feature, generating target edited speech corresponding to the second text. By means of the present application, a pitch feature of second text is predicted, a first speech feature of the second text is generated according to the pitch feature, and target edited speech corresponding to the second text is generated on the basis of the first speech feature, such that pitch features of speech before and after song editing is performed are similar to each other, and thus the acoustic experience of the target edited speech is similar to the acoustic experience of the original speech.

Description

一种语音处理方法及相关设备Speech processing method and related equipment
本申请要求于2022年4月29日提交中国专利局、申请号为202210468926.8、发明名称为“一种语音处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on April 29, 2022, with the application number 202210468926.8 and the invention title "A speech processing method and related equipment", the entire content of which is incorporated into this application by reference. middle.
技术领域Technical field
本申请实施例涉及人工智能领域领域,尤其涉及一种语音处理方法及相关设备。The embodiments of the present application relate to the field of artificial intelligence, and in particular, to a speech processing method and related equipment.
背景技术Background technique
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence. Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
目前,语音编辑具有非常重要的实用意义。比如,在用户录制歌曲(例如清唱)等场景下,经常会由于口误而导致语音中的某些内容出错。该种情况下,语音编辑便可帮助用户快速地修正原始歌声中的错误内容,生成校正后的语音。常用的语音编辑方法是通过预先构建含有大量语音片段的数据库,从数据库中获取发音单元的片段,并用该片段替换原始语音中的错误片段,进而生成校正后的语音。Currently, voice editing has very important practical significance. For example, in scenarios where users record songs (such as singing a cappella), some content in the voice is often wrong due to slips of the tongue. In this case, voice editing can help users quickly correct the erroneous content in the original singing voice and generate corrected voice. A commonly used speech editing method is to pre-build a database containing a large number of speech segments, obtain segments of pronunciation units from the database, and use the segments to replace erroneous segments in the original speech to generate corrected speech.
然而,上述语音编辑的方式依赖数据库中语音片段的多样性,在数据库中语音片段较少的情况下,会导致校正后的语音(例如用户的歌声)的听感较差。However, the above-mentioned voice editing method relies on the diversity of voice segments in the database. When there are few voice segments in the database, the corrected voice (such as the user's singing voice) will have a poor hearing quality.
发明内容Contents of the invention
本申请实施例提供了一种语音处理方法及相关设备,可以实现编辑歌声的听感与原始语音的听感类似,提升用户体验。Embodiments of the present application provide a voice processing method and related equipment, which can achieve a listening experience of edited singing that is similar to that of original speech, thereby improving user experience.
第一方面,本申请提供了一种语音处理方法,可以应用于用户录制短视频、老师录制授课语音等场景。该方法可以由语音处理设备执行,也可以由语音处理设备的部件(例如处理器、芯片、或芯片系统等)执行。其中,该语音处理设备可以是终端设备也可以是云端设备,所述方法包括:获取原始语音以及第二文本,所述第二文本为目标文本中除了第一文本以外的文本,所述目标文本与所述原始语音对应的原始文本都包括所述第一文本,所述第一文本在所述原始语音中对应的语音为非编辑语音;根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征;根据所述第二音高特征以及所述第二文本,通过神经网络得到所述第二文本对应的第一语音特征;根据所述第一语音特征,生成所述第二文本对应的目标编辑语音。本申请通过预测第二文本(待编辑文本)的音高特征,根据音高特征生成第二文本的第一语音特征,并基于第一 语音特征生成第二文本对应目标编辑语音,使得歌声编辑前后的语音的音高特征相似,进而实现目标编辑语音的听感与原始语音的听感目标编辑语音的听感与原始语音的听感类似。In the first aspect, this application provides a voice processing method, which can be applied to scenarios such as users recording short videos and teachers recording teaching voices. The method may be executed by the speech processing device, or may be executed by a component of the speech processing device (such as a processor, a chip, or a chip system, etc.). Wherein, the speech processing device can be a terminal device or a cloud device, and the method includes: obtaining the original speech and the second text, the second text being the text other than the first text in the target text, the target text The original text corresponding to the original speech includes the first text, and the speech corresponding to the first text in the original speech is a non-edited speech; according to the first pitch of the non-edited speech features and the information of the target text, predict the second pitch feature of the second text; according to the second pitch feature and the second text, obtain the first pitch corresponding to the second text through a neural network Speech features: generate a target editing voice corresponding to the second text according to the first voice features. This application predicts the pitch characteristics of the second text (text to be edited), generates the first speech characteristics of the second text based on the pitch characteristics, and based on the first The voice features generate a second text that corresponds to the target edited voice, so that the pitch characteristics of the voice before and after the singing are similar, thereby achieving a listening experience of the target edited voice that is similar to that of the original voice. .
另外,获取第二文本的方式有多种,可以是直接获取第二文本;也可以是先获取位置信息(也可以理解为是标记信息,用于指示第二文本在目标文本中的位置),在根据位置与目标文本获取第二文本,位置信息用于表示第二文本在目标文本中的位置;还可以是获取目标文本与原始文本(或者获取目标文本与原始语音,对原始语音进行识别得到原始文本),再基于原始文本与目标文本确定第二文本。In addition, there are many ways to obtain the second text, which can be to obtain the second text directly; or to obtain the position information first (which can also be understood as mark information, used to indicate the position of the second text in the target text). When obtaining the second text according to the position and the target text, the position information is used to represent the position of the second text in the target text; it can also be obtained by obtaining the target text and the original text (or obtaining the target text and the original voice, and recognizing the original voice to obtain original text), and then determine the second text based on the original text and the target text.
在一种可能的实现中,基于第二语音特征生成第二文本对应的目标编辑语音,包括:基于第二语音特征,通过声码器,生成目标编辑语音。In one possible implementation, generating a target editing voice corresponding to the second text based on the second voice feature includes: generating the target editing voice through a vocoder based on the second voice feature.
该种可能的实现方式中,根据声码器将第二语音特征转化为目标编辑语音,进而使得目标编辑语音具有与原始语音相近的语音特征,提升用户的听感。In this possible implementation, the second voice features are converted into the target edited voice according to the vocoder, so that the target edited voice has voice features similar to the original voice, thereby improving the user's listening experience.
在一种可能的实现中,所述原始语音的内容为用户的歌声,例如可以为用户清唱时录制的语音。In one possible implementation, the content of the original voice is the user's singing voice, which may be, for example, the voice recorded when the user sings a cappella.
在一种可能的实现中,获取原始语音与第二文本,包括:接收终端设备发送的原始语音与第二文本;方法还包括:向终端设备发送目标编辑语音,目标编辑语音用于终端设备生成目标文本对应的目标语音。也可以理解为是交互场景,由云端设备进行复杂的计算操作,由终端设备执行简单的拼接操作,从终端设备处获取原始语音与第二文本,云端设备生成目标编辑语音之后,向终端设备发送目标编辑语音,再由终端设备进行拼接得到目标语音。In one possible implementation, obtaining the original voice and the second text includes: receiving the original voice and the second text sent by the terminal device; the method also includes: sending the target editing voice to the terminal device, and the target editing voice is used for generation by the terminal device The target speech corresponding to the target text. It can also be understood as an interactive scenario. The cloud device performs complex calculation operations, and the terminal device performs a simple splicing operation. The original voice and the second text are obtained from the terminal device. After the cloud device generates the target editing voice, it is sent to the terminal device. The target edits the voice, and then the terminal device splices it to obtain the target voice.
该种可能的实现方式中,在语音处理设备是云端设备的情况下,一方面,可以通过云端设备与终端设备的交互,由云端设备进行复杂的计算得到目标编辑语音并返给终端设备,可以减少终端设备的算力与存储空间。另一方面,可以根据原始语音中非编辑区域的语音特征生成修改文本对应的目标编辑语音,进而与非编辑语音生成目标文本对应的目标语音。In this possible implementation, when the voice processing device is a cloud device, on the one hand, through the interaction between the cloud device and the terminal device, the cloud device can perform complex calculations to obtain the target edited voice and return it to the terminal device. Reduce the computing power and storage space of the terminal device. On the other hand, a target edited voice corresponding to the modified text can be generated based on the voice characteristics of the non-edited area in the original voice, and then a target voice corresponding to the target text can be generated from the non-edited voice.
可选地,在第一方面的一种可能的实现方式中,上述步骤:获取原始语音与第二文本,包括:接收终端设备发送的原始语音与目标文本;方法还包括:基于非编辑语音与目标编辑语音生成目标文本对应的目标语音,向终端设备发送目标语音。Optionally, in a possible implementation of the first aspect, the above steps: obtaining the original voice and the second text include: receiving the original voice and the target text sent by the terminal device; the method further includes: based on the non-edited voice and the second text The target editing voice generates a target voice corresponding to the target text, and sends the target voice to the terminal device.
该种可能的实现方式中,接收终端设备发送的原始语音与目标文本,可以获取非编辑语音,并根据非编辑语音的第一语音特征生成第二文本对应的第二语音特征,进而根据声码器得到目标编辑语音,并拼接目标编辑语音与非编辑语音生成目标语音。相当于,处理过程都在语音处理设备,结果返回给终端设备。由云端设备进行复杂的计算得到目标语音并返给终端设备,可以减少终端设备的算力与存储空间。In this possible implementation, the original voice and the target text sent by the terminal device are received, the non-edited voice can be obtained, and the second voice feature corresponding to the second text is generated based on the first voice feature of the unedited voice, and then the second voice feature corresponding to the second text is generated according to the vocoded voice code. The processor obtains the target edited voice, and splices the target edited voice and the non-edited voice to generate the target voice. Equivalently, the processing is done on the voice processing device, and the results are returned to the terminal device. The cloud device performs complex calculations to obtain the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device.
在一种可能的实现中,所述根据所述非编辑语音的第一音高(pitch)特征以及所述第二文本包括:根据所述非编辑语音的第一音高(pitch)特征、所述目标文本的信息以及所述非编辑语音的第二语音特征;所述第二语音特征携带有如下信息的至少一种:所述非编辑语音的部分语音帧或全部语音帧;所述非编辑语音的声纹特征;所述非编辑语音的音色 特征;所述非编辑语音的韵律特征;以及,所述非编辑语音的节奏特征。In a possible implementation, the first pitch (pitch) feature based on the non-edited voice and the second text include: based on the first pitch (pitch) feature of the non-edited voice, the The information of the target text and the second speech feature of the non-edited speech; the second speech feature carries at least one of the following information: some speech frames or all speech frames of the non-edited speech; the non-edited speech The voiceprint characteristics of the voice; the timbre of the non-edited voice Features; the prosodic features of the non-edited speech; and the rhythmic features of the non-edited speech.
其中,第一语音特征可以与第二语音特征的韵律、音色和/或信噪比等相同或相近,韵律可以反映出发音者的情感状态或讲话形式等,韵律泛指语调、音调、重音强调、停顿或节奏等特征。Among them, the first speech feature can be the same or similar to the second speech feature in terms of rhythm, timbre, and/or signal-to-noise ratio. Rhythm can reflect the speaker's emotional state or speech form. Rhythm generally refers to intonation, pitch, and stress emphasis. , pauses or rhythm characteristics.
在一种可能的实现中,第二语音特征携带有原始语音的声纹特征。其中,获取声纹特征的方式可以是直接获取,也可以是通过识别原始语音得到该声纹特征等。In one possible implementation, the second voice feature carries the voiceprint feature of the original voice. Among them, the voiceprint features may be obtained directly, or the voiceprint features may be obtained by recognizing original speech, etc.
该种可能的实现方式中,一方面,通过引入原始语音的声纹特征,使得后续生成的第一语音特征也携带有该原始语音的声纹特征,进而提升目标编辑语音与原始语音的相近程度。另一方面,在发音者(或者用户)的数量为多个的情况下,引入声纹特征可以提升后续预测的语音特征更加与原始语音的发音者的声纹相似。In this possible implementation, on the one hand, by introducing the voiceprint feature of the original voice, the subsequently generated first voice feature also carries the voiceprint feature of the original voice, thereby improving the similarity between the target edited voice and the original voice. . On the other hand, when there are multiple speakers (or users), introducing voiceprint features can improve the subsequently predicted voice features to be more similar to the voiceprints of the speakers of the original speech.
在一种可能的实现中,所述目标文本的信息,包括:In a possible implementation, the target text information includes:
所述目标文本中各个音素的文本嵌入(text embedding)。Text embedding of each phoneme in the target text.
在一种可能的实现中,所述目标文本为将所述第二文本插入到所述第一文本得到的文本;或者,所述目标文本为将所述第一文本的第一部分文本删除得到的文本,所述第二文本为与所述第一部分文本相邻的文本;In a possible implementation, the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text. Text, the second text is text adjacent to the first part of text;
所述根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征,包括:Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text includes:
将所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息进行融合,以得到第一融合结果;Fusion of the first pitch feature of the non-edited speech and the information of the target text to obtain a first fusion result;
将所述第一融合结果输入到第二神经网络,得到所述第二文本的第二音高特征。The first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
在一种可能的实现中,所述目标文本为将所述第一文本中的第二部分文本替换为所述第二文本得到的;In a possible implementation, the target text is obtained by replacing the second part of the text in the first text with the second text;
所述根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征,包括:Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text includes:
将所述非编辑语音的第一音高(pitch)特征输入到第三神经网络,得到初始音高特征,所述第一初始音高特征包括多个帧中每个帧的音高;Input the first pitch (pitch) feature of the non-edited speech into a third neural network to obtain an initial pitch feature, where the first initial pitch feature includes the pitch of each frame in a plurality of frames;
将所述目标文本的信息输入到第四神经网络,得到所述第二文本的发音特征,所述发音特征用于指示所述初始音高特征包括的多个帧中各个帧是否发音;Input the information of the target text into a fourth neural network to obtain the pronunciation feature of the second text, where the pronunciation feature is used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated;
将所述初始音高特征和所述发音特征进行融合,以得到所述第二文本的第二音高特征。The initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
在一种可能的实现中,所述方法还包括:In a possible implementation, the method further includes:
根据所述非编辑语音中各个音素的帧数以及所述目标文本的信息,预测所述第二文本中各个音素的帧数。 According to the frame number of each phoneme in the non-edited speech and the information of the target text, the frame number of each phoneme in the second text is predicted.
在一种可能的实现中,所述第一音高(pitch)特征,包括:所述非编辑语音的多帧中的每一帧的音高特征;In a possible implementation, the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
所述第二音高特征,包括:所述目标编辑语音的多帧中的每一帧的音高特征。The second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited speech.
在一种可能的实现中,所述根据所述非编辑语音中各个音素的帧数以及所述目标文本的信息,包括:In a possible implementation, the information based on the number of frames of each phoneme in the non-edited speech and the target text includes:
根据所述非编辑语音中各个音素的帧数、所述目标文本的信息以及所述非编辑语音的第二语音特征。According to the frame number of each phoneme in the non-edited speech, the information of the target text and the second speech feature of the non-edited speech.
在一种可能的实现中,上述步骤还包括:获取第二文本在目标文本中的位置;基于位置拼接目标编辑语音与非编辑语音得到目标文本对应的目标语音。也可以理解为是用目标编辑语音替换原始语音中的编辑语音,该编辑语音为原始语音中除了非编辑语音以外的语音。In a possible implementation, the above steps further include: obtaining the position of the second text in the target text; and splicing the target editing voice and the non-editing voice based on the position to obtain the target voice corresponding to the target text. It can also be understood as replacing the edited voice in the original voice with the target edited voice, and the edited voice is the voice in the original voice except the non-edited voice.
该种可能的实现方式中,可以根据第二文本在目标文本中的位置拼接目标编辑语音与非编辑语音。如果第一文本是原始文本与目标文本中的所有重叠文本,则可以在不改变原始语音中非编辑语音的情况下生成所需文本(即目标文本)的语音。In this possible implementation, the target editing voice and the non-editing voice can be spliced according to the position of the second text in the target text. If the first text is all overlapping text in the original text and the target text, the voice of the desired text (ie, the target text) can be generated without changing the non-edited voice in the original voice.
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:基于目标文本、原始文本以及原始语音确定非编辑语音,具体可以是:基于目标文本与原始文本确定第一文本;基于第一文本、原始文本与原始语音确定非编辑语音。Optionally, in a possible implementation of the first aspect, the above steps further include: determining the non-edited voice based on the target text, the original text and the original voice. Specifically, it may be: determining the first text based on the target text and the original text. ; Determine the non-edited voice based on the first text, original text and original voice.
该种可能的实现方式中,通过对比原始文本与原始语音,确定第一文本在原始语音中的非编辑语音,便于后续第一语音特征的生成。In this possible implementation, the non-edited voice of the first text in the original voice is determined by comparing the original text and the original voice, so as to facilitate the subsequent generation of the first voice feature.
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于目标文本与原始文本确定第一文本,包括:基于目标文本与原始文本确定重叠文本;向用户显示重叠文本;响应用户的第二操作,从重叠文本中确定第一文本。Optionally, in a possible implementation of the first aspect, the above steps: determining the first text based on the target text and the original text, including: determining the overlapping text based on the target text and the original text; displaying the overlapping text to the user; and responding The user's second operation determines the first text from the overlapping text.
第二方面,本申请提供了一种语音处理装置,所述装置包括:In a second aspect, this application provides a voice processing device, which includes:
获取模块,用于获取原始语音以及第二文本,所述第二文本为目标文本中除了第一文本以外的文本,所述目标文本与所述原始语音对应的原始文本都包括所述第一文本,所述第一文本在所述原始语音中对应的语音为非编辑语音;Acquisition module, used to obtain the original speech and the second text. The second text is the text in the target text except the first text. The target text and the original text corresponding to the original speech both include the first text. , the voice corresponding to the first text in the original voice is non-edited voice;
音高预测模块,用于根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征;A pitch prediction module, configured to predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text;
生成模块,用于根据所述第二音高特征以及所述第二文本,通过神经网络得到所述第二文本对应的第一语音特征;A generation module, configured to obtain the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text;
根据所述第一语音特征,生成所述第二文本对应的目标编辑语音。According to the first voice characteristics, a target editing voice corresponding to the second text is generated.
在一种可能的实现中,所述原始语音的内容为用户的歌声。 In one possible implementation, the content of the original voice is the user's singing voice.
在一种可能的实现中,所述根据所述非编辑语音的第一音高(pitch)特征以及所述第二文本包括:In a possible implementation, the first pitch (pitch) feature of the non-edited voice and the second text include:
根据所述非编辑语音的第一音高(pitch)特征、所述目标文本的信息以及所述非编辑语音的第二语音特征;所述第二语音特征携带有如下信息的至少一种:According to the first pitch feature of the non-edited voice, the information of the target text and the second voice feature of the non-edited voice; the second voice feature carries at least one of the following information:
所述非编辑语音的部分语音帧或全部语音帧;Partial speech frames or all speech frames of the non-edited speech;
所述非编辑语音的声纹特征;The voiceprint characteristics of the non-edited speech;
所述非编辑语音的音色特征;The timbre characteristics of the non-edited voice;
所述非编辑语音的韵律特征;以及,The prosodic characteristics of the unedited speech; and,
所述非编辑语音的节奏特征。The rhythmic characteristics of the non-edited speech.
在一种可能的实现中,所述目标文本的信息,包括:所述目标文本中各个音素的文本嵌入(text embedding)。In a possible implementation, the information of the target text includes: text embedding of each phoneme in the target text.
在一种可能的实现中,所述目标文本为将所述第二文本插入到所述第一文本得到的文本;或者,所述目标文本为将所述第一文本的第一部分文本删除得到的文本,所述第二文本为与所述第一部分文本相邻的文本;In a possible implementation, the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text. Text, the second text is text adjacent to the first part of text;
所述音高预测模块,具体用于:The pitch prediction module is specifically used for:
将所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息进行融合,以得到第一融合结果;Fusion of the first pitch feature of the non-edited speech and the information of the target text to obtain a first fusion result;
将所述第一融合结果输入到第二神经网络,得到所述第二文本的第二音高特征。The first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
在一种可能的实现中,所述目标文本为将所述第一文本中的第二部分文本替换为所述第二文本得到的;In a possible implementation, the target text is obtained by replacing the second part of the text in the first text with the second text;
所述音高预测模块,具体用于:The pitch prediction module is specifically used for:
将所述非编辑语音的第一音高(pitch)特征输入到第三神经网络,得到初始音高特征,所述第一初始音高特征包括多个帧中每个帧的音高;Input the first pitch (pitch) feature of the non-edited speech into a third neural network to obtain an initial pitch feature, where the first initial pitch feature includes the pitch of each frame in a plurality of frames;
将所述目标文本的信息输入到第四神经网络,得到所述第二文本的发音特征,所述发音特征用于指示所述初始音高特征包括的多个帧中各个帧是否发音;Input the information of the target text into a fourth neural network to obtain the pronunciation feature of the second text, where the pronunciation feature is used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated;
将所述初始音高特征和所述发音特征进行融合,以得到所述第二文本的第二音高特征。The initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
在一种可能的实现中,所述装置还包括:In a possible implementation, the device further includes:
时长预测模块,用于根据所述非编辑语音中各个音素的帧数以及所述目标文本的信息,预测所述第二文本中各个音素的帧数。A duration prediction module, configured to predict the number of frames of each phoneme in the second text based on the number of frames of each phoneme in the non-edited speech and the information of the target text.
在一种可能的实现中,所述第一音高(pitch)特征,包括:所述非编辑语音的多帧中的每一帧的音高特征;In a possible implementation, the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
所述第二音高特征,包括:所述目标编辑语音的多帧中的每一帧的音高特征。 The second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited voice.
在一种可能的实现中,所述时长预测模块,具体用于:In a possible implementation, the duration prediction module is specifically used to:
根据所述非编辑语音中各个音素的帧数、所述目标文本的信息以及所述非编辑语音的第二语音特征。According to the frame number of each phoneme in the non-edited speech, the information of the target text and the second speech feature of the non-edited speech.
在一种可能的实现中,所述获取模块还用于:In a possible implementation, the acquisition module is also used to:
获取所述第二文本在所述目标文本中的位置;Obtain the position of the second text in the target text;
所述生成模块,还用于基于所述位置拼接所述目标编辑语音与所述非编辑语音得到所述目标文本对应的目标语音。The generating module is further configured to splice the target edited voice and the non-edited voice based on the position to obtain a target voice corresponding to the target text.
本申请第三方面提供了一种语音处理设备,该语音处理设备执行前述第一方面或第一方面的任意可能的实现方式中的方法。A third aspect of the present application provides a voice processing device that performs the method in the foregoing first aspect or any possible implementation of the first aspect.
本申请第四方面提供了一种语音处理设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该语音处理设备实现上述第一方面或第一方面的任意可能的实现方式中的方法。A fourth aspect of the present application provides a speech processing device, including: a processor, the processor is coupled to a memory, and the memory is used to store programs or instructions. When the program or instructions are executed by the processor, the speech processing device implements the above-mentioned first step. A method in any possible implementation of an aspect or first aspect.
本申请第五方面提供了一种计算机可读介质,其上存储有计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。The fifth aspect of the present application provides a computer-readable medium on which a computer program or instructions are stored. When the computer program or instructions are run on a computer, the computer is caused to execute the foregoing first aspect or any possible implementation of the first aspect. method within the method.
本申请第六方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。A sixth aspect of the present application provides a computer program product, which, when executed on a computer, causes the computer to execute the method in the foregoing first aspect or any possible implementation of the first aspect.
附图说明Description of the drawings
图1为本申请提供的一种系统架构的结构示意图;Figure 1 is a schematic structural diagram of a system architecture provided by this application;
图2为本申请提供的一种卷积神经网络结构示意图;Figure 2 is a schematic structural diagram of a convolutional neural network provided by this application;
图3为本申请提供的另一种卷积神经网络结构示意图;Figure 3 is a schematic structural diagram of another convolutional neural network provided by this application;
图4为本申请提供的一种芯片硬件结构示意图;Figure 4 is a schematic diagram of the chip hardware structure provided by this application;
图5为本申请提供的一种神经网络的训练方法的示意性流程图;Figure 5 is a schematic flow chart of a neural network training method provided by this application;
图6为本申请提供的一种神经网络的结构示意图;Figure 6 is a schematic structural diagram of a neural network provided by this application;
图7a为本申请提供的语音处理方法一个流程示意图;Figure 7a is a schematic flow chart of the speech processing method provided by this application;
图7b为本申请提供的一个时长预测示意图;Figure 7b is a schematic diagram of duration prediction provided by this application;
图7c为本申请提供的一个音高预测示意图;Figure 7c is a schematic diagram of pitch prediction provided by this application;
图7d为本申请提供的一个音高预测示意图;Figure 7d is a schematic diagram of pitch prediction provided by this application;
图8-图10为本申请提供的语音处理设备显示界面的几种示意图;Figures 8 to 10 are several schematic diagrams of the display interface of the voice processing device provided by this application;
图11为本申请提供的一种双向解码器的结构示意图;Figure 11 is a schematic structural diagram of a bidirectional decoder provided by this application;
图12为本申请提供的语音处理设备显示界面的另一种示意图;Figure 12 is another schematic diagram of the display interface of the voice processing device provided by this application;
图13为本申请提供的语音处理方法另一个流程示意图;Figure 13 is another schematic flow chart of the speech processing method provided by this application;
图14-图16本申请提供的语音处理设备的几种结构示意图。 Figures 14-16 are schematic structural diagrams of several speech processing devices provided by this application.
具体实施方式Detailed ways
本申请实施例提供了一种语音处理方法及相关设备,可以实现编辑语音的听感与原始语音的听感类似,提升用户体验。Embodiments of the present application provide a speech processing method and related equipment, which can realize that the listening feeling of edited speech is similar to that of original speech, thereby improving user experience.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获取的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of this application.
为了便于理解,下面先对本申请实施例主要涉及的相关术语和概念进行介绍。In order to facilitate understanding, the relevant terms and concepts mainly involved in the embodiments of this application are first introduced below.
1、神经网络1. Neural network
神经网络可以是由神经单元组成的,神经单元可以是指以Xs和截距1为输入的运算单元,该运算单元的输出可以为:
The neural network can be composed of neural units. The neural unit can refer to an arithmetic unit that takes X s and intercept 1 as input. The output of the arithmetic unit can be:
其中,s=1、2、……n,n为大于1的自然数,Ws为Xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。Among them, s=1, 2,...n, n is a natural number greater than 1, W s is the weight of X s , and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.
2、深度神经网络2. Deep neural network
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。当然,深度神经网络也可能不包括隐藏层,具体此处不做限定。Deep neural network (DNN), also known as multi-layer neural network, can be understood as a neural network with many hidden layers. There is no special metric for "many" here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Of course, the deep neural network may not include hidden layers, and there is no specific limitation here.
深度神经网络中的每一层的工作可以用数学表达式来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由完成,4的操作由完成,5的操作则由α()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何 变换空间。训练深度神经网络的目的,也就是最终获取训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。The work of each layer in a deep neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the columns) through five operations on the input space (a collection of input vectors). space), these five operations include: 1. Dimension raising/reducing; 2. Zoom in/out; 3. Rotation; 4. Translation; 5. "Bend". Among them, the operations of 1, 2 and 3 are performed by Completed, operation 4 is performed by Completed, the operation of 5 is implemented by α(). The reason why the word "space" is used here is because the object to be classified is not a single thing, but a class of things. Space refers to the collection of all individuals of this type of thing. Among them, W is a weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer. This vector W determines the spatial transformation from the input space to the output space mentioned above, that is, the weight W of each layer controls how Transform space. The purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vectors W of many layers). Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
3、卷积神经网络3. Convolutional neural network
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使同一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习获取的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。Convolutional neural network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers. The feature extractor can be viewed as a filter, and the convolution process can be viewed as convolving the same trainable filter with an input image or feature map. The convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. In the convolutional layer of a convolutional neural network, a neuron can be connected to only some of the neighboring layer neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information independent of position. The underlying principle is that the statistical information of one part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习获取合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。本申请实施例中的分离网络、识别网络、检测网络、深度估计网络等网络都可以是CNN。The convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting. The separation network, recognition network, detection network, depth estimation network and other networks in the embodiments of this application can all be CNNs.
4、循环神经网络(RNN)4. Recurrent Neural Network (RNN)
在传统的神经网络中模型中,层与层之间是全连接的,每层之间的节点是无连接的。但是这种普通的神经网络对于很多问题是无法解决的。比如,预测句子的下一个单词是什么,因为一个句子中前后单词并不是独立的,一般需要用到前面的单词。循环神经网络(recurrent neural network,RNN)指的是一个序列当前的输出与之前的输出也有关。具体的表现形式为网络会对前面的信息进行记忆,保存在网络的内部状态中,并应用于当前输出的计算中。In the traditional neural network model, the layers are fully connected, and the nodes between each layer are unconnected. But this ordinary neural network cannot solve many problems. For example, predicting what the next word of a sentence is, because the preceding and following words in a sentence are not independent, generally the previous word needs to be used. Recurrent neural network (RNN) refers to the current output of a sequence that is also related to the previous output. The specific form of expression is that the network will remember the previous information, save it in the internal state of the network, and apply it to the calculation of the current output.
5、损失函数5. Loss function
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training a deep neural network, because we hope that the output of the deep neural network is as close as possible to the value that we really want to predict, we can compare the predicted value of the current network with the really desired target value, and then based on the difference between the two to update the weight vector of each layer of the neural network according to the difference (of course, there is usually an initialization process before the first update, that is, preconfiguring parameters for each layer in the deep neural network). For example, if the predicted value of the network If it is high, adjust the weight vector to make it predict a lower value, and continue to adjust until the neural network can predict the target value that you really want. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.
6、从文本到语音 6. From text to speech
从文本到语音(text to speech,TTS)是将文本转换成语音的程序或软件系统。Text to speech (TTS) is a program or software system that converts text into speech.
7、声码器7. Vocoder
声码器是一种声音信号处理模块或软件,可以将声学特征编码生成声音波形。A vocoder is a sound signal processing module or software that encodes acoustic features into a sound waveform.
8、音高8. Pitch
音高也可以称之为基频,当发声体由于振动而发出声音时,声音一般可以分解为许多单纯的正弦波,也就是说所有的自然声音基本都是由许多频率不同的正弦波组成的,其中频率最低的正弦波即为基音(即基频,可以用F0表示),而其他频率较高的正弦波则为泛音。Pitch can also be called fundamental frequency. When a sound-emitting body emits sound due to vibration, the sound can generally be decomposed into many simple sine waves. That is to say, all natural sounds are basically composed of many sine waves with different frequencies. , the sine wave with the lowest frequency is the fundamental tone (that is, the fundamental frequency, which can be represented by F0), while other sine waves with higher frequencies are overtones.
9、韵律9. Rhythm
语音合成领域中,韵律泛指控制语调、音调、重音强调、停顿和节奏等功能的特征。韵律可以反映出说话者的情感状态或讲话形式等。In the field of speech synthesis, prosody broadly refers to features that control functions such as intonation, pitch, emphasis, pauses, and rhythm. Prosody can reflect the speaker's emotional state or speech form, etc.
10、音素10. Phonemes
音素(phone):是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。音素分为元音与辅音两大类。例如,汉语音节a(例如,一声:啊)只有一个音素,ai(例如四声:爱)有两个音素,dai(例如一声:呆)有三个音素等。Phoneme (phone): It is the smallest unit of speech divided according to the natural properties of speech. It is analyzed based on the pronunciation movements in the syllable. One movement constitutes a phoneme. Phonemes are divided into two categories: vowels and consonants. For example, the Chinese syllable a (for example, one tone: ah) has only one phoneme, ai (for example, four tones: love) has two phonemes, dai (for example, one tone: stay) has three phonemes, etc.
11、词向量(embedding)11. Word vector (embedding)
词向量也可以称为“词嵌入”、“向量化”、“向量映射”、“嵌入”等。从形式上讲,词向量是用一个稠密的向量表示一个对象。Word vectors can also be called "word embeddings", "vectorization", "vector mapping", "embeddings", etc. Formally speaking, a word vector represents an object as a dense vector.
12、语音特征12. Voice characteristics
语音特征:将经过处理的语音信号转换成一种简洁而有逻辑的表示形式,比实际信号更有鉴别性和可靠性。在获取一段语音信号后,可以从语音信号中提取语音特征。其中,提取方法通常为每个语音信号提取一个多维特征向量。语音信号的参数化表示方法有很多种,例如:感知线性预测(perceptual linear predictive,PLP)、线性预测编码(linear predictive coding,LPC)和频率倒谱系数(mel frequency cepstrum coefficient,MFCC)等。Speech features: Convert the processed speech signal into a concise and logical representation that is more discriminating and reliable than the actual signal. After acquiring a segment of speech signal, speech features can be extracted from the speech signal. Among them, the extraction method usually extracts a multi-dimensional feature vector for each speech signal. There are many ways to represent the parameterization of speech signals, such as: perceptual linear prediction (PLP), linear predictive coding (LPC) and frequency cepstrum coefficient (MFCC), etc.
13、transformer层13. transformer layer
神经网络包括嵌入层和至少一个transformer层,至少一个transformer层可以为N个transformer层(N大于0的整数),其中,每个transformer层包括依次相邻的注意力层、加和与归一化(add&norm)层、前馈(feed forward)层和加和与归一化层。在嵌入层,对当前输入进行嵌入处理,得到多个特征向量;在所述注意力层,从所述第一transformer层的上一层获取P个输入向量,以P个输入向量中的任意的第一输入向量为中心,基于预设的注意力窗口范围内的各个输入向量与该第一输入向量之间的关联度,得到该第一输入向量对应的中间向量,如此确定出P个输入向量对应的P个中间向量;在所述池化层,将所述P个中间向量合并为Q个输出向量,其中transformer层中最后一个transformer层得到的多个输出向量用作所述当前输入的特征表示。The neural network includes an embedding layer and at least one transformer layer. At least one transformer layer can be N transformer layers (N is an integer greater than 0), where each transformer layer includes successively adjacent attention layers, summation and normalization. (add&norm) layer, feed forward layer and summation and normalization layer. In the embedding layer, the current input is embedded to obtain multiple feature vectors; in the attention layer, P input vectors are obtained from the upper layer of the first transformer layer, and any of the P input vectors are The first input vector is the center. Based on the correlation between each input vector within the preset attention window range and the first input vector, the intermediate vector corresponding to the first input vector is obtained. In this way, P input vectors are determined. Corresponding P intermediate vectors; in the pooling layer, the P intermediate vectors are merged into Q output vectors, where the multiple output vectors obtained by the last transformer layer in the transformer layer are used as features of the current input express.
接下来,结合具体例子对上述各步骤进行具体介绍。Next, each of the above steps will be introduced in detail with specific examples.
首先,在所述嵌入层,对当前输入进行嵌入处理,得到多个特征向量。 First, in the embedding layer, the current input is embedded to obtain multiple feature vectors.
嵌入层可以称为输入嵌入(input embedding)层。当前输入可以为文本输入,例如可以为一段文本,也可以为一个句子。文本可以为中文文本,也可以为英文文本,还可以为其他语言文本。嵌入层在获取当前输入后,可以对该当前输入中各个词进行嵌入处理,可得到各个词的特征向量。在一些实施例中,所述嵌入层包括输入嵌入层和位置编码(positional encoding)层。在输入嵌入层,可以对当前输入中的各个词进行词嵌入处理,从而得到各个词的词嵌入向量。在位置编码层,可以获取各个词在该当前输入中的位置,进而对各个词的位置生成位置向量。在一些示例中,各个词的位置可以为各个词在该当前输入中的绝对位置。以当前输入为“几号应还花呗”为例,其中的“几”的位置可以表示为第一位,“号”的位置可以表示为第二位,……。在一些示例中,各个词的位置可以为各个词之间的相对位置。仍以当前输入为“几号应还花呗”为例,其中的“几”的位置可以表示为“号”之前,“号”的位置可以表示为“几”之后、“应”之前,……。当得到当前输入中各个词的词嵌入向量和位置向量时,可以将各个词的位置向量和对应的词嵌入向量进行组合,得到各个词特征向量,即得到该当前输入对应的多个特征向量。多个特征向量可以表示为具有预设维度的嵌入矩阵。可以设定该多个特征向量中的特征向量个数为M,预设维度为H维,则该多个特征向量可以表示为M×H的嵌入矩阵。The embedding layer can be called the input embedding layer. The current input can be text input, for example, it can be a paragraph of text or a sentence. The text can be Chinese text, English text, or other language text. After obtaining the current input, the embedding layer can embed each word in the current input to obtain the feature vector of each word. In some embodiments, the embedding layer includes an input embedding layer and a positional encoding layer. In the input embedding layer, word embedding processing can be performed on each word in the current input to obtain the word embedding vector of each word. In the position coding layer, the position of each word in the current input can be obtained, and then a position vector is generated for the position of each word. In some examples, the position of each word may be the absolute position of each word in the current input. Taking the current input as "What number should I pay back the Huabei?" for example, the position of "number" can be represented as the first digit, the position of "number" can be represented as the second digit,... In some examples, the position of each word may be a relative position between each word. Still taking the current input as "what number should I pay back Huabei" as an example, the position of "what number" can be expressed as before "number", and the position of "number" can be expressed as after "what number" and before "should",... …. When the word embedding vector and position vector of each word in the current input are obtained, the position vector of each word and the corresponding word embedding vector can be combined to obtain the feature vector of each word, that is, multiple feature vectors corresponding to the current input are obtained. Multiple feature vectors can be represented as embedding matrices with preset dimensions. The number of eigenvectors in the plurality of eigenvectors can be set to M, and the default dimension is H dimension, then the plurality of eigenvectors can be expressed as an M×H embedding matrix.
14、注意力机制(attention mechanism)14. attention mechanism
注意力机制模仿了生物观察行为的内部过程,即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制,能够利用有限的注意力资源从大量信息中快速筛选出高价值信息。注意力机制可以快速提取稀疏数据的重要特征,因而被广泛用于自然语言处理任务,特别是机器翻译。而自注意力机制(self-attention mechanism)是注意力机制的改进,其减少了对外部信息的依赖,更擅长捕捉数据或特征的内部相关性。注意力机制的本质思想可以改写为如下公式:The attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sensation to increase the precision of observation in some areas, and can use limited attention resources to quickly filter out high-value information from a large amount of information. . The attention mechanism can quickly extract important features of sparse data and is therefore widely used in natural language processing tasks, especially machine translation. The self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features. The essential idea of the attention mechanism can be rewritten as the following formula:
其中,Lx=||Source||代表Source的长度,公式含义即将Source中的构成元素想象成是由一系列的数据对构成,此时给定目标Target中的某个元素Query,通过计算Query和各个Key的相似性或者相关性,得到每个Key对应Value的权重系数,然后对Value进行加权求和,即得到了最终的Attention数值。所以本质上Attention机制是对Source中元素的Value值进行加权求和,而Query和Key用来计算对应Value的权重系数。从概念上理解,把Attention可以理解为从大量信息中有选择地筛选出少量重要信息并聚焦到这些重要信息上,忽略大多不重要的信息。聚焦的过程体现在权重系数的计算上,权重越大越聚焦于其对应的Value值上,即权重代表了信息的重要性,而Value是其对应的信息。自注意力机制可以理解为内部Attention(intra attention),Attention机制发生在Target的元素Query和Source中的所有元素之间,自注意力机制指的是在Source内部元素之间或者Target内部元素之间发生的Attention机制,也可以理解为Target=Source这种特殊情况下的注意力计算机制,其具体计算过程是一样的,只是计算对象发生了变化而已。Among them, Lx=||Source|| represents the length of Source. The meaning of the formula is to imagine that the constituent elements in Source are composed of a series of data pairs. At this time, given a certain element Query in the target Target, by calculating the Query and Based on the similarity or correlation of each Key, the weight coefficient of each Key's corresponding Value is obtained, and then the Value is weighted and summed to obtain the final Attention value. So essentially the Attention mechanism is a weighted summation of the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value. Conceptually, Attention can be understood as selectively filtering out a small amount of important information from a large amount of information and focusing on this important information, while ignoring most of the unimportant information. The process of focusing is reflected in the calculation of the weight coefficient. The greater the weight, the more focused it is on its corresponding Value value. That is, the weight represents the importance of the information, and the Value is its corresponding information. The self-attention mechanism can be understood as internal Attention (intra attention). The Attention mechanism occurs between the Target element Query and all elements in the Source. The self-attention mechanism refers to between the internal elements of the Source or between the internal elements of the Target. The Attention mechanism that occurs can also be understood as the attention calculation mechanism in the special case of Target=Source. The specific calculation process is the same, but the calculation object has changed.
目前,语音编辑的场景越来越多,例如,歌声编辑的场景为用户在录制歌曲(例如清唱)等场景,为了修复由于口误带来的原始语音中的错误内容,通常会用到语音编辑。目 前的语音编辑方式是从数据库中获取语音片段,并用该语音片段替换错误内容,进而生成校正后的语音。At present, there are more and more scenarios for voice editing. For example, the scenario of singing voice editing is when the user is recording a song (such as singing a cappella). In order to repair the erroneous content in the original voice caused by a slip of the tongue, voice editing is usually used. Head The current voice editing method is to obtain voice segments from the database, replace the erroneous content with the voice segments, and then generate corrected speech.
然而,该种方式过于依赖数据库中存储的语音片段,若该语音片段与原始语音的音色、韵律、信噪比等相差较大,会导致校正后的人声前后不连贯、韵律不自然,导致校正后的语音听感较差。且虽然歌声编辑同语音编辑的场景非常相似,但不同于说话声的平稳语音,歌声数据在发音时长、声音能量和音高等维度上变化更大,现有的语音编辑技术难以直接应用于歌声编辑。However, this method relies too much on the speech fragments stored in the database. If the timbre, rhythm, signal-to-noise ratio, etc. of the speech fragments are greatly different from the original speech, the corrected human voice will be incoherent and the rhythm will be unnatural, resulting in The corrected speech sounds worse. And although the scene of singing editing is very similar to that of voice editing, unlike the smooth voice of a speaking voice, singing data changes more in dimensions such as pronunciation duration, sound energy and pitch, making it difficult for existing voice editing technology to be directly applied to singing editing.
为了解决上述问题,本申请提供一种语音编辑方法,在歌声编辑时,音高特征会影响目标编辑语音的听感与原始语音的听感,本申请通过预测第二文本(待编辑文本)的音高特征,根据音高特征生成第二文本的第一语音特征,并基于第一语音特征生成第二文本对应目标编辑语音,使得歌声编辑前后的语音的音高特征相似,进而实现目标编辑语音的听感与原始语音的听感目标编辑语音的听感与原始语音的听感类似。In order to solve the above problems, this application provides a voice editing method. When editing a singing voice, the pitch characteristics will affect the hearing sense of the target edited voice and the hearing sense of the original voice. This application predicts the second text (text to be edited) by predicting Pitch feature, generate the first voice feature of the second text based on the pitch feature, and generate the target editing voice corresponding to the second text based on the first voice feature, so that the pitch features of the voice before and after singing editing are similar, thereby achieving the target editing voice The listening feeling of the target is similar to that of the original speech.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获取的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of this application.
首先介绍本申请实施例提供的系统架构。First, the system architecture provided by the embodiment of this application is introduced.
参见附图1,本申请实施例提供了一种系统架构10。如所述系统架构10所示,数据采集设备16用于采集训练数据,本申请实施例中训练数据包括训练语音以及与该训练语音对应的训练文本。并将训练数据存入数据库13,训练设备12基于数据库13中维护的训练数据训练得到目标模型/规则101。下面将更详细地描述训练设备12如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的语音处理方法,即,将文本通过相关预处理后输入该目标模型/规则101,即可得到该文本的语音特征。本申请实施例中的目标模型/规则101具体可以为神经网络。需要说明的是,在实际的应用中,所述数据库13中维护的训练数据不一定都来自于数据采集设备16的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备12也不一定完全基于数据库13维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。Referring to Figure 1, an embodiment of the present application provides a system architecture 10. As shown in the system architecture 10, the data collection device 16 is used to collect training data. In the embodiment of the present application, the training data includes training speech and training text corresponding to the training speech. The training data is stored in the database 13, and the training device 12 trains to obtain the target model/rule 101 based on the training data maintained in the database 13. The following will describe in more detail how the training device 12 obtains the target model/rules 101 based on the training data. The target model/rules 101 can be used to implement the speech processing method provided by the embodiment of the present application, that is, the text is input after relevant preprocessing. The target model/rule 101 can obtain the phonetic features of the text. The target model/rule 101 in the embodiment of this application may specifically be a neural network. It should be noted that in actual applications, the training data maintained in the database 13 may not all be collected from the data collection device 16, and may also be received from other devices. In addition, it should be noted that the training device 12 may not necessarily train the target model/rules 101 based entirely on the training data maintained by the database 13. It may also obtain training data from the cloud or other places for model training. The above description should not be used as a limitation of this application. Limitations of Examples.
根据训练设备12训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备11,所述执行设备11可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。在附图1中,执行设备11配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备14向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:第二语音特征、目标文本以及标记信息,输入数据也可以包括第二语音特征与第二文本。另外,输入数据可以是用户输入的,也可以是用户通过其他设备上传的,当然还可以来自数据库,具体此处不做限定。The target model/rules 101 trained according to the training device 12 can be applied to different systems or devices, such as to the execution device 11 shown in Figure 1. The execution device 11 can be a terminal, such as a mobile phone terminal, a tablet computer, Laptops, AR/VR, vehicle terminals, etc., or servers or clouds, etc. In Figure 1, the execution device 11 is configured with an I/O interface 112 for data interaction with external devices. The user can input data to the I/O interface 112 through the client device 14. The input data is used in the embodiment of the present application. may include: second voice features, target text and mark information, and the input data may also include second voice features and second text. In addition, the input data can be input by the user, or uploaded by the user through other devices. Of course, it can also come from a database, and there is no specific limit here.
若输入数据包括第二语音特征、目标文本以及标记信息,则预处理模块113用于根据I/O接口112接收到的目标文本与标记信息进行预处理,在本申请实施例中,预处理模块 113可以用于基于目标文本与标记信息确定目标文本中的目标编辑文本。若输入数据包括第二语音特征、第二文本,则预处理模块113用于根据I/O接口112接收到的目标文本与标记信息进行预处理,例如,将目标文本转化为音素等准备工作。If the input data includes the second speech feature, target text, and mark information, the preprocessing module 113 is configured to perform preprocessing according to the target text and mark information received by the I/O interface 112. In the embodiment of the present application, the preprocessing module 113 may be used to determine the target editing text in the target text based on the target text and mark information. If the input data includes the second speech feature and the second text, the preprocessing module 113 is configured to perform preprocessing according to the target text and mark information received by the I/O interface 112, for example, converting the target text into phonemes and other preparatory work.
在执行设备11对输入数据进行预处理,或者在执行设备11的计算模块111执行计算等相关的处理过程中,执行设备11可以调用数据存储系统15中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统15中。When the execution device 11 preprocesses the input data, or when the calculation module 111 of the execution device 11 performs calculations and other related processes, the execution device 11 can call data, codes, etc. in the data storage system 15 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 15 .
最后,I/O接口112将处理结果,如上述得到的第一语音特征返回给客户设备14,从而提供给用户。Finally, the I/O interface 112 returns the processing result, such as the first voice feature obtained as described above, to the client device 14, thereby providing it to the user.
值得说明的是,训练设备12可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果或为后续的其他处理提供输入。It is worth mentioning that the training device 12 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results or providing input for other subsequent processing.
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备14可以自动地向I/O接口112发送输入数据,如果要求客户设备14自动发送输入数据需要获得用户的授权,则用户可以在客户设备14中设置相应权限。用户可以在客户设备14查看执行设备11输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备14也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库13。当然,也可以不经过客户设备14进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库13。In the situation shown in FIG. 1 , the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 . In another case, the client device 14 can automatically send input data to the I/O interface 112. If requiring the client device 14 to automatically send the input data requires the user's authorization, the user can set corresponding permissions in the client device 14. The user can view the results output by the execution device 11 on the client device 14, and the specific presentation form may be display, sound, action, etc. The client device 14 can also be used as a data collection terminal to collect input data from the input I/O interface 112 and output results from the output I/O interface 112 as new sample data, and store them in the database 13 . Of course, it is also possible to collect data without going through the client device 14. Instead, the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result outputted from the I/O interface 112 as a new sample as shown in the figure. The data is stored in database 13.
值得注意的是,附图1仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统15相对执行设备11是外部存储器,在其它情况下,也可以将数据存储系统15置于执行设备11中。It is worth noting that Figure 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation. For example, in Figure 1 , the data storage system 15 is an external memory relative to the execution device 11. In other cases, the data storage system 15 can also be placed in the execution device 11.
如图1所示,根据训练设备12训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是神经网络,具体的,在本申请实施例提供的网络中,神经网络可以是循环神经网络、长短期记忆网络等。预测网络可以是卷积神经网络、循环神经网络等。As shown in Figure 1, a target model/rule 101 is obtained by training according to the training device 12. The target model/rule 101 can be a neural network in the embodiment of the present application. Specifically, in the network provided by the embodiment of the present application, the neural network It can be a recurrent neural network, a long short-term memory network, etc. The prediction network can be a convolutional neural network, a recurrent neural network, etc.
可选地,本申请实施例中的神经网络与预测网络可以是单独的两个网络,也可以是一个多任务的神经网络,其中一个任务是输出时长,一个任务是预测音高特征,另外一个任务是输出语音特征。Optionally, the neural network and the prediction network in the embodiment of this application can be two separate networks, or they can be a multi-task neural network, one of which is to output duration, one to predict pitch features, and the other to predict pitch characteristics. The task is to output speech features.
由于CNN是一种非常常见的神经网络,下面结合图2重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。Since CNN is a very common neural network, the structure of CNN will be introduced in detail below in conjunction with Figure 2. As mentioned in the previous introduction to the basic concepts, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture. The deep learning architecture refers to the algorithm of machine learning. Multiple levels of learning at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the image input into it.
如图2所示,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,以及神经网络层130其中池化层为可选的。 As shown in Figure 2, a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional/pooling layer 120, and a neural network layer 130 where the pooling layer is optional.
卷积层/池化层120:Convolutional layer/pooling layer 120:
卷积层:Convolutional layer:
如图2所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。As shown in Figure 2, the convolution layer/pooling layer 120 may include layers 121-126 as examples. In one implementation, layer 121 is a convolution layer, layer 122 is a pooling layer, layer 123 is a convolution layer, and 124 The layer is a pooling layer, 125 is a convolution layer, and 126 is a pooling layer; in another implementation, 121 and 122 are convolution layers, 123 is a pooling layer, 124 and 125 are convolution layers, and 126 is Pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or can be used as the input of another convolutional layer to continue the convolution operation.
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。Taking the convolution layer 121 as an example, the convolution layer 121 may include many convolution operators. The convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can essentially be a weight matrix. This weight matrix is usually predefined. During the convolution operation on the image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels followed by two pixels... This depends on the value of the step size) to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices with the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform fuzzification...the multiple weight matrices have the same dimensions, and the feature maps extracted by the multiple weight matrices with the same dimensions also have the same dimensions, and then the extracted feature maps with the same dimensions are combined to form the output of the convolution operation .
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络100进行正确的预测。The weight values in these weight matrices need to be obtained through a large amount of training in practical applications. Each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。When the convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (for example, 121) often extracts more general features, which can also be called low-level features; as the convolutional neural network As the depth of the network 100 deepens, the features extracted by subsequent convolutional layers (for example, 126) become more and more complex, such as high-level semantic features. Features with higher semantics are more suitable for the problem to be solved.
池化层:Pooling layer:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图2中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺 寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolution layer, that is, each layer 121-126 as shown at 120 in Figure 2, which can be a convolution layer followed by a layer The pooling layer can also be a multi-layer convolution layer followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the average value of pixel values in an image within a specific range. The max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling. In addition, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the output image processed by the pooling layer can be smaller than the size of the image input to the pooling layer. Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
神经网络层130:Neural network layer 130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层130中可以包括多层隐含层(如图2所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 120 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to use the neural network layer 130 to generate an output or a set of required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in Figure 2) and an output layer 140. The parameters included in the multiple hidden layers may be based on specific task types. Related training data are pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc.
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图2由110至140的传播为前向传播)完成,反向传播(如图2由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。After the multi-layer hidden layer in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140. The output layer 140 has a loss function similar to classification cross entropy, specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 100 (the propagation from 110 to 140 in Figure 2 is forward propagation) is completed, the back propagation (the propagation from 140 to 110 in Figure 2 is back propagation) will start to update. The weight values and deviations of each layer mentioned above are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
需要说明的是,如图2所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图3所示的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。It should be noted that the convolutional neural network 100 shown in Figure 2 is only an example of a convolutional neural network. In specific applications, the convolutional neural network can also exist in the form of other network models, for example, as The multiple convolutional layers/pooling layers shown in Figure 3 are parallel, and the extracted features are all input to the full neural network layer 130 for processing.
下面介绍本申请实施例提供的一种芯片硬件结构。The following introduces a chip hardware structure provided by the embodiment of the present application.
图4为本申请实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器40。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图2所示的卷积神经网络中各层的算法均可在如图4所示的芯片中得以实现。Figure 4 is a chip hardware structure provided by an embodiment of the present application. The chip includes a neural network processor 40. The chip can be disposed in the execution device 110 as shown in Figure 1 to complete the calculation work of the calculation module 111. The chip can also be provided in the training device 120 as shown in Figure 1 to complete the training work of the training device 120 and output the target model/rules 101. The algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 4.
神经网络处理器40可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器NPU40作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路403,控制器404控制运算电路403提取存储器(权重存储器或输入存储器)中的数据并进行运算。The neural network processor 40 may be a neural network processor (neural-network processing unit, NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., which are suitable for large-scale applications. A processor that handles scaled XOR operations. Take the NPU as an example: the neural network processor NPU40 is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the main CPU allocates tasks. The core part of the NPU is the arithmetic circuit 403. The controller 404 controls the arithmetic circuit 403 to extract data in the memory (weight memory or input memory) and perform operations.
在一些实现中,运算电路403内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路403是二维脉动阵列。运算电路403还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路403是通用的矩阵处理器。In some implementations, the computing circuit 403 internally includes multiple processing engines (PEs). In some implementations, arithmetic circuit 403 is a two-dimensional systolic array. The arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 403 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器402中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器401 中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器408中。For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit obtains the corresponding data of matrix B from the weight memory 402 and caches it on each PE in the arithmetic circuit. The arithmetic circuit converts the input memory 401 The matrix A data is obtained and matrix B is used for matrix operation, and the partial result or final result of the matrix is stored in the accumulator 408 .
向量计算单元407可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元407可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。The vector calculation unit 407 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. For example, the vector calculation unit 407 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .
在一些实现种,向量计算单元能407将经处理的输出的向量存储到统一缓存器406。例如,向量计算单元407可以将非线性函数应用到运算电路403的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元407生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路403的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit can 407 store the processed output vector into the unified buffer 406 . For example, the vector calculation unit 407 may apply a nonlinear function to the output of the operation circuit 403, such as a vector of accumulated values, to generate an activation value. In some implementations, vector calculation unit 407 generates normalized values, merged values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 403, such as for use in a subsequent layer in a neural network.
统一存储器406用于存放输入数据以及输出数据。The unified memory 406 is used to store input data and output data.
权重数据直接通过存储单元访问控制器405(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器401和/或统一存储器406、将外部存储器中的权重数据存入权重存储器402,以及将统一存储器506中的数据存入外部存储器。The weight data directly transfers the input data in the external memory to the input memory 401 and/or unified memory 406 through the storage unit access controller 405 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 402. And store the data in the unified memory 506 into the external memory.
总线接口单元(bus interface unit,BIU)410,用于通过总线实现主CPU、DMAC和取指存储器409之间进行交互。A bus interface unit (BIU) 410 is used to implement interaction between the main CPU, the DMAC and the fetch memory 409 through the bus.
与控制器404连接的取指存储器(instruction fetch buffer)409,用于存储控制器404使用的指令。An instruction fetch buffer 409 connected to the controller 404 is used to store instructions used by the controller 404.
控制器404,用于调用指存储器409中缓存的指令,实现控制该运算加速器的工作过程。The controller 404 is used to call instructions cached in the memory 409 to control the working process of the computing accelerator.
一般地,统一存储器406,输入存储器401,权重存储器402以及取指存储器409均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。Generally, the unified memory 406, the input memory 401, the weight memory 402 and the instruction memory 409 are all on-chip memories, and the external memory is a memory external to the NPU. The external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM for short), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
其中,图2或图3所示的卷积神经网络中各层的运算可以由运算电路403或向量计算单元407执行。Among them, the operations of each layer in the convolutional neural network shown in Figure 2 or Figure 3 can be performed by the operation circuit 403 or the vector calculation unit 407.
首先,先对本申请实施例提供的语音处理方法所适用的应用场景进行描述。该语音处理方法可以应用于需要修改语音内容的场景,例如:用户录制短视频、老师在录制授课语音等场景。该语音处理方法可以适用于例如手机、计算机、可发声的拆戴式终端上的智能语音助手、智能音响等具有语音编辑功能的应用程序、软件或语音处理设备上。First, the application scenarios to which the voice processing method provided by the embodiment of the present application is applicable are described. This voice processing method can be applied to scenarios where voice content needs to be modified, such as scenarios where users record short videos, teachers record teaching voices, etc. The voice processing method can be applied to applications, software or voice processing devices with voice editing functions such as smart voice assistants on mobile phones, computers, detachable terminals that can produce sounds, smart speakers, etc.
其中,语音处理设备是一种用于服务用户的终端设备,或者云端设备。终端设备可以包括头戴显示设备(head mount display,HMD)、该头戴显示设备可以是虚拟现实(virtual reality,VR)盒子与终端的组合,VR一体机,个人计算机(personal computer,PC),增强现实(augmented reality,AR)设备,混合现实(mixed reality,MR)设备等,该终端设备还可以包括蜂窝电话(cellular phone)、智能电话(smart phone)、个人数字助理(personal  digital assistant,PDA)、平板型电脑、膝上型电脑(laptop computer)、个人电脑(personal computer,PC)、车载终端等,具体此处不做限定。Among them, the voice processing device is a terminal device used to serve users, or a cloud device. The terminal device may include a head mount display (HMD), which may be a combination of a virtual reality (VR) box and a terminal, a VR all-in-one machine, or a personal computer (PC), Augmented reality (AR) devices, mixed reality (MR) devices, etc. The terminal device may also include a cellular phone, a smart phone, a personal digital assistant (personal digital assistant), etc. digital assistant (PDA), tablet computer, laptop computer (laptop computer), personal computer (PC), vehicle-mounted terminal, etc. The details are not limited here.
下面结合附图对本申请实施例的神经网络、预测网络的训练方法、语音处理方法进行详细的介绍。The following is a detailed introduction to the neural network, the training method of the prediction network, and the speech processing method in the embodiments of the present application with reference to the accompanying drawings.
本申请实施例中的神经网络与预测网络可以是单独的两个网络,也可以是一个多任务的神经网络,其中一个任务是输出时长,另外一个任务是输出语音特征。The neural network and the prediction network in the embodiment of this application can be two separate networks, or they can be a multi-task neural network, one of which is to output duration, and the other is to output speech features.
其次,结合图5对本申请实施例的神经网络的训练方法进行详细介绍。图5所示的训练方法可以由神经网络的训练装置来执行,该神经网络的训练装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行神经网络的训练方法的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,训练方法可以由图1中的训练设备120、图4中的神经网络处理器40执行。Secondly, the training method of the neural network according to the embodiment of the present application will be introduced in detail with reference to Figure 5 . The training method shown in Figure 5 can be executed by a neural network training device. The neural network training device can be a cloud service device or a terminal device. For example, a computer, server, etc. with sufficient computing power to execute the neural network. The training method device may also be a system composed of cloud service equipment and terminal equipment. For example, the training method may be executed by the training device 120 in FIG. 1 and the neural network processor 40 in FIG. 4 .
可选地,训练方法可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。Optionally, the training method can be processed by the CPU, or it can be processed by the CPU and GPU together, or it can not use the GPU but use other processors suitable for neural network calculations, which is not limited by this application.
图5所示的训练方法包括步骤501与步骤502。下面对步骤501与步骤502进行详细说明。The training method shown in Figure 5 includes step 501 and step 502. Step 501 and step 502 will be described in detail below.
首先,先对预测网络的训练过程进行简单描述。本申请实施例中的预测网络可以是transformer网络、RNN、CNN等,具体此处不做限定。预测网络在训练阶段,输入是训练文本的向量,输出是训练文本中各个音素的时长、音高特征或者语音特征。再不断缩小预测网络输出的训练文本中各个音素的时长、音高特征或者语音特征与训练文本对应训练语音的实际时长、实际音高特征或者实际语音特征之间的差异,进而得到训练好的预测网络。First, let’s briefly describe the training process of the prediction network. The prediction network in the embodiment of this application can be a transformer network, RNN, CNN, etc., and is not specifically limited here. In the training phase of the prediction network, the input is the vector of the training text, and the output is the duration, pitch characteristics or voice characteristics of each phoneme in the training text. Then continue to narrow the difference between the duration, pitch characteristics or phonetic features of each phoneme in the training text output by the prediction network and the actual duration, actual pitch features or actual phonetic features of the training text corresponding to the training text, and then obtain the trained predictions network.
步骤501,获取训练数据。Step 501: Obtain training data.
本申请实施例中的训练数据包括训练语音,或者包括训练语音以及与训练语音对应的训练文本。如果训练数据不包括训练文本,则可以通过识别训练语音的方式获取训练文本。The training data in the embodiment of the present application includes training speech, or includes training speech and training text corresponding to the training speech. If the training data does not include training text, the training text can be obtained by recognizing the training speech.
可选地,若发音者(或者用户)的数量为多个,为了后续预测的语音特征正确,训练数据中的训练语音特征还可以包括用户标识,或者包括训练语音的声纹特征,或者包括用于标识训练语音的声纹特征的向量。Optionally, if there are multiple speakers (or users), in order to ensure that the subsequently predicted speech features are correct, the training speech features in the training data may also include user identification, or include the voiceprint features of the training speech, or include using A vector identifying the voiceprint features of the training speech.
可选地,训练数据还可以包括训练语音中各个音素的起止时长信息。Optionally, the training data may also include start and end duration information of each phoneme in the training speech.
本申请实施例中获取训练数据可以是通过直接录制发声对象发声的方式获取,也可以是通过用户输入音频信息、视频信息的方式获取,还可以是通过接收采集设备发送的方式获取,在实际应用中,还有其他方式获取训练数据,对于训练数据的获取方式具体此处不做限定。In the embodiments of this application, the training data can be obtained by directly recording the utterances of the voicing object, or by the user inputting audio information and video information, or by receiving transmissions from the collection device. In practical applications , there are other ways to obtain training data, and there are no specific limitations on how to obtain training data.
步骤502,以训练数据作为神经网络的输入,以损失函数的值小于阈值为目标对神经网络进行训练,得到训练好的神经网络。Step 502: Use the training data as the input of the neural network, train the neural network with the goal that the value of the loss function is less than the threshold, and obtain a trained neural network.
可选地,训练数据可以进行一些预处理,例如上述所描述的如果训练数据包括训练语音,可以识别训练语音的方式获取训练文本,并将训练文本用音素表示输入神经网络。Optionally, some preprocessing can be performed on the training data. For example, if the training data includes training speech as described above, the training text can be obtained by identifying the training speech, and the training text can be input into the neural network using phoneme representation.
在训练过程中,可以将整个训练文本当做目标编辑文本,并作为输入,以减小损失函数的值为目标对神经网络进行训练,也就是不断缩小神经网络输出的语音特征与训练语音 对应的实际语音特征之间的差异。该训练过程可以理解为预测任务。损失函数可以理解为预测任务对应的损失函数。During the training process, the entire training text can be regarded as the target editing text and used as input to train the neural network with the goal of reducing the value of the loss function, that is, continuously reducing the speech features output by the neural network and the training speech The difference between the corresponding actual speech features. This training process can be understood as a prediction task. The loss function can be understood as the loss function corresponding to the prediction task.
本申请实施例中的神经网络具体可以是注意力机制模型,例如:transformer、tacotron2等。其中,注意力机制模型包括编码器-解码器,编码器或解码器的结构可以是循环神经网络、长短期记忆网络(long short-term memory,LSTM)等。The neural network in the embodiment of this application may specifically be an attention mechanism model, such as transformer, tacotron2, etc. Among them, the attention mechanism model includes an encoder-decoder, and the structure of the encoder or decoder can be a recurrent neural network, a long short-term memory network (long short-term memory, LSTM), etc.
本申请实施例中的神经网络包括编码器(encoder)与解码器(decoder),编码器与解码器的结构类型可以是RNN、LSTM等,具体此处不做限定。编码器的作用是将训练文本编码为文本向量(以音素为单位的向量表示,每个输入对应一个向量),解码器的作用是根据文本向量得到文本对应的语音特征。解码器在训练过程中,每步的计算以上一步所对应的真实语音特征作为条件进行计算。The neural network in the embodiment of the present application includes an encoder and a decoder. The structural types of the encoder and decoder may be RNN, LSTM, etc., and are not limited here. The function of the encoder is to encode the training text into a text vector (a vector representation in units of phonemes, with each input corresponding to a vector). The function of the decoder is to obtain the phonetic features corresponding to the text based on the text vector. During the training process of the decoder, the calculation of each step is calculated based on the real speech features corresponding to the previous step.
进一步的,为了保证前后语音的连贯,可以使用预测网络对文本向量对应的语音时长进行修正。即可以理解为根据训练语音中各个音素的时长对文本向量进行上采样(也可以理解为是对向量的帧数进行扩展),以得到对应帧数的向量。解码器的作用是根据上述对应帧数的向量得到文本对应的语音特征。Furthermore, in order to ensure the coherence of the preceding and following speech, the prediction network can be used to correct the speech duration corresponding to the text vector. That is, it can be understood as upsampling the text vector according to the duration of each phoneme in the training speech (it can also be understood as extending the number of frames of the vector) to obtain a vector of the corresponding number of frames. The function of the decoder is to obtain the speech features corresponding to the text based on the above vector corresponding to the frame number.
可选地,上述的解码器可以是单向解码器,也可以是双向解码器(即两个方向并行),具体此处不做限定。其中,两个方向是指训练文本的方向,也可以理解为是训练文本对应的向量的方向,还可以理解为是训练文本的正序或者反序,一个方向是训练文本的一侧指向训练文本的另一侧,另一个方向是训练文本的另一侧指向训练文本的一侧。Optionally, the above-mentioned decoder may be a unidirectional decoder or a bidirectional decoder (that is, two directions are parallel), and the details are not limited here. Among them, the two directions refer to the direction of the training text, which can also be understood as the direction of the vector corresponding to the training text. It can also be understood as the forward or reverse order of the training text. One direction means that one side of the training text points to the training text. The other side, the other direction is the other side of the training text pointing to the side of the training text.
示例性的,若训练文本为:“中午吃饭了没”,则第一方向或正序可以是从“中”到“没”的方向,第二方向或反序可以是从“没”到“中”的方向。For example, if the training text is: "Have you had lunch?", the first direction or positive sequence can be from "mid" to "no", and the second direction or reverse sequence can be from "no" to "no" "Center" direction.
若解码器是双向解码器,则两个方向(或者正反序)的解码器并行训练,且在训练过程中各自独立计算,不存在结果依赖。当然,如果预测网络与神经网络为一个多任务的网络,预测网络可以称为预测模块,则解码器可以根据训练文本对应的真实时长信息修正神经网络输出的语音特征。If the decoder is a bidirectional decoder, the decoders in both directions (or forward and reverse order) are trained in parallel and are calculated independently during the training process, so there is no dependence on the results. Of course, if the prediction network and the neural network are a multi-task network, the prediction network can be called a prediction module, and the decoder can correct the speech features output by the neural network based on the real duration information corresponding to the training text.
示例性的,以歌声编辑为例,进行模型训练时的输入可以为原始歌声音频、对应歌词文本(以音素为单位表示)根据原始歌声音频得到各音素在原始音频中的时长信息、唱歌人声纹特征,帧级的Pitch信息等,例如可以通过预训练的其他模型或工具(如歌声歌词对齐工具,Singer声纹提取工具,以及Pitch提取算法等)得到。输出可以为训练好的声学模型,训练目标为最小化预测出的歌声特征同歌声语音特征间的误差。For example, taking singing editing as an example, the input during model training can be the original singing audio, the corresponding lyric text (expressed in units of phonemes). According to the original singing audio, the duration information of each phoneme in the original audio, and the singing voice are obtained. Pattern features, frame-level pitch information, etc., can be obtained through other pre-trained models or tools (such as singing lyrics alignment tool, Singer voiceprint extraction tool, and pitch extraction algorithm, etc.). The output can be a trained acoustic model, and the training goal is to minimize the error between the predicted singing voice features and the singing voice features.
在训练样本的数据准备上,可以基于歌声合成训练数据集,分别模拟“插入,删除和替换”操作场景构建对应的训练数据样本。In terms of data preparation for training samples, training data sets can be synthesized based on singing voices, and corresponding training data samples can be constructed by simulating "insertion, deletion and replacement" operation scenarios.
训练过程:Training process:
Stage1:先使用ground-truth歌词和音频以及Pitch和duration数据,训练一个歌声合成模型,从而得到训练好的文本编码模块和音频特征解码模块;Stage1: First use ground-truth lyrics and audio as well as pitch and duration data to train a singing voice synthesis model to obtain the trained text encoding module and audio feature decoding module;
Stage2:固定文本编码模块和音频特征解码模块,使用模拟编辑操作训练数据集训练时长规整模块和Pitch预测模块;Stage2: Fixed text encoding module and audio feature decoding module, use simulated editing operation training data set to train duration regularization module and pitch prediction module;
Stage3:端到端训练,使用所有的训练数据finetune整个模型。 Stage3: End-to-end training, finetune the entire model using all training data.
本申请实施例中的神经网络的架构可以参阅图6。其中,神经网络包括编码器与解码器。可选地,神经网络还可以包括预测模块与上采样模块。预测模块具体用于实现上述预测网络的功能,上采样模块具体用于实现上述根据训练语音中各个音素的时长对文本向量进行上采样的过程,具体此处不再赘述。The architecture of the neural network in the embodiment of this application can be seen in Figure 6. Among them, neural network includes encoder and decoder. Optionally, the neural network may also include a prediction module and an upsampling module. The prediction module is specifically used to implement the function of the above prediction network, and the upsampling module is specifically used to implement the above process of upsampling the text vector according to the duration of each phoneme in the training speech, which will not be described again here.
需要说明的是,训练过程也可以不采用前述训练方法而采用其他训练方法,此处不做限定。It should be noted that the training process may also adopt other training methods instead of the aforementioned training methods, which are not limited here.
下面结合附图对本申请实施例的语音处理方法进行详细的介绍。The speech processing method according to the embodiment of the present application will be introduced in detail below with reference to the accompanying drawings.
首先,本申请实施例提供的语音处理方法可以应用于替换场景、插入场景或删除场景。上述场景可以理解为是对原始文本对应的原始语音进行替换、插入、删除等得到目标语音,实现目标语音与原始语音的听感类似和/或提升目标语音的流畅度。其中,原始语音可以认为是包括待修改的语音,目标语音为用户想修正原始语音后得到的语音。First of all, the voice processing method provided by the embodiment of the present application can be applied to replacing scenes, inserting scenes, or deleting scenes. The above scenario can be understood as replacing, inserting, deleting, etc. the original speech corresponding to the original text to obtain the target speech, so as to achieve a similar listening experience between the target speech and the original speech and/or improve the fluency of the target speech. Among them, the original voice can be considered to include the voice to be modified, and the target voice is the voice obtained after the user wants to modify the original voice.
为了方便理解,下面对上述场景的几种举例进行描述:To facilitate understanding, several examples of the above scenarios are described below:
一、对于替换场景。1. For replacement scenarios.
原始文本为“今天深圳天气很好”,目标文本为“今天广州天气很好”。其中,重叠文本为“今天天气很好”。原始文本中的非重叠文本为“深圳”,目标文本中的非重叠文本为“广州”。目标文本包括第一文本与第二文本,第一文本为重叠文本或重叠文本中的部分文本。第二文本为目标文本中除了第一文本以外的文本。例如:若第一文本为“今天天气很好”,则第二文本为“广州”。若第一文本为“今气很好”,则第二文本为“天广州天”。The original text is "Today the weather is very good in Shenzhen", and the target text is "Today the weather is very good in Guangzhou". Among them, the overlapping text is "The weather is very good today". The non-overlapping text in the original text is "Shenzhen", and the non-overlapping text in the target text is "Guangzhou". The target text includes a first text and a second text, and the first text is an overlapping text or a part of the overlapping text. The second text is the text in the target text other than the first text. For example: If the first text is "The weather is very good today", the second text is "Guangzhou". If the first text is "Today's weather is very good", the second text is "Heaven, Guangzhou, Tian".
二、对于插入场景。2. For insertion scenes.
原始文本为“今天深圳天气很好”,目标文本为“今天上午深圳天气很好”。其中,重叠文本为“今天深圳天气很好”。目标文本中的非重叠文本为“上午”。为了实现目标语音前后的连贯,可以将该插入场景看作为将原始语音中的“天深”替换为“天上午深”的替换场景。即第一文本为“今圳天气很好”,第二文本为“天上午深”。The original text is "The weather in Shenzhen is very good today", and the target text is "The weather in Shenzhen is very good this morning". Among them, the overlapping text is "The weather in Shenzhen is very good today". The non-overlapping text in the target text is "AM". In order to achieve coherence before and after the target speech, the insertion scene can be regarded as a replacement scene in which "tianshen" in the original speech is replaced by "tianchenshen". That is, the first text is "The weather in Zhenzhou is very good today", and the second text is "It's morning and late in the day".
三、对于删除场景。3. For deleted scenes.
原始文本为“今天深圳天气很好”,目标文本为“今天天气很好”。其中,重叠文本为“今天天气很好”。原始文本中的非重叠文本为“深圳”。为了实现目标语音前后的连贯,可以将该删除场景看作为将原始语音中的“天深圳天”替换为“天天”的替换场景。即第一文本为“今气很好”,第二文本为“天天”。The original text is "The weather in Shenzhen is very good today" and the target text is "The weather is very good today". Among them, the overlapping text is "The weather is very good today". The non-overlapping text in the original text is "Shenzhen". In order to achieve continuity before and after the target speech, the deleted scene can be regarded as a replacement scene in which "天天天" in the original speech is replaced by "天天". That is, the first text is "Today's weather is very good" and the second text is "Every day".
可选地,上述几种场景只是举例,在实际应用中,还有其他场景,具体此处不做限定。Optionally, the above scenarios are just examples. In actual applications, there are other scenarios, which are not limited here.
由于上述的删除场景与插入场景都可以用替换场景进行代替,下面仅以替换场景为例对本申请实施例提供的语音处理方法进行描述。本申请实施例提供的语音处理方法可以由终端设备或云端设备单独执行,也可以由终端设备与云端设备共同完成,下面分别描述:Since the above-mentioned deletion scenes and insertion scenes can be replaced by replacement scenes, the speech processing method provided by the embodiment of the present application is described below only by taking the replacement scene as an example. The voice processing method provided by the embodiments of this application can be executed by the terminal device or the cloud device alone, or can be completed by the terminal device and the cloud device together. They are described below:
实施例一:终端设备或者云端设备单独执行该语音处理方法。Embodiment 1: The terminal device or the cloud device executes the voice processing method independently.
请参阅图7a,本申请实施例提供的语音处理方法一个实施例,该方法可以由语音处理设备执行,也可以由语音处理设备的部件(例如处理器、芯片、或芯片系统等)执行,该语音处理设备可以是终端设备或云端设备,该实施例包括步骤701至步骤704。 Please refer to Figure 7a, which is an example of a voice processing method provided by the embodiment of the present application. This method can be executed by a voice processing device or by a component of the voice processing device (such as a processor, a chip, or a chip system, etc.). The voice processing device may be a terminal device or a cloud device, and this embodiment includes steps 701 to 704.
步骤701,获取原始语音与第二文本。Step 701: Obtain the original voice and the second text.
本申请实施例中,语音处理设备可以直接获取原始语音、原始文本与第二文本。也可以先获取原始语音与第二文本,在识别原始语音得到与原始语音对应的原始文本。其中,第二文本为目标文本中除了第一文本以外的文本,且原始文本与目标文本含有第一文本。第一文本可以理解为是原始文本与目标文本的重叠文本中的部分或全部文本。In the embodiment of the present application, the speech processing device can directly obtain the original speech, the original text and the second text. It is also possible to obtain the original speech and the second text first, and then recognize the original speech to obtain the original text corresponding to the original speech. The second text is the text in the target text other than the first text, and the original text and the target text contain the first text. The first text can be understood as part or all of the overlapping text of the original text and the target text.
在一种可能的实现中,所述原始语音的内容为用户的歌声,例如可以为用户清唱时录制的语音。In one possible implementation, the content of the original voice is the user's singing voice, which may be, for example, the voice recorded when the user sings a cappella.
本申请实施例中,语音处理设备获取第二文本的方式有多种,下面分别描述:In the embodiment of this application, there are multiple ways for the voice processing device to obtain the second text, which are described below:
第一种,语音处理设备可以通过其他设备或用户的输入直接获取第二文本。First, the voice processing device can directly obtain the second text through input from other devices or users.
第二种,语音处理设备获取目标文本,并根据目标文本与原始语音对应的原始文本得到重叠文本,再根据重叠文本确定第二文本。具体可以是将原始文本与目标文本中的字符一一对比或者输入对比模型,确定原始文本与目标文本的重叠文本和/或非重叠文本。再根据重叠文本确定第一文本。其中,第一文本可以是重叠文本,也可以是重叠文本中的部分文本。Second, the speech processing device obtains the target text, obtains the overlapping text based on the original text corresponding to the target text and the original speech, and then determines the second text based on the overlapping text. Specifically, the characters in the original text and the target text can be compared one by one or a comparison model can be input to determine overlapping text and/or non-overlapping text between the original text and the target text. Then determine the first text based on the overlapping text. The first text may be an overlapping text, or may be part of the overlapping text.
本申请实施例中根据重叠文本确定第一文本的方式有多种,语音处理设备可以直接确定重叠文本为第一文本,还可以根据预设规则确定重叠文本中的第一文本,也可以根据用户的操作确定重叠文本中的第一文本。其中,预设规则可以是去掉重叠内容中的N个字符后得到第一文本,N为正整数。In the embodiment of the present application, there are many ways to determine the first text based on the overlapping text. The speech processing device can directly determine the overlapping text as the first text, can also determine the first text in the overlapping text according to preset rules, or can also determine the first text in the overlapping text according to the user. The operation determines the first text in the overlapping text. The preset rule may be to obtain the first text after removing N characters in the overlapping content, where N is a positive integer.
可以理解的是,上述两种方式只是举例,在实际应用中,还有其他方式获取第二文本的方式,具体此处不做限定。It can be understood that the above two methods are just examples. In actual applications, there are other ways to obtain the second text, which are not limited here.
另外,语音处理设备可以将原始文本与原始语音对齐,确定原始文本中各个音素在原始语音中的起止位置,可以获知原始文本中各个音素的时长。进而获取第一文本对应的音素,也即是获取第一文本在原始语音中对应的语音(即非编辑语音)。In addition, the speech processing device can align the original text with the original speech, determine the starting and ending positions of each phoneme in the original text in the original speech, and can learn the duration of each phoneme in the original text. Then, the phonemes corresponding to the first text are obtained, that is, the speech corresponding to the first text in the original speech (that is, the non-edited speech) is obtained.
可选地,语音处理设备可以将原始文本与原始语音对齐采用的方式可以是采用强制对齐法,例如:蒙特利尔强制校准器(montreal forced aligner,MFA)、具有对齐功能的神经网络等对齐工具,具体此处不做限定。Optionally, the voice processing device can align the original text with the original speech by using a forced alignment method, such as: Montreal forced aligner (MFA), neural network with alignment function and other alignment tools, specifically There are no limitations here.
可选地,语音处理设备获取原始语音与原始文本之后,可以向用户展示用户界面,该用户界面包括原始语音以及原始文本。进一步的,用户通过用户界面对原始文本执行第一操作,语音处理设备响应用户的第一操作确定目标文本。其中,第一操作可以理解为是用户对原始文本的编辑,编辑具体可以是前述的替换、插入或删除等。Optionally, after the speech processing device obtains the original speech and the original text, the user interface can be displayed to the user, and the user interface includes the original speech and the original text. Further, the user performs a first operation on the original text through the user interface, and the speech processing device determines the target text in response to the user's first operation. The first operation can be understood as the user's editing of the original text, and the editing can be the aforementioned replacement, insertion, or deletion.
示例性的,延续上述替换场景中的举例。原始文本为“今天深圳天气很好”,目标文本为“今天广州天气很好”。示例性的,以语音处理设备是手机为例进行描述。语音处理设备获取原始文本与原始语音之后,向用户展示如图8所示的界面,该界面包括原始文本与原始语音。如图9所示,用户可以对原始文本执行第一操作901,例如将“深圳”修改为“广州”等前述的插入、删除、替换操作,这里仅以替换为例进行描述。Illustratively, continue the example in the above replacement scenario. The original text is "Today the weather is very good in Shenzhen", and the target text is "Today the weather is very good in Guangzhou". For example, the speech processing device is a mobile phone. After the speech processing device obtains the original text and the original voice, the user is presented with an interface as shown in Figure 8, which includes the original text and the original voice. As shown in Figure 9, the user can perform the first operation 901 on the original text, such as modifying "Shenzhen" to "Guangzhou" and other aforementioned insertion, deletion, and replacement operations. Here, only replacement is used as an example for description.
可选地,语音处理设备确定原始文本与目标文本的重叠文本后,向用户展示重叠文本,再根据用户的第二操作,从重叠文本中确定第一文本,进而确定第二文本。其中,第二操 作可以是点击、拖拽、滑动等操作,具体此处不做限定。Optionally, after determining the overlapping text between the original text and the target text, the speech processing device displays the overlapping text to the user, and then determines the first text from the overlapping text according to the user's second operation, and then determines the second text. Among them, the second The operation can be click, drag, slide, etc. There are no specific limitations here.
示例性的,延续上述举例,第二文本为“广州”,第一文本为“今天天气很好”,非编辑语音为第一文本在原始语音中的语音。假设一个文字对应2帧,原始文本对应的原始语音包括16帧,则非编辑语音相当于原始语音中的第1帧至第4帧以及第9帧至第16帧。可以理解的是,在实际应用中,文字与语音帧的对应关系不一定是上述举例的1比2,上述举例只是为了方便理解非编辑区域,原始文本对应的帧数具体此处不做限定。确定目标文本之后,语音处理设备可以显示如图10所示界面,该界面可以包括第二文本、目标文本、原始语音中的非编辑语音与编辑语音,其中,第二文本为“广州”,目标文本为“今天广州天气很好”,非编辑语音为“今天天气很好”对应的语音,编辑语音为“深圳”对应的语音。也可以理解为是,随着用户编辑的目标文本,进而语音处理设备基于目标文本、原始文本以及原始语音确定原始语音中的非编辑语音。Illustratively, continuing the above example, the second text is "Guangzhou", the first text is "The weather is very good today", and the non-edited voice is the voice of the first text in the original voice. Assuming that one text corresponds to 2 frames and the original speech corresponding to the original text includes 16 frames, the non-edited speech is equivalent to frames 1 to 4 and frames 9 to 16 in the original speech. It can be understood that in practical applications, the correspondence between text and voice frames is not necessarily 1:2 as in the above example. The above example is only for the convenience of understanding the non-editing area. The number of frames corresponding to the original text is not specifically limited here. After determining the target text, the speech processing device can display an interface as shown in Figure 10, which can include the second text, the target text, the non-edited voice and the edited voice in the original voice, where the second text is "Guangzhou" and the target text The text is "The weather in Guangzhou is very good today", the non-edited voice is the voice corresponding to "The weather is very good today", and the edited voice is the voice corresponding to "Shenzhen". It can also be understood that as the user edits the target text, the speech processing device determines the non-edited voice in the original voice based on the target text, the original text, and the original voice.
可选地,语音处理设备接收用户发送的编辑请求,该编辑请求中包括原始语音与第二文本。可选地,编辑请求还包括原始文本和/或发音者标识。当然,该编辑请求也可以包括原始语音与目标文本。Optionally, the voice processing device receives an editing request sent by the user, where the editing request includes the original voice and the second text. Optionally, the edit request also includes the original text and/or speaker identification. Of course, the editing request can also include the original speech and the target text.
步骤702,根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征。Step 702: Predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text.
在一种可能的实现中,所述目标文本的信息,包括:所述目标文本中各个音素的文本嵌入(text embedding)。In a possible implementation, the information of the target text includes: text embedding of each phoneme in the target text.
在一种可能的实现中,可以根据目标文本,通过文本编码模块(Text Encoder)得到目标文本中各个音素的文本嵌入。例如,目标文本可以被转成对应的音素序列(如“爱怎么可以不问对错”所对应的音素即为其拼音的声母和韵母序列),继而输入到Text Encoder转换成对应的以音素为单位的文本嵌入。Text Encoder的网络结构可示例性的为Tacotron 2模型。In a possible implementation, the text embedding of each phoneme in the target text can be obtained through the text encoding module (Text Encoder) based on the target text. For example, the target text can be converted into the corresponding phoneme sequence (for example, the phoneme corresponding to "How can love not ask whether it is right or wrong" is the sequence of initial consonants and finals in pinyin), and then input to the Text Encoder and converted into the corresponding phoneme-based phoneme sequence. Text embedding of units. The network structure of Text Encoder can be exemplified by the Tacotron 2 model.
在一种可能的实现中,可以获取到非编辑语音中各个音素的帧数(也可以称之为时长),并根据所述非编辑语音中各个音素的帧数以及所述目标文本的信息,预测所述第二文本中各个音素的帧数。In a possible implementation, the number of frames of each phoneme in the non-edited speech (which can also be called the duration) can be obtained, and based on the number of frames of each phoneme in the non-edited speech and the information of the target text, Predict the number of frames for each phoneme in the second text.
在一种可能的实现中,预测所述第二文本中各个音素的帧数所使用的神经网络可以如图7b所示(例如可以为基于掩码机制的融合原始真实时长的时长预测模型),其以Text Encoder的输出和原始真实时长(Reference Duration,也就是第一文本中各个音素的时长)以及对应掩码作为输入而预测每个待编辑音素(也就是第二文本中各个音素)的时长(即对应音频中的帧数)。In one possible implementation, the neural network used to predict the number of frames of each phoneme in the second text can be as shown in Figure 7b (for example, it can be a duration prediction model based on a mask mechanism that fuses the original real duration), It uses the output of Text Encoder and the original real duration (Reference Duration, that is, the duration of each phoneme in the first text) and the corresponding mask as input to predict the duration of each phoneme to be edited (that is, each phoneme in the second text) (That is, the number of frames corresponding to the audio).
在一种可能的实现中,在得到目标文本(包括第一文本和第二文本)中各个音素的帧数后,可以根据每个音素所预测的时长,将每个文本嵌入(text embedding)进行上采样,得到对应帧数的embedding结果(示例性的,若音素ai预测的时长为10帧,则可以将ai所对应的text embedding复制N份,N为大于1的正数,例如N为10)。In a possible implementation, after obtaining the number of frames of each phoneme in the target text (including the first text and the second text), each text embedding can be performed based on the predicted duration of each phoneme. Upsample to obtain the embedding result corresponding to the number of frames (for example, if the prediction duration of phoneme ai is 10 frames, you can copy the text embedding corresponding to ai N copies, N is a positive number greater than 1, for example, N is 10 ).
应理解,可选的,在歌声编辑的场景中,歌声本身会遵循一定的曲谱,而曲谱也就规定了每个字的发音时长和Pitch等。因此,在歌声编辑时,对于无需编辑区域(非编辑语音), 其对应的时长和音高Pitch的信息是无需预测的,直接得到准确的真实数值并使用即可。It should be understood that, optionally, in the singing editing scenario, the singing itself will follow a certain music score, and the music score also stipulates the pronunciation duration and pitch of each word. Therefore, when editing the singing voice, for the area that does not need to be edited (non-edited voice), There is no need to predict the corresponding duration and pitch information, just get the accurate real value directly and use it.
接下来给出一个对第二文本进行时长预测的示意:Next, a diagram for predicting the duration of the second text is given:
参照图7b,Reference Durations为原始歌声音频中各音素的真实时长,其中虚线框则为第二文本中各个音素的待预测时长(由于此时未知,则可以0代替);而Edit Mask则用已标记待预测的音素(其中Mask=0表示需预测);Embedding Layer将Reference durations同edit Mask进行融合(例如可以通过执行内积运算进行融合),其结果继而同Text Embedding以及Singer Embedding(提取的声纹特征)累加。其中,1个FFT Block可以为一个Transformer block,示例性的,可以使用4个(即N=4)个FFT block;最终,模型预测出Mask=0所对应音素的时长,并同其他未编辑音素时长一同作为输出。Referring to Figure 7b, Reference Durations is the real duration of each phoneme in the original singing audio, and the dotted box is the duration to be predicted of each phoneme in the second text (since it is unknown at this time, it can be replaced by 0); while the Edit Mask is used Mark the phonemes to be predicted (Mask=0 indicates that prediction is required); the Embedding Layer fuses the Reference durations with the edit Mask (for example, it can be fused by performing an inner product operation), and the result is then the same as the Text Embedding and Singer Embedding (the extracted sound Grain features) are accumulated. Among them, 1 FFT Block can be a Transformer block. For example, 4 (i.e. N=4) FFT blocks can be used; finally, the model predicts the duration of the phoneme corresponding to Mask=0 and compares it with other unedited phonemes. The duration is also used as output.
在一种可能的实现中,预测出的第二文本中各个音素的时长可以被用于进行音高特征预测中各个输入的上采样,例如,进行音高特征预测的输入可以包括文本嵌入,上采样前的每个文本嵌入对应于一个音素,上采样后的文本嵌入包括对应的音素的帧数的数量的文本嵌入。In one possible implementation, the predicted duration of each phoneme in the second text can be used to upsample each input in pitch feature prediction. For example, the input for pitch feature prediction can include text embeddings, above Each text embedding before sampling corresponds to a phoneme, and the text embedding after upsampling includes the number of text embeddings corresponding to the number of frames of the phoneme.
在一种可能的实现中,还可以会根据非编辑语音,得到非编辑语音的第二语音特征。所述第二语音特征可以携带有如下信息的至少一种:所述非编辑语音的部分语音帧或全部语音帧;所述非编辑语音的声纹特征;所述非编辑语音的音色特征;所述非编辑语音的韵律特征;以及,所述非编辑语音的节奏特征。In a possible implementation, the second speech feature of the non-edited speech may also be obtained based on the non-edited speech. The second voice feature may carry at least one of the following information: some voice frames or all voice frames of the non-edited voice; voiceprint features of the non-edited voice; timbre features of the non-edited voice; The prosodic characteristics of the non-edited speech; and the rhythmic characteristics of the non-edited speech.
本申请实施例中的语音特征可以用于表示语音的特征(例如:音色、韵律、情感或节奏等),语音特征的表现形式有多种,可以是语音帧、序列、向量等,具体此处不做限定。另外,本申请实施例中的语音特征具体可以是通过前述的PLP、LPC、MFCC等方法从上述表现形式中提取的参数。The speech features in the embodiments of the present application can be used to represent the characteristics of speech (such as timbre, rhythm, emotion or rhythm, etc.). The speech features can be expressed in many forms, such as speech frames, sequences, vectors, etc., specifically here No restrictions. In addition, the speech features in the embodiments of the present application may specifically be parameters extracted from the above-mentioned expression forms through the aforementioned PLP, LPC, MFCC and other methods.
可选地,从非编辑语音中选取至少一个语音帧作为第二语音特征。进一步的,为了第一语音特征更加结合了上下文的第二语音特征。至少一个语音帧对应的文本可以为第一文本中与第二文本相邻的文本。Optionally, at least one speech frame is selected from the non-edited speech as the second speech feature. Furthermore, the second speech feature of the context is further combined with the first speech feature. The text corresponding to at least one speech frame may be text adjacent to the second text in the first text.
可选地,将非编辑语音通过编码模型编码得到目标序列,将该目标序列作为第二语音特征。其中,编码模型可以是CNN、RNN等,具体此处不做限定。Optionally, the non-edited speech is encoded through a coding model to obtain a target sequence, and the target sequence is used as the second speech feature. Among them, the coding model can be CNN, RNN, etc., and there is no specific limit here.
另外,第二语音特征还可以携带有原始语音的声纹特征。其中,获取声纹特征的方式可以是直接获取,也可以是通过识别原始语音得到该声纹特征等。一方面,通过引入原始语音的声纹特征,使得后续生成的第一语音特征也携带有该原始语音的声纹特征,进而提升目标编辑语音与原始语音的相近程度。另一方面,在发音者(或者用户)的数量为多个的情况下,引入声纹特征可以提升后续预测的语音特征更加与原始语音的发音者的声纹相似。In addition, the second voice feature may also carry the voiceprint feature of the original voice. Among them, the voiceprint features may be obtained directly, or the voiceprint features may be obtained by recognizing original speech, etc. On the one hand, by introducing the voiceprint feature of the original voice, the subsequently generated first voice feature also carries the voiceprint feature of the original voice, thereby improving the similarity between the target edited voice and the original voice. On the other hand, when there are multiple speakers (or users), introducing voiceprint features can improve the subsequently predicted voice features to be more similar to the voiceprints of the speakers of the original speech.
可选地,语音处理设备还可以获取原始语音的发音者标识,以便于在发音者为多个时,可以匹配相应发音者对应的语音,提升后续目标编辑语音与原始语音的相似度。Optionally, the speech processing device can also obtain the speaker identification of the original speech, so that when there are multiple speakers, the speech corresponding to the corresponding speaker can be matched, and the similarity between the subsequent target edited speech and the original speech can be improved.
下面仅以将语音帧作为语音特征(或者理解为是根据语音帧获取语音特征)为例进行描述。示例性的,延续上述举例,选择原始语音中的第1帧至第4帧以及第9帧至第16帧中的至少一帧作为第二语音特征。 The following description only takes speech frames as speech features (or is understood as obtaining speech features based on speech frames) as an example. Illustratively, continuing the above example, at least one frame from the 1st to 4th frame and the 9th to 16th frame in the original speech is selected as the second speech feature.
示例性的,第二语音特征为梅尔频谱特征。For example, the second speech feature is a Mel spectrum feature.
在一种可能的实现中,第二语音特征可以为向量的形式表达,在一种可能的实现中,预测出的第二文本中各个音素的时长可以被用于进行音高特征预测中各个输入的上采样,例如,进行音高特征预测的输入可以包括第二语音特征,上采样前的每个向量对应于一个音素,上采样后的文本嵌入包括对应的音素的帧数的数量的向量。In a possible implementation, the second speech feature can be expressed in the form of a vector. In a possible implementation, the predicted duration of each phoneme in the second text can be used to perform each input in pitch feature prediction. For upsampling, for example, the input for pitch feature prediction may include the second speech feature, each vector before upsampling corresponds to a phoneme, and the text embedding after upsampling includes a vector corresponding to the number of frames of the phoneme.
在一种可能的实现中,可以根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征。In a possible implementation, the second pitch feature of the second text can be predicted based on the first pitch feature of the non-edited voice and the information of the target text.
在一种可能的实现中,非编辑语音的第一音高(pitch)特征可以通过现有的Pitch提取算法得到,本申请并不限定。In a possible implementation, the first pitch (pitch) feature of the non-edited speech can be obtained through an existing pitch extraction algorithm, which is not limited by this application.
在一种可能的实现中,可以根据所述非编辑语音的第一音高(pitch)特征、所述目标文本的信息以及所述非编辑语音的第二语音特征,通过神经网格来预测所述第二文本的第二音高特征。In a possible implementation, the neural grid can be used to predict the target text according to the first pitch feature of the non-edited voice, the information of the target text, and the second voice feature of the non-edited voice. Describe the second pitch characteristics of the second text.
接下来介绍如何根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征:Next, we will introduce how to predict the second pitch feature of the second text based on the first pitch feature of the non-edited speech and the information of the target text:
在一种可能的实现中,所述目标文本为将所述第二文本插入到所述第一文本得到的文本;或者,所述目标文本为将所述第一文本的第一部分文本删除得到的文本,所述第二文本为与所述第一部分文本相邻的文本;可以将所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息进行融合,以得到第一融合结果;将所述第一融合结果输入到第二神经网络,得到所述第二文本的第二音高特征。In a possible implementation, the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text. Text, the second text is text adjacent to the first part of the text; the first pitch feature of the non-edited voice and the information of the target text can be fused to obtain the first fusion Result: Input the first fusion result to the second neural network to obtain the second pitch feature of the second text.
针对插入和删除操作:使用图7c所示的模型预测目标编辑音素的帧级别的pitch特征。针对插入和删除操作的Pitch预测模型,其模型结构可以同图7b设置一致或者相似,区别仅在于此时的输入是帧级别(图7b中输入的是音素级别的)的从真实歌声中所提取的pitch值(图7b中输入的是时长信息),其中待编辑区域的pitch如虚框标记,且其对应Edit Mask标记设置为0。For insertion and deletion operations: Use the model shown in Figure 7c to predict the frame-level pitch features of the target edited phoneme. For the pitch prediction model for insertion and deletion operations, the model structure can be consistent or similar to the setting in Figure 7b. The only difference is that the input at this time is at the frame level (the input in Figure 7b is at the phoneme level) extracted from real singing. The pitch value (the duration information is entered in Figure 7b), where the pitch of the area to be edited is marked by a dotted box, and its corresponding Edit Mask mark is set to 0.
在一种可能的实现中,所述目标文本为将所述第一文本中的第二部分文本替换为所述第二文本得到的;可以将所述非编辑语音的第一音高(pitch)特征输入到第三神经网络,得到初始音高特征,所述第一初始音高特征包括多个帧中每个帧的音高;将所述目标文本的信息输入到第四神经网络,得到所述第二文本的发音特征,所述发音特征用于指示所述初始音高特征包括的多个帧中各个帧是否发音;将所述初始音高特征和所述发音特征进行融合,以得到所述第二文本的第二音高特征。In a possible implementation, the target text is obtained by replacing the second part of the first text with the second text; the first pitch of the non-edited voice may be The features are input to the third neural network to obtain the initial pitch feature, and the first initial pitch feature includes the pitch of each frame in multiple frames; the information of the target text is input to the fourth neural network to obtain the The pronunciation features of the second text are used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated; the initial pitch features and the pronunciation features are fused to obtain the Describe the second pitch characteristics of the second text.
针对于替换操作(此处替换操作仅表示新编辑文本字数同被替换文本字数一致的情况,若不一致,则将替换操作分解成先删除后插入两个编辑操作)。由于替换的文本可能在发音上存在很大区别,所以为保障替换后前后歌声的连贯性,使用图7d所示的模型来预测新pitch:For the replacement operation (the replacement operation here only means that the number of words in the new edited text is consistent with the number of words in the replaced text. If they are not consistent, the replacement operation will be decomposed into two editing operations: first deletion and then insertion). Since the replaced text may have a big difference in pronunciation, in order to ensure the coherence of the singing before and after the replacement, the model shown in Figure 7d is used to predict the new pitch:
针对替换操作的Pitch预测模型。可以引入帧级别的发音/未发音(Voiced/Unvoicde,U/UV)预测来帮助Pitch的预测。示例性的,V/UV Predictor和F0Predictor模块的设计可参照Fastspeech2中F0predictor。 Pitch prediction model for replacement operations. Frame-level voiced/unvoiced (U/UV) prediction can be introduced to help pitch prediction. For example, the design of the V/UV Predictor and F0Predictor modules can refer to the F0predictor in Fastspeech2.
在一种可能的实现中,输入的所述第一音高(pitch)特征可以包括所述非编辑语音的多帧中的每一帧的音高特征;相应的,输出的所述第二音高特征可以包括所述目标编辑语音的多帧中的每一帧的音高特征。In a possible implementation, the input first pitch (pitch) feature may include the pitch feature of each frame in the multiple frames of the non-edited speech; correspondingly, the output second pitch feature The high features may include pitch features of each of the plurality of frames of the target edited speech.
步骤703,根据所述第二音高特征以及所述第二文本,通过神经网络得到所述第二文本对应的第一语音特征。Step 703: According to the second pitch feature and the second text, obtain the first speech feature corresponding to the second text through a neural network.
在一种可能的实现中,可以将第二音高特征以及所述第二文本(例如第二文本的文本嵌入)进行融合(例如相加),并将融合结果输入到神经网络中,以得到第二文本对应的第一语音特征。其中,第二文本对应的第一语音特征可以为梅尔频谱特征。In a possible implementation, the second pitch feature and the second text (for example, the text embedding of the second text) can be fused (for example, added), and the fusion result can be input into the neural network to obtain The first speech feature corresponding to the second text. The first speech feature corresponding to the second text may be a Mel spectrum feature.
在一种可能的实现中,可以根据所述非编辑语音的第一音高(pitch)特征、所述目标文本的信息以及所述非编辑语音的第二语音特征,关于第二语音特征的描述可以参照上述实施例中关于第二语音特征的描述,这里不再赘述。In a possible implementation, the description of the second voice feature may be based on the first pitch feature of the non-edited voice, the information of the target text, and the second voice feature of the non-edited voice. Reference may be made to the description of the second speech feature in the above embodiment, which will not be described again here.
在一种可能的实现中,在获取第二语音特征之后,可以基于第二语音特征、第二文本通过神经网络得到第二文本对应的第一语音特征。该神经网络可以包括编码器与解码器。将第二文本输入编码器得到第二文本对应的第一向量,再基于第二语音特征通过解码器对第一向量进行解码得到第一语音特征。其中,第二语音特征可以与第一语音特征的韵律、音色和/或信噪比等相同或相近,韵律可以反映出发音者的情感状态或讲话形式等,韵律泛指语调、音调、重音强调、停顿或节奏等特征。In a possible implementation, after obtaining the second speech feature, the first speech feature corresponding to the second text can be obtained through a neural network based on the second speech feature and the second text. The neural network may include an encoder and a decoder. The second text is input into the encoder to obtain a first vector corresponding to the second text, and then the first vector is decoded by the decoder based on the second speech feature to obtain the first speech feature. Among them, the second speech feature can be the same as or similar to the first speech feature in terms of rhythm, timbre, and/or signal-to-noise ratio. Rhythm can reflect the speaker's emotional state or speech form. Rhythm generally refers to intonation, pitch, and emphasis. , pauses or rhythm characteristics.
可选地,编码器与解码器之间可以引入注意力机制,用于调整输入与输出之间数量的对应关系。Optionally, an attention mechanism can be introduced between the encoder and the decoder to adjust the quantitative correspondence between the input and the output.
可选地,在编码器编码过程中可以引入第二文本所在的目标文本,使得生成的第二文本的第一向量参考了目标文本,使得该第一向量描述的第二文本更加准确。即可以基于第二语音特征、目标文本、标记信息通过神经网络得到第二文本对应的第一语音特征。具体可以是将目标文本与标记信息输入编码器得到第二文本对应的第一向量,再基于第二语音特征通过解码器对第一向量进行解码得到第一语音特征。该标记信息用于标记目标文本中的第二文本。Optionally, the target text where the second text is located can be introduced during the coding process of the encoder, so that the generated first vector of the second text refers to the target text, so that the second text described by the first vector is more accurate. That is, the first speech feature corresponding to the second text can be obtained through the neural network based on the second speech feature, the target text, and the mark information. Specifically, the target text and mark information may be input into an encoder to obtain a first vector corresponding to the second text, and then the first vector may be decoded by a decoder based on the second speech feature to obtain the first speech feature. The marking information is used to mark the second text in the target text.
本申请实施例中的解码器可以是单向解码器,也可以是双向解码器,下面分别描述。The decoder in the embodiment of the present application may be a unidirectional decoder or a bidirectional decoder, which are described respectively below.
第一种,解码器是单向解码器。First, the decoder is a one-way decoder.
解码器基于第二语音特征从目标文本的第一方向计算第一向量或第二向量得到的语音帧作为第一语音特征。其中,第一方向为从目标文本的一侧指向目标文本的另一侧的方向。另外,该第一方向可以理解为是目标文本的正序或反序(相关描述可以参考前述图5所示实施例中关于正序反序的描述)。The decoder calculates the first vector or the speech frame obtained by the second vector from the first direction of the target text as the first speech feature based on the second speech feature. The first direction is a direction from one side of the target text to the other side of the target text. In addition, the first direction can be understood as the forward or reverse order of the target text (for related descriptions, please refer to the description of the forward or reverse order in the embodiment shown in FIG. 5).
可选地,将第二语音特征与第一向量输入解码器得到第一语音特征。或者将第二语音特征与第二向量输入解码器得到第一语音特征。Optionally, the second speech feature and the first vector are input into the decoder to obtain the first speech feature. Or the second speech feature and the second vector are input into the decoder to obtain the first speech feature.
第二种,若第二文本在目标文本的中间区域,解码器可以是双向解码器(也可以理解为编码器包括第一编码器与第二编码器)。Second, if the second text is in the middle area of the target text, the decoder can be a bidirectional decoder (it can also be understood that the encoder includes a first encoder and a second encoder).
上述的第二文本在目标文本的中间区域,可以理解为第二文本并不在目标文本的两端。The above-mentioned second text is in the middle area of the target text, which can be understood to mean that the second text is not at both ends of the target text.
本申请实施例中的双向解码器有多种情况,下面分别描述: There are many situations of bidirectional decoders in the embodiments of this application, which are described below:
1、双向解码器从第一方向输出的第一语音特征为第二文本对应的语音特征,双向解码器从第二方向输出的第四语音特征为第二文本对应的语音特征。1. The first speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the second text, and the fourth speech feature output by the bidirectional decoder from the second direction is the speech feature corresponding to the second text.
该种情况,可以理解为可以分别通过左右两侧(即正序反序)得到两种第二文本对应的完整语音特征,并根据两种语音特征得到第一语音特征。In this case, it can be understood that the complete phonetic features corresponding to the two second texts can be obtained through the left and right sides (ie, forward and reverse order), and the first phonetic features can be obtained based on the two phonetic features.
第一解码器基于第二语音特征从目标文本的第一方向计算第一向量或第二向量得到第二文本的第一语音特征(以下称为LR)。第二解码器基于第二语音特征从目标文本的第二方向计算第一向量或第二向量得到第二文本的第四语音特征(以下称为RL)。并根据第一语音特征与第四语音特征生成第一语音特征。其中,第一方向为从目标文本的一侧指向目标文本的另一侧的方向,第二方向与第一方向相反(或者理解为第二方向为从目标文本的另一侧指向目标文本的一侧方向)。第一方向可以是上述的正序,第二方向可以是上述的反序。The first decoder calculates a first vector or a second vector from the first direction of the target text based on the second speech feature to obtain the first speech feature of the second text (hereinafter referred to as LR). The second decoder calculates a first vector or a second vector from the second direction of the target text based on the second speech feature to obtain a fourth speech feature of the second text (hereinafter referred to as RL). and generate the first voice feature according to the first voice feature and the fourth voice feature. Wherein, the first direction is the direction from one side of the target text to the other side of the target text, and the second direction is opposite to the first direction (or understood as the second direction is from the other side of the target text to one side of the target text). side direction). The first direction may be the above-mentioned forward sequence, and the second direction may be the above-mentioned reverse sequence.
对于双向解码器,第一编码器在第一方向解码第一向量或第二向量的第一帧时,可以将非编辑语音中与第二文本一侧(也可以称为左侧)相邻的语音帧作为条件进行解码得到N帧LR。第二编码器在第二方向解码第一向量或第二向量的第一帧时,可以将非编辑语音中与第二文本另一侧(也可以称为右侧)相邻的语音帧作为条件进行解码得到N帧RL。可选地,双向解码器的结构可以参考图11。获取N帧LR与N帧RL之后,可以将LR与RL中差值小于阈值的帧作为过渡帧(位置为m,m<n,),或者将LR与RL中差值最小的帧作为过渡帧。则第一语音特征的N帧可以包括LR中的前m帧与RL中的后n-m帧,或者第一语音特征的N帧包括LR中的前n-m帧与RL中的后m帧。其中,LR与RL的差值可以理解为是向量与向量之间的距离。另外,若前述步骤701中获取了发音者标识,则本步骤中的第一向量或第二向量还可以包括用于标识发音者的第三向量。也可以理解为第三向量用于标识原始语音的声纹特征。For a bidirectional decoder, when the first encoder decodes the first frame of the first vector or the second vector in the first direction, the non-edited speech may be adjacent to the side of the second text (which may also be called the left side). The speech frame is decoded as a condition to obtain N frames LR. When the second encoder decodes the first vector or the first frame of the second vector in the second direction, the speech frame adjacent to the other side (also called the right side) of the second text in the non-edited speech can be used as a condition Decode to obtain N frames of RL. Optionally, the structure of the bidirectional decoder can be referred to Figure 11. After obtaining N frames of LR and N frames of RL, the frame with the difference between LR and RL less than the threshold can be used as a transition frame (position is m, m<n,), or the frame with the smallest difference between LR and RL can be used as a transition frame . Then the N frames of the first speech feature may include the first m frames in LR and the last n-m frames in RL, or the N frames of the first speech feature may include the first n-m frames in LR and the last m frames in RL. Among them, the difference between LR and RL can be understood as the distance between vectors. In addition, if the speaker identification is obtained in the aforementioned step 701, the first vector or the second vector in this step may also include a third vector used to identify the speaker. It can also be understood that the third vector is used to identify the voiceprint characteristics of the original speech.
示例性的,延续上述举例,假设第一编码器得到“广州”对应的LR帧包括LR1、LR2、LR3、LR4。第二编码器得到“广州”对应的RL帧包括RL1、RL2、RL3、RL4。且LR2与RL2差值最小,则将LR1、LR2、RL3、RL4或者LR1、RL2、RL3、RL4作为第一语音特征。Illustratively, continuing the above example, assume that the first encoder obtains the LR frames corresponding to "Guangzhou" including LR 1 , LR 2 , LR 3 , and LR 4 . The second encoder obtains the RL frames corresponding to "Guangzhou" including RL 1 , RL 2 , RL 3 , and RL 4 . And the difference between LR 2 and RL 2 is the smallest, then LR 1 , LR 2 , RL 3 , RL 4 or LR 1 , RL 2 , RL 3 , RL 4 are used as the first speech features.
2、双向解码器从第一方向输出的第一语音特征为第二文本中第三文本对应的语音特征,双向解码器从第二方向输出的第四语音特征为第二文本中第四文本对应的语音特征。2. The first speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the third text in the second text, and the fourth speech feature output by the bidirectional decoder from the second direction is the speech feature corresponding to the fourth text in the second text. voice characteristics.
该种情况,可以理解为可以分别通过左右两侧(即正序反序)得到第二文本对应的部分语音特征,并根据两个部分语音特征得到完整的第一语音特征。即从正序的方向上取一部分语音特征,从反序的方向上取另一部分语音特征,并拼接一部分语音特征与另一部分语音特征得到整体的语音特征。In this case, it can be understood that the partial speech features corresponding to the second text can be obtained through the left and right sides (ie, forward and reverse order), and the complete first speech features can be obtained based on the two partial speech features. That is, one part of the phonetic features is taken from the forward direction, another part of the phonetic features is taken from the reverse direction, and one part of the phonetic features and another part of the phonetic features are spliced to obtain the overall phonetic features.
示例性的,延续上述举例,假设第一编码器得到第三文本(“广”)对应的LR帧包括LR1与LR2。第二编码器得到第四文本(“州”)对应的RL帧包括RL3与RL4。则拼接LR1、LR2、RL3、RL4得到第一语音特征。Illustratively, continuing the above example, assume that the first encoder obtains the LR frame corresponding to the third text ("Guang") including LR 1 and LR 2 . The second encoder obtains the RL frame corresponding to the fourth text ("state") including RL 3 and RL 4 . Then the first speech feature is obtained by splicing LR 1 , LR 2 , RL 3 and RL 4 .
可以理解的是,上述两种方式只是举例,在实际应用中,还有其他方式获取第一语音特征,具体此处不做限定。It can be understood that the above two methods are only examples. In practical applications, there are other methods to obtain the first speech feature, which are not limited here.
步骤704,根据所述第一语音特征,生成所述第二文本对应的目标编辑语音。 Step 704: Generate a target editing voice corresponding to the second text according to the first voice feature.
在一种可能的实现中,在获取第一语音特征之后,可以根据声码器将第一语音特征转换为第二文本对应的目标编辑语音。其中,声码器可以是传统声码器(例如Griffin-lim算法),也可以是神经网络声码器(如使用音频训练数据预训练好的Melgan,或Hifigan等)等,具体此处不做限定。In a possible implementation, after obtaining the first speech feature, the first speech feature can be converted into a target edited voice corresponding to the second text according to the vocoder. Among them, the vocoder can be a traditional vocoder (such as Griffin-lim algorithm), or a neural network vocoder (such as Melgan or Hifigan pre-trained using audio training data), etc. The details will not be discussed here. limited.
示例性的,延续上述举例,“广州”对应的目标编辑语音如图12所示。Illustratively, continuing the above example, the target editing voice corresponding to "Guangzhou" is shown in Figure 12.
步骤705,获取第二文本在目标文本中的位置。本步骤是可选地。Step 705: Obtain the position of the second text in the target text. This step is optional.
可选地,如果步骤701中获取的是原始语音与第二文本,则获取第二文本在目标文本中的位置。Optionally, if what is obtained in step 701 is the original speech and the second text, the position of the second text in the target text is obtained.
可选地,如果步骤701中已获取目标文本,则可以通过前述步骤701中的对齐技术对齐原始语音与原始文本确定原始文本中各个音素在原始语音中的起止位置。并根据各音素的起止位置确定第二文本在目标文本中的位置。Optionally, if the target text has been obtained in step 701, the starting and ending positions of each phoneme in the original text in the original speech can be determined by aligning the original speech and the original text through the alignment technology in step 701. And determine the position of the second text in the target text based on the starting and ending positions of each phoneme.
步骤706,基于位置拼接目标编辑语音与非编辑语音生成与目标文本对应的目标语音。本步骤是可选地。Step 706: Splice the target edited voice and the non-edited voice based on the position to generate a target voice corresponding to the target text. This step is optional.
本申请实施例中的位置用于拼接非编辑语音与目标编辑语音,该位置可以是第二文本在目标文本中的位置,也可以是第一文本在目标文本中的位置,还可以是非编辑语音在原始语音中的位置,还可以是编辑语音在原始语音中的位置。The position in the embodiment of this application is used to splice the non-edited speech and the target edited speech. The position can be the position of the second text in the target text, the position of the first text in the target text, or the non-edited speech. The position in the original speech can also be the position of the edited speech in the original speech.
可选地,获取第二文本在目标文本中的位置之后,可以通过前述步骤701中的对齐技术对齐原始语音与原始文本确定原始文本中各个音素在原始语音中的起止位置。并根据第一文本在原始文本中的位置,确定原始语音中的非编辑语音或编辑语音位置。进而语音处理设备基于位置拼接目标编辑语音与非编辑语音得到目标语音。即将第二文本对应的目标语音替换原始语音中的编辑区域得到目标语音。Optionally, after obtaining the position of the second text in the target text, the original speech and the original text can be aligned using the alignment technology in step 701 to determine the starting and ending positions of each phoneme in the original text in the original speech. And based on the position of the first text in the original text, the position of the non-edited speech or the edited speech in the original speech is determined. Then, the speech processing device splices the target edited speech and the non-edited speech based on the position to obtain the target speech. That is, the target speech corresponding to the second text is replaced with the editing area in the original speech to obtain the target speech.
示例性的,延续上述举例,非编辑语音相当于原始语音中的第1帧至第4帧以及第9帧至第16帧。目标编辑语音为LR1、LR2、RL3、RL4或者LR1、RL2、RL3、RL4。拼接目标编辑语音与非编辑语音,可以理解为是将得到的四帧替换原始语音中的第5帧至第8帧,进而得到目标语音。即将“广州”对应的语音替换原始语音中“深圳”对应的语音,进而得到目标文本:“今天广州天气很好”对应的目标语音。“今天广州天气很好”对应的目标语音如图12所示。Illustratively, continuing the above example, the non-edited speech is equivalent to frames 1 to 4 and frames 9 to 16 in the original speech. The target editing voices are LR 1 , LR 2 , RL 3 , RL 4 or LR 1 , RL 2 , RL 3 , RL 4 . Splicing the target edited speech and the non-edited speech can be understood as replacing the 5th to 8th frames in the original speech with the four obtained frames, thereby obtaining the target speech. That is, the voice corresponding to "Guangzhou" is replaced with the voice corresponding to "Shenzhen" in the original voice, and then the target text is obtained: the target voice corresponding to "The weather in Guangzhou is very good today". The target speech corresponding to "The weather in Guangzhou is very good today" is shown in Figure 12.
可选地,语音处理设备在获取目标编辑语音或目标语音之后,对目标编辑语音或目标语音进行播放。Optionally, after acquiring the target editing voice or the target voice, the voice processing device plays the target editing voice or the target voice.
一种可能实现的方式中,本申请实施例提供的语音处理方法包括步骤701至步骤704。另一种可能实现的方式中,本申请实施例提供的语音处理方法包括步骤701至步骤705。另一种可能实现的方式中,本申请实施例提供的语音处理方法包括步骤701至步骤706。另外,本申请实施例中图7a所示的各个步骤不限定时序关系。例如:上述方法中的步骤705也可以在步骤704之后,也可以在步骤701之前,还可以与步骤701共同执行。In one possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 704. In another possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 705. In another possible implementation manner, the speech processing method provided by the embodiment of the present application includes steps 701 to 706. In addition, the various steps shown in Figure 7a in the embodiment of the present application do not limit the timing relationship. For example, step 705 in the above method can also be performed after step 704, or before step 701, or can be executed together with step 701.
本申请实施例提供了一种语音处理方法,所述方法包括:获取原始语音以及第二文本,所述第二文本为目标文本中除了第一文本以外的文本,所述目标文本与所述原始语音对应的原始文本都包括所述第一文本,所述第一文本在所述原始语音中对应的语音为非编辑语 音;根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征;根据所述第二音高特征以及所述第二文本,通过神经网络得到所述第二文本对应的第一语音特征;根据所述第一语音特征,生成所述第二文本对应的目标编辑语音。本申请通过预测第二文本(待编辑文本)的音高特征,根据音高特征生成第二文本的第一语音特征,并基于第一语音特征生成第二文本对应目标编辑语音,使得歌声编辑前后的语音的音高特征相似,进而实现目标编辑语音的听感与原始语音的听感目标编辑语音的听感与原始语音的听感类似。An embodiment of the present application provides a speech processing method. The method includes: obtaining original speech and a second text. The second text is a text other than the first text in the target text. The target text is the same as the original speech. The original texts corresponding to the speech all include the first text, and the speech corresponding to the first text in the original speech is non-edited language. sound; predicting the second pitch characteristic of the second text according to the first pitch characteristic of the non-edited speech and the information of the target text; predicting the second pitch characteristic of the second text according to the second pitch characteristic and the third For the second text, obtain the first voice feature corresponding to the second text through a neural network; and generate the target editing voice corresponding to the second text based on the first voice feature. This application predicts the pitch characteristics of the second text (text to be edited), generates the first speech characteristics of the second text based on the pitch characteristics, and generates the target editing speech corresponding to the second text based on the first speech characteristics, so that the singing voice can be edited before and after The pitch characteristics of the target edited voice are similar to that of the original voice, so that the listening experience of the target edited voice is similar to that of the original voice.
接下来结合一个示意介绍本申请实施例中的语音处理方法:Next, we will introduce the speech processing method in the embodiment of this application with a schematic:
以歌声编辑的场景为例,分别以原始待编辑歌声W(其中语音内容为S“爱可以不问对错”),和以下三条不同目标语音为例:Take the scene of singing voice editing as an example, take the original singing voice W to be edited (the voice content is S "Love can be right or wrong"), and the following three different target voices as examples:
编辑请求Q1:其目标语音为W1(语音内容对应文本T1为“爱怎么可以不问对错”),Editing request Q1: Its target voice is W1 (the corresponding text T1 of the voice content is "How can love not ask whether it is right or wrong"),
编辑请求Q2:其目标语音为W2(语音内容对应文本T2为“爱不问对错”),Editing request Q2: Its target voice is W2 (the corresponding text T2 of the voice content is "Love does not ask whether it is right or wrong"),
编辑请求Q3:其目标语音为W3(语音内容对应文本T2为“爱怎么不问对错”);Editing request Q3: Its target voice is W3 (the corresponding text T2 of the voice content is "How can love not ask whether it is right or wrong");
步骤S1:接收用户“语音编辑”请求;Step S1: Receive the user’s “voice editing” request;
该请求至少包括原始待编辑语音W,原始歌词文本S,目标文本T(T1或T2或T3)等数据,预操作包括:对比原始文本S和目标文本,确定当前编辑请求的编辑类型:即对于Q1,Q2和Q3,可确定得到它们分别为插入,删除和替换操作;从W中提取每帧的音频特征,Pitch特征;W经声纹模型提取出Singer embedding;将S和目标文本T*转换成音素的表示形式,比如T2,其音素序列为[ai4 b u2 w en4d ui4c cuo4];根据W和S,提取S中每个音素所对应的时长(即帧数);根据操作类型确定Mask区域,对于Q1,其为插入操作(插入词“怎么”)则其目标Mask音素为“怎么”所对应的音素,即Q1最终的目标文本音素为[ai4 z en3 m e5 k e2 y i3 b u2 w en4 d ui4 c cuo4];(其中红色表示被Mask的音素),对于Q2,其为删除操作(删除词“可以”),则其目标音素为S中原本同“可以”相邻词的音素;即Q2最终的目标文本音素为[ai4 b u2 w en4 d ui4 c cuo4];(其中红色表示被Mask的音素);对于Q3,其为替换操作(将“可以”替换成“怎么”),所以其目标文本音素为[ai4 z en3 m e5 b u2 w en4 d ui4 c cuo4];(其中红色表示被Mask的音素);The request at least includes the original voice to be edited W, the original lyric text S, the target text T (T1 or T2 or T3) and other data. The pre-operation includes: comparing the original text S and the target text to determine the editing type of the current editing request: that is, for Q1, Q2 and Q3 can be determined to be insertion, deletion and replacement operations respectively; extract the audio features and pitch features of each frame from W; W extracts Singer embedding through the voiceprint model; convert S and target text T* The representation form of phonemes, such as T2, its phoneme sequence is [ai4 b u2 w en4d ui4c cuo4]; according to W and S, extract the duration (i.e., the number of frames) corresponding to each phoneme in S; determine the Mask area according to the operation type , for Q1, it is an insertion operation (inserting the word "how"), then its target Mask phoneme is the phoneme corresponding to "how", that is, the final target text phoneme of Q1 is [ai4 z en3 m e5 k e2 y i3 b u2 w en4 d ui4 c cuo4]; (red indicates the phoneme being Masked), for Q2, it is a deletion operation (delete the word "can"), then its target phoneme is the phoneme of the original word adjacent to "can" in S; That is, the final target text phoneme of Q2 is [ai4 b u2 w en4 d ui4 c cuo4]; (red indicates the phoneme being Masked); for Q3, it is a replacement operation (replacing "can" with "how"), so The target text phoneme is [ai4 z en3 me e5 b u2 w en4 d ui4 c cuo4]; (red indicates the phoneme being Masked);
步骤S2:S1中所得的目标文本音素经文本编码模块生成文本特征,即Phoneme-level Text Embedding;Step S2: The target text phonemes obtained in S1 are used by the text encoding module to generate text features, that is, Phoneme-level Text Embedding;
步骤S3:经时长规整模块预测出目标文本中各音素的时长信息;该步骤可通过以下子步骤完成:Step S3: Predict the duration information of each phoneme in the target text through the duration regularization module; this step can be completed through the following sub-steps:
根据音素的Mask标记生成Mask向量和参考时长向量:即对于非Mask音素,其参考时长即为S1步骤所提取的真实时长,否则则设为0;对于非Mask音素,Mask向量中对应位置设置为1,否则设置为0;Generate a Mask vector and a reference duration vector according to the Mask tag of the phoneme: that is, for non-Mask phonemes, the reference duration is the real duration extracted in step S1, otherwise it is set to 0; for non-Mask phonemes, the corresponding position in the Mask vector is set to 1, otherwise set to 0;
将Text Embedding,Singer Embedding参考时长向量以及Mask向量作为输入,使用如Figure2-2所示的时长预测模块预测出Mask音素所对应的时长;Taking the Text Embedding, Singer Embedding reference duration vector and Mask vector as input, use the duration prediction module as shown in Figure 2-2 to predict the duration corresponding to the Mask phoneme;
根据各音素所对应的时长,将各音素的Embedding向上采样(即若音素A的时长为10,则将A的Embedding复制10份),从而生成Frame-level Text Embedding; According to the duration corresponding to each phoneme, the Embedding of each phoneme is upsampled (that is, if the duration of phoneme A is 10, then the Embedding of A is copied 10 times), thereby generating Frame-level Text Embedding;
步骤S4:经Pitch预测模块预测出各帧的Pitch值,该步骤可通过以下子步骤完成:Step S4: Predict the pitch value of each frame through the pitch prediction module. This step can be completed through the following sub-steps:
对于Q1和Q2,使用Figure2-3所示的模型预测出Mask音素所对应的帧的pitch:For Q1 and Q2, use the model shown in Figure 2-3 to predict the pitch of the frame corresponding to the Mask phoneme:
其中,对于非Mask音素,其参考pitch即为S1中提取的真实Pitch,且其在Mask向量对应位置上标记为1;对于Mask音素,其对应帧上的pitch设置为0,Mask设置为0;预测出Mask音素所对应的Frame-level pitch。Among them, for non-Mask phonemes, the reference pitch is the real pitch extracted in S1, and its corresponding position in the Mask vector is marked as 1; for Mask phonemes, the pitch on the corresponding frame is set to 0, and the Mask is set to 0; Predict the Frame-level pitch corresponding to the Mask phoneme.
对于替换操作Q3,则使用Figure2-4所示的模型预测出Mask音素的Frame-level Pitch;For the replacement operation Q3, the model shown in Figure 2-4 is used to predict the Frame-level Pitch of the Mask phoneme;
步骤S5:将Frame-Level text Embedding和Pitch加到一起输入到音频特征解码模块,预测出新的Mask音素所对应的音频特征帧。Step S5: Add Frame-Level text Embedding and Pitch together and input them into the audio feature decoding module to predict the audio feature frame corresponding to the new Mask phoneme.
应理解,若一个编辑请求中涉及多个编辑操作,则可以按照从左至右的处理顺序一一使用如上所述的流程进行编辑。另一方面,一个替换操作也可以通过“先删除后插入”两个操作来实现。It should be understood that if an editing request involves multiple editing operations, the editing can be performed one by one using the process described above in a processing order from left to right. On the other hand, a replacement operation can also be implemented by two operations: "delete first and then insert".
上面对终端设备或云端设备单独实施的语音处理方法进行了描述,下面对终端设备与云端设备共同执行的语音处理方法进行描述。The voice processing method implemented by the terminal device or the cloud device alone is described above, and the voice processing method implemented by the terminal device and the cloud device jointly is described below.
实施例二:终端设备与云端设备共同执行语音处理方法。Embodiment 2: The terminal device and the cloud device jointly execute the voice processing method.
请参阅图13,本申请实施例提供的语音处理方法一个实施例,该方法可以由终端设备与云端设备共同执行,也可以由终端设备的部件(例如处理器、芯片、或芯片系统等)与云端设备的部件(例如处理器、芯片、或芯片系统等)执行,该实施例包括步骤1301至步骤1306。Please refer to Figure 13, which is an example of a voice processing method provided by the embodiment of the present application. This method can be executed jointly by the terminal device and the cloud device, or can be performed by components of the terminal device (such as a processor, a chip, or a chip system, etc.) and The components of the cloud device (such as a processor, a chip, or a chip system, etc.) execute. This embodiment includes steps 1301 to 1306.
步骤1301,终端设备获取原始语音与第二文本。Step 1301: The terminal device obtains the original voice and the second text.
本实施例中终端设备执行的步骤1301与前述图7a所示实施例中语音处理设备执行的步骤701类似,此处不再赘述。Step 1301 performed by the terminal device in this embodiment is similar to step 701 performed by the voice processing device in the embodiment shown in Figure 7a, and will not be described again here.
步骤1302,终端设备向云端设备发送原始语音与第二文本。Step 1302: The terminal device sends the original voice and the second text to the cloud device.
终端设备获取原始语音与第二文本之后,可以向云端设备发送原始语音与第二文本。After the terminal device obtains the original voice and the second text, it can send the original voice and the second text to the cloud device.
可选地,若步骤1301中,终端设备获取的是原始语音与目标文本,则终端设备向云端设备发送原始语音与目标文本。Optionally, if in step 1301, the terminal device obtains the original voice and the target text, the terminal device sends the original voice and the target text to the cloud device.
步骤1303,云端设备基于原始语音与第二文本获取非编辑语音。Step 1303: The cloud device obtains the non-edited voice based on the original voice and the second text.
本实施例中云端设备执行的步骤1303与前述图7a所示实施例中语音处理设备执行的步骤701中确定非编辑语音的描述类似,此处不再赘述。Step 1303 performed by the cloud device in this embodiment is similar to the description of determining non-edited voice in step 701 performed by the speech processing device in the embodiment shown in Figure 7a, and will not be described again here.
步骤1304,云端设备基于非编辑语音的第一音高特征和目标文本的信息,获取第二文本的第二音高特征。Step 1304: The cloud device obtains the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text.
本实施例中云端设备执行的步骤1303与前述图7a所示实施例中语音处理设备执行的步骤702中确定非编辑语音的描述类似,此处不再赘述。Step 1303 performed by the cloud device in this embodiment is similar to the description of determining non-edited voice in step 702 performed by the speech processing device in the embodiment shown in Figure 7a, and will not be described again here.
步骤1305,云端设备基于第二音高特征、第二文本通过神经网络得到第二文本对应的第一语音特征。Step 1305: The cloud device obtains the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text.
步骤1306,云端设备基于第一语音特征生成与第二文本对应的目标编辑语音。Step 1306: The cloud device generates a target editing voice corresponding to the second text based on the first voice feature.
本实施例中云端设备执行的步骤1304至步骤1306与前述图7a所示实施例中语音处理设备执行的步骤702至步骤704类似,此处不再赘述。 Steps 1304 to 1306 performed by the cloud device in this embodiment are similar to steps 702 to 704 performed by the voice processing device in the embodiment shown in Figure 7a, and will not be described again here.
步骤1307,云端设备向终端设备发送目标编辑语音。本步骤是可选地。Step 1307: The cloud device sends the target editing voice to the terminal device. This step is optional.
可选地,云端设备获取目标编辑语音之后,可以向终端设备发送目标编辑语音。Optionally, after the cloud device obtains the target editing voice, it can send the target editing voice to the terminal device.
步骤1308,终端设备或云端设备获取第二文本在目标文本中的位置。本步骤是可选地。Step 1308: The terminal device or cloud device obtains the position of the second text in the target text. This step is optional.
步骤1309,终端设备或云端设备基于位置拼接目标编辑语音与非编辑语音生成与目标文本对应的目标语音。本步骤是可选地。本步骤是可选地。Step 1309: The terminal device or the cloud device splices the target edited voice and the non-edited voice based on the location to generate a target voice corresponding to the target text. This step is optional. This step is optional.
本实施例中的步骤1308、步骤1309与前述图7a所示实施例中语音处理设备执行的步骤705至步骤706类似,此处不再赘述。本实施例中的步骤1308、步骤1309可以由终端设备或云端设备执行。Steps 1308 and 1309 in this embodiment are similar to steps 705 to 706 performed by the speech processing device in the embodiment shown in FIG. 7a, and will not be described again here. Steps 1308 and 1309 in this embodiment can be executed by a terminal device or a cloud device.
步骤1310,云端设备向终端设备发送目标语音。本步骤是可选地。Step 1310: The cloud device sends the target voice to the terminal device. This step is optional.
可选地,若步骤1308与步骤1309由云端设备执行,则云端设备获取目标语音后,向终端设备发送目标语音。若步骤1308与步骤1309由终端设备执行,则可以不执行本步骤。Optionally, if steps 1308 and 1309 are executed by the cloud device, then after acquiring the target voice, the cloud device sends the target voice to the terminal device. If steps 1308 and 1309 are executed by the terminal device, this step may not be executed.
可选地,终端设备在获取目标编辑语音或目标语音之后,对目标编辑语音或目标语音进行播放。Optionally, after acquiring the target editing voice or target voice, the terminal device plays the target editing voice or target voice.
一种可能实现的方式中,本申请实施例提供的语音处理方法可以包括:云端设备生成目标编辑语音,并向终端设备发送目标编辑语音,即该方法包括步骤1301至步骤1307。另一种可能实现的方式中,本申请实施例提供的语音处理方法可以包括:云端设备生成目标编辑语音,并根据目标编辑语音与非编辑语音生成目标语音,向终端设备发送目标语音。即该方法包括步骤1301至步骤1306、步骤1308至步骤1310。另一种可能实现的方式中,本申请实施例提供的语音处理方法可以包括:云端设备生成目标编辑语音,向终端设备发送目标编辑语音。终端设备在根据目标编辑语音与非编辑语音生成目标语音。即该方法包括步骤1301至步骤1309。In a possible implementation manner, the voice processing method provided by the embodiment of the present application may include: the cloud device generates the target editing voice and sends the target editing voice to the terminal device, that is, the method includes steps 1301 to 1307. In another possible implementation manner, the speech processing method provided by the embodiment of the present application may include: the cloud device generates the target edited voice, generates the target voice based on the target edited voice and the non-edited voice, and sends the target voice to the terminal device. That is, the method includes steps 1301 to 1306, and steps 1308 to 1310. In another possible implementation manner, the voice processing method provided by the embodiment of the present application may include: the cloud device generates the target editing voice, and sends the target editing voice to the terminal device. The terminal device generates the target voice based on the target edited voice and the non-edited voice. That is, the method includes steps 1301 to 1309.
本申请实施例中,一方面可以通过云端设备与终端设备的交互,由云端设备进行复杂的计算得到目标编辑语音或目标语音并返给终端设备,可以减少终端设备的算力与存储空间。另一方面,可以根据原始语音中非编辑区域的语音特征生成修改文本对应的目标编辑语音,进而与非编辑语音生成目标文本对应的目标语音。另一方面,用户可以通过对原始文本中的文本进行修改,得到修改文本(即第二文本)对应的目标编辑语音。提升用户基于文本进行语音编辑的编辑体验。另一方面,生成目标语音时,并未修改非编辑语音,且目标编辑语音的音高特征与非编辑语音的音高特征类似,使得用户在听原始语音与目标语音时,很难听出原始语音与目标语音在语音特征上的差别。In the embodiments of this application, on the one hand, through the interaction between the cloud device and the terminal device, the cloud device performs complex calculations to obtain the target edited voice or the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device. On the other hand, a target edited voice corresponding to the modified text can be generated based on the voice characteristics of the non-edited area in the original voice, and then a target voice corresponding to the target text can be generated from the non-edited voice. On the other hand, the user can modify the text in the original text to obtain the target editing voice corresponding to the modified text (ie, the second text). Improve users' editing experience based on text-based voice editing. On the other hand, the non-edited speech is not modified when the target speech is generated, and the pitch characteristics of the target edited speech are similar to those of the non-edited speech, making it difficult for users to hear the original speech when listening to the original speech and the target speech. Differences in phonetic characteristics from the target voice.
上面对本申请实施例中的语音处理方法进行了描述,下面对本申请实施例中的语音处理设备进行描述,请参阅图14,本申请实施例中语音处理设备的一个实施例包括:The speech processing method in the embodiment of the present application is described above, and the speech processing device in the embodiment of the present application is described below. Please refer to Figure 14. An embodiment of the speech processing device in the embodiment of the present application includes:
获取模块1401,用于获取原始语音以及第二文本,所述第二文本为目标文本中除了第一文本以外的文本,所述目标文本与所述原始语音对应的原始文本都包括所述第一文本,所述第一文本在所述原始语音中对应的语音为非编辑语音;Obtaining module 1401 is used to obtain the original speech and the second text. The second text is the text in the target text except the first text. The target text and the original text corresponding to the original speech both include the first text. Text, the voice corresponding to the first text in the original voice is a non-edited voice;
其中,关于获取模块1401的具体描述可以参照上述实施例中步骤701的描述,这里不再赘述。For a specific description of the acquisition module 1401, reference may be made to the description of step 701 in the above embodiment, which will not be described again here.
音高预测模块1402,用于根据所述非编辑语音的第一音高(pitch)特征以及所述目标 文本的信息,预测所述第二文本的第二音高特征;Pitch prediction module 1402, configured to use the first pitch (pitch) feature of the non-edited speech and the target Text information, predicting the second pitch feature of the second text;
其中,关于音高预测模块1402的具体描述可以参照上述实施例中步骤702的描述,这里不再赘述。For a detailed description of the pitch prediction module 1402, reference may be made to the description of step 702 in the above embodiment, and will not be described again here.
生成模块1403,用于根据所述第二音高特征以及所述第二文本,通过神经网络得到所述第二文本对应的第一语音特征;Generating module 1403, configured to obtain the first speech feature corresponding to the second text through a neural network according to the second pitch feature and the second text;
根据所述第一语音特征,生成所述第二文本对应的目标编辑语音。According to the first voice characteristics, a target editing voice corresponding to the second text is generated.
其中,关于生成模块1403的具体描述可以参照上述实施例中步骤703和704的描述,这里不再赘述。For detailed description of the generation module 1403, reference may be made to the description of steps 703 and 704 in the above embodiment, which will not be described again here.
在一种可能的实现中,所述原始语音的内容为用户的歌声。In one possible implementation, the content of the original voice is the user's singing voice.
在一种可能的实现中,所述根据所述非编辑语音的第一音高(pitch)特征以及所述第二文本包括:In a possible implementation, the first pitch (pitch) feature of the non-edited voice and the second text include:
根据所述非编辑语音的第一音高(pitch)特征、所述目标文本的信息以及所述非编辑语音的第二语音特征;所述第二语音特征携带有如下信息的至少一种:According to the first pitch feature of the non-edited voice, the information of the target text and the second voice feature of the non-edited voice; the second voice feature carries at least one of the following information:
所述非编辑语音的部分语音帧或全部语音帧;Partial speech frames or all speech frames of the non-edited speech;
所述非编辑语音的声纹特征;The voiceprint characteristics of the non-edited speech;
所述非编辑语音的音色特征;The timbre characteristics of the non-edited voice;
所述非编辑语音的韵律特征;以及,The prosodic characteristics of the unedited speech; and,
所述非编辑语音的节奏特征。The rhythmic characteristics of the non-edited speech.
在一种可能的实现中,所述目标文本的信息,包括:所述目标文本中各个音素的文本嵌入(text embedding)。In a possible implementation, the information of the target text includes: text embedding of each phoneme in the target text.
在一种可能的实现中,所述目标文本为将所述第二文本插入到所述第一文本得到的文本;或者,所述目标文本为将所述第一文本的第一部分文本删除得到的文本,所述第二文本为与所述第一部分文本相邻的文本;In a possible implementation, the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by deleting the first part of the first text. Text, the second text is text adjacent to the first part of text;
所述音高预测模块,具体用于:The pitch prediction module is specifically used for:
将所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息进行融合,以得到第一融合结果;Fusion of the first pitch feature of the non-edited speech and the information of the target text to obtain a first fusion result;
将所述第一融合结果输入到第二神经网络,得到所述第二文本的第二音高特征。The first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
在一种可能的实现中,所述目标文本为将所述第一文本中的第二部分文本替换为所述第二文本得到的;In a possible implementation, the target text is obtained by replacing the second part of the text in the first text with the second text;
所述音高预测模块,具体用于:The pitch prediction module is specifically used for:
将所述非编辑语音的第一音高(pitch)特征输入到第三神经网络,得到初始音高特征,所述第一初始音高特征包括多个帧中每个帧的音高;Input the first pitch (pitch) feature of the non-edited speech into a third neural network to obtain an initial pitch feature, where the first initial pitch feature includes the pitch of each frame in a plurality of frames;
将所述目标文本的信息输入到第四神经网络,得到所述第二文本的发音特征,所述发音特征用于指示所述初始音高特征包括的多个帧中各个帧是否发音;Input the information of the target text into a fourth neural network to obtain the pronunciation feature of the second text, where the pronunciation feature is used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated;
将所述初始音高特征和所述发音特征进行融合,以得到所述第二文本的第二音高特征。The initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
在一种可能的实现中,所述装置还包括:In a possible implementation, the device further includes:
时长预测模块,用于根据所述非编辑语音中各个音素的帧数以及所述目标文本的信息, 预测所述第二文本中各个音素的帧数。A duration prediction module used to predict the number of frames of each phoneme in the non-edited speech and the information of the target text, Predict the number of frames for each phoneme in the second text.
在一种可能的实现中,所述第一音高(pitch)特征,包括:所述非编辑语音的多帧中的每一帧的音高特征;In a possible implementation, the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
所述第二音高特征,包括:所述目标编辑语音的多帧中的每一帧的音高特征。The second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited voice.
在一种可能的实现中,所述时长预测模块,具体用于:In a possible implementation, the duration prediction module is specifically used to:
根据所述非编辑语音中各个音素的帧数、所述目标文本的信息以及所述非编辑语音的第二语音特征。According to the frame number of each phoneme in the non-edited speech, the information of the target text and the second speech feature of the non-edited speech.
在一种可能的实现中,所述获取模块还用于:In a possible implementation, the acquisition module is also used to:
获取所述第二文本在所述目标文本中的位置;Obtain the position of the second text in the target text;
所述生成模块,还用于基于所述位置拼接所述目标编辑语音与所述非编辑语音得到所述目标文本对应的目标语音。The generating module is further configured to splice the target edited voice and the non-edited voice based on the position to obtain a target voice corresponding to the target text.
请参阅图15,本申请实施例提供了另一种语音处理设备,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。该语音处理设备可以为包括手机、平板电脑、个人数字助理(personal digital assistant,PDA)、销售终端设备(point of sales,POS)、车载电脑等任意终端设备,以语音处理设备为手机为例:Please refer to Figure 15. This embodiment of the present application provides another voice processing device. For convenience of explanation, only the parts related to the embodiment of the present application are shown. For specific technical details that are not disclosed, please refer to the method part of the embodiment of the present application. . The voice processing device can be any terminal device including a mobile phone, tablet computer, personal digital assistant (PDA), point of sales (POS), vehicle-mounted computer, etc. Taking the voice processing device as a mobile phone as an example:
图15示出的是与本申请实施例提供的语音处理设备相关的手机的部分结构的框图。参考图15,手机包括:射频(radio frequency,RF)电路1510、存储器1520、输入单元1530、显示单元1540、传感器1550、音频电路1560、无线保真(wireless fidelity,WiFi)模块1570、处理器1580、以及电源1590等部件。本领域技术人员可以理解,图15中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。FIG. 15 shows a block diagram of a partial structure of a mobile phone related to the voice processing device provided by an embodiment of the present application. Referring to Figure 15, the mobile phone includes: radio frequency (RF) circuit 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuit 1560, wireless fidelity (WiFi) module 1570, processor 1580 , and power supply 1590 and other components. Those skilled in the art can understand that the structure of the mobile phone shown in FIG. 15 does not constitute a limitation on the mobile phone, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components.
下面结合图15对手机的各个构成部件进行具体的介绍:The following is a detailed introduction to each component of the mobile phone in conjunction with Figure 15:
RF电路1510可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器1580处理;另外,将设计上行的数据发送给基站。通常,RF电路1510包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(low noise amplifier,LNA)、双工器等。此外,RF电路1510还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(global system of mobile communication,GSM)、通用分组无线服务(general packet radio service,GPRS)、码分多址(code division multiple access,CDMA)、宽带码分多址(wideband code division multiple access,WCDMA)、长期演进(long term evolution,LTE)、电子邮件、短消息服务(short messaging service,SMS)等。The RF circuit 1510 can be used to receive and transmit information or signals during a call. In particular, after receiving downlink information from the base station, it is processed by the processor 1580; in addition, the designed uplink data is sent to the base station. Typically, the RF circuit 1510 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, etc. Additionally, RF circuitry 1510 can communicate with networks and other devices through wireless communications. The above wireless communication can use any communication standard or protocol, including but not limited to global system of mobile communication (GSM), general packet radio service (GPRS), code division multiple access (code division) multiple access (CDMA), wideband code division multiple access (WCDMA), long term evolution (LTE), email, short messaging service (SMS), etc.
存储器1520可用于存储软件程序以及模块,处理器1580通过运行存储在存储器1520的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器1520可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所 创建的数据(比如音频数据、电话本等)等。此外,存储器1520可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 1520 can be used to store software programs and modules. The processor 1580 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 1520 . The memory 1520 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store data based on Where to use your mobile phone Created data (such as audio data, phone book, etc.), etc. In addition, memory 1520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
输入单元1530可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元1530可包括触控面板1531以及其他输入设备1532。触控面板1531,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板1531上或在触控面板1531附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板1531可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器1580,并能接收处理器1580发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板1531。除了触控面板1531,输入单元1530还可以包括其他输入设备1532。具体地,其他输入设备1532可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。The input unit 1530 may be used to receive input numeric or character information, and generate key signal input related to user settings and function control of the mobile phone. Specifically, the input unit 1530 may include a touch panel 1531 and other input devices 1532. The touch panel 1531 , also known as a touch screen, can collect the user's touch operations on or near the touch panel 1531 (for example, the user uses a finger, stylus, or any suitable object or accessory on or near the touch panel 1531 operation), and drive the corresponding connection device according to the preset program. Optionally, the touch panel 1531 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact point coordinates, and then sends it to the touch controller. to the processor 1580, and can receive commands sent by the processor 1580 and execute them. In addition, the touch panel 1531 can be implemented using various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 1531, the input unit 1530 may also include other input devices 1532. Specifically, other input devices 1532 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, etc.
显示单元1540可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元1540可包括显示面板1541,可选的,可以采用液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light-emitting diode,OLED)等形式来配置显示面板1541。进一步的,触控面板1531可覆盖显示面板1541,当触控面板1531检测到在其上或附近的触摸操作后,传送给处理器1580以确定触摸事件的类型,随后处理器1580根据触摸事件的类型在显示面板1541上提供相应的视觉输出。虽然在图15中,触控面板1531与显示面板1541是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板1531与显示面板1541集成而实现手机的输入和输出功能。The display unit 1540 may be used to display information input by the user or information provided to the user as well as various menus of the mobile phone. The display unit 1540 may include a display panel 1541. Optionally, the display panel 1541 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (organic light-emitting diode, OLED), etc. Further, the touch panel 1531 can cover the display panel 1541. When the touch panel 1531 detects a touch operation on or near it, it is sent to the processor 1580 to determine the type of the touch event. The processor 1580 then determines the type of the touch event. Type provides corresponding visual output on display panel 1541. Although in Figure 15, the touch panel 1531 and the display panel 1541 are used as two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 1531 and the display panel 1541 can be integrated. Realize the input and output functions of mobile phone.
手机还可包括至少一种传感器1550,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板1541的亮度,接近传感器可在手机移动到耳边时,关闭显示面板1541和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。The phone may also include at least one sensor 1550, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor. The ambient light sensor may adjust the brightness of the display panel 1541 according to the brightness of the ambient light. The proximity sensor may close the display panel 1541 and/or when the mobile phone is moved to the ear. or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three axes). It can detect the magnitude and direction of gravity when stationary. It can be used to identify applications of mobile phone posture (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. that can be configured on the mobile phone, we will not mention them here. Repeat.
音频电路1560、扬声器1561,传声器1562可提供用户与手机之间的音频接口。音频电路1560可将接收到的音频数据转换后的电信号,传输到扬声器1561,由扬声器1561转换为声音信号输出;另一方面,传声器1562将收集的声音信号转换为电信号,由音频电路1560接收后转换为音频数据,再将音频数据输出处理器1580处理后,经RF电路1510以发送给比如另一手机,或者将音频数据输出至存储器1520以便进一步处理。The audio circuit 1560, speaker 1561, and microphone 1562 can provide an audio interface between the user and the mobile phone. The audio circuit 1560 can transmit the electrical signal converted from the received audio data to the speaker 1561, and the speaker 1561 converts it into a sound signal for output; on the other hand, the microphone 1562 converts the collected sound signal into an electrical signal, and the audio circuit 1560 After receiving, it is converted into audio data, and then processed by the audio data output processor 1580, and then sent to, for example, another mobile phone through the RF circuit 1510, or the audio data is output to the memory 1520 for further processing.
WiFi属于短距离无线传输技术,手机通过WiFi模块1570可以帮助用户收发电子邮件、 浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图15示出了WiFi模块1570,但是可以理解的是,其并不属于手机的必须构成。WiFi is a short-distance wireless transmission technology. Through the WiFi module 1570, mobile phones can help users send and receive emails, Browsing the web and accessing streaming media, etc., it provides users with wireless broadband Internet access. Although FIG. 15 shows the WiFi module 1570, it can be understood that it is not a necessary component of the mobile phone.
处理器1580是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器1520内的软件程序和/或模块,以及调用存储在存储器1520内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器1580可包括一个或多个处理单元;优选的,处理器1580可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1580中。The processor 1580 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing software programs and/or modules stored in the memory 1520, and calling data stored in the memory 1520 to execute Various functions of the mobile phone and processing data, thereby conducting overall monitoring of the mobile phone. Optionally, the processor 1580 may include one or more processing units; preferably, the processor 1580 may integrate an application processor and a modem processor, where the application processor mainly processes operating systems, user interfaces, application programs, etc. , the modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 1580.
手机还包括给各个部件供电的电源1590(比如电池),优选的,电源可以通过电源管理系统与处理器1580逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The mobile phone also includes a power supply 1590 (such as a battery) that supplies power to various components. Preferably, the power supply can be logically connected to the processor 1580 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.
尽管未示出,手机还可以包括摄像头、蓝牙模块等,在此不再赘述。Although not shown, the mobile phone may also include a camera, a Bluetooth module, etc., which will not be described in detail here.
在本申请实施例中,该终端设备所包括的处理器1580可以执行前述图7a实施例中语音处理设备的功能,或者执行前述图13所示实施例中终端设备的功能,此处不再赘述。In this embodiment of the present application, the processor 1580 included in the terminal device can perform the functions of the voice processing device in the embodiment shown in Figure 7a, or the functions of the terminal device in the embodiment shown in Figure 13, which will not be described again here. .
参阅图16,本申请提供的另一种语音处理设备的结构示意图。该语音处理设备可以是云端设备。该云端设备可以包括处理器1601、存储器1602和通信接口1603。该处理器1601、存储器1602和通信接口1603通过线路互联。其中,存储器1602中存储有程序指令和数据。Refer to Figure 16, which is a schematic structural diagram of another voice processing device provided by this application. The voice processing device may be a cloud device. The cloud device may include a processor 1601, a memory 1602, and a communication interface 1603. The processor 1601, memory 1602 and communication interface 1603 are interconnected through lines. Among them, the memory 1602 stores program instructions and data.
存储器1602中存储了前述图7a对应的实施方式中,由语音处理设备执行的步骤对应的程序指令以及数据。或者存储了前述图13对应的实施方式中,由云端设备执行的步骤对应的程序指令以及数据。The memory 1602 stores program instructions and data corresponding to the steps executed by the speech processing device in the aforementioned embodiment corresponding to FIG. 7a. Or the program instructions and data corresponding to the steps executed by the cloud device in the aforementioned embodiment corresponding to Figure 13 are stored.
处理器1601,用于执行前述图7a所示实施例中任一实施例所示的由语音处理设备执行的步骤。或者用于执行前述图13所示实施例中任一实施例所示的由云端设备执行的步骤。The processor 1601 is configured to perform the steps performed by the speech processing device shown in any of the embodiments shown in FIG. 7a. Or used to perform the steps performed by the cloud device shown in any of the embodiments shown in FIG. 13 .
通信接口1603可以用于进行数据的接收和发送,用于执行前述图7a或图13所示实施例中任一实施例中与获取、发送、接收相关的步骤。The communication interface 1603 can be used to receive and send data, and to perform steps related to obtaining, sending, and receiving in any of the embodiments shown in FIG. 7a or FIG. 13 .
一种实现方式中,云端设备可以包括相对于图16更多或更少的部件,本申请对此仅仅是示例性说明,并不作限定。In one implementation, the cloud device may include more or fewer components than in Figure 16 , which is only an illustrative description in this application and is not limiting.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元可 以全部或部分地通过软件、硬件、固件或者其任意组合来实现。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated unit can Implemented in whole or in part by software, hardware, firmware, or any combination thereof.
当使用软件实现所述集成的单元时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。When the integrated unit is implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。 The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application. Furthermore, the terms "include" and "having" and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, product or apparatus comprising a series of elements need not be limited to those elements, but may include not explicitly other elements specifically listed or inherent to such processes, methods, products or equipment.

Claims (24)

  1. 一种语音处理方法,其特征在于,所述方法包括:A speech processing method, characterized in that the method includes:
    获取原始语音以及第二文本,所述第二文本为目标文本中除了第一文本以外的文本,所述目标文本与所述原始语音对应的原始文本都包括所述第一文本,所述第一文本在所述原始语音中对应的语音为非编辑语音;Obtain the original speech and the second text, the second text is the text in the target text except the first text, the target text and the original text corresponding to the original speech both include the first text, the first text The voice corresponding to the text in the original voice is non-edited voice;
    根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征;Predict the second pitch feature of the second text based on the first pitch feature of the non-edited speech and the information of the target text;
    根据所述第二音高特征以及所述第二文本,通过神经网络得到所述第二文本对应的第一语音特征;According to the second pitch feature and the second text, obtain the first speech feature corresponding to the second text through a neural network;
    根据所述第一语音特征,生成所述第二文本对应的目标编辑语音。According to the first voice characteristics, a target editing voice corresponding to the second text is generated.
  2. 根据权利要求1所述的方法,其特征在于,所述原始语音的内容为用户的歌声。The method according to claim 1, characterized in that the content of the original voice is the user's singing voice.
  3. 根据权利要求1或2任一所述的方法,其特征在于,所述根据所述非编辑语音的第一音高(pitch)特征以及所述第二文本包括:The method according to any one of claims 1 or 2, characterized in that the first pitch (pitch) feature of the non-edited voice and the second text include:
    根据所述非编辑语音的第一音高(pitch)特征、所述目标文本的信息以及所述非编辑语音的第二语音特征;所述第二语音特征携带有如下信息的至少一种:According to the first pitch feature of the non-edited voice, the information of the target text and the second voice feature of the non-edited voice; the second voice feature carries at least one of the following information:
    所述非编辑语音的部分语音帧或全部语音帧;Partial speech frames or all speech frames of the non-edited speech;
    所述非编辑语音的声纹特征;The voiceprint characteristics of the non-edited speech;
    所述非编辑语音的音色特征;The timbre characteristics of the non-edited voice;
    所述非编辑语音的韵律特征;以及,The prosodic characteristics of the unedited speech; and,
    所述非编辑语音的节奏特征。The rhythmic characteristics of the non-edited speech.
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述目标文本的信息,包括:The method according to any one of claims 1 to 3, characterized in that the information of the target text includes:
    所述目标文本中各个音素的文本嵌入(text embedding)。Text embedding of each phoneme in the target text.
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述目标文本为将所述第二文本插入到所述第一文本得到的文本;或者,所述目标文本为将所述第一文本的第一部分文本删除得到的文本,所述第二文本为与所述第一部分文本相邻的文本;The method according to any one of claims 1 to 4, wherein the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by inserting the third text into the first text. A text obtained by deleting the first part of a text, and the second text is a text adjacent to the first part of the text;
    所述根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征,包括:Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited speech and the information of the target text includes:
    将所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息进行融合,以得到第一融合结果;Fusion of the first pitch feature of the non-edited speech and the information of the target text to obtain a first fusion result;
    将所述第一融合结果输入到第二神经网络,得到所述第二文本的第二音高特征。The first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
  6. 根据权利要求1至5任一所述的方法,其特征在于,所述目标文本为将所述第一文本中的第二部分文本替换为所述第二文本得到的; The method according to any one of claims 1 to 5, characterized in that the target text is obtained by replacing the second part of the text in the first text with the second text;
    所述根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征,包括:Predicting the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text includes:
    将所述非编辑语音的第一音高(pitch)特征输入到第三神经网络,得到初始音高特征,所述第一初始音高特征包括多个帧中每个帧的音高;Input the first pitch (pitch) feature of the non-edited speech into a third neural network to obtain an initial pitch feature, where the first initial pitch feature includes the pitch of each frame in a plurality of frames;
    将所述目标文本的信息输入到第四神经网络,得到所述第二文本的发音特征,所述发音特征用于指示所述初始音高特征包括的多个帧中各个帧是否发音;Input the information of the target text into a fourth neural network to obtain the pronunciation feature of the second text, where the pronunciation feature is used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated;
    将所述初始音高特征和所述发音特征进行融合,以得到所述第二文本的第二音高特征。The initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
  7. 根据权利要求1至6任一所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 6, characterized in that the method further includes:
    根据所述非编辑语音中各个音素的帧数以及所述目标文本的信息,预测所述第二文本中各个音素的帧数。According to the frame number of each phoneme in the non-edited speech and the information of the target text, the frame number of each phoneme in the second text is predicted.
  8. 根据权利要求1至7任一所述的方法,其特征在于,所述第一音高(pitch)特征,包括:所述非编辑语音的多帧中的每一帧的音高特征;The method according to any one of claims 1 to 7, characterized in that the first pitch (pitch) feature includes: the pitch feature of each frame in the multiple frames of the non-edited speech;
    所述第二音高特征,包括:所述目标编辑语音的多帧中的每一帧的音高特征。The second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited speech.
  9. 根据权利要求7或8所述的方法,其特征在于,所述根据所述非编辑语音中各个音素的帧数以及所述目标文本的信息,包括:The method according to claim 7 or 8, characterized in that the information based on the frame number of each phoneme in the non-edited speech and the target text includes:
    根据所述非编辑语音中各个音素的帧数、所述目标文本的信息以及所述非编辑语音的第二语音特征。According to the frame number of each phoneme in the non-edited speech, the information of the target text and the second speech feature of the non-edited speech.
  10. 根据权利要求1至9任一所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 9, characterized in that the method further includes:
    获取所述第二文本在所述目标文本中的位置;Obtain the position of the second text in the target text;
    基于所述位置拼接所述目标编辑语音与所述非编辑语音得到所述目标文本对应的目标语音。The target edited voice and the non-edited voice are spliced based on the position to obtain the target voice corresponding to the target text.
  11. 一种语音处理装置,其特征在于,所述装置包括:A speech processing device, characterized in that the device includes:
    获取模块,用于获取原始语音以及第二文本,所述第二文本为目标文本中除了第一文本以外的文本,所述目标文本与所述原始语音对应的原始文本都包括所述第一文本,所述第一文本在所述原始语音中对应的语音为非编辑语音;Acquisition module, used to obtain the original speech and the second text. The second text is the text in the target text except the first text. The target text and the original text corresponding to the original speech both include the first text. , the voice corresponding to the first text in the original voice is non-edited voice;
    音高预测模块,用于根据所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息,预测所述第二文本的第二音高特征;A pitch prediction module, configured to predict the second pitch feature of the second text based on the first pitch feature of the non-edited voice and the information of the target text;
    生成模块,用于根据所述第二音高特征以及所述第二文本,通过神经网络得到所述第二文本对应的第一语音特征;A generation module, configured to obtain the first speech feature corresponding to the second text through a neural network based on the second pitch feature and the second text;
    根据所述第一语音特征,生成所述第二文本对应的目标编辑语音。According to the first voice characteristics, a target editing voice corresponding to the second text is generated.
  12. 根据权利要求11所述的装置,其特征在于,所述原始语音的内容为用户的歌声。 The device according to claim 11, characterized in that the content of the original voice is the user's singing voice.
  13. 根据权利要求11或12所述的装置,其特征在于,所述根据所述非编辑语音的第一音高(pitch)特征以及所述第二文本包括:The device according to claim 11 or 12, characterized in that the first pitch (pitch) feature according to the non-edited voice and the second text include:
    根据所述非编辑语音的第一音高(pitch)特征、所述目标文本的信息以及所述非编辑语音的第二语音特征;所述第二语音特征携带有如下信息的至少一种:According to the first pitch feature of the non-edited voice, the information of the target text and the second voice feature of the non-edited voice; the second voice feature carries at least one of the following information:
    所述非编辑语音的部分语音帧或全部语音帧;Partial speech frames or all speech frames of the non-edited speech;
    所述非编辑语音的声纹特征;The voiceprint characteristics of the non-edited speech;
    所述非编辑语音的音色特征;The timbre characteristics of the non-edited voice;
    所述非编辑语音的韵律特征;以及,The prosodic characteristics of the unedited speech; and,
    所述非编辑语音的节奏特征。The rhythmic characteristics of the non-edited speech.
  14. 根据权利要求11至13任一所述的装置,其特征在于,所述目标文本的信息,包括:所述目标文本中各个音素的文本嵌入(text embedding)。The device according to any one of claims 11 to 13, characterized in that the information of the target text includes: text embedding of each phoneme in the target text.
  15. 根据权利要求11至14任一所述的装置,其特征在于,所述目标文本为将所述第二文本插入到所述第一文本得到的文本;或者,所述目标文本为将所述第一文本的第一部分文本删除得到的文本,所述第二文本为与所述第一部分文本相邻的文本;The device according to any one of claims 11 to 14, wherein the target text is a text obtained by inserting the second text into the first text; or, the target text is a text obtained by inserting the second text into the first text. A text obtained by deleting the first part of a text, and the second text is a text adjacent to the first part of the text;
    所述音高预测模块,具体用于:The pitch prediction module is specifically used for:
    将所述非编辑语音的第一音高(pitch)特征以及所述目标文本的信息进行融合,以得到第一融合结果;Fusion of the first pitch feature of the non-edited speech and the information of the target text to obtain a first fusion result;
    将所述第一融合结果输入到第二神经网络,得到所述第二文本的第二音高特征。The first fusion result is input into the second neural network to obtain the second pitch feature of the second text.
  16. 根据权利要求11至14任一所述的装置,其特征在于,所述目标文本为将所述第一文本中的第二部分文本替换为所述第二文本得到的;The device according to any one of claims 11 to 14, wherein the target text is obtained by replacing the second part of the text in the first text with the second text;
    所述音高预测模块,具体用于:The pitch prediction module is specifically used for:
    将所述非编辑语音的第一音高(pitch)特征输入到第三神经网络,得到初始音高特征,所述第一初始音高特征包括多个帧中每个帧的音高;Input the first pitch (pitch) feature of the non-edited speech into a third neural network to obtain an initial pitch feature, where the first initial pitch feature includes the pitch of each frame in a plurality of frames;
    将所述目标文本的信息输入到第四神经网络,得到所述第二文本的发音特征,所述发音特征用于指示所述初始音高特征包括的多个帧中各个帧是否发音;Input the information of the target text into a fourth neural network to obtain the pronunciation feature of the second text, where the pronunciation feature is used to indicate whether each of the multiple frames included in the initial pitch feature is pronunciated;
    将所述初始音高特征和所述发音特征进行融合,以得到所述第二文本的第二音高特征。The initial pitch feature and the pronunciation feature are fused to obtain the second pitch feature of the second text.
  17. 根据权利要求11至16任一所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 11 to 16, characterized in that the device further includes:
    时长预测模块,用于根据所述非编辑语音中各个音素的帧数以及所述目标文本的信息,预测所述第二文本中各个音素的帧数。A duration prediction module, configured to predict the number of frames of each phoneme in the second text based on the number of frames of each phoneme in the non-edited speech and the information of the target text.
  18. 根据权利要求11至17任一所述的装置,其特征在于,所述第一音高(pitch)特征,包括:所述非编辑语音的多帧中的每一帧的音高特征; The device according to any one of claims 11 to 17, wherein the first pitch (pitch) feature includes: the pitch feature of each frame in the plurality of frames of the non-edited speech;
    所述第二音高特征,包括:所述目标编辑语音的多帧中的每一帧的音高特征。The second pitch feature includes: the pitch feature of each frame in the plurality of frames of the target edited speech.
  19. 根据权利要求17或18所述的装置,其特征在于,所述时长预测模块,具体用于:The device according to claim 17 or 18, characterized in that the duration prediction module is specifically used for:
    根据所述非编辑语音中各个音素的帧数、所述目标文本的信息以及所述非编辑语音的第二语音特征。According to the frame number of each phoneme in the non-edited speech, the information of the target text and the second speech feature of the non-edited speech.
  20. 根据权利要求11至19任一所述的装置,其特征在于,所述获取模块还用于:The device according to any one of claims 11 to 19, characterized in that the acquisition module is also used to:
    获取所述第二文本在所述目标文本中的位置;Obtain the position of the second text in the target text;
    所述生成模块,还用于基于所述位置拼接所述目标编辑语音与所述非编辑语音得到所述目标文本对应的目标语音。The generating module is further configured to splice the target edited voice and the non-edited voice based on the position to obtain a target voice corresponding to the target text.
  21. 一种语音处理设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述语音处理设备执行如权利要求1至10中任一项所述的方法。A speech processing device, characterized in that it includes: a processor, the processor is coupled to a memory, the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the The speech processing device performs the method according to any one of claims 1 to 10.
  22. 根据权利要求21所述的设备,其特征在于,所述设备还包括:The device according to claim 21, characterized in that the device further includes:
    输入单元,用于接收第二文本;an input unit for receiving the second text;
    输出单元,用于播放所述第二文本对应的目标编辑语音或者目标文本对应的目标语音。An output unit is used to play the target editing voice corresponding to the second text or the target voice corresponding to the target text.
  23. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,所述指令在计算机上执行时,使得所述计算机执行如权利要求1至10中任一项所述的方法。A computer-readable storage medium, characterized in that instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, they cause the computer to execute as described in any one of claims 1 to 10 Methods.
  24. 一种计算机程序产品,其特征在于,所述计算机程序产品在计算机上执行时,使得所述计算机执行如权利要求1至10中任一项所述的方法。 A computer program product, characterized in that, when executed on a computer, the computer program product causes the computer to execute the method according to any one of claims 1 to 10.
PCT/CN2023/086497 2022-04-29 2023-04-06 Speech processing method and related device WO2023207541A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210468926.8 2022-04-29
CN202210468926.8A CN114882862A (en) 2022-04-29 2022-04-29 Voice processing method and related equipment

Publications (1)

Publication Number Publication Date
WO2023207541A1 true WO2023207541A1 (en) 2023-11-02

Family

ID=82673378

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/086497 WO2023207541A1 (en) 2022-04-29 2023-04-06 Speech processing method and related device

Country Status (2)

Country Link
CN (1) CN114882862A (en)
WO (1) WO2023207541A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882862A (en) * 2022-04-29 2022-08-09 华为技术有限公司 Voice processing method and related equipment
CN116189654A (en) * 2023-02-23 2023-05-30 京东科技信息技术有限公司 Voice editing method and device, electronic equipment and storage medium
CN117153144B (en) * 2023-10-31 2024-02-06 杭州宇谷科技股份有限公司 Battery information voice broadcasting method and device based on terminal calculation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006349787A (en) * 2005-06-14 2006-12-28 Hitachi Information & Control Solutions Ltd Method and device for synthesizing voices
JP2011170191A (en) * 2010-02-19 2011-09-01 Fujitsu Ltd Speech synthesis device, speech synthesis method and speech synthesis program
CN111899706A (en) * 2020-07-30 2020-11-06 广州酷狗计算机科技有限公司 Audio production method, device, equipment and storage medium
CN113421547A (en) * 2021-06-03 2021-09-21 华为技术有限公司 Voice processing method and related equipment
CN113808555A (en) * 2021-09-17 2021-12-17 广州酷狗计算机科技有限公司 Song synthesis method and device, equipment, medium and product thereof
CN113920977A (en) * 2021-09-30 2022-01-11 宿迁硅基智能科技有限公司 Speech synthesis model, model training method and speech synthesis method
CN114882862A (en) * 2022-04-29 2022-08-09 华为技术有限公司 Voice processing method and related equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006349787A (en) * 2005-06-14 2006-12-28 Hitachi Information & Control Solutions Ltd Method and device for synthesizing voices
JP2011170191A (en) * 2010-02-19 2011-09-01 Fujitsu Ltd Speech synthesis device, speech synthesis method and speech synthesis program
CN111899706A (en) * 2020-07-30 2020-11-06 广州酷狗计算机科技有限公司 Audio production method, device, equipment and storage medium
CN113421547A (en) * 2021-06-03 2021-09-21 华为技术有限公司 Voice processing method and related equipment
CN113808555A (en) * 2021-09-17 2021-12-17 广州酷狗计算机科技有限公司 Song synthesis method and device, equipment, medium and product thereof
CN113920977A (en) * 2021-09-30 2022-01-11 宿迁硅基智能科技有限公司 Speech synthesis model, model training method and speech synthesis method
CN114882862A (en) * 2022-04-29 2022-08-09 华为技术有限公司 Voice processing method and related equipment

Also Published As

Publication number Publication date
CN114882862A (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN112487182B (en) Training method of text processing model, text processing method and device
CN110782870B (en) Speech synthesis method, device, electronic equipment and storage medium
CN110490213B (en) Image recognition method, device and storage medium
CN110853618B (en) Language identification method, model training method, device and equipment
CN111048062B (en) Speech synthesis method and apparatus
WO2023207541A1 (en) Speech processing method and related device
CN113421547B (en) Voice processing method and related equipment
CN111179962B (en) Training method of voice separation model, voice separation method and device
CN111933115B (en) Speech recognition method, apparatus, device and storage medium
KR102346046B1 (en) 3d virtual figure mouth shape control method and device
CN114694076A (en) Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
KR20210007786A (en) Vision-assisted speech processing
CN112347795A (en) Machine translation quality evaluation method, device, equipment and medium
CN112069309A (en) Information acquisition method and device, computer equipment and storage medium
CN112233698A (en) Character emotion recognition method and device, terminal device and storage medium
CN112632244A (en) Man-machine conversation optimization method and device, computer equipment and storage medium
CN113822076A (en) Text generation method and device, computer equipment and storage medium
CN115688937A (en) Model training method and device
CN115240713B (en) Voice emotion recognition method and device based on multi-modal characteristics and contrast learning
WO2022057759A1 (en) Voice conversion method and related device
CN113948060A (en) Network training method, data processing method and related equipment
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
KR20230120790A (en) Speech Recognition Healthcare Service Using Variable Language Model
CN114333772A (en) Speech recognition method, device, equipment, readable storage medium and product
CN113822084A (en) Statement translation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23794974

Country of ref document: EP

Kind code of ref document: A1