WO2022253061A1 - 一种语音处理方法及相关设备 - Google Patents

一种语音处理方法及相关设备 Download PDF

Info

Publication number
WO2022253061A1
WO2022253061A1 PCT/CN2022/094838 CN2022094838W WO2022253061A1 WO 2022253061 A1 WO2022253061 A1 WO 2022253061A1 CN 2022094838 W CN2022094838 W CN 2022094838W WO 2022253061 A1 WO2022253061 A1 WO 2022253061A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
speech
voice
speech feature
feature
Prior art date
Application number
PCT/CN2022/094838
Other languages
English (en)
French (fr)
Inventor
邓利群
谭达新
郑念祖
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22815107.2A priority Critical patent/EP4336490A1/en
Publication of WO2022253061A1 publication Critical patent/WO2022253061A1/zh
Priority to US18/524,208 priority patent/US20240105159A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used

Definitions

  • the embodiments of the present application relate to the fields of artificial intelligence and audio applications, and in particular to a speech processing method and related equipment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • voice editing has very important practical significance. For example, when a user records a short video or a teacher records a lecture voice, some content in the voice is often wrong due to a slip of the tongue. In this case, voice editing can help users or teachers to conveniently and quickly correct the erroneous content in the original voice and generate corrected voice.
  • the commonly used speech editing method is to construct a database containing a large number of speech fragments in advance, obtain the fragments of pronunciation units from the database, and replace the wrong fragments in the original speech with the fragments, and then generate corrected speech.
  • the above-mentioned voice editing method relies on the diversity of voice segments in the database, and when there are few voice segments in the database, the corrected voice will have a poor sense of hearing.
  • Embodiments of the present application provide a voice processing method and related equipment, which can realize that the sense of hearing of the edited voice is similar to that of the original voice, thereby improving user experience.
  • the first aspect of the embodiments of the present application provides a voice processing method, which can be applied to scenarios such as a user recording a short video, a teacher recording a lecture voice, and the like.
  • the method may be executed by the speech processing device, or may be executed by components of the speech processing device (such as a processor, a chip, or a chip system, etc.).
  • the speech processing device may be a terminal device or a cloud device, and the method includes: obtaining the original speech and the second text, the second text is the text in the target text other than the first text, and the target text corresponds to the original speech
  • the original text all includes the first text, and the voice corresponding to the first text in the original voice is a non-edited voice; the first voice feature is obtained based on the non-edited voice; the second text is obtained through a neural network based on the first voice feature and the second text.
  • the second voice feature generating a target editing voice corresponding to the second text based on the second voice feature.
  • the prosody, timbre and/or signal-to-noise ratio of the first speech feature can be the same as or similar to the second speech feature, and the prosody can reflect the speaker's emotional state or speech form, etc., and prosody generally refers to intonation, pitch, stress emphasis , pauses, or rhythms.
  • the second text can be to obtain the second text directly; or to obtain the position information first (it can also be understood as mark information, which is used to indicate the position of the second text in the target text), Obtaining the second text according to the position and the target text, the position information is used to represent the position of the second text in the target text; it can also be to obtain the target text and the original text (or obtain the target text and the original voice, and recognize the original voice to obtain original text), and then determine the second text based on the original text and the target text.
  • the second speech feature corresponding to the second text in the target text is obtained through the first speech feature of the first text in the original speech, that is, the target text is generated by referring to the first speech feature of the first text in the original text.
  • the above step: obtaining the original voice and the second text includes: receiving the original voice and the second text sent by the terminal device; the method further includes: sending the original voice and the second text to the terminal device
  • the target editing voice is used for the terminal device to generate the target voice corresponding to the target text. It can also be understood as an interactive scene.
  • the cloud device performs complex calculation operations, and the terminal device performs simple splicing operations.
  • the original voice and the second text are obtained from the terminal device. After the cloud device generates the target editing voice, it sends it to the terminal device.
  • the target edited voice is spliced by the terminal device to obtain the target voice.
  • the voice processing device when the voice processing device is a cloud device, on the one hand, through the interaction between the cloud device and the terminal device, the cloud device can perform complex calculations to obtain the target edited voice and return it to the terminal device. Reduce the computing power and storage space of terminal equipment.
  • the target edited speech corresponding to the modified text can be generated according to the speech features of the non-edited region in the original speech, and then the target speech corresponding to the target text can be generated from the non-edited speech.
  • the above step: obtaining the original voice and the second text includes: receiving the original voice and the target text sent by the terminal device; the method further includes: based on the non-edited voice and the The target editing voice generates a target voice corresponding to the target text, and sends the target voice to the terminal device.
  • the original voice and the target text sent by the receiving terminal device can obtain the non-edited voice, and generate the second voice feature corresponding to the second text according to the first voice feature of the non-edited voice, and then according to the vocoder
  • the device obtains the target edited speech, and splices the target edited speech and the non-edited speech to generate the target speech. It is equivalent to that the processing is all in the voice processing device, and the result is returned to the terminal device.
  • the cloud device performs complex calculations to obtain the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • the above step of: obtaining the original voice and the second text includes: receiving an editing request from the user, where the editing request includes the original voice and the second text.
  • the edit request includes the original speech and the target text.
  • the target text can be understood as the text corresponding to the voice that the user wants to generate.
  • the user can obtain the target editing voice corresponding to the modified text (ie, the second text) by modifying the text in the original text. Improve the user's editing experience for text-based voice editing.
  • the above steps further include: acquiring the position of the second text in the target text; splicing the target editing voice and non-editing voice based on the position to obtain the target voice corresponding to the target text . It can also be understood as replacing the edited voice in the original voice with the target edited voice, where the edited voice is a voice other than the non-edited voice in the original voice.
  • the target edited speech and the non-edited speech may be spliced according to the position of the second text in the target text. If the first text is all overlapping texts in the original text and the target text, then the voice of the desired text (i.e. the target text) can be generated without changing the non-editing voice in the original voice.
  • the above step: obtaining the first speech feature based on the non-edited speech includes: obtaining at least one speech frame in the non-edited speech; obtaining the first speech feature based on the at least one speech frame A speech feature, the first speech feature is used to represent the feature of at least one speech frame, and the first speech feature is a feature vector or sequence.
  • the target voice can also be obtained (similar to the above), in order to ensure that the transition between the non-edited voice and the target edited voice is more gentle, in the case of multiple voice frames, the text corresponding to the selected voice frame can be compared with the second text similar.
  • the first speech feature is obtained through the speech frame in the non-edited speech, which can make the generated target edited speech have the same or similar speech features as the non-edited speech, reducing the difference between the original speech and the target edited speech. Differences in the sense of hearing of speech.
  • the text corresponding to the selected speech frame may be similar to the second text, so that when the target speech is generated, the transition between the target edited speech and the non-edited speech is smoother.
  • speech features may also be embodied in a non-physical quantity manner, for example, in a sequence or vector manner.
  • the text corresponding to at least one speech frame in the above steps is text adjacent to the second text in the first text. That is, the non-edited speech corresponding to the first speech feature is adjacent to the non-edited speech in the target speech.
  • the speech features of the second text are generated by using the first speech features of the context of the second text, so that the second speech features are more combined with the first speech features of the context. That is, by predicting the speech corresponding to the second text through the speech frame corresponding to the context, the speech features of the speech frame of the second text and the speech frame of the context can be approximated, so that the target edited speech of the second text is similar to the original speech.
  • the above step: obtaining the second speech feature corresponding to the second text through the neural network based on the first speech feature and the second text includes: based on the first speech feature
  • the second speech feature corresponding to the second text is obtained through the neural network, the target text, and the marking information, and the marking information is used to mark the second text in the target text.
  • the marking information can also be understood as position information, which is used to indicate the position of the second text in the target text.
  • the entire target text can be referred to, so as to avoid the splicing of the target edited speech generated subsequently and the non-edited speech in the original speech. Speech does not focus on the target text.
  • the neural network includes an encoder and a decoder
  • the second speech corresponding to the second text is obtained through the neural network based on the first speech feature and the second text
  • the features include: based on the second text, using an encoder to obtain a first vector corresponding to the second text; based on the first vector and the first speech feature, using a decoder to obtain a second speech feature. It can also be understood that the first vector and the first speech feature are input into the decoder to obtain the second speech feature.
  • the decoder decodes the first vector based on the first speech feature, so that the generated second speech feature is similar to the first speech feature, or in other words, the generated second speech feature carries the first Similar features in speech features (such as prosody, timbre, and/or signal-to-noise ratio, etc.).
  • the above step: obtaining the first vector corresponding to the second text through the encoder based on the second text includes: obtaining the first vector corresponding to the second text through the encoder based on the target text first vector. It can also be understood as inputting the target text and position information into the encoder to obtain the first vector, and the position information is used to indicate the position of the second text in the target text.
  • the target text where the second text is located is introduced during the encoding process of the encoder, so that the first vector of the generated second text refers to the target text, so that the second text described by the first vector is more accurate.
  • the above steps further include: predicting the first duration and the second duration through the prediction network based on the target text, the first duration being the corresponding time length of the first text in the target text phoneme duration, the second duration is the corresponding phoneme duration of the second text in the target text; modify the second duration based on the first duration and the third duration to obtain the first modified duration, and the third duration is the first text in the original speech phoneme duration; based on the first vector and the first speech feature, the second speech feature is obtained through the decoder, including: based on the first vector, the first speech feature and the first modified duration, the second speech feature is obtained through the decoder .
  • the duration of the target edited speech may be corrected.
  • the above step: modifying the second duration based on the first duration and the third duration to obtain the first modified duration includes: calculating the third duration and the first duration The ratio of ; the first modified duration is obtained based on the ratio and the second duration.
  • the second duration is corrected by using a ratio of the third duration to the first duration.
  • the degree of consistency between the duration of the target edited speech corresponding to the second text and the speech rate of the non-edited speech can be improved.
  • the above step: based on the first vector, the first speech feature and the first modified duration, using the decoder to obtain the second speech feature includes: based on the first Upsampling the first vector to obtain a second vector by modifying the duration; obtaining the second speech feature through a decoder based on the second vector and the first speech feature.
  • the second vector and the first speech feature are input into the decoder to obtain the second speech feature.
  • the decoder includes multiple coding units connected in series
  • the second vector and the first speech feature may be the same coding unit that is input, or may be different coding units that are input.
  • upsampling the first vector through the first modified duration can also be understood as using the first modified duration to expand the first vector to obtain the second vector, so that the duration of the target edited voice is different from the non- Edited voices are consistent in speech rate.
  • the above steps further include: predicting a fourth duration based on the second text through a prediction network, where the fourth duration is the total duration of all phonemes corresponding to the second text; obtaining the original Speech rate of speech; modify the fourth duration based on the speech rate to obtain the second modified duration; based on the first vector and the first speech feature, obtain the second speech feature through the decoder, including: based on the first vector, the first speech feature With the second modified duration, the second speech feature is obtained through the decoder.
  • the phonemes of the original speech are used to adjust the duration of the speech frame corresponding to the second text, which can increase the degree of consistency between the duration of the target edited speech corresponding to the second text and the speech rate of the non-edited speech.
  • the above step: based on the first vector and the first speech feature, acquiring the second speech feature through the decoder includes: based on the decoder and the first speech feature
  • the second speech feature is obtained by decoding the first vector from the forward or reverse sequence of the target text. For example, if the target text is "happy today", the forward sequence is the sequence from "now” to "heart”, and the reverse sequence is the sequence from "heart” to "now".
  • the encoder can predict the second speech feature through the forward sequence or reverse sequence direction of the text.
  • the above-mentioned second text is in a middle area of the target text, or in other words, the second text is not at both ends of the target text.
  • the second speech feature is obtained through the decoder, including: based on the decoder and the first speech feature, decoding the first vector from the target text in a positive order to obtain the third speech feature; based on the decoder and the first speech feature.
  • the first speech feature is obtained by decoding the first vector in reverse order of the target text to obtain the fourth speech feature; the second speech feature is obtained based on the third speech feature and the fourth speech feature.
  • the decoder is a two-way decoder, which can respectively obtain the phonetic features corresponding to the two second texts through the left and right sides (i.e. forward and reverse order), and obtain the second phonetic features according to the two phonetic features , making the second speech feature more similar to the feature of the first text in the original speech, and improving the auditory effect of the target edited speech.
  • the above-mentioned second text includes a third text and a fourth text
  • the third speech feature is the speech feature corresponding to the third text
  • the fourth speech feature is the speech feature corresponding to the first text.
  • the speech features corresponding to the four texts obtaining the second speech features based on the third speech features and the fourth speech features, including: concatenating the third speech features and the fourth speech features to obtain the second speech features.
  • a part of the speech features is taken from the direction of the forward sequence, and another part of the speech features is taken from the direction of the reverse order, and the part of the speech features is spliced with the other part of the speech features to obtain the overall speech feature.
  • the third speech feature in the above steps is the speech feature corresponding to the second text obtained by the decoder based on the forward sequence
  • the fourth speech feature is the speech feature corresponding to the second text obtained by the decoder based on the reverse sequence
  • the second speech feature is selected from the two complete speech features in a complementary manner through the transitional speech feature in the two complete speech features, so that the second speech feature refers to both the positive sequence and the In reverse order, the degree of similarity between the second speech feature and the first speech feature is increased.
  • the above step: generating the target editing voice corresponding to the second text based on the second voice feature includes: generating the target editing voice through a vocoder based on the second voice feature Edit voice.
  • the second voice feature is converted into the target edited voice according to the vocoder, so that the target edited voice has voice features similar to the original voice, and the user's sense of hearing is improved.
  • the first voice feature carries a voiceprint feature of the original voice.
  • the manner of acquiring the voiceprint feature may be directly acquiring, or obtaining the voiceprint feature by recognizing the original voice, and the like.
  • the subsequently generated second voice features also carry the voiceprint features of the original voice, thereby improving the similarity between the target edited voice and the original voice .
  • introducing voiceprint features can improve the subsequent predicted voice features to be more similar to the voiceprint of the speaker of the original voice.
  • the above steps further include: determining the non-edited voice based on the target text, the original text, and the original voice, specifically, determining the first text based on the target text and the original text ; Determine a non-edited voice based on the first text, the original text, and the original voice.
  • the non-edited voice of the first text in the original voice is determined, which facilitates subsequent generation of first voice features.
  • determining the first text based on the target text and the original text includes: determining the overlapping text based on the target text and the original text; displaying the overlapping text to the user; responding The second operation of the user is to determine the first text from the overlapping text.
  • the first text can be determined according to the user's operation.
  • the operability of the user's voice editing can be improved.
  • more non-editing voices can be referred to.
  • the above neural network is obtained by using the training data as the input of the neural network, and training the neural network with the value of the loss function being less than the second threshold as the target,
  • the training data includes the training speech and the training text corresponding to the training speech;
  • the loss function is used to indicate the difference between the speech features output by the neural network and the actual speech features, and the actual speech features are obtained from the training speech.
  • the neural network is trained with the goal of reducing the value of the loss function, that is, the difference between the speech features output by the neural network and the actual speech features is continuously reduced. Therefore, the second speech feature output by the neural network is more accurate.
  • the above step: determining the non-edited voice based on the first text, the original text, and the original voice includes: determining the start and end positions of each phoneme in the original text in the original voice; The non-edited speech is determined based on the start and end positions and the first text.
  • the non-edited speech is determined according to the start and end positions of the phonemes in the original speech and the first text, so that the determined non-edited speech is more accurate in the phoneme dimension.
  • the above-mentioned first speech feature and the second speech feature are Mel spectrum features.
  • the second aspect of the embodiment of the present application provides a voice processing method, which can be applied to scenarios such as a user recording a short video, a teacher recording a lecture voice, and the like.
  • the method may be executed by the speech processing device, or may be executed by components of the speech processing device (such as a processor, a chip, or a chip system, etc.).
  • the speech processing device is a terminal device, and the method includes: acquiring the original speech and the second text, the second text is the text in the target text except the first text, and the original text corresponding to the target text and the original speech includes the first Text, the voice corresponding to the first text in the original voice is a non-editing voice; send the original voice and the second text to the cloud device, and the original voice and the second text are used by the cloud device to generate the target editing voice corresponding to the second text; receive the cloud The target editing voice sent by the device.
  • the cloud device can perform complex calculations to obtain the target editing voice and return it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • the target edited speech corresponding to the modified text can be generated according to the speech features of the non-edited region in the original speech, and then the target speech corresponding to the target text can be generated from the non-edited speech.
  • the above step of: obtaining the original voice and the second text includes: receiving an editing request from the user, where the editing request includes the original voice and the second text.
  • the edit request includes the original speech and the target text.
  • the target text can be understood as the text corresponding to the voice that the user wants to generate.
  • the user can obtain the target editing voice corresponding to the modified text (ie, the second text) by modifying the text in the original text. Improve the user's editing experience for text-based voice editing.
  • the third aspect of the embodiment of the present application provides a voice processing method, which can be applied to scenarios such as a user recording a short video, a teacher recording a lecture voice, and the like.
  • the method may be executed by the speech processing device, or may be executed by components of the speech processing device (such as a processor, a chip, or a chip system, etc.).
  • the speech processing device is a cloud device
  • the method includes: receiving the original speech and the second text sent by the terminal device, the second text is the text in the target text other than the first text, and the target text is the original text corresponding to the original speech All include the first text, the voice corresponding to the first text in the original voice is non-edited voice; the first voice feature is obtained based on the non-edited voice; the second text corresponding to the second text is obtained based on the first voice feature and the second text through a neural network Two voice features; generating a target editing voice corresponding to the second text based on the second voice features.
  • the second speech feature corresponding to the second text in the target text is obtained through the first speech feature of the first text in the original speech, that is, the target text is generated by referring to the first speech feature of the first text in the original text.
  • the above steps further include: sending the target editing voice to the terminal device.
  • the cloud device performs complex calculations to obtain the target editing voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • the above steps further include: generating the target voice based on the target edited voice and the non-edited voice; and sending the target voice to the terminal device.
  • the cloud device performs complex calculations to obtain the target voice and returns it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • the fourth aspect of the present application provides a voice processing device, which can be applied to scenarios such as a user recording a short video, a teacher recording a lecture voice, and the like.
  • the voice processing device may be a terminal device or a cloud device, and the voice processing device includes: an acquisition unit, configured to acquire the original voice and the second text, the second text is text other than the first text in the target text, The original text corresponding to the target text and the original speech all includes the first text, and the speech corresponding to the first text in the original speech is a non-edited speech; the acquisition unit is also used to obtain the first speech feature based on the non-edited speech; the processing unit uses Obtaining second speech features corresponding to the second text through a neural network based on the first speech features and the second text; a generating unit configured to generate a target editing speech corresponding to the second text based on the second speech features.
  • the prosody, timbre and/or signal-to-noise ratio of the first speech feature can be the same as or similar to the second speech feature, and the prosody can reflect the speaker's emotional state or speech form, etc., and prosody generally refers to intonation, pitch, stress emphasis , pauses, or rhythms.
  • the above-mentioned obtaining unit is specifically configured to receive the original voice and the second text sent by the terminal device.
  • the voice processing device further includes: a sending unit, configured to send the target edited voice to the terminal device, where the target edited voice is used by the terminal device to generate a target voice corresponding to the target text.
  • the above-mentioned obtaining unit is specifically configured to receive the original voice and the target text sent by the terminal device.
  • the generating unit is further configured to generate a target voice corresponding to the target text based on the non-edited voice and the target edited voice
  • the voice processing device further includes: a sending unit configured to send the target voice to the terminal device.
  • the above-mentioned obtaining unit is specifically configured to receive an editing request from a user, where the editing request includes the original voice and the second text.
  • the edit request includes the original speech and the target text.
  • the target text can be understood as the text corresponding to the voice that the user wants to generate.
  • the above-mentioned acquiring unit is further configured to acquire the position of the second text in the target text; the speech processing device further includes: a splicing unit configured to splice the text based on the position The target speech corresponding to the target text is obtained from the target editing speech and the non-editing speech.
  • the above-mentioned acquiring unit is specifically configured to acquire at least one speech frame in the non-edited speech; the acquiring unit is specifically configured to acquire the first speech frame based on the at least one speech frame.
  • Speech feature the first speech feature is used to represent the feature of at least one speech frame, and the first speech feature is a feature vector or sequence.
  • the text corresponding to the above at least one speech frame is text adjacent to the second text in the first text.
  • the above-mentioned processing unit is specifically configured to obtain the second speech feature corresponding to the second text through a neural network based on the first speech feature, the target text, and the marking information, The marking information is used to mark the second text in the target text.
  • the above-mentioned neural network includes an encoder and a decoder, and a processing unit is specifically configured to obtain the first text corresponding to the second text through the encoder based on the second text.
  • a vector a processing unit specifically configured to obtain a second speech feature through a decoder based on the first vector and the first speech feature.
  • the above processing unit is specifically configured to obtain the first vector through an encoder based on the target text.
  • the above speech processing device further includes: a first prediction unit, configured to predict the first duration and the second duration through a prediction network based on the target text, the first duration is the phoneme duration corresponding to the first text in the target text, and the second duration is the phoneme duration corresponding to the second text in the target text; the first correction unit is used to correct the second duration based on the first duration and the third duration, with The first modified duration is obtained, and the third duration is the phoneme duration of the first text in the original speech; the processing unit is specifically used to obtain the second speech through the decoder based on the first vector, the first speech feature and the first modified duration feature.
  • the above-mentioned first correction unit is specifically configured to calculate a ratio between the third duration and the first duration; and obtain the first corrected duration based on the ratio and the second duration.
  • the above-mentioned processing unit is specifically configured to upsample the first vector based on the first correction duration to obtain the second vector; the processing unit is specifically configured to Based on the second vector and the first speech feature, the second speech feature is acquired through a decoder. Specifically, the processing unit is specifically configured to input the second vector and the first speech feature into the decoder to obtain the second speech feature.
  • the decoder includes multiple coding units connected in series, the second vector and the first speech feature may be the same coding unit that is input, or may be different coding units that are input.
  • the above-mentioned obtaining unit is further configured to obtain the speech rate of the original speech;
  • the speech processing device further includes: a second prediction unit configured to pass the speech rate based on the second text
  • the prediction network predicts the fourth duration, the fourth duration is the total duration of all phonemes corresponding to the second text;
  • the second correction unit is used to modify the fourth duration based on the speech rate to obtain the second correction duration;
  • the processing unit is specifically used based on the first A vector, the first speech feature and the second modified duration are used to obtain the second speech feature through the decoder.
  • the above-mentioned processing unit is specifically configured to decode the first vector from the forward sequence or reverse sequence of the target text based on the decoder and the first speech feature to obtain the second speech feature.
  • the above-mentioned second text is in the middle region of the target text
  • the processing unit is specifically configured to decode the target text from the forward sequence based on the decoder and the first speech feature
  • the first vector obtains the third speech feature
  • the processing unit is specifically used to decode the first vector from the reverse sequence of the target text based on the decoder and the first speech feature to obtain the fourth speech feature
  • the processing unit is specifically used to obtain the fourth speech feature based on the third speech feature and the fourth speech feature to obtain the second speech feature.
  • the above-mentioned second text includes the third text and the fourth text
  • the third speech feature is the speech feature corresponding to the third text
  • the fourth speech feature is the speech feature corresponding to the first text.
  • the above-mentioned third speech feature is the speech feature corresponding to the second text obtained by the decoder based on the forward order, and the fourth speech feature is obtained by the decoder based on the reverse order
  • the positions in the four phonetic features are intercepted from the fourth phonetic feature.
  • the above generating unit is specifically configured to generate the target edited voice by using a vocoder based on the second voice feature.
  • the above-mentioned first voice feature carries a voiceprint feature of the original voice.
  • the above-mentioned acquiring unit is further configured to determine the non-edited voice based on the target text, the original text, and the original voice, and the acquiring unit is specifically configured to determine the non-edited voice based on the target text and the original voice.
  • the text determines the first text; the acquiring unit is specifically configured to determine the non-edited voice based on the first text, the original text, and the original voice.
  • the above-mentioned acquiring unit is specifically configured to determine the overlapping text based on the target text and the original text; the acquiring unit is specifically configured to display the overlapping text to the user; the acquiring unit, Specifically, it is used to determine the first text from the overlapping text in response to the second operation of the user.
  • the above-mentioned neural network is obtained by using the training data as the input of the neural network, and training the neural network with the goal that the value of the loss function is less than the second threshold,
  • the training data includes the training speech and the training text corresponding to the training speech;
  • the loss function is used to indicate the difference between the speech features output by the neural network and the actual speech features, and the actual speech features are obtained from the training speech.
  • the above-mentioned acquisition unit is specifically configured to determine the start and end positions of each phoneme in the original text in the original speech; Text determines non-editable speech.
  • the above-mentioned first speech feature and the second speech feature are Mel spectrum features.
  • the fifth aspect of the present application provides a voice processing device, which can be applied to scenarios such as a user recording a short video, a teacher recording a lecture voice, and the like.
  • the voice processing device may be a terminal device.
  • the speech processing device includes: an acquisition unit, configured to acquire original speech and second text, the second text is text other than the first text in the target text, the original texts corresponding to the target text and the original speech all include the first text, the second The voice corresponding to a text in the original voice is a non-edited voice; the sending unit is used to send the original voice and the second text to the cloud device, and the original voice and the second text are used by the cloud device to generate a target edited voice corresponding to the second text;
  • the acquisition unit is also used to receive the target editing voice sent by the cloud device.
  • the acquiring unit can also be understood as an input unit
  • the sending unit can also be understood as an output unit.
  • the above-mentioned obtaining unit is specifically configured to receive an editing request from a user, where the editing request includes the original voice and the second text.
  • the edit request includes the original speech and the target text.
  • the target text can be understood as the text corresponding to the voice that the user wants to generate.
  • the sixth aspect of the present application provides a voice processing device, which can be applied to scenarios such as a user recording a short video, a teacher recording a lecture voice, and the like.
  • the voice processing device may be a cloud device, and the voice processing device includes: a receiving unit, configured to receive the original voice and the second text sent by the terminal device, the second text is text other than the first text in the target text, and the target text
  • the original text corresponding to the text and the original speech all includes the first text, and the speech corresponding to the first text in the original speech is a non-edited speech;
  • the acquisition unit is used to obtain the first speech feature based on the non-edited speech;
  • the processing unit is used to obtain the first speech feature based on the non-edited speech;
  • the first speech feature and the second text are used to obtain the second speech feature corresponding to the second text through the neural network;
  • the generation unit is configured to generate the target editing speech corresponding to the second text based on the second speech feature.
  • the above voice processing device further includes: a sending unit, configured to send the target edited voice to the terminal device.
  • the above generating unit is further configured to generate the target voice based on the target edited voice and the non-edited voice; the sending unit is configured to send the target voice to the terminal device.
  • the seventh aspect of the present application provides a speech processing device, the speech processing device executes the method in the aforementioned first aspect or any possible implementation of the first aspect, or executes the aforementioned second aspect or any possible implementation of the second aspect The method in the implementation manner, or execute the method in the foregoing third aspect or any possible implementation manner of the third aspect.
  • the eighth aspect of the present application provides a speech processing device, including: a processor, the processor is coupled with a memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the speech processing device realizes the above-mentioned first
  • the method in one aspect or any possible implementation of the first aspect or enabling the voice processing device to implement the second aspect or the method in any possible implementation of the second aspect, or enabling the voice processing device to implement the above first
  • the ninth aspect of the present application provides a computer-readable medium, on which computer programs or instructions are stored, and when the computer programs or instructions are run on the computer, the computer executes the aforementioned first aspect or any possible implementation of the first aspect
  • the method in the manner, or causing the computer to execute the method in the aforementioned second aspect or any possible implementation manner of the second aspect, or causing the computer to execute the aforementioned third aspect or the method in any possible implementation manner of the third aspect.
  • the tenth aspect of the present application provides a computer program product.
  • the computer program product When the computer program product is executed on a computer, the computer executes the method in the aforementioned first aspect or any possible implementation of the first aspect, or causes the computer to execute the aforementioned first aspect.
  • the third, fourth, sixth, seventh, eighth, ninth, tenth aspect or the technical effects brought by any of the possible implementations may refer to the first aspect or the different possible implementations of the first aspect The resulting technical effects will not be repeated here.
  • the technical effects brought by the fifth, seventh, eighth, ninth, tenth aspects or any one of the possible implementations may refer to the second aspect or the technical effects brought by different possible implementations of the second aspect , which will not be repeated here.
  • the embodiments of the present application have the following advantages: the second speech feature corresponding to the second text in the target text is obtained through the first speech feature of the first text in the original speech, that is, by referring to the first speech feature in the original text The first speech feature of a text generates the second speech feature of the second text in the target text, so that the sense of hearing of the target edited speech is similar to that of the original speech, thereby improving user experience.
  • FIG. 1 is a schematic structural diagram of a system architecture provided by the present application.
  • Fig. 2 is a schematic diagram of a convolutional neural network structure provided by the present application.
  • FIG. 3 is a schematic diagram of another convolutional neural network structure provided by the present application.
  • FIG. 4 is a schematic diagram of a chip hardware structure provided by the present application.
  • Fig. 5 is a schematic flow chart of a neural network training method provided by the present application.
  • Fig. 6 is a schematic structural diagram of a neural network provided by the present application.
  • Fig. 7 is a schematic flow chart of the voice processing method provided by the present application.
  • Fig. 8-Fig. 10 are several schematic diagrams of the display interface of the voice processing device provided by the present application.
  • FIG. 11 is a schematic structural diagram of a bidirectional decoder provided by the present application.
  • FIG. 12 is another schematic diagram of the display interface of the voice processing device provided by the present application.
  • FIG. 13 is another schematic flowchart of the speech processing method provided by the present application.
  • Fig. 14-Fig. 18 are schematic structural diagrams of several voice processing devices provided in this application.
  • Embodiments of the present application provide a voice processing method and related equipment, which can realize that the sense of hearing of the edited voice is similar to that of the original voice, thereby improving user experience.
  • a neural network may be composed of neural units, and a neural unit may refer to an operation unit that takes X s and an intercept 1 as input, and the output of the operation unit may be:
  • W s is the weight of X s
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function may be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with many hidden layers, and there is no special metric for "many” here.
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the deep neural network may not include a hidden layer, which is not limited here.
  • the work of each layer in a deep neural network can be expressed mathematically To describe: From the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (a collection of input vectors). Space), these five operations include: 1. Dimension up/down; 2. Zoom in/out; 3. Rotate; 4. Translation; 5. "Bending”. Among them, the operations of 1, 2, and 3 are performed by Completed, the operation of 4 is performed by completed, the operation of 5 is realized by ⁇ ().
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in this layer of neural network.
  • the vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how to transform the space.
  • the purpose of training the deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially to learn the way to control the spatial transformation, and more specifically, to learn the weight matrix.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of a convolutional layer and a subsampling layer.
  • the feature extractor can be seen as a filter, and the convolution process can be seen as convolving the same trainable filter with an input image or convolutional feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units of the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information that is independent of location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
  • multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the separation network, recognition network, detection network, and depth estimation network in the embodiment of the present application can all be CNNs.
  • a recurrent neural network refers to the fact that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will remember the previous information, save it in the internal state of the network, and apply it to the calculation of the current output.
  • Text to speech is a program or software system that converts text into speech.
  • a vocoder is a sound signal processing module or software that encodes acoustic features into sound waveforms.
  • the sounding body When the sounding body emits sound due to vibration, the sound can generally be decomposed into many simple sine waves, that is to say, all natural sounds are basically composed of many sine waves with different frequencies, and the sine wave with the lowest frequency is the fundamental tone (That is, the fundamental frequency, which can be represented by F0), while other sine waves with higher frequencies are overtones.
  • the fundamental tone That is, the fundamental frequency, which can be represented by F0
  • other sine waves with higher frequencies are overtones.
  • prosody In the field of speech synthesis, prosody generally refers to features that control functions such as intonation, intonation, stress emphasis, pauses, and rhythm. Prosody can reflect the speaker's emotional state or speech form, etc.
  • Phoneme It is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation actions in syllables. An action constitutes a phoneme. Phonemes are divided into two categories: vowels and consonants. For example, the Chinese syllable a (for example, one tone: ah) has only one phoneme, ai (for example four tone: love) has two phonemes, dai (for example one tone: stay) has three phonemes, etc.
  • Word vectors can also be called “word embeddings”, “vectorization”, “vector mapping”, “embedding”, etc.
  • a word vector is to use a dense vector to represent an object, for example: use a vector to represent the user ID card identification number (identity document, ID), item ID, etc.
  • Speech features Transform the processed speech signal into a concise and logical representation that is more discriminative and reliable than the actual signal. After acquiring a piece of speech signal, speech features can be extracted from the speech signal. Among them, the extraction method usually extracts a multi-dimensional feature vector for each speech signal. There are many parametric representations of speech signals, such as: perceptual linear predictive (PLP), linear predictive coding (linear predictive coding, LPC) and frequency cepstrum coefficient (mel frequency cepstrum coefficient, MFCC), etc.
  • PLP perceptual linear predictive
  • LPC linear predictive coding
  • frequency cepstrum coefficient mel frequency cepstrum coefficient, MFCC
  • voice editing is usually used.
  • the current voice editing method is to obtain a voice segment from a database, and replace the wrong content with the voice segment, and then generate a corrected voice.
  • this method relies too much on the speech fragments stored in the database. If the speech fragments differ greatly from the original speech in timbre, rhythm, and signal-to-noise ratio, the corrected speech will be incoherent and the rhythm will be unnatural. The subsequent voice has a poor sense of hearing.
  • the present application provides a voice editing method, which determines the second voice feature of the modified content by referring to the first voice feature corresponding to the context of the content to be modified, and generates the target editing voice corresponding to the second text based on the second voice feature , so as to realize that the sense of hearing of the target edited voice is similar to that of the original voice, thereby improving the user experience.
  • the embodiment of the present application provides a system architecture 10 .
  • the data collection device 16 is used to collect training data.
  • the training data includes training speech and training text corresponding to the training speech.
  • the training data is stored in the database 13 , and the training device 12 obtains the target model/rule 101 based on training data maintained in the database 13 .
  • the following will describe in more detail how the training device 12 obtains the target model/rule 101 based on the training data, and the target model/rule 101 can be used to implement the speech processing method provided by the embodiment of the present application, that is, the text is input after relevant preprocessing
  • the target model/rule 101 can obtain the speech features of the text.
  • the target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 13 may not all be collected by the data collection device 16, but may also be received from other devices. In addition, it should be noted that the training device 12 does not necessarily perform the training of the target model/rules 101 based entirely on the training data maintained by the database 13, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.
  • the target model/rules 101 obtained by training according to the training device 12 can be applied to different systems or devices, such as the execution device 11 shown in FIG. Laptops, AR/VR, vehicle terminals, etc., can also be servers or clouds.
  • the execution device 11 is configured with an I/O interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 14.
  • the input data is described in the embodiment of this application.
  • the input data may include: the first speech feature, the target text and mark information, and the input data may also include the first speech feature and the second text.
  • the input data may be input by the user, or uploaded by the user through other devices, and of course may also come from a database, which is not limited here.
  • the preprocessing module 113 is used to perform preprocessing according to the target text and tag information received by the I/O interface 112.
  • the preprocessing module Step 113 can be used to determine the target editing text in the target text based on the target text and mark information. If the input data includes the first speech feature and the second text, the preprocessing module 113 is used to perform preprocessing according to the target text and tag information received by the I/O interface 112, for example, converting the second text into phonemes and other preparations .
  • the execution device 11 When the execution device 11 preprocesses the input data, or when the calculation module 111 of the execution device 11 executes calculations and other related processing, the execution device 11 can call the data, codes, etc. in the data storage system 15 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 15 .
  • the I/O interface 112 returns the processing result, such as the second speech feature obtained above, to the client device 14, so as to provide it to the user.
  • the training device 12 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above tasks, thereby providing the user with the desired result or providing input for subsequent other processing.
  • the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 .
  • the client device 14 can automatically send the input data to the I/O interface 112 . If the client device 14 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 14 .
  • the user can view the results output by the execution device 11 on the client device 14, and the specific presentation form may be specific ways such as display, sound, and action.
  • the client device 14 can also be used as a data collection terminal, collecting the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data, and storing them in the database 13 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample The data is stored in the database 13.
  • accompanying drawing 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation, for example, in accompanying drawing 1 , the data storage system 15 is an external memory relative to the execution device 11 , and in other cases, the data storage system 15 may also be placed in the execution device 11 .
  • the target model/rule 101 is obtained according to the training of the training device 12.
  • the target model/rule 101 may be a neural network in the embodiment of the present application.
  • the neural network It can be a recurrent neural network, a long short-term memory network, etc.
  • the predictive network can be a convolutional neural network, a recurrent neural network, etc.
  • the neural network and the prediction network in the embodiment of the present application may be two separate networks, or a multi-task neural network, where one task is to output duration and the other task is to output speech features.
  • CNN is a very common neural network
  • the structure of CNN will be introduced in detail in combination with Figure 2 below.
  • the convolutional neural network is a deep neural network with a convolutional structure, and it is a deep learning architecture.
  • the deep learning architecture refers to the algorithm through machine learning. Multiple levels of learning are performed on the abstraction level.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it.
  • a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional/pooling layer 120 , and a neural network layer 130 where the pooling layer is optional.
  • the convolutional layer/pooling layer 120 may include layers 121-126 as examples.
  • the 121st layer is a convolutional layer
  • the 122nd layer is a pooling layer
  • the 123rd layer is a convolutional layer
  • the 124th layer is a convolutional layer.
  • Layer is a pooling layer
  • 125 is a convolutional layer
  • 126 is a pooling layer
  • 121 and 122 are convolutional layers
  • 123 is a pooling layer
  • 124 and 125 are convolutional layers
  • 126 is pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolutional layer 121 can include many convolutional operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution operation on the image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels followed by two pixels...it depends on the value of the stride), so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to The entire depth of the input image.
  • convolving with a single weight matrix produces a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same dimension are applied.
  • the output of each weight matrix is stacked to form the depth dimension of the convolved image.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer (such as 121) often extracts more general features, which can also be referred to as low-level features;
  • the features extracted by the later convolutional layers (such as 126) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of pooling layers is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size.
  • the average pooling operator can calculate the average value of the pixel values in the image within a specific range.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 100 needs to use the neural network layer 130 to generate one or a group of outputs with the required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 2 ) and an output layer 140, and the parameters contained in the multi-layer hidden layers may be based on specific task types. The related training data is pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
  • the output layer 140 which has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the forward propagation of the entire convolutional neural network 100 (as shown in Figure 2, the propagation from 110 to 140 is forward propagation)
  • the backpropagation (as shown in Figure 2, the propagation from 140 to 110 is backward propagation) will start to update
  • the aforementioned weight values and deviations of each layer are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
  • the convolutional neural network 100 shown in FIG. 2 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models, for example, as Multiple convolutional layers/pooling layers shown in FIG. 3 are parallelized, and the extracted features are input to the full neural network layer 130 for processing.
  • a chip hardware structure provided by the embodiment of the present application is introduced below.
  • FIG. 4 is a hardware structure of a chip provided by an embodiment of the present application, and the chip includes a neural network processor 40 .
  • the chip can be set in the execution device 110 shown in FIG. 1 to complete the computing work of the computing module 111 .
  • the chip can also be set in the training device 120 shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 4 .
  • the neural network processor 40 can be a neural network processor (neural-network processing unit, NPU), a tensor processor (tensor processing unit, TPU), or a graphics processing unit (graphics processing unit, GPU), etc.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processing unit
  • Processor for scale XOR processing Take the NPU as an example: the neural network processor NPU40 is mounted on the main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks.
  • the core part of the NPU is the operation circuit 403, and the controller 404 controls the operation circuit 403 to extract data in the memory (weight memory or input memory) and perform operations.
  • the operation circuit 403 includes multiple processing units (process engine, PE).
  • arithmetic circuit 403 is a two-dimensional systolic array.
  • the arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 403 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight storage 402, and caches it in each PE in the operation circuit.
  • the operation circuit fetches the data of matrix A from the input memory 401 and performs matrix operation with matrix B, and the obtained partial results or final results of the matrix are stored in the accumulator 408 .
  • the vector calculation unit 407 can perform further processing on the output of the calculation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 407 can be used for network calculations of non-convolution/non-FC layers in neural networks, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), etc. .
  • the vector computation unit can 407 store the processed output vectors to the unified buffer 406 .
  • the vector calculation unit 407 may apply a non-linear function to the output of the operation circuit 403, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 407 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as an activation input to operational circuitry 403, eg, for use in subsequent layers in a neural network.
  • the unified memory 406 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 401 and/or unified memory 406 through the storage unit access controller 405 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 402, And store the data in the unified memory 506 into the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (bus interface unit, BIU) 410 is configured to implement interaction between the main CPU, DMAC and instruction fetch memory 409 through the bus.
  • An instruction fetch buffer 409 connected to the controller 404 is used to store instructions used by the controller 404.
  • the controller 404 is configured to invoke instructions cached in the memory 409 to control the operation process of the computing accelerator.
  • the unified memory 406, the input memory 401, the weight memory 402, and the instruction fetch memory 409 are all on-chip (On-Chip) memories
  • the external memory is a memory outside the NPU
  • the external memory can be a double data rate synchronous dynamic random Memory (double data rate synchronous dynamic random access memory, referred to as DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • DDR SDRAM double data rate synchronous dynamic random Memory
  • HBM high bandwidth memory
  • each layer in the convolutional neural network shown in FIG. 2 or FIG. 3 may be performed by the operation circuit 403 or the vector calculation unit 407 .
  • the voice processing method can be applied to scenarios where voice content needs to be modified, for example, a user records a short video, a teacher is recording a lecture voice, and the like.
  • the voice processing method can be applied to applications, software, or voice processing devices with voice editing functions, such as mobile phones, computers, intelligent voice assistants on removable terminals that can produce sound, and smart speakers.
  • the voice processing device is a terminal device for serving users, or a cloud device.
  • the terminal device may include a head mount display (HMD), the head mount display device may be a combination of a virtual reality (virtual reality, VR) box and a terminal, a VR all-in-one machine, a personal computer (personal computer, PC), Augmented reality (augmented reality, AR) equipment, mixed reality (mixed reality, MR) equipment, etc.
  • the terminal equipment may also include cellular phone (cellular phone), smart phone (smart phone), personal digital assistant (personal digital assistant, PDA) ), tablet computer, laptop computer (laptop computer), personal computer (personal computer, PC), vehicle-mounted terminal, etc., which are not limited here.
  • the neural network and the prediction network in the embodiment of the present application can be two separate networks, or a multi-task neural network, one of which is to output duration, and the other is to output speech features.
  • the training method shown in Figure 5 can be executed by a neural network training device, which can be a cloud service device or a terminal device, for example, computers, servers and other computing capabilities are sufficient to execute the neural network.
  • the device of the training method may also be a system composed of cloud service equipment and terminal equipment.
  • the training method can be executed by the training device 120 in FIG. 1 and the neural network processor 40 in FIG. 4 .
  • the training method may be processed by CPU, or jointly processed by CPU and GPU, or other processors suitable for neural network calculation may be used instead of GPU, which is not limited in this application.
  • the training method shown in FIG. 5 includes step 501 and step 502 .
  • Step 501 and step 502 will be described in detail below.
  • the prediction network in the embodiment of the present application may be RNN, CNN, etc., which are not limited here.
  • the input is the vector of the training text
  • the output is the duration of each phoneme in the training text. Then continuously reduce the difference between the duration of each phoneme in the training text output by the prediction network and the actual duration of the training speech corresponding to the training text, and then obtain a trained prediction network.
  • Step 501 acquire training data.
  • the training data in this embodiment of the present application includes training speech, or includes training speech and training text corresponding to the training speech. If the training data does not include the training text, the training text can be obtained by recognizing the training speech.
  • the training voice features in the training data may also include the user identification, or include the voiceprint features of the training voice, or include A vector used to identify the voiceprint features of the training speech.
  • the training data may also include start and end duration information of each phoneme in the training speech.
  • the training data can be acquired by directly recording the vocalization of the utterance object, or by inputting audio information and video information by the user, or by receiving and sending from the collection device.
  • the method of obtaining training data is not limited here.
  • Step 502 using the training data as the input of the neural network, and aiming at the value of the loss function being smaller than the second threshold to train the neural network to obtain a trained neural network.
  • some preprocessing can be performed on the training data.
  • the training data includes the training speech as described above
  • the training text can be obtained by recognizing the training speech, and the training text can be represented by phonemes and input into the neural network.
  • the entire training text can be used as the target editing text and used as input to train the neural network with the goal of reducing the value of the loss function, that is, to continuously reduce the actual voice characteristics corresponding to the training voice output by the neural network. Differences between phonetic features.
  • This training process can be understood as a prediction task.
  • the loss function can be understood as the loss function corresponding to the prediction task.
  • the neural network in the embodiment of the present application may specifically be an attention mechanism model, such as transformer, tacotron2, and the like.
  • the attention mechanism model includes an encoder-decoder, and the structure of the encoder or decoder can be a recurrent neural network, a long short-term memory network (long short-term memory, LSTM), etc.
  • the neural network in the embodiment of the present application includes an encoder (encoder) and a decoder (decoder), and the structure types of the encoder and decoder can be RNN, LSTM, etc., which are not limited here.
  • the role of the encoder is to encode the training text into a text vector (vector representation in units of phonemes, each input corresponds to a vector), and the role of the decoder is to obtain the corresponding speech features of the text according to the text vector.
  • the calculation of each step is performed on the basis of the real speech features corresponding to the previous step.
  • the prediction network can be used to correct the speech duration corresponding to the text vector. That is, it can be understood as upsampling the text vector according to the duration of each phoneme in the training speech (it can also be understood as expanding the number of frames of the vector) to obtain a vector corresponding to the number of frames.
  • the role of the decoder is to obtain the speech features corresponding to the text according to the above-mentioned vector corresponding to the number of frames.
  • the above-mentioned decoder may be a unidirectional decoder or a bidirectional decoder (that is, two directions are parallel), which is not specifically limited here.
  • the two directions refer to the direction of the training text, which can also be understood as the direction of the vector corresponding to the training text, and can also be understood as the forward or reverse order of the training text, and one direction is that one side of the training text points to the training text The other side of , the other direction is that the other side of the training text points to the side of the training text.
  • the first direction or positive sequence can be the direction from “middle” to “no”
  • the second direction or reverse order can be from “no” to " “in” direction.
  • the decoder is a bidirectional decoder
  • the decoders in the two directions are trained in parallel, and they are independently calculated during the training process, and there is no result dependence.
  • the prediction network and the neural network are a multi-task network
  • the prediction network can be called a prediction module
  • the decoder can modify the speech features output by the neural network according to the real duration information corresponding to the training text.
  • the architecture of the neural network in the embodiment of the present application can refer to FIG. 6 .
  • the neural network includes an encoder and a decoder.
  • the neural network may also include a prediction module and an upsampling module.
  • the prediction module is specifically used to realize the above-mentioned function of the prediction network
  • the up-sampling module is specifically used to realize the above-mentioned process of up-sampling the text vector according to the duration of each phoneme in the training speech, and details are not repeated here.
  • the voice processing method provided in the embodiment of the present application can be applied to a replacement scene, an insertion scene or a deletion scene.
  • the above scenario can be understood as replacing, inserting, deleting, etc. the original speech corresponding to the original text to obtain the target speech, so as to realize the similarity between the target speech and the original speech and/or improve the fluency of the target speech.
  • the original speech can be regarded as including the speech to be modified, and the target speech is the speech obtained after the user wants to modify the original speech.
  • the original text is "the weather is fine in Shenzhen today", and the target text is “the weather is fine in Guangzhou today”.
  • the overlapping text is "the weather is fine today”.
  • the non-overlapping text in the original text is "Shenzhen”, and the non-overlapping text in the target text is "Guangzhou”.
  • the target text includes a first text and a second text, and the first text is the overlapping text or a part of the overlapping text.
  • the second text is text other than the first text in the target text. For example: if the first text is "the weather is fine today", then the second text is "Guangzhou”. If the first text is "the Qi is very good today", then the second text is "Tianguang Tian”.
  • the original text is "the weather in Shenzhen is fine today", and the target text is “the weather in Shenzhen is fine this morning".
  • the overlapping text is "the weather in Shenzhen is very good today”.
  • the non-overlapping text in the target text is "morning".
  • the insertion scene can be regarded as a replacement scene in which "the sky is deep” in the original speech is replaced with "the sky is deep in the morning”. That is, the first text is "the weather in Shenzhen is fine today", and the second text is "it's deep in the morning”.
  • the original text is "the weather is fine in Shenzhen today", and the target text is "the weather is fine today".
  • the overlapping text is "the weather is fine today”.
  • the non-overlapping text in the original text is "Shenzhen”.
  • the deletion scene can be regarded as a replacement scene in which " ⁇ " in the original speech is replaced with " ⁇ ”. That is, the first text is “the weather is very good today", and the second text is "every day”.
  • the voice processing method provided by the embodiment of the present application will be described below only by taking the replacement scene as an example.
  • the voice processing method provided by the embodiment of the present application can be executed by the terminal device or the cloud device alone, or can be jointly completed by the terminal device and the cloud device, which are described separately below:
  • Embodiment 1 The terminal device or the cloud device independently executes the voice processing method.
  • the method may be executed by a speech processing device, or may be executed by components of a speech processing device (such as a processor, a chip, or a chip system, etc.).
  • the voice processing device may be a terminal device or a cloud device, and this embodiment includes steps 701 to 706.
  • Step 701 acquire original voice and second text.
  • the voice processing device may directly acquire the original voice, the original text, and the second text.
  • the original speech and the second text may also be obtained first, and the original text corresponding to the original speech is obtained after recognizing the original speech.
  • the second text is the text in the target text except the first text, and the original text and the target text contain the first text.
  • the first text can be understood as part or all of the overlapping text between the original text and the target text.
  • the voice processing device can directly obtain the second text through input from other devices or users.
  • the speech processing device obtains the target text, obtains the overlapping text according to the original text corresponding to the target text and the original speech, and then determines the second text according to the overlapping text.
  • the characters in the original text and the target text may be compared one by one or input into a comparison model to determine overlapping text and/or non-overlapping text between the original text and the target text.
  • determine the first text according to the overlapping text may be overlapping text, or part of text in the overlapping text.
  • the voice processing device can directly determine the overlapping text as the first text, and can also determine the first text in the overlapping text according to preset rules, or according to the user The operation of determines the first text in the overlapping text.
  • the preset rule may be to obtain the first text after removing N characters in the overlapping content, where N is a positive integer.
  • the speech processing device can align the original text with the original speech, determine the start and end positions of each phoneme in the original text in the original speech, and obtain the duration of each phoneme in the original text. Further, the phoneme corresponding to the first text is obtained, that is, the speech corresponding to the first text in the original speech (that is, the non-edited speech) is obtained.
  • the voice processing device may align the original text with the original voice by adopting a forced alignment method, for example: Montreal forced aligner (montreal forced aligner, MFA), alignment tools such as a neural network with an alignment function, specifically There is no limit here.
  • a forced alignment method for example: Montreal forced aligner (montreal forced aligner, MFA), alignment tools such as a neural network with an alignment function, specifically There is no limit here.
  • the voice processing device may display a user interface to the user, where the user interface includes the original voice and the original text. Further, the user performs a first operation on the original text through the user interface, and the voice processing device determines the target text in response to the user's first operation.
  • the first operation can be understood as editing the original text by the user, and the editing can specifically be the aforementioned replacement, insertion, or deletion.
  • the voice processing device is a mobile phone as an example for description. After the voice processing device acquires the original text and the original voice, it displays the interface shown in FIG. 8 to the user, and the interface includes the original text and the original voice. As shown in FIG. 9 , the user can perform a first operation 901 on the original text, such as modifying "Shenzhen" to "Guangzhou” and other aforementioned operations of inserting, deleting, and replacing.
  • a first operation 901 on the original text such as modifying "Shenzhen" to "Guangzhou” and other aforementioned operations of inserting, deleting, and replacing.
  • only replacement is described as an example.
  • the speech processing device after determining the overlapping text between the original text and the target text, presents the overlapping text to the user, and then determines the first text from the overlapping text according to the second operation of the user, and then determines the second text.
  • the second operation may be operations such as clicking, dragging, and sliding, which are not specifically limited here.
  • the second text is "Guangzhou”
  • the first text is "the weather is fine today”
  • the non-edited voice is the voice of the first text in the original voice.
  • a text corresponds to 2 frames
  • the original speech corresponding to the original text includes 16 frames
  • the non-edited speech corresponds to frames 1 to 4 and frames 9 to 16 in the original speech. It can be understood that in practical applications, the corresponding relationship between text and speech frames is not necessarily 1:2 as in the above example.
  • the above example is only for the convenience of understanding the non-editing area, and the number of frames corresponding to the original text is not specifically limited here.
  • the voice processing device may display an interface as shown in Figure 10, which may include the second text, the target text, non-edited voice and edited voice in the original voice, wherein the second text is "Guangzhou", and the target text
  • the text is "the weather in Guangzhou is very good today”
  • the non-editing speech is the speech corresponding to "the weather is fine today”
  • the editing speech is the speech corresponding to "Shenzhen”.
  • the voice processing device further determines the non-edited voice in the original voice based on the target text, the original text, and the original voice.
  • the voice processing device receives an edit request sent by the user, where the edit request includes the original voice and the second text.
  • the edit request also includes the original text and/or speaker identification.
  • the editing request may also include the original voice and the target text.
  • Step 702 Obtain first voice features based on the non-edited voice.
  • the speech features in the embodiment of the present application can be used to represent the speech features (for example: timbre, prosody, emotion or rhythm, etc.), and there are many forms of speech features, which can be speech frames, sequences, vectors, etc., specifically here No limit.
  • the speech features in the embodiment of the present application may specifically be parameters extracted from the above-mentioned representation forms through the aforementioned methods such as PLP, LPC, and MFCC.
  • At least one speech frame is selected from non-edited speech as the first speech feature.
  • the first speech feature of the context is further combined for the second speech feature.
  • the text corresponding to at least one speech frame may be text adjacent to the second text in the first text.
  • the non-edited speech is coded through a coding model to obtain a target sequence, and the target sequence is used as the first speech feature.
  • the coding model may be CNN, RNN, etc., which are not limited here.
  • the first voice feature may also carry the voiceprint feature of the original voice.
  • the manner of acquiring the voiceprint feature may be directly acquiring, or obtaining the voiceprint feature by recognizing the original voice, and the like.
  • the subsequently generated second voice features also carry the voiceprint features of the original voice, thereby improving the similarity between the target edited voice and the original voice.
  • introducing voiceprint features can improve the subsequent predicted voice features to be more similar to the voiceprint of the speaker of the original voice.
  • the voice processing device can also obtain the speaker identification of the original voice, so that when there are multiple speakers, the voice corresponding to the corresponding speaker can be matched, and the similarity between the subsequent target edited voice and the original voice can be improved.
  • the speech frame is used as the speech feature (or it is understood that the speech feature is obtained according to the speech frame) as an example for description.
  • at least one frame among the 1st frame to the 4th frame and the 9th frame to the 16th frame in the original speech is selected as the first speech feature.
  • the first speech feature is a Mel spectrum feature.
  • Step 703 Based on the first speech feature and the second text, a second speech feature corresponding to the second text is obtained through a neural network.
  • the speech processing device After the speech processing device acquires the first speech feature, it may obtain the second speech feature corresponding to the second text through a neural network based on the first speech feature and the second text.
  • the neural network includes an encoder and a decoder.
  • the second text is input into the encoder to obtain a first vector corresponding to the second text, and then based on the first speech feature, the decoder decodes the first vector to obtain the second speech feature.
  • the prosody, timbre and/or signal-to-noise ratio of the first speech feature can be the same as or similar to the second speech feature, and the prosody can reflect the speaker's emotional state or speech form, etc., and prosody generally refers to intonation, pitch, stress emphasis , pauses, or rhythms.
  • an attention mechanism can be introduced between the encoder and the decoder to adjust the corresponding relationship between input and output.
  • the target text where the second text is located may be introduced during the encoding process of the encoder, so that the generated first vector of the second text refers to the target text, so that the second text described by the first vector is more accurate. That is, the second speech feature corresponding to the second text can be obtained through a neural network based on the first speech feature, the target text, and the tag information. Specifically, the target text and mark information may be input into the encoder to obtain the first vector corresponding to the second text, and then the first vector is decoded by the decoder based on the first speech feature to obtain the second speech feature. The marking information is used to mark the second text in the target text.
  • the duration of the target edited speech may be corrected.
  • the specific steps of correction may include: predicting the total duration through the prediction network, the total duration is the total duration of all phonemes corresponding to the target text, splitting the total duration into the first duration and the second duration, The first duration is the phoneme duration corresponding to the first text in the target text, and the second duration is the phoneme duration corresponding to the second text in the target text. Then, the second duration is corrected according to the first duration and the third duration to obtain the first modified duration, and the third duration is the phoneme duration of the first text in the original speech.
  • the specific step of correcting may include: predicting a fourth duration based on the second text through a prediction network, the fourth duration being the total duration of all phonemes corresponding to the second text; obtaining the speech rate of the original speech; The fourth duration is corrected based on the speech rate to obtain a second corrected duration; and based on the first vector, the first speech feature and the second corrected duration, the second speech feature is obtained through a decoder.
  • the phoneme duration of the second text in the target text is corrected based on the difference between the phoneme duration of the first text in the original speech and the phoneme duration of the first text in the target text predicted by the prediction network.
  • the difference coefficient is calculated by the following formula one.
  • n is the number of phonemes in the first text
  • RP k is the duration of the Kth phoneme in the original speech (i.e. the third duration)
  • LP k is the duration of the Kth phoneme in predicting the corresponding second text (i.e. the first duration)
  • the first modified duration s*second duration.
  • the first vector can be up-sampled using the modified duration (the first modified duration or the second modified duration) to obtain the second vector, and the second vector is passed through the decoder based on the first speech feature, Decoding the second vector obtains the second speech feature.
  • the upsampling here can be understood as extending or stretching the second duration corresponding to the first vector to the corrected duration corresponding to the second vector.
  • the decoder may also obtain the second speech feature through auto-regression, that is, adjust the second speech feature while generating the second speech feature.
  • the decoder in the embodiment of the present application may be a unidirectional decoder or a bidirectional decoder, which will be described separately below.
  • the decoder is a one-way decoder.
  • the decoder calculates the speech frame obtained by calculating the first vector or the second vector from the first direction of the target text based on the first speech feature as the second speech feature.
  • the first direction is a direction from one side of the target text to the other side of the target text.
  • the first direction can be understood as the forward order or reverse order of the target text (for related descriptions, please refer to the description about forward order and reverse order in the embodiment shown in FIG. 5 ).
  • the first speech feature and the first vector are input into the decoder to obtain the second speech feature.
  • input the first speech feature and the second vector into the decoder to obtain the second speech feature.
  • the decoder can be a bidirectional decoder (it can also be understood that the encoder includes the first encoder and the second encoder).
  • the aforementioned second text is in the middle area of the target text, which can be understood as the second text is not at both ends of the target text.
  • the two-way decoder in the embodiment of the present application has many situations, which are described below:
  • the third speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the second text
  • the fourth speech feature output by the bidirectional decoder from the second direction is the speech feature corresponding to the second text.
  • the complete phonetic features corresponding to the two second texts can be obtained through the left and right sides (ie, forward and reverse order), and the second phonetic features can be obtained according to the two phonetic features.
  • the first decoder calculates the first vector or the second vector from the first direction of the target text based on the first speech feature to obtain a third speech feature (hereinafter referred to as LR) of the second text.
  • the second decoder calculates the first vector or the second vector from the second direction of the target text based on the first speech feature to obtain a fourth speech feature (hereinafter referred to as RL) of the second text. And generate the second speech feature according to the third speech feature and the fourth speech feature.
  • the first direction is a direction from one side of the target text to the other side of the target text
  • the second direction is opposite to the first direction (or understood as the second direction is from the other side of the target text to one side of the target text sideways).
  • the first direction may be the above-mentioned forward sequence, and the second direction may be the above-mentioned reverse sequence.
  • the non-edited speech adjacent to the second text side (also called the left side) can be The speech frame is decoded as a condition to obtain N frames of LR.
  • the speech frame adjacent to the other side (also called the right side) of the second text in the non-edited speech can be used as a condition Decoding is performed to obtain N frames of RL.
  • the structure of the bidirectional decoder refer to FIG. 11 .
  • the frame whose difference between LR and RL is smaller than the threshold can be used as a transition frame (the position is m, m ⁇ n,), or the frame with the smallest difference between LR and RL can be used as a transition frame .
  • the N frames of the second speech feature may include the first m frames in LR and the last n-m frames in RL, or the N frames of the second speech feature include the first n-m frames in LR and the last m frames in RL.
  • the difference between LR and RL can be understood as the distance between vectors.
  • the first vector or the second vector in this step may further include a third vector for identifying the speaker. It can also be understood that the third vector is used to identify the voiceprint feature of the original voice.
  • the LR frame corresponding to "Guangzhou" obtained by the first encoder includes LR 1 , LR 2 , LR 3 , and LR 4 .
  • the RL frame corresponding to "Guangzhou” obtained by the second encoder includes RL 1 , RL 2 , RL 3 , and RL 4 .
  • the difference between LR 2 and RL 2 is the smallest, then LR 1 , LR 2 , RL 3 , RL 4 or LR 1 , RL 2 , RL 3 , RL 4 are used as the second speech feature.
  • the third speech feature output by the bidirectional decoder from the first direction is the speech feature corresponding to the third text in the second text
  • the fourth speech feature output by the bidirectional decoder from the second direction is corresponding to the fourth text in the second text voice characteristics.
  • the partial speech features corresponding to the second text can be obtained through the left and right sides (ie, forward and reverse order), and the complete second speech features can be obtained according to the two partial speech features. That is, a part of the speech features is taken from the direction of the forward sequence, another part of the speech features is taken from the direction of the reverse sequence, and a part of the speech features is spliced with the other part of the speech features to obtain the overall speech feature.
  • the first encoder obtains the LR frame corresponding to the third text ("wide") including LR 1 and LR 2 .
  • the second encoder obtains the RL frame corresponding to the fourth text ("state") including RL 3 and RL 4 .
  • LR 1 , LR 2 , RL 3 , and RL 4 are concatenated to obtain the second speech feature.
  • Step 704 generating a target editing voice corresponding to the second text based on the second voice feature.
  • the voice processing device After the voice processing device acquires the second voice feature, it may convert the second voice feature into the target editing voice corresponding to the second text according to the vocoder.
  • the vocoder can be a traditional vocoder (such as the Griffin-lim algorithm), or a neural network vocoder (such as Melgan pre-trained using audio training data, or Hifigan, etc.), etc., which will not be done here. limited.
  • the target editing voice corresponding to "Guangzhou” is shown in FIG. 12 .
  • Step 705 acquire the position of the second text in the target text. This step is optional.
  • step 701 the position of the second text in the target text is obtained.
  • the original speech and the original text may be aligned using the alignment technique in step 701 to determine the start and end positions of each phoneme in the original text in the original speech. And determine the position of the second text in the target text according to the start and end positions of each phoneme.
  • Step 706 splicing the target edited speech and non-edited speech based on the position to generate a target speech corresponding to the target text. This step is optional.
  • the position in the embodiment of the present application is used to splice the non-edited speech and the target edited speech.
  • the position can be the position of the second text in the target text, or the position of the first text in the target text, or it can be the non-edited speech
  • the position in the original voice may also be the position of the edited voice in the original voice.
  • the original speech and the original text may be aligned by the alignment technique in step 701 to determine the start and end positions of each phoneme in the original text in the original speech. And according to the position of the first text in the original text, the position of the non-edited voice or the edited voice in the original voice is determined. Furthermore, the speech processing device splices the target edited speech and the non-edited speech based on the position to obtain the target speech. That is, the target speech corresponding to the second text is replaced by the edited area in the original speech to obtain the target speech.
  • the non-edited voice corresponds to the first frame to the fourth frame and the ninth frame to the sixteenth frame in the original voice.
  • the target editing voice is LR 1 , LR 2 , RL 3 , RL 4 or LR 1 , RL 2 , RL 3 , RL 4 .
  • Splicing the target edited speech and non-edited speech can be understood as replacing the 5th to 8th frames in the original speech with the obtained four frames to obtain the target speech. That is, the voice corresponding to "Guangzhou” is replaced with the voice corresponding to "Shenzhen" in the original voice, and then the target text: "The weather in Guangzhou is very good today" corresponds to the target voice.
  • Figure 12 shows the target speech corresponding to "The weather in Guangzhou is very good today".
  • the voice processing device plays the target edited voice or the target voice.
  • the voice processing method provided in the embodiment of the present application includes step 701 to step 704 .
  • the voice processing method provided in the embodiment of the present application includes step 701 to step 705 .
  • the voice processing method provided in the embodiment of the present application includes step 701 to step 706 .
  • the various steps shown in FIG. 7 in the embodiment of the present application do not limit the time sequence relationship. For example: step 705 in the above method may also be performed after step 704, may also be before step 701, and may also be performed together with step 701.
  • the second speech feature corresponding to the second text in the target text is obtained through the first speech feature of the first text in the original speech, that is, by referring to the first speech feature of the first text in the original text Generate the second speech feature of the second text in the target text, and then realize that the sense of hearing of the target edited speech/target speech (that is, edited speech) is similar to that of the original speech, thereby improving user experience.
  • the target voice is similar to the speech rate of the original voice, thereby improving user experience.
  • the original voice can be modified by directly modifying the original text to improve the user's operability for voice editing, and the edited target edited voice is highly similar to the original voice in terms of timbre and rhythm.
  • the target voice is generated, the non-edited voice is not modified, and the second voice feature of the target edited voice is similar to the first voice feature of the non-edited voice, making it difficult for the user to hear the original voice and the target voice.
  • the voice processing method implemented independently by the terminal device or the cloud device is described above, and the voice processing method jointly performed by the terminal device and the cloud device is described below.
  • Embodiment 2 The terminal device and the cloud device jointly execute the voice processing method.
  • FIG. 13 an embodiment of the speech processing method provided by the embodiment of the present application.
  • the method can be executed jointly by the terminal device and the cloud device, or can be performed by components of the terminal device (such as a processor, a chip, or a chip system, etc.) and The components of the cloud device (such as a processor, a chip, or a chip system, etc.) execute, and this embodiment includes steps 1301 to 1306 .
  • Step 1301 the terminal device obtains the original voice and the second text.
  • Step 1301 performed by the terminal device in this embodiment is similar to step 701 performed by the voice processing device in the foregoing embodiment shown in FIG. 7 , and will not be repeated here.
  • Step 1302 the terminal device sends the original voice and the second text to the cloud device.
  • the terminal device After the terminal device obtains the original voice and the second text, it can send the original voice and the second text to the cloud device.
  • step 1301 the terminal device obtains the original voice and the target text, the terminal device sends the original voice and the target text to the cloud device.
  • Step 1303 the cloud device obtains the non-edited voice based on the original voice and the second text.
  • Step 1303 performed by the cloud device in this embodiment is similar to the description of determining the non-edited voice in step 701 performed by the voice processing device in the embodiment shown in FIG. 7 , and will not be repeated here.
  • Step 1304 the cloud device acquires the first voice feature based on the non-edited voice.
  • Step 1305 the cloud device obtains the second speech feature corresponding to the second text through the neural network based on the first speech feature and the second text.
  • Step 1306 the cloud device generates a target editing voice corresponding to the second text based on the second voice feature.
  • Steps 1304 to 1306 performed by the cloud device in this embodiment are similar to steps 702 to 704 performed by the voice processing device in the embodiment shown in FIG. 7 , and will not be repeated here.
  • Step 1307 the cloud device sends the target editing voice to the terminal device. This step is optional.
  • the cloud device may send the target editing voice to the terminal device.
  • Step 1308 the terminal device or the cloud device obtains the position of the second text in the target text. This step is optional.
  • step 1309 the terminal device or the cloud device splices the target editing voice and the non-editing voice based on the position to generate a target voice corresponding to the target text.
  • This step is optional.
  • This step is optional.
  • Step 1308 and step 1309 in this embodiment are similar to steps 705 to 706 performed by the voice processing device in the embodiment shown in FIG. 7 , and will not be repeated here.
  • Step 1308 and step 1309 in this embodiment may be executed by a terminal device or a cloud device.
  • Step 1310 the cloud device sends the target voice to the terminal device. This step is optional.
  • steps 1308 and 1309 are performed by the cloud device, the cloud device sends the target voice to the terminal device after acquiring the target voice. If step 1308 and step 1309 are performed by the terminal device, this step may not be performed.
  • the terminal device plays the target edited voice or the target voice.
  • the voice processing method provided in the embodiment of the present application may include: the cloud device generates the target edited voice, and sends the target edited voice to the terminal device, that is, the method includes steps 1301 to 1307.
  • the voice processing method provided by the embodiment of the present application may include: the cloud device generates the target edited voice, generates the target voice according to the target edited voice and the non-edited voice, and sends the target voice to the terminal device. That is, the method includes steps 1301 to 1306, and steps 1308 to 1310.
  • the voice processing method provided in the embodiment of the present application may include: the cloud device generates the target edited voice, and sends the target edited voice to the terminal device.
  • the terminal device is generating the target voice according to the target editing voice and the non-editing voice. That is, the method includes step 1301 to step 1309 .
  • the cloud device can perform complex calculations to obtain the target editing voice or target voice and return it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • the target edited speech corresponding to the modified text can be generated according to the speech features of the non-edited region in the original speech, and then the target speech corresponding to the target text can be generated from the non-edited speech.
  • the user can obtain the target editing voice corresponding to the modified text (ie, the second text) by modifying the text in the original text. Improve the user's editing experience for text-based voice editing.
  • the target voice when the target voice is generated, the non-edited voice is not modified, and the second voice feature of the target edited voice is similar to the first voice feature of the non-edited voice, making it difficult for the user to hear the original voice and the target voice.
  • the difference in speech characteristics between the original speech and the target speech is not limited
  • An embodiment of the speech processing device in the embodiment of the present application includes:
  • the acquiring unit 1401 is configured to acquire the original speech and the second text, the second text is the text in the target text except the first text, the original text corresponding to the target text and the original speech includes the first text, and the first text is included in the original speech
  • the corresponding voice in is the non-editing voice
  • the obtaining unit 1401 is further configured to obtain the first speech feature based on the non-edited speech
  • a processing unit 1402 configured to obtain a second speech feature corresponding to the second text through a neural network based on the first speech feature and the second text;
  • a generating unit 1403, configured to generate a target editing speech corresponding to the second text based on the second speech feature.
  • the voice processing device in this embodiment also includes:
  • the splicing unit 1404 is configured to splice the target edited speech and the non-edited speech based on the position to obtain the target speech corresponding to the target text.
  • the first prediction unit 1405 is configured to predict the first duration and the second duration through the prediction network based on the target text, the first duration is the corresponding phoneme duration of the first text in the target text, and the second duration is the second duration of the second text in the target text corresponding phoneme duration;
  • the first modification unit 1406 is configured to modify the second duration based on the first duration and the third duration to obtain the first modified duration, and the third duration is the phoneme duration of the first text in the original speech;
  • the second prediction unit 1407 is configured to predict the fourth duration based on the second text through the prediction network, and the fourth duration is the total duration of all phonemes corresponding to the second text;
  • the second correction unit 1408 is configured to correct the fourth duration based on the speech rate to obtain a second corrected duration
  • the cloud device may further include a sending unit 1409, configured to send the target editing voice or the target voice to the terminal device.
  • each unit in the voice processing device is similar to those described in the foregoing embodiments shown in FIG. 7 to FIG. 12 , and will not be repeated here.
  • the processing unit 1402 obtains the second speech feature corresponding to the second text in the target text through the first speech feature of the first text in the original speech, that is, the processing unit 1402 obtains the second speech feature corresponding to the second text in the target text by referring to the first text in the original text
  • the first speech feature of the target text generates the second speech feature of the second text in the target text, and then realizes that the target editing speech/target speech generated by the generation unit 1403 has a sense of hearing similar to that of the original speech, thereby improving user experience.
  • the first modifying unit 1406 or the second modifying unit 1408 modifies the duration of the target edited speech, so that the speech rate of the target speech is similar to that of the original speech, thereby improving user experience.
  • the original voice can be modified by directly modifying the original text to improve the user's operability for voice editing, and the edited target edited voice is highly similar to the original voice in terms of timbre and rhythm.
  • the target voice is generated, the non-edited voice is not modified, and the second voice feature of the target edited voice is similar to the first voice feature of the non-edited voice, making it difficult for the user to hear the original voice and the target voice. The difference in speech characteristics between the original speech and the target speech.
  • the terminal equipment includes:
  • the acquisition unit 1501 is configured to acquire the original speech and the second text, the second text is the text in the target text except the first text, the original text corresponding to the target text and the original speech includes the first text, and the first text is included in the original speech
  • the corresponding voice in is the non-editing voice
  • the sending unit 1502 is configured to send the original voice and the second text to the cloud device, and the original voice and the second text are used by the cloud device to generate a target editing voice corresponding to the second text;
  • the acquiring unit 1501 is also configured to receive the target editing voice sent by the cloud device.
  • each unit in the voice processing device is similar to the description of the steps performed by the terminal device in the embodiment shown in FIG. 13 above, and will not be repeated here.
  • the cloud device can perform complex calculations to obtain the target editing voice or target voice and return it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • the user can obtain the target editing voice corresponding to the modified text (ie, the second text) by modifying the text in the original text. Improve the user's editing experience for text-based voice editing.
  • the cloud device includes:
  • the receiving unit 1601 is configured to receive the original voice and the second text sent by the terminal device, the second text is the text in the target text except the first text, the original text corresponding to the target text and the original voice both include the first text, the first The voice corresponding to the text in the original voice is a non-edited voice;
  • An obtaining unit 1602 configured to obtain a first speech feature based on the non-edited speech
  • a processing unit 1603, configured to obtain a second speech feature corresponding to the second text through a neural network based on the first speech feature and the second text;
  • a generating unit 1604 configured to generate a target editing speech corresponding to the second text based on the second speech feature.
  • the generating unit 1604 is further configured to generate the target voice based on the target edited voice and the non-edited voice.
  • the voice processing device in this embodiment also includes:
  • the sending unit 1605 is configured to send the target editing voice or the target voice to the terminal device.
  • each unit in the voice processing device is similar to the description of the steps performed by the cloud device in the embodiment shown in FIG. 13 , and will not be repeated here.
  • the cloud device can perform complex calculations to obtain the target editing voice or target voice and return it to the terminal device, which can reduce the computing power and storage space of the terminal device.
  • the user can obtain the target editing voice corresponding to the modified text (ie, the second text) by modifying the text in the original text. Improve the user's editing experience for text-based voice editing.
  • the generation unit 1604 when the generation unit 1604 generates the target voice, the non-edited voice is not modified, and the second voice feature of the target edited voice is similar to the first voice feature of the non-edited voice, so that when the user listens to the original voice and the target voice, It is difficult to hear the difference in speech characteristics between the original speech and the target speech.
  • the embodiment of the present application provides another voice processing device.
  • the voice processing device can be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (PDA), a point of sales (POS), a vehicle-mounted computer, etc.
  • the voice processing device is a mobile phone as an example:
  • FIG. 17 is a block diagram showing a partial structure of a mobile phone related to the voice processing device provided by the embodiment of the present application.
  • the mobile phone includes: a radio frequency (radio frequency, RF) circuit 1710, a memory 1720, an input unit 1730, a display unit 1740, a sensor 1750, an audio circuit 1760, a wireless fidelity (wireless fidelity, WiFi) module 1770, and a processor 1780 , and power supply 1790 and other components.
  • RF radio frequency
  • the RF circuit 1710 can be used for sending and receiving information or receiving and sending signals during a call. In particular, after receiving the downlink information from the base station, it is processed by the processor 1780; in addition, it sends the designed uplink data to the base station.
  • the RF circuit 1710 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (low noise amplifier, LNA), a duplexer, and the like.
  • RF circuitry 1710 may also communicate with networks and other devices via wireless communications.
  • the above wireless communication can use any communication standard or protocol, including but not limited to global system of mobile communication (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access) multiple access (CDMA), wideband code division multiple access (WCDMA), long term evolution (LTE), e-mail, short message service (short messaging service, SMS), etc.
  • GSM global system of mobile communication
  • GPRS general packet radio service
  • code division multiple access code division multiple access
  • WCDMA wideband code division multiple access
  • LTE long term evolution
  • e-mail short message service
  • SMS short message service
  • the memory 1720 can be used to store software programs and modules, and the processor 1780 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 1720 .
  • Memory 1720 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required by a function (such as a sound playback function, an image playback function, etc.); Data created by the use of mobile phones (such as audio data, phonebook, etc.), etc.
  • the memory 1720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
  • the input unit 1730 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the mobile phone.
  • the input unit 1730 may include a touch panel 1731 and other input devices 1732 .
  • the touch panel 1731 also referred to as a touch screen, can collect touch operations of the user on or near it (for example, the user uses any suitable object or accessory such as a finger or a stylus on the touch panel 1731 or near the touch panel 1731). operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 1731 may include two parts, a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, and detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and sends it to the to the processor 1780, and can receive and execute commands sent by the processor 1780.
  • the touch panel 1731 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 1730 may also include other input devices 1732 .
  • other input devices 1732 may include but not limited to one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, and the like.
  • the display unit 1740 may be used to display information input by or provided to the user and various menus of the mobile phone.
  • the display unit 1740 may include a display panel 1741.
  • the display panel 1741 may be configured in the form of a liquid crystal display (liquid crystal display, LCD) or an organic light-emitting diode (OLED).
  • the touch panel 1731 can cover the display panel 1741, and when the touch panel 1731 detects a touch operation on or near it, it sends it to the processor 1780 to determine the type of the touch event, and then the processor 1780 determines the type of the touch event according to the type of the touch event.
  • the type provides a corresponding visual output on the display panel 1741 .
  • the touch panel 1731 and the display panel 1741 are used as two independent components to realize the input and input functions of the mobile phone, in some embodiments, the touch panel 1731 and the display panel 1741 can be integrated to form a mobile phone. Realize the input and output functions of the mobile phone.
  • the handset may also include at least one sensor 1750, such as a light sensor, motion sensor, and other sensors.
  • the light sensor can include an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 1741 according to the brightness of the ambient light, and the proximity sensor can turn off the display panel 1741 and/or when the mobile phone is moved to the ear. or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (generally three axes), and can detect the magnitude and direction of gravity when it is stationary, and can be used to identify the application of mobile phone posture (such as horizontal and vertical screen switching, related Games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tap), etc.; as for other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. repeat.
  • mobile phone posture such as horizontal and vertical screen switching, related Games, magnetometer attitude calibration
  • vibration recognition related functions such as pedometer, tap
  • other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. repeat.
  • the audio circuit 1760, the speaker 1761, and the microphone 1762 can provide an audio interface between the user and the mobile phone.
  • the audio circuit 1760 can transmit the electrical signal converted from the received audio data to the speaker 1761, and the speaker 1761 converts it into an audio signal for output; After being received, it is converted into audio data, and then the audio data is processed by the output processor 1780, and then sent to another mobile phone through the RF circuit 1710, or the audio data is output to the memory 1720 for further processing.
  • WiFi is a short-distance wireless transmission technology.
  • the mobile phone can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 1770. It provides users with wireless broadband Internet access.
  • FIG. 17 shows a WiFi module 1770, it can be understood that it is not an essential component of the mobile phone.
  • the processor 1780 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone. By running or executing software programs and/or modules stored in the memory 1720, and calling data stored in the memory 1720, execution Various functions and processing data of the mobile phone, so as to monitor the mobile phone as a whole.
  • the processor 1780 may include one or more processing units; preferably, the processor 1780 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface and application programs, etc. , the modem processor mainly handles wireless communications. It can be understood that the foregoing modem processor may not be integrated into the processor 1780 .
  • the mobile phone also includes a power supply 1790 (such as a battery) for supplying power to various components.
  • a power supply 1790 (such as a battery) for supplying power to various components.
  • the power supply can be logically connected to the processor 1780 through the power management system, so that functions such as charging, discharging, and power consumption management can be realized through the power management system.
  • the mobile phone may also include a camera, a Bluetooth module, etc., which will not be repeated here.
  • the processor 1780 included in the terminal device can perform the functions of the voice processing device in the aforementioned embodiment of FIG. 7 , or perform the functions of the terminal device in the aforementioned embodiment shown in FIG. .
  • the voice processing device may be a cloud device.
  • the cloud device may include a processor 1801 , a memory 1802 and a communication interface 1803 .
  • the processor 1801, the memory 1802 and the communication interface 1803 are interconnected by wires.
  • the memory 1802 stores program instructions and data.
  • the memory 1802 stores program instructions and data corresponding to the steps executed by the speech processing device in the aforementioned embodiment corresponding to FIG. 7 .
  • the program instructions and data corresponding to the steps executed by the cloud device in the aforementioned embodiment corresponding to FIG. 13 are stored.
  • the processor 1801 is configured to execute the steps executed by the voice processing device shown in any one of the embodiments shown in the foregoing embodiments shown in FIG. 7 . Or it is used to execute the steps performed by the cloud device as shown in any one of the above-mentioned embodiments shown in FIG. 13 .
  • the communication interface 1803 may be used for receiving and sending data, and for performing steps related to acquisition, sending, and receiving in any of the embodiments shown in FIG. 7 or FIG. 13 .
  • the cloud device may include more or fewer components than those shown in FIG. 18 , which is only an example in the present application and not limited thereto.
  • the disclosed system, device and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be fully or partially realized by software, hardware, firmware or any combination thereof.
  • the integrated units When the integrated units are implemented using software, they may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种语音处理方法及相关设备,可以应用于用户录制短视频、老师录制授课语音等场景,该方法包括:获取原始语音与第二文本(701),原始语音对应的原始文本与第二文本所属的目标文本都包括第一文本,第一文本在原始语音中对应的语音为非编辑语音;基于非编辑语音获取第一语音特征(702);基于第一语音特征与第二文本通过神经网络得到第二文本对应的第二语音特征(703);基于第二语音特征生成第二文本对应的目标编辑语音(704)。该方法实现了修改部分对应的目标编辑语音的听感与正确文本对应的非编辑语音的听感类似,提升用户体验。

Description

一种语音处理方法及相关设备
本申请要求于2021年6月3日提交中国专利局、申请号为202110621213.6、发明名称为“一种语音处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能领域与音频应用领域,尤其涉及一种语音处理方法及相关设备。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
目前,语音编辑具有非常重要的实用意义。比如,在用户录制短视频、老师在录制授课语音等场景下,经常会由于口误而导致语音中的某些内容出错。该种情况下,语音编辑便可帮助用户或老师方便又快速地修正原始语音中的错误内容,生成校正后的语音。常用的语音编辑方法是通过预先构建含有大量语音片段的数据库,从数据库中获取发音单元的片段,并用该片段替换原始语音中的错误片段,进而生成校正后的语音。
然而,上述语音编辑的方式依赖数据库中语音片段的多样性,在数据库中语音片段较少的情况下,会导致校正后的语音听感较差。
发明内容
本申请实施例提供了一种语音处理方法及相关设备,可以实现编辑语音的听感与原始语音的听感类似,提升用户体验。
本申请实施例第一方面提供了一种语音处理方法,可以应用于用户录制短视频、老师录制授课语音等场景。该方法可以由语音处理设备执行,也可以由语音处理设备的部件(例如处理器、芯片、或芯片系统等)执行。其中,该语音处理设备可以是终端设备也可以是云端设备,该方法包括:获取原始语音与第二文本,第二文本为目标文本中除了第一文本以外的文本,目标文本与原始语音对应的原始文本都包括第一文本,第一文本在原始语音中对应的语音为非编辑语音;基于非编辑语音获取第一语音特征;基于第一语音特征与第二文本通过神经网络得到第二文本对应的第二语音特征;基于第二语音特征生成第二文本对应的目标编辑语音。其中,第一语音特征可以与第二语音特征的韵律、音色和/或信噪比等相同或相近,韵律可以反映出发音者的情感状态或讲话形式等,韵律泛指语调、音调、重音强调、停顿或 节奏等特征。
另外,获取第二文本的方式有多种,可以是直接获取第二文本;也可以是先获取位置信息(也可以理解为是标记信息,用于指示第二文本在目标文本中的位置),在根据位置与目标文本获取第二文本,位置信息用于表示第二文本在目标文本中的位置;还可以是获取目标文本与原始文本(或者获取目标文本与原始语音,对原始语音进行识别得到原始文本),再基于原始文本与目标文本确定第二文本。
本申请实施例中,通过第一文本在原始语音中的第一语音特征获取目标文本中第二文本对应的第二语音特征,即通过参考原始文本中第一文本的第一语音特征生成目标文本中第二文本的第二语音特征,进而实现目标编辑语音的听感与原始语音的听感类似,提升用户体验。
可选地,在第一方面的一种可能的实现方式中,上述步骤:获取原始语音与第二文本,包括:接收终端设备发送的原始语音与第二文本;方法还包括:向终端设备发送目标编辑语音,目标编辑语音用于终端设备生成目标文本对应的目标语音。也可以理解为是交互场景,由云端设备进行复杂的计算操作,由终端设备执行简单的拼接操作,从终端设备处获取原始语音与第二文本,云端设备生成目标编辑语音之后,向终端设备发送目标编辑语音,再由终端设备进行拼接得到目标语音。
该种可能的实现方式中,在语音处理设备是云端设备的情况下,一方面,可以通过云端设备与终端设备的交互,由云端设备进行复杂的计算得到目标编辑语音并返给终端设备,可以减少终端设备的算力与存储空间。另一方面,可以根据原始语音中非编辑区域的语音特征生成修改文本对应的目标编辑语音,进而与非编辑语音生成目标文本对应的目标语音。
可选地,在第一方面的一种可能的实现方式中,上述步骤:获取原始语音与第二文本,包括:接收终端设备发送的原始语音与目标文本;方法还包括:基于非编辑语音与目标编辑语音生成目标文本对应的目标语音,向终端设备发送目标语音。
该种可能的实现方式中,接收终端设备发送的原始语音与目标文本,可以获取非编辑语音,并根据非编辑语音的第一语音特征生成第二文本对应的第二语音特征,进而根据声码器得到目标编辑语音,并拼接目标编辑语音与非编辑语音生成目标语音。相当于,处理过程都在语音处理设备,结果返回给终端设备。由云端设备进行复杂的计算得到目标语音并返给终端设备,可以减少终端设备的算力与存储空间。
可选地,在第一方面的一种可能的实现方式中,上述步骤:获取原始语音与第二文本,包括:接收来自用户的编辑请求,编辑请求中包括原始语音与第二文本。或者编辑请求中包括原始语音与目标文本。该目标文本可以理解为是用户想要生成语音对应的文本。
该种可能的实现方式中,用户可以通过对原始文本中的文本进行修改,得到修改文本(即第二文本)对应的目标编辑语音。提升用户基于文本进行语音编辑的编辑体验。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:获取第二文本在目标文本中的位置;基于位置拼接目标编辑语音与非编辑语音得到目标文本对应的目标语音。也可以理解为是用目标编辑语音替换原始语音中的编辑语音,该编辑语音为原始语音中除了非编辑语音以外的语音。
该种可能的实现方式中,可以根据第二文本在目标文本中的位置拼接目标编辑语音与非编辑语音。如果第一文本是原始文本与目标文本中的所有重叠文本,则可以在不改变原始语 音中非编辑语音的情况下生成所需文本(即目标文本)的语音。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于非编辑语音获取第一语音特征,包括:获取非编辑语音中的至少一个语音帧;基于至少一个语音帧获取第一语音特征,第一语音特征用于表示至少一个语音帧的特征,第一语音特征为特征向量或序列。另外,还可以获取目标语音(方式与前述类似),为了保证非编辑语音与目标编辑语音的衔接处更加平缓,在多个语音帧的情况下,选取的语音帧对应的文本可以与第二文本相近。
该种可能的实现方式中,一方面,通过非编辑语音中的语音帧获取第一语音特征,可以使得生成的目标编辑语音具有与非编辑语音相同或相近的语音特征,减少原始语音与目标编辑语音的听感差异。另一方面,在多个语音帧的情况下,选取的语音帧对应的文本可以与第二文本相近,进而在生成目标语音时,使得目标编辑语音与非编辑语音的衔接处更加平缓。另外,还可以通过非物理量的方式,例如,序列、向量的方式等体现语音特征。
可选地,在第一方面的一种可能的实现方式中,上述步骤中的至少一个语音帧对应的文本为第一文本中与第二文本相邻的文本。即第一语音特征对应的非编辑语音在目标语音中与非编辑语音相邻。
该种可能的实现方式中,通过第二文本的上下文的第一语音特征生成第二文本的语音特征,使得第二语音特征更加结合了上下文的第一语音特征。即通过上下文对应的语音帧预测第二文本对应的语音,可以使得第二文本的语音帧与上下文的语音帧的语音特征近似,实现第二文本的目标编辑语音与原始语音的听感类似。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一语音特征与第二文本通过神经网络得到第二文本对应的第二语音特征,包括:基于第一语音特征、目标文本以及标记信息通过神经网络得到第二文本对应的第二语音特征,标记信息用于标记目标文本中的第二文本。该标记信息也可以理解为是位置信息,用于指示第二文本在目标文本中的位置。
该种可能的实现方式中,通过引入目标文本,在后续生成第二文本对应的语音特征时,可以参考整个目标文本,避免后续生成的目标编辑语音与原始语音中的非编辑语音拼接得到的目标语音没有关注目标文本。
可选地,在第一方面的一种可能的实现方式中,上述步骤:神经网络包括编码器与解码器,基于第一语音特征与第二文本通过神经网络得到第二文本对应的第二语音特征,包括:基于第二文本,通过编码器,获取第二文本对应的第一向量;基于第一向量与第一语音特征,通过解码器,获取第二语音特征。也可以理解是将第一向量与第一语音特征输入解码器得到第二语音特征。
该种可能的实现方式中,解码器以第一语音特征为条件对第一向量解码,可以使得生成的第二语音特征与第一语音特征类似,或者说生成的第二语音特征携带有第一语音特征中的类似特征(例如韵律、音色和/或信噪比等)。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第二文本,通过编码器,获取第二文本对应的第一向量,包括:基于目标文本,通过编码器,获取第一向量。也可以理解为是将目标文本以及位置信息输入编码器器得到第一向量,位置信息用于指示第二文本在目标文本中的位置。
该种可能的实现方式,在编码器编码过程中引入第二文本所在的目标文本,使得生成的 第二文本的第一向量参考了目标文本,使得该第一向量描述的第二文本更加准确。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:基于目标文本通过预测网络预测第一时长与第二时长,第一时长为第一文本在目标文本中对应的音素时长,第二时长为第二文本在目标文本中对应的音素时长;基于第一时长与第三时长修正第二时长,以得到第一修正时长,第三时长为第一文本在原始语音中的音素时长;基于第一向量与第一语音特征,通过解码器,获取第二语音特征,包括:基于第一向量、第一语音特征与第一修正时长,通过解码器,获取第二语音特征。
该种可能的实现方式中,为了保证第二文本对应的目标编辑语音的时长与非编辑语音在语速上一致,可以对目标编辑语音的时长进行修正。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一时长与第三时长修正第二时长,以得到第一修正时长,包括:计算第三时长与第一时长的比值;基于比值与第二时长获取第一修正时长。
该种可能的实现方式中,利用第三时长与第一时长的比值修正第二时长。可以提升第二文本对应的目标编辑语音的时长与非编辑语音在语速上的一致程度。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一向量、第一语音特征与第一修正时长,通过解码器,获取第二语音特征,包括:基于第一修正时长对第一向量进行上采样,以得到第二向量;基于第二向量与第一语音特征,通过解码器,获取第二语音特征。具体的,将第二向量与第一语音特征输入解码器中,得到第二语音特征。在解码器包括串联的多个编码单元时,第二向量与第一语音特征可以是输入的同一个编码单元,也可以是输入的不同编码单元等。
该种可能的实现方式中,通过第一修正时长对第一向量进行上采样,也可以理解为是利用第一修正时长对第一向量进行扩充得到第二向量,使得目标编辑语音的时长与非编辑语音在语速上一致。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:基于第二文本通过预测网络预测第四时长,第四时长为第二文本对应所有音素的总时长;获取原始语音的语速;基于语速修正第四时长,得到第二修正时长;基于第一向量与第一语音特征,通过解码器,获取第二语音特征,包括:基于第一向量、第一语音特征与第二修正时长,通过解码器,获取第二语音特征。
该种可能的实现方式中,利用原始语音的音素调整第二文本对应语音帧的时长,可以提升第二文本对应的目标编辑语音的时长与非编辑语音在语速上的一致程度。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一向量与第一语音特征,通过解码器,获取第二语音特征,包括:基于解码器与第一语音特征从目标文本的正序或反序解码第一向量得到第二语音特征。例如,目标文本为“今天开心”,则正序为从“今”至“心”的顺序,反序为从“心”至“今”的顺序。
该种可能的实现方式中,编码器可以通过文本的正序或反序方向预测第二语音特征。
可选地,在第一方面的一种可能的实现方式中,上述的第二文本在目标文本的中间区域,或者说第二文本并不在目标文本的两端。基于第一向量与第一语音特征,通过解码器,获取第二语音特征,包括:基于解码器与第一语音特征从目标文本的正序解码第一向量得到第三 语音特征;基于解码器与第一语音特征从目标文本的反序解码第一向量得到第四语音特征;基于第三语音特征与第四语音特征获取第二语音特征。
该种可能的实现方式中,解码器为双向解码器,可以分别通过左右两侧(即正序反序)得到两种第二文本对应的语音特征,并根据两种语音特征得到第二语音特征,使得第二语音特征与第一文本在原始语音中的特征更加近似,提升目标编辑语音的听觉效果。
可选地,在第一方面的一种可能的实现方式中,上述的第二文本包括第三文本和第四文本,第三语音特征为第三文本对应的语音特征,第四语音特征为第四文本对应的语音特征;基于第三语音特征与第四语音特征获取第二语音特征,包括:拼接第三语音特征与第四语音特征得到第二语音特征。
该种可能的实现方式中,从正序的方向上取一部分语音特征,从反序的方向上取另一部分语音特征,并拼接一部分语音特征与另一部分语音特征得到整体的语音特征。
可选地,在第一方面的一种可能的实现方式中,上述步骤的第三语音特征为解码器基于正序得到的第二文本对应的语音特征,第四语音特征为解码器基于反序得到的第二文本对应的语音特征;基于第三语音特征与第四语音特征获取第二语音特征,包括:确定第三语音特征与第四语音特征中相似度大于第一阈值的语音特征为过渡语音特征;拼接第五语音特征与第六语音特征得到第二语音特征,第五语音特征为基于过渡语音特征在第三语音特征中的位置从第三语音特征中截取得到的,第六语音特征为基于过渡语音特征在第四语音特征中的位置从第四语音特征中截取得到的。
该种可能的实现方式中,通过两个完整语音特征中的过渡语音特征,从两个完整的语音特征中互补的方式选取第二语音特征,使得第二语音特征既参考了正序又参考了反序,提升第二语音特征与第一语音特征的相似程度。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第二语音特征生成第二文本对应的目标编辑语音,包括:基于第二语音特征,通过声码器,生成目标编辑语音。
该种可能的实现方式中,根据声码器将第二语音特征转化为目标编辑语音,进而使得目标编辑语音具有与原始语音相近的语音特征,提升用户的听感。
可选地,在第一方面的一种可能的实现方式中,上述步骤:第一语音特征携带有原始语音的声纹特征。其中,获取声纹特征的方式可以是直接获取,也可以是通过识别原始语音得到该声纹特征等。
该种可能的实现方式中,一方面,通过引入原始语音的声纹特征,使得后续生成的第二语音特征也携带有该原始语音的声纹特征,进而提升目标编辑语音与原始语音的相近程度。另一方面,在发音者(或者用户)的数量为多个的情况下,引入声纹特征可以提升后续预测的语音特征更加与原始语音的发音者的声纹相似。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:基于目标文本、原始文本以及原始语音确定非编辑语音,具体可以是:基于目标文本与原始文本确定第一文本;基于第一文本、原始文本与原始语音确定非编辑语音。
该种可能的实现方式中,通过对比原始文本与原始语音,确定第一文本在原始语音中的非编辑语音,便于后续第一语音特征的生成。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于目标文本与原始文本确 定第一文本,包括:基于目标文本与原始文本确定重叠文本;向用户显示重叠文本;响应用户的第二操作,从重叠文本中确定第一文本。
该种可能的实现方式中,可以根据用户的操作确定第一文本,一方面可以提升用户语音编辑的可操作性,另一方面,相较于使用重叠文本,可以参考更多非编辑语音的语音特征,提升目标编辑语音的听感。
可选地,在第一方面的一种可能的实现方式中,上述的神经网络是通过以训练数据作为神经网络的输入,以损失函数的值小于第二阈值为目标对神经网络进行训练得到,训练数据包括训练语音以及与训练语音对应的训练文本;损失函数用于指示神经网络输出的语音特征与实际语音特征之间的差异,实际语音特征由训练语音获取。
该种可能的实现方式中,以减小损失函数的值为目标对神经网络进行训练,也就是不断缩小神经网络输出的语音特征与实际语音特征之间的差异。从而使得神经网络输出的第二语音特征更加准确。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一文本、原始文本以及原始语音确定非编辑语音,包括:确定原始文本中各个音素在原始语音的起止位置;基于起止位置与第一文本确定非编辑语音。
该种可能的实现方式中,根据音素在原始语音的起止位置与第一文本确定非编辑语音,使得确定的非编辑语音在音素维度上更加准确。
可选地,在第一方面的一种可能的实现方式中,上述的第一语音特征与第二语音特征为梅尔频谱特征。
本申请实施例第二方面提供了一种语音处理方法,可以应用于用户录制短视频、老师录制授课语音等场景。该方法可以由语音处理设备执行,也可以由语音处理设备的部件(例如处理器、芯片、或芯片系统等)执行。其中,该语音处理设备为终端设备,该方法包括:获取原始语音与第二文本,第二文本为目标文本中除了第一文本以外的文本,目标文本与原始语音对应的原始文本都包括第一文本,第一文本在原始语音中对应的语音为非编辑语音;向云端设备发送原始语音与第二文本,原始语音与第二文本用于云端设备生成第二文本对应的目标编辑语音;接收云端设备发送的目标编辑语音。
本申请实施例中,可以通过云端设备与终端设备的交互,由云端设备进行复杂的计算得到目标编辑语音并返给终端设备,可以减少终端设备的算力与存储空间。另一方面,可以根据原始语音中非编辑区域的语音特征生成修改文本对应的目标编辑语音,进而与非编辑语音生成目标文本对应的目标语音。
可选地,在第二方面的一种可能的实现方式中,上述步骤:获取原始语音与第二文本,包括:接收来自用户的编辑请求,编辑请求中包括原始语音与第二文本。或者编辑请求中包括原始语音与目标文本。该目标文本可以理解为是用户想要生成语音对应的文本。
该种可能的实现方式中,用户可以通过对原始文本中的文本进行修改,得到修改文本(即第二文本)对应的目标编辑语音。提升用户基于文本进行语音编辑的编辑体验。
本申请实施例第三方面提供了一种语音处理方法,可以应用于用户录制短视频、老师录制授课语音等场景。该方法可以由语音处理设备执行,也可以由语音处理设备的部件(例如处理器、芯片、或芯片系统等)执行。其中,该语音处理设备为云端设备,该方法包括:接 收终端设备发送的原始语音与第二文本,第二文本为目标文本中除了第一文本以外的文本,目标文本与原始语音对应的原始文本都包括第一文本,第一文本在原始语音中对应的语音为非编辑语音;基于非编辑语音获取第一语音特征;基于第一语音特征与第二文本通过神经网络得到第二文本对应的第二语音特征;基于第二语音特征生成第二文本对应的目标编辑语音。
本申请实施例中,通过第一文本在原始语音中的第一语音特征获取目标文本中第二文本对应的第二语音特征,即通过参考原始文本中第一文本的第一语音特征生成目标文本中第二文本的第二语音特征,进而实现目标编辑语音的听感与原始语音的听感类似,提升用户体验。
可选地,在第三方面的一种可能的实现方式中,上述步骤还包括:向终端设备发送目标编辑语音。
该种可能的实现方式中,由云端设备进行复杂的计算得到目标编辑语音并返给终端设备,可以减少终端设备的算力与存储空间。
可选地,在第三方面的一种可能的实现方式中,上述步骤还包括:基于目标编辑语音与非编辑语音生成目标语音;向终端设备发送目标语音。
该种可能的实现方式中,由云端设备进行复杂的计算得到目标语音并返给终端设备,可以减少终端设备的算力与存储空间。
本申请第四方面提供一种语音处理设备,该语音处理设备可以应用于用户录制短视频、老师录制授课语音等场景。其中,该语音处理设备可以是终端设备也可以是云端设备,该语音处理设备包括:获取单元,用于获取原始语音与第二文本,第二文本为目标文本中除了第一文本以外的文本,目标文本与原始语音对应的原始文本都包括第一文本,第一文本在原始语音中对应的语音为非编辑语音;获取单元,还用于基于非编辑语音获取第一语音特征;处理单元,用于基于第一语音特征与第二文本通过神经网络得到第二文本对应的第二语音特征;生成单元,用于基于第二语音特征生成第二文本对应的目标编辑语音。其中,第一语音特征可以与第二语音特征的韵律、音色和/或信噪比等相同或相近,韵律可以反映出发音者的情感状态或讲话形式等,韵律泛指语调、音调、重音强调、停顿或节奏等特征。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,具体用于接收终端设备发送的原始语音与第二文本。语音处理设备还包括:发送单元,用于向终端设备发送目标编辑语音,目标编辑语音用于终端设备生成目标文本对应的目标语音。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,具体用于接收终端设备发送的原始语音与目标文本。生成单元,还用于基于非编辑语音与目标编辑语音生成目标文本对应的目标语音,语音处理设备还包括:发送单元,用于向终端设备发送目标语音。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,具体用于接收来自用户的编辑请求,编辑请求中包括原始语音与第二文本。或者编辑请求中包括原始语音与目标文本。该目标文本可以理解为是用户想要生成语音对应的文本。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,还用于获取第二文本在目标文本中的位置;语音处理设备还包括:拼接单元,用于基于位置拼接目标编辑语音与非编辑语音得到目标文本对应的目标语音。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,具体用于获取非编辑语音中的至少一个语音帧;获取单元,具体用于基于至少一个语音帧获取第一语音特征,第 一语音特征用于表示至少一个语音帧的特征,第一语音特征为特征向量或序列。
可选地,在第四方面的一种可能的实现方式中,上述的至少一个语音帧对应的文本为第一文本中与第二文本相邻的文本。
可选地,在第四方面的一种可能的实现方式中,上述的处理单元,具体用于基于第一语音特征、目标文本以及标记信息通过神经网络得到第二文本对应的第二语音特征,标记信息用于标记目标文本中的第二文本。
可选地,在第四方面的一种可能的实现方式中,上述的神经网络包括编码器与解码器,处理单元,具体用于基于第二文本,通过编码器,获取第二文本对应的第一向量;处理单元,具体用于基于第一向量与第一语音特征,通过解码器,获取第二语音特征。
可选地,在第四方面的一种可能的实现方式中,上述的处理单元,具体用于基于目标文本,通过编码器,获取第一向量。
可选地,在第四方面的一种可能的实现方式中,上述的语音处理设备还包括:第一预测单元,用于基于目标文本通过预测网络预测第一时长与第二时长,第一时长为第一文本在目标文本中对应的音素时长,第二时长为第二文本在目标文本中对应的音素时长;第一修正单元,用于基于第一时长与第三时长修正第二时长,以得到第一修正时长,第三时长为第一文本在原始语音中的音素时长;处理单元,具体用于基于第一向量、第一语音特征与第一修正时长,通过解码器,获取第二语音特征。
可选地,在第四方面的一种可能的实现方式中,上述的第一修正单元,具体用于计算第三时长与第一时长的比值;基于比值与第二时长获取第一修正时长。
可选地,在第四方面的一种可能的实现方式中,上述的处理单元,具体用于基于第一修正时长对第一向量进行上采样,以得到第二向量;处理单元,具体用于基于第二向量与第一语音特征,通过解码器,获取第二语音特征。具体的,处理单元,具体用于将第二向量与第一语音特征输入解码器中,得到第二语音特征。在解码器包括串联的多个编码单元时,第二向量与第一语音特征可以是输入的同一个编码单元,也可以是输入的不同编码单元等。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,还用于获取原始语音的语速;语音处理设备还包括:第二预测单元,用于基于第二文本通过预测网络预测第四时长,第四时长为第二文本对应所有音素的总时长;第二修正单元,用于基于语速修正第四时长,得到第二修正时长;处理单元,具体用于基于第一向量、第一语音特征与第二修正时长,通过解码器,获取第二语音特征。
可选地,在第四方面的一种可能的实现方式中,上述的处理单元,具体用于基于解码器与第一语音特征从目标文本的正序或反序解码第一向量得到第二语音特征。
可选地,在第四方面的一种可能的实现方式中,上述的第二文本在目标文本的中间区域,处理单元,具体用于基于解码器与第一语音特征从目标文本的正序解码第一向量得到第三语音特征;处理单元,具体用于基于解码器与第一语音特征从目标文本的反序解码第一向量得到第四语音特征;处理单元,具体用于基于第三语音特征与第四语音特征获取第二语音特征。
可选地,在第四方面的一种可能的实现方式中,上述的第二文本包括第三文本和第四文本,第三语音特征为第三文本对应的语音特征,第四语音特征为第四文本对应的语音特征;处理单元,具体用于拼接第三语音特征与第四语音特征得到第二语音特征。
可选地,在第四方面的一种可能的实现方式中,上述的第三语音特征为解码器基于正序得到的第二文本对应的语音特征,第四语音特征为解码器基于反序得到的第二文本对应的语音特征;处理单元,具体用于确定第三语音特征与第四语音特征中相似度大于第一阈值的语音特征为过渡语音特征;处理单元,具体用于拼接第五语音特征与第六语音特征得到第二语音特征,第五语音特征为基于过渡语音特征在第三语音特征中的位置从第三语音特征中截取得到的,第六语音特征为基于过渡语音特征在第四语音特征中的位置从第四语音特征中截取得到的。
可选地,在第四方面的一种可能的实现方式中,上述的生成单元,具体用于基于第二语音特征,通过声码器,生成目标编辑语音。
可选地,在第四方面的一种可能的实现方式中,上述的第一语音特征携带有原始语音的声纹特征。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,还用于基于目标文本、原始文本以及原始语音确定非编辑语音,获取单元,具体用于基于目标文本与原始文本确定第一文本;获取单元,具体用于基于第一文本、原始文本与原始语音确定非编辑语音。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,具体用于基于目标文本与原始文本确定重叠文本;获取单元,具体用于向用户显示重叠文本;获取单元,具体用于响应用户的第二操作,从重叠文本中确定第一文本。
可选地,在第四方面的一种可能的实现方式中,上述的神经网络是通过以训练数据作为神经网络的输入,以损失函数的值小于第二阈值为目标对神经网络进行训练得到,训练数据包括训练语音以及与训练语音对应的训练文本;损失函数用于指示神经网络输出的语音特征与实际语音特征之间的差异,实际语音特征由训练语音获取。
可选地,在第四方面的一种可能的实现方式中,上述的获取单元,具体用于确定原始文本中各个音素在原始语音的起止位置;获取单元,具体用于基于起止位置与第一文本确定非编辑语音。
可选地,在第四方面的一种可能的实现方式中,上述的第一语音特征与第二语音特征为梅尔频谱特征。
本申请第五方面提供一种语音处理设备,该语音处理设备可以应用于用户录制短视频、老师录制授课语音等场景。其中,该语音处理设备可以是终端设备。该语音处理设备包括:获取单元,用于获取原始语音与第二文本,第二文本为目标文本中除了第一文本以外的文本,目标文本与原始语音对应的原始文本都包括第一文本,第一文本在原始语音中对应的语音为非编辑语音;发送单元,用于向云端设备发送原始语音与第二文本,原始语音与第二文本用于云端设备生成第二文本对应的目标编辑语音;获取单元,还用于接收云端设备发送的目标编辑语音。其中,获取单元也可以理解为是输入单元,发送单元也可以理解为是输出单元。
可选地,在第五方面的一种可能的实现方式中,上述的获取单元,具体用于接收来自用户的编辑请求,编辑请求中包括原始语音与第二文本。或者编辑请求中包括原始语音与目标文本。该目标文本可以理解为是用户想要生成语音对应的文本。
本申请第六方面提供一种语音处理设备,该语音处理设备可以应用于用户录制短视频、老师录制授课语音等场景。其中,该语音处理设备可以是云端设备,该语音处理设备包括: 接收单元,用于接收终端设备发送的原始语音与第二文本,第二文本为目标文本中除了第一文本以外的文本,目标文本与原始语音对应的原始文本都包括第一文本,第一文本在原始语音中对应的语音为非编辑语音;获取单元,用于基于非编辑语音获取第一语音特征;处理单元,用于基于第一语音特征与第二文本通过神经网络得到第二文本对应的第二语音特征;生成单元,用于基于第二语音特征生成第二文本对应的目标编辑语音。
可选地,在第六方面的一种可能的实现方式中,上述语音处理设备还包括:发送单元,用于向终端设备发送目标编辑语音。
可选地,在第六方面的一种可能的实现方式中,上述的生成单元,还用于基于目标编辑语音与非编辑语音生成目标语音;发送单元,用于向终端设备发送目标语音。
本申请第七方面提供了一种语音处理设备,该语音处理设备执行前述第一方面或第一方面的任意可能的实现方式中的方法,或者执行前述第二方面或第二方面的任意可能的实现方式中的方法,或者执行前述第三方面或第三方面的任意可能的实现方式中的方法。
本申请第八方面提供了一种语音处理设备,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该语音处理设备实现上述第一方面或第一方面的任意可能的实现方式中的方法,或者使得该语音处理设备实现上述第二方面或第二方面的任意可能的实现方式中的方法,或者使得该语音处理设备实现上述第三方面或第三方面的任意可能的实现方式中的方法。
本申请第九方面提供了一种计算机可读介质,其上存储有计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法,或者使得计算机执行前述第二方面或第二方面的任意可能的实现方式中的方法,或者使得计算机执行前述第三方面或第三方面的任意可能的实现方式中的方法。
本申请第十方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法,或者使得计算机执行前述第二方面或第二方面的任意可能的实现方式中的方法,或者使得计算机执行前述第三方面或第三方面的任意可能的实现方式中的方法。
其中,第三、第四、第六、第七、第八、第九、第十方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。
其中,第五、第七、第八、第九、第十方面或者其中任一种可能实现方式所带来的技术效果可参见第二方面或第二方面不同可能实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请实施例具有以下优点:通过第一文本在原始语音中的第一语音特征获取目标文本中第二文本对应的第二语音特征,即通过参考原始文本中第一文本的第一语音特征生成目标文本中第二文本的第二语音特征,进而实现目标编辑语音的听感与原始语音的听感类似,提升用户体验。
附图说明
图1为本申请提供的一种系统架构的结构示意图;
图2为本申请提供的一种卷积神经网络结构示意图;
图3为本申请提供的另一种卷积神经网络结构示意图;
图4为本申请提供的一种芯片硬件结构示意图;
图5为本申请提供的一种神经网络的训练方法的示意性流程图;
图6为本申请提供的一种神经网络的结构示意图;
图7为本申请提供的语音处理方法一个流程示意图;
图8-图10为本申请提供的语音处理设备显示界面的几种示意图;
图11为本申请提供的一种双向解码器的结构示意图;
图12为本申请提供的语音处理设备显示界面的另一种示意图;
图13为本申请提供的语音处理方法另一个流程示意图;
图14-图18本申请提供的语音处理设备的几种结构示意图。
具体实施方式
本申请实施例提供了一种语音处理方法及相关设备,可以实现编辑语音的听感与原始语音的听感类似,提升用户体验。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获取的所有其他实施例,都属于本申请保护的范围。
为了便于理解,下面先对本申请实施例主要涉及的相关术语和概念进行介绍。
1、神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以X s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022094838-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为X s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
2、深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。当然,深度神经网络也可能不包括隐藏层,具体此处不做限定。
深度神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2022094838-appb-000002
来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2022094838-appb-000003
完成,4的操作由
Figure PCTCN2022094838-appb-000004
完成,5的操作则由α()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练深度神经网络的目的,也就是最终获取训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
3、卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使同一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习获取的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习获取合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。本申请实施例中的分离网络、识别网络、检测网络、深度估计网络等网络都可以是CNN。
4、循环神经网络(RNN)
在传统的神经网络中模型中,层与层之间是全连接的,每层之间的节点是无连接的。但是这种普通的神经网络对于很多问题是无法解决的。比如,预测句子的下一个单词是什么,因为一个句子中前后单词并不是独立的,一般需要用到前面的单词。循环神经网络(recurrent neural network,RNN)指的是一个序列当前的输出与之前的输出也有关。具体的表现形式为网络会对前面的信息进行记忆,保存在网络的内部状态中,并应用于当前输出的计算中。
5、损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程, 即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
6、从文本到语音
从文本到语音(text to speech,TTS)是将文本转换成语音的程序或软件系统。
7、声码器
声码器是一种声音信号处理模块或软件,可以将声学特征编码生成声音波形。
8、基频
当发声体由于振动而发出声音时,声音一般可以分解为许多单纯的正弦波,也就是说所有的自然声音基本都是由许多频率不同的正弦波组成的,其中频率最低的正弦波即为基音(即基频,可以用F0表示),而其他频率较高的正弦波则为泛音。
9、韵律
语音合成领域中,韵律泛指控制语调、音调、重音强调、停顿和节奏等功能的特征。韵律可以反映出说话者的情感状态或讲话形式等。
10、音素
音素(phone):是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。音素分为元音与辅音两大类。例如,汉语音节a(例如,一声:啊)只有一个音素,ai(例如四声:爱)有两个音素,dai(例如一声:呆)有三个音素等。
11、词向量(Embedding)
词向量也可以称为“词嵌入”、“向量化”、“向量映射”、“嵌入”等。从形式上讲,词向量是用一个稠密的向量表示一个对象,例如:用向量表示用户身份证标识号(identity document,ID)、物品ID等。
12、语音特征
语音特征:将经过处理的语音信号转换成一种简洁而有逻辑的表示形式,比实际信号更有鉴别性和可靠性。在获取一段语音信号后,可以从语音信号中提取语音特征。其中,提取方法通常为每个语音信号提取一个多维特征向量。语音信号的参数化表示方法有很多种,例如:感知线性预测(perceptual linear predictive,PLP)、线性预测编码(linear predictive coding,LPC)和频率倒谱系数(mel frequency cepstrum coefficient,MFCC)等。
目前,语音编辑的场景越来越多,例如,用户录制短视频、老师在录制授课语音等场景,为了修复由于口误带来的原始语音中的错误内容,通常会用到语音编辑。目前的语音编辑方式是从数据库中获取语音片段,并用该语音片段替换错误内容,进而生成校正后的语音。
然而,该种方式过于依赖数据库中存储的语音片段,若该语音片段与原始语音的音色、韵律、信噪比等相差较大,会导致校正后的语音前后不连贯、韵律不自然,导致校正后的语音听感较差。
为了解决上述问题,本申请提供一种语音编辑方法,通过参考待修改内容的上下文对应 的第一语音特征确定修改内容的第二语音特征,并基于第二语音特征生成第二文本对应目标编辑语音,进而实现目标编辑语音的听感与原始语音的听感类似,提升用户体验。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获取的所有其他实施例,都属于本申请保护的范围。
首先介绍本申请实施例提供的系统架构。
参见附图1,本申请实施例提供了一种系统架构10。如所述系统架构10所示,数据采集设备16用于采集训练数据,本申请实施例中训练数据包括训练语音以及与该训练语音对应的训练文本。并将训练数据存入数据库13,训练设备12基于数据库13中维护的训练数据训练得到目标模型/规则101。下面将更详细地描述训练设备12如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的语音处理方法,即,将文本通过相关预处理后输入该目标模型/规则101,即可得到该文本的语音特征。本申请实施例中的目标模型/规则101具体可以为神经网络。需要说明的是,在实际的应用中,所述数据库13中维护的训练数据不一定都来自于数据采集设备16的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备12也不一定完全基于数据库13维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备12训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备11,所述执行设备11可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。在附图1中,执行设备11配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备14向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:第一语音特征、目标文本以及标记信息,输入数据也可以包括第一语音特征与第二文本。另外,输入数据可以是用户输入的,也可以是用户通过其他设备上传的,当然还可以来自数据库,具体此处不做限定。
若输入数据包括第一语音特征、目标文本以及标记信息,则预处理模块113用于根据I/O接口112接收到的目标文本与标记信息进行预处理,在本申请实施例中,预处理模块113可以用于基于目标文本与标记信息确定目标文本中的目标编辑文本。若输入数据包括第一语音特征、第二文本,则预处理模块113用于根据I/O接口112接收到的目标文本与标记信息进行预处理,例如,将第二文本转化为音素等准备工作。
在执行设备11对输入数据进行预处理,或者在执行设备11的计算模块111执行计算等相关的处理过程中,执行设备11可以调用数据存储系统15中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统15中。
最后,I/O接口112将处理结果,如上述得到的第二语音特征返回给客户设备14,从而提供给用户。
值得说明的是,训练设备12可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果或为后续的其他处理提供输入。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备14可以自动地向I/O接口112发送输入数据,如果要求客户设备14自动发送输入数据需要获得用户的授权,则用户可以在客户设备14中设置相应权限。用户可以在客户设备14查看执行设备11输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备14也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库13。当然,也可以不经过客户设备14进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库13。
值得注意的是,附图1仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储系统15相对执行设备11是外部存储器,在其它情况下,也可以将数据存储系统15置于执行设备11中。
如图1所示,根据训练设备12训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是神经网络,具体的,在本申请实施例提供的网络中,神经网络可以是循环神经网络、长短期记忆网络等。预测网络可以是卷积神经网络、循环神经网络等。
可选地,本申请实施例中的神经网络与预测网络可以是单独的两个网络,也可以是一个多任务的神经网络,其中一个任务是输出时长,另外一个任务是输出语音特征。
由于CNN是一种非常常见的神经网络,下面结合图2重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图2所示,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,以及神经网络层130其中池化层为可选的。
卷积层/池化层120:
卷积层:
如图2所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的 整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图2中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层130中可以包括多层隐含层(如图2所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图2由110至140的传播为前向传播)完成,反向传播(如图2由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图3所示的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。
下面介绍本申请实施例提供的一种芯片硬件结构。
图4为本申请实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器40。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图2所示的卷积神经网络中各层的算法均可在如图4所示的芯片中得以实现。
神经网络处理器40可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器NPU40作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路403,控制器404控制运算电路403提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路403内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路403是二维脉动阵列。运算电路403还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路403是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器402中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器401中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器408中。
向量计算单元407可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元407可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能407将经处理的输出的向量存储到统一缓存器406。例如,向量计算单元407可以将非线性函数应用到运算电路403的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元407生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路403的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器406用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器405(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器401和/或统一存储器406、将外部存储器中的权重数据存入权重存储器402,以及将统一存储器506中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)410,用于通过总线实现主CPU、DMAC和取指存储器409之间进行交互。
与控制器404连接的取指存储器(instruction fetch buffer)409,用于存储控制器404使用的指令。
控制器404,用于调用指存储器409中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器406,输入存储器401,权重存储器402以及取指存储器409均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
其中,图2或图3所示的卷积神经网络中各层的运算可以由运算电路403或向量计算单元407执行。
首先,先对本申请实施例提供的语音处理方法所适用的应用场景进行描述。该语音处理方法可以应用于需要修改语音内容的场景,例如:用户录制短视频、老师在录制授课语音等场景。该语音处理方法可以适用于例如手机、计算机、可发声的拆戴式终端上的智能语音助手、智能音响等具有语音编辑功能的应用程序、软件或语音处理设备上。
其中,语音处理设备是一种用于服务用户的终端设备,或者云端设备。终端设备可以包括头戴显示设备(head mount display,HMD)、该头戴显示设备可以是虚拟现实(virtual reality,VR)盒子与终端的组合,VR一体机,个人计算机(personal computer,PC),增强现实(augmented reality,AR)设备,混合现实(mixed reality,MR)设备等,该终端设备还可以包括蜂窝电话(cellular phone)、智能电话(smart phone)、个人数字助理(personal digital assistant,PDA)、平板型电脑、膝上型电脑(laptop computer)、个人电脑(personal computer,PC)、车载终端等,具体此处不做限定。
下面结合附图对本申请实施例的神经网络、预测网络的训练方法、语音处理方法进行详细的介绍。
本申请实施例中的神经网络与预测网络可以是单独的两个网络,也可以是一个多任务的神经网络,其中一个任务是输出时长,另外一个任务是输出语音特征。
其次,结合图5对本申请实施例的神经网络的训练方法进行详细介绍。图5所示的训练方法可以由神经网络的训练装置来执行,该神经网络的训练装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行神经网络的训练方法的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,训练方法可以由图1中的训练设备120、图4中的神经网络处理器40执行。
可选地,训练方法可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。
图5所示的训练方法包括步骤501与步骤502。下面对步骤501与步骤502进行详细说明。
首先,先对预测网络的训练过程进行简单描述。本申请实施例中的预测网络可以是RNN、CNN等,具体此处不做限定。预测网络在训练阶段,输入是训练文本的向量,输出是训练文本中各个音素的时长。再不断缩小预测网络输出的训练文本中各个音素的时长与训练文本对应训练语音的实际时长之间的差异,进而得到训练好的预测网络。
步骤501,获取训练数据。
本申请实施例中的训练数据包括训练语音,或者包括训练语音以及与训练语音对应的训练文本。如果训练数据不包括训练文本,则可以通过识别训练语音的方式获取训练文本。
可选地,若发音者(或者用户)的数量为多个,为了后续预测的语音特征正确,训练数据中的训练语音特征还可以包括用户标识,或者包括训练语音的声纹特征,或者包括用于标识训练语音的声纹特征的向量。
可选地,训练数据还可以包括训练语音中各个音素的起止时长信息。
本申请实施例中获取训练数据可以是通过直接录制发声对象发声的方式获取,也可以是通过用户输入音频信息、视频信息的方式获取,还可以是通过接收采集设备发送的方式获取,在实际应用中,还有其他方式获取训练数据,对于训练数据的获取方式具体此处不做限定。
步骤502,以训练数据作为神经网络的输入,以损失函数的值小于第二阈值为目标对神经网络进行训练,得到训练好的神经网络。
可选地,训练数据可以进行一些预处理,例如上述所描述的如果训练数据包括训练语音,可以识别训练语音的方式获取训练文本,并将训练文本用音素表示输入神经网络。
在训练过程中,可以将整个训练文本当做目标编辑文本,并作为输入,以减小损失函数的值为目标对神经网络进行训练,也就是不断缩小神经网络输出的语音特征与训练语音对应的实际语音特征之间的差异。该训练过程可以理解为预测任务。损失函数可以理解为预测任务对应的损失函数。
本申请实施例中的神经网络具体可以是注意力机制模型,例如:transformer、tacotron2等。其中,注意力机制模型包括编码器-解码器,编码器或解码器的结构可以是循环神经网络、长短期记忆网络(long short-term memory,LSTM)等。
本申请实施例中的神经网络包括编码器(encoder)与解码器(decoder),编码器与解码器的结构类型可以是RNN、LSTM等,具体此处不做限定。编码器的作用是将训练文本编码为文本向量(以音素为单位的向量表示,每个输入对应一个向量),解码器的作用是根据文本向量得到文本对应的语音特征。解码器在训练过程中,每步的计算以上一步所对应的真实语音特征作为条件进行计算。
进一步的,为了保证前后语音的连贯,可以使用预测网络对文本向量对应的语音时长进行修正。即可以理解为根据训练语音中各个音素的时长对文本向量进行上采样(也可以理解为是对向量的帧数进行扩展),以得到对应帧数的向量。解码器的作用是根据上述对应帧数的向量得到文本对应的语音特征。
可选地,上述的解码器可以是单向解码器,也可以是双向解码器(即两个方向并行),具体此处不做限定。其中,两个方向是指训练文本的方向,也可以理解为是训练文本对应的向量的方向,还可以理解为是训练文本的正序或者反序,一个方向是训练文本的一侧指向训练文本的另一侧,另一个方向是训练文本的另一侧指向训练文本的一侧。
示例性的,若训练文本为:“中午吃饭了没”,则第一方向或正序可以是从“中”到“没”的方向,第二方向或反序可以是从“没”到“中”的方向。
若解码器是双向解码器,则两个方向(或者正反序)的解码器并行训练,且在训练过程中各自独立计算,不存在结果依赖。当然,如果预测网络与神经网络为一个多任务的网络,预测网络可以称为预测模块,则解码器可以根据训练文本对应的真实时长信息修正神经网络 输出的语音特征。
本申请实施例中的神经网络的架构可以参阅图6。其中,神经网络包括编码器与解码器。可选地,神经网络还可以包括预测模块与上采样模块。预测模块具体用于实现上述预测网络的功能,上采样模块具体用于实现上述根据训练语音中各个音素的时长对文本向量进行上采样的过程,具体此处不再赘述。
需要说明的是,训练过程也可以不采用前述训练方法而采用其他训练方法,此处不做限定。
下面结合附图对本申请实施例的语音处理方法进行详细的介绍。
首先,本申请实施例提供的语音处理方法可以应用于替换场景、插入场景或删除场景。上述场景可以理解为是对原始文本对应的原始语音进行替换、插入、删除等得到目标语音,实现目标语音与原始语音的听感类似和/或提升目标语音的流畅度。其中,原始语音可以认为是包括待修改的语音,目标语音为用户想修正原始语音后得到的语音。
为了方便理解,下面对上述场景的几种举例进行描述:
一、对于替换场景。
原始文本为“今天深圳天气很好”,目标文本为“今天广州天气很好”。其中,重叠文本为“今天天气很好”。原始文本中的非重叠文本为“深圳”,目标文本中的非重叠文本为“广州”。目标文本包括第一文本与第二文本,第一文本为重叠文本或重叠文本中的部分文本。第二文本为目标文本中除了第一文本以外的文本。例如:若第一文本为“今天天气很好”,则第二文本为“广州”。若第一文本为“今气很好”,则第二文本为“天广州天”。
二、对于插入场景。
原始文本为“今天深圳天气很好”,目标文本为“今天上午深圳天气很好”。其中,重叠文本为“今天深圳天气很好”。目标文本中的非重叠文本为“上午”。为了实现目标语音前后的连贯,可以将该插入场景看作为将原始语音中的“天深”替换为“天上午深”的替换场景。即第一文本为“今圳天气很好”,第二文本为“天上午深”。
三、对于删除场景。
原始文本为“今天深圳天气很好”,目标文本为“今天天气很好”。其中,重叠文本为“今天天气很好”。原始文本中的非重叠文本为“深圳”。为了实现目标语音前后的连贯,可以将该删除场景看作为将原始语音中的“天深圳天”替换为“天天”的替换场景。即第一文本为“今气很好”,第二文本为“天天”。
可选地,上述几种场景只是举例,在实际应用中,还有其他场景,具体此处不做限定。
由于上述的删除场景与插入场景都可以用替换场景进行代替,下面仅以替换场景为例对本申请实施例提供的语音处理方法进行描述。本申请实施例提供的语音处理方法可以由终端设备或云端设备单独执行,也可以由终端设备与云端设备共同完成,下面分别描述:
实施例一:终端设备或者云端设备单独执行该语音处理方法。
请参阅图7,本申请实施例提供的语音处理方法一个实施例,该方法可以由语音处理设备执行,也可以由语音处理设备的部件(例如处理器、芯片、或芯片系统等)执行,该语音处理设备可以是终端设备或云端设备,该实施例包括步骤701至步骤706。
步骤701,获取原始语音与第二文本。
本申请实施例中,语音处理设备可以直接获取原始语音、原始文本与第二文本。也可以先获取原始语音与第二文本,在识别原始语音得到与原始语音对应的原始文本。其中,第二文本为目标文本中除了第一文本以外的文本,且原始文本与目标文本含有第一文本。第一文本可以理解为是原始文本与目标文本的重叠文本中的部分或全部文本。
本申请实施例中,语音处理设备获取第二文本的方式有多种,下面分别描述:
第一种,语音处理设备可以通过其他设备或用户的输入直接获取第二文本。
第二种,语音处理设备获取目标文本,并根据目标文本与原始语音对应的原始文本得到重叠文本,再根据重叠文本确定第二文本。具体可以是将原始文本与目标文本中的字符一一对比或者输入对比模型,确定原始文本与目标文本的重叠文本和/或非重叠文本。再根据重叠文本确定第一文本。其中,第一文本可以是重叠文本,也可以是重叠文本中的部分文本。
本申请实施例中根据重叠文本确定第一文本的方式有多种,语音处理设备可以直接确定重叠文本为第一文本,还可以根据预设规则确定重叠文本中的第一文本,也可以根据用户的操作确定重叠文本中的第一文本。其中,预设规则可以是去掉重叠内容中的N个字符后得到第一文本,N为正整数。
可以理解的是,上述两种方式只是举例,在实际应用中,还有其他方式获取第二文本的方式,具体此处不做限定。
另外,语音处理设备可以将原始文本与原始语音对齐,确定原始文本中各个音素在原始语音中的起止位置,可以获知原始文本中各个音素的时长。进而获取第一文本对应的音素,也即是获取第一文本在原始语音中对应的语音(即非编辑语音)。
可选地,语音处理设备可以将原始文本与原始语音对齐采用的方式可以是采用强制对齐法,例如:蒙特利尔强制校准器(montreal forced aligner,MFA)、具有对齐功能的神经网络等对齐工具,具体此处不做限定。
可选地,语音处理设备获取原始语音与原始文本之后,可以向用户展示用户界面,该用户界面包括原始语音以及原始文本。进一步的,用户通过用户界面对原始文本执行第一操作,语音处理设备响应用户的第一操作确定目标文本。其中,第一操作可以理解为是用户对原始文本的编辑,编辑具体可以是前述的替换、插入或删除等。
示例性的,延续上述替换场景中的举例。原始文本为“今天深圳天气很好”,目标文本为“今天广州天气很好”。示例性的,以语音处理设备是手机为例进行描述。语音处理设备获取原始文本与原始语音之后,向用户展示如图8所示的界面,该界面包括原始文本与原始语音。如图9所示,用户可以对原始文本执行第一操作901,例如将“深圳”修改为“广州”等前述的插入、删除、替换操作,这里仅以替换为例进行描述。
可选地,语音处理设备确定原始文本与目标文本的重叠文本后,向用户展示重叠文本,再根据用户的第二操作,从重叠文本中确定第一文本,进而确定第二文本。其中,第二操作可以是点击、拖拽、滑动等操作,具体此处不做限定。
示例性的,延续上述举例,第二文本为“广州”,第一文本为“今天天气很好”,非编辑语音为第一文本在原始语音中的语音。假设一个文字对应2帧,原始文本对应的原始语音包括16帧,则非编辑语音相当于原始语音中的第1帧至第4帧以及第9帧至第16帧。可以理解的是,在实际应用中,文字与语音帧的对应关系不一定是上述举例的1比2,上述举例只 是为了方便理解非编辑区域,原始文本对应的帧数具体此处不做限定。确定目标文本之后,语音处理设备可以显示如图10所示界面,该界面可以包括第二文本、目标文本、原始语音中的非编辑语音与编辑语音,其中,第二文本为“广州”,目标文本为“今天广州天气很好”,非编辑语音为“今天天气很好”对应的语音,编辑语音为“深圳”对应的语音。也可以理解为是,随着用户编辑的目标文本,进而语音处理设备基于目标文本、原始文本以及原始语音确定原始语音中的非编辑语音。
可选地,语音处理设备接收用户发送的编辑请求,该编辑请求中包括原始语音与第二文本。可选地,编辑请求还包括原始文本和/或发音者标识。当然,该编辑请求也可以包括原始语音与目标文本。
步骤702,基于非编辑语音获取第一语音特征。
本申请实施例中的语音特征可以用于表示语音的特征(例如:音色、韵律、情感或节奏等),语音特征的表现形式有多种,可以是语音帧、序列、向量等,具体此处不做限定。另外,本申请实施例中的语音特征具体可以是通过前述的PLP、LPC、MFCC等方法从上述表现形式中提取的参数。
可选地,从非编辑语音中选取至少一个语音帧作为第一语音特征。进一步的,为了第二语音特征更加结合了上下文的第一语音特征。至少一个语音帧对应的文本可以为第一文本中与第二文本相邻的文本。
可选地,将非编辑语音通过编码模型编码得到目标序列,将该目标序列作为第一语音特征。其中,编码模型可以是CNN、RNN等,具体此处不做限定。
另外,第一语音特征还可以携带有原始语音的声纹特征。其中,获取声纹特征的方式可以是直接获取,也可以是通过识别原始语音得到该声纹特征等。一方面,通过引入原始语音的声纹特征,使得后续生成的第二语音特征也携带有该原始语音的声纹特征,进而提升目标编辑语音与原始语音的相近程度。另一方面,在发音者(或者用户)的数量为多个的情况下,引入声纹特征可以提升后续预测的语音特征更加与原始语音的发音者的声纹相似。
可选地,语音处理设备还可以获取原始语音的发音者标识,以便于在发音者为多个时,可以匹配相应发音者对应的语音,提升后续目标编辑语音与原始语音的相似度。
下面仅以将语音帧作为语音特征(或者理解为是根据语音帧获取语音特征)为例进行描述。示例性的,延续上述举例,选择原始语音中的第1帧至第4帧以及第9帧至第16帧中的至少一帧作为第一语音特征。
示例性的,第一语音特征为梅尔频谱特征。
步骤703,基于第一语音特征、第二文本通过神经网络得到第二文本对应的第二语音特征。
语音处理设备获取第一语音特征之后,可以基于第一语音特征、第二文本通过神经网络得到第二文本对应的第二语音特征。该神经网络包括编码器与解码器。将第二文本输入编码器得到第二文本对应的第一向量,再基于第一语音特征通过解码器对第一向量进行解码得到第二语音特征。其中,第一语音特征可以与第二语音特征的韵律、音色和/或信噪比等相同或相近,韵律可以反映出发音者的情感状态或讲话形式等,韵律泛指语调、音调、重音强调、停顿或节奏等特征。
可选地,编码器与解码器之间可以引入注意力机制,用于调整输入与输出之间数量的对应关系。
可选地,在编码器编码过程中可以引入第二文本所在的目标文本,使得生成的第二文本的第一向量参考了目标文本,使得该第一向量描述的第二文本更加准确。即可以基于第一语音特征、目标文本、标记信息通过神经网络得到第二文本对应的第二语音特征。具体可以是将目标文本与标记信息输入编码器得到第二文本对应的第一向量,再基于第一语音特征通过解码器对第一向量进行解码得到第二语音特征。该标记信息用于标记目标文本中的第二文本。
另外,为了保证第二文本对应的目标编辑语音的时长与非编辑语音在语速上一致,可以对目标编辑语音的时长进行修正。在一种可能实现的方式中,修正的具体步骤可以包括:通过预测网络预测总时长,该总时长为目标文本对应所有音素的总时长,将总时长拆分为第一时长与第二时长,第一时长为第一文本在目标文本对应的音素时长,第二时长为第二文本在目标文本对应的音素时长。再根据第一时长与第三时长修正第二时长得到第一修正时长,第三时长为第一文本在原始语音中的音素时长。在另一种可能实现的方式中,修正的具体步骤可以包括:基于第二文本通过预测网络预测第四时长,第四时长为第二文本对应所有音素的总时长;获取原始语音的语速;基于语速修正第四时长,得到第二修正时长;并基于第一向量、第一语音特征与第二修正时长,通过解码器,获取第二语音特征。类似的操作可以参考上述一种可能实现的方式中的描述,此处不再赘述。
也可以理解为是,通过第一文本在原始语音中的音素时长与预测网络预测出的第一文本在目标文本中的音素时长的差异修正第二文本在目标文本中的音素时长。
可选地,通过下述公式一计算差异系数。
公式一:
Figure PCTCN2022094838-appb-000005
其中,n为第一文本的音素数量,RP k为第K个音素在原始语音中的时长(即第三时长),LP k为第K个音素在预测第二文本对应的时长(即第一时长),则第一修正时长=s*第二时长。
可选地,通过解码器获取第一向量之后,可以使用修正时长(第一修正时长或第二修正时长)对第一向量进行上采样后得到第二向量,基于第一语音特征通过解码器,解码第二向量得到第二语音特征。其中,这里的上采样可以理解为是将第一向量对应的第二时长扩展或拉伸至第二向量对应的修正时长。另外,解码器也可以通过自回归的方式获取第二语音特征,即边生成第二语音特征,边对第二语音特征进行调整。
本申请实施例中的解码器可以是单向解码器,也可以是双向解码器,下面分别描述。
第一种,解码器是单向解码器。
解码器基于第一语音特征从目标文本的第一方向计算第一向量或第二向量得到的语音帧作为第二语音特征。其中,第一方向为从目标文本的一侧指向目标文本的另一侧的方向。另 外,该第一方向可以理解为是目标文本的正序或反序(相关描述可以参考前述图5所示实施例中关于正序反序的描述)。
可选地,将第一语音特征与第一向量输入解码器得到第二语音特征。或者将第一语音特征与第二向量输入解码器得到第二语音特征。
第二种,若第二文本在目标文本的中间区域,解码器可以是双向解码器(也可以理解为编码器包括第一编码器与第二编码器)。
上述的第二文本在目标文本的中间区域,可以理解为第二文本并不在目标文本的两端。
本申请实施例中的双向解码器有多种情况,下面分别描述:
1、双向解码器从第一方向输出的第三语音特征为第二文本对应的语音特征,双向解码器从第二方向输出的第四语音特征为第二文本对应的语音特征。
该种情况,可以理解为可以分别通过左右两侧(即正序反序)得到两种第二文本对应的完整语音特征,并根据两种语音特征得到第二语音特征。
第一解码器基于第一语音特征从目标文本的第一方向计算第一向量或第二向量得到第二文本的第三语音特征(以下称为LR)。第二解码器基于第一语音特征从目标文本的第二方向计算第一向量或第二向量得到第二文本的第四语音特征(以下称为RL)。并根据第三语音特征与第四语音特征生成第二语音特征。其中,第一方向为从目标文本的一侧指向目标文本的另一侧的方向,第二方向与第一方向相反(或者理解为第二方向为从目标文本的另一侧指向目标文本的一侧方向)。第一方向可以是上述的正序,第二方向可以是上述的反序。
对于双向解码器,第一编码器在第一方向解码第一向量或第二向量的第一帧时,可以将非编辑语音中与第二文本一侧(也可以称为左侧)相邻的语音帧作为条件进行解码得到N帧LR。第二编码器在第二方向解码第一向量或第二向量的第一帧时,可以将非编辑语音中与第二文本另一侧(也可以称为右侧)相邻的语音帧作为条件进行解码得到N帧RL。可选地,双向解码器的结构可以参考图11。获取N帧LR与N帧RL之后,可以将LR与RL中差值小于阈值的帧作为过渡帧(位置为m,m<n,),或者将LR与RL中差值最小的帧作为过渡帧。则第二语音特征的N帧可以包括LR中的前m帧与RL中的后n-m帧,或者第二语音特征的N帧包括LR中的前n-m帧与RL中的后m帧。其中,LR与RL的差值可以理解为是向量与向量之间的距离。另外,若前述步骤701中获取了发音者标识,则本步骤中的第一向量或第二向量还可以包括用于标识发音者的第三向量。也可以理解为第三向量用于标识原始语音的声纹特征。
示例性的,延续上述举例,假设第一编码器得到“广州”对应的LR帧包括LR 1、LR 2、LR 3、LR 4。第二编码器得到“广州”对应的RL帧包括RL 1、RL 2、RL 3、RL 4。且LR 2与RL 2差值最小,则将LR 1、LR 2、RL 3、RL 4或者LR 1、RL 2、RL 3、RL 4作为第二语音特征。
2、双向解码器从第一方向输出的第三语音特征为第二文本中第三文本对应的语音特征,双向解码器从第二方向输出的第四语音特征为第二文本中第四文本对应的语音特征。
该种情况,可以理解为可以分别通过左右两侧(即正序反序)得到第二文本对应的部分语音特征,并根据两个部分语音特征得到完整的第二语音特征。即从正序的方向上取一部分语音特征,从反序的方向上取另一部分语音特征,并拼接一部分语音特征与另一部分语音特征得到整体的语音特征。
示例性的,延续上述举例,假设第一编码器得到第三文本(“广”)对应的LR帧包括LR 1 与LR 2。第二编码器得到第四文本(“州”)对应的RL帧包括RL 3与RL 4。则拼接LR 1、LR 2、RL 3、RL 4得到第二语音特征。
可以理解的是,上述两种方式只是举例,在实际应用中,还有其他方式获取第二语音特征,具体此处不做限定。
步骤704,基于第二语音特征生成与第二文本对应的目标编辑语音。
语音处理设备获取第二语音特征之后,可以根据声码器将第二语音特征转换为第二文本对应的目标编辑语音。其中,声码器可以是传统声码器(例如Griffin-lim算法),也可以是神经网络声码器(如使用音频训练数据预训练好的Melgan,或Hifigan等)等,具体此处不做限定。
示例性的,延续上述举例,“广州”对应的目标编辑语音如图12所示。
步骤705,获取第二文本在目标文本中的位置。本步骤是可选地。
可选地,如果步骤701中获取的是原始语音与第二文本,则获取第二文本在目标文本中的位置。
可选地,如果步骤701中已获取目标文本,则可以通过前述步骤701中的对齐技术对齐原始语音与原始文本确定原始文本中各个音素在原始语音中的起止位置。并根据各音素的起止位置确定第二文本在目标文本中的位置。
步骤706,基于位置拼接目标编辑语音与非编辑语音生成与目标文本对应的目标语音。本步骤是可选地。
本申请实施例中的位置用于拼接非编辑语音与目标编辑语音,该位置可以是第二文本在目标文本中的位置,也可以是第一文本在目标文本中的位置,还可以是非编辑语音在原始语音中的位置,还可以是编辑语音在原始语音中的位置。
可选地,获取第二文本在目标文本中的位置之后,可以通过前述步骤701中的对齐技术对齐原始语音与原始文本确定原始文本中各个音素在原始语音中的起止位置。并根据第一文本在原始文本中的位置,确定原始语音中的非编辑语音或编辑语音位置。进而语音处理设备基于位置拼接目标编辑语音与非编辑语音得到目标语音。即将第二文本对应的目标语音替换原始语音中的编辑区域得到目标语音。
示例性的,延续上述举例,非编辑语音相当于原始语音中的第1帧至第4帧以及第9帧至第16帧。目标编辑语音为LR 1、LR 2、RL 3、RL 4或者LR 1、RL 2、RL 3、RL 4。拼接目标编辑语音与非编辑语音,可以理解为是将得到的四帧替换原始语音中的第5帧至第8帧,进而得到目标语音。即将“广州”对应的语音替换原始语音中“深圳”对应的语音,进而得到目标文本:“今天广州天气很好”对应的目标语音。“今天广州天气很好”对应的目标语音如图12所示。
可选地,语音处理设备在获取目标编辑语音或目标语音之后,对目标编辑语音或目标语音进行播放。
一种可能实现的方式中,本申请实施例提供的语音处理方法包括步骤701至步骤704。另一种可能实现的方式中,本申请实施例提供的语音处理方法包括步骤701至步骤705。另一种可能实现的方式中,本申请实施例提供的语音处理方法包括步骤701至步骤706。另外,本申请实施例中图7所示的各个步骤不限定时序关系。例如:上述方法中的步骤705也可以 在步骤704之后,也可以在步骤701之前,还可以与步骤701共同执行。
本申请实施例中,一方面,通过第一文本在原始语音中的第一语音特征获取目标文本中第二文本对应的第二语音特征,即通过参考原始文本中第一文本的第一语音特征生成目标文本中第二文本的第二语音特征,进而实现目标编辑语音/目标语音(即编辑语音)的听感与原始语音的听感类似,提升用户体验。另一方面,通过修正目标编辑语音的时长,使得目标语音与原始语音的语速类似,提升用户体验。另一方面,可以通过直接修改原始文本的方式修改原始语音,提升用户对于语音编辑的可操作性,并且编辑后目标编辑语音同原始语音在音色、韵律等维度高度相似。另一方面,生成目标语音时,并未修改非编辑语音,且目标编辑语音的第二语音特征与非编辑语音的第一语音特征类似,使得用户在听原始语音与目标语音时,很难听出原始语音与目标语音在语音特征上的差别。
上面对终端设备或云端设备单独实施的语音处理方法进行了描述,下面对终端设备与云端设备共同执行的语音处理方法进行描述。
实施例二:终端设备与云端设备共同执行语音处理方法。
请参阅图13,本申请实施例提供的语音处理方法一个实施例,该方法可以由终端设备与云端设备共同执行,也可以由终端设备的部件(例如处理器、芯片、或芯片系统等)与云端设备的部件(例如处理器、芯片、或芯片系统等)执行,该实施例包括步骤1301至步骤1306。
步骤1301,终端设备获取原始语音与第二文本。
本实施例中终端设备执行的步骤1301与前述图7所示实施例中语音处理设备执行的步骤701类似,此处不再赘述。
步骤1302,终端设备向云端设备发送原始语音与第二文本。
终端设备获取原始语音与第二文本之后,可以向云端设备发送原始语音与第二文本。
可选地,若步骤1301中,终端设备获取的是原始语音与目标文本,则终端设备向云端设备发送原始语音与目标文本。
步骤1303,云端设备基于原始语音与第二文本获取非编辑语音。
本实施例中云端设备执行的步骤1303与前述图7所示实施例中语音处理设备执行的步骤701中确定非编辑语音的描述类似,此处不再赘述。
步骤1304,云端设备基于非编辑语音获取第一语音特征。
步骤1305,云端设备基于第一语音特征、第二文本通过神经网络得到第二文本对应的第二语音特征。
步骤1306,云端设备基于第二语音特征生成与第二文本对应的目标编辑语音。
本实施例中云端设备执行的步骤1304至步骤1306与前述图7所示实施例中语音处理设备执行的步骤702至步骤704类似,此处不再赘述。
步骤1307,云端设备向终端设备发送目标编辑语音。本步骤是可选地。
可选地,云端设备获取目标编辑语音之后,可以向终端设备发送目标编辑语音。
步骤1308,终端设备或云端设备获取第二文本在目标文本中的位置。本步骤是可选地。
步骤1309,终端设备或云端设备基于位置拼接目标编辑语音与非编辑语音生成与目标文本对应的目标语音。本步骤是可选地。本步骤是可选地。
本实施例中的步骤1308、步骤1309与前述图7所示实施例中语音处理设备执行的步骤705至步骤706类似,此处不再赘述。本实施例中的步骤1308、步骤1309可以由终端设备或云端设备执行。
步骤1310,云端设备向终端设备发送目标语音。本步骤是可选地。
可选地,若步骤1308与步骤1309由云端设备执行,则云端设备获取目标语音后,向终端设备发送目标语音。若步骤1308与步骤1309由终端设备执行,则可以不执行本步骤。
可选地,终端设备在获取目标编辑语音或目标语音之后,对目标编辑语音或目标语音进行播放。
一种可能实现的方式中,本申请实施例提供的语音处理方法可以包括:云端设备生成目标编辑语音,并向终端设备发送目标编辑语音,即该方法包括步骤1301至步骤1307。另一种可能实现的方式中,本申请实施例提供的语音处理方法可以包括:云端设备生成目标编辑语音,并根据目标编辑语音与非编辑语音生成目标语音,向终端设备发送目标语音。即该方法包括步骤1301至步骤1306、步骤1308至步骤1310。另一种可能实现的方式中,本申请实施例提供的语音处理方法可以包括:云端设备生成目标编辑语音,向终端设备发送目标编辑语音。终端设备在根据目标编辑语音与非编辑语音生成目标语音。即该方法包括步骤1301至步骤1309。
本申请实施例中,一方面可以通过云端设备与终端设备的交互,由云端设备进行复杂的计算得到目标编辑语音或目标语音并返给终端设备,可以减少终端设备的算力与存储空间。另一方面,可以根据原始语音中非编辑区域的语音特征生成修改文本对应的目标编辑语音,进而与非编辑语音生成目标文本对应的目标语音。另一方面,用户可以通过对原始文本中的文本进行修改,得到修改文本(即第二文本)对应的目标编辑语音。提升用户基于文本进行语音编辑的编辑体验。另一方面,生成目标语音时,并未修改非编辑语音,且目标编辑语音的第二语音特征与非编辑语音的第一语音特征类似,使得用户在听原始语音与目标语音时,很难听出原始语音与目标语音在语音特征上的差别。
上面对本申请实施例中的语音处理方法进行了描述,下面对本申请实施例中的语音处理设备进行描述,请参阅图14,本申请实施例中语音处理设备的一个实施例包括:
获取单元1401,用于获取原始语音与第二文本,第二文本为目标文本中除了第一文本以外的文本,目标文本与原始语音对应的原始文本都包括第一文本,第一文本在原始语音中对应的语音为非编辑语音;
获取单元1401,还用于基于非编辑语音获取第一语音特征;
处理单元1402,用于基于第一语音特征与第二文本通过神经网络得到第二文本对应的第二语音特征;
生成单元1403,用于基于第二语音特征生成第二文本对应的目标编辑语音。
可选地,本实施例中的语音处理设备还包括:
拼接单元1404,用于基于位置拼接目标编辑语音与非编辑语音得到目标文本对应的目标语音。
第一预测单元1405,用于基于目标文本通过预测网络预测第一时长与第二时长,第一时长为第一文本在目标文本中对应的音素时长,第二时长为第二文本在目标文本中对应的音素 时长;
第一修正单元1406,用于基于第一时长与第三时长修正第二时长,以得到第一修正时长,第三时长为第一文本在原始语音中的音素时长;
第二预测单元1407,用于基于第二文本通过预测网络预测第四时长,第四时长为第二文本对应所有音素的总时长;
第二修正单元1408,用于基于语速修正第四时长,得到第二修正时长;
在语音处理设备为云端设备时,云端设备还可以包括发送单元1409,用于向终端设备发送目标编辑语音或目标语音。
本实施例中,语音处理设备中各单元所执行的操作与前述图7至图12所示实施例中描述的类似,此处不再赘述。
本实施例中,一方面,处理单元1402通过第一文本在原始语音中的第一语音特征获取目标文本中第二文本对应的第二语音特征,即处理单元1402通过参考原始文本中第一文本的第一语音特征生成目标文本中第二文本的第二语音特征,进而实现生成单元1403生成的目标编辑语音/目标语音的听感与原始语音的听感类似,提升用户体验。另一方面,第一修正单元1406或者第二修正单元1408通过修正目标编辑语音的时长,使得目标语音与原始语音的语速类似,提升用户体验。另一方面,可以通过直接修改原始文本的方式修改原始语音,提升用户对于语音编辑的可操作性,并且编辑后目标编辑语音同原始语音在音色、韵律等维度高度相似。另一方面,生成目标语音时,并未修改非编辑语音,且目标编辑语音的第二语音特征与非编辑语音的第一语音特征类似,使得用户在听原始语音与目标语音时,很难听出原始语音与目标语音在语音特征上的差别。
请参阅图15,本申请实施例中语音处理设备的另一个实施例,其中,该语音处理设备可以是终端设备。该终端设备包括:
获取单元1501,用于获取原始语音与第二文本,第二文本为目标文本中除了第一文本以外的文本,目标文本与原始语音对应的原始文本都包括第一文本,第一文本在原始语音中对应的语音为非编辑语音;
发送单元1502,用于向云端设备发送原始语音与第二文本,原始语音与第二文本用于云端设备生成第二文本对应的目标编辑语音;
获取单元1501,还用于接收云端设备发送的目标编辑语音。
本实施例中,语音处理设备中各单元所执行的操作与前述图13所示实施例中终端设备执行步骤的描述类似,此处不再赘述。
本实施例中,一方面可以通过云端设备与终端设备的交互,由云端设备进行复杂的计算得到目标编辑语音或目标语音并返给终端设备,可以减少终端设备的算力与存储空间。另一方面,用户可以通过对原始文本中的文本进行修改,得到修改文本(即第二文本)对应的目标编辑语音。提升用户基于文本进行语音编辑的编辑体验。
请参阅图16,本申请实施例中语音处理设备的另一个实施例,其中,该语音处理设备可以是云端设备。该云端设备包括:
接收单元1601,用于接收终端设备发送的原始语音与第二文本,第二文本为目标文本中除了第一文本以外的文本,目标文本与原始语音对应的原始文本都包括第一文本,第一文本 在原始语音中对应的语音为非编辑语音;
获取单元1602,用于基于非编辑语音获取第一语音特征;
处理单元1603,用于基于第一语音特征与第二文本通过神经网络得到第二文本对应的第二语音特征;
生成单元1604,用于基于第二语音特征生成第二文本对应的目标编辑语音。
可选地,生成单元1604,还用于基于目标编辑语音与非编辑语音生成目标语音。
可选地,本实施例中的语音处理设备还包括:
发送单元1605,用于向终端设备发送目标编辑语音或者目标语音。
本实施例中,语音处理设备中各单元所执行的操作与前述图13所示实施例中云端设备执行步骤的描述类似,此处不再赘述。
本实施例中,一方面可以通过云端设备与终端设备的交互,由云端设备进行复杂的计算得到目标编辑语音或目标语音并返给终端设备,可以减少终端设备的算力与存储空间。另一方面,用户可以通过对原始文本中的文本进行修改,得到修改文本(即第二文本)对应的目标编辑语音。提升用户基于文本进行语音编辑的编辑体验。另一方面,生成单元1604生成目标语音时,并未修改非编辑语音,且目标编辑语音的第二语音特征与非编辑语音的第一语音特征类似,使得用户在听原始语音与目标语音时,很难听出原始语音与目标语音在语音特征上的差别。
请参阅图17,本申请实施例提供了另一种语音处理设备,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。该语音处理设备可以为包括手机、平板电脑、个人数字助理(personal digital assistant,PDA)、销售终端设备(point of sales,POS)、车载电脑等任意终端设备,以语音处理设备为手机为例:
图17示出的是与本申请实施例提供的语音处理设备相关的手机的部分结构的框图。参考图17,手机包括:射频(radio frequency,RF)电路1710、存储器1720、输入单元1730、显示单元1740、传感器1750、音频电路1760、无线保真(wireless fidelity,WiFi)模块1770、处理器1780、以及电源1790等部件。本领域技术人员可以理解,图17中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图17对手机的各个构成部件进行具体的介绍:
RF电路1710可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器1780处理;另外,将设计上行的数据发送给基站。通常,RF电路1710包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(low noise amplifier,LNA)、双工器等。此外,RF电路1710还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(global system of mobile communication,GSM)、通用分组无线服务(general packet radio service,GPRS)、码分多址(code division multiple access,CDMA)、宽带码分多址(wideband code division multiple access,WCDMA)、长期演进(long term evolution,LTE)、电子邮件、短消息服务(short messaging service,SMS)等。
存储器1720可用于存储软件程序以及模块,处理器1780通过运行存储在存储器1720的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器1720可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器1720可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元1730可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元1730可包括触控面板1731以及其他输入设备1732。触控面板1731,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板1731上或在触控面板1731附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板1731可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器1780,并能接收处理器1780发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板1731。除了触控面板1731,输入单元1730还可以包括其他输入设备1732。具体地,其他输入设备1732可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元1740可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元1740可包括显示面板1741,可选的,可以采用液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light-emitting diode,OLED)等形式来配置显示面板1741。进一步的,触控面板1731可覆盖显示面板1741,当触控面板1731检测到在其上或附近的触摸操作后,传送给处理器1780以确定触摸事件的类型,随后处理器1780根据触摸事件的类型在显示面板1741上提供相应的视觉输出。虽然在图17中,触控面板1731与显示面板1741是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板1731与显示面板1741集成而实现手机的输入和输出功能。
手机还可包括至少一种传感器1750,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板1741的亮度,接近传感器可在手机移动到耳边时,关闭显示面板1741和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路1760、扬声器1761,传声器1762可提供用户与手机之间的音频接口。音频电路1760可将接收到的音频数据转换后的电信号,传输到扬声器1761,由扬声器1761转换为声音信号输出;另一方面,传声器1762将收集的声音信号转换为电信号,由音频电路1760接收后转换为音频数据,再将音频数据输出处理器1780处理后,经RF电路1710以发送给比 如另一手机,或者将音频数据输出至存储器1720以便进一步处理。
WiFi属于短距离无线传输技术,手机通过WiFi模块1770可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图17示出了WiFi模块1770,但是可以理解的是,其并不属于手机的必须构成。
处理器1780是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器1720内的软件程序和/或模块,以及调用存储在存储器1720内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器1780可包括一个或多个处理单元;优选的,处理器1780可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1780中。
手机还包括给各个部件供电的电源1790(比如电池),优选的,电源可以通过电源管理系统与处理器1780逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管未示出,手机还可以包括摄像头、蓝牙模块等,在此不再赘述。
在本申请实施例中,该终端设备所包括的处理器1780可以执行前述图7实施例中语音处理设备的功能,或者执行前述图13所示实施例中终端设备的功能,此处不再赘述。
参阅图18,本申请提供的另一种语音处理设备的结构示意图。该语音处理设备可以是云端设备。该云端设备可以包括处理器1801、存储器1802和通信接口1803。该处理器1801、存储器1802和通信接口1803通过线路互联。其中,存储器1802中存储有程序指令和数据。
存储器1802中存储了前述图7对应的实施方式中,由语音处理设备执行的步骤对应的程序指令以及数据。或者存储了前述图13对应的实施方式中,由云端设备执行的步骤对应的程序指令以及数据。
处理器1801,用于执行前述图7所示实施例中任一实施例所示的由语音处理设备执行的步骤。或者用于执行前述图13所示实施例中任一实施例所示的由云端设备执行的步骤。
通信接口1803可以用于进行数据的接收和发送,用于执行前述图7或图13所示实施例中任一实施例中与获取、发送、接收相关的步骤。
一种实现方式中,云端设备可以包括相对于图18更多或更少的部件,本申请对此仅仅是示例性说明,并不作限定。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个 单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。
当使用软件实现所述集成的单元时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。

Claims (29)

  1. 一种语音处理方法,其特征在于,所述方法包括:
    获取原始语音与第二文本,所述第二文本为目标文本中除了第一文本以外的文本,所述目标文本与所述原始语音对应的原始文本都包括所述第一文本,所述第一文本在所述原始语音中对应的语音为非编辑语音;
    基于所述非编辑语音获取第一语音特征;
    基于所述第一语音特征与所述第二文本通过神经网络得到所述第二文本对应的第二语音特征;
    基于所述第二语音特征生成所述第二文本对应的目标编辑语音。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取所述第二文本在所述目标文本中的位置;
    基于所述位置拼接所述目标编辑语音与所述非编辑语音得到所述目标文本对应的目标语音。
  3. 根据权利要求1或2所述的方法,其特征在于,所述基于非编辑语音获取第一语音特征,包括:
    获取所述非编辑语音中的至少一个语音帧;
    基于所述至少一个语音帧获取所述第一语音特征,所述第一语音特征用于表示所述至少一个语音帧的特征,所述第一语音特征为特征向量或序列。
  4. 根据权利要求3所述的方法,其特征在于,所述至少一个语音帧对应的文本为所述第一文本中与所述第二文本相邻的文本。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述基于所述第一语音特征与第二文本通过神经网络得到所述第二文本对应的第二语音特征,包括:
    基于所述第一语音特征、所述目标文本以及标记信息通过神经网络得到第二文本对应的第二语音特征,所述标记信息用于标记所述目标文本中的所述第二文本。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述神经网络包括编码器与解码器,所述基于所述第一语音特征与第二文本通过神经网络得到所述第二文本对应的第二语音特征,包括:
    基于所述第二文本,通过所述编码器,获取所述第二文本对应的第一向量;
    基于所述第一向量与所述第一语音特征,通过所述解码器,获取所述第二语音特征。
  7. 根据权利要求6所述的方法,其特征在于,所述基于所述第二文本,通过所述编码器,获取所述第二文本对应的第一向量,包括:
    基于所述目标文本,通过所述编码器,获取所述第一向量。
  8. 根据权利要求6或7所述的方法,其特征在于,所述方法还包括:
    基于所述目标文本通过预测网络预测第一时长与第二时长,所述第一时长为所述第一文本在所述目标文本中对应的音素时长,所述第二时长为所述第二文本在所述目标文本中对应的音素时长;
    基于所述第一时长与第三时长修正所述第二时长,以得到第一修正时长,所述第三时长为所述第一文本在所述原始语音中的音素时长;
    所述基于所述第一向量与所述第一语音特征,通过所述解码器,获取所述第二语音特征,包括:
    基于所述第一向量、所述第一语音特征与所述第一修正时长,通过所述解码器,获取所述第二语音特征。
  9. 根据权利要求6或7所述的方法,其特征在于,所述方法还包括:
    基于所述第二文本通过预测网络预测第四时长,所述第四时长为所述第二文本对应所有音素的总时长;
    获取所述原始语音的语速;
    基于所述语速修正所述第四时长,得到第二修正时长;
    所述基于所述第一向量与所述第一语音特征,通过所述解码器,获取所述第二语音特征,包括:
    基于所述第一向量、所述第一语音特征与所述第二修正时长,通过所述解码器,获取所述第二语音特征。
  10. 根据权利要求6至9中任一项所述的方法,其特征在于,所述基于所述第一向量与所述第一语音特征,通过所述解码器,获取所述第二语音特征,包括:
    基于所述解码器与所述第一语音特征从所述目标文本的正序或反序解码所述第一向量得到所述第二语音特征。
  11. 根据权利要求6至9中任一项所述的方法,其特征在于,所述第二文本在所述目标文本的中间区域,所述基于所述第一向量与所述第一语音特征,通过所述解码器,获取所述第二语音特征,包括:
    基于所述解码器与所述第一语音特征从所述目标文本的正序解码所述第一向量得到第三语音特征;
    基于所述解码器与所述第一语音特征从所述目标文本的反序解码所述第一向量得到第四语音特征;
    基于所述第三语音特征与所述第四语音特征获取所述第二语音特征。
  12. 根据权利要求11所述的方法,其特征在于,所述第二文本包括第三文本和第四文本,所述第三语音特征为所述第三文本对应的语音特征,所述第四语音特征为所述第四文本对应的语音特征;
    所述基于所述第三语音特征与所述第四语音特征获取所述第二语音特征,包括:
    拼接所述第三语音特征与所述第四语音特征得到所述第二语音特征。
  13. 根据权利要求11所述的方法,其特征在于,所述第三语音特征为所述解码器基于所述正序得到的所述第二文本对应的语音特征,所述第四语音特征为所述解码器基于所述反序得到的所述第二文本对应的语音特征;
    所述基于所述第三语音特征与所述第四语音特征获取所述第二语音特征,包括:
    确定所述第三语音特征与所述第四语音特征中相似度大于第一阈值的语音特征为过渡语音特征;
    拼接第五语音特征与第六语音特征得到所述第二语音特征,所述第五语音特征为基于所述过渡语音特征在所述第三语音特征中的位置从所述第三语音特征中截取得到的,所述第六 语音特征为基于所述过渡语音特征在所述第四语音特征中的位置从所述第四语音特征中截取得到的。
  14. 根据权利要求1至13中任一项所述的方法,其特征在于,所述基于所述第二语音特征生成所述第二文本对应的目标编辑语音,包括:
    基于所述第二语音特征,通过声码器,生成所述目标编辑语音。
  15. 根据权利要求1至14中任一项所述的方法,其特征在于,所述第一语音特征携带有所述原始语音的声纹特征。
  16. 一种语音处理设备,其特征在于,所述语音处理设备包括:
    获取单元,用于获取原始语音与第二文本,所述第二文本为目标文本中除了第一文本以外的文本,所述目标文本与所述原始语音对应的原始文本都包括所述第一文本,所述第一文本在所述原始语音中对应的语音为非编辑语音;
    所述获取单元,还用于基于所述非编辑语音获取第一语音特征;
    处理单元,用于基于所述第一语音特征与所述第二文本通过神经网络得到所述第二文本对应的第二语音特征;
    生成单元,用于基于所述第二语音特征生成所述第二文本对应的目标编辑语音。
  17. 根据权利要求16所述的设备,其特征在于,所述获取单元,还用于获取所述第二文本在所述目标文本中的位置;
    所述语音处理设备还包括:
    拼接单元,用于基于所述位置拼接所述目标编辑语音与所述非编辑语音得到所述目标文本对应的目标语音。
  18. 根据权利要求16或17所述的设备,其特征在于,所述获取单元,具体用于获取所述非编辑语音中的至少一个语音帧;
    所述获取单元,具体用于基于所述至少一个语音帧获取所述第一语音特征,所述第一语音特征用于表示所述至少一个语音帧的特征,所述第一语音特征为特征向量或序列。
  19. 根据权利要求18所述的设备,其特征在于,所述至少一个语音帧对应的文本为所述第一文本中与所述第二文本相邻的文本。
  20. 根据权利要求16至19中任一项所述的设备,其特征在于,所述神经网络包括编码器与解码器,所述处理单元,具体用于基于所述第二文本,通过所述编码器,获取所述第二文本对应的第一向量;
    所述处理单元,具体用于基于所述第一向量与所述第一语音特征,通过所述解码器,获取所述第二语音特征。
  21. 根据权利要求20所述的设备,其特征在于,所述语音处理设备还包括:
    第一预测单元,用于基于所述目标文本通过预测网络预测第一时长与第二时长,所述第一时长为所述第一文本在所述目标文本中对应的音素时长,所述第二时长为所述第二文本在所述目标文本中对应的音素时长;
    第一修正单元,用于基于所述第一时长与第三时长修正所述第二时长,以得到第一修正时长,所述第三时长为所述第一文本在所述原始语音中的音素时长;
    所述处理单元,具体用于基于所述第一向量、所述第一语音特征与所述第一修正时长, 通过所述解码器,获取所述第二语音特征。
  22. 根据权利要求20或21所述的设备,其特征在于,所述处理单元,具体用于基于所述解码器与所述第一语音特征从所述目标文本的正序或反序解码所述第一向量得到所述第二语音特征。
  23. 根据权利要求20或21所述的设备,其特征在于,所述第二文本在所述目标文本的中间区域,所述处理单元,具体用于基于所述解码器与所述第一语音特征从所述目标文本的正序解码所述第一向量得到第三语音特征;
    所述处理单元,具体用于基于所述解码器与所述第一语音特征从所述目标文本的反序解码所述第一向量得到第四语音特征;
    所述处理单元,具体用于基于所述第三语音特征与所述第四语音特征获取所述第二语音特征。
  24. 根据权利要求23所述的设备,其特征在于,所述第二文本包括第三文本和第四文本,所述第三语音特征为所述第三文本对应的语音特征,所述第四语音特征为所述第四文本对应的语音特征;
    所述处理单元,具体用于拼接所述第三语音特征与所述第四语音特征得到所述第二语音特征。
  25. 根据权利要求23所述的设备,其特征在于,所述第三语音特征为所述解码器基于所述正序得到的所述第二文本对应的语音特征,所述第四语音特征为所述解码器基于所述反序得到的所述第二文本对应的语音特征;
    所述处理单元,具体用于确定所述第三语音特征与所述第四语音特征中相似度大于第一阈值的语音特征为过渡语音特征;
    所述处理单元,具体用于拼接第五语音特征与第六语音特征得到所述第二语音特征,所述第五语音特征为基于所述过渡语音特征在所述第三语音特征中的位置从所述第三语音特征中截取得到的,所述第六语音特征为基于所述过渡语音特征在所述第四语音特征中的位置从所述第四语音特征中截取得到的。
  26. 一种语音处理设备,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述语音处理设备执行如权利要求1至15中任一项所述的方法。
  27. 根据权利要求26所述的设备,其特征在于,所述设备还包括:
    输入单元,用于接收第二文本;
    输出单元,用于播放所述第二文本对应的目标编辑语音或者目标文本对应的目标语音。
  28. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,所述指令在计算机上执行时,使得所述计算机执行如权利要求1至15中任一项所述的方法。
  29. 一种计算机程序产品,其特征在于,所述计算机程序产品在计算机上执行时,使得所述计算机执行如权利要求1至15中任一项所述的方法。
PCT/CN2022/094838 2021-06-03 2022-05-25 一种语音处理方法及相关设备 WO2022253061A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22815107.2A EP4336490A1 (en) 2021-06-03 2022-05-25 Voice processing method and related device
US18/524,208 US20240105159A1 (en) 2021-06-03 2023-11-30 Speech processing method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110621213.6A CN113421547B (zh) 2021-06-03 2021-06-03 一种语音处理方法及相关设备
CN202110621213.6 2021-06-03

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/524,208 Continuation US20240105159A1 (en) 2021-06-03 2023-11-30 Speech processing method and related device

Publications (1)

Publication Number Publication Date
WO2022253061A1 true WO2022253061A1 (zh) 2022-12-08

Family

ID=77713755

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/094838 WO2022253061A1 (zh) 2021-06-03 2022-05-25 一种语音处理方法及相关设备

Country Status (4)

Country Link
US (1) US20240105159A1 (zh)
EP (1) EP4336490A1 (zh)
CN (1) CN113421547B (zh)
WO (1) WO2022253061A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421547B (zh) * 2021-06-03 2023-03-17 华为技术有限公司 一种语音处理方法及相关设备
CN113823260A (zh) * 2021-10-20 2021-12-21 科大讯飞股份有限公司 语音合成模型训练方法、语音合成方法和装置
CN113724686B (zh) * 2021-11-03 2022-04-01 中国科学院自动化研究所 编辑音频的方法、装置、电子设备及存储介质
CN114882862A (zh) * 2022-04-29 2022-08-09 华为技术有限公司 一种语音处理方法及相关设备
CN116189654A (zh) * 2023-02-23 2023-05-30 京东科技信息技术有限公司 语音编辑方法、装置、电子设备及存储介质

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178895A (zh) * 2007-12-06 2008-05-14 安徽科大讯飞信息科技股份有限公司 基于生成参数听感误差最小化的模型自适应方法
CN104517605A (zh) * 2014-12-04 2015-04-15 北京云知声信息技术有限公司 一种用于语音合成的语音片段拼接系统和方法
CN106328145A (zh) * 2016-08-19 2017-01-11 北京云知声信息技术有限公司 语音修正方法及装置
CN107644646A (zh) * 2017-09-27 2018-01-30 北京搜狗科技发展有限公司 语音处理方法、装置以及用于语音处理的装置
US20180268806A1 (en) * 2017-03-14 2018-09-20 Google Inc. Text-to-speech synthesis using an autoencoder
KR102057926B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법
CN111369968A (zh) * 2020-03-19 2020-07-03 北京字节跳动网络技术有限公司 声音复制方法、装置、可读介质及电子设备
CN112365879A (zh) * 2020-11-04 2021-02-12 北京百度网讯科技有限公司 语音合成方法、装置、电子设备和存储介质
CN112767910A (zh) * 2020-05-13 2021-05-07 腾讯科技(深圳)有限公司 音频信息合成方法、装置、计算机可读介质及电子设备
CN112802444A (zh) * 2020-12-30 2021-05-14 科大讯飞股份有限公司 语音合成方法、装置、设备及存储介质
CN113421547A (zh) * 2021-06-03 2021-09-21 华为技术有限公司 一种语音处理方法及相关设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107293296B (zh) * 2017-06-28 2020-11-20 百度在线网络技术(北京)有限公司 语音识别结果纠正方法、装置、设备及存储介质
CN109036377A (zh) * 2018-07-26 2018-12-18 中国银联股份有限公司 一种语音合成方法及装置
CN110136691B (zh) * 2019-05-28 2021-09-28 广州多益网络股份有限公司 一种语音合成模型训练方法、装置、电子设备及存储介质
JP7432199B2 (ja) * 2019-07-05 2024-02-16 国立研究開発法人情報通信研究機構 音声合成処理装置、音声合成処理方法、および、プログラム
CN110534088A (zh) * 2019-09-25 2019-12-03 招商局金融科技有限公司 语音合成方法、电子装置及存储介质
CN111292720B (zh) * 2020-02-07 2024-01-23 北京字节跳动网络技术有限公司 语音合成方法、装置、计算机可读介质及电子设备
CN112885328A (zh) * 2021-01-22 2021-06-01 华为技术有限公司 一种文本数据处理方法及装置

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178895A (zh) * 2007-12-06 2008-05-14 安徽科大讯飞信息科技股份有限公司 基于生成参数听感误差最小化的模型自适应方法
CN104517605A (zh) * 2014-12-04 2015-04-15 北京云知声信息技术有限公司 一种用于语音合成的语音片段拼接系统和方法
CN106328145A (zh) * 2016-08-19 2017-01-11 北京云知声信息技术有限公司 语音修正方法及装置
US20180268806A1 (en) * 2017-03-14 2018-09-20 Google Inc. Text-to-speech synthesis using an autoencoder
CN107644646A (zh) * 2017-09-27 2018-01-30 北京搜狗科技发展有限公司 语音处理方法、装置以及用于语音处理的装置
KR102057926B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법
CN111369968A (zh) * 2020-03-19 2020-07-03 北京字节跳动网络技术有限公司 声音复制方法、装置、可读介质及电子设备
CN112767910A (zh) * 2020-05-13 2021-05-07 腾讯科技(深圳)有限公司 音频信息合成方法、装置、计算机可读介质及电子设备
CN112365879A (zh) * 2020-11-04 2021-02-12 北京百度网讯科技有限公司 语音合成方法、装置、电子设备和存储介质
CN112802444A (zh) * 2020-12-30 2021-05-14 科大讯飞股份有限公司 语音合成方法、装置、设备及存储介质
CN113421547A (zh) * 2021-06-03 2021-09-21 华为技术有限公司 一种语音处理方法及相关设备

Also Published As

Publication number Publication date
EP4336490A1 (en) 2024-03-13
CN113421547A (zh) 2021-09-21
US20240105159A1 (en) 2024-03-28
CN113421547B (zh) 2023-03-17

Similar Documents

Publication Publication Date Title
CN112487182B (zh) 文本处理模型的训练方法、文本处理方法及装置
WO2022253061A1 (zh) 一种语音处理方法及相关设备
WO2021135577A9 (zh) 音频信号处理方法、装置、电子设备及存储介质
CN109543195B (zh) 一种文本翻译的方法、信息处理的方法以及装置
CN112069309B (zh) 信息获取方法、装置、计算机设备及存储介质
US11769018B2 (en) System and method for temporal attention behavioral analysis of multi-modal conversations in a question and answer system
WO2023207541A1 (zh) 一种语音处理方法及相关设备
US11776269B2 (en) Action classification in video clips using attention-based neural networks
CN110209784B (zh) 消息交互方法、计算机设备及存储介质
CN113254684B (zh) 一种内容时效的确定方法、相关装置、设备以及存储介质
CN111967334B (zh) 一种人体意图识别方法、系统以及存储介质
CN110114765B (zh) 通过共享话语的上下文执行翻译的电子设备及其操作方法
WO2021057884A1 (zh) 语句复述方法、训练语句复述模型的方法及其装置
CN111581958A (zh) 对话状态确定方法、装置、计算机设备及存储介质
WO2023226239A1 (zh) 对象情绪的分析方法、装置和电子设备
CN113822076A (zh) 文本生成方法、装置、计算机设备及存储介质
CN115688937A (zh) 一种模型训练方法及其装置
CN113948060A (zh) 一种网络训练方法、数据处理方法及相关设备
CN115544227A (zh) 多模态数据的情感分析方法、装置、设备及存储介质
CN115866291A (zh) 一种数据处理方法及其装置
WO2021083312A1 (zh) 训练语句复述模型的方法、语句复述方法及其装置
CN113822084A (zh) 语句翻译方法、装置、计算机设备及存储介质
Guo et al. Sign-to-911: Emergency Call Service for Sign Language Users with Assistive AR Glasses
CN117877125B (zh) 动作识别及其模型训练方法、装置、电子设备、存储介质
WO2023207391A1 (zh) 虚拟人视频生成方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22815107

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022815107

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022815107

Country of ref document: EP

Effective date: 20231207

NENP Non-entry into the national phase

Ref country code: DE