WO2022142850A1 - 音频处理方法、装置、声码器、电子设备、计算机可读存储介质及计算机程序产品 - Google Patents

音频处理方法、装置、声码器、电子设备、计算机可读存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2022142850A1
WO2022142850A1 PCT/CN2021/132024 CN2021132024W WO2022142850A1 WO 2022142850 A1 WO2022142850 A1 WO 2022142850A1 CN 2021132024 W CN2021132024 W CN 2021132024W WO 2022142850 A1 WO2022142850 A1 WO 2022142850A1
Authority
WO
WIPO (PCT)
Prior art keywords
time
prediction
values
sampling point
sub
Prior art date
Application number
PCT/CN2021/132024
Other languages
English (en)
French (fr)
Inventor
林诗伦
李新辉
卢鲤
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP21913592.8A priority Critical patent/EP4210045B1/en
Priority to JP2023518015A priority patent/JP2023542012A/ja
Publication of WO2022142850A1 publication Critical patent/WO2022142850A1/zh
Priority to US17/965,130 priority patent/US20230035504A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present application relates to audio and video processing technologies, and in particular, to an audio processing method, apparatus, vocoder, electronic device, computer-readable storage medium, and computer program product.
  • Speech synthesis technology converts text into corresponding audio content through certain rules or model algorithms.
  • Traditional speech synthesis techniques are mainly based on splicing methods or statistical parameter methods.
  • neural network-based vocoders (Neural vocoder) have made great progress.
  • the current vocoder usually needs to perform multiple cycles based on multiple sampling time points in the audio feature signal to complete speech prediction and then complete speech synthesis, which leads to a slow processing speed of audio synthesis and reduces the efficiency of audio processing. .
  • Embodiments of the present application provide an audio processing method, an apparatus, a vocoder, an electronic device, a computer-readable storage medium, and a computer program product, which can improve the speed and efficiency of audio processing.
  • An embodiment of the present application provides an audio processing method, the method is executed by an electronic device, and the method includes:
  • n is a positive integer greater than 1; each subframe of the n subframes The frame contains a preset number of sample points;
  • n sub-prediction values corresponding to each sampling point in the number of sampling points; wherein, i is a positive integer greater than or equal to 1; m is a positive integer greater than or equal to 2 and less than or equal to the preset number;
  • the embodiment of the present application provides a vocoder, including:
  • a frame rate network configured to extract the conditional feature corresponding to each frame of the acoustic feature frame from each frame of the acoustic feature frame of the at least one frame of the acoustic feature frame;
  • a time-domain and frequency-domain processing module configured to perform frequency band division and time-domain downsampling on the current frame in each frame of acoustic feature frames to obtain n subframes corresponding to the current frame; wherein, n is a positive integer greater than 1 ; Each subframe of the n subframes includes a preset number of sampling points;
  • the sampling prediction network is configured to perform synchronous prediction on the sampling values corresponding to the current m adjacent sampling points in the n subframes during the i-th prediction process, to obtain m ⁇ n sub-prediction values, and then obtain the predicted values.
  • n sub-prediction values corresponding to each sampling point in the number of sampling points wherein, i is a positive integer greater than or equal to 1; m is a positive integer greater than or equal to 2 and less than or equal to the preset number;
  • the signal synthesis module is configured to obtain the audio prediction signal corresponding to the current frame according to the n sub-prediction values corresponding to each sampling point; Audio synthesis is performed to obtain the target audio corresponding to the text to be processed.
  • An embodiment of the present application provides an audio processing device, including:
  • a text-to-speech conversion model configured to perform voice feature conversion on the text to be processed to obtain at least one acoustic feature frame
  • a frame rate network configured to extract the conditional feature corresponding to each frame of the acoustic feature frame from each frame of the acoustic feature frame of the at least one frame of the acoustic feature frame;
  • a time-domain and frequency-domain processing module configured to perform frequency band division and time-domain downsampling on the current frame in each frame of acoustic feature frames to obtain n subframes corresponding to the current frame; wherein, n is a positive integer greater than 1 ; Each subframe of the n subframes includes a preset number of sampling points;
  • the sampling prediction network is configured to perform synchronous prediction on the sampling values corresponding to the current m adjacent sampling points in the n subframes during the i-th prediction process, to obtain m ⁇ n sub-prediction values, and then obtain the predicted values.
  • n sub-prediction values corresponding to each sampling point in the number of sampling points wherein, i is a positive integer greater than or equal to 1; m is a positive integer greater than or equal to 2 and less than or equal to the preset number;
  • the signal synthesis module is configured to obtain the audio prediction signal corresponding to the current frame according to the n sub-prediction values corresponding to each sampling point; Audio synthesis is performed to obtain the target audio corresponding to the text to be processed.
  • the embodiment of the present application provides an electronic device, including:
  • a memory configured to store executable instructions
  • the processor is configured to implement the audio processing method provided by the embodiment of the present application when executing the executable instructions stored in the memory.
  • Embodiments of the present application provide a computer-readable storage medium storing executable instructions configured to cause a processor to execute the audio processing method provided by the embodiments of the present application.
  • the embodiments of the present application provide a computer program product, including computer programs or instructions, and when the computer programs or instructions are executed by a processor, the audio processing methods provided by the embodiments of the present application are implemented.
  • the overall number of sampling points that need to be processed when the sampling prediction network performs sampling value prediction is reduced.
  • multiple sampling points at adjacent times are predicted at the same time, which realizes the synchronous processing of multiple sampling points, thereby significantly reducing the number of cycles required for the sampling prediction network to predict audio signals, and improving the performance of audio synthesis. processing speed and improve the efficiency of audio processing.
  • FIG. 1 is an optional structural schematic diagram of a current LPCNet vocoder provided by an embodiment of the present application
  • FIG 2 is an optional structural schematic diagram 1 of an audio processing system architecture provided by an embodiment of the present application.
  • FIG. 3 is an optional structural schematic diagram 1 of the audio processing system provided by the embodiment of the present application in an in-vehicle application scenario;
  • FIG. 4 is an optional structural schematic diagram 2 of an audio processing system architecture provided by an embodiment of the present application.
  • FIG. 5 is an optional structural schematic diagram 2 of the audio processing system provided by the embodiment of the present application in an in-vehicle application scenario;
  • FIG. 6 is an optional schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 7 is an optional structural schematic diagram of a multi-band multi-time-domain vocoder provided by an embodiment of the present application.
  • FIG. 10 is an optional schematic flowchart 3 of an audio processing method provided by an embodiment of the present application.
  • FIG. 11 is an optional fourth schematic flowchart of the audio processing method provided by the embodiment of the present application.
  • FIG. 12 is an optional schematic diagram of a network architecture of a frame rate network and a sampling prediction network provided by an embodiment of the present application;
  • FIG. 13 is an optional schematic flowchart 5 of the audio processing method provided by the embodiment of the present application.
  • FIG. 14 is an optional schematic structural diagram of an electronic device provided by an embodiment of the present application applied in an actual scenario.
  • first ⁇ second ⁇ third is only used to distinguish similar objects, and does not represent a specific ordering of objects. It is understood that “first ⁇ second ⁇ third” Where permitted, the specific order or sequence may be interchanged to enable the embodiments of the application described herein to be practiced in sequences other than those illustrated or described herein.
  • Speech synthesis Also known as Text to Speech (TTS), its function is to convert the text information generated by the computer itself or input externally into comprehensible and fluent speech and read it out.
  • TTS Text to Speech
  • Spectrograms refer to the representation of a signal in the time domain in the frequency domain, which can be obtained by performing Fourier transform on the signal.
  • the phase information is often omitted in the application of speech synthesis technology, and only the corresponding amplitude information at different frequencies is retained.
  • the fundamental frequency refers to the frequency of the fundamental tone in a polyphony, represented by the symbol FO.
  • the fundamental tone has the lowest frequency and the highest intensity.
  • the level of the fundamental frequency determines the level of a sound.
  • the so-called frequency of speech generally refers to the frequency of the fundamental tone.
  • Vocoder is derived from the abbreviation of Voice Encoder (Voice Encoder), also known as speech signal analysis and synthesis system, its function is to convert acoustic features into sound.
  • GMM Gaussian Mixture Model (Gaussian Mixture Model) is an extension of a single Gaussian probability density function, using multiple Gaussian probability density functions to more accurately model the distribution of variables.
  • DNN Deep Neural Network
  • MLP multi-layer perceptron neural network
  • CNN Convolutional Neural Network
  • CNN is a feedforward neural network whose neurons can respond to units in the receptive field.
  • CNN usually consists of multiple convolutional layers and a fully connected layer at the top, which reduces the number of parameters of the model by sharing parameters, making it widely used in image and speech recognition.
  • RNN Recurrent Neural Network
  • RNN is a type of recurrent neural network that takes sequence data as input, performs recursion in the evolution direction of the sequence, and all nodes (cyclic units) are connected in a chain. Recursive Neural Network.
  • LSTM Long Short-Term Memory network (Long Short-Term Memory), which is a recurrent neural network, which adds a Cell to the algorithm to judge whether the information is useful or not. Input gate, forget gate and output gate are placed in a Cell. After the information enters the LSTM, it is judged whether it is useful or not according to the rules. Only the information that conforms to the algorithm authentication will be left, and the information that does not conform will be forgotten through the forgetting gate.
  • the network is suitable for processing and predicting important events with relatively long intervals and delays in time series.
  • GRU Gate Recurrent Unit, which is a type of recurrent neural network. Like LSTM, it is also proposed to solve problems such as long-term memory and gradients in backpropagation. Compared with LSTM, GRU has one less "gating" and fewer parameters than LSTM. In most cases, it can achieve the same effect as LSTM and effectively reduce the calculation time.
  • Pitch pitch period.
  • speech signals can be simply divided into two categories. One is the voiced sound with short-term periodicity. When a person makes a voiced sound, the air flow through the glottis makes the vocal cords vibrate, producing a quasi-periodic pulsed air flow, which stimulates the vocal tract to produce voiced sound, also known as voiced sound. Speech, which carries most of the energy in speech, its period is called the pitch period (Pitch).
  • the other category is the unvoiced sound with random noise properties, emitted by the air compressed by the oral cavity when the glottis is closed.
  • LPC Linear Predictive Coding
  • the speech signal can be modeled as the output of a linear time-varying system whose input excitation signal is (during voiced) periodic pulses or (during unvoiced) random noise.
  • the sampling of the speech signal can be approximated by linear fitting of past samples, and then a set of prediction coefficients, ie LPC, can be obtained by locally minimizing the sum of squares of the differences between the actual sampling and the linearly predicted sampling.
  • LPCNet Linear Predictive Coding Network is a vocoder that combines digital signal processing and neural network ingeniously in speech synthesis, which can synthesize high-quality speech in real time on an ordinary CPU.
  • Wavenet As the pioneering work of neural vocoders, provides an important reference for subsequent work in this field, but due to its self-recursion (that is, predicting the current sampling point needs to rely on the previous The forward method of sampling points at time) makes it difficult to meet the requirements of large-scale online applications in real-time.
  • stream-based neural vocoders such as Parallel Wavenet and Clarinet emerged.
  • This type of vocoder uses distillation to make the distribution (mixed logistic distribution, single Gaussian distribution) predicted by the teacher model and the student model as close as possible. After the distillation learning is complete, use a parallelizable student model in the forward direction to improve the overall speed.
  • the stream-based vocoder can only realize real-time synthesis on the GPU with high cost.
  • the cost is too high.
  • self-recursive models with simpler structures, such as Wavernn and LPCNet were successively introduced. Quantization optimization and matrix sparse optimization are further introduced into the original simpler structure, so that it can achieve a relatively good real-time performance on a single CPU. But for large-scale online applications, faster vocoders are also required.
  • the LPCNet vocoder is mainly composed of a Frame Rate Network (FRN) and a Sample Rate Network (SRN).
  • the frame rate network 10 usually takes multi-dimensional audio features as input, and extracts high-level speech features through multi-layer convolution processing as the conditional features f of the subsequent sampling rate network 20; the sampling rate network 20 can be based on multi-dimensional audio.
  • LPC coefficient calculation is performed on the feature, and based on the LPC coefficient, combined with the sampling point prediction values S t-16 ... S t-1 predicted at multiple times before the current time, the sampling point at the current time is output by linear prediction coding. The corresponding current rough measurement value p t .
  • the sampling rate network 20 calculates the predicted value S t-1 corresponding to the sampling point at the last moment, the prediction error E t-1 corresponding to the sampling point at the last moment, the current rough measurement value pt and the conditional feature output by the frame rate network 10 f is used as input, and outputs the prediction error et corresponding to the sampling point at the current moment; after that, the sampling rate network 20 uses the current rough measurement value pt to add the prediction error et corresponding to the sampling point at the current moment to obtain the prediction at the current moment. value S t .
  • the sampling rate network 20 performs the same processing for each sampling point in the multi-dimensional audio feature, runs continuously in a loop, and finally completes the sampling value prediction of all sampling points, and obtains the entire target audio that needs to be synthesized according to the predicted value on each sampling point. . Since there are usually a large number of audio sampling points, taking the sampling rate of 16Khz as an example, 10ms of audio contains 160 sampling points. Therefore, in order to synthesize 10ms audio, the SRN in the current vocoder needs to loop 160 times, and the overall calculation amount is large, resulting in low speed and efficiency of audio processing.
  • Embodiments of the present application provide an audio processing method, an apparatus, a vocoder, an electronic device, and a computer-readable storage medium, which can improve the speed and efficiency of audio processing.
  • Exemplary applications of the electronic devices provided by the embodiments of the present application are described below.
  • the electronic devices provided by the embodiments of the present application may be implemented as intelligent robots, smart speakers, notebook computers, tablet computers, desktop computers, set-top boxes, mobile devices (for example, mobile phones) , portable music players, personal digital assistants, dedicated messaging equipment, portable game equipment), intelligent voice interaction equipment, smart home appliances, vehicle terminals and other types of user terminals can also be implemented as servers.
  • exemplary applications when the electronic device is implemented as a server will be described.
  • FIG. 2 is an optional schematic structural diagram of an audio processing system 100-1 provided by an embodiment of the present application.
  • the terminal 400 exemplarily shows the terminal 400-1, -2 and the terminal 400-3) are connected to the server 200 through a network, and the network may be a wide area network or a local area network, or a combination of the two.
  • the terminal 400 is installed with the client 410 of the intelligent speech application (exemplarily shows the client 410-1, the client 410-2 and the client 410-3), and the client 410 can process the pending intelligent speech synthesis
  • the text is sent to the server.
  • the server 200 is configured to, after receiving the text to be processed, perform voice feature conversion on the text to be processed, to obtain at least one frame of acoustic feature frames; through the frame rate network, extract each frame of acoustic feature frames from the at least one frame of acoustic feature frames through the frame rate network.
  • Conditional feature corresponding to the frame acoustic feature frame frequency band division and time domain downsampling are performed on the current frame in each frame of acoustic feature frame to obtain n subframes corresponding to the current frame; where n is a positive integer greater than 1; n subframes Each subframe contains a preset number of sampling points; through the sampling prediction network, in the i-th prediction process, the corresponding sampling values of the current m adjacent sampling points in the n subframes are synchronously predicted to obtain m ⁇ n sub-predicted values, and then obtain n sub-predicted values corresponding to each sampling point in the preset number of sampling points; wherein, i is a positive integer greater than or equal to 1; m is greater than or equal to 2 and less than or equal to the preset number Positive integer; obtain the audio prediction signal corresponding to the current frame according to the n sub-prediction values corresponding to each sampling point; and then perform audio synthesis on the audio prediction signal corresponding to each acoustic feature frame
  • the server 200 may further perform post-processing operations such as compression of the target audio, and return the processed target audio to the terminal 400 by way of streaming or returning the entire sentence.
  • the terminal 400 can perform smooth and natural voice playback in the client 410 .
  • the server 200 can simultaneously predict the predicted values corresponding to multiple sub-band features at adjacent times through the sampling prediction network, and the number of loops required for audio prediction is less, thereby making the server
  • the background speech synthesis service has little delay, and the client 410 can obtain the returned audio immediately. This enables the user of the terminal 400 to hear the voice content converted from the text to be processed in a short period of time, freeing eyes, and the interaction is natural and convenient.
  • the server 200 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the terminal 400 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present application.
  • the terminal 400 may be an in-vehicle device 400-4.
  • the in-vehicle device 400-4 may be an in-vehicle computer installed inside the vehicle device, or may be installed outside the vehicle device.
  • the client 410 of the intelligent voice application can be an in-vehicle service client 410-4, which is used to display the relevant driving information of the vehicle, provide control of various devices on the vehicle, and provide other extended functions .
  • the in-vehicle service client 410-4 When the in-vehicle service client 410-4 receives a text message sent from the outside, such as a news message, a road condition message, or a message containing text information, such as an emergency message, it can be based on the user's operation instruction, as shown by the user at 410-5.
  • the voice broadcast instruction is triggered by operations such as voice, screen or button on the message pop-up interface
  • the in-vehicle service system responds to the voice broadcast instruction and sends the text message to the server 200, and the server 200 extracts the text to be processed from the text message, and treats it.
  • the text is processed to perform the above audio processing process to generate the corresponding target audio.
  • the server 200 sends the target audio to the in-vehicle service client 410-4, calls the in-vehicle multimedia device to play the target audio through the in-vehicle service client 410-4, and displays the audio playing interface as shown in 410-6.
  • FIG. 4 is a schematic diagram of an optional architecture of the audio processing system 100-2 provided by the embodiment of the present application, in order to realize customizable personalized speech synthesis applications in a vertical field, such as novel reading, news broadcasting
  • the terminal 500 is connected to the server 300 through a network, and the network may be a wide area network or a local area network, or a combination of the two.
  • the server 300 is configured to customize the requirements of timbres in advance, and form a voice library by collecting audios of various timbres, such as voices of speakers of different genders or different timbre types, and train the built-in initial voice synthesis model through the voice library, so as to obtain a voice synthesizer capable of function server-side model, and deploy the trained server-side model on the terminal 500 to become the background voice processing model 420 on the terminal 500 .
  • An intelligent voice application 411 (such as a reading APP, a news client, etc.) is installed on the terminal 500. When the user needs to read a certain text through the intelligent voice application 411, the intelligent voice application 411 can obtain the text submitted by the user for the desired voice reading.
  • the text to be processed is subjected to voice feature conversion to obtain at least one frame of acoustic feature frames; through the frame rate network, from at least one frame of acoustic feature frames In each frame of the acoustic feature frame, the conditional feature corresponding to each frame of the acoustic feature frame is extracted; the current frame in each frame of the acoustic feature frame is subjected to frequency band division and time domain downsampling to obtain n subframes corresponding to the current frame; among them, n is a positive integer greater than 1; each of the n subframes contains a preset number of sampling points; through the sampling prediction network, in the i-th round of prediction process, the current m adjacent sampling points are on the n subframes The corresponding sampling values are synchronously predicted to obtain m ⁇ n sub-predicted values, and then n sub-predicted values corresponding
  • Personalized customized speech synthesis puts forward higher requirements on the robustness, generalization, and real-time performance of the system.
  • the modularized end-to-end audio processing system provided by the embodiment of the present application can be flexibly adjusted according to the actual situation , under the premise of hardly affecting the synthesis effect, the high adaptability of the system under different requirements is guaranteed.
  • the terminal 500 may be an in-vehicle device 500-1, which is connected to another user device 500-2 such as a mobile phone, a tablet computer, etc. in a wired or wireless manner.
  • the connection can be done by way of Bluetooth, or by way of USB, and so on.
  • the user equipment 500-2 can send its own text, such as short messages, documents, etc., to the intelligent voice application 411-1 on the in-vehicle equipment 500-1 through the connection.
  • the user equipment 500-2 may automatically forward the notification message to the intelligent voice application 411-1, or the user equipment 500-2 may also, based on the user's operation instruction on the user equipment application, Send the locally saved document to the intelligent voice application 411-1.
  • the intelligent voice application 411-1 receives the pushed text, it can use the text content as the text to be processed based on the response to the voice broadcast instruction, and perform the above audio processing process on the text to be processed through the background voice model to generate the corresponding target audio .
  • the intelligent voice application 411-1 then invokes the corresponding interface display and in-vehicle multimedia equipment to play the target audio.
  • FIG. 6 is a schematic structural diagram of an electronic device 600 provided by an embodiment of the present application.
  • the electronic device 600 shown in FIG. 6 includes: at least one processor 610 , memory 650 , at least one network interface 620 and user interface 630 .
  • the various components in electronic device 600 are coupled together by bus system 640 .
  • the bus system 640 is used to implement the connection communication between these components.
  • the bus system 640 also includes a power bus, a control bus, and a status signal bus.
  • the various buses are labeled as bus system 640 in FIG. 6 .
  • the processor 410 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.
  • DSP Digital Signal Processor
  • User interface 630 includes one or more output devices 631 that enable presentation of media content, including one or more speakers and/or one or more visual display screens.
  • User interface 630 also includes one or more input devices 632, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, and other input buttons and controls.
  • Memory 650 may be removable, non-removable, or a combination thereof.
  • Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like.
  • Memory 650 optionally includes one or more storage devices that are physically remote from processor 610 .
  • Memory 650 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory).
  • ROM read-only memory
  • RAM random access memory
  • the memory 650 described in the embodiments of the present application is intended to include any suitable type of memory.
  • memory 650 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • the operating system 651 includes system programs for processing various basic system services and performing hardware-related tasks, such as framework layer, core library layer, driver layer, etc., for implementing various basic services and processing hardware-based tasks;
  • a presentation module 653 for enabling presentation of information (eg, a user interface for operating peripherals and displaying content and information) via one or more output devices 631 associated with the user interface 630 (eg, a display screen, speakers, etc.) );
  • An input processing module 654 for detecting one or more user inputs or interactions from one of the one or more input devices 632 and translating the detected inputs or interactions.
  • the apparatuses provided in the embodiments of the present application may be implemented in software
  • FIG. 6 shows the audio processing apparatus 655 stored in the memory 650, which may be software in the form of programs and plug-ins, including the following software Modules: text-to-speech conversion model 6551, frame rate network 6552, time domain and frequency domain processing module 6553, sampling prediction network 6554, and signal synthesis module 6555, these modules are logical, so any combination or Split further.
  • the apparatus provided by the embodiments of the present application may be implemented in hardware.
  • the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor, which is programmed to execute the present application
  • the audio processing method provided by the embodiment for example, a processor in the form of a hardware decoding processor may adopt one or more application specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, Programmable Logic Device (PLD, Programmable Logic Device) ), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
  • ASIC Application Specific Integrated Circuit
  • DSP Programmable Logic Device
  • PLD Programmable Logic Device
  • CPLD Complex Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • An embodiment of the present application provides a multi-band multi-time-domain vocoder, which can be combined with a text-to-speech conversion model to convert at least one acoustic feature frame output by the text-to-speech conversion model according to the text to be processed into a target Audio; it can also be combined with the audio feature extraction module in other audio processing systems to convert the audio features output by the audio feature extraction module into audio signals.
  • the specific selection is made according to the actual situation, which is not limited in the embodiment of the present application.
  • the vocoder provided by the embodiment of the present application includes a time domain and frequency domain processing module 51 , a frame rate network 52 , a sampling prediction network 53 , and a signal synthesis module 54 .
  • the frame rate network 52 may perform high-level abstraction on the input acoustic feature signal, and extract the conditional feature corresponding to the frame from each acoustic feature frame of at least one acoustic feature frame.
  • the vocoder can further predict the sampled signal value at each sampling point in the acoustic feature of each frame based on the conditional feature corresponding to the acoustic feature frame of the frame.
  • the time domain frequency domain processing module 51 can perform frequency band division and time domain downsampling on the current frame. , to obtain n subframes corresponding to the current frame; each subframe in the n subframes includes a preset number of sampling points.
  • the sampling prediction network 53 is used to perform synchronous prediction on the sampling values corresponding to the current m adjacent sampling points in the n subframes during the i-th prediction process, to obtain m ⁇ n sub-predicted values, and then obtain a preset number of samples.
  • n sub-prediction values corresponding to each sampling point in the point wherein, i is a positive integer greater than or equal to 1; m is a positive integer greater than or equal to 2 and less than or equal to a preset number; the signal synthesis module 54 is used for according to The audio prediction signal corresponding to the current frame is obtained from the n sub-prediction values corresponding to each sampling point; the audio prediction signal corresponding to the acoustic feature frame of each frame is then audio synthesized to obtain the target audio corresponding to the text to be processed.
  • the sampling prediction network 53 may include a linear prediction coding module 53-1 and a sample rate network 53-2 as shown in FIG. 7 .
  • the linear prediction coding module 53-1 can calculate the sub-rough measurement values corresponding to each of the m sampling points on the n subframes, as the channel response; the sampling rate network 53-2 can be based on the frame rate network 52.
  • the extracted conditional features in one round of prediction process, take m sampling points as the time span of forward prediction, and simultaneously complete the corresponding residual values of each sampling point in m adjacent sampling points on n subframes , as the excitation of the sound source, and then simulate the corresponding audio signal according to the channel response and the excitation of the sound source.
  • the linear prediction encoding module 53-1 can In at least one historical sampling point at time t corresponding to point t, the n sub-prediction values corresponding to each historical sampling point, perform linear coding prediction on the linear sampling value of sampling point t in n sub-frames, and obtain n sub-prediction values at time t.
  • the rough measurement value is used as the channel response of the sampling point t; when predicting the residual value corresponding to the sampling point t, since the prediction time span is 2 sampling points, the sampling rate network 53-2 can In the round prediction process, the n residual values at time t-2 corresponding to the sampling point t-2 and the n sub-predicted values at time t-2 are used as excitation values. Forward prediction is performed on the residual values corresponding to the sampling point t in n subframes, respectively, to obtain n residual values at time t corresponding to the sampling point t.
  • the forward prediction is performed on the residual values corresponding to the sampling point t+1 in n subframes, and the n residual values at the time t+1 corresponding to the sampling point t+1 are obtained.
  • the sampling rate network 53-2 can perform residual prediction on a preset number of down-sampled sampling points in the n subframes recursively according to the above process, until n residual values corresponding to each sampling point are obtained.
  • the sampling prediction network 53 may obtain n sub-prediction values at time t corresponding to the sampling point t according to the residual values at time t and the sub-coarse measurement values at time t, and use the sampling point t as the sampling point One of at least one historical sampling point at time t+1 corresponding to t+1, according to the sub-prediction value corresponding to each historical sampling point at time t+1 in the at least one historical sampling point at time t+1, for sampling point t+1 at Linear coding prediction is performed on the corresponding linear sampling values on the n subframes to obtain n sub-coarse measurement values at time t+1 as the channel response of the sampling point t; and then according to the n sub-coarse measurement values at time t+1 and The residual value at time t+1 obtains n sub-prediction values at time t+1, and the n sub-prediction values at time t and n sub-prediction values at time t+1 are used as 2n sub
  • the sampling prediction network 53 updates two adjacent sampling points t and sampling point t+1, and starts the i+1-th round of sampling value prediction process until all the preset number of sampling points are predicted.
  • the vocoder can obtain the signal waveform of the audio signal corresponding to the current frame through the signal synthesis module 54 .
  • the vocoder provided by the embodiment of the present application effectively reduces the amount of computation required to convert acoustic features into audio signals, and realizes synchronous prediction of multiple sampling points, which can ensure a high real-time rate at the same time. , output high intelligibility, high naturalness, high fidelity audio.
  • the prediction time span of the vocoder is set to two sampling points, that is, setting m to 2 is a kind of preference based on the comprehensive consideration of the processing efficiency of the vocoder and the audio synthesis quality.
  • m can also be set to other time span parameter values according to the needs of the project.
  • FIG. 8 is a schematic flowchart of an optional audio processing method provided by an embodiment of the present application, which will be described in conjunction with the steps shown in FIG. 8 .
  • the audio processing method provided in the embodiment of the present application can be applied to the cloud service of the intelligent voice application, and then serves the users who use the cloud service, such as the intelligent customer service of the bank and the learning software such as the word memorizing software, and can also be applied to the local terminal of the terminal.
  • the applied intelligent voice scenarios such as intelligent reading of books and news broadcasts can also be applied to automatic driving scenarios or vehicle-mounted scenarios, such as voice interaction-based car networking scenarios or smart traffic scenarios, which are not limited in the embodiments of the present application.
  • the electronic device may perform voice feature conversion on the text information to be converted by using a preset text-to-speech conversion model, and output at least one frame of acoustic feature frames.
  • the text-to-speech conversion model may be a sequence-to-sequence (Sequence to Sequence) model constructed by a CNN, a DNN network, or an RNN network, and the sequence-to-sequence model is mainly composed of an encoder and a decoder;
  • the processor can abstract a series of continuous data such as speech data, raw text, and video data into sequences, and extract robust sequence representations from character sequences in raw texts, such as sentences, and encode them to be able to map out sentence content.
  • the fixed-length vector of the original text thereby converting the natural language in the original text into digital features that can be recognized and processed by the neural network;
  • the decoder can map the fixed-length vector obtained by the encoder to the acoustic features of the corresponding sequence, and convert multiple
  • the features on the sampling points are collected into one observation unit, that is, one frame, so as to obtain at least one acoustic feature frame.
  • At least one frame of acoustic feature frame may be at least one frame of audio spectrum signal, which may be represented by a frequency-domain spectrogram.
  • Each acoustic feature frame contains a preset number of feature dimensions, and the feature dimension represents the number of vectors in the feature, and the vectors in the feature are used to describe various feature information, such as pitch, formant, spectrum, vocal domain function, etc.
  • the at least one acoustic feature frame may be a Mel scale spectrogram, a linear logarithmic magnitude spectrogram, or a Barker scale spectrogram, etc.
  • the at least one acoustic feature frame is The extraction method and feature data form are not limited.
  • each acoustic feature frame may include 18-dimensional BFCC features (Bark-Frequency Cepstral Coefficients) plus 2-dimensional pitch (Pitch) related features.
  • each acoustic feature frame may be 10ms, then for an audio signal with a sampling rate of 16KHZ, each acoustic feature frame may contain 160 sample points.
  • the electronic device may perform multi-layer convolution processing on at least one acoustic feature frame through the frame rate network, and extract the high-level speech feature of each acoustic feature frame as the conditional feature corresponding to the acoustic feature frame.
  • the electronic device can convert the text to be processed into 100 frames of acoustic feature frames through S101, and then process the 100 frames of acoustic feature frames simultaneously through the frame rate network to obtain corresponding 100 frames of conditional features.
  • the frame rate network may comprise two convolutional layers and two fully connected layers in series.
  • the two convolutional layers may be two convolutional layers (conv3x1) with a filter size of 3.
  • the 20-dimensional features in each frame are first passed through.
  • Two convolutional layers generate 5-frame receptive field according to the acoustic feature frames of the first two frames and the last two frames of the frame, add the 5-frame receptive field to the residual connection, and then output through two fully connected layers
  • a 128-dimensional conditional vector f as conditional features, is used to assist the sampling rate network in forward residual prediction.
  • the conditional feature corresponding to the frame rate network is only calculated once. That is, when the sampling rate network recursively predicts the sampling values corresponding to the down-sampled sampling points corresponding to the acoustic feature frame, the conditional features corresponding to the frame remain unchanged during the recursive prediction process corresponding to the frame. .
  • S103 Perform frequency band division and time domain downsampling on the current frame in each frame of acoustic feature frames to obtain n subframes corresponding to the current frame; wherein, n is a positive integer greater than 1; each subframe in the n subframes contains a Set the number of sampling points.
  • the electronic device may divide the current frame in each frame of acoustic feature frames into frequency bands, and then divide the sampling points in the time domain included in the divided frequency bands. Down-sampling is performed to reduce the number of sampling points included in each divided frequency band, so as to obtain n subframes corresponding to the current frame.
  • the frequency domain partitioning process may be implemented by a filter bank.
  • n is equal to 4, for the current frame whose frequency domain range is 0-8k, the electronic device can pass a filter bank including four band-pass filters, such as a Pseudo-QMF (Pseudo Quadratue Mirror Filter Bank) filter.
  • Pseudo-QMF Pseudo Quadratue Mirror Filter Bank
  • Group with 2k bandwidth as a unit, respectively divide features corresponding to 0-2k, 2-4k, 4-6k, 6-8k frequency bands from the current frame, and correspondingly obtain 4 initial subframes corresponding to the current frame.
  • each The sampling points included in the initial subframe are still 160.
  • the electronic device further downsamples each initial subframe through a downsampling filter to reduce the number of sampling points in each initial subframe to 40, thereby obtaining 4 subframes corresponding to the current frame.
  • the electronic device may also divide the frequency band of the current frame by other software or hardware methods, and the specific selection is made according to the actual situation, which is not limited in the embodiment of the present application.
  • each frame may be regarded as a current frame, and the division and time-domain downsampling are performed in the same processing process.
  • sampling prediction network In the i-th round of prediction process, perform synchronous prediction on the sampling values corresponding to the current m adjacent sampling points in the n subframes to obtain m ⁇ n sub-prediction values, and then obtain a preset number of n sub-prediction values corresponding to each sampling point in the sampling points; wherein, i is a positive integer greater than or equal to 1; m is a positive integer greater than or equal to 2 and less than or equal to a preset number.
  • the electronic device after obtaining at least one frame of acoustic feature frame, the electronic device needs to convert at least one frame of acoustic feature frame into a waveform representation of an audio signal. Therefore, for a frame of acoustic feature frames, the electronic device needs to predict the spectral amplitude on the linear frequency scale corresponding to each sampling point in the frequency domain, as the sampling predicted value of each sampling point, and then pass the sampling of each sampling point. The predicted value obtains the audio signal waveform corresponding to the acoustic feature frame of the frame.
  • the sampling points corresponding to each subframe in the frequency domain in the time domain are the same, and include a preset number of sampling points at the same time.
  • the sampling values corresponding to the n subframes in the domain are predicted at the m sampling points at adjacent times, and m ⁇ n sub-prediction values are obtained, which can greatly shorten the number of cycles required to predict an acoustic feature frame.
  • the electronic device can predict m adjacent sampling points in the preset number of sampling points in the time domain through the same processing process.
  • the n sub-predictions corresponding to the n sub-frames of the sampling point t 1 in the frequency domain and the n sub-predictions corresponding to the n sub-frames of the sampling point t 2 are predicted at the same time, as 2n sub-predictions; and in the next In the round prediction process, the sampling points t 3 and t 4 are regarded as the two adjacent sampling points, and the sampling points t 3 and t 4 are synchronously processed in the same way, and the sampling points t 3 and t 4 are predicted at the same time.
  • the electronic device completes the sampling value prediction of all sampling points in the preset number of sampling points in a self-recursive manner through the sampling prediction network, and obtains n sub-predicted values corresponding to each sampling point.
  • S105 obtain the audio prediction signal corresponding to the current frame according to the n sub-prediction values corresponding to each sampling point; and then perform audio synthesis on the audio prediction signal corresponding to each acoustic feature frame of at least one acoustic feature frame to obtain the corresponding text to be processed. target audio.
  • the n sub-prediction values corresponding to each sampling point represent the predicted amplitudes of the audio signal of the sampling point in the n frequency bands.
  • the electronic device may combine the n sub-prediction values corresponding to the sampling point in the frequency domain to obtain the signal prediction value corresponding to the sampling point in the entire frequency band.
  • the electronic device further associates each sampling point in the current frame with the order in the preset time series, and combines the predicted signal values corresponding to each sampling point in the time domain to obtain an audio prediction signal corresponding to the current frame.
  • the sampling prediction network performs the same processing on each acoustic feature frame, and can predict all signal waveforms through at least one acoustic feature frame, thereby obtaining the target audio.
  • the electronic device divides each frame of the acoustic feature signal into multiple subframes in the frequency domain and downsamples each subframe, which reduces the need for the sampling prediction network to perform sample value prediction.
  • the overall number of sampling points processed, and further, by simultaneously predicting multiple sampling points at adjacent times during one round of prediction, the simultaneous processing of multiple sampling points is achieved, thereby significantly reducing the sampling prediction network.
  • the number of cycles required for the signal increases the processing speed of audio synthesis and improves the efficiency of audio processing.
  • S103 may be implemented by executing S1031-S1032, as follows:
  • the sampling prediction network may include 2n independent fully connected layers, and the adjacent m sampling points include: in the i-th round of prediction process, the sampling point t corresponding to the current time t and the next A sampling point t+1 corresponding to a time t+1, where t is a positive integer greater than or equal to 1.
  • S104 in FIG. 8 can be implemented through S1041-S1044, which will be described in conjunction with each step.
  • the electronic device in the i-th prediction process, first performs linear coding prediction on the n linear sampling values corresponding to the sampling point t of the n subframes at the current moment through the sampling prediction network, and obtains n times t Sub-coarse value.
  • the sampling prediction network needs to refer to the signal of at least one historical sampling point before the sampling point t when predicting the sub-rough measurement values at the time t corresponding to the sampling point t.
  • the predicted value of the signal at the time of sampling point t is solved by linear combination.
  • the maximum number of historical sampling points that the sampling prediction network needs to refer to is the preset window threshold.
  • the electronic device may determine at least one historical sampling point corresponding to the linear coding prediction of the sampling point t according to the order of the sampling point t in the preset time series and in combination with the preset window threshold of the sampling prediction network.
  • the electronic device may also perform S201 or S202 to determine at least one historical sampling point at time t corresponding to the sampling point t, as follows:
  • t is less than or equal to a preset window threshold, use all sampling points before sampling point t as at least one historical sampling point at time t; the preset window threshold represents the maximum number of sampling points that can be processed by linear coding prediction.
  • the preset window threshold is 16, that is, the linear prediction module in the sampling prediction network performs one prediction
  • the maximum queue that can be processed is 16 sampling points.
  • the linear prediction module can start from all sampling points before sampling point 15, that is, the sampling point. 14 sampling points in the range from 1 to sampling point 14 are used as at least one historical sampling point at time t.
  • a corresponding sampling point in the range from sampling point t-1 to sampling point t-k is used as at least one historical sampling point at time t; where k is a preset window threshold.
  • the prediction window of the linear prediction module slides correspondingly and gradually on the preset time series of multiple sampling points.
  • t is greater than 16
  • the linear prediction module performs linear coding prediction on sampling point 18
  • the end point of the prediction window slides to the position of sampling point 17, and the linear prediction module converts sampling point 17 to sampling point 2
  • the 16 sampling points within the range are used as at least one historical sampling point at time t.
  • the electronic device may obtain n sub-prediction values corresponding to each historical sampling point at time t from the sampling point t corresponding to at least one historical sampling point at time t through the linear prediction module, as at least one historical sub-prediction value at time t Predicted value: perform linear coding prediction on the linear value of the audio signal at the sampling point t according to at least one historical sub-predicted value at time t, and obtain n sub-rough measurement values at time t corresponding to the sampling point t.
  • i when i is greater than 1, it means that the electronic device can obtain the previous prediction result of the i-th round of prediction process as the excitation of the i-th round of prediction process, and the nonlinear error value of the audio signal is analyzed by sampling the prediction network. Prediction.
  • the historical prediction result includes n residual values and sub-prediction values corresponding to each of two adjacent sampling points in the i-1th round of prediction process.
  • the electronic device can, based on the historical prediction results of the i-1th round, combined with the conditional features, pass through 2n fully connected layers, and simultaneously forward the residual values corresponding to the n subframes at the sampling point t and the sampling point t+1.
  • n residual values at time t corresponding to sampling point t and n residual values at time t+1 corresponding to sampling point t+1 are obtained.
  • S1042 may be implemented through S301-S303, which will be described in conjunction with each step.
  • the sampling points processed in the i-1th round of prediction process are sampling point t-2 and sampling point t-1
  • the historical prediction results that can be obtained during the i-1 round of prediction of the sampling prediction network include: n sub-rough measurements at time t-2 corresponding to sampling point t-2, n residual values at time t-2 and n t -2 sub-predicted values at time; and n coarse measurement values at time t-1, n residual values at time t-1 and n sub-predicted values at time t-1 corresponding to sampling point t-1.
  • the sampling prediction network obtains n sub-rough measurements at time t-1 from the historical prediction results corresponding to the i-1 round of prediction process, as well as n residual values at time t-1 and n residual values at time t-2 value, n sub-prediction values at time t-1, and n sub-prediction values at time t-2, so as to perform prediction of sampling values at the middle sampling point t and sampling point t+1 in the i-th round on the basis of the above data.
  • the sampling prediction network in order to reduce the complexity of network operation, the sampling prediction network needs to carry out dimension reduction processing to the feature data to be processed, and remove the feature data on the dimension that has less influence on the prediction result from it, so as to improve the network operation efficiency.
  • the sampling prediction network includes a first gated cyclic network and a second gated cyclic network.
  • S302 may be implemented through S3021-S3023, which will be described in conjunction with each step.
  • the electronic device converts n sub-coarse measurements at time t, n sub-coarse measurements at time t-1, n residual values at time t-1, n residual values at time t-2, n
  • the sub-predicted values at time t-1 and the predicted values at time t-2 are combined from the perspective of feature dimension to obtain the total dimension set of information features used for residual prediction, which is used as the initial feature vector.
  • the first gated cyclic network may perform weight analysis on feature vectors of different dimensions, and based on the results of the weight analysis, retain feature data on dimensions that are important and effective for residual prediction, and forget about invalid dimensions.
  • the feature data of so as to realize the dimensionality reduction processing of the initial feature vector set, and obtain the intermediate feature vector set.
  • the gated recurrent network may be a GRU network or an LSTM network, which is specifically selected according to the actual situation, which is not limited in this embodiment of the present application.
  • the electronic device performs dimension reduction again on the intermediate feature vector through the second gated cyclic network based on the conditional feature, so as to remove redundant information and reduce the workload of the subsequent prediction process.
  • S303 may be implemented by performing the process of S3031 - S3033 , which will be described in conjunction with each step.
  • the electronic device may use the n dimensionality reduction residuals at time t-2 and n dimensionality reduction prediction values at time t-2 obtained in the i-1th round of prediction process as the sound of the ith round of prediction process
  • the channel excitation is used to predict the residual value at time t by using the forward prediction ability of the sampling rate network.
  • the electronic device may use the n dimensionality reduction residuals at time t-2 and n dimensionality reduction prediction values at time t-2 obtained in the i-1th round of prediction process as the sound of the ith round of prediction process
  • the channel excitation is used to predict the residual value at time t by using the forward prediction ability of the sampling rate network.
  • each fully connected layer in the n fully connected layers is adopted, and the dimension is reduced according to the n times t-1.
  • the sub-rough measurement value performs forward residual prediction on the sampling point t, and obtains n residual values at time t; at the same time, in the other n fully connected layers in the 2n fully connected layers, based on conditional features and time t+1 time Excitation value, each fully connected layer in the other n fully connected layers is used, and the forward residual prediction is performed on the sampling point t+1 according to the n dimensionality reduction sub-rough measurement values at time t, and n times t+1 are obtained. residual value.
  • 2n fully connected layers work simultaneously and independently, and the n fully connected layers are used to process the correlation prediction process of the sampling point t.
  • each fully connected layer of the n fully connected layers performs residual value prediction processing corresponding to the sampling point t on each of the n subframes, according to t-1 on one subframe
  • the rough measurement value of the time dimension reduction subframe, combined with the conditional feature and the excitation value at time t on the subframe that is, the subframe is in the n t-2 time dimensionality reduction residual value and the n t-2 time dimensionality reduction prediction value.
  • the other n fully connected layers in the 2n fully connected layers correspondingly perform the residual value prediction processing of the sampling point t on each of the n subframes, according to the t on one subframe.
  • the rough measurement value of the time dimension reduction subframe, combined with the conditional feature and the excitation value at time t+1 on the subframe that is, the residual value of the dimension reduction at n times t-1 and the dimensionality reduction prediction at n times t-1 for the subframe
  • the corresponding dimensionality reduction residual value at t-1 time and the dimensionality reduction prediction value at t-1 time) the residual value of sampling point t+1 on this subframe is predicted, and then obtained through the other n fully connected layers
  • the residual value of the sampling point t+1 in each subframe that is, the residual value of n times t+1.
  • S1043 is the linear prediction process when the prediction window of the linear prediction algorithm slides to the sampling point t+1, and the electronic device can obtain at least one t+1 corresponding to the sampling point t+1 through a process similar to S1041
  • linear coding prediction is performed on the linear sampling value corresponding to the sampling point t+1 according to at least one historical sub-prediction value at time t+1, and n sub-rough measurement values at time t+1 are obtained.
  • n residual values at time t and the n sub-rough measurement values at time t obtain n sub-predicted values at time t corresponding to the sampling point t, and according to the residual values at time n t+1 and the sub-predicted values at time t+ 1 sub-coarse measurement value, to obtain n sub-predicted values at time t+1; take n sub-predicted values at time t and n sub-predicted values at time t+1 as 2n sub-predicted values.
  • the electronic device can combine each of the n subframes by means of signal superposition, and combine the n sub-rough measurement values representing the linear information of the audio signal at time t with the non-coarse measurement values representing the non-linear information.
  • the signal amplitudes of the n residual values at time t of the linear random noise information are superimposed to obtain n sub-predicted values at time t corresponding to the sampling point t.
  • the electronic device may perform signal superposition processing on the n residual values at time t+1 and the n sub-rough measurement values at time t+1 to obtain n sub-predicted values at time t+1.
  • the electronic device further uses the n sub-predicted values at time t and the n sub-predicted values at time t+1 as 2n sub-predicted values.
  • the network architecture diagram of the frame rate network and the sampling prediction network in the electronic device may be as shown in FIG. 12 , wherein the sampling prediction network includes m ⁇ n
  • the dual fully-connected layer is used to predict the corresponding sample values on each of the n subframes in the frequency domain for the m sampling points in the time domain in one round of prediction.
  • the dual fully connected layer 1 to the dual fully connected layer 8 are 2*4 independent fully connected layers included in the sampling prediction network 110; the frame rate network 111 can pass two convolutions layer and two fully connected layers, extract the conditional feature f from the current frame, the band-pass downsampling filter bank 112 performs frequency domain division and time domain downsampling on the current frame, and obtains 4 subframes from b1 to b4; Each subframe corresponds to 40 sampling points in the time domain.
  • the sampling prediction network 110 can realize sampling value prediction for 40 sampling points in the time domain through multiple rounds of self-recursive cyclic prediction process.
  • the sampling prediction network 110 can calculate the LPC prediction value at time t by calculating the LPC coefficient, according to at least one historical sub-prediction value at time t corresponding to at least one historical sampling point at time t Obtain the n sub-rough measurements at time t corresponding to the sampling point t at the current time Then obtain the corresponding n sub-coarse measurements at time t-1 in the i-1th round of prediction process n sub-predictions at time t-2 n residual values at time t-2 n sub-predictions at time t-1 with n t-1 time residual values and They are jointly sent to the merging layer for feature dimension merging to obtain the initial feature vector set.
  • the sampling prediction network 110 performs dimensionality reduction processing on the initial feature vector set through the first gated cyclic network and the second gated cyclic network in combination with the conditional feature f to obtain a dimensionality reduction feature set for prediction, and then separates the dimensionality reduction feature set respectively. It is sent to 8 dual connection layers, and through 4 dual connection layers, the n residual values corresponding to the sampling point t are predicted, and the 4 residual values corresponding to the sampling point t in 4 subframes are obtained. At the same time, through the other 4 dual connection layers, the 4 residual values corresponding to the sampling point t+1 are predicted, and the 4 residual values corresponding to the sampling point t+1 on the 4 subframes are obtained.
  • the sample prediction network 110 can then be based on and Obtain the 4 sub-predictions corresponding to the sampling point t on the 4 sub-frames according to Obtain at least one historical sub-prediction value at time t+1 corresponding to sampling point t+1 By calculating the LPC predicted value at time t+1, the 4 sub-coarse measurement values corresponding to the sampling point t+1 in 4 sub-frames are obtained
  • the sample prediction network 110 is based on and Obtain 4 sub-predictions corresponding to sampling point t+1 on 4 sub-frames
  • the i-th round of prediction process is completed, the sampling point t and the sampling point t+1 in the next round of prediction process are updated, and the circular prediction is carried out in the same way until all the 40 sampling points in the time domain are predicted, and each sampling point is obtained.
  • the 4 sub-predictions corresponding to the sampling point is based on and Obtain the 4 sub-predictions corresponding to the sampling point.
  • the method in the embodiment of the present application reduces the number of cycles of the sampling prediction network from the current 160 to 160/4 (number of subframes)/2 (number of adjacent sampling points), that is, 20 times, thereby greatly reducing the number of loop processing times of the sampling prediction network, thereby improving the processing speed and processing efficiency of audio processing.
  • the number of dual fully connected layers in the sampling prediction network 110 needs to be set to m*n correspondingly, and in the prediction process, for each sampling point
  • the forward prediction time span is m, that is, when predicting the residual value of each sampling point, the historical prediction results of the first m sampling points corresponding to the sampling point in the previous round of prediction process are used as the excitation value. Residual predictions.
  • S1045-1047 may also be executed, which will be described in conjunction with each step.
  • the electronic device can combine the conditional features and the preset excitation parameters, through 2n fully connected layers, Combine conditional features and preset excitation parameters, and perform forward residual prediction on sampling point t and sampling point t+1 at the same time, and obtain n residual values at time t corresponding to sampling point t and n corresponding to sampling point t+1 Residual value at time t+1.
  • the preset excitation parameter may be 0, or may be set to other values according to actual needs, and is specifically selected according to the actual situation, which is not limited in this embodiment of the present application.
  • n residual values at time t and the n sub-coarse measurement values at time t obtain n sub-predicted values at time t corresponding to the sampling point t, and according to the residual values at time n t+1 and the sub-predicted values at time t+ 1 sub-coarse measurement value, to obtain n sub-predicted values at time t+1; take n sub-predicted values at time t and n sub-predicted values at time t+1 as 2n sub-predicted values.
  • S105 may be implemented by executing S1051-1053 , which will be described in conjunction with each step.
  • the electronic device can use the inverse process of frequency domain division to divide the n sub-prediction values corresponding to each sampling point The frequency domain superposition is performed to obtain the signal prediction value corresponding to each sampling point.
  • the electronic device can perform signal synthesis on the signal prediction value corresponding to each sampling point in order in the time domain, and obtain the corresponding signal of the current frame. Audio prediction signal.
  • the electronic device may perform signal synthesis by synthesizing each acoustic feature of at least one frame of acoustic feature frames as a current frame in a cyclic process, thereby obtaining an audio signal corresponding to each frame of acoustic features.
  • S1053 Perform signal synthesis on the audio signals corresponding to the acoustic features of each frame to obtain target audio.
  • the electronic device performs signal synthesis on the audio signals corresponding to the acoustic features of each frame to obtain the target audio.
  • S101 may be implemented by executing S1011 to S1013 , which will be described in conjunction with each step.
  • the preprocessing of the text has a very important influence on the quality of the target audio that is finally generated.
  • the pending text captured by the electronic device usually with spaces and punctuation characters, can produce different semantics in many contexts, and therefore may cause the pending text to be misread, or may cause some words to be skipped or repeated . Therefore, the electronic device needs to preprocess the text to be processed first, so as to normalize the information of the text to be processed.
  • preprocessing the text to be processed by the electronic device may include: capitalizing all characters in the text to be processed; removing all intermediate punctuation; ending each sentence with a uniform terminator, such as a period or a question mark; using a special delimiter
  • a uniform terminator such as a period or a question mark
  • using a special delimiter The spaces between words are replaced by characters, etc., and the specific selection is made according to the actual situation, which is not limited in this embodiment of the present application.
  • the text-to-speech conversion model is a neural network model that has been trained and can convert text information into acoustic features.
  • the electronic device uses a text-to-speech conversion model to correspondingly convert at least one text sequence in the text information to be converted into at least one acoustic feature frame, thereby implementing acoustic feature prediction of the text information to be converted.
  • the audio quality of the target audio can be improved; and the electronic device can use the most original text to be processed as input data, through the The audio processing method outputs the final data processing result of the text to be processed, that is, the target audio, which realizes the end-to-end processing of the text to be processed, reduces the transition processing between system modules, and increases the overall fit.
  • an exemplary application of an electronic device including a text-to-speech conversion model 14-1 and a multi-band multi-time-domain vocoder 14-2.
  • the text-to-speech model 14-1 adopts the sequence-to-sequence Tacotron structure model with attention mechanism, including CBHG (1-D Convolution Bank Highway network bidirectional GRU) encoder 141, attention module 142, decoder 143 and CBHG Smoothing module 144 .
  • the CBHG encoder 141 is configured to use sentences in the original text as sequences, extract robust sequence expressions from the sentences, and encode them into vectors that can be mapped to a fixed length.
  • the attention module 142 is configured to pay attention to all words expressed by the robust sequence, and assist the encoder to perform better encoding by calculating an attention score.
  • the decoder 143 is configured to map the fixed-length vector obtained by the encoder into the acoustic features of the corresponding sequence, and output smoother acoustic features through the CBHG smoothing module 144, thereby obtaining at least one frame of acoustic features. At least one frame of the acoustic feature frame enters the multi-band multi-time-domain vocoder 14-2, and the conditional feature f of each frame is calculated through the frame rate network 145 in the multi-band multi-time-domain vocoder.
  • each frame of the acoustic feature frame It is divided into 4 subframes by the bandpass downsampling filter bank 146, and after each subframe is downsampled in the time domain, the 4 subframes enter the self-recursive sampling prediction network 147, and in the sampling prediction network 147 is calculated by the LPC coefficient (( Compute LPC) and LPC current prediction value calculation (Compute prediction), predict the linear prediction value of the sampling point t at the current time t in the current round on 4 subframes, and obtain 4 sub-coarse measurement values at time t
  • the sampling prediction network 147 takes two sampling points in each round as the forward prediction step, and obtains 4 sub-prediction values corresponding to the sampling point t-1 on 4 sub-frames from the historical prediction results of the previous round of prediction.
  • Sub-coarse measurement values of sampling point and sampling point t-1 on 4 subframes Residual value of sampling point t-1 on 4 subframes
  • the residual value of the sampling point on 4 subframes Combined with the conditional feature f, it is sent to the merging layer (concat layer) in the sampling prediction network for feature dimension merging to obtain the initial feature vector;
  • the initial feature vector is then passed through the 90% sparse 384-dimensional first gated recurrent network (GRU-A ) and a normal 16-dimensional second gated recurrent network (GRU-B) for feature dimensionality reduction to obtain a dimensionality reduction feature set;
  • the sampling prediction network 147 sends the dimensionality reduction feature set into eight 256-dimensional dual full connections (double FC) layer, through eight 256-dimensional double FC layers, combined with conditional features f, based on and Predict the sub-residual value of the sampling point t on 4 sub
  • the audio synthesis module 148 Perform time domain combination on the audio signals at each sampling point to obtain the audio signal corresponding to the frame.
  • the audio synthesis module 148 combines the audio signals corresponding to each frame of the at least one acoustic feature frame to obtain a piece of audio corresponding to the at least one acoustic feature frame, that is, the target audio corresponding to the original text initially input to the electronic device.
  • the input matrix of the GRU-A layer will become larger, but the table lookup operation can be used to make this
  • the effect of input overhead is negligible, and the multi-band multi-time-domain strategy reduces the number of cycles required for the self-recursion of the sample prediction network by a factor of 8 compared to conventional vocoders.
  • the vocoder is 2.75 times faster.
  • the target audio synthesized by the electronic device of the present application is only reduced by 3% in the subjective quality scoring, thus realizing the improvement of the audio frequency without affecting the audio processing quality. Processing speed and efficiency.
  • the software modules in the audio processing apparatus 655 stored in the memory 650 may be include:
  • a text-to-speech conversion model 6551 configured to perform voice feature conversion on the text to be processed to obtain at least one acoustic feature frame
  • Frame rate network 6552 configured to extract the conditional feature corresponding to each frame of acoustic feature frame from each frame of acoustic feature frame of the at least one frame of acoustic feature frame through the frame rate network;
  • the time-domain and frequency-domain processing module 6553 is configured to perform frequency band division and time-domain downsampling on the current frame in each frame of acoustic feature frames, to obtain n subframes corresponding to the current frame; wherein, n is a positive value greater than 1. Integer; each subframe of the n subframes includes a preset number of sampling points;
  • the sampling prediction network 6554 is configured to perform synchronous prediction on the sampling values corresponding to the current m adjacent sampling points on the n subframes during the i-th round of prediction, to obtain m ⁇ n sub-predicted values, and then obtain the n sub-prediction values corresponding to each sampling point in the preset number of sampling points; wherein, i is a positive integer greater than or equal to 1; m is a positive integer greater than or equal to 2 and less than or equal to the preset number;
  • the signal synthesis module 6555 is configured to obtain the audio prediction signal corresponding to the current frame according to the n sub-prediction values corresponding to each sampling point; and then the audio prediction signal corresponding to each acoustic feature frame of at least one acoustic feature frame Perform audio synthesis to obtain target audio corresponding to the text to be processed.
  • the sampling prediction network when m is equal to 2, includes 2n independent fully connected layers, and the two adjacent sampling points include: in the i-th round of prediction process, the current time t corresponds to The sampling point t of t and the sampling point t+1 corresponding to the next time t+1, where t is a positive integer greater than or equal to 1;
  • the sampling prediction network 6554 is further configured to, in the i-th round of prediction process, through the sampling prediction network, based on at least one historical sampling point at time t corresponding to the sampling point t, perform the sampling point t in the n sub-points.
  • Linear coding prediction is performed on the linear sampling values on the frame to obtain n sub-rough measurement values at time t; when i is greater than 1, based on the historical prediction results corresponding to the i-1th round of prediction process, combined with the conditional features, through 2n
  • the fully connected layer performs forward residual prediction on the residual values of the sampling point t and the sampling point t+1 on each of the n subframes synchronously, and obtains n corresponding to the sampling point t.
  • the residual value at time t corresponds to the n residual values at time t+1 corresponding to the sampling point t+1; the historical prediction result includes the n corresponding to each of the two adjacent sampling points in the i-1th round of prediction process.
  • residual values and sub-prediction values based on at least one historical sampling point at time t+1 corresponding to the sampling point t+1, linearize the linear sampling value of the sampling point t+1 on the n subframes Encoding prediction to obtain n sub-rough measurement values at time t+1; according to the n residual values at time t and the n sub-coarse measurement values at time t, obtain n sub-coarse measurement values at time t corresponding to the sampling point t predicted values, and according to the n residual values at time t+1 and the n sub-coarse measurements at time t+1, n sub-predicted values at time t+1 are obtained; the n sub-predictions at time t are calculated value and the n sub-predic
  • the sampling prediction network 6554 is further configured to obtain n sub-rough measurement values at time t-1 corresponding to the sampling point t-1, and n obtained in the i-1th round of prediction process residual values at time t-1, n residual values at time t-2, n sub-prediction values at time t-1 and n sub-prediction values at time t-2; , the n sub-coarse measurements at time t-1, the n residual values at time t-1, the n residual values at time t-2, the n sub-predicted values at time t-1 and The n predicted values at time t-2 are filtered by feature dimension to obtain a dimension reduction feature set; through each fully connected layer in the 2n fully connected layers, combined with the conditional feature, based on the dimension reduction feature set , synchronously perform forward residual prediction on the residual values of the sampling point t and the sampling point t+1 on each of the n subframes, respectively, to obtain the n residual values at time t and the
  • the sampling prediction network 6554 is further configured to determine n dimensionality reduction residual values at time t-2 and n dimensionality reduction prediction values at time t-2 in the dimensionality reduction feature set as t time excitation value; the n t-2 time dimensionality reduction residual values are obtained after the n t-2 time t-2 time residual values are filtered by the feature dimension; the n t-2 time dimensionality reduction prediction values are The n predicted values at time t-2 are obtained by filtering the feature dimension; the n dimensionality reduction residual values at time t-1 in the dimensionality reduction feature set are combined with the n dimensionality reducers at time t-1 The predicted value is determined as the excitation value at time t+1; the n dimensionality reduction residual values at time t-1 are obtained after the n residual values at time t-1 are filtered by feature dimension; the n t- The dimensionality reduction prediction value at time 1 is obtained after the n prediction values at time t-1 are filtered by the feature dimension;
  • the sampling prediction network includes a first gated recurrent network and a second gated recurrent network
  • the sampling prediction network 6554 is further configured to The n sub-rough measurements at time t-1, the n residual values at time t-1, the n residual values at time t-2, the n sub-predicted values at time t-1 and the The n predicted values at time t-2 are combined with feature dimensions to obtain an initial feature vector set; based on the conditional features, through the first gated recurrent network, feature dimension reduction processing is performed on the initial feature vector set to obtain an intermediate feature vector set.
  • a feature vector set; based on the conditional feature, through the second gated recurrent network, feature dimension reduction processing is performed on the intermediate feature vector to obtain the dimension reduction feature set.
  • the time-domain and frequency-domain processing module 6553 is further configured to perform frequency-domain division on the current frame to obtain n initial subframes; sampling the time domain corresponding to the n initial subframes The points are down-sampled to obtain the n subframes.
  • the sampling prediction network 6554 is further configured to use the sampling prediction network in the i-th prediction process, based on at least one historical sampling point at time t corresponding to the sampling point t, for the sampling point t Perform linear coding prediction on the linear sampling values on the n subframes, and before obtaining n sub-rough measurement values at time t, when t is less than or equal to the preset window threshold, all sampling points before the sampling point t are as the at least one historical sampling point at time t; the preset window threshold represents the maximum number of sampling points that can be processed by linear coding prediction; or, when t is greater than the preset window threshold, the sampling point t A corresponding sampling point in the range from -1 to sampling point t-k is used as the at least one historical sampling point at time t; where k is a preset window threshold.
  • the sampling prediction network 6554 is further configured to, in the i-th prediction process, use the sampling prediction network, based on at least one historical sampling point at time t corresponding to the sampling point t, to perform a Linear coding prediction is performed on the linear sampling values of the sampling point t in the n subframes, and after obtaining n sub-rough measurement values at time t, when i is equal to 1, through the 2n fully connected layers, combined with the conditional features and the preset excitation parameters, synchronously perform forward residual prediction on the residual values of the sampling point t and the sampling point t+1 on the n subframes, to obtain n corresponding to the sampling point t.
  • the signal synthesis module 6555 is further configured to superimpose the n sub-prediction values corresponding to each sampling point in the frequency domain to obtain the signal prediction value corresponding to each sampling point;
  • the signal prediction value corresponding to each sampling point is synthesized in the time domain to obtain the audio prediction signal corresponding to the current frame; then the audio signal corresponding to the acoustic feature of each frame is obtained; the audio signal corresponding to the acoustic feature of each frame is obtained Perform signal synthesis to obtain the target audio.
  • the text-to-speech conversion model 6551 is further configured to obtain the text to be processed; to preprocess the text to be processed to obtain information of the text to be converted; Acoustic feature prediction is performed on the information to obtain the at least one acoustic feature frame.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the audio processing method described above in the embodiment of the present application.
  • the embodiments of the present application provide a storage medium storing executable instructions, that is, a computer-readable storage medium, in which executable instructions are stored.
  • executable instructions When the executable instructions are executed by a processor, the processor will be caused to execute the instructions provided by the embodiments of the present application. method, for example, as shown in Figures 8-11, and the method shown in Figure 13.
  • the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories Various equipment.
  • executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, a Hyper Text Markup Language (HTML, Hyper Text Markup Language) document
  • HTML Hyper Text Markup Language
  • One or more scripts in stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).
  • executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.
  • the audio quality of the target audio can be improved; and the most original text to be processed can be used as input data, and output by the audio processing method in the embodiment of the present application
  • the final data processing result of the text to be processed that is, the target audio, realizes the end-to-end processing process of the text to be processed, reduces the transition processing between system modules, and increases the overall fit;
  • the feature signal is divided into multiple subframes in the frequency domain, and each subframe is down-sampled, which reduces the overall number of sampling points that need to be processed when the sampling prediction network performs sampling value prediction.
  • the prediction of multiple sampling points at adjacent times realizes the synchronous processing of multiple sampling points, thereby significantly reducing the number of cycles required for the sampling prediction network to predict audio signals, improving the processing speed of audio synthesis, and improving the efficiency of audio processing.
  • each frame of acoustic feature signal by dividing each frame of acoustic feature signal into multiple subframes in the frequency domain and down-sampling each subframe, the overall number of sampling points that need to be processed when the sampling prediction network performs sampling value prediction is reduced , and further, by simultaneously predicting multiple sampling points at adjacent times in one round of prediction, the synchronous processing of multiple sampling points is realized, thereby significantly reducing the cycle required for the sampling prediction network to predict audio signals times, the processing speed of audio synthesis is improved, and the efficiency of audio processing is improved.
  • each subframe in the time domain by down-sampling each subframe in the time domain, redundant information in each subframe can be removed, and the number of processing cycles required by the sampling prediction network to perform recursive prediction based on each subframe can be reduced, thereby further improving the audio processing performance. speed and efficiency; further, by preprocessing the text to be processed, the audio quality of the target audio can be improved; and, the most original text to be processed can be used as input data, and output to be processed by the audio processing method in the embodiment of the present application.
  • the final data processing result of the text that is, the target audio, realizes the end-to-end processing of the text to be processed, reduces the transition processing between system modules, and increases the overall fit.
  • the vocoder provided by the embodiment of the present application effectively reduces the amount of calculation required to convert acoustic features into audio signals, and realizes synchronous prediction of multiple sampling points, which can ensure a high real-time rate and output high output. Intelligible, natural, high-fidelity audio.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请实施例提供一种音频处理方法、装置、声码器、电子设备及计算机可读存储介质;包括:对待处理文本进行语音特征转换,得到至少一帧声学特征帧;通过帧率网络,从每帧声学特征帧中提取出条件特征;对当前帧进行频带划分与时域降采样,得到包含预设数量个采样点的n个子帧;通过采样预测网络,在第i轮预测过程中,对n个子帧中对应的当前m个相邻采样点同步进行采样值预测,得到m×n个子预测值,进而得到预设数量个采样点中每个采样点对应的n个子预测值;根据每个采样点对应的n个子预测值得到当前帧对应的音频预测信号;对至少一帧声学特征帧的每帧声学特征帧进行音频合成,得到目标音频。通过本申请,能够提高音频处理的速度与效率。

Description

音频处理方法、装置、声码器、电子设备、计算机可读存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202011612387.8、申请日为2020年12月30日、申请名称为“音频处理方法、声码器、装置、设备及存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及音视频处理技术,尤其涉及一种音频处理方法、装置、声码器、电子设备、计算机可读存储介质及计算机程序产品。
背景技术
随着智能设备(如智能手机、智能音箱等)的快速发展,语音交互技术作为一种自然的交互方式得到越来越多的应用。作为语音交互技术中重要的一环,语音合成技术也取得了长足的进步。语音合成技术通过一定的规则或模型算法将文本转换为对应的音频内容。传统的语音合成技术主要基于拼接方法或统计参数方法。随着深度学习在语音识别领域不断取得突破,深度学习逐渐被引入到语音合成领域。得益于此,基于神经网络的声码器(Neural vocoder)取得了很大的进展。然而,目前的声码器通常需要基于音频特征信号中的多个采样时间点进行多次循环才能完成语音预测,进而完成语音合成,从而导致音频合成的处理速度较慢,降低了音频处理的效率。
发明内容
本申请实施例提供一种音频处理方法、装置、声码器、电子设备、计算机可读存储介质及计算机程序产品,能够提高音频处理的速度与效率。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种音频处理方法,所述方法由电子设备执行,所述方法包括:
对待处理文本进行语音特征转换,得到至少一帧声学特征帧;
通过帧率网络,从所述至少一帧声学特征帧的每帧声学特征帧中,提取出所述每帧声学特征帧对应的条件特征;
对所述每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到所述当前帧对应的n个子帧;其中,n为大于1的正整数;所述n个子帧的每个子帧包含预设数量个采样点;
通过采样预测网络,在第i轮预测过程中,对当前m个相邻采样点在所述n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到所述预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于所述预设数量的正整数;
根据所述每个采样点对应的n个子预测值得到所述当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到所述待处理文本对应的目标音频。
本申请实施例提供一种声码器,包括:
帧率网络,配置为从所述至少一帧声学特征帧的每帧声学特征帧中,提取出所述每帧声学特征帧对应的条件特征;
时域频域处理模块,配置为对所述每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到所述当前帧对应的n个子帧;其中,n为大于1的正整数;所述n个子帧的每个子帧包含预设数量个采样点;
采样预测网络,配置为在第i轮预测过程中,对当前m个相邻采样点在所述n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到所述预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于所述预设数量的正整数;
信号合成模块,配置为根据所述每个采样点对应的n个子预测值得到所述当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到所述待处理文本对 应的目标音频。
本申请实施例提供一种音频处理装置,包括:
文本语音转换模型,配置为待处理文本进行语音特征转换,得到至少一帧声学特征帧;
帧率网络,配置为从所述至少一帧声学特征帧的每帧声学特征帧中,提取出所述每帧声学特征帧对应的条件特征;
时域频域处理模块,配置为对所述每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到所述当前帧对应的n个子帧;其中,n为大于1的正整数;所述n个子帧的每个子帧包含预设数量个采样点;
采样预测网络,配置为在第i轮预测过程中,对当前m个相邻采样点在所述n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到所述预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于所述预设数量的正整数;
信号合成模块,配置为根据所述每个采样点对应的n个子预测值得到所述当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到所述待处理文本对应的目标音频。
本申请实施例提供一种电子设备,包括:
存储器,配置为存储可执行指令;
处理器,配置为执行所述存储器中存储的可执行指令时,实现本申请实施例提供的音频处理方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,配置为引起处理器执行时,实现本申请实施例提供的音频处理方法。
本申请实施例提供一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现本申请实施例提供的音频处理方法。
本申请实施例具有以下有益效果:
通过将每帧声学特征信号划分为频域上的多个子帧并对每个子帧进行降采样,降低了采样预测网络进行采样值预测时所需处理的整体采样点数量,进而,通过在一轮预测过程中同时对多个相邻时间的采样点的进行预测,实现了对多个采样点的同步处理,从而显著减少了采样预测网络预测音频信号时所需的循环次数,提高了音频合成的处理速度,并提高了音频处理的效率。
附图说明
图1是本申请实施例提供的目前的LPCNet声码器的一个可选的结构示意图;
图2是本申请实施例提供的音频处理系统架构的一个可选的结构示意图一;
图3是本申请实施例提供的音频处理系统在车载应用场景下的一个可选的结构示意图一;
图4是本申请实施例提供的音频处理系统架构的一个可选的结构示意图二;
图5是本申请实施例提供的音频处理系统在车载应用场景下的一个可选的结构示意图二;
图6是本申请实施例提供的电子设备的一个可选的结构示意图;
图7是本申请实施例提供的多频带多时域的声码器的一个可选的结构示意图;
图8是本申请实施例提供的音频处理方法的一个可选的流程示意图一;
图9是本申请实施例提供的音频处理方法的一个可选的流程示意图二;
图10是本申请实施例提供的音频处理方法的一个可选的流程示意图三;
图11是本申请实施例提供的音频处理方法的一个可选的流程示意图四;
图12是本申请实施例提供的帧率网络与采样预测网络的网络架构的一个可选的示意图;
图13是本申请实施例提供的音频处理方法的一个可选的流程示意图五;
图14是本申请实施例提供的电子设备应用于实际场景中的一个可选的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
在以下的描述中,所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述 的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)语音合成:也被称为文字转语音(Text to Speech,TTS),其作用是将计算机自己产生的或外部输入的文字信息转变为可以听得懂的、流利的语音并朗读出来。
2)频谱:频谱(Spectrograms)是指一个时域的信号在频域下的表示方式,可以针对信号进行傅里叶变换而得,所得的结果是分别以幅度及相位为纵轴,频率为横轴的两张图,语音合成技术应用中多会省略相位的信息,只保留不同频率下对应的幅度信息。
3)基频:在声音中,基频(Fundamental frequency)是指一个复音中基音的频率,用符号FO表示。在构成一个复音的若干个音中,基音的频率最低,强度最大。基频的高低决定一个音的高低。平常所谓语音的频率,一般指的是基音的频率。
4)声码器:声码器(Vocoder)源自人声编码器(Voice Encoder)的缩写,又称语音信号分析合成系统,其作用是将声学特征转换为声音。
5)GMM:混合高斯模型(Gaussian Mixture Model)是单一高斯概率密度函数的延伸,用多个高斯概率密度函数更为精确地对变量分布进行统计建模。
6)DNN:深度神经网络(Deep Neural Network)是一种判别模型,是包含超过两个隐藏层的多层感知机(Multi-layer perceptron neural networks,MLP),除了输入节点外,每个节点都是一个带有非线性激活函数的神经元,与MLP一样,DNN可以使用反向传播算法进行训练。
7)CNN:卷积神经网络(Convolutional Neural Network)是一种前馈神经网络,其神经元可对感受野内的单元进行响应。CNN通常由多个卷积层和顶端的全连接层组成,其通过共享参数降低模型的参数量,使之在图像和语音识别方面得到广泛应用。
8)RNN:循环神经网络(Recurrent Neural Network,RNN)是一类以序列(sequence)数据为输入,在序列的演进方向进行递归(recursion)且所有节点(循环单元)按链式连接的递归神经网络(Recursive Neural Network)。
9)LSTM:长短时记忆网络(Long Short-Term Memory),是一种循环神经网络,它在算法中加入了一个判断信息有用与否的Cell。一个Cell中放置了输入门、遗忘门和输出门。信息进入LSTM后,根据规则来判断是否有用。符合算法认证的信息才会留下,不符的信息则通过遗忘门被遗忘。该网络适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。
10)GRU:循环门单元(Gate Recurrent Unit),是循环神经网络的一种。和LSTM一样,也是为了解决长期记忆和反向传播中的梯度等问题而提出的。与LSTM相比,GRU内部少了一个“门控”,参数比LSTM少,在多数情况下能够达到与LSTM相当的效果并有效降低计算耗时。
11)Pitch:基音周期。通常,语音信号简单的可以分为两类。一类是具有短时周期性的浊音,人在发浊音时,气流通过声门使声带产生张驰振荡式振动,产生一股准周期脉冲气流,这一气流激励声道就产生浊音,又称有声语音,它携带着语音中的大部分能量,其周期称为基音周期(Pitch)。另一类是具有随机噪声性质的清音,在声门关闭时由口腔压迫其中的空气发出。
12)LPC:线性预测编码(Linear Predictive Coding),语音信号可以被建模为一个线性时变系统的输出,该系统的输入激励信号为(浊音期间)周期脉冲或(清音期间)随机噪声。语音信号的采样可以用过去的采样线性拟合来逼近,然后通过局部最小化实际采样和线性预测采样之间的差值平方和,即可得出一组预测系数,即LPC。
13)LPCNet:线性预测编码网络是一个数字信号处理和神经网络巧妙结合应用于语音合成中的声码器的工作,可以在普通的CPU上实时合成高质量语音。
目前,在基于神经网络的声码器中,Wavenet作为神经声码器的开山之作,为后续的该领域的工作提供了重要参考,但由于其自递归(即预测当前的采样点需要依赖前面时刻的采样点)的前向方式,导致实时性上很难满足大规模线上应用的要求。针对Wavenet存在的问题,基于流的神经声码器如Parallel Wavenet、Clarinet应运而生。这类声码器通过蒸馏的方式,使教师模型与学生模型所预测的分布(混合逻辑斯特分布、单高斯分布)尽可能相近。蒸馏学习完成后,在前向时使用可并行的学生模型来提升整体的速度。但是由于基于流的声码器整体结构仍较为复杂,存在训练流程割裂,训练稳定性不佳的问题,因此基于流的声码器仅能在成本较高的GPU上实现实时合成,对于大规模线上应用而言,成本过高。随后,有着更简单结构的自递归模型,如Wavernn、LPCNet相继被推出。在原本较为简单的结构上进一步引入了量化优化及矩阵稀疏优化,使得其能在单CPU上达到一个相对不错的实时性。但是对于大规模线上应用而言,还需要更快的声码器。
目前,LPCNet声码器主要由帧率网络(Frame Rate Network,FRN)与采样率网络(Sample Rate Network,SRN)构成。如图1所示,其中,帧率网络10通常以多维音频特征作为输入,经过多层卷积的处理提取高层语音特征作为后续采样率网络20的条件特征f;采样率网络20可以基于多维音频特征进行LPC系数计算,并基于LPC系数,结合在当前时刻之前的多个时刻上预测得到的采样点预测值S t-16...S t-1,以线性预测编码输出当前时刻的采样点所对应的当前粗测值p t。采样率网络20将上一时刻的采样点对应的预测值S t-1,上一时刻的采样点对应的预测误差e t-1、当前粗测值p t以及帧率网络10输出的条件特征f作为输入,输出当前时刻的采样点对应的预测误差e t;之后,采样率网络20再使用当前粗测值p t加上当前时刻的采样点对应的预测误差e t,得到当前时刻的预测值S t。采样率网络20针对多维音频特征中的每个采样点进行同样的处理,不断循环运行,最终完成对所有采样点的采样值预测,根据每个采样点上的预测值得到整个需要合成的目标音频。由于通常音频采样点数量较多,以采样率为16Khz为例,10ms的音频包含160个采样点。因此,为了合成10ms音频,目前的声码器中的SRN需要循环160次,整体计算量较大,从而导致音频处理的速度和效率较低。
本申请实施例提供一种音频处理方法、装置、声码器、电子设备及计算机可读存储介质,能够提高音频处理的速度和效率。下面说明本申请实施例提供的电子设备的示例性应用,本申请实施例提供的电子设备可以实施为智能机器人、智能音箱、笔记本电脑,平板电脑,台式计算机,机顶盒,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)、智能语音交互设备、智能家电、车载终端等各种类型的用户终端,也可以实施为服务器。下面,将说明电子设备实施为服务器时示例性应用。
参见图2,图2是本申请实施例提供的音频处理系统100-1的一个可选的架构示意图,为实现支撑一个智能语音应用,终端400(示例性示出了终端400-1、终端400-2与终端400-3)通过网络连接服务器200,网络可以是广域网或者局域网,又或者是二者的组合。
终端400上安装有智能语音应用的客户端410(示例性示出了客户端410-1、客户端410-2和客户端410-3),客户端410可以将需要进行智能语音合成的待处理文本发送至服务端。服务器200配置为接收到待处理文本后,对待处理文本进行语音特征转换,得到至少一帧声学特征帧;通过帧率网络,从至少一帧声学特征帧的每帧声学特征帧中,提取出每帧声学特征帧对应的条件特征;对每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到当前帧对应的n个子帧;其中,n为大于1的正整数;n个子帧的每个子帧包含预设数量个采样点;通过采样预测网络,在第i轮预测过程中,对当前m个相邻采样点在n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于预设数量的正整数;根据每个采样点对应的n个子预测值得到当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到待处理文本对应的目标音频。服务器200还可以进一步进行对目标音频的压缩等后处理操作,将处理后的目标音频通过流式或整句返回的方式返回到终端400。终端400在接收到返回的音频之后即可在客户端410中进行流畅自然的语音播放。由于在音频处理系统100-1的整个处理过程中,服务器200能够通过采样预测网络同时预测相邻时间的多个子带特征对应的预测值,预测音频时所需的循环次数较少,从而使得服务器后台语音合成服务延迟很小,客户端410可立即获得返回的音频。使得终端400的用户可以在短时间内听到待处理文本所转化的语音内容,解放双眼,交互自然便捷。
在一些实施例中,服务器200可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。终端400可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例中不做限制。
在一些实施例中,如图3所示,终端400可以是车载设备400-4,示例性的,车载设备400-4可以是安装在车辆设备内部的车载电脑,还可以是安装在车辆设备外部的用于控制车辆的控制设备等,智能语音应用的客户端410可以是车载服务客户端410-4,用于显示车辆的相关行驶信息并提供对车辆上各类设备的操控以及提供其他扩展功能。当车载服务客户端410-4接收到外部发送的文本类消息,如新闻消息、路况消息,或紧急消息等包含文本信息的消息时,可以基于用户的操作指令,如用户在410-5所示消息弹出界面上通过语音、屏幕或按键等操作触发语音播报指令后,车载服务系统响应于语音播报指令,将文本消息发送至服务器200,由服务器200从文本消息中提取出待处理文本,并对待处理文本进行上述的音频处理过程,生成对应的目标音频。服务器200将目标音频发送至车载服务客户端410-4,通过车载服务客户端410-4调用车载多媒体设备对目标音频进行播放,并显示如410-6所示的音频播放界面。
下面,将说明电子设备实施为终端时示例性应用。参见图4,图4是本申请实施例提供的音频处理系统100-2的一个可选的架构示意图,为实现支撑一个垂直领域下的可定制个性化语音合成应用,如在小说朗读、新闻播报等领域的专属音色语音合成服务,终端500通过网络连接服务器300,网络可以是广域网或者局域网,又或者是二者的组合。
服务器300配置为预先根据音色定制需求,通过采集各种音色的音频,如不同性别或不同音色类型的发音人音频形成语音库,通过语音库对内置的初始语音合成模型进行训练,得到具备语音合成功能的服务器端模型,并将训练完成的服务器端模型部署在终端500上,成为终端500上的后台语音处理模型420。终端500上安装有智能语音应用411(如阅读类APP,新闻客户端等),当用户需要通过智能语音应用411朗读某个文本时,智能语音应用411可以获取用户提交的所需语音朗读的文本,将该文本作为待处理文本,发送给后台语音模型420,通过后台语音模型420,对待处理文本进行语音特征转换,得到至少一帧声学特征帧;通过帧率网络,从至少一帧声学特征帧的每帧声学特征帧中,提取出每帧声学特征帧对应的条件特征;对每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到当前帧对应的n个子帧;其中,n为大于1的正整数;n个子帧中的每个子帧包含预设数量个采样点;通过采样预测网络,在第i轮预测过程中,对当前m个相邻采样点在n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于预设数量的正整数;根据每个采样点对应的n个子预测值得到当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到待处理文本对应的目标音频,并传递给智能语音应用411的前台交互界面进行播放。个性化定制语音合成对系统的鲁棒性、泛化性及实时性等提出了更高的要求,本申请实施例提供的可模块化的端到端的音频处理系统可根据实际情况灵活做出调整,在几乎不影响合成效果的前提下,保障了不同需求下系统的高适配性。
在一些实施例中,参见图5,终端500可以为车载设备500-1,车载设备500-1与另一用户设备500-2如手机、平板电脑等以有线或无线的方式相连接。示例性的,可以通过蓝牙方式连接,也可以通过USB方式连接等等。用户设备500-2可以将自身的文本,如短消息、文档等通过连接发送至车载设备500-1上的智能语音应用411-1。示例性的,用户设备500-2在接收到通知消息时,可以将通知消息自动转发至智能语音应用411-1,或者,用户设备500-2也可以基于用户在用户设备应用上的操作指令,将本地保存的文档发送至智能语音应用411-1。智能语音应用411-1收到推送的文本时,可以基于对语音播报指令的响应,将文本内容作为待处理文本,通过后台语音模型,对待处理文本进行上述的音频处理过程,生成对应的目标音频。智能语音应用411-1进而调用相应的界面显示和车载多媒体设备对目标音频进行播放。
参见图6,图6是本申请实施例提供的电子设备600的结构示意图,图6所示的电子设备600包括:至少一个处理器610、存储器650、至少一个网络接口620和用户接口630。电子设备600中的各个组件通过总线系统640耦合在一起。可理解,总线系统640用于实现这些组件之间的连接通信。总线系统640除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图6中将各种总线都标为总线系统640。
处理器410可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口630包括使得能够呈现媒体内容的一个或多个输出装置631,包括一个或多个扬声器和/或一个或多个视觉显示屏。用户接口630还包括一个或多个输入装置632,包括有助于用户输入的用户接口部件,比如键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。
存储器650可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器650可选地包括在物理位置上远离处理器610的一个或多个存储设备。
存储器650包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器650旨在包括任意适合类型的存储器。
在一些实施例中,存储器650能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统651,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块652,用于经由一个或多个(有线或无线)网络接口620到达其他计算设备,示例性的网络接口620包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
呈现模块653,用于经由一个或多个与用户接口630相关联的输出装置631(例如,显示屏、扬声器等)使得能够呈现信息(例如,用于操作外围设备和显示内容和信息的用户接口);
输入处理模块654,用于对一个或多个来自一个或多个输入装置632之一的一个或多个用户输入或互 动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的装置可以采用软件方式实现,图6中示出了存储在存储器650中的音频处理装置655,其可以是程序和插件等形式的软件,包括以下软件模块:文本语音转换模型6551、帧率网络6552、时域频域处理模块6553、采样预测网络6554和信号合成模块6555,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。
将在下文中说明各个模块的功能。
在另一些实施例中,本申请实施例提供的装置可以采用硬件方式实现,作为示例,本申请实施例提供的装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的音频处理方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。
本申请实施例提供一种多频带多时域的声码器,该声码器可以与文本语音转换模型相结合,将文本语音转换模型根据待处理文本所输出的至少一帧声学特征帧转换为目标音频;也可以与其他音频处理系统中的音频特征提取模块相结合,起到将音频特征提取模块输出的音频特征转换为音频信号的作用。具体的根据实际情况进行选择,本申请实施例不作限定。
如图7所示,本申请实施例提供的声码器包含时域频域处理模块51、帧率网络52、采样预测网络53和信号合成模块54。其中,帧率网络52可以对输入的声学特征信号进行高层抽象,从至少一帧声学特征帧的每帧声学特征帧中提取出该帧对应的条件特征。声码器进而可以基于每帧声学特征帧对应的条件特征,对该帧声学特征中每个采样点上的采样信号值进行预测。以声码器对至少一帧声学特征帧中的当前帧进行处理为例,对于每帧声学特征帧中的当前帧,时域频域处理模块51可以对当前帧进行频带划分与时域降采样,得到当前帧对应的n个子帧;n个子帧中的每个子帧包含预设数量个采样点。采样预测网络53用于在第i轮预测过程中,对当前m个相邻采样点在n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于预设数量的正整数;信号合成模块54,用于根据每个采样点对应的n个子预测值得到当前帧对应的音频预测信号;进而对每帧声学特征帧对应的音频预测信号进行音频合成,得到待处理文本对应的目标音频。
人发声音是通过人肺部挤压出来的气流通过声带产生震动波,通过空气传播到耳朵,因此,采样预测网络可以通过声源激励(模拟从肺部发出气流)和声道响应(vocal tract response)系统来进行音频信号采样值的预测。在一些实施例中,采样预测网络53可以如图7所示,包括线性预测编码模块53-1与采样率网络53-2。其中,线性预测编码模块53-1可以计算出m个采样点中每个采样点在n个子帧上对应的子粗测值,作为声道响应;采样率网络53-2可以根据帧率网络52提取出的条件特征,在一轮预测过程中,以m个采样点为前向预测的时间跨度,同时完成m个相邻采样点中每个采样点分别在n个子帧上对应的残差值,作为声源激励(Excitation),进而根据声道响应与声源激励模拟出相应的音频信号。
在一些实施例中,以m等于2,即采样预测网络的预测时间跨度为2个采样点为例,在第i轮预测过程中,线性预测编码模块53-1可以根据当前时刻t上的采样点t所对应的至少一个t时刻历史采样点中,每个历史采样点对应的n个子预测值,对采样点t在n个子帧上的线性采样值进行线性编码预测,得到n个t时刻子粗测值,作为采样点t的声道响应;在对采样点t对应的残差值进行预测时,由于预测时间跨度为2个采样点,因此采样率网络53-2可以将第i-1轮预测过程中采样点t-2对应的n个t-2时刻残差值与n个t-2时刻子预测值作为激励值,结合条件特征与n个t-1时刻子粗测值,对采样点t在n个子帧中分别对应的残差值进行前向预测,得到采样点t对应的n个t时刻残差值。同时,在对采样点t对应的残差值进行预测时,将第i-1轮预测过程中采样点t-1对应的n个t-1时刻残差值与n个t-1时刻子预测值作为激励值,结合条件特征,对采样点t+1在n个子帧中分别对应的残差值进行前向预测,得到采样点t+1对应的n个t+1时刻残差值。采样率网络53-2可以根据上述过程,自递归地对n个子帧中的降采样后的预设数量个采样点进行残差预测,直至得到每个采样点对应的n个残差值。
本申请实施例中,采样预测网络53可以根据n个t时刻残差值与n个t时刻子粗测值得到采样点t对应的n个t时刻子预测值,并将采样点t作为采样点t+1对应的至少一个t+1时刻历史采样点之一,根据至少一个t+1时刻历史采样点中每个t+1时刻历史采样点对应的子预测值,对采样点t+1在n个子帧上对应的线性采样值进行线性编码预测,得到n个t+1时刻子粗测值,作为采样点t的声道响应;进而根据n个t+1时刻子粗测值与n个t+1时刻残差值得到n个t+1时刻子预测值,将n个t时刻子预测值与n个t+1时刻子预测值作为2n个子预测值,从而完成第i轮预测过程。第i轮预测过程结束后,采样预测网络53当前相邻两个采样点t与采样点t+1进行更新,开始第i+1轮采样值预测过程,直至预设数量个采样点全部预测完毕,声码器可以通过信号合成模块54得到当前帧对应的音频信号的信号波形。
可以理解的是,本申请实施例提供的声码器有效地降低了将声学特征转换为音频信号所需的计算量,并且实现了多个采样点的同步预测,可以在保证高实时率的同时,输出高可理解度、高自然度、高保真度的音频。
需要说明的是,上述实施例中,将声码器的预测时间跨度设置为两个采样点,即设置m为2是在综合考虑声码器处理效率与音频合成质量的基础上的一种优选的示例性应用,实际应用时也可以根据工程的需要将m设置为其他时间跨度参数值,具体的根据实际情况进行选择,本申请实施例不作限定。当m设置为其他值时,预测过程和每轮预测过程中每个采样点对应的激励值的选取与上述m=2时类似,此处不再赘述。
下面,结合本申请实施例提供的电子设备600的示例性应用和实施,说明本申请实施例提供的音频处理方法。
参见图8,图8是本申请实施例提供的音频处理方法的一个可选的流程示意图,将结合图8示出的步骤进行说明。
S101、对待处理文本进行语音特征转换,得到至少一帧声学特征帧。
本申请实施例提供的音频处理方法可应用在智能语音应用的云服务中,进而服务于使用该云服务的用户,如银行智能客服、以及背单词软件等学习类软件,也可以应用在终端本地应用的书籍智能朗读、新闻播报等智能语音场景,还可以应用于自动驾驶场景或车载场景,如基于语音交互的车联网场景或智慧交通场景等等,本申请实施例不作限定。
本申请实施例中,电子设备可以通过预设的文本语音转换模型,对待转换文本信息进行语音特征转换,输出的至少一帧声学特征帧。
本申请实施例中,文本语音转换模型可以是由CNN、DNN网络或RNN网络构建的序列到序列(Sequence to Sequence)模型,序列到序列模型主要由编码器和解码器两部分构成;其中,编码器可以将语音数据、原始文本、视频数据等一系列具有连续关系的数据抽象为序列,并从原始文本中的字符序列,如句子中提取鲁棒序列表达,将其编码为能够映射出句子内容的固定长度的向量,从而将原始文本中的自然语言转换为能够被神经网络识别和处理的数字特征;解码器可以将编码器得到的固定长度的向量映射为对应序列的声学特征,并将多个采样点上的特征集合成一个观测单位,即一帧,从而得到至少一帧声学特征帧。
本申请实施例中,至少一帧声学特征帧可以是至少一帧音频频谱信号,可以通过频域的频谱图进行表示。每个声学特征帧中包含预设数量的特征维度,特征维度表征特征中向量的数量,而特征中的向量用于对各类特征信息,如音调、共振峰、频谱、发音域函数等进行描述。示例性的,至少一帧声学特征帧可以是梅尔刻度声谱图,也可以是线性对数幅度谱图,或者是巴克刻度声谱图等等,本申请实施例对至少一帧声学特征帧的提取方法和特征的数据形式不作限定。
在一些实施例中,每帧声学特征帧中可以包含18维BFCC特征(Bark-Frequency Cepstral Coefficients)加上2维声调(Pitch)相关特征。
由于日常生活中的声音的模拟信号的频率一般都在8kHz以下,根据抽样定理,16kHz的采样率足以使得采样出来的音频数据包含大多数声音信息。16kHz意味着1秒的时间内采样16k个信号样本,在一些实施例中,每帧声学特征帧的帧长可以是10ms,则对于采样率为16KHZ的音频信号,每帧声学特征帧中可以包含160个采样点。
S102、通过帧率网络,从至少一帧声学特征帧的每帧声学特征帧中,提取出每帧声学特征帧对应的条件特征。
本申请实施例中,电子设备可以通过帧率网络对至少一帧声学特征帧进行多层卷积的处理,提取出每帧声学特征帧的高层语音特征作为该帧声学特征帧对应的条件特征。
在一些实施例中,电子设备可以通过S101将待处理文本转换为100帧声学特征帧,再通过帧率网络对100帧声学特征帧同时进行处理,得到对应的100帧条件特征。
在一些实施例中,帧率网络可以包含依次串联的两个卷积层与两个全连接层。示例性的,两个卷积层可以是两个filter尺寸为3的卷积层(conv3x1),对于包含18维BFCC特征加2维声调特征的声学特征帧,每帧中的20维特征首先通过两个卷积层,根据该帧前两帧与该帧后两帧的声学特征帧产生5帧的感受野,并将5帧的感受野添加到残差连接,然后通过两个全连接层输出一个128维条件向量f,作为条件特征,用于辅助采样率网络进行前向残差预测。
需要说明的是,本申请实施例中,对于每个声学特征帧,帧率网络对应的条件特征只计算一次。即当采样率网络在对该声学特征帧对应的降采样后的多个采样点对应的采样值进行递归式地预测时,该帧对应的条件特征在该帧对应的递归预测过程中保持不变。
S103、对每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到当前帧对应的n个子帧;其中,n为大于1的正整数;n个子帧中的每个子帧包含预设数量个采样点。
本申请实施例中,为了降低采样预测网络的循环预测次数,电子设备可以通过对每帧声学特征帧中的当前帧进行频带划分,然后对划分后的频带中所包含的时域上的采样点进行降采样,以减少每个划分后频带中包含的采样点数量,从而得到当前帧对应的n个子帧。
在一些实施例中,频域划分过程可以通过滤波器组来实现。示例性的,当n等于4时,对于频域范围为0-8k的当前帧,电子设备可以通过包含四个带通滤波器的滤波器组如Pseudo-QMF(Pseudo Quadratue Mirror Filter Bank)滤波器组,以2k带宽为单位,从当前帧中分别划分出0-2k、2-4k、4-6k、6-8k频带对应的特征,对应得到当前帧对应的4个初始子帧。
在一些实施例中,对于当前帧中包含160个采样点的情况,电子设备将当前帧划分为4个频域上的初始子帧后,由于频域划分仅是基于频带的划分,因此每个初始子帧中包含的采样点仍然是160个。电子设备进一步通过降采样滤波器对每个初始子帧进行降采样,将每个初始子帧中的采样点减少到40个,从而得到当前帧对应的4个子帧。
本申请实施例中,电子设备也可以通过其他软件或硬件的方法对当前帧进行频带划分,具体的根据实际情况进行选择,本申请实施例不作限定。电子设备在对至少一帧声学特征帧中的每一帧进行频带划分和时域降采样时,可以将每一帧作为当前帧,以同样的处理过程进行划分和时域降采样。
S104、通过采样预测网络,在第i轮预测过程中,对当前m个相邻采样点在n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于预设数量的正整数。
本申请实施例中,电子设备在得到至少一帧声学特征帧之后,需要将至少一帧声学特征帧转化为音频信号的波形表达。因此,对于一帧声学特征帧,电子设备需要预测在每个采样点在频域上对应的线性频率刻度上的频谱幅度,作为每个采样点的采样预测值,进而通过每个采样点的采样预测值得到该帧声学特征帧对应的音频信号波形。
本申请实施例中,频域的每个子帧在时域上对应的采样点是相同的,都包含有相同时刻的预设数量个采样点,电子设备可以在一轮预测过程中,同时对频域上的n个子帧在相邻时刻的m个采样点上所分别对应采样值进行预测,得到m×n个子预测值,从而可以大大缩短预测一个声学特征帧所需的循环次数。
本申请实施例中,电子设备可以通过同样的处理过程,对时域上的预设数量个采样点中的m个相邻的采样点进行预测,例如,预设数量个采样点中包含采样点t 1、t 2、t 3、t 4….t n,当m=2时,电子设备可以在一轮预测过程中,对采样点t 1以及采样点t 2进行同步处理,在一轮预测过程中同时预测出采样点t 1在频域上的n个子帧对应的n个子预测值、以及采样点t 2在n个子帧上对应的n个子预测值,作为2n个子预测值;并在下一轮预测过程中,将采样点t 3和t 4作为当前相邻两个采样点,以相同的方式对采样点t 3和t 4进行同步处理,同时预测出采样点t 3和采样点t 4对应的2n个子预测值。电子设备通过采样预测网络,以自递归的方式完成预设数量个采样点中的全部采样点的采样值预测,得到每个采样点对应的n个子预测值。
S105、根据每个采样点对应的n个子预测值得到当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到待处理文本对应的目标音频。
本申请实施例中,每个采样点对应的n个子预测值表征了该采样点在n个频带上的音频信号预测幅度。电子设备可以针对每个采样点,对该采样点对应的n个子预测值进行频域的合并,得到该采样点的在全频带上对应的信号预测值。电子设备进而将当前帧中的每个采样点对应在预设时间序列中的次序,对每个采样点对应的信号预测值进行时域的合并,得到当前帧对应的音频预测信号。
本申请实施例中,采样预测网络对每一帧声学特征帧进行相同的处理,可以通过至少一帧声学特征帧预测出全部信号波形,从而得到目标音频。
可以理解的是,本申请实施例中,电子设备通过将每帧声学特征信号划分为频域上的多个子帧并对每个子帧进行降采样,降低了采样预测网络进行采样值预测时所需处理的整体采样点数量,进而,通过在一轮预测过程中同时对多个相邻时间的采样点的进行预测,实现了对多个采样点的同步处理,从而显著减少了采样预测网络预测音频信号时所需的循环次数,提高了音频合成的处理速度,并提高了音频处理的效率。
在本申请的一些实施例中,S103可以通过执行S1031-S1032来实现,如下:
S1031、对当前帧进行频域划分,得到n个初始子帧;
S1032、对n个初始子帧所对应的时域采样点进行降采样,得到n个子帧。
可以理解的是,通过对每个子帧进行时域降采样,可以去除每个子帧中的冗余信息,减少采样预测网络基于每个进行递归预测时所需处理的循环次数,从而进一步提高了音频处理的速度和效率。
本申请实施例中,当m等于2时,采样预测网络可以包含独立的2n个全连接层,相邻m个采样点包括:第i轮预测过程中,当前时刻t对应的采样点t与下一时刻t+1对应的采样点t+1,其中,t为大于或 等于1的正整数。如图9所示,图8中的S104可以通过S1041-S1044来实现,将结合各步骤进行说明。
S1041、在第i轮预测过程中,通过采样预测网络,基于采样点t对应的至少一个t时刻历史采样点,对采样点t在n个子帧上的线性采样值进行线性编码预测,得到n个t时刻子粗测值。
本申请实施例中,在第i轮预测过程中,电子设备首先通过采样预测网络,对n个子帧在当前时刻的采样点t对应的n个线性采样值进行线性编码预测,得到n个t时刻子粗测值。
本申请实施例中,在第i轮预测过程中,采样预测网络在对采样点t对应的n个t时刻子粗测值进行预测时,需要参考采样点t之前的至少一个历史采样点的信号预测值,通过线性组合方式求解出采样点t时刻的信号预测值。采样预测网络所需参考的历史采样点的最大数量即为预设窗口阈值。电子设备可以根据采样点t在预设时间序列中的次序,结合采样预测网络的预设窗口阈值,确定对采样点t进行线性编码预测时所对应的至少一个历史采样点。
在一些实施例中,电子设备在S1041之前,还可以通过执行S201或S202,确定出采样点t对应的至少一个t时刻历史采样点,如下:
S201、当t小于或等于预设窗口阈值时,将采样点t之前的全部采样点作为至少一个t时刻历史采样点;预设窗口阈值表征线性编码预测可处理的最大采样点的数量。
在一些实施例中,当当前帧中包含160个采样点,预设窗口阈值为16,也即采样预测网络中的线性预测模块进行一次预测,所能处理的最大队列为16个采样点对应的全部子预测值时,对于采样点15,由于采样点15所在预设时间序列中的次序还未超出预设窗口阈值,因此线性预测模块可以将从采样点15之前的全部采样点,即采样点1到采样点14范围内的14个采样点作为至少一个t时刻历史采样点。
S202、当t大于预设窗口阈值时,将从采样点t-1至采样点t-k范围内对应的采样点,作为至少一个t时刻历史采样点;其中,k为预设窗口阈值。
本申请实施例中,随着采样值预测过程的逐轮递归,线性预测模块的预测窗口在多个采样点的预设时间序列上相应的逐步滑动。在一些实施例中,当t大于16时,如线性预测模块在对采样点18进行线性编码预测时,预测窗口的终点滑动至采样点17的位置,线性预测模块将采样点17至采样点2范围内的16个采样点作为至少一个t时刻历史采样点。
本申请实施例中,电子设备可以通过线性预测模块,在采样点t对应至少一个t时刻历史采样点中,获取每个t时刻历史采样点对应的n个子预测值,作为至少一个t时刻历史子预测值;根据至少一个t时刻历史子预测值对采样点t的音频信号线性值进行线性编码预测,得到采样点t对应的n个t时刻子粗测值。
需要说明的是,本申请实施例中,对于当前帧中的首个采样点,由于没有首个采样点对应的历史采样点上的子预测值可以参考,电子设备可以结合预设线性预测参数对首个采样点,即i=1,且t=1的采样点t进行线性编码预测,得到首个采样点对应的n个t时刻子粗测值。
S1042、当i大于1时,基于第i-1轮预测过程对应的历史预测结果,结合条件特征,通过2n个全连接层,同步对采样点t与采样点t+1分别在n个子帧的每个子帧上的残差值进行前向残差预测,得到采样点t对应的n个t时刻残差值与采样点t+1对应的n个t+1时刻残差值;历史预测结果包含第i-1轮预测过程中,相邻两个采样点各自对应的n个残差值与子预测值。
本申请实施例中,当i大于1时,说明电子设备可以获取到第i轮预测过程的上一轮预测结果作为第i轮预测过程的激励,通过采样预测网络进行音频信号的非线性误差值的预测。
本申请实施例中,历史预测结果包含第i-1轮预测过程中,相邻两个采样点各自对应的n个残差值与子预测值。电子设备可以基于第i-1轮的历史预测结果,结合条件特征,通过2n个全连接层,同时对n个子帧在采样点t与采样点t+1上分别对应的残差值进行前向残差预测,得到采样点t对应的n个t时刻残差值与采样点t+1对应的n个t+1时刻残差值。
在一些实施例中,如图10所示,S1042可以通过S301-S303来实现,将结合各步骤进行说明。
S301、当i大于1时,获取采样点t-1对应的n个t-1时刻子粗测值,以及在第i-1轮预测过程中得到的n个t-1时刻残差值、n个t-2时刻残差值、n个t-1时刻子预测值与n个t-2时刻子预测值。
本申请实施例中,当i大于1时,相对于第i轮预测过程中的当前时刻t,第i-1轮预测过程所处理的采样点为采样点t-2与采样点t-1,采样预测网络第i-1轮预测过程中可以得到的历史预测结果包括:采样点t-2对应的n个t-2时刻子粗测值、n个t-2时刻残差值与n个t-2时刻子预测值;以及采样点t-1对应的n个t-1时刻粗测值、n个t-1时刻残差值与n个t-1时刻子预测值。采样预测网络从第i-1轮预测过程对应的历史预测结果中,获取的n个t-1时刻子粗测值,以及n个t-1时刻残差值、n个t-2时刻残差值、n个t-1时刻子预测值与n个t-2时刻子预测值,以在上述数据基础上进行第i轮的中采样点t和采样点t+1上的采样值预测。
S302、对n个t时刻子粗测值、n个t-1时刻子粗测值、n个t-1时刻残差值、n个t-2时刻残差值、n个t-1时刻子预测值与n个t-2时刻预测值进行特征维度过滤,得到降维特征集合。
本申请实施例中,为了降低网络运算的复杂度,采样预测网络需要将所需处理的特征数据进行降维 处理,从中去掉对预测结果影响较小的维度上的特征数据,提高网络运算效率。
在一些实施例中,采样预测网络中包含第一门控循环网络与第二门控循环网络,S302可以通过S3021-S3023来实现,将结合各步骤进行说明。
S3021、将n个t时刻子粗测值、n个t-1时刻子粗测值、n个t-1时刻残差值、n个t-2时刻残差值、n个t-1时刻子预测值与n个t-2时刻预测值进行特征维度合并,得到初始特征向量集合。
本申请实施例中,电子设备将n个t时刻子粗测值、n个t-1时刻子粗测值、n个t-1时刻残差值、n个t-2时刻残差值、n个t-1时刻子预测值与n个t-2时刻预测值从特征维度的角度进行合并,得到用于残差预测的信息特征总维度集合,作为初始特征向量。
S3022、基于条件特征,通过第一门控循环网络,对初始特征向量集合进行特征降维处理,得到中间特征向量集合。
本申请实施例中,第一门控循环网络中可以对不同维度的特征向量进行权重分析,并基于权重分析的结果,保留对残差预测重要且有效的维度上的特征数据,遗忘无效维度上的特征数据,从而实现对初始特征向量集合的降维处理,得到中间特征向量集合。
在一些实施例中,门控循环网络可以是GRU网络,也可以是LSTM网络,具体的根据实际情况进行选择,本申请实施例不作限定。
S3023、基于条件特征,通过第二门控循环网络,对中间特征向量进行特征降维处理,得到降维特征集合。
本申请实施例中,电子设备基于条件特征,通过第二门控循环网络对中间特征向量进行再次降维,以去除冗余信息,减少后续预测过程的工作量。
S303、通过2n个全连接层中的每个全连接层,结合条件特征,基于降维特征集合,同步对所述采样点t与采样点t+1分别在所述n个子帧的每个子帧上的残差值进行前向残差预测,分别得到n个t时刻残差值与n个t+1时刻残差值。
在一些实施例中,基于图10,如图11所示,S303可以通过执行S3031-S3033的过程来实现,将结合各步骤进行说明。
S3031、将降维特征集合中的n个t-2时刻降维残差值与n个t-2时刻降维预测值确定为t时刻激励值;n个t-2时刻降维残差值为n个t-2时刻残差值经过特征维度过滤后得到的;n个t-2时刻降维预测值为n个t-2时刻预测值经过特征维度过滤后得到的。
本申请实施例中,电子设备可以将第i-1轮预测过程中得到的n个t-2时刻降维残差值与n个t-2时刻降维预测值作为第i轮预测过程的声道激励,以利用采样率网络的前向预测能力,对t时刻的残差值进行预测。
S3032、将降维特征集合中的n个t-1时刻降维残差值与n个t-1时刻降维子预测值确定为t+1时刻激励值;n个t-1时刻降维残差值为n个t-1时刻残差值经过特征维度过滤后得到的;n个t-1时刻降维预测值为n个t-1时刻预测值经过特征维度过滤后得到的。
本申请实施例中,电子设备可以将第i-1轮预测过程中得到的n个t-2时刻降维残差值与n个t-2时刻降维预测值作为第i轮预测过程的声道激励,以利用采样率网络的前向预测能力,对t时刻的残差值进行预测。
S3033、在2n个全连接层中的n个全连接层中,基于条件特征与t时刻激励值,采用n个全连接层中的每个全连接层,同时根据n个t-1时刻降维子粗测值对采样点t进行前向残差预测,得到n个t时刻残差值;同时,在2n个全连接层中的另外n个全连接层中,基于条件特征与t+1时刻激励值,采用另外n个全连接层中的每个全连接层,同时根据n个t时刻降维子粗测值对采样点t+1进行前向残差预测,得到n个t+1时刻残差值。
本申请实施例中,2n个全连接层同时并独立进行工作,其中的n个全连接层用于对采样点t的相关预测过程进行处理。在一些实施例中,该n个全连接层中的每个全连接层对应进行采样点t在n个子帧中的每个子帧上的残差值预测处理,根据一个子帧上的t-1时刻降维子粗测值,结合条件特征与该子帧上的t时刻激励值(即该子帧在n个t-2时刻降维残差值与n个t-2时刻降维预测值中对应的t-2时刻降维残差值与t-2时刻降维预测值),预测出采样点t在该子帧上对应的残差值,进而通过n个全连接层得到采样点t在每个子帧上的残差值,也即n个t时刻残差值。
同时,与上述过程类似地,2n个全连接层中的另外n个全连接层对应进行采样点t在n个子帧中的每个子帧上的残差值预测处理,根据一个子帧上的t时刻降维子粗测值,结合条件特征与该子帧上的t+1时刻激励值(即该子帧在n个t-1时刻降维残差值与n个t-1时刻降维预测值中对应的t-1时刻降维残差值与t-1时刻降维预测值),预测出采样点t+1在该子帧上的残差值,进而通过另外n个全连接层得到采样点t+1在每个子帧上的残差值,也即n个t+1时刻残差值。
S1043、基于采样点t+1对应的至少一个t+1时刻历史采样点,对采样点t+1在n个子帧上的线性采 样值进行线性编码预测,得到n个t+1时刻子粗测值。
本申请实施例中,S1043为线性预测算法的预测窗口滑动至采样点t+1时的线性预测过程,电子设备可以通过与S1041类似的过程,获得采样点t+1对应的至少一个t+1时刻历史子预测值,根据至少一个t+1时刻历史子预测值对采样点t+1对应的线性采样值进行线性编码预测,得到n个t+1时刻子粗测值。
S1044、根据n个t时刻残差值与n个t时刻子粗测值,得到采样点t对应的n个t时刻子预测值,并根据n个t+1时刻残差值与n个t+1时刻子粗测值,得到n个t+1时刻子预测值;将n个t时刻子预测值与n个t+1时刻子预测值作为2n个子预测值。
本申请实施例中,对于采样点t,电子设备可以通过信号叠加的方式,结合n个子帧中的每个子帧,将表征音频信号的线性信息的n个t时刻子粗测值,与表征非线性随机噪声信息的n个t时刻残差值的信号幅度进行叠加处理,得到采样点t对应的n个t时刻子预测值。
同样地,电子设备可以将n个t+1时刻残差值与n个t+1时刻子粗测值进行信号叠加处理,得到n个t+1时刻子预测值。电子设备进而将n个t时刻子预测值与n个t+1时刻子预测值作为2n个子预测值。
在一些实施例中,基于上述图8-图11中方法流程,电子设备中的帧率网络与采样预测网络的网络架构图可以如图12所示,其中,采样预测网络中包含m×n个对偶全连接层,用于对一轮预测过程中时域上的m个采样点分别频域上的n个子帧中的每个子帧上对应的采样值进行预测。以n=4,m=2为例,对偶全连接层1至对偶全连接层8为采样预测网络110中包含的2*4个独立的全连接层;帧率网络111可以通过两个卷积层与两个全连接层,从当前帧中提取出条件特征f,带通降采样滤波器组112对当前帧进行频域划分和时域降采样,得到的b1至b4的4个子帧;每个子帧在时域上对应包含40个采样点。
图12中,采样预测网络110可以通过多轮自递归的循环预测过程,实现对时域上的40个采样点的采样值预测。对于多轮预测过程中的第i轮预测过程,采样预测网络110可以通过LPC系数计算与t时刻LPC预测值计算,根据至少一个t时刻历史采样点对应的至少一个t时刻历史子预测值
Figure PCTCN2021132024-appb-000001
得到当前时刻的采样点t对应的n个t时刻子粗测值
Figure PCTCN2021132024-appb-000002
进而获取第i-1轮预测过程中对应的n个t-1时刻子粗测值
Figure PCTCN2021132024-appb-000003
n个t-2时刻子预测值
Figure PCTCN2021132024-appb-000004
n个t-2时刻残差值
Figure PCTCN2021132024-appb-000005
n个t-1时刻子预测值
Figure PCTCN2021132024-appb-000006
与n个t-1时刻残差值
Figure PCTCN2021132024-appb-000007
Figure PCTCN2021132024-appb-000008
共同送入合并层进行特征维度合并,得到初始特征向量集合。采样预测网络110通过第一门控循环网络和第二门控循环网络,结合条件特征f对初始特征向量集合进行降维处理,得到用于预测的降维特征集合,进而将降维特征集合分别送入8个对偶连接层,通过其中的4个对偶连接层,对采样点t对应的n个残差值进行预测,得到采样点t在4个子帧上对应的4个残差值
Figure PCTCN2021132024-appb-000009
同时,通过其中另外的4个对偶连接层,对采样点t+1对应的4个残差值进行预测,得到采样点t+1在4个子帧上对应的4个残差值
Figure PCTCN2021132024-appb-000010
采样预测网络110进而可以根据
Figure PCTCN2021132024-appb-000011
Figure PCTCN2021132024-appb-000012
得到采样点t在4个子帧上对应的4个子预测值
Figure PCTCN2021132024-appb-000013
根据
Figure PCTCN2021132024-appb-000014
得到采样点t+1对应的至少一个t+1时刻历史子预测值
Figure PCTCN2021132024-appb-000015
通过t+1时刻LPC预测值计算,得到采样点t+1在4个子帧上对应的4个子粗测值
Figure PCTCN2021132024-appb-000016
采样预测网络110根据
Figure PCTCN2021132024-appb-000017
Figure PCTCN2021132024-appb-000018
得到采样点t+1在4个子帧上对应的4个子预测值
Figure PCTCN2021132024-appb-000019
从而完成第i轮预测过程,更新下一轮预测过程中采样点t与采样点t+1,以同样的方式进行循环预测,直至时域上的40个采样点全部预测完毕时,得到每个采样点对应的4个子预测值。
可以看出,上述实施例中,本申请实施例中的方法将采样预测网络的循环次数从目前的160次减少到了160/4(子帧数)/2(相邻采样点个数),即20次,从而大大减少了采样预测网络的循环处理次数,继而提高了音频处理的处理速度和处理效率。
需要说明的是,本申请实施例中,当m为其他值时,采样预测网络110中的对偶全连接层的数量需要对应设置为m*n个,并且在预测过程中,对每个采样点的前向预测时间跨度为m个,也即对每个采样点进行残差值预测时,使用上一轮预测过程中,该采样点对应的前m个采样点的历史预测结果作为激励值进行残差预测。
在本申请的一些实施例中,基于图8-图11,S1041之后,还可以执行S1045-1047,将结合各步骤进行说明。
S1045、当i等于1时,通过2n个全连接层,结合条件特征与预设激励参数,同时对采样点t与采样点t+1进行前向残差预测,得到采样点t对应的n个t时刻残差值与采样点t+1对应的n个t+1时刻残差值。
本申请实施例中,对于首轮预测过程,即i=1时,由于没有上一轮的历史预测结果作为激励值,电子设备可以结合条件特征与预设激励参数,通过2n个全连接层,结合条件特征与预设激励参数,同时对采样点t与采样点t+1进行前向残差预测,得到采样点t对应的n个t时刻残差值与采样点t+1对应的n个t+1时刻残差值。
在一些实施例中,预设激励参数可以是0,也可以根据实际需要设置为其他值,具体的根据实际情况进行选择,本申请实施例不作限定。
S1046、基于采样点t+1对应的至少一个t+1时刻历史采样点,对n个子帧在采样点t+1对应的线性采样值进行线性编码预测,得到n个t+1时刻子粗测值。
本申请实施例中,S1046的过程与S1043描述一致,此处不再赘述。
S1047、根据n个t时刻残差值与n个t时刻子粗测值,得到采样点t对应的n个t时刻子预测值,并根据n个t+1时刻残差值与n个t+1时刻子粗测值,得到n个t+1时刻子预测值;将n个t时刻子预测值与n个t+1时刻子预测值作为2n个子预测值。
本申请实施例中,S1047的过程与S1044描述一致,此处不再赘述。
在本申请的一些实施例中,基于图8-图11,如图13所示,S105可以通过执行S1051-1053来实现,将结合各步骤进行说明。
S1051、将每个采样点对应的n个子预测值进行频域叠加,得到每个采样点对应的信号预测值;
本申请实施例中,由于n个子预测值表征了在一个采样点每个子帧频域上的信号幅值,电子设备可以通过频域划分的反过程,将每个采样点对应的n个子预测值进行频域叠加,得到每个采样点对应的信号预测值。
S1052、将每个采样点对应的信号预测值进行时域信号合成,得到当前帧对应的音频预测信号;进而得到每帧声学特征对应的音频信号。
本申请实施例中,由于预设数量个采样点是按时间序列进行排列的,因此电子设备可以在时域上对每个采样点对应的信号预测值按顺序进行信号合成,得到当前帧对应的音频预测信号。电子设备可以通过循环处理的方式,在每轮循环过程中将至少一帧声学特征帧的每帧声学特征作为当前帧进行信号合成,进而得到每帧声学特征对应的音频信号。
S1053、将每帧声学特征对应的音频信号进行信号合成,得到目标音频。
本申请实施例中,电子设备将每帧声学特征对应的音频信号进行信号合成,得到目标音频。
在本申请的一些实施例中,基于图8-图11以及图13,S101可以通过执行S1011-1013来实现,将结合各步骤进行说明。
S1011、获取待处理文本;
S1012、对待处理文本进行预处理,得到待转换文本信息;
本申请实施例中,文本的预处理最终生成目标音频的质量的影响至关重要。电子设备所获取的待处理文本,通常带有空格和标点符号的字符,可以在许多语境中产生不同语义,因此可能导致待处理文本被误读,或者可能会导致一些词语被跳过或重复。因此,电子设备需要先对待处理文本进行预处理,以规整化待处理文本的信息。
在一些实施例中,电子设备对待处理文本进行预处理可以包括:在待处理文本中大写所有字符;删除所有中间标点符号;用统一结束符,如句号或问号结束每一个句子;用特殊的分隔符替换单词之间的空格等等,具体的根据实际情况进行选择,本申请实施例不作限定。
S1013、通过文本语音转换模型,对待转换文本信息进行声学特征预测,得到至少一帧声学特征帧。
本申请实施例中,文本语音转换模型为已经训练完成的,能够将文本信息转换为声学特征的神经网络模型。电子设备使用文本语音转换模型,根据待转换文本信息中的至少一个文本序列,对应转换为至少一个声学特征帧,从而实现对待转换文本信息的声学特征预测。
可以理解的是,本申请实施例中,通过对待处理文本进行预处理,可以提高目标音频的音频质量;并且,电子设备可以将最原始的待处理文本作为输入数据,通过本申请实施例中的音频处理方法输出待处理文本最终的数据处理结果,即目标音频,实现了对待处理文本端到端的处理过程,减少了系统模块间的过渡处理,并且增加了整体契合度。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
参见图14,本申请实施例中提出一种电子设备的示例性应用,包括文本语音转换模型14-1与多频带多时域声码器14-2。其中文本转换语音模型14-1采用带有注意力机制的序列到序列的Tacotron结构模型,包含CBHG(1-D Convolution Bank Highway network bidirectional GRU)编码器141、注意力模块142、解码器143与CBHG平滑模块144。其中,CBHG编码器141配置为将原始文本中的句子作为序列,从句子中提取鲁棒序列表达,将其编码为能够映射出固定长度的向量。注意力模块142配置为关注鲁棒序列表达的所有词语,通过计算注意力得分,协助编码器进行更好的编码。解码器143配置为将编码器得到的固定长度的向量映射为对应序列的声学特征,并通过CBHG平滑模块144对输出更平滑的声学特征,从而得到至少一帧声学特征帧。至少一帧声学特征帧进入多频带多时域声码器14-2,通过多频带多时域声码器中的帧率网络145,计算出每一帧的条件特征f,同时,每帧声学特征帧被带通降采样滤波器组146划分为4个子帧,并对每个子帧进行时域降采样后,4个子帧进入自递归的采样预测网络147,在采样预测网络147中 通过LPC系数计算(Compute LPC)与LPC当前预测值计算(Compute predicition),预测出当前轮中的当前时刻t的采样点t在4个子帧上的线性预测值,得到4个t时刻子粗测值
Figure PCTCN2021132024-appb-000020
并且,采样预测网络147以每轮两个采样点的为前向预测步长,从上一轮预测的历史预测结果中,获取采样点t-1在4个子帧上对应的4个子预测值
Figure PCTCN2021132024-appb-000021
采样点、采样点t-1在4个子帧上的子粗测值
Figure PCTCN2021132024-appb-000022
采样点t-1在4个子帧上的残差值
Figure PCTCN2021132024-appb-000023
采样点t-2在4个子帧上的子预测值
Figure PCTCN2021132024-appb-000024
以及采样点在4个子帧上的残差值
Figure PCTCN2021132024-appb-000025
结合条件特征f,共同送入采样预测网络中的合并层(concat层)进行特征维度合并,得到初始特征向量;初始特征向量进而通过90%稀疏的384维第一门控循环网络(GRU-A)与一个正常的16维第二门控循环网络(GRU-B)进行特征降维,得到降维特征集合;采样预测网络147将降维特征集合送入8个256维的对偶全连接(双FC)层,通过8个256维的双FC层,结合条件特征f,基于
Figure PCTCN2021132024-appb-000026
Figure PCTCN2021132024-appb-000027
预测出采样点t在4个子帧上的子残差值
Figure PCTCN2021132024-appb-000028
同时,基于
Figure PCTCN2021132024-appb-000029
Figure PCTCN2021132024-appb-000030
预测出采样点t+1在4个子帧上的子残差值
Figure PCTCN2021132024-appb-000031
采样预测网络147可以通过叠加
Figure PCTCN2021132024-appb-000032
Figure PCTCN2021132024-appb-000033
得到采样点t在4个子帧上的子预测值
Figure PCTCN2021132024-appb-000034
这样,采样预测网络147可以根据
Figure PCTCN2021132024-appb-000035
采用预测窗口滑动的方式预测出采样点t+1在4个子帧上对应的子粗测值
Figure PCTCN2021132024-appb-000036
采样预测网络147通过叠加
Figure PCTCN2021132024-appb-000037
Figure PCTCN2021132024-appb-000038
得到采样点t+1对应的4个子预测值
Figure PCTCN2021132024-appb-000039
采样预测网络147将
Figure PCTCN2021132024-appb-000040
以及
Figure PCTCN2021132024-appb-000041
作为下一轮即i+1预测过程的激励值,并更新下一轮预测过程对应的当前相邻两个采样点进行循环处理,直至得到该帧声学特征帧在每个采样点上的4个子预测值,多频带多时域声码器14-2通过音频合成模块148,将每个采样点上的4个子预测值进行频域合并,得到每个采样点上的音频信号,并通过音频合成模块148,对每个采样点上的音频信号进行时域合并,得到该帧对应的音频信号。音频合成模块148将至少一帧声学特征帧中的每帧对应的音频信号进行合并,得到至少一帧声学特征帧对应的一段音频,也即最开始输入电子设备的原始文本对应的目标音频。
可以理解的是,本申请实施例所提供的示例性的电子设备的结构中,虽然增加了7个对偶全连接层,GRU-A层的输入矩阵将变大,但可以通过查表操作使得此输入开销的影响忽略不计,并且,相对于传统声码器,多频带多时域策略将采样预测网络自递归所需的周期数减少了8倍。因此,在没有其他计算优化的情况下,声码器的速度提高了2.75倍。并且,经过招募实验人员进行主观质量打分后,本申请的电子设备的所合成的目标音频在主观质量评分上仅降低了3%,从而实现了在基本不影响音频处理质量的基础上提高了音频处理的速度和效率。
下面继续说明本申请实施例提供的音频处理装置655的实施为软件模块的示例性结构,在一些实施例中,如图6所示,存储在存储器650中的音频处理装置655中的软件模块可以包括:
文本语音转换模型6551,配置为对待处理文本进行语音特征转换,得到至少一帧声学特征帧;
帧率网络6552,配置为通过帧率网络,从所述至少一帧声学特征帧的每帧声学特征帧中,提取出所述每帧声学特征帧对应的条件特征;
时域频域处理模块6553,配置为对所述每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到所述当前帧对应的n个子帧;其中,n为大于1的正整数;所述n个子帧的每个子帧包含预设数量个采样点;
采样预测网络6554,配置为在第i轮预测过程中,对当前m个相邻采样点在所述n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到所述预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于所述预设数量的正整数;
信号合成模块6555,配置为根据所述每个采样点对应的n个子预测值得到所述当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到所述待处理文本对应的目标音频。
在一些实施例中,当m等于2时,所述采样预测网络中包含独立的2n个全连接层,所述相邻两个采样点包括:所述第i轮预测过程中,当前时刻t对应的采样点t与下一时刻t+1对应的采样点t+1,其中,t为大于或等于1的正整数;
所述采样预测网络6554,还配置为在第i轮预测过程中,通过采样预测网络,基于所述采样点t对应的至少一个t时刻历史采样点,对所述采样点t在所述n个子帧上的线性采样值进行线性编码预测,得到n个t时刻子粗测值;当i大于1时,基于第i-1轮预测过程对应的历史预测结果,结合所述条件特征,通过2n个全连接层,同步对所述采样点t与采样点t+1分别在所述n个子帧的每个子帧上的残差值进行前向残差预测,得到所述采样点t对应的n个t时刻残差值与所述采样点t+1对应的n个t+1时刻残差值;所述历史预测结果包含第i-1轮预测过程中,相邻两个采样点各自对应的n个残差值与子预测值;基于所述采样点t+1对应的至少一个t+1时刻历史采样点,对所述采样点t+1在所述n个子帧上的线性采样值进行线性 编码预测,得到n个t+1时刻子粗测值;根据所述n个t时刻残差值与所述n个t时刻子粗测值,得到所述采样点t对应的n个t时刻子预测值,并根据所述n个t+1时刻残差值与所述n个t+1时刻子粗测值,得到n个t+1时刻子预测值;将所述n个t时刻子预测值与所述n个t+1时刻子预测值作为2n个子预测值。
在一些实施例中,所述采样预测网络6554,还配置为获取采样点t-1对应的n个t-1时刻子粗测值,以及在所述第i-1轮预测过程中得到的n个t-1时刻残差值、n个t-2时刻残差值、n个t-1时刻子预测值与n个t-2时刻子预测值;对所述n个t时刻子粗测值、所述n个t-1时刻子粗测值、所述n个t-1时刻残差值、所述n个t-2时刻残差值、所述n个t-1时刻子预测值与所述n个t-2时刻预测值进行特征维度过滤,得到降维特征集合;通过所述2n个全连接层中的每个全连接层,结合所述条件特征,基于所述降维特征集合,同步对所述采样点t与采样点t+1分别在所述n个子帧的每个子帧上的残差值进行前向残差预测,分别得到所述n个t时刻残差值与所述n个t+1时刻残差值。
在一些实施例中,所述采样预测网络6554,还配置为将所述降维特征集合中的n个t-2时刻降维残差值与n个t-2时刻降维预测值确定为t时刻激励值;所述n个t-2时刻降维残差值为所述n个t-2时刻残差值经过特征维度过滤后得到的;所述n个t-2时刻降维预测值为所述n个t-2时刻预测值经过特征维度过滤后得到的;将所述降维特征集合中的n个t-1时刻降维残差值与所述n个t-1时刻降维子预测值确定为t+1时刻激励值;所述n个t-1时刻降维残差值为所述n个t-1时刻残差值经过特征维度过滤后得到的;所述n个t-1时刻降维预测值为所述n个t-1时刻预测值经过特征维度过滤后得到的;在所述2n个全连接层中的n个全连接层中,基于所述条件特征与所述t时刻激励值,采用所述n个全连接层中的每个全连接层,同步根据所述n个t-1时刻降维子粗测值对所述采样点t进行前向残差预测,得到所述n个t时刻残差值;并且,在所述2n个全连接层中的另外n个全连接层中,基于所述条件特征与所述t+1时刻激励值,采用所述另外n个全连接层中的每个全连接层,同步根据所述n个t时刻降维子粗测值对所述采样点t+1进行前向残差预测,得到所述n个t+1时刻残差值。
在一些实施例中,所述采样预测网络中包含第一门控循环网络与第二门控循环网络,所述采样预测网络6554,还配置为将所述n个t时刻子粗测值、所述n个t-1时刻子粗测值、所述n个t-1时刻残差值、所述n个t-2时刻残差值、所述n个t-1时刻子预测值与所述n个t-2时刻预测值进行特征维度合并,得到初始特征向量集合;基于所述条件特征,通过所述第一门控循环网络,对所述初始特征向量集合进行特征降维处理,得到中间特征向量集合;基于所述条件特征,通过所述第二门控循环网络,对所述中间特征向量进行特征降维处理,得到所述降维特征集合。
在一些实施例中,所述时域频域处理模块6553,还配置为对所述当前帧进行频域划分,得到n个初始子帧;对所述n个初始子帧所对应的时域采样点进行降采样,得到所述n个子帧。
在一些实施例中,所述采样预测网络6554,还配置为在第i轮预测过程中,通过采样预测网络,基于所述采样点t对应的至少一个t时刻历史采样点,对所述采样点t在所述n个子帧上的线性采样值进行线性编码预测,得到n个t时刻子粗测值之前,当t小于或等于预设窗口阈值时,将所述采样点t之前的全部采样点作为所述至少一个t时刻历史采样点;所述预设窗口阈值表征线性编码预测可处理的最大采样点的数量;或者,当t大于所述预设窗口阈值时,将从所述采样点t-1至采样点t-k范围内对应的采样点,作为所述至少一个t时刻历史采样点;其中,k为预设窗口阈值。
在一些实施例中,所述采样预测网络6554,还配置为所述在第i轮预测过程中,通过采样预测网络,基于所述采样点t对应的至少一个t时刻历史采样点,对所述采样点t在所述n个子帧上的线性采样值进行线性编码预测,得到n个t时刻子粗测值之后,当i等于1时,通过所述2n个全连接层,结合所述条件特征与预设激励参数,同步对所述采样点t与所述采样点t+1分别在所述n个子帧上的残差值进行前向残差预测,得到所述采样点t对应的n个t时刻残差值与所述采样点t+1对应的n个t+1时刻残差值;基于所述采样点t+1对应的至少一个t+1时刻历史采样点,对所述采样点t+1在所述n个子帧上的线性采样值进行线性编码预测,得到n个t+1时刻子粗测值;根据所述n个t时刻残差值与所述n个t时刻子粗测值,得到所述采样点t对应的n个t时刻子预测值,并根据所述n个t+1时刻残差值与所述n个t+1时刻子粗测值,得到n个t+1时刻子预测值;将所述n个t时刻子预测值与所述n个t+1时刻子预测值作为所述2n个子预测值。
在一些实施例中,所述信号合成模块6555,还配置为将所述每个采样点对应的n个子预测值进行频域叠加,得到所述每个采样点对应的信号预测值;将所述每个采样点对应的信号预测值进行时域信号合成,得到所述当前帧对应的音频预测信号;进而得到所述每帧声学特征对应的音频信号;将所述每帧声学特征对应的音频信号进行信号合成,得到所述目标音频。
在一些实施例中,所述文本语音转换模型6551,还配置为获取待处理文本;对所述待处理文本进行预处理,得到待转换文本信息;通过文本语音转换模型,对所述待转换文本信息进行声学特征预测,得到所述至少一帧声学特征帧。
需要说明的是,以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相 似的有益效果。对于本申请装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的音频处理方法。
本申请实施例提供一种存储有可执行指令的存储介质,即计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法,例如,如图8-图11,以及图13中示出的方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
综上所述,通过本申请实施例对待处理文本进行预处理,可以提高目标音频的音频质量;并且,可以将最原始的待处理文本作为输入数据,通过本申请实施例中的音频处理方法输出待处理文本最终的数据处理结果,即目标音频,实现了对待处理文本端到端的处理过程,减少了系统模块间的过渡处理,并且增加了整体契合度;并且本申请实施例通过将每帧声学特征信号划分为频域上的多个子帧并对每个子帧进行降采样,降低了采样预测网络进行采样值预测时所需处理的整体采样点数量,进而,通过在一轮预测过程中同时对多个相邻时间的采样点的进行预测,实现了对多个采样点的同步处理,从而显著减少了采样预测网络预测音频信号时所需的循环次数,提高了音频合成的处理速度,并提高了音频处理的效率。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。
工业实用性
在本申请实施例中,通过将每帧声学特征信号划分为频域上的多个子帧并对每个子帧进行降采样,降低了采样预测网络进行采样值预测时所需处理的整体采样点数量,进而,通过在一轮预测过程中同时对多个相邻时间的采样点的进行预测,实现了对多个采样点的同步处理,从而显著减少了采样预测网络预测音频信号时所需的循环次数,提高了音频合成的处理速度,并提高了音频处理的效率。进一步的,通过对每个子帧进行时域降采样,可以去除每个子帧中的冗余信息,减少采样预测网络基于每个进行递归预测时所需处理的循环次数,从而进一步提高了音频处理的速度和效率;进一步地,通过对待处理文本进行预处理,可以提高目标音频的音频质量;并且,可以将最原始的待处理文本作为输入数据,通过本申请实施例中的音频处理方法输出待处理文本最终的数据处理结果,即目标音频,实现了对待处理文本端到端的处理过程,减少了系统模块间的过渡处理,并且增加了整体契合度。并且,本申请实施例提供的声码器有效地降低了将声学特征转换为音频信号所需的计算量,并且实现了多个采样点的同步预测,可以在保证高实时率的同时,输出高可理解度、高自然度、高保真度的音频。

Claims (15)

  1. 一种音频处理方法,所述方法由电子设备执行,所述方法包括:
    对待处理文本进行语音特征转换,得到至少一帧声学特征帧;
    通过帧率网络,从所述至少一帧声学特征帧的每帧声学特征帧中,提取出所述每帧声学特征帧对应的条件特征;
    对所述每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到所述当前帧对应的n个子帧;其中,n为大于1的正整数;所述n个子帧的每个子帧包含预设数量个采样点;
    通过采样预测网络,在第i轮预测过程中,对当前m个相邻采样点在所述n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到所述预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于所述预设数量的正整数;
    根据所述每个采样点对应的n个子预测值得到所述当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到所述待处理文本对应的目标音频。
  2. 根据权利要求1所述的方法,其中,当m等于2时,所述采样预测网络中包含独立的2n个全连接层,所述相邻两个采样点包括:所述第i轮预测过程中,当前时刻t对应的采样点t与下一时刻t+1对应的采样点t+1,其中,t为大于或等于1的正整数;
    所述对当前m个相邻采样点在所述n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,包括:
    在第i轮预测过程中,通过采样预测网络,基于所述采样点t对应的至少一个t时刻历史采样点,对所述采样点t在所述n个子帧上的线性采样值进行线性编码预测,得到n个t时刻子粗测值;
    当i大于1时,基于第i-1轮预测过程对应的历史预测结果,结合所述条件特征,通过2n个全连接层,同步对所述采样点t与采样点t+1分别在所述n个子帧的每个子帧上的残差值进行前向残差预测,得到所述采样点t对应的n个t时刻残差值与所述采样点t+1对应的n个t+1时刻残差值;所述历史预测结果包含第i-1轮预测过程中,相邻两个采样点各自对应的n个残差值与子预测值;
    基于所述采样点t+1对应的至少一个t+1时刻历史采样点,对所述采样点t+1在所述n个子帧上的线性采样值进行线性编码预测,得到n个t+1时刻子粗测值;
    根据所述n个t时刻残差值与所述n个t时刻子粗测值,得到所述采样点t对应的n个t时刻子预测值,并根据所述n个t+1时刻残差值与所述n个t+1时刻子粗测值,得到n个t+1时刻子预测值;将所述n个t时刻子预测值与所述n个t+1时刻子预测值作为2n个子预测值。
  3. 根据权利要求2所述的方法,其中,所述基于第i-1轮预测过程对应的历史预测结果,结合所述条件特征,通过2n个全连接层,同步对所述采样点t与采样点t+1分别在所述n个子帧的每个子帧上的残差值进行前向残差预测,得到所述采样点t对应的n个t时刻残差值与所述采样点t+1对应的n个t+1时刻残差值,包括:
    获取采样点t-1对应的n个t-1时刻子粗测值,以及在所述第i-1轮预测过程中得到的n个t-1时刻残差值、n个t-2时刻残差值、n个t-1时刻子预测值与n个t-2时刻子预测值;
    对所述n个t时刻子粗测值、所述n个t-1时刻子粗测值、所述n个t-1时刻残差值、所述n个t-2时刻残差值、所述n个t-1时刻子预测值与所述n个t-2时刻预测值进行特征维度过滤,得到降维特征集合;
    通过所述2n个全连接层中的每个全连接层,结合所述条件特征,基于所述降维特征集合,同步对所述采样点t与采样点t+1分别在所述n个子帧的每个子帧上的残差值进行前向残差预测,分别得到所述n个t时刻残差值与所述n个t+1时刻残差值。
  4. 根据权利要求3所述的方法,其中,所述通过所述2n个全连接层中的每个全连接层,结合所述条件特征,基于所述降维特征集合,同步对所述采样点t与采样点t+1分别在所述n个子帧的每个子帧上的残差值进行前向残差预测,分别得到所述n个t时刻残差值与所述n个t+1时刻残差值,包括:
    将所述降维特征集合中的n个t-2时刻降维残差值与n个t-2时刻降维预测值确定为t时刻激励值;所述n个t-2时刻降维残差值为所述n个t-2时刻残差值经过特征维度过滤后得到的;所述n个t-2时刻降维预测值为所述n个t-2时刻预测值经过特征维度过滤后得到的;
    将所述降维特征集合中的n个t-1时刻降维残差值与所述n个t-1时刻降维子预测值确定为t+1时刻激励值;所述n个t-1时刻降维残差值为所述n个t-1时刻残差值经过特征维度过滤后得到的;所述n个t-1时刻降维预测值为所述n个t-1时刻预测值经过特征维度过滤后得到的;
    在所述2n个全连接层中的n个全连接层中,基于所述条件特征与所述t时刻激励值,采用所述n个全连接层中的每个全连接层,同步根据所述n个t-1时刻降维子粗测值对所述采样点t进行前向残差预测,得到所述n个t时刻残差值;
    并且,在所述2n个全连接层中的另外n个全连接层中,基于所述条件特征与所述t+1时刻激励值,采用所述另外n个全连接层中的每个全连接层,同步根据所述n个t时刻降维子粗测值对所述采样点t+1进行前向残差预测,得到所述n个t+1时刻残差值。
  5. 根据权利要求3或4所述的方法,其中,所述采样预测网络中包含第一门控循环网络与第二门控循环网络,所述对所述n个t时刻子粗测值、所述n个t-1时刻子粗测值、所述n个t-1时刻残差值、所述n个t-2时刻残差值、所述n个t-1时刻子预测值与所述n个t-2时刻预测值进行特征维度过滤,得到降维特征集合,包括:
    将所述n个t时刻子粗测值、所述n个t-1时刻子粗测值、所述n个t-1时刻残差值、所述n个t-2时刻残差值、所述n个t-1时刻子预测值与所述n个t-2时刻预测值进行特征维度合并,得到初始特征向量集合;
    基于所述条件特征,通过所述第一门控循环网络,对所述初始特征向量集合进行特征降维处理,得到中间特征向量集合;
    基于所述条件特征,通过所述第二门控循环网络,对所述中间特征向量进行特征降维处理,得到所述降维特征集合。
  6. 根据权利要求1-5任一项所述的方法,其中,所述对所述每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到所述当前帧对应的n个子帧,包括:
    对所述当前帧进行频域划分,得到n个初始子帧;
    对所述n个初始子帧所对应的时域采样点进行降采样,得到所述n个子帧。
  7. 根据权利要求2-6任一项所述的方法,其中,所述在第i轮预测过程中,通过采样预测网络,基于所述采样点t对应的至少一个t时刻历史采样点,对所述采样点t在所述n个子帧上的线性采样值进行线性编码预测,得到n个t时刻子粗测值之前,所述方法还包括:
    当t小于或等于预设窗口阈值时,将所述采样点t之前的全部采样点作为所述至少一个t时刻历史采样点;所述预设窗口阈值表征线性编码预测可处理的最大采样点的数量;或者,
    当t大于所述预设窗口阈值时,将从所述采样点t-1至采样点t-k范围内对应的采样点,作为所述至少一个t时刻历史采样点;其中,k为预设窗口阈值。
  8. 根据权利要求2-7任一项所述的方法,其中,所述在第i轮预测过程中,通过采样预测网络,基于所述采样点t对应的至少一个t时刻历史采样点,对所述采样点t在所述n个子帧上的线性采样值进行线性编码预测,得到n个t时刻子粗测值之后,所述方法还包括:
    当i等于1时,通过2n个全连接层,结合所述条件特征与预设激励参数,同步对所述采样点t与所述采样点t+1分别在所述n个子帧上的残差值进行前向残差预测,得到所述采样点t对应的n个t时刻残差值与所述采样点t+1对应的n个t+1时刻残差值;
    基于所述采样点t+1对应的至少一个t+1时刻历史采样点,对所述采样点t+1在所述n个子帧上的线性采样值进行线性编码预测,得到n个t+1时刻子粗测值;
    根据所述n个t时刻残差值与所述n个t时刻子粗测值,得到所述采样点t对应的n个t时刻子预测值,并根据所述n个t+1时刻残差值与所述n个t+1时刻子粗测值,得到n个t+1时刻子预测值;将所述n个t时刻子预测值与所述n个t+1时刻子预测值作为所述2n个子预测值。
  9. 根据权利要求1-7任一项所述的方法,其中,所述根据所述每个采样点对应的n个子预测值得到所述当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到所述待处理文本对应的目标音频,包括:
    将所述每个采样点对应的n个子预测值进行频域叠加,得到所述每个采样点对应的信号预测值;
    将所述每个采样点对应的信号预测值进行时域信号合成,得到所述当前帧对应的音频预测信号;进而得到所述每帧声学特征对应的音频信号;
    将所述每帧声学特征对应的音频信号进行信号合成,得到所述目标音频。
  10. 根据权利要求1所述的方法,其中,所述对待处理文本进行语音特征转换,得到至少一帧声学特征帧,包括:
    获取待处理文本;
    对所述待处理文本进行预处理,得到待转换文本信息;
    通过文本语音转换模型,对所述待转换文本信息进行声学特征预测,得到所述至少一帧声学特征帧。
  11. 一种声码器,包括:
    帧率网络,配置为从至少一帧声学特征帧的每帧声学特征帧中,提取出所述每帧声学特征帧对应的条件特征;
    时域频域处理模块,配置为对所述每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到所述当前帧对应的n个子帧;其中,n为大于1的正整数;所述n个子帧的每个子帧包含预设数量个采样点;
    采样预测网络,配置为在第i轮预测过程中,对当前m个相邻采样点在所述n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到所述预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于所述预设数量的正整数;
    信号合成模块,配置为根据所述每个采样点对应的n个子预测值得到所述当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到目标音频。
  12. 一种音频处理装置,包括:
    文本语音转换模型,配置为对待处理文本进行语音特征转换,得到至少一帧声学特征帧;
    帧率网络,配置为从所述至少一帧声学特征帧的每帧声学特征帧中,提取出所述每帧声学特征帧对应的条件特征;
    时域频域处理模块,配置为对所述每帧声学特征帧中的当前帧进行频带划分与时域降采样,得到所述当前帧对应的n个子帧;其中,n为大于1的正整数;所述n个子帧的每个子帧包含预设数量个采样点;
    采样预测网络,配置为在第i轮预测过程中,对当前m个相邻采样点在所述n个子帧上对应的采样值进行同步预测,得到m×n个子预测值,进而得到所述预设数量个采样点中每个采样点对应的n个子预测值;其中,i为大于或等于1的正整数;m为大于或等于2且小于或等于所述预设数量的正整数;
    信号合成模块,配置为根据所述每个采样点对应的n个子预测值得到所述当前帧对应的音频预测信号;进而对至少一帧声学特征帧的每帧声学特征帧对应的音频预测信号进行音频合成,得到所述待处理文本对应的目标音频。
  13. 一种电子设备,包括:
    存储器,配置为存储可执行指令;
    处理器,配置为执行所述存储器中存储的可执行指令时,实现权利要求1至10任一项所述的方法。
  14. 一种计算机可读存储介质,存储有可执行指令,配置为被处理器执行时,实现权利要求1至10任一项所述的方法。
  15. 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现权利要求1至10任一项所述的方法。
PCT/CN2021/132024 2020-12-30 2021-11-22 音频处理方法、装置、声码器、电子设备、计算机可读存储介质及计算机程序产品 WO2022142850A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21913592.8A EP4210045B1 (en) 2020-12-30 2021-11-22 Audio processing method and apparatus, vocoder, electronic device, computer readable storage medium, and computer program product
JP2023518015A JP2023542012A (ja) 2020-12-30 2021-11-22 オーディオ処理方法、装置、ボコーダ、電子機器、コンピュータープログラム
US17/965,130 US20230035504A1 (en) 2020-12-30 2022-10-13 Audio processing method and apparatus, vocoder, electronic device, computer-readable storage medium, and computer program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011612387.8A CN113539231B (zh) 2020-12-30 2020-12-30 音频处理方法、声码器、装置、设备及存储介质
CN202011612387.8 2020-12-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/965,130 Continuation US20230035504A1 (en) 2020-12-30 2022-10-13 Audio processing method and apparatus, vocoder, electronic device, computer-readable storage medium, and computer program product

Publications (1)

Publication Number Publication Date
WO2022142850A1 true WO2022142850A1 (zh) 2022-07-07

Family

ID=78094317

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/132024 WO2022142850A1 (zh) 2020-12-30 2021-11-22 音频处理方法、装置、声码器、电子设备、计算机可读存储介质及计算机程序产品

Country Status (5)

Country Link
US (1) US20230035504A1 (zh)
EP (1) EP4210045B1 (zh)
JP (1) JP2023542012A (zh)
CN (1) CN113539231B (zh)
WO (1) WO2022142850A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115985330A (zh) * 2022-12-29 2023-04-18 南京硅基智能科技有限公司 一种音频编解码的系统和方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539231B (zh) * 2020-12-30 2024-06-18 腾讯科技(深圳)有限公司 音频处理方法、声码器、装置、设备及存储介质
CN114242034A (zh) * 2021-12-28 2022-03-25 深圳市优必选科技股份有限公司 一种语音合成方法、装置、终端设备及存储介质
CN115578995B (zh) * 2022-12-07 2023-03-24 北京邮电大学 面向语音对话场景的语音合成方法、系统及存储介质
CN118571233A (zh) * 2023-02-28 2024-08-30 华为技术有限公司 音频信号的处理方法及相关装置
CN116712056B (zh) * 2023-08-07 2023-11-03 合肥工业大学 心电图数据的特征图像生成与识别方法、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4790015A (en) * 1982-04-30 1988-12-06 International Business Machines Corporation Multirate digital transmission method and device for implementing said method
CN101221763A (zh) * 2007-01-09 2008-07-16 上海杰得微电子有限公司 针对子带编码音频的三维声场合成方法
CN102623016A (zh) * 2012-03-26 2012-08-01 华为技术有限公司 宽带语音处理方法及装置
CN110473516A (zh) * 2019-09-19 2019-11-19 百度在线网络技术(北京)有限公司 语音合成方法、装置以及电子设备
CN111583903A (zh) * 2020-04-28 2020-08-25 北京字节跳动网络技术有限公司 语音合成方法、声码器训练方法、装置、介质及电子设备
CN113539231A (zh) * 2020-12-30 2021-10-22 腾讯科技(深圳)有限公司 音频处理方法、声码器、装置、设备及存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DK2242045T3 (da) * 2009-04-16 2012-09-24 Univ Mons Talesyntese og kodningsfremgangsmåder
US9607610B2 (en) * 2014-07-03 2017-03-28 Google Inc. Devices and methods for noise modulation in a universal vocoder synthesizer
CN108305612B (zh) * 2017-11-21 2020-07-31 腾讯科技(深圳)有限公司 文本处理、模型训练方法、装置、存储介质和计算机设备
CN110930975B (zh) * 2018-08-31 2023-08-04 百度在线网络技术(北京)有限公司 用于输出信息的方法和装置
CN110136690B (zh) * 2019-05-22 2023-07-14 平安科技(深圳)有限公司 语音合成方法、装置及计算机可读存储介质
CN110223705B (zh) * 2019-06-12 2023-09-15 腾讯科技(深圳)有限公司 语音转换方法、装置、设备及可读存储介质
CN111179961B (zh) * 2020-01-02 2022-10-25 腾讯科技(深圳)有限公司 音频信号处理方法、装置、电子设备及存储介质
CN111402908A (zh) * 2020-03-30 2020-07-10 Oppo广东移动通信有限公司 语音处理方法、装置、电子设备和存储介质
CN111968618B (zh) * 2020-08-27 2023-11-14 腾讯科技(深圳)有限公司 语音合成方法、装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4790015A (en) * 1982-04-30 1988-12-06 International Business Machines Corporation Multirate digital transmission method and device for implementing said method
CN101221763A (zh) * 2007-01-09 2008-07-16 上海杰得微电子有限公司 针对子带编码音频的三维声场合成方法
CN102623016A (zh) * 2012-03-26 2012-08-01 华为技术有限公司 宽带语音处理方法及装置
CN110473516A (zh) * 2019-09-19 2019-11-19 百度在线网络技术(北京)有限公司 语音合成方法、装置以及电子设备
CN111583903A (zh) * 2020-04-28 2020-08-25 北京字节跳动网络技术有限公司 语音合成方法、声码器训练方法、装置、介质及电子设备
CN113539231A (zh) * 2020-12-30 2021-10-22 腾讯科技(深圳)有限公司 音频处理方法、声码器、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4210045A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115985330A (zh) * 2022-12-29 2023-04-18 南京硅基智能科技有限公司 一种音频编解码的系统和方法

Also Published As

Publication number Publication date
EP4210045A4 (en) 2024-03-13
EP4210045A1 (en) 2023-07-12
CN113539231A (zh) 2021-10-22
EP4210045B1 (en) 2024-08-07
US20230035504A1 (en) 2023-02-02
JP2023542012A (ja) 2023-10-04
EP4210045C0 (en) 2024-08-07
CN113539231B (zh) 2024-06-18

Similar Documents

Publication Publication Date Title
WO2022142850A1 (zh) 音频处理方法、装置、声码器、电子设备、计算机可读存储介质及计算机程序产品
CN110223705B (zh) 语音转换方法、装置、设备及可读存储介质
CN112687259B (zh) 一种语音合成方法、装置以及可读存储介质
Gold et al. Speech and audio signal processing: processing and perception of speech and music
WO2021218324A1 (zh) 歌曲合成方法、装置、可读介质及电子设备
CN111968618A (zh) 语音合成方法、装置
CN111276120A (zh) 语音合成方法、装置和计算机可读存储介质
CN109147831A (zh) 一种语音连接播放方法、终端设备及计算机可读存储介质
KR102137523B1 (ko) 텍스트-음성 변환 방법 및 시스템
CN115206284B (zh) 一种模型训练方法、装置、服务器和介质
CN113870827A (zh) 一种语音合成模型的训练方法、装置、设备及介质
CN113555000A (zh) 声学特征转换及模型训练方法、装置、设备、介质
Tan Neural text-to-speech synthesis
CN117373431A (zh) 音频合成方法、训练方法、装置、设备及存储介质
Hu et al. Generating synthetic dysarthric speech to overcome dysarthria acoustic data scarcity
CN114495896A (zh) 一种语音播放方法及计算机设备
CN114203151A (zh) 语音合成模型的训练的相关方法以及相关装置、设备
Zhao et al. Multi-speaker Chinese news broadcasting system based on improved Tacotron2
KR20220017285A (ko) 인공 신경망을 이용한 다화자 음성 합성 방법 및 장치
CN115132204B (zh) 一种语音处理方法、设备、存储介质及计算机程序产品
Bous A neural voice transformation framework for modification of pitch and intensity
CN113421544B (zh) 歌声合成方法、装置、计算机设备及存储介质
CN118398004B (zh) 一种语音大模型的构建和训练方法、音频输出方法及应用
WO2024056078A1 (zh) 视频生成方法、装置和计算机可读存储介质
Eirini End-to-End Neural based Greek Text-to-Speech Synthesis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913592

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023518015

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2021913592

Country of ref document: EP

Effective date: 20230406

NENP Non-entry into the national phase

Ref country code: DE