WO2021179717A1 - Speech recognition front-end processing method and apparatus, and terminal device - Google Patents

Speech recognition front-end processing method and apparatus, and terminal device Download PDF

Info

Publication number
WO2021179717A1
WO2021179717A1 PCT/CN2020/135511 CN2020135511W WO2021179717A1 WO 2021179717 A1 WO2021179717 A1 WO 2021179717A1 CN 2020135511 W CN2020135511 W CN 2020135511W WO 2021179717 A1 WO2021179717 A1 WO 2021179717A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
feature parameter
distribution
speech
feature
Prior art date
Application number
PCT/CN2020/135511
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
贾雪丽
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021179717A1 publication Critical patent/WO2021179717A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • This application belongs to the technical field of speech recognition, and in particular relates to a front-end processing method, device and terminal equipment for speech recognition.
  • ASR Automatic Speech Recognition
  • the inventor realizes that when there is a small amount of noise in the speech signal or the speech signal undergoes subtle changes, such as the natural interference in human language due to psychological or physiological interference (including laugh, excitement, frustration, and expressive speech signals of different emotions) Or voice signals with squeaking and breathing sounds generated by different sound qualities) will affect the performance of automatic speech recognition and reduce the performance of automatic speech recognition.
  • the embodiments of the present application provide a front-end processing method, device, and terminal equipment for speech recognition to solve the natural interference in human language due to psychological or physiological interference, which affects the performance of automatic speech recognition and reduces automatic speech recognition.
  • the performance problem of speech recognition is a front-end processing method, device, and terminal equipment for speech recognition to solve the natural interference in human language due to psychological or physiological interference, which affects the performance of automatic speech recognition and reduces automatic speech recognition.
  • an embodiment of the present application provides a front-end processing method for speech recognition, including:
  • the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
  • an embodiment of the present application provides a front-end processing device for speech recognition, including:
  • the acquiring unit is configured to acquire an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data;
  • a feature extraction unit configured to perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
  • a data processing unit configured to input the first voice feature parameter into a voice conversion model, and output a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
  • the synthesis unit is configured to synthesize the target voice data according to the second voice feature parameter, and use the target voice data as the input of a voice recognition model to perform voice recognition.
  • an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the computer program when the computer program is executed:
  • the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
  • the embodiments of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program. Realized when executed by the processor:
  • the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
  • the embodiments of the present application provide a computer program product that, when the computer program product runs on a terminal device, causes the terminal device to execute the front-end processing method for speech recognition according to any one of the above-mentioned first aspects.
  • the embodiment of this application has the following beneficial effects: through the embodiment of this application, the original voice signal is obtained, and the original voice signal is preprocessed according to a preset format to obtain source voice data; Perform voice feature extraction on the voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice; input the first voice feature parameter to the voice conversion
  • the model is converted and output to obtain a second voice feature parameter, the second voice feature parameter being a feature parameter of target voice data; the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as Input to the speech recognition model for speech recognition.
  • the original speech signal is preprocessed and the characteristic speech characteristic parameters are converted.
  • the natural interference in the original speech data can be filtered out through the speech conversion, and the characteristic parameters of the source speech data with disturbance characteristics are converted. Is the characteristic parameter of the undisturbed natural voice data, and synthesizes the corresponding undisturbed voice data as the input of voice recognition; the first voice characteristic parameter of the voice data with the disturbance characteristic source and the second voice of the converted voice data.
  • FIG. 1 is a schematic diagram of an application scenario system provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a front-end processing method for speech recognition provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of an iterative training method for a confrontation network model provided by another embodiment of the present application
  • FIG. 4 is a schematic diagram of the network structure of the confrontation network model provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a front-end processing device for speech recognition provided by an embodiment of the present application
  • Fig. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the term “if” can be construed as “when” or “once” or “in response to determination” or “in response to detecting “.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • the front-end processing method for speech recognition provided by the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, vehicle-mounted devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, and super
  • AR augmented reality
  • VR virtual reality
  • terminal devices such as ultra-mobile personal computers (UMPCs), netbooks, and personal digital assistants (personal digital assistants, PDAs)
  • UMPCs ultra-mobile personal computers
  • PDAs personal digital assistants
  • FIG. 1 is a schematic diagram of an application scenario system provided by an embodiment of the present application.
  • the front-end processing method for speech recognition provided by the embodiment of the present application can be applied to mobile terminals or fixed devices, such as smart phones 101 and laptops. 102.
  • Desktop computer 103, etc. the embodiment of the application does not impose any restrictions on the specific types of terminal devices.
  • the terminal device interacts with the server 104 in a wired or wireless manner; the voice assistant of the terminal device obtains external voice signals, Perform front-end processing to filter out some interference factors in the voice signal, convert the disturbed voice signal into a natural voice signal with no disturbance or minimum disturbance, and then transmit it to the server through wired or wireless means, and the server will perform voice recognition , Natural language processing and related business processing are fed back to the terminal device, and the terminal device executes corresponding actions according to the business processing information; among them, voice assistants such as Siri, Google Assistant, Amazon Alexa, etc., respond to the voice in the automatic speech recognition ASR system Application of recognized front-end processing methods. Wireless methods include the Internet, WiFi networks or mobile networks.
  • Mobile networks can include existing 2G (such as Global System for Mobile Communication, GSM), 3G (such as Universal Mobile Communication System (English: Universal)). Mobile Telecommunications System, UMTS)), 4G (such as FDD LTE, TDD LTE), 4.5G, 5G, etc.
  • 2G such as Global System for Mobile Communication, GSM
  • 3G such as Universal Mobile Communication System (English: Universal)
  • Mobile Telecommunications System, UMTS Mobile Telecommunications System
  • 4G such as FDD LTE, TDD LTE
  • 4.5G 5G, etc.
  • FIG. 2 shows a schematic flowchart of the front-end processing method of speech recognition provided by the present application, and the front-end processing method of speech recognition includes:
  • Step S201 Obtain an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data.
  • the executive body of this embodiment may be a terminal device with a voice recognition function, which implements front-end processing of voice signals for application scenarios where voice recognition is performed; that is, before performing semantic recognition on voice , Perform front-end processing on the voice signal with disturbance or noise to obtain normal and noise-free voice data, and use the normal and noise-free voice data as the input of the voice recognition system to improve the accuracy and robustness of voice recognition.
  • a voice recognition function which implements front-end processing of voice signals for application scenarios where voice recognition is performed; that is, before performing semantic recognition on voice , Perform front-end processing on the voice signal with disturbance or noise to obtain normal and noise-free voice data, and use the normal and noise-free voice data as the input of the voice recognition system to improve the accuracy and robustness of voice recognition.
  • the original voice signal may be a voice signal with disturbance or noise, such as a voice signal with natural interference caused by psychology or physiology.
  • voice signals expressed in different emotions such as laughter, excitement, depression, etc.
  • voice signals with squeaking and breathing sounds produced by different sound qualities may include: voice signals expressed in different emotions such as laughter, excitement, depression, etc. , Or voice signals with squeaking and breathing sounds produced by different sound qualities.
  • obtaining the original voice signal, and preprocessing the original voice signal according to a preset format to obtain the source voice data includes:
  • A2 Periodically sample the filtered voice signal to obtain voice sampling data with a preset frequency
  • the original speech signal is filtered and sampled at a frequency of 16 kHz.
  • A3. Perform windowing and framing processing on the voice sample data to obtain the source voice data.
  • the voice sample data is windowed. Since the voice signal has strong time-varying properties in the time domain, the voice signal is divided into short-term to obtain a short signal with a fixed time length. The characteristics of a fixed frame of short signal remain unchanged within a fixed time.
  • the fixed time can be a fixed period of time between 10 milliseconds and 30 milliseconds, which can be realized by adding windows, such as multiplying the voice by a window function with a length of 20 milliseconds. Signal, the spectral characteristics of the windowed speech signal are stable within the duration of the window (20 milliseconds).
  • the voice signal is processed into frames; in order to ensure the continuity and reliability of the information of the dynamic change of the voice signal, the overlap between two adjacent frames of the voice signal is set to maintain the voice signal Smooth transition between frames.
  • endpoint detection is performed on the voice signal to mark and determine the starting point and ending point of each frame of voice signal, reducing the impact of bursts or discontinuities on voice signal analysis.
  • the acquired voice data frame is used as the source voice data to be analyzed.
  • the original voice signal may also be a normal noise-free voice signal.
  • the front-end processing part of the voice recognition system the front-end processing of the obtained normal voice without disturbance will not affect the subsequent recognition of the voice signal.
  • Step S202 Perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice.
  • the first voice feature parameter is an acoustic feature parameter describing the timbre of the voice extracted based on the voice data frame, such as a frequency spectrum parameter; the first voice feature parameter also includes a parameter for characterizing prosodic features of the voice, For example, the pitch frequency parameter.
  • performing voice feature extraction on the source voice data to obtain the first voice feature parameter of the source voice data includes:
  • the first speech feature parameter is extracted every 5 milliseconds, including the mel spectrum extracted based on the mel filter bank (MFB) Characteristic parameters, logarithmic fundamental frequency (log F0) characteristic parameters, and aperiodic components (APs) characteristics.
  • MFB mel filter bank
  • log F0 logarithmic fundamental frequency
  • APs aperiodic components
  • the Mel frequency spectrum characteristic parameter and the aperiodic component (APs) characteristic are respectively 24-dimensional speech characteristic parameters.
  • the Mel spectrum feature parameters feature extraction is performed every 5 milliseconds in the voice data window of 20 milliseconds per frame; the time domain signal is supplemented to the length by recording the time domain signal of the source voice data of each frame
  • the sequence is the same as the window width, the discrete Fourier transform is performed on the sequence to obtain the linear spectrum of each frame of speech data, and the linear spectrum is passed through the mel frequency filter bank to obtain the mel spectrum; among them, the mel filter bank generally includes 24 A triangular bandpass filter smoothes the acquired spectrum characteristics, effectively emphasizes the low-frequency information of the voice data, highlights useful information, and shields the interference of noise.
  • log F0 logarithmic fundamental frequency
  • the cepstrum of the frame of speech data after performing windowing processing on each frame of source speech data after preprocessing, calculate the cepstrum of the frame of speech data, set the length range of pitch search, and query the maximum value of the cepstrum of the frame speech data in this length range If the maximum value is greater than the threshold value of the window, the pitch frequency of voiced voice is calculated according to the maximum value, and the characteristics of the voice data are reflected by obtaining the logarithm of the pitch frequency; if the maximum value of the cepstrum is less than or equal to the threshold value of the window, It means that the source voice data of the frame is muted or unvoiced.
  • non-periodic component feature parameters For the non-periodic component feature parameters, perform inverse Fourier transform according to the windowed signal of the source speech data to obtain the time domain characteristics of the non-periodic component, and determine according to the minimum phase of the windowed signal and the spectral feature of the source speech data Frequency domain characteristics of aperiodic components.
  • Step S203 Input the first voice feature parameter into the voice conversion model, and output the second voice feature parameter after the conversion.
  • the second voice feature parameter is the feature parameter of the target voice data.
  • the voice conversion model is a model obtained by training a sample training data set using a confrontation network model with a consistent period.
  • the first voice feature parameter extracted from the source voice data is input into the voice conversion model, and the second voice feature parameter is output after the voice conversion.
  • the second voice feature parameter is the voice feature parameter most similar to the actual normal voice feature parameter , That is, the characteristic parameter of the target voice data, and the target voice data is voice data with minimal or no disturbance.
  • a schematic flow chart of a method for iterative training of a confrontation network model provided by another embodiment of the present application, the training step of the voice conversion model includes:
  • Step S301 Obtain a random sample and an actual sample in the speech sample training data set, and extract the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;
  • two spontaneous speech data sets such as the AMI conference corpus and the Buckeye corpus of conversational speech, are used to analyze the impact of natural disturbance; from the two speech data sets, 40 female speeches are obtained Voice data composed of speakers and 30 male speakers; using the voice data as a voice sample training data set, a total of 210 utterances, including each gender and each type (normal language, laughing language and squeaking language). Among these 210 utterances, 150 are used for training and 60 are used for testing; the duration of each sentence is 1-2 seconds; to train the speech conversion model.
  • the adversarial network model with consistent cycles includes a generator and a discriminator; specifically, random samples and actual samples are obtained from the speech sample training data set, and the random sample characteristics of the random samples are extracted Parameters and actual sample characteristic parameters in the actual sample, and the distribution of random sample characteristic parameters will be used as the input of the generator.
  • Step S302 Perform iterative training on the to-be-trained confrontation network model according to the random sample feature parameter distribution and the actual sample feature parameter distribution;
  • the adversarial network model with consistent cycles includes a generator and a discriminator; according to the random sample characteristic parameters, the generator generates a pseudo sample characteristic parameter distribution similar to the actual sample characteristic parameter distribution.
  • the feature parameter distribution of the pseudo sample is input into the discriminator, and the discriminator distinguishes the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution.
  • Step S303 Calculate the output error of the confrontation network model in the iterative training process according to the preset loss function
  • the confrontation network model uses a preset loss function to calculate the error in the iterative training process, and uses the error as the target training value of the confrontation network model.
  • Step S304 when the error is less than or equal to the preset error threshold, stop training to obtain the voice conversion model.
  • the trained confrontation network model meets the conversion conditions, then the training is stopped to obtain the voice conversion model; through the voice conversion model, disturbing speech The characteristic parameters are converted into actual normal speech characteristic parameters, completing the conversion of non-parallel speech.
  • performing iterative training on the confrontation network to be trained according to the distribution of the random sample characteristic parameters and the distribution of the actual sample characteristic parameters includes:
  • the conversion of disturbing voice features to normal voice features is modeled; the acquired random samples are extracted from the voice feature parameters, and the distribution of the extracted voice feature parameters (x ⁇ X) is input to the generator, and the generator generates a pseudo-sample feature parameter distribution GX ⁇ Y(x); through the first confrontation loss function L adv (G X ⁇ Y (x),Y), calculate the pseudo-sample feature parameter distribution GX ⁇ The distance between Y(x) and the actual sample feature parameter distribution (y ⁇ Y), that is, the conversion from disturbed speech to normal speech is realized.
  • the discriminator distinguishes the generated pseudo sample features from the actual sample features, and obtains the differentiated result GY ⁇ X(y), and passes the second adversarial loss function L adv (G Y ⁇ X (y),X) Calculate the distance between the identification result and the feature of the random sample.
  • the confrontation network model includes a generator G and a discriminator D.
  • the generator G generates a pseudo-sample feature parameter distribution G(x), and
  • the feature parameter distribution of the pseudo sample and the feature distribution of the actual sample are input to the discriminator, and the discriminator is used to identify the discriminator to obtain the identification result, and then the identification result is fed back to the generator G or the discriminator D to perform cyclic training on the adversarial network model.
  • the generator and discriminator networks in the speech conversion model are respectively composed of convolution blocks.
  • the generator network consists of 9 convolution blocks; among them, it includes a stride-1 convolution block, a stride-2 convolution block, 5 residual blocks, a 1/2stride convolution block and a stride-1 convolution block Block;
  • all convolutional layers are one-dimensional; as the activation function of the convolutional layer, the gated linear unit has achieved the most advanced performance in language and speech modeling.
  • the discriminator network is composed of four two-dimensional convolutional blocks; the gated linear unit is used as the activation function of all convolutional blocks; for the discriminator network, a 6 ⁇ 6 patch GAN is used to classify each 6 ⁇ 6 patch as true and false.
  • calculating the error output by the confrontation network model during the iterative training process according to a preset loss function includes:
  • the cyclic consistency loss function and the identity mapping loss function of the confrontation network model are obtained; wherein, the first confrontation loss function is to calculate the pseudo sample feature A loss function for the distance between the parameter distribution and the actual sample feature parameter distribution, and the second counter loss function is a loss function for calculating the distance between the identification result feature distribution and the random sample feature distribution;
  • 1 , and the identity mapping loss function L id E x
  • the preset loss function L L adv (G X ⁇ Y (x) of the confrontation network model is obtained ,y)+L adv (G Y ⁇ X (y),x)+ ⁇ cyc L cyc + ⁇ id L id , where ⁇ cyc and ⁇ id are hyperparameters to control the cyclic consistency loss function and the identity mapping
  • ⁇ cyc and ⁇ id are hyperparameters to control the cyclic consistency loss function and the identity mapping
  • the confrontation network model outputs an error calculated by the preset loss function, and uses the error as a target training value.
  • the error is taken as the target training value, and when the value of the complete loss function is minimized, the training of the speech conversion model is completed to obtain the speech conversion model.
  • Step S204 Synthesize the target voice data according to the second voice feature parameter, and use the target voice data as an input of a voice recognition model to perform voice recognition.
  • synthesizing the target voice data according to the second voice feature parameter includes:
  • the waveform splicing and time domain gene synchronization superposition algorithm is used to synthesize target voice data with no disturbance or with the least disturbance feature.
  • the target voice data is synthesized according to the second voice feature parameter, for example, based on the second voice feature parameter, waveform splicing is used, and a time-domain pitch synchronization superposition algorithm is used to synthesize a voice signal containing the target feature parameter.
  • the synthesized speech data is used as the input of the speech recognition model to perform speech recognition; specifically, in the actual application process, based on the specific speech recognition system, the front-end processing method proposed in this application and the unused In this case, use laughter voice (voice disturbed by emotions) and squeak (voice disturbed by voice quality) respectively for testing.
  • the performance is evaluated by word error rate (WER) and sentence error rate (SER). Lower WER and SER values indicate better performance.
  • WER word error rate
  • SER sentence error rate
  • the ASR performance shown in Table 1 is affected by the strength of the language model used by each ASR system.
  • the deep speech model that converts speech into English character sequence is tested.
  • the character error rate of the language model with and without the front-end speech conversion model (CER) performance is tested.
  • the language model is trained through 1000 hours of LibriSpeech data, and the language model is not used for decoding. It can be seen from Table 2 that the deep language model after the front-end processing through the voice conversion model reduces the character error rate CER of the deep voice model.
  • the speech conversion model of this embodiment can capture The distribution of Mel filter bank output of normal and laughter disturbed speech, and can convert laughter disturbed speech into equivalent normal speech.
  • the original voice signal is obtained, and the original voice signal is preprocessed in a preset format to obtain source voice data; voice feature extraction is performed on the source voice data to obtain the first voice of the source voice data
  • the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice; the first voice feature parameter is input to the voice conversion model, and the second voice feature parameter is output after the conversion.
  • the second voice feature parameter is The voice feature parameter is a feature parameter of target voice data; the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
  • the original speech signal is preprocessed and the characteristic speech characteristic parameters are converted.
  • the natural interference in the original speech data can be filtered out through the speech conversion, and the characteristic parameters of the source speech data with disturbance characteristics are converted. Is the characteristic parameter of the undisturbed natural voice data, and synthesizes the corresponding undisturbed voice data as the input of voice recognition; the first voice characteristic parameter of the voice data with the disturbance characteristic source and the second voice of the converted voice data.
  • the visualization of feature parameters realizes the non-parallel conversion of voice data and improves the robustness and accuracy of voice recognition.
  • FIG. 5 shows a structural block diagram of the front-end processing device for speech recognition provided in an embodiment of the present application. The relevant part.
  • the device includes:
  • the obtaining unit 51 is configured to obtain an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data;
  • the feature extraction unit 52 is configured to perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
  • the data processing unit 53 is configured to input the first voice feature parameter into a voice conversion model, and output a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
  • the synthesis unit 54 is configured to synthesize the target voice data according to the second voice characteristic parameter, and use the target voice data as an input of a voice recognition model to perform voice recognition.
  • the acquiring unit includes:
  • the filtering module is used to perform filtering processing on the original speech signal
  • the sampling module is used to periodically sample the filtered voice signal to obtain voice sampling data with a preset frequency
  • the processing module is used to perform windowing and framing processing on the voice sample data to obtain the source voice data.
  • the feature extraction unit is further configured to extract Mel spectrum feature parameters, logarithmic fundamental frequency feature parameters, and aperiodic component feature parameters of the source voice data through a Mel filter bank; to obtain the source voice data
  • the parameter distribution corresponding to the characteristic parameters of the Mel spectrum, the characteristic parameters of the logarithmic fundamental frequency and the characteristic parameters of the non-periodic component.
  • the front-end processing device for speech recognition further includes:
  • the sample data acquisition unit is configured to acquire a random sample and an actual sample in the speech sample training data set, and extract the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;
  • the model training unit is configured to perform iterative training on the confrontation network model to be trained according to the random sample feature parameter distribution and the actual sample feature parameter distribution;
  • An error calculation unit configured to calculate the error output by the confrontation network model in the iterative training process according to a preset loss function
  • the model generating unit is used to stop training when the error is less than or equal to the preset error threshold to obtain the voice conversion model.
  • model training unit includes:
  • a generator network which is used to input the random sample feature parameter distribution to the generator network of the confrontation network model to be trained, and generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;
  • the discriminator network is used to discriminate the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution through the discriminator network of the confrontation network model to be trained to obtain the feature distribution of the identification result;
  • the cyclic training module is used to input the feature distribution of the identification result to the generator network again, to regenerate the pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution, and to re-analyze the pseudo sample feature parameter through the discriminator network
  • the distribution is distinguished from the actual sample characteristic parameter distribution, and the characteristic distribution of the identification result is obtained;
  • An iterative training module configured to perform cyclic iteration on the to-be-trained confrontation network model according to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution train.
  • the error calculation unit includes:
  • the first calculation module is used to obtain the cyclic consistency loss function and the identity mapping loss function of the confrontation network model according to the first confrontation loss function and the second confrontation loss function; wherein, the first confrontation loss function is calculated A loss function for the distance between the feature parameter distribution of the pseudo sample and the feature parameter distribution of the actual sample, and the second counter loss function is a loss function for calculating the distance between the feature distribution of the identification result and the feature distribution of the random sample;
  • a second calculation module configured to obtain the preset loss function of the confrontation network model according to the cyclic consistency loss function and the identity mapping loss function;
  • the target training value calculation module is configured to output the error calculated by the preset loss function by the confrontation network model, and use the error as the target training value.
  • the synthesis unit is further configured to adopt a waveform splicing and time domain gene synchronization superposition algorithm according to the second speech characteristic parameter to synthesize target speech data with no disturbance or with the least disturbance characteristic.
  • the original voice signal is obtained, and the original voice signal is preprocessed according to a preset format to obtain source voice data; voice feature extraction is performed on the source voice data to obtain the first voice of the source voice data
  • the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice; the first voice feature parameter is input to the voice conversion model, and the second voice feature parameter is output after the conversion.
  • the second voice feature parameter is The voice feature parameter is a feature parameter of target voice data; the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
  • the original speech signal is preprocessed and the characteristic speech characteristic parameters are converted.
  • the natural interference in the original speech data can be filtered out through the speech conversion, and the characteristic parameters of the source speech data with disturbance characteristics are converted. Is the characteristic parameter of the undisturbed natural voice data, and synthesizes the corresponding undisturbed voice data as the input of voice recognition; the first voice characteristic parameter of the voice data with the disturbance characteristic source and the second voice of the converted voice data.
  • the visualization of feature parameters realizes the non-parallel conversion of voice data and improves the robustness and accuracy of voice recognition.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program, and the computer program is When the processor is executed, the steps in the foregoing method embodiments can be realized.
  • the embodiments of the present application provide a computer program product.
  • the steps in the foregoing method embodiments can be realized when the mobile terminal is executed.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • ROM read-only memory
  • RAM random access memory
  • electric carrier signal telecommunications signal and software distribution medium.
  • U disk mobile hard disk, floppy disk or CD-ROM, etc.
  • computer-readable media cannot be electrical carrier signals and telecommunication signals.
  • FIG. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the application.
  • the terminal device 6 of this embodiment includes: at least one processor 60 (only one is shown in FIG. 6), a processor, a memory 61, and a processor that is stored in the memory 61 and can be processed in the at least one processor.
  • the computer program 62 running on the processor 60 implements the steps in any of the foregoing embodiments of the front-end processing method for speech recognition when the processor 60 executes the computer program 62.
  • the terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the terminal device may include, but is not limited to, a processor 60 and a memory 61.
  • FIG. 6 is only an example of the terminal device 6 and does not constitute a limitation on the terminal device 6. It may include more or fewer components than shown in the figure, or a combination of certain components, or different components. , For example, can also include input and output devices, network access devices, and so on.
  • the so-called processor 60 may be a central processing unit (Central Processing Unit, CPU), and the processor 60 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (Application Specific Integrated Circuits). , ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 61 may be an internal storage unit of the terminal device 6 in some embodiments, such as a hard disk or a memory of the terminal device 6. In other embodiments, the memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk equipped on the terminal device 6, a smart media card (SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device.
  • the memory 61 is used to store an operating system, an application program, a boot loader (BootLoader), data, and other programs, such as the program code of the computer program. The memory 61 can also be used to temporarily store data that has been output or will be output.
  • the disclosed apparatus/network equipment and method may be implemented in other ways.
  • the device/network device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed are a speech recognition front-end processing method and apparatus, and a terminal device. The method comprises: acquiring an original speech signal, and preprocessing the original speech signal according to a preset format to obtain source speech data (S201); performing speech feature extraction on the source speech data to obtain a first speech feature parameter of the source speech data, wherein the first speech feature parameter is an acoustic feature parameter describing the timbre and prosody of speech (S202); inputting the first speech feature parameter into a speech conversion model, and outputting a second speech feature parameter after conversion, wherein the second speech feature parameter is a feature parameter of target speech data (S203); and synthesising the target speech data according to the second speech feature parameter, and taking the target speech data as an input of a speech recognition model for performing speech recognition (204). Source speech data with a first speech feature parameter is converted into speech data with a second speech feature parameter, and non-parallel conversion of speech data is realised, thereby improving the robustness and accuracy of speech recognition.

Description

[根据细则91更正 07.01.2021] 一种语音识别的前端处理方法、装置及终端设备[Corrected according to Rule 91 07.01.2021]  A front-end processing method, device and terminal equipment for speech recognition
本申请要求于2020年03月11日提交中国专利局、申请号为202010165112.8、发明名称为“一种语音识别的前端处理方法、装置及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 11, 2020, the application number is 202010165112.8, and the invention title is "a front-end processing method, device and terminal equipment for speech recognition". The entire content of the application is approved The reference is incorporated in this application.
技术领域Technical field
本申请属于语音识别技术领域,尤其涉及一种语音识别的前端处理方法、装置及终端设备。This application belongs to the technical field of speech recognition, and in particular relates to a front-end processing method, device and terminal equipment for speech recognition.
背景技术Background technique
自动语音识别(Automatic Speech Recognition,ASR)是将人类的语音中的词汇内容转换为计算机可读的输入,不同于说话人识别或说话人确认。随着深度学习技术的发展与应用,自动语音识别技术有了显著的提高,在日常不同领域中得到广泛的应用。Automatic Speech Recognition (ASR) converts the vocabulary content of human speech into computer-readable input, which is different from speaker recognition or speaker confirmation. With the development and application of deep learning technology, automatic speech recognition technology has been significantly improved and is widely used in different fields of daily life.
但是,发明人意识到,语音信号中存在少量噪声或语音信号发生细微改变时,例如人类语言中的由于心理或生理产生的自然干扰(包括大笑、兴奋、沮丧的不同情绪表达性的语音信号或由不同声音品质产生的附带吱吱声、呼吸声的语音信号),会对自动语音识别的性能产生影响,降低自动语音识别的性能。However, the inventor realizes that when there is a small amount of noise in the speech signal or the speech signal undergoes subtle changes, such as the natural interference in human language due to psychological or physiological interference (including laugh, excitement, frustration, and expressive speech signals of different emotions) Or voice signals with squeaking and breathing sounds generated by different sound qualities) will affect the performance of automatic speech recognition and reduce the performance of automatic speech recognition.
技术问题technical problem
有鉴于此,本申请实施例提供了一种语音识别的前端处理方法、装置及终端设备,以解决人类语言中的由于心理或生理产生的自然干扰,对自动语音识别的性能产生影响,降低自动语音识别的性能的问题。In view of this, the embodiments of the present application provide a front-end processing method, device, and terminal equipment for speech recognition to solve the natural interference in human language due to psychological or physiological interference, which affects the performance of automatic speech recognition and reduces automatic speech recognition. The performance problem of speech recognition.
技术解决方案Technical solutions
第一方面,本申请实施例提供了一种语音识别的前端处理方法,包括:In the first aspect, an embodiment of the present application provides a front-end processing method for speech recognition, including:
获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;
对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
第二方面,本申请实施例提供了一种语音识别的前端处理装置,包括:In the second aspect, an embodiment of the present application provides a front-end processing device for speech recognition, including:
获取单元,用于获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;The acquiring unit is configured to acquire an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data;
特征提取单元,用于对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;A feature extraction unit, configured to perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
数据处理单元,用于将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;A data processing unit, configured to input the first voice feature parameter into a voice conversion model, and output a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
合成单元,用于根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。The synthesis unit is configured to synthesize the target voice data according to the second voice feature parameter, and use the target voice data as the input of a voice recognition model to perform voice recognition.
第三方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行计算机程序时实现:In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the computer program when the computer program is executed:
获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;
对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性,计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:In the fourth aspect, the embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program. Realized when executed by the processor:
获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;
对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的语音识别的前端处理方法。In a fifth aspect, the embodiments of the present application provide a computer program product that, when the computer program product runs on a terminal device, causes the terminal device to execute the front-end processing method for speech recognition according to any one of the above-mentioned first aspects.
有益效果Beneficial effect
本申请实施例与现有技术相比存在的有益效果是:通过本申请实施例,获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。在进行语音识别之前,对原始语音信号进行预处理及特征语音特征参量的转换,通过语音转换可以将原始语音数据中的自然干扰进行滤除,将带有扰动特征的源语音数据的特征参量转换为无干扰的自然语音数据的特征参量,并合成对应的无干扰语音数据,作为语音识别的输入;将带有扰动特征源语音数据的第一语音特征参量以及转换后的语音数据的第二语音特征参量可视化,实现了语音数据的非平行转换,提高了语音识别的鲁棒性和准确性。Compared with the prior art, the embodiment of this application has the following beneficial effects: through the embodiment of this application, the original voice signal is obtained, and the original voice signal is preprocessed according to a preset format to obtain source voice data; Perform voice feature extraction on the voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice; input the first voice feature parameter to the voice conversion The model is converted and output to obtain a second voice feature parameter, the second voice feature parameter being a feature parameter of target voice data; the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as Input to the speech recognition model for speech recognition. Before speech recognition, the original speech signal is preprocessed and the characteristic speech characteristic parameters are converted. The natural interference in the original speech data can be filtered out through the speech conversion, and the characteristic parameters of the source speech data with disturbance characteristics are converted. Is the characteristic parameter of the undisturbed natural voice data, and synthesizes the corresponding undisturbed voice data as the input of voice recognition; the first voice characteristic parameter of the voice data with the disturbance characteristic source and the second voice of the converted voice data The visualization of feature parameters realizes the non-parallel conversion of voice data and improves the robustness and accuracy of voice recognition.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only of the present application. For some embodiments, those of ordinary skill in the art can obtain other drawings based on these drawings without creative work.
图1是本申请一实施例提供的应用场景系统示意图;FIG. 1 is a schematic diagram of an application scenario system provided by an embodiment of the present application;
图2是本申请一实施例提供的语音识别的前端处理方法的流程示意图;2 is a schematic flowchart of a front-end processing method for speech recognition provided by an embodiment of the present application;
图3是本申请另一实施例提供的对抗网络模型迭代训练方法的流程示意图;FIG. 3 is a schematic flowchart of an iterative training method for a confrontation network model provided by another embodiment of the present application;
图4是本申请一实施例提供的对抗网络模型的网络结构示意图;4 is a schematic diagram of the network structure of the confrontation network model provided by an embodiment of the present application;
图5是本申请实施例提供的语音识别的前端处理装置的结构示意图;FIG. 5 is a schematic structural diagram of a front-end processing device for speech recognition provided by an embodiment of the present application;
图6是本申请实施例提供的终端设备的结构示意图。Fig. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in the specification and appended claims of this application, the term "comprising" indicates the existence of the described features, wholes, steps, operations, elements and/or components, but does not exclude one or more other The existence or addition of features, wholes, steps, operations, elements, components, and/or collections thereof.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the term "and/or" used in the specification and appended claims of this application refers to any combination of one or more of the associated listed items and all possible combinations, and includes these combinations.
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in the description of this application and the appended claims, the term "if" can be construed as "when" or "once" or "in response to determination" or "in response to detecting ". Similarly, the phrase "if determined" or "if detected [described condition or event]" can be interpreted as meaning "once determined" or "in response to determination" or "once detected [described condition or event]" depending on the context ]" or "in response to detection of [condition or event described]".
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In addition, in the description of the specification of this application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。Reference to "one embodiment" or "some embodiments" described in the specification of this application means that one or more embodiments of this application include a specific feature, structure, or characteristic described in combination with the embodiment. Therefore, the sentences "in one embodiment", "in some embodiments", "in some other embodiments", "in some other embodiments", etc. appearing in different places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless it is specifically emphasized otherwise. The terms "including", "including", "having" and their variations all mean "including but not limited to", unless otherwise specifically emphasized.
本申请实施例提供的语音识别的前端处理方法可以应用于手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。The front-end processing method for speech recognition provided by the embodiments of this application can be applied to mobile phones, tablet computers, wearable devices, vehicle-mounted devices, augmented reality (AR)/virtual reality (VR) devices, notebook computers, and super For terminal devices such as ultra-mobile personal computers (UMPCs), netbooks, and personal digital assistants (personal digital assistants, PDAs), the embodiments of this application do not impose any restrictions on the specific types of terminal devices.
参见图1,是本申请一实施例提供的应用场景系统示意图,如图所示本申请实施例提供的语音识别的前端处理方法可以应用于移动终端或固定设备,例如:智能手机101、笔记本电脑102、台式计算机103等,本申请实施例对终端设备的具体类型不作任何限制,终端设备通过有线或无线的方式与服务器104进行数据的交互;终端设备的语音助手获取外界语音信号,对语音信号进行前端处理,过滤掉语音信号中的一些干扰因素,将带有扰动的语音信号转化为无扰动或扰动最小化的自然语音信号,进而通过有线或无线的方式传输至服务器,由服务器进行语音识别、自然语言处理及相关的业务处理,反馈至终端设备,由终端设备根据业务处理信息执行相应的动作;其中,语音助手如Siri、谷歌Assistant、亚马逊Alexa等,在自动语音识别ASR系统中对语音识别的前端处理方法的应用。无线方式包括互联网、WiFi网络或移动网络,其中移动网络可以包括现有的2G(如全球移动通信系统(英文:Global System for Mobile Communication,GSM))、3G(如通用移动通信系统(英文:Universal Mobile Telecommunications System,UMTS))、4G(如FDD LTE、TDD LTE)以及4.5G、5G等。Refer to Figure 1, which is a schematic diagram of an application scenario system provided by an embodiment of the present application. As shown in the figure, the front-end processing method for speech recognition provided by the embodiment of the present application can be applied to mobile terminals or fixed devices, such as smart phones 101 and laptops. 102. Desktop computer 103, etc., the embodiment of the application does not impose any restrictions on the specific types of terminal devices. The terminal device interacts with the server 104 in a wired or wireless manner; the voice assistant of the terminal device obtains external voice signals, Perform front-end processing to filter out some interference factors in the voice signal, convert the disturbed voice signal into a natural voice signal with no disturbance or minimum disturbance, and then transmit it to the server through wired or wireless means, and the server will perform voice recognition , Natural language processing and related business processing are fed back to the terminal device, and the terminal device executes corresponding actions according to the business processing information; among them, voice assistants such as Siri, Google Assistant, Amazon Alexa, etc., respond to the voice in the automatic speech recognition ASR system Application of recognized front-end processing methods. Wireless methods include the Internet, WiFi networks or mobile networks. Mobile networks can include existing 2G (such as Global System for Mobile Communication, GSM), 3G (such as Universal Mobile Communication System (English: Universal)). Mobile Telecommunications System, UMTS)), 4G (such as FDD LTE, TDD LTE), 4.5G, 5G, etc.
图2示出了本申请提供的语音识别的前端处理方法的示意性流程图,所述语音识别的前端处理方法包括:FIG. 2 shows a schematic flowchart of the front-end processing method of speech recognition provided by the present application, and the front-end processing method of speech recognition includes:
步骤S201,获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据。Step S201: Obtain an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data.
在一种可能的实现方式中,本实施例的执行主体可以为具有语音识别功能的终端设备,针对在进行语音识别的应用场景,实现对语音信号的前端处理;即在对语音进行语义识别之前,对带有扰动或杂音的语音信号进行前端处理,获取正常无杂音的语音数据,将正常无杂音的语音数据作为语音识别系统的输入,提高语音识别的准确性和鲁棒性。In a possible implementation manner, the executive body of this embodiment may be a terminal device with a voice recognition function, which implements front-end processing of voice signals for application scenarios where voice recognition is performed; that is, before performing semantic recognition on voice , Perform front-end processing on the voice signal with disturbance or noise to obtain normal and noise-free voice data, and use the normal and noise-free voice data as the input of the voice recognition system to improve the accuracy and robustness of voice recognition.
其中,原始语音信号可以为带有扰动或杂音的语音信号,例如由于心理或生理产生的带有自然干扰的语音信号,具体的可以包括:以大笑、兴奋、沮丧等不同情绪表达的语音 信号,或者由不同声音品质产生的附带吱吱声、呼吸声的语音信号。Among them, the original voice signal may be a voice signal with disturbance or noise, such as a voice signal with natural interference caused by psychology or physiology. Specifically, it may include: voice signals expressed in different emotions such as laughter, excitement, depression, etc. , Or voice signals with squeaking and breathing sounds produced by different sound qualities.
在一个实施例中,获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据,包括:In one embodiment, obtaining the original voice signal, and preprocessing the original voice signal according to a preset format to obtain the source voice data includes:
A1、对所述原始语音信号进行滤波处理;A1. Perform filtering processing on the original speech signal;
A2、对滤波处理后的语音信号进行周期性采样,获取预设频率的语音采样数据;A2. Periodically sample the filtered voice signal to obtain voice sampling data with a preset frequency;
在一种可能的实现方式中,对原始语音信号进行滤波处理,按16kHz的频率进行采样。In a possible implementation manner, the original speech signal is filtered and sampled at a frequency of 16 kHz.
A3、对所述语音采样数据进行加窗及分帧处理,得到所述源语音数据。A3. Perform windowing and framing processing on the voice sample data to obtain the source voice data.
在一种可能的实现方式中,对语音采样数据进行加窗处理,由于语音信号在时域上具有较强的时变性,因此将语音信号进行短时划分,得到固定时间长度的短信号,设定一帧短信号的特征在固定时间内保持不变,固定时间可以是10毫秒~30毫秒之间的某一固定时间段,通过加窗实现,例如选择长度为20毫秒的窗函数乘以语音信号,加窗后的语音信号的频谱特征在窗的持续时间(20毫秒)内是平稳的。In a possible implementation manner, the voice sample data is windowed. Since the voice signal has strong time-varying properties in the time domain, the voice signal is divided into short-term to obtain a short signal with a fixed time length. The characteristics of a fixed frame of short signal remain unchanged within a fixed time. The fixed time can be a fixed period of time between 10 milliseconds and 30 milliseconds, which can be realized by adding windows, such as multiplying the voice by a window function with a length of 20 milliseconds. Signal, the spectral characteristics of the windowed speech signal are stable within the duration of the window (20 milliseconds).
另外,在对语音数据进行加窗后,对语音信号进行分帧处理;为了保证语音信号动态变化的信息的连续性及可靠性,设置相邻两帧语音信号之间的重叠部分,保持语音信号的帧与帧之间的平滑过渡。在对语音信号进行分帧处理后,对语音信号进行端点检测,以标记并确定每一帧语音信号的起始点和终止点,降低突发脉冲或语音间断等对语音信号分析的影响。最后将获取的语音数据帧作为待分析的源语音数据。In addition, after the voice data is windowed, the voice signal is processed into frames; in order to ensure the continuity and reliability of the information of the dynamic change of the voice signal, the overlap between two adjacent frames of the voice signal is set to maintain the voice signal Smooth transition between frames. After framing the voice signal, endpoint detection is performed on the voice signal to mark and determine the starting point and ending point of each frame of voice signal, reducing the impact of bursts or discontinuities on voice signal analysis. Finally, the acquired voice data frame is used as the source voice data to be analyzed.
需要说明的是,原始语音信号还可以是正常无杂音的语音信号,作为语音识别系统的前端处理部分,针对获取的无扰动的正常语音的前端处理,不会影响后续对语音信号的识别。It should be noted that the original voice signal may also be a normal noise-free voice signal. As the front-end processing part of the voice recognition system, the front-end processing of the obtained normal voice without disturbance will not affect the subsequent recognition of the voice signal.
步骤S202,对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量。Step S202: Perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice.
在一种可能的实现方式中,第一语音特征参量为基于语音数据帧提取的描述语音音色的声学特征参数,例如频谱参数;第一语音特征参量还包括用于表征语音的韵律特征的参数,例如基音频率参数。In a possible implementation, the first voice feature parameter is an acoustic feature parameter describing the timbre of the voice extracted based on the voice data frame, such as a frequency spectrum parameter; the first voice feature parameter also includes a parameter for characterizing prosodic features of the voice, For example, the pitch frequency parameter.
在一个实施例中,对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,包括:In one embodiment, performing voice feature extraction on the source voice data to obtain the first voice feature parameter of the source voice data includes:
B1、通过梅尔滤波器组提取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量;B1. Extract the Mel spectrum feature parameters, logarithmic fundamental frequency feature parameters, and non-periodic component feature parameters of the source voice data through the Mel filter bank;
B2、获取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量对应的参量分布。B2. Obtain the parameter distribution corresponding to the Mel spectrum characteristic parameter, the logarithmic fundamental frequency characteristic parameter, and the non-periodic component characteristic parameter of the source voice data.
在一种可能的实现方式中,在每一帧20毫秒的语音数据窗口内,以每5毫秒的长度提取第一语音特征参量,包括基于梅尔滤波器组(MFB)所提取的梅尔频谱特征参量、对数基频(log F0)特征参量以及非周期分量(APs)特征。其中,所述梅尔频谱特征参量和非周期分量(APs)特征分别为24维的语音特征参量。In a possible implementation manner, within the speech data window of 20 milliseconds in each frame, the first speech feature parameter is extracted every 5 milliseconds, including the mel spectrum extracted based on the mel filter bank (MFB) Characteristic parameters, logarithmic fundamental frequency (log F0) characteristic parameters, and aperiodic components (APs) characteristics. Wherein, the Mel frequency spectrum characteristic parameter and the aperiodic component (APs) characteristic are respectively 24-dimensional speech characteristic parameters.
其中,对于梅尔频谱特征参量,在每一帧20毫秒的语音数据窗口内,以每5毫秒的长度进行特征提取;通过记录每帧源语音数据的时域信号,将时域信号补充至长度与窗宽相同的序列,对序列进行离散傅里叶变换得到每帧语音数据的线性频谱,将线性频谱通过梅尔频率滤波器组,得到梅尔频谱;其中,梅尔滤波器组一般包括24个三角带通滤波器,对所获取的频谱特征进行平滑化,有效地强调了语音数据的低频信息,突出了有用的信息,并且屏蔽了噪声的干扰。Among them, for the Mel spectrum feature parameters, feature extraction is performed every 5 milliseconds in the voice data window of 20 milliseconds per frame; the time domain signal is supplemented to the length by recording the time domain signal of the source voice data of each frame The sequence is the same as the window width, the discrete Fourier transform is performed on the sequence to obtain the linear spectrum of each frame of speech data, and the linear spectrum is passed through the mel frequency filter bank to obtain the mel spectrum; among them, the mel filter bank generally includes 24 A triangular bandpass filter smoothes the acquired spectrum characteristics, effectively emphasizes the low-frequency information of the voice data, highlights useful information, and shields the interference of noise.
对于对数基频(log F0)特征参量,由于人们在发浊音时,气流通过声门使声带产生张弛振荡式振动,产生一股准周期脉冲气流,这一气流激励声道产生浊音,而这种声带振动的频率为基音频率。具体的,对经过预处理后的每一帧源语音数据进行加窗处理后,计算该帧语音数据的倒谱,设置基音搜索的长度范围,查询该长度范围帧语音数据的倒谱的最大值,若最大值大于窗口的门限值,则根据最大值计算得到浊音的基音频率,通过获取基音 频率的对数反应语音数据的特征;若倒谱的最大值小于或等于窗口的门限值,则说明该帧源语音数据为静音或清音。For the characteristic parameter of the logarithmic fundamental frequency (log F0), when people are making voiced sounds, the airflow passes through the glottis to cause relaxation and oscillation of the vocal cords, resulting in a quasi-periodic pulsed airflow. This airflow excites the vocal tract to produce voiced sounds. The frequency of this vocal cord vibration is the pitch frequency. Specifically, after performing windowing processing on each frame of source speech data after preprocessing, calculate the cepstrum of the frame of speech data, set the length range of pitch search, and query the maximum value of the cepstrum of the frame speech data in this length range If the maximum value is greater than the threshold value of the window, the pitch frequency of voiced voice is calculated according to the maximum value, and the characteristics of the voice data are reflected by obtaining the logarithm of the pitch frequency; if the maximum value of the cepstrum is less than or equal to the threshold value of the window, It means that the source voice data of the frame is muted or unvoiced.
对于非周期分量特征参量,根据对源语音数据的加窗信号,进行傅里叶逆变换,得到非周期分量的时域特征,根据对源语音数据的加窗信号及频谱特征的最小相位,确定非周期分量的频域特征。For the non-periodic component feature parameters, perform inverse Fourier transform according to the windowed signal of the source speech data to obtain the time domain characteristics of the non-periodic component, and determine according to the minimum phase of the windowed signal and the spectral feature of the source speech data Frequency domain characteristics of aperiodic components.
步骤S203,将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量。Step S203: Input the first voice feature parameter into the voice conversion model, and output the second voice feature parameter after the conversion. The second voice feature parameter is the feature parameter of the target voice data.
在一种可能的实现方式中,语音转换模型为通过对样本训练数据集,采用周期一致的对抗网络模型进行训练获得的模型。将对所述源语音数据提取的第一语音特征参量输入至语音转换模型,经过语音转换后输出第二语音特征参量,第二语音特征参量为与实际正常语音的特征参量最相似的语音特征参量,即目标语音数据的特征参量,所述目标语音数据为扰动最小或无扰动的语音数据。In a possible implementation manner, the voice conversion model is a model obtained by training a sample training data set using a confrontation network model with a consistent period. The first voice feature parameter extracted from the source voice data is input into the voice conversion model, and the second voice feature parameter is output after the voice conversion. The second voice feature parameter is the voice feature parameter most similar to the actual normal voice feature parameter , That is, the characteristic parameter of the target voice data, and the target voice data is voice data with minimal or no disturbance.
在一个实施例中,如图3所示,本申请另一实施例提供的对抗网络模型迭代训练方法的流程示意图,所述语音转换模型的训练步骤,包括:In one embodiment, as shown in FIG. 3, a schematic flow chart of a method for iterative training of a confrontation network model provided by another embodiment of the present application, the training step of the voice conversion model includes:
步骤S301,获取语音样本训练数据集中的随机样本与实际样本,分别提取所述随机样本的随机样本特征参量分布以及实际样本的实际样本特征参量分布;Step S301: Obtain a random sample and an actual sample in the speech sample training data set, and extract the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;
在一种可能的实现方式中,采用两个自发的语音数据集,例如AMI会议语料库和会话性语音的Buckeye语料库来分析自然扰动的影响;从两个语音数据集中,获取了由40名女性演讲者和30名男性演讲者组成的语音数据;将语音数据作为语音样本训练数据集,共210个话语,包括每种性别和每个类型的(正常的语言、大笑的语言和吱吱作响的语言)。在这210个话语中,150个用于训练,60个用于测试;每句话的时长为1-2秒;以训练语音转换模型。In a possible implementation, two spontaneous speech data sets, such as the AMI conference corpus and the Buckeye corpus of conversational speech, are used to analyze the impact of natural disturbance; from the two speech data sets, 40 female speeches are obtained Voice data composed of speakers and 30 male speakers; using the voice data as a voice sample training data set, a total of 210 utterances, including each gender and each type (normal language, laughing language and squeaking language). Among these 210 utterances, 150 are used for training and 60 are used for testing; the duration of each sentence is 1-2 seconds; to train the speech conversion model.
具体的,在对语音转换模型的训练过程中,周期一致的对抗网络模型包括生成器和鉴别器;具体的,通过从语音样本训练数据集中获取随机样本及实际样本,提取随机样本的随机样本特征参量及实际样本中的实际样本特征参量,将随机样本特征参量的分布将作为生成器的输入。Specifically, in the training process of the speech conversion model, the adversarial network model with consistent cycles includes a generator and a discriminator; specifically, random samples and actual samples are obtained from the speech sample training data set, and the random sample characteristics of the random samples are extracted Parameters and actual sample characteristic parameters in the actual sample, and the distribution of random sample characteristic parameters will be used as the input of the generator.
步骤S302,根据所述随机样本特征参量分布及所述实际样本特征参量分布,对待训练的对抗网络模型进行迭代训练;Step S302: Perform iterative training on the to-be-trained confrontation network model according to the random sample feature parameter distribution and the actual sample feature parameter distribution;
在一种可能的实现方式中,周期一致的对抗网络模型包括生成器和鉴别器;根据随机样本特征参量,由生成器生成与实际样本特征参量分布类似的伪样本特征参量分布。将所述伪样本特征参量分布输入鉴别器,由鉴别器区分伪样本分布与实际样本特征参量分布。In a possible implementation, the adversarial network model with consistent cycles includes a generator and a discriminator; according to the random sample characteristic parameters, the generator generates a pseudo sample characteristic parameter distribution similar to the actual sample characteristic parameter distribution. The feature parameter distribution of the pseudo sample is input into the discriminator, and the discriminator distinguishes the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution.
步骤S303,根据预设损失函数,计算所述对抗网络模型在迭代训练过程中输出的误差;Step S303: Calculate the output error of the confrontation network model in the iterative training process according to the preset loss function;
在一种可能的实现方式中,对抗网络模型使用预设损失函数计算在迭代训练过程中的误差,将所述误差作为对抗网络模型的目标训练值。In a possible implementation manner, the confrontation network model uses a preset loss function to calculate the error in the iterative training process, and uses the error as the target training value of the confrontation network model.
步骤S304,当误差小于或等于预设误差阈值时,停止训练,得到所述语音转换模型。Step S304, when the error is less than or equal to the preset error threshold, stop training to obtain the voice conversion model.
在一种可能的实现方式中,当误差小于或等于预设误差阈值时,训练的对抗网络模型符合转换条件,则停止训练,得到所述语音转换模型;通过语音转换模型,将带有扰动语音特征参量转换为实际正常语音特征参量,完成非平行语音的转换。In a possible implementation manner, when the error is less than or equal to the preset error threshold, the trained confrontation network model meets the conversion conditions, then the training is stopped to obtain the voice conversion model; through the voice conversion model, disturbing speech The characteristic parameters are converted into actual normal speech characteristic parameters, completing the conversion of non-parallel speech.
在一个实施例中,根据所述随机样本特征参量分布及所述实际样本特征参量分布,对所述待训练的对抗网络进行迭代训练,包括:In one embodiment, performing iterative training on the confrontation network to be trained according to the distribution of the random sample characteristic parameters and the distribution of the actual sample characteristic parameters includes:
C1、将所述随机样本特征参量分布输入至待训练的对抗网络模型的生成器网络,生成与实际样本特征参量分布对应的伪样本特征参量分布;C1. Input the feature parameter distribution of the random sample to the generator network of the confrontation network model to be trained, and generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;
具体的,通过采用周期一致性的对抗网络模型,将扰动语音特征到正常语音特征的转换进行建模;对获取的随机样本进行语音特征参量的提取,将提取的语音特征参量的分布(x∈X)输入至生成器,由生成器生成伪样本特征参量分布GX→Y(x);通过第一对抗损失函数L adv(G X→Y(x),Y),计算伪样本特征参量分布GX→Y(x)与实际样本特征参量分 布(y∈Y)之间的距离,即实现从扰动语音到正常语音的转换。 Specifically, by adopting a period-consistent confrontation network model, the conversion of disturbing voice features to normal voice features is modeled; the acquired random samples are extracted from the voice feature parameters, and the distribution of the extracted voice feature parameters (x∈ X) is input to the generator, and the generator generates a pseudo-sample feature parameter distribution GX→Y(x); through the first confrontation loss function L adv (G X→Y (x),Y), calculate the pseudo-sample feature parameter distribution GX →The distance between Y(x) and the actual sample feature parameter distribution (y∈Y), that is, the conversion from disturbed speech to normal speech is realized.
C2、通过待训练的对抗网络模型的鉴别器网络,对所述伪样本特征参量分布与所述实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;C2, through the discriminator network of the adversarial network model to be trained, discriminate the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution to obtain the feature distribution of the identification result;
具体的,鉴别器对生成的伪样本特征与实际样本特征进行区分,得到区分后的结果GY→X(y),通过第二对抗性损失函数L adv(G Y→X(y),X)计算鉴别结果与随机样本特征之间的距离。 Specifically, the discriminator distinguishes the generated pseudo sample features from the actual sample features, and obtains the differentiated result GY→X(y), and passes the second adversarial loss function L adv (G Y→X (y),X) Calculate the distance between the identification result and the feature of the random sample.
C3、将所述鉴别结果特征分布再次输入至所述生成器网络,再次生成与实际样本特征参量分布对应的伪样本特征参量分布,通过所述鉴别器网络再次对伪样本特征参量分布与实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;C3. Input the feature distribution of the identification result to the generator network again to generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution, and again compare the pseudo sample feature parameter distribution with the actual sample through the discriminator network The characteristic parameter distribution is identified, and the characteristic distribution of the identification result is obtained;
C4、根据所述随机样本特征参量分布、所述实际样本特征参量分布、所述伪样本特征参量分布及所述鉴别结果特征分布,对所述待训练的对抗网络模型进行循环迭代训练。C4. According to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution, perform cyclic iterative training on the to-be-trained confrontation network model.
如图4所示的,本申请一实施例提供的对抗网络模型的网络结构示意图,对抗网络模型包括生成器G和鉴别器D,通过生成器G生成伪样本特征参量分布G(x),将伪样本特征参量分布和实际样本的特征分布输入至鉴别器,通过鉴别器进行鉴别,获得鉴别结果,再将鉴别结果反馈至生成器G或鉴别器D,从而对对抗网络模型进行循环训练。As shown in Figure 4, a schematic diagram of the network structure of the confrontation network model provided by an embodiment of the present application. The confrontation network model includes a generator G and a discriminator D. The generator G generates a pseudo-sample feature parameter distribution G(x), and The feature parameter distribution of the pseudo sample and the feature distribution of the actual sample are input to the discriminator, and the discriminator is used to identify the discriminator to obtain the identification result, and then the identification result is fed back to the generator G or the discriminator D to perform cyclic training on the adversarial network model.
其中,语音转换模型中的生成器和鉴别器网络分别由卷积块组成。生成器网络由9个卷积块组成;其中,包括一个stride-1卷积块、一个stride-2卷积块、5个残差块、一个1/2stride卷积块和一个stride-1卷积块;为了保持时间结构,所有卷积层都是一维的;门控线性单元作为卷积层的激活函数,在语言和语音建模方面取得了最先进的性能。鉴别器网络由四块二维卷积块组成;门控线性单元作为所有卷积块的激活函数;对于鉴别器网络,使用一个6×6patch GAN来对每个6×6patch进行真假分类。Among them, the generator and discriminator networks in the speech conversion model are respectively composed of convolution blocks. The generator network consists of 9 convolution blocks; among them, it includes a stride-1 convolution block, a stride-2 convolution block, 5 residual blocks, a 1/2stride convolution block and a stride-1 convolution block Block; In order to maintain the time structure, all convolutional layers are one-dimensional; as the activation function of the convolutional layer, the gated linear unit has achieved the most advanced performance in language and speech modeling. The discriminator network is composed of four two-dimensional convolutional blocks; the gated linear unit is used as the activation function of all convolutional blocks; for the discriminator network, a 6×6 patch GAN is used to classify each 6×6 patch as true and false.
在一个实施例中,根据预设损失函数,计算所述对抗网络模型在迭代训练过程中输出的误差,包括:In one embodiment, calculating the error output by the confrontation network model during the iterative training process according to a preset loss function includes:
D1、根据第一对抗损失函数和第二对抗损失函数,得出所述对抗网络模型的循环一致性损失函数及身份映射损失函数;其中,所述第一对抗损失函数为计算所述伪样本特征参量分布与所述实际样本特征参量分布的距离的损失函数,所述第二对抗损失函数为计算所述鉴别结果特征分布与所述随机样本特征分布的距离的损失函数;D1. According to the first confrontation loss function and the second confrontation loss function, the cyclic consistency loss function and the identity mapping loss function of the confrontation network model are obtained; wherein, the first confrontation loss function is to calculate the pseudo sample feature A loss function for the distance between the parameter distribution and the actual sample feature parameter distribution, and the second counter loss function is a loss function for calculating the distance between the identification result feature distribution and the random sample feature distribution;
在一种可能的实现方式中,通过第一对抗损失函数L adv(G X→Y(x),Y),计算伪样本特征参量分布G X→Y(x)与实际样本特征参量分布(y∈Y)之间的距离;通过第二对抗性损失函数L adv(G Y→X(y),X)计算鉴别结果与随机样本特征之间的距离;根据第一对抗损失函数L adv(G X→Y(x),Y)、第二对抗性损失函数L adv(G Y→X(y),X),得出循环一致性损失函数:L cyc=E y||G Y→X(G X→Y(x)|| 1+E y||G X→Y(G Y→X(y)|| 1,以及身份映射损失函数L id=E x||G Y→X(x)-x|| 1+E y||G X→Y(y)-y|| 1;通过循环一致性损失函数Lcyc,在计算过程中保留语音特征中的上下文信息,通过身份映射损失函数Lid,在计算过程中保存语音数据在转换过程中的重要语音信息。 In one possible implementation, the first against the loss function L adv (G X → Y ( x), Y), calculates the pseudo sample feature distributions in G X → Y (x) and the actual sample feature quantity distribution (y ∈ Y); the distance between the identification result and the random sample feature is calculated through the second adversarial loss function L adv (G Y→X (y), X); according to the first adversarial loss function L adv (G X→Y (x),Y), the second adversarial loss function L adv (G Y→X (y),X), the cyclic consistency loss function is obtained: L cyc =E y ||G Y→X ( G X→Y (x)|| 1 +E y ||G X→Y (G Y→X (y)|| 1 , and the identity mapping loss function L id =E x ||G Y→X (x) -x|| 1 +E y ||G X→Y (y)-y|| 1 ; through the cyclic consistency loss function Lcyc, the context information in the voice feature is retained during the calculation process, and the identity mapping loss function Lid, During the calculation process, save the important voice information of the voice data during the conversion process.
D2、根据所述循环一致性损失函数及所述身份映射损失函数,得到所述对抗网络模型的所述预设损失函数;D2. Obtain the preset loss function of the confrontation network model according to the cyclic consistency loss function and the identity mapping loss function;
在一种可能的实现方式中,根据所述循环一致性损失函数及所述身份映射损失函数,得到所述对抗网络模型的所述预设损失函数L=L adv(G X→Y(x),y)+L adv(G Y→X(y),x)+λ cycL cycidL id,其中λ cyc和λ id为超参数,以控制循环一致性损失函数、所述身份映射损失函数及所述预设损失函数三种损失函数的相对重要性。 In a possible implementation manner, according to the cyclic consistency loss function and the identity mapping loss function, the preset loss function L=L adv (G X→Y (x) of the confrontation network model is obtained ,y)+L adv (G Y→X (y),x)+λ cyc L cycid L id , where λ cyc and λ id are hyperparameters to control the cyclic consistency loss function and the identity mapping The relative importance of the loss function and the three loss functions of the preset loss function.
D3、所述对抗网络模型输出通过所述预设损失函数计算的误差,将所述误差作为目标训练值。D3. The confrontation network model outputs an error calculated by the preset loss function, and uses the error as a target training value.
在一种可能的实现方式中,将所述误差作为目标训练值,使完整损失函数的值最小时,完成语音转换模型的训练,以得到语音转换模型。In a possible implementation manner, the error is taken as the target training value, and when the value of the complete loss function is minimized, the training of the speech conversion model is completed to obtain the speech conversion model.
步骤S204,根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。Step S204: Synthesize the target voice data according to the second voice feature parameter, and use the target voice data as an input of a voice recognition model to perform voice recognition.
在一个实施例中,根据所述第二语音特征参量合成所述目标语音数据,包括:In an embodiment, synthesizing the target voice data according to the second voice feature parameter includes:
根据所述第二语音特征参量,采用波形拼接及时域基因同步叠加算法,合成无扰动或扰动特征最小的目标语音数据。According to the second voice feature parameter, the waveform splicing and time domain gene synchronization superposition algorithm is used to synthesize target voice data with no disturbance or with the least disturbance feature.
在一种可能的实现方式中,根据所述第二语音特征参量合成目标语音数据,例如根据第二语音特征参量,采用波形拼接,运用时域基音同步叠加算法合成含有目标特征参量的语音信号。In a possible implementation manner, the target voice data is synthesized according to the second voice feature parameter, for example, based on the second voice feature parameter, waveform splicing is used, and a time-domain pitch synchronization superposition algorithm is used to synthesize a voice signal containing the target feature parameter.
进一步的,将合成的语音数据作为语音识别模型的输入,以进行语音识别;具体的,在实际的应用过程中,基于具体的语音识别系统,在使用本申请提出的前端处理方法和不使用的情况下,分别用笑声语音(受情绪干扰的语音)和吱吱声(受语音质量干扰的语音)进行测试。通过单词错误率(WER)和句子错误率(SER)来评估性能。较低的WER和SER值表示较好的性能。如下表1实验测试数据可以看出,用谱特征和非周期分量(即MFB+AP)进行建模比在所提出的前端中仅建模MFB有更好的性能。Further, the synthesized speech data is used as the input of the speech recognition model to perform speech recognition; specifically, in the actual application process, based on the specific speech recognition system, the front-end processing method proposed in this application and the unused In this case, use laughter voice (voice disturbed by emotions) and squeak (voice disturbed by voice quality) respectively for testing. The performance is evaluated by word error rate (WER) and sentence error rate (SER). Lower WER and SER values indicate better performance. As can be seen from the experimental test data in Table 1 below, modeling with spectral features and aperiodic components (ie MFB+AP) has better performance than modeling only MFB in the proposed front-end.
Figure PCTCN2020135511-appb-000001
Figure PCTCN2020135511-appb-000001
表1Table 1
表1所示的ASR性能受各个ASR系统使用的语言模型的强度的影响。为了在不受语言模型影响的情况下检验ASR性能,将语音转换为英语字符序列的深度语音模型进行测试,如表2所示,使用和不使用该前端语音转换模型的语言模型的字符错误率(CER)性能。通过1000小时的LibriSpeech数据对语言模型进行训练,没有使用语言模型进行解码。从表2可以看出,通过语音转换模型进行前端处理后的深度语言模型,降低了深度语音模型的字符错误率CER。The ASR performance shown in Table 1 is affected by the strength of the language model used by each ASR system. In order to test the performance of ASR without being affected by the language model, the deep speech model that converts speech into English character sequence is tested. As shown in Table 2, the character error rate of the language model with and without the front-end speech conversion model (CER) performance. The language model is trained through 1000 hours of LibriSpeech data, and the language model is not used for decoding. It can be seen from Table 2 that the deep language model after the front-end processing through the voice conversion model reduces the character error rate CER of the deep voice model.
Figure PCTCN2020135511-appb-000002
Figure PCTCN2020135511-appb-000002
表2Table 2
另外,通过梅尔滤波器组特征的二维t-SNE投影,用于正常语音、笑声扰动语音,以及通过本实施例基于对抗网络模型CycleGANs的前端处理方法,将笑声扰动语音转换为的正常语音;可以得出正常语音和通过转换得到的语音的滤波器组输出的特征非常相似,并且与笑声语音的滤波器组输出的特征显著不同;因此通过本实施例的语音转换模型能够捕获正常和笑声扰动语音的Mel滤波器组输出的分布,并且可以将笑声扰动的语音转换为等效的正常语音。In addition, through the two-dimensional t-SNE projection of Mel filter bank features, it is used for normal speech and laughter disturbed speech, and the front-end processing method based on the anti-network model CycleGANs of this embodiment is used to convert laughter disturbed speech into Normal speech; it can be concluded that the characteristics of the filter bank output of normal speech and the speech obtained by conversion are very similar, and are significantly different from the characteristics of the filter bank output of laughter speech; therefore, the speech conversion model of this embodiment can capture The distribution of Mel filter bank output of normal and laughter disturbed speech, and can convert laughter disturbed speech into equivalent normal speech.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
通过本实施例,获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特 征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。在进行语音识别之前,对原始语音信号进行预处理及特征语音特征参量的转换,通过语音转换可以将原始语音数据中的自然干扰进行滤除,将带有扰动特征的源语音数据的特征参量转换为无干扰的自然语音数据的特征参量,并合成对应的无干扰语音数据,作为语音识别的输入;将带有扰动特征源语音数据的第一语音特征参量以及转换后的语音数据的第二语音特征参量可视化,实现了语音数据的非平行转换,提高了语音识别的鲁棒性和准确性。Through this embodiment, the original voice signal is obtained, and the original voice signal is preprocessed in a preset format to obtain source voice data; voice feature extraction is performed on the source voice data to obtain the first voice of the source voice data The first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice; the first voice feature parameter is input to the voice conversion model, and the second voice feature parameter is output after the conversion. The second voice feature parameter is The voice feature parameter is a feature parameter of target voice data; the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition. Before speech recognition, the original speech signal is preprocessed and the characteristic speech characteristic parameters are converted. The natural interference in the original speech data can be filtered out through the speech conversion, and the characteristic parameters of the source speech data with disturbance characteristics are converted. Is the characteristic parameter of the undisturbed natural voice data, and synthesizes the corresponding undisturbed voice data as the input of voice recognition; the first voice characteristic parameter of the voice data with the disturbance characteristic source and the second voice of the converted voice data The visualization of feature parameters realizes the non-parallel conversion of voice data and improves the robustness and accuracy of voice recognition.
对应于上文实施例所述的语音识别的前端处理方法,图5示出了本申请实施例提供的语音识别的前端处理装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the front-end processing method for speech recognition described in the above embodiment, FIG. 5 shows a structural block diagram of the front-end processing device for speech recognition provided in an embodiment of the present application. The relevant part.
参照图5,该装置包括:Referring to Figure 5, the device includes:
获取单元51,用于获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;The obtaining unit 51 is configured to obtain an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data;
特征提取单元52,用于对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;The feature extraction unit 52 is configured to perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
数据处理单元53,用于将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;The data processing unit 53 is configured to input the first voice feature parameter into a voice conversion model, and output a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
合成单元54,用于根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。The synthesis unit 54 is configured to synthesize the target voice data according to the second voice characteristic parameter, and use the target voice data as an input of a voice recognition model to perform voice recognition.
可选的,所述获取单元包括:Optionally, the acquiring unit includes:
滤波模块,用于对所述原始语音信号进行滤波处理;The filtering module is used to perform filtering processing on the original speech signal;
采样模块,用于对滤波处理后的语音信号进行周期性采样,获取预设频率的语音采样数据;The sampling module is used to periodically sample the filtered voice signal to obtain voice sampling data with a preset frequency;
处理模块,用于对所述语音采样数据进行加窗及分帧处理,得到所述源语音数据。The processing module is used to perform windowing and framing processing on the voice sample data to obtain the source voice data.
可选的,所述特征提取单元还用于通过梅尔滤波器组提取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量;获取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量对应的参量分布。Optionally, the feature extraction unit is further configured to extract Mel spectrum feature parameters, logarithmic fundamental frequency feature parameters, and aperiodic component feature parameters of the source voice data through a Mel filter bank; to obtain the source voice data The parameter distribution corresponding to the characteristic parameters of the Mel spectrum, the characteristic parameters of the logarithmic fundamental frequency and the characteristic parameters of the non-periodic component.
可选的,所述语音识别的前端处理装置还包括:Optionally, the front-end processing device for speech recognition further includes:
样本数据获取单元,用于获取语音样本训练数据集中的随机样本与实际样本,分别提取所述随机样本的随机样本特征参量分布以及实际样本的实际样本特征参量分布;The sample data acquisition unit is configured to acquire a random sample and an actual sample in the speech sample training data set, and extract the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;
模型训练单元,用于根据所述随机样本特征参量分布及所述实际样本特征参量分布,对待训练的对抗网络模型进行迭代训练;The model training unit is configured to perform iterative training on the confrontation network model to be trained according to the random sample feature parameter distribution and the actual sample feature parameter distribution;
误差计算单元,用于根据预设损失函数,计算所述对抗网络模型在迭代训练过程中输出的误差;An error calculation unit, configured to calculate the error output by the confrontation network model in the iterative training process according to a preset loss function;
模型生成单元,用于当误差小于或等于预设误差阈值时,停止训练,得到所述语音转换模型。The model generating unit is used to stop training when the error is less than or equal to the preset error threshold to obtain the voice conversion model.
可选的,所述模型训练单元包括:Optionally, the model training unit includes:
生成器网络,用于将所述随机样本特征参量分布输入至待训练的对抗网络模型的生成器网络,生成与实际样本特征参量分布对应的伪样本特征参量分布;A generator network, which is used to input the random sample feature parameter distribution to the generator network of the confrontation network model to be trained, and generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;
鉴别器网络,用于通过待训练的对抗网络模型的鉴别器网络,对所述伪样本特征参量分布与所述实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;The discriminator network is used to discriminate the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution through the discriminator network of the confrontation network model to be trained to obtain the feature distribution of the identification result;
循环训练模块,用于将所述鉴别结果特征分布再次输入至所述生成器网络,再次生成与实际样本特征参量分布对应的伪样本特征参量分布,通过所述鉴别器网络再次对伪样本特征参量分布与实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;The cyclic training module is used to input the feature distribution of the identification result to the generator network again, to regenerate the pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution, and to re-analyze the pseudo sample feature parameter through the discriminator network The distribution is distinguished from the actual sample characteristic parameter distribution, and the characteristic distribution of the identification result is obtained;
迭代训练模块,用于根据所述随机样本特征参量分布、所述实际样本特征参量分布、所述伪样本特征参量分布及所述鉴别结果特征分布,对所述待训练的对抗网络模型进行循环迭代训练。An iterative training module, configured to perform cyclic iteration on the to-be-trained confrontation network model according to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution train.
可选的,所述误差计算单元包括:Optionally, the error calculation unit includes:
第一计算模块,用于根据第一对抗损失函数和第二对抗损失函数,得出所述对抗网络模型的循环一致性损失函数及身份映射损失函数;其中,所述第一对抗损失函数为计算所述伪样本特征参量分布与所述实际样本特征参量分布的距离的损失函数,所述第二对抗损失函数为计算所述鉴别结果特征分布与所述随机样本特征分布的距离的损失函数;The first calculation module is used to obtain the cyclic consistency loss function and the identity mapping loss function of the confrontation network model according to the first confrontation loss function and the second confrontation loss function; wherein, the first confrontation loss function is calculated A loss function for the distance between the feature parameter distribution of the pseudo sample and the feature parameter distribution of the actual sample, and the second counter loss function is a loss function for calculating the distance between the feature distribution of the identification result and the feature distribution of the random sample;
第二计算模块,用于根据所述循环一致性损失函数及所述身份映射损失函数,得到所述对抗网络模型的所述预设损失函数;A second calculation module, configured to obtain the preset loss function of the confrontation network model according to the cyclic consistency loss function and the identity mapping loss function;
目标训练值计算模块,用于所述对抗网络模型输出通过所述预设损失函数计算的误差,将所述误差作为目标训练值。The target training value calculation module is configured to output the error calculated by the preset loss function by the confrontation network model, and use the error as the target training value.
可选的,所述合成单元还用于根据所述第二语音特征参量,采用波形拼接及时域基因同步叠加算法,合成无扰动或扰动特征最小的目标语音数据。Optionally, the synthesis unit is further configured to adopt a waveform splicing and time domain gene synchronization superposition algorithm according to the second speech characteristic parameter to synthesize target speech data with no disturbance or with the least disturbance characteristic.
通过本实施例,获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。在进行语音识别之前,对原始语音信号进行预处理及特征语音特征参量的转换,通过语音转换可以将原始语音数据中的自然干扰进行滤除,将带有扰动特征的源语音数据的特征参量转换为无干扰的自然语音数据的特征参量,并合成对应的无干扰语音数据,作为语音识别的输入;将带有扰动特征源语音数据的第一语音特征参量以及转换后的语音数据的第二语音特征参量可视化,实现了语音数据的非平行转换,提高了语音识别的鲁棒性和准确性。Through this embodiment, the original voice signal is obtained, and the original voice signal is preprocessed according to a preset format to obtain source voice data; voice feature extraction is performed on the source voice data to obtain the first voice of the source voice data The first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice; the first voice feature parameter is input to the voice conversion model, and the second voice feature parameter is output after the conversion. The second voice feature parameter is The voice feature parameter is a feature parameter of target voice data; the target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition. Before speech recognition, the original speech signal is preprocessed and the characteristic speech characteristic parameters are converted. The natural interference in the original speech data can be filtered out through the speech conversion, and the characteristic parameters of the source speech data with disturbance characteristics are converted. Is the characteristic parameter of the undisturbed natural voice data, and synthesizes the corresponding undisturbed voice data as the input of voice recognition; the first voice characteristic parameter of the voice data with the disturbance characteristic source and the second voice of the converted voice data The visualization of feature parameters realizes the non-parallel conversion of voice data and improves the robustness and accuracy of voice recognition.
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。It should be noted that the information interaction and execution process between the above-mentioned devices/units are based on the same concept as the method embodiment of this application, and its specific functions and technical effects can be found in the method embodiment section. I won't repeat it here.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of a software functional unit. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。The embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program, and the computer program is When the processor is executed, the steps in the foregoing method embodiments can be realized.
本申请实施例提供了一种计算机程序产品,当计算机程序产品在移动终端上运行时,使得移动终端执行时实现可实现上述各个方法实施例中的步骤。The embodiments of the present application provide a computer program product. When the computer program product runs on a mobile terminal, the steps in the foregoing method embodiments can be realized when the mobile terminal is executed.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法 实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the implementation of all or part of the processes in the above-mentioned embodiment methods in the present application can be accomplished by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms. The computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium. For example, U disk, mobile hard disk, floppy disk or CD-ROM, etc. In some jurisdictions, according to legislation and patent practices, computer-readable media cannot be electrical carrier signals and telecommunication signals.
图6为本申请一实施例提供的终端设备的结构示意图。如图6所示,该实施例的终端设备6包括:至少一个处理器60(图6中仅示出一个)处理器、存储器61以及存储在所述存储器61中并可在所述至少一个处理器60上运行的计算机程序62,所述处理器60执行所述计算机程序62时实现上述任意各个语音识别的前端处理方法实施例中的步骤。FIG. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the application. As shown in FIG. 6, the terminal device 6 of this embodiment includes: at least one processor 60 (only one is shown in FIG. 6), a processor, a memory 61, and a processor that is stored in the memory 61 and can be processed in the at least one processor. The computer program 62 running on the processor 60 implements the steps in any of the foregoing embodiments of the front-end processing method for speech recognition when the processor 60 executes the computer program 62.
所述终端设备6可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。该终端设备可包括,但不仅限于,处理器60、存储器61。本领域技术人员可以理解,图6仅仅是终端设备6的举例,并不构成对终端设备6的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入输出设备、网络接入设备等。The terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device may include, but is not limited to, a processor 60 and a memory 61. Those skilled in the art can understand that FIG. 6 is only an example of the terminal device 6 and does not constitute a limitation on the terminal device 6. It may include more or fewer components than shown in the figure, or a combination of certain components, or different components. , For example, can also include input and output devices, network access devices, and so on.
所称处理器60可以是中央处理单元(Central Processing Unit,CPU),该处理器60还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 60 may be a central processing unit (Central Processing Unit, CPU), and the processor 60 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (Application Specific Integrated Circuits). , ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
所述存储器61在一些实施例中可以是所述终端设备6的内部存储单元,例如终端设备6的硬盘或内存。所述存储器61在另一些实施例中也可以是所述终端设备6的外部存储设备,例如所述终端设备6上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器61还可以既包括所述终端设备6的内部存储单元也包括外部存储设备。所述存储器61用于存储操作系统、应用程序、引导装载程序(BootLoader)、数据以及其他程序等,例如所述计算机程序的程序代码等。所述存储器61还可以用于暂时地存储已经输出或者将要输出的数据。The memory 61 may be an internal storage unit of the terminal device 6 in some embodiments, such as a hard disk or a memory of the terminal device 6. In other embodiments, the memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk equipped on the terminal device 6, a smart media card (SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used to store an operating system, an application program, a boot loader (BootLoader), data, and other programs, such as the program code of the computer program. The memory 61 can also be used to temporarily store data that has been output or will be output.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed apparatus/network equipment and method may be implemented in other ways. For example, the device/network device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units. Or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施 例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种语音识别的前端处理方法,其中,包括:A front-end processing method for speech recognition, which includes:
    获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;
    对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
    将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
    根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
  2. 如权利要求1所述的语音识别的前端处理方法,其中,获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据,包括:5. The front-end processing method for speech recognition according to claim 1, wherein obtaining the original speech signal and preprocessing the original speech signal according to a preset format to obtain the source speech data comprises:
    对所述原始语音信号进行滤波处理;Filtering the original speech signal;
    对滤波处理后的语音信号进行周期性采样,获取预设频率的语音采样数据;Perform periodic sampling on the filtered voice signal to obtain voice sampling data with a preset frequency;
    对所述语音采样数据进行加窗及分帧处理,得到所述源语音数据。Perform windowing and framing processing on the voice sample data to obtain the source voice data.
  3. 如权利要求1所述的语音识别的前端处理方法,其中,对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,包括:5. The front-end processing method for speech recognition according to claim 1, wherein performing speech feature extraction on the source speech data to obtain the first speech feature parameter of the source speech data comprises:
    通过梅尔滤波器组提取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量;Extracting the Mel spectrum feature parameters, the logarithmic fundamental frequency feature parameters, and the non-periodic component feature parameters of the source voice data through a Mel filter bank;
    获取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量对应的参量分布。Obtain the parameter distribution corresponding to the Mel spectrum characteristic parameter, the logarithmic fundamental frequency characteristic parameter, and the non-periodic component characteristic parameter of the source voice data.
  4. 如权利要求1所述的语音识别的前端处理方法,其中,所述语音转换模型的训练步骤,包括:The front-end processing method for speech recognition according to claim 1, wherein the training step of the speech conversion model comprises:
    获取语音样本训练数据集中的随机样本与实际样本,分别提取所述随机样本的随机样本特征参量分布以及实际样本的实际样本特征参量分布;Acquiring a random sample and an actual sample in the speech sample training data set, and extracting the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;
    根据所述随机样本特征参量分布及所述实际样本特征参量分布,对待训练的对抗网络模型进行迭代训练;Performing iterative training on the to-be-trained confrontation network model according to the random sample feature parameter distribution and the actual sample feature parameter distribution;
    根据预设损失函数,计算所述对抗网络模型在迭代训练过程中输出的误差;According to a preset loss function, calculate the error output by the confrontation network model in the iterative training process;
    当误差小于或等于预设误差阈值时,停止训练,得到所述语音转换模型。When the error is less than or equal to the preset error threshold, the training is stopped to obtain the voice conversion model.
  5. 如权利要求4所述的语音识别的前端处理方法,其中,根据所述随机样本特征参量分布及所述实际样本特征参量分布,对所述待训练的对抗网络进行迭代训练,包括:The front-end processing method of speech recognition according to claim 4, wherein, according to the distribution of the characteristic parameter of the random sample and the distribution of the characteristic parameter of the actual sample, the iterative training of the confrontation network to be trained comprises:
    将所述随机样本特征参量分布输入至待训练的对抗网络模型的生成器网络,生成与实际样本特征参量分布对应的伪样本特征参量分布;Inputting the random sample feature parameter distribution to the generator network of the to-be-trained confrontation network model to generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;
    通过待训练的对抗网络模型的鉴别器网络,对所述伪样本特征参量分布与所述实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;Through the discriminator network of the confrontation network model to be trained, discriminating the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution to obtain the feature distribution of the identification result;
    将所述鉴别结果特征分布再次输入至所述生成器网络,再次生成与实际样本特征参量分布对应的伪样本特征参量分布,通过所述鉴别器网络再次对伪样本特征参量分布与实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;The feature distribution of the identification result is input to the generator network again, and the pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution is generated again, and the pseudo sample feature parameter distribution and the actual sample feature parameter distribution are again analyzed through the discriminator network. Distribute the identification, and obtain the characteristic distribution of the identification result;
    根据所述随机样本特征参量分布、所述实际样本特征参量分布、所述伪样本特征参量分布及所述鉴别结果特征分布,对所述待训练的对抗网络模型进行循环迭代训练。According to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution, cyclic iterative training is performed on the to-be-trained confrontation network model.
  6. 如权利要求5所述的语音识别的前端处理方法,其中,根据预设损失函数,计算所述对抗网络模型在迭代训练过程中输出的误差,包括:8. The front-end processing method for speech recognition according to claim 5, wherein, according to a preset loss function, calculating the error output by the confrontation network model in the iterative training process comprises:
    根据第一对抗损失函数和第二对抗损失函数,得出所述对抗网络模型的循环一致性损失函数及身份映射损失函数;其中,所述第一对抗损失函数为计算所述伪样本特征参量分布与所述实际样本特征参量分布的距离的损失函数,所述第二对抗损失函数为计算所述鉴别结果特征分布与所述随机样本特征分布的距离的损失函数;According to the first confrontation loss function and the second confrontation loss function, the cyclic consistency loss function and the identity mapping loss function of the confrontation network model are obtained; wherein, the first confrontation loss function is to calculate the feature parameter distribution of the pseudo sample A loss function for the distance from the actual sample feature parameter distribution, and the second counter loss function is a loss function for calculating the distance between the identification result feature distribution and the random sample feature distribution;
    根据所述循环一致性损失函数及所述身份映射损失函数,得到所述对抗网络模型的所 述预设损失函数;Obtaining the preset loss function of the confrontation network model according to the cyclic consistency loss function and the identity mapping loss function;
    所述对抗网络模型输出通过所述预设损失函数计算的误差,将所述误差作为目标训练值。The confrontation network model outputs an error calculated by the preset loss function, and uses the error as a target training value.
  7. 如权利要求1所述的语音识别的前端处理方法,其中,根据所述第二语音特征参量合成所述目标语音数据,包括:The front-end processing method for speech recognition according to claim 1, wherein synthesizing the target speech data according to the second speech characteristic parameter comprises:
    根据所述第二语音特征参量,采用波形拼接及时域基因同步叠加算法,合成无扰动或扰动特征最小的目标语音数据。According to the second voice feature parameter, the waveform splicing and time domain gene synchronization superposition algorithm is used to synthesize target voice data with no disturbance or with the least disturbance feature.
  8. 一种语音识别的前端处理装置,其中,包括:A front-end processing device for speech recognition, which includes:
    获取单元,用于获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;The acquiring unit is configured to acquire an original voice signal, and preprocess the original voice signal according to a preset format to obtain source voice data;
    特征提取单元,用于对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;A feature extraction unit, configured to perform voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
    数据处理单元,用于将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;A data processing unit, configured to input the first voice feature parameter into a voice conversion model, and output a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
    合成单元,用于根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。The synthesis unit is configured to synthesize the target voice data according to the second voice feature parameter, and use the target voice data as the input of a voice recognition model to perform voice recognition.
  9. 一种终端设备,其中,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:A terminal device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program:
    获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;
    对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
    将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
    根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
  10. 如权利要求9所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:
    对所述原始语音信号进行滤波处理;Filtering the original speech signal;
    对滤波处理后的语音信号进行周期性采样,获取预设频率的语音采样数据;Perform periodic sampling on the filtered voice signal to obtain voice sampling data with a preset frequency;
    对所述语音采样数据进行加窗及分帧处理,得到所述源语音数据。Perform windowing and framing processing on the voice sample data to obtain the source voice data.
  11. 如权利要求9所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:
    通过梅尔滤波器组提取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量;Extracting the Mel spectrum feature parameters, the logarithmic fundamental frequency feature parameters, and the non-periodic component feature parameters of the source voice data through a Mel filter bank;
    获取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量对应的参量分布。Obtain the parameter distribution corresponding to the Mel spectrum characteristic parameter, the logarithmic fundamental frequency characteristic parameter, and the non-periodic component characteristic parameter of the source voice data.
  12. 如权利要求9所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:
    获取语音样本训练数据集中的随机样本与实际样本,分别提取所述随机样本的随机样本特征参量分布以及实际样本的实际样本特征参量分布;Acquiring a random sample and an actual sample in the speech sample training data set, and extracting the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;
    根据所述随机样本特征参量分布及所述实际样本特征参量分布,对待训练的对抗网络模型进行迭代训练;Performing iterative training on the to-be-trained confrontation network model according to the random sample feature parameter distribution and the actual sample feature parameter distribution;
    根据预设损失函数,计算所述对抗网络模型在迭代训练过程中输出的误差;According to a preset loss function, calculate the error output by the confrontation network model in the iterative training process;
    当误差小于或等于预设误差阈值时,停止训练,得到所述语音转换模型。When the error is less than or equal to the preset error threshold, the training is stopped to obtain the voice conversion model.
  13. 如权利要求12所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 12, wherein, when the processor executes the computer program, it further implements:
    将所述随机样本特征参量分布输入至待训练的对抗网络模型的生成器网络,生成与实际样本特征参量分布对应的伪样本特征参量分布;Inputting the random sample feature parameter distribution to the generator network of the to-be-trained confrontation network model to generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;
    通过待训练的对抗网络模型的鉴别器网络,对所述伪样本特征参量分布与所述实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;Through the discriminator network of the confrontation network model to be trained, discriminating the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution to obtain the feature distribution of the identification result;
    将所述鉴别结果特征分布再次输入至所述生成器网络,再次生成与实际样本特征参量分布对应的伪样本特征参量分布,通过所述鉴别器网络再次对伪样本特征参量分布与实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;The feature distribution of the identification result is input to the generator network again, and the pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution is generated again, and the pseudo sample feature parameter distribution and the actual sample feature parameter distribution are again analyzed through the discriminator network. Distribute the identification, and obtain the characteristic distribution of the identification result;
    根据所述随机样本特征参量分布、所述实际样本特征参量分布、所述伪样本特征参量分布及所述鉴别结果特征分布,对所述待训练的对抗网络模型进行循环迭代训练。According to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution, cyclic iterative training is performed on the to-be-trained confrontation network model.
  14. 如权利要求13所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 13, wherein, when the processor executes the computer program, it further implements:
    根据第一对抗损失函数和第二对抗损失函数,得出所述对抗网络模型的循环一致性损失函数及身份映射损失函数;其中,所述第一对抗损失函数为计算所述伪样本特征参量分布与所述实际样本特征参量分布的距离的损失函数,所述第二对抗损失函数为计算所述鉴别结果特征分布与所述随机样本特征分布的距离的损失函数;According to the first confrontation loss function and the second confrontation loss function, the cyclic consistency loss function and the identity mapping loss function of the confrontation network model are obtained; wherein, the first confrontation loss function is to calculate the feature parameter distribution of the pseudo sample A loss function for the distance from the actual sample feature parameter distribution, and the second counter loss function is a loss function for calculating the distance between the identification result feature distribution and the random sample feature distribution;
    根据所述循环一致性损失函数及所述身份映射损失函数,得到所述对抗网络模型的所述预设损失函数;Obtaining the preset loss function of the confrontation network model according to the cyclic consistency loss function and the identity mapping loss function;
    所述对抗网络模型输出通过所述预设损失函数计算的误差,将所述误差作为目标训练值。The confrontation network model outputs an error calculated by the preset loss function, and uses the error as a target training value.
  15. 如权利要求9所述的终端设备,其中,所述处理器执行所述计算机程序时还实现:The terminal device according to claim 9, wherein, when the processor executes the computer program, it further implements:
    根据所述第二语音特征参量,采用波形拼接及时域基因同步叠加算法,合成无扰动或扰动特征最小的目标语音数据。According to the second voice feature parameter, the waveform splicing and time domain gene synchronization superposition algorithm is used to synthesize target voice data with no disturbance or with the least disturbance feature.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现:A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:
    获取原始语音信号,对所述原始语音信号按预设格式进行预处理,得到源语音数据;Acquiring an original voice signal, and preprocessing the original voice signal according to a preset format to obtain source voice data;
    对所述源语音数据进行语音特征提取,得到所述源语音数据的第一语音特征参量,所述第一语音特征参量为描述语音音色及韵律的声学特征参量;Performing voice feature extraction on the source voice data to obtain a first voice feature parameter of the source voice data, where the first voice feature parameter is an acoustic feature parameter describing the timbre and prosody of the voice;
    将所述第一语音特征参量输入至语音转换模型,经过转换后输出得到第二语音特征参量,所述第二语音特征参量为目标语音数据的特征参量;Inputting the first voice feature parameter into a voice conversion model, and outputting a second voice feature parameter after conversion, where the second voice feature parameter is a feature parameter of target voice data;
    根据所述第二语音特征参量合成所述目标语音数据,将所述目标语音数据作为语音识别模型的输入,以进行语音识别。The target voice data is synthesized according to the second voice feature parameter, and the target voice data is used as an input of a voice recognition model to perform voice recognition.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现:15. The computer-readable storage medium of claim 16, wherein the computer program, when executed by the processor, further implements:
    对所述原始语音信号进行滤波处理;Filtering the original speech signal;
    对滤波处理后的语音信号进行周期性采样,获取预设频率的语音采样数据;Perform periodic sampling on the filtered voice signal to obtain voice sampling data with a preset frequency;
    对所述语音采样数据进行加窗及分帧处理,得到所述源语音数据。Perform windowing and framing processing on the voice sample data to obtain the source voice data.
  18. 如权利要求16所述的计算机可读存储介质,其中,所述处理器执行所述计算机程序时还实现:15. The computer-readable storage medium of claim 16, wherein the processor further implements when the computer program is executed:
    通过梅尔滤波器组提取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量;Extracting the Mel spectrum feature parameters, the logarithmic fundamental frequency feature parameters, and the non-periodic component feature parameters of the source voice data through a Mel filter bank;
    获取所述源语音数据的梅尔频谱特征参量、对数基频特征参量及非周期分量特征参量对应的参量分布。Obtain the parameter distribution corresponding to the Mel spectrum characteristic parameter, the logarithmic fundamental frequency characteristic parameter, and the non-periodic component characteristic parameter of the source voice data.
  19. 如权利要求16所述的计算机可读存储介质,其中,所述处理器执行所述计算机程序时还实现:15. The computer-readable storage medium of claim 16, wherein the processor further implements when the computer program is executed:
    获取语音样本训练数据集中的随机样本与实际样本,分别提取所述随机样本的随机样本特征参量分布以及实际样本的实际样本特征参量分布;Acquiring a random sample and an actual sample in the speech sample training data set, and extracting the random sample feature parameter distribution of the random sample and the actual sample feature parameter distribution of the actual sample respectively;
    根据所述随机样本特征参量分布及所述实际样本特征参量分布,对待训练的对抗网络模型进行迭代训练;Performing iterative training on the to-be-trained confrontation network model according to the random sample feature parameter distribution and the actual sample feature parameter distribution;
    根据预设损失函数,计算所述对抗网络模型在迭代训练过程中输出的误差;According to a preset loss function, calculate the error output by the confrontation network model in the iterative training process;
    当误差小于或等于预设误差阈值时,停止训练,得到所述语音转换模型。When the error is less than or equal to the preset error threshold, the training is stopped to obtain the voice conversion model.
  20. 如权利要求19所述的计算机可读存储介质,其中,所述处理器执行所述计算机程 序时还实现:The computer-readable storage medium of claim 19, wherein, when the processor executes the computer program, it further implements:
    将所述随机样本特征参量分布输入至待训练的对抗网络模型的生成器网络,生成与实际样本特征参量分布对应的伪样本特征参量分布;Inputting the random sample feature parameter distribution to the generator network of the to-be-trained confrontation network model to generate a pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution;
    通过待训练的对抗网络模型的鉴别器网络,对所述伪样本特征参量分布与所述实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;Through the discriminator network of the confrontation network model to be trained, discriminating the feature parameter distribution of the pseudo sample from the actual sample feature parameter distribution to obtain the feature distribution of the identification result;
    将所述鉴别结果特征分布再次输入至所述生成器网络,再次生成与实际样本特征参量分布对应的伪样本特征参量分布,通过所述鉴别器网络再次对伪样本特征参量分布与实际样本特征参量分布进行鉴别,得到鉴别结果特征分布;The feature distribution of the identification result is input to the generator network again, and the pseudo sample feature parameter distribution corresponding to the actual sample feature parameter distribution is generated again, and the pseudo sample feature parameter distribution and the actual sample feature parameter distribution are again analyzed through the discriminator network. Distribute the identification, and obtain the characteristic distribution of the identification result;
    根据所述随机样本特征参量分布、所述实际样本特征参量分布、所述伪样本特征参量分布及所述鉴别结果特征分布,对所述待训练的对抗网络模型进行循环迭代训练。According to the random sample feature parameter distribution, the actual sample feature parameter distribution, the pseudo sample feature parameter distribution, and the identification result feature distribution, cyclic iterative training is performed on the to-be-trained confrontation network model.
PCT/CN2020/135511 2020-03-11 2020-12-11 Speech recognition front-end processing method and apparatus, and terminal device WO2021179717A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010165112.8 2020-03-11
CN202010165112.8A CN111445900A (en) 2020-03-11 2020-03-11 Front-end processing method and device for voice recognition and terminal equipment

Publications (1)

Publication Number Publication Date
WO2021179717A1 true WO2021179717A1 (en) 2021-09-16

Family

ID=71650573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/135511 WO2021179717A1 (en) 2020-03-11 2020-12-11 Speech recognition front-end processing method and apparatus, and terminal device

Country Status (2)

Country Link
CN (1) CN111445900A (en)
WO (1) WO2021179717A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882867A (en) * 2022-04-13 2022-08-09 天津大学 Deep network waveform synthesis method and device based on filter bank frequency discrimination
CN115620748A (en) * 2022-12-06 2023-01-17 北京远鉴信息技术有限公司 Comprehensive training method and device for speech synthesis and false discrimination evaluation

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445900A (en) * 2020-03-11 2020-07-24 平安科技(深圳)有限公司 Front-end processing method and device for voice recognition and terminal equipment
CN112397057A (en) * 2020-12-01 2021-02-23 平安科技(深圳)有限公司 Voice processing method, device, equipment and medium based on generation countermeasure network
CN112652318B (en) * 2020-12-21 2024-03-29 北京捷通华声科技股份有限公司 Tone color conversion method and device and electronic equipment
CN112767927A (en) * 2020-12-29 2021-05-07 平安科技(深圳)有限公司 Method, device, terminal and storage medium for extracting voice features
CN113555026B (en) * 2021-07-23 2024-04-19 平安科技(深圳)有限公司 Voice conversion method, device, electronic equipment and medium
CN115064177A (en) * 2022-06-14 2022-09-16 中国第一汽车股份有限公司 Voice conversion method, apparatus, device and medium based on voiceprint encoder

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170133005A1 (en) * 2015-11-10 2017-05-11 Paul Wendell Mason Method and apparatus for using a vocal sample to customize text to speech applications
CN106782504A (en) * 2016-12-29 2017-05-31 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN107346659A (en) * 2017-06-05 2017-11-14 百度在线网络技术(北京)有限公司 Audio recognition method, device and terminal based on artificial intelligence
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN109979436A (en) * 2019-04-12 2019-07-05 南京工程学院 A kind of BP neural network speech recognition system and method based on frequency spectrum adaptive method
CN111445900A (en) * 2020-03-11 2020-07-24 平安科技(深圳)有限公司 Front-end processing method and device for voice recognition and terminal equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170133005A1 (en) * 2015-11-10 2017-05-11 Paul Wendell Mason Method and apparatus for using a vocal sample to customize text to speech applications
CN106782504A (en) * 2016-12-29 2017-05-31 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN107346659A (en) * 2017-06-05 2017-11-14 百度在线网络技术(北京)有限公司 Audio recognition method, device and terminal based on artificial intelligence
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN109979436A (en) * 2019-04-12 2019-07-05 南京工程学院 A kind of BP neural network speech recognition system and method based on frequency spectrum adaptive method
CN111445900A (en) * 2020-03-11 2020-07-24 平安科技(深圳)有限公司 Front-end processing method and device for voice recognition and terminal equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882867A (en) * 2022-04-13 2022-08-09 天津大学 Deep network waveform synthesis method and device based on filter bank frequency discrimination
CN114882867B (en) * 2022-04-13 2024-05-28 天津大学 Depth network waveform synthesis method and device based on filter bank frequency discrimination
CN115620748A (en) * 2022-12-06 2023-01-17 北京远鉴信息技术有限公司 Comprehensive training method and device for speech synthesis and false discrimination evaluation

Also Published As

Publication number Publication date
CN111445900A (en) 2020-07-24

Similar Documents

Publication Publication Date Title
WO2021179717A1 (en) Speech recognition front-end processing method and apparatus, and terminal device
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
US20200321008A1 (en) Voiceprint recognition method and device based on memory bottleneck feature
CN106486131B (en) A kind of method and device of speech de-noising
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
WO2020224217A1 (en) Speech processing method and apparatus, computer device, and storage medium
CN110880329B (en) Audio identification method and equipment and storage medium
JP2019522810A (en) Neural network based voiceprint information extraction method and apparatus
EP2363852A1 (en) Computer-based method and system of assessing intelligibility of speech represented by a speech signal
Vyas A Gaussian mixture model based speech recognition system using Matlab
WO2021042537A1 (en) Voice recognition authentication method and system
CN109036437A (en) Accents recognition method, apparatus, computer installation and computer readable storage medium
CN113658583B (en) Ear voice conversion method, system and device based on generation countermeasure network
WO2020073509A1 (en) Neural network-based speech recognition method, terminal device, and medium
US11810546B2 (en) Sample generation method and apparatus
Müller et al. Contextual invariant-integration features for improved speaker-independent speech recognition
CN113646833A (en) Voice confrontation sample detection method, device, equipment and computer readable storage medium
CN113782032B (en) Voiceprint recognition method and related device
CN114927126A (en) Scheme output method, device and equipment based on semantic analysis and storage medium
CN113409771B (en) Detection method for forged audio frequency, detection system and storage medium thereof
CN113112992A (en) Voice recognition method and device, storage medium and server
CN110838294B (en) Voice verification method and device, computer equipment and storage medium
CN116665649A (en) Synthetic voice detection method based on prosody characteristics
CN114582373A (en) Method and device for recognizing user emotion in man-machine conversation
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20923715

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20923715

Country of ref document: EP

Kind code of ref document: A1