WO2020168752A1 - Speech recognition and speech synthesis method and apparatus based on dual learning - Google Patents

Speech recognition and speech synthesis method and apparatus based on dual learning Download PDF

Info

Publication number
WO2020168752A1
WO2020168752A1 PCT/CN2019/117567 CN2019117567W WO2020168752A1 WO 2020168752 A1 WO2020168752 A1 WO 2020168752A1 CN 2019117567 W CN2019117567 W CN 2019117567W WO 2020168752 A1 WO2020168752 A1 WO 2020168752A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
speech
voice
model
speech recognition
Prior art date
Application number
PCT/CN2019/117567
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
程宁
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020168752A1 publication Critical patent/WO2020168752A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application relates to the technical field of new energy vehicles, in particular to a method and device for speech recognition and speech synthesis based on dual learning.
  • the embodiments of the application provide a method and device for speech recognition and speech synthesis based on dual learning, which can effectively use dual learning for speech recognition and speech synthesis, improve the training speed of speech recognition and speech generation, and improve speech recognition and speech The precision of the generated output.
  • the embodiment of the present application provides a speech recognition and speech synthesis method based on dual learning.
  • the method includes the following steps:
  • the labeled data set ⁇ (x,y) ⁇ (x (j) ,y (j) ) ⁇ K
  • the labeled data set ⁇ (x,y) contains K pairs of labeled data
  • (x (j) ,y (j) ) represents the first in the labeled data set ⁇ (x,y) j is the labeled data
  • x (j) is the voice data in the j-th pair of labeled data
  • y (j) is the text data in the j-th pair of labeled data
  • K is a positive integer
  • N is a positive value less than or equal to K.
  • the embodiment of the present application also provides a device for speech recognition and speech synthesis based on dual learning, which can realize the beneficial effects of the aforementioned method for speech recognition and speech synthesis based on dual learning.
  • the function of the device can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes at least one module corresponding to the above-mentioned functions.
  • the device includes an initialization unit, a selection unit, a processing unit, a first generation unit, a second generation unit, and an optimization unit.
  • the selection unit is used to select N pairs of labeled data ⁇ (x (i) ,y (i) ) ⁇ N from the labeled data set ⁇ (x,y) .
  • the processing unit the voice data x (i) for extracting acoustic features, according to the acoustic characteristic of a voice data x (i) acquires the posterior probability corresponding phoneme and voice speech data x (i) data x (i) corresponding to The transition probability of a phoneme.
  • a first generating unit according to the corresponding transition probability of the phoneme corresponding phoneme speech data x (i) and the posterior probability of the speech data x (i), text data generated Calculate text data Equal to the first log likelihood of text data y (i) .
  • the second generating unit is used to obtain the voice feature sequence corresponding to the text data y (i) , and generate voice data according to the voice sequence feature Calculate voice data Equal to the second log likelihood of the speech data x (i) .
  • the optimization unit is used to maximize the first log-likelihood and the second log-likelihood as the objective function for N pairs of labeled data, and use the duality of speech recognition and speech synthesis as constraints to optimize ⁇ xy And ⁇ yx .
  • the embodiment of the present application also provides a server, which can realize the beneficial effects of the above-mentioned speech recognition and speech synthesis method based on dual learning.
  • the function of the server can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes at least one module corresponding to the above-mentioned functions.
  • the server includes a memory, a processor, and a transceiver.
  • the memory is used to store a computer program that supports the server to execute the above method.
  • the computer program includes program instructions.
  • the processor is used to control and manage the actions of the server according to the program instructions.
  • the transceiver is used to support The server communicates with other communication devices.
  • the embodiment of the present application also provides a computer non-volatile readable storage medium.
  • the non-volatile readable storage medium stores instructions. When it runs on the processor, the processor executes the aforementioned dual learning-based Speech recognition and speech synthesis methods.
  • the present application embodiment by acquiring speech data x corresponding transition probability of the phoneme (i) corresponding to the posterior probability of phoneme and voice data x (i), text data generated Acquire the sound feature sequence corresponding to y (i) to generate speech data For N pairs of labeled data to maximize text data Equal to the log likelihood of text data y (i) and speech data
  • the log likelihood equal to the speech data x (i) is the goal, and the probabilistic duality of speech recognition and speech synthesis is used as a constraint condition, so as to optimize the effect of speech recognition and speech synthesis. Effective use of dual learning for speech recognition and speech synthesis, improves the training speed of speech recognition and speech generation, and improves the accuracy of the output results of speech recognition and speech generation.
  • FIG. 1 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a method for speech recognition and speech synthesis based on dual learning provided by an embodiment of the present application
  • Fig. 3 is a schematic structural diagram of a speech recognition and speech synthesis device based on dual learning provided by an embodiment of the present application.
  • Dual learning is a learning scheme that uses the duality between a set of dual tasks to establish a feedback signal, and uses this signal to constrain training. Duality exists widely in artificial intelligence tasks. For example, machine translation is to allow machines to translate natural language from one language to another. Chinese to English and English to Chinese are dual tasks. Image recognition and image synthesis are also dual tasks for each other. Image recognition refers to a given picture, its classification and specific information. Image generation refers to the generation of a corresponding image given a category and specific information. Similarly, speech recognition and speech synthesis are also dual tasks. Speech recognition is a technology that allows machines to convert speech signals into corresponding texts or commands through the process of recognition and understanding. Speech synthesis is a technology that converts text generated by the computer itself or input from outside. Information is transformed into voice technology through mechanical and electronic methods.
  • the application field of speech recognition is very wide.
  • Common application systems include: voice input system, which is more natural and efficient than keyboard input method; voice control system, which uses voice to control the operation of the device, is faster than manual control , Convenience;
  • the intelligent dialogue query system operates according to the customer's voice, providing users with natural and friendly database retrieval services.
  • Speech synthesis technology also has a wide range of applications in our lives, such as electronic reading, car voice navigation, bank and hospital numbering systems, traffic announcements, and so on.
  • the speech recognition and speech synthesis method based on dual learning provided by the embodiments of this application can be applied to network equipment with speech recognition and speech synthesis functions such as terminal equipment, servers, and vehicle network equipment.
  • the aforementioned terminal equipment includes smart phones, smart bracelets, E-reading devices, notebooks and tablets. This application does not specifically limit this.
  • the following uses a server as an example to introduce in detail the functions of the application device of the above-mentioned speech recognition and speech synthesis method based on dual learning.
  • FIG. 1 is a schematic diagram of the hardware structure of a server 100 provided by an embodiment of the application.
  • the server 100 includes a memory 101, a transceiver 102, and a processor 103 coupled to the memory 101 and the transceiver 102.
  • the memory 101 is configured to store a computer program
  • the computer program includes program instructions
  • the processor 103 is configured to execute the program instructions stored in the memory 101
  • the transceiver 102 is configured to communicate with other devices under the control of the processor 103.
  • the processor 103 When the processor 103 is executing instructions, it can execute a voice recognition and voice synthesis method based on dual learning according to the program instructions.
  • the processor 103 may be a central processing unit (English: central processing unit, abbreviated as: CPU), a general-purpose processor, a digital signal processor (English: digital signal processor, abbreviated as: DSP), an application specific integrated circuit (English: application- Specific integrated circuit, referred to as ASIC), field programmable gate array (English: field programmable gate array, referred to as FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of the embodiments of the present application.
  • CPU central processing unit
  • DSP digital signal processor
  • ASIC application- Specific integrated circuit
  • FPGA field programmable gate array
  • FPGA field programmable gate array
  • the processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of DSP and microprocessor, and so on.
  • the transceiver 102 may be a communication interface, a transceiver circuit, etc., where the communication interface is a general term and may include one or more interfaces, such as an interface between a server and a terminal.
  • the server 100 may further include a bus 104.
  • the memory 101, the transceiver 102, and the processor 103 may be connected to each other through a bus 104;
  • the bus 104 may be a peripheral component interconnection standard (English: peripheral component interconnect, abbreviation: PCI) bus or an extended industry standard structure (English: extended industry standard architecture, EISA for short, etc.
  • the bus 104 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 1, but it does not mean that there is only one bus or one type of bus.
  • the server 100 in the embodiment may generally include other hardware according to the actual function of the server, which will not be repeated here.
  • the embodiment of the present application provides a method for speech recognition and speech synthesis based on dual learning as shown in FIG. 2.
  • the speech recognition and speech synthesis method based on dual learning includes:
  • the labeled data set ⁇ (x,y) contains K
  • (x (j) ,y (j) ) represents the jth pair of labeled data in the labeled data set ⁇ (x,y)
  • x (j) is the voice in the jth pair of labeled data
  • y (j) is the text data in the j-th pair of labeled data
  • K is a positive integer
  • N is a positive integer less than or equal to K.
  • the training data size N is the number of labeled data participating in the dual learning-based speech recognition and speech synthesis optimization training in the labeled data set ⁇ (x, y) .
  • the speech recognition parameter ⁇ xy is a parameter that affects the speech recognition effect
  • the speech synthesis parameter ⁇ yx is a parameter that affects the speech synthesis effect.
  • the contents of the K voice data in the labeled data set ⁇ (x, y) are all different, and the length of the K voice data may be the same or different.
  • the voice data can come from TV news reports, daily conversations, meeting recordings, etc.
  • the source scenes of the above K voice data can be the same or different. This application does not specifically limit this.
  • the speech recognition parameter ⁇ xy and the speech synthesis parameter ⁇ yx are randomly initialized, for example, the initial values of ⁇ xy and ⁇ yx are both set to 1.
  • the above-mentioned dual learning-based voice recognition and speech synthesis method further includes: Randomly select S from the standard data set ⁇ (x, y) for the standard data, and pre-train the first speech recognition model to be trained to obtain the pre-trained second speech recognition model and the first speech synthesis model to be trained Perform pre-training to obtain a pre-trained second speech synthesis model.
  • the second speech recognition model includes Deep Neural Network (full English name: Deep Neural Network, English abbreviation: DNN)-hidden Markov model (full English name: Hidden Markov Model, English abbreviation: HMM), the second speech synthesis model includes encoder, decoder and neural vocoder, S is a positive integer less than or equal to K.
  • the aforementioned dual learning-based speech recognition and synthesis device pre-trains the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, including: comparing the aforementioned S with labeled data ⁇ (x ( r) ,y (r) ) ⁇
  • the speech data x (r) in S is input to the first speech recognition model to be trained; the speech data x (r) is preprocessed to obtain the frequency corresponding to the speech data x (r)
  • Spectral coefficient feature Use the frequency cepstral coefficient feature to pre-train the DNN-HMM model of the first speech recognition model to obtain a second speech recognition model, which includes the trained DNN-HMM model.
  • the voice recognition and synthesis device based on dual learning performs pre-training on the first voice recognition model to be trained to obtain the pre-trained second voice recognition model, which specifically includes: the voice recognition and synthesis device based on dual learning
  • the speech data x (r) in the labeled data ⁇ (x (r) ,y (r) ) ⁇ S is input into the first speech recognition model to be trained.
  • the aforementioned speech data x (r) is preprocessed to obtain The characteristic of the frequency cepstrum coefficient corresponding to the speech data x (r) .
  • a speech recognition and synthesis device based on dual learning takes the above-mentioned frequency cepstral coefficient characteristics as input data, and uses Gaussian mixture models (full English name: Adaptive background mixture models for real-time tracking, English abbreviation: GMM) and invisible Marco The husband model (English full name: Hidden Markov Model, English abbreviation: HMM) constitutes an acoustic model for training, and obtains the likelihood probability feature of the phoneme state output by the pre-trained GMM and the transition probability of the phoneme state output by the pre-trained HMM.
  • Gaussian mixture models full English name: Adaptive background mixture models for real-time tracking, English abbreviation: GMM
  • Marco The husband model English full name: Hidden Markov Model, English abbreviation: HMM constitutes an acoustic model for training, and obtains the likelihood probability feature of the phoneme state output by the pre-trained GMM and the transition probability of the phoneme state output by the pre-trained HMM.
  • the dual learning-based speech recognition and synthesis device converts the likelihood probability feature of the phoneme state into the posterior probability feature of the phoneme state through forced alignment, and obtains a deep neural network according to the posterior probability feature of the S pair of labeled data and the phoneme state.
  • a deep neural network according to the posterior probability feature of the S pair of labeled data and the phoneme state.
  • DNN Deep Neural Network
  • the matrix weight value and matrix offset value between the output layer nodes in the model are generated to generate the pre-trained DNN model.
  • the second speech recognition model includes the aforementioned pre-trained DNN model and the aforementioned pre-trained HMM.
  • the aforementioned dual learning-based speech recognition and synthesis device performs pre-training on the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model, including: comparing the aforementioned S with labeled data ⁇ (x ( r) , y (r) ) ⁇
  • the text data y (r) in S is input to the first speech synthesis model to be trained; the text data y (r) is used to compare the encoder, decoder, and neuroacoustics of the first speech synthesis model
  • the encoder is pre-trained to obtain a second language synthesis model.
  • the second language synthesis model includes a trained encoder, a trained decoder, and a trained neural vocoder.
  • the voice recognition and synthesis device based on dual learning performs pre-training on the first voice recognition model to be trained to obtain the pre-trained second voice recognition model, which specifically includes: the voice recognition and synthesis device based on dual learning from S is randomly selected from the labeled data set ⁇ (x, y) to pre-train the first speech synthesis model to be trained on the labeled data ⁇ (x (t) ,y (t) ) ⁇ S to obtain the pre-trained first speech synthesis model 2.
  • Speech synthesis model is randomly selected from the labeled data set ⁇ (x, y) to pre-train the first speech synthesis model to be trained on the labeled data ⁇ (x (t) ,y (t) ) ⁇ S to obtain the pre-trained first speech synthesis model 2.
  • the speech recognition and synthesis device based on dual learning inputs the text data y (t) in the labeled data ⁇ (x (t) ,y (t) ) ⁇ S into the first speech synthesis model to be trained, First, the text data is analyzed by the encoder to obtain the intermediate semantic vector corresponding to the text data y (t) , which represents the semantics of the text. Then, the speech recognition and synthesis device based on dual learning inputs the above-mentioned intermediate semantic vector into the decoder to obtain the sound sequence characteristics corresponding to the text data y (t) .
  • the above-mentioned sound sequence characteristics are input to the neural vocoder, and the speech data corresponding to the text data y (t) is output.
  • the aforementioned encoder, decoder, and neural vocoder all use a cyclic neural network (English full name: Hidden Markov Model, English abbreviation: HMM) model, and the second speech synthesis model includes the aforementioned encoder, decoder and neural vocoder.
  • GMM is to use Gaussian probability density function to accurately quantify things. It is a model formed by decomposing things into several Gaussian probability density functions. HMM is a probabilistic model about time series. It describes the process of randomly generating an unobservable state random sequence from a hidden Markov chain, and then generating an observation from each state to generate an observation random sequence.
  • the synthesis device extracts the voice data x (i), according to the acoustic characteristics of the voice data x (i), and obtains the corresponding phoneme speech data x (i) the posterior probability and the voice data x (i) The transition probability of the corresponding phoneme.
  • the voice recognition and synthesis device based on dual learning inputs the voice data x (i) into the second voice recognition model, filters out unimportant information and background noise, and divides the voice data x (i) into multiple frames of voice signals. Each frame of speech signal is analyzed and processed, and the filter bank feature of each frame of speech signal corresponding to the speech data x (i) is extracted as the acoustic feature of the speech data x (i) .
  • the voice recognition and synthesis device based on dual learning inputs the acoustic features of the voice data x (i) into the DNN model in the second voice recognition model, and obtains the posterior probability of the phoneme corresponding to the voice data x (i) output by the DNN model, and The phoneme corresponding to the speech data x (i) is input into the HMM in the second speech recognition model to obtain the transition probability of the phoneme corresponding to the speech data x (i) .
  • the phoneme transition probability output by the HMM includes the phoneme transition probability from the first phoneme state to the first phoneme state and the phoneme transition probability from the first phoneme state to the second phoneme state.
  • the second phoneme state is the first phoneme state. Next state.
  • dual learning probability w network paths is determined according to the transition probability corresponding phoneme corresponding phoneme speech data x (i) the posterior probability and the speech data x (i); acquires the The text data corresponding to the network path with the highest probability among the w network paths w is a positive integer greater than zero.
  • the HMM model is to establish a statistical model for the time series structure of the speech signal, which can be regarded as a mathematical double random process: one is to use a Markov chain with a finite number of states to simulate the implicit random process of the statistical characteristics of the speech signal. The other is the random process of the externally visible observation sequence associated with each state of the Markov chain.
  • the HMM model contains the following elements: hidden state, observation sequence, initial probability distribution of hidden state, transition probability matrix of hidden state, and emission probability of observations.
  • an observation sequence that is, the acoustic characteristics of the speech data
  • find the optimal state sequence corresponding to the observation sequence thereby converting the speech into text.
  • the phoneme is used as a hidden node, and the change process of the phoneme constitutes the HMM state sequence, and each phoneme generates an observation vector with a certain probability density function.
  • the probability of generating the observed value for each state is calculated according to the HMM state transition probability of each word. If the joint probability of the HMM state sequence of a word is the largest, it is determined that the segment of speech corresponds to the above-mentioned word. For example, taking the voice data of the word five as an example, the word “five” is formed by connecting the three phoneme states of [f], [ay] and [v], and each state of the hidden node corresponds to a single phoneme. We take the words “one”, "two", “three” and “five” as examples, and use the forward algorithm to calculate the posterior probability of the observation sequence and find the word with the highest probability as the recognition result.
  • the first log likelihood represents the log likelihood function of the conditional probability distribution P f (y (i)
  • the voice recognition and synthesis device based on dual learning acquires the voice feature sequence feature corresponding to the text data y (i) , and generates voice data according to the voice sequence feature Calculate voice data Equal to the second log likelihood of the speech data x (i) .
  • the voice recognition and synthesis device based on dual learning acquires the voice feature sequence feature corresponding to y (i) , and generates voice data according to the voice sequence feature Specifically, the speech recognition and synthesis device based on dual learning inputs the text data y (i) into the second speech synthesis model. First, the text data y (i) is split into the smallest unit words with semantics. The smallest unit word corresponding to the text data y (i) is input into the encoder of the second speech synthesis model, and the smallest unit word corresponding to the text data y (i) is semantically analyzed and classified.
  • the speech recognition and synthesis device based on dual learning performs classification coding on the smallest unit word corresponding to the text data y (i) , and outputs an intermediate semantic vector with a fixed length corresponding to the text data y (i) .
  • the intermediate semantic vector is input into the decoder of the second speech synthesis model, and the speech recognition and synthesis device based on dual learning performs semantic analysis on the intermediate semantic vector, and generates the voice sequence feature corresponding to the text data y (i) .
  • the classified categories include: Chinese, English, Korean, numbers, pinyin, and place names.
  • the speech recognition and synthesis device based on dual learning aims at N pairs of labeled data, takes maximizing the first log likelihood and the second log likelihood as the objective function, and takes the probability duality of speech recognition and speech synthesis as Constraint conditions, optimize ⁇ xy and ⁇ yx .
  • a speech recognition and synthesis device based on dual learning aims at maximizing the first log likelihood and the second log likelihood for N pairs of labeled data ⁇ (x,y) N in the labeled data set Function and take the probability duality of speech recognition and speech synthesis as a constraint.
  • the speech recognition model and speech synthesis model should satisfy the probabilistic duality, that is, P(x (i) )P(y (i)
  • x (i) , ⁇ xy ) P(y (i) )P( x (i)
  • P(x (i) ) and P(y (i) ) represent the edge probability of speech data x (i) and text data y (i) , respectively.
  • the objective function and constraint conditions can be expressed as follows by formulas.
  • the present application embodiment by acquiring speech data x corresponding transition probability of the phoneme (i) corresponding to the posterior probability of phoneme and voice data x (i), text data generated Acquire the sound feature sequence corresponding to y (i) to generate speech data For N pairs of labeled data to maximize text data Equal to the log likelihood of text data y (i) and speech data
  • the log likelihood equal to the speech data x (i) is the goal, and the probabilistic duality of speech recognition and speech synthesis is used as a constraint condition, so as to optimize the effect of speech recognition and speech synthesis. Effective use of dual learning for speech recognition and speech synthesis, improves the training speed of speech recognition and speech generation, and improves the accuracy of the output results of speech recognition and speech generation.
  • the embodiment of the present application also provides a speech recognition and speech synthesis device based on dual learning, which can have the beneficial effects of the above-mentioned speech recognition and speech synthesis method based on dual learning.
  • the function of the device can be realized by hardware, or by hardware executing corresponding software.
  • the hardware or software includes at least one module corresponding to the above-mentioned functions.
  • FIG. 3 is a structural block diagram of a speech recognition and speech synthesis device 300 based on dual learning provided by an embodiment of the present application.
  • the device includes an initialization unit 301, a selection unit 302, a processing unit 303, and a first The generation unit 304, the second generation unit 305, and the optimization unit 306.
  • the selecting unit 302 is used to select N pairs of labeled data ⁇ (x (i) ,y (i) ) ⁇ N from the labeled data set ⁇ (x, y) , K is a positive integer, and N is less than or equal to K Positive integer.
  • Processing unit 303 for extracting acoustic features x (i), the acoustic features x (i) acquires the corresponding transition probability of the phoneme x (i) corresponding to the posterior probability of the phoneme and x (i).
  • the first generating unit 304 is configured to generate text data according to the posterior probability of the phoneme corresponding to x (i) and the transition probability of the phoneme corresponding to x (i) Calculation Equal to the first log likelihood of y (i) .
  • the second generating unit 305 is configured to obtain the voice feature sequence corresponding to y (i) , and generate voice data according to the voice sequence feature Calculation Equal to the second log likelihood of x (i) .
  • the optimization unit 306 is configured to maximize the first log-likelihood and the second log-likelihood as the objective function for N pairs of labeled data, and use the probability duality of speech recognition and speech synthesis as constraints to optimize ⁇ xy and ⁇ yx .
  • the selecting unit 302 selects N pairs of labeled data from the labeled data set ⁇ (x, y) before ⁇ (x (i) ,y (i) ) ⁇ N , it also includes: a pre-training unit for Randomly select S pairs of labeled data from the labeled data set ⁇ (x, y) , pre-train the first voice recognition model to obtain the second voice recognition model, and pre-train the first voice synthesis model to obtain the first Two speech synthesis models, the second speech recognition model includes a deep neural network and an invisible Markov model, and the second speech synthesis model includes an encoder, a decoder and a neural vocoder.
  • the pre-training unit is used to pre-train the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, which specifically includes:
  • An input unit, a first speech recognition model described above has the S standard data ⁇ (x (r), y (r)) ⁇ in the speech data S x (r) input to be trained;
  • the preprocessing unit is used to preprocess the voice data x (r) to obtain the frequency cepstral coefficient characteristics corresponding to the voice data x (r) ;
  • the pre-training unit is also used for pre-training the deep neural network-Gaussian mixture model of the first speech recognition model by using the frequency cepstral coefficient feature to obtain a second speech recognition model.
  • the second speech recognition model includes the trained deep neural network -Gaussian mixture model.
  • the pre-training unit is used to pre-train the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model, which specifically includes:
  • the input unit is also used to input the text data y (r) in the S pair of labeled data ⁇ (x (r) ,y (r) ) ⁇ S into the first speech synthesis model to be trained;
  • the pre-training unit is also used to pre-train the encoder, decoder, and neural vocoder of the first speech synthesis model by using text data y (r) to obtain a second speech synthesis model.
  • the second speech synthesis model includes post-training The encoder, the decoder after training and the neural vocoder after training.
  • the encoder, decoder and neural vocoder all adopt a cyclic neural network model.
  • the processing unit 303 includes: an extraction unit and an acquisition unit.
  • the extraction unit is used to input the acoustic features of x (i) into the second speech recognition model, and extract the acoustic features of x (i) frame by frame.
  • the acquisition unit is used to input the acoustic features of x (i) into the deep neural network in the second speech recognition model to obtain the posterior probability of the phoneme corresponding to x (i) , and pass the hidden Markov in the second speech recognition model
  • the model obtains the transition probability of the phoneme corresponding to x (i) .
  • first generation unit 304 according to the transition probability of the phoneme corresponding to speech data x (i) and the posterior probability of the speech data x (i) corresponding to phonemes, generating text data Specifically:
  • Determination means for the probability of w is determined according to network paths corresponding to the transition probability of the phoneme corresponding phoneme speech data x (i) and the posterior probability of the speech data x (i), w is a positive integer greater than zero;
  • the obtaining unit is also used to obtain text data corresponding to the network path with the highest probability among the above w network paths
  • the second generating unit 305 is specifically configured to: input y (i) into the encoder of the second speech synthesis model to generate a semantic sequence will Input the decoder of the second speech synthesis model to generate a voice feature sequence, and input the voice sequence features into the neural vocoder of the second speech synthesis model to generate voice data Calculation Equal to the second log likelihood of x (i) .
  • the optimization unit 306 is specifically configured to: maximize the first log-likelihood and the second log-likelihood as the objective function, and use the probabilistic duality of speech recognition and speech synthesis as the constraint condition to simultaneously establish the objective function And constraint conditions, using Lagrangian multiplier optimization algorithm, iterative optimization of ⁇ xy and ⁇ yx .
  • the steps of the method or algorithm described in combination with the disclosure of the embodiments of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions.
  • Software instructions can be composed of corresponding software modules, which can be stored in random access memory (English: random access memory, referred to as RAM), flash memory, read-only memory (English: read only memory, referred to as ROM), Erasable programmable read-only memory (English: erasable programmable rom, abbreviation: EPROM), electrically erasable programmable read-only memory (English: electrically eprom, abbreviation: EEPROM), register, hard disk, mobile hard disk, read-only optical disk (CD -ROM) or any other form of storage medium known in the art.
  • RAM random access memory
  • ROM read only memory
  • EPROM Erasable programmable read-only memory
  • EPROM Erasable programmable read-only memory
  • EEPROM electrically erasable
  • An exemplary storage medium is coupled to the processor, so that the processor can read information from the storage medium and can write information to the storage medium.
  • the storage medium may also be an integral part of the processor.
  • the processor and the storage medium may be located in the ASIC.
  • the ASIC may be located in a network device.
  • the processor and storage medium can also exist as discrete components in the network device.
  • Computer non-volatile readable media include computer non-volatile storage media and communication media, where communication media includes any media that facilitates the transfer of a computer program from one place to another.
  • the storage medium may be any available medium that can be accessed by a general-purpose or special-purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

A speech recognition and speech synthesis method and apparatus based on dual learning. The method comprises: initializing a marked data set Φ(x,y), a speech recognition parameter θxy and a speech synthesis parameter θyx, wherein Φ(x,y) = {(x(j),y(j))}K, x(j) is speech data, and y(j) is text data; selecting, from Φ(x,y), N pairs of marked data {(x(i),y(i))}N; extracting an acoustic feature of x(i), and according to the acoustic feature of x(i), acquiring the posterior probability of a phoneme corresponding to x(i) and the transition probability of the phoneme corresponding to x(i) to generate the text data (a), and calculating the first log likelihood of (a) equaling y(i); acquiring a sound feature sequence corresponding to y(i) to generate the speech data (b), and calculating the second log likelihood of (b) equaling x(i); and taking the maximum first log likelihood and the maximum second log likelihood as target functions, and taking the probabilistic duality of speech recognition and speech synthesis as a constraint condition to optimize θxy and θyx. According to the method, dual learning is effectively used to perform speech recognition and speech synthesis, thereby increasing the training speed of speech recognition and speech generation and improving the accuracy of an output result.

Description

一种基于对偶学习的语音识别与语音合成方法及装置Method and device for speech recognition and speech synthesis based on dual learning
本申请要求于2019年02月22日提交中国专利局、申请号为201910135575.7、申请名称为“一种基于对偶学习的语音识别与语音合成方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 22, 2019, the application number is 201910135575.7, and the application name is "A method and device for speech recognition and speech synthesis based on dual learning", and its entire contents Incorporated in this application by reference.
技术领域Technical field
本申请涉及新能源汽车技术领域,尤其涉及一种基于对偶学习的语音识别与语音合成方法及装置。This application relates to the technical field of new energy vehicles, in particular to a method and device for speech recognition and speech synthesis based on dual learning.
背景技术Background technique
近年来,以深度学习和增强学习为代表的人工智能技术取得了长足的进步,在很多应用取得了巨大的成功。然而,深度学习限制于大规模的带标注的数据,增强学习限制于持续的可交互环境。首先,大规模的带标注的数据的获取及交互环境的维护的代价均很高,为了使深度学习和增强学习能够取得更广泛的成功,我们需要降低其对大规模标注数据和交互环境的依赖性。为了解决这个问题,出现了一种新的学习范式,我们把它称作对偶学习。In recent years, artificial intelligence technology represented by deep learning and enhanced learning has made considerable progress, and has achieved great success in many applications. However, deep learning is limited to large-scale labeled data, and reinforcement learning is limited to continuous interactive environments. First, the acquisition of large-scale labeled data and the maintenance of the interactive environment are both costly. In order to make deep learning and reinforcement learning more widely successful, we need to reduce its dependence on large-scale labeled data and interactive environments. Sex. To solve this problem, a new learning paradigm has emerged, which we call dual learning.
在有监督的学习任务中,发现很多问题具有双重的形式,即输入和输出是以对偶的形式出现的,其中一个任务的输入和输出就是另一个任务的输出和输入,比如在机器翻译里面,不同语言之间的相互翻译互为对偶任务。这两个任务在内部有概率的关系,有相关性模型,但是这种联系没有得到有效的利用,因为人们通常在训练模型时是将两个任务独立地完成的。所以对偶学习的出现就是利用这两个模型之间的相关性,对这两者同时开展训练,简化训练流程,对偶学习并不依赖于大规模标注数据。In supervised learning tasks, it is found that many problems have a dual form, that is, input and output appear in a dual form. The input and output of one task are the output and input of another task. For example, in machine translation, The mutual translation between different languages is a dual task. These two tasks have a probabilistic relationship and a correlation model internally, but this relationship has not been effectively used because people usually complete the two tasks independently when training the model. Therefore, the emergence of dual learning is to use the correlation between these two models to train them at the same time to simplify the training process. Dual learning does not rely on large-scale labeled data.
传统技术中通常将语音识别和语音生成分开进行训练,未能有效利用语音识别和语音生成之间的对偶性。利用语音识别和语音生成之间的对偶性,联合语音识别训练和语音生成训练进行对偶学习,是语音识别和语音生成技术的一大发展趋势。然而,如何将对偶学习应用于实际场景中依然面临巨大的挑战,如何基于对偶学习有效的进行语音识别和语音生成,提高语音识别和语音生成的训练速度及输出结果的精度是目前急需解决的技术问题。Traditional technologies usually separate speech recognition and speech generation for training, failing to effectively utilize the duality between speech recognition and speech generation. Using the duality between speech recognition and speech generation, combined speech recognition training and speech generation training for dual learning is a major development trend of speech recognition and speech generation technology. However, how to apply dual learning to actual scenarios still faces huge challenges. How to effectively perform voice recognition and voice generation based on dual learning, and improve the training speed of voice recognition and voice generation and the accuracy of output results are technologies that need to be solved urgently. problem.
发明内容Summary of the invention
本申请实施例提供了一种基于对偶学习的语音识别与语音合成方法及装置,能够有效的利用对偶学习进行语音识别与语音合成,提高语音识别和语音生成的训练速度,以及提高语音识别和语音生成输出结果的精度。The embodiments of the application provide a method and device for speech recognition and speech synthesis based on dual learning, which can effectively use dual learning for speech recognition and speech synthesis, improve the training speed of speech recognition and speech generation, and improve speech recognition and speech The precision of the generated output.
本申请实施例提供了一种基于对偶学习的语音识别与语音合成方法,该方法包括以下步骤:The embodiment of the present application provides a speech recognition and speech synthesis method based on dual learning. The method includes the following steps:
初始化有标数据集Φ (x,y)、语音识别参数θ xy、语音合成参数θ yx和训练数据规模N,其中,有标数据集Φ (x,y)={(x (j),y (j))} K,有标数据集Φ (x,y)中包含K对有标数据,(x (j),y (j))表示有标数据集Φ (x,y)中的第j对有标数据,x (j)为第j对有标数据中的语音数据,y (j)为第j对有标数据中的文本数据,K为正整数,N为小于等于K的正整数; Initialize the labeled data set Φ (x,y) , speech recognition parameters θ xy , speech synthesis parameters θ yx, and training data size N, where the labeled data set Φ (x,y) = {(x (j) ,y (j) )} K , the labeled data set Φ (x,y) contains K pairs of labeled data, (x (j) ,y (j) ) represents the first in the labeled data set Φ (x,y) j is the labeled data, x (j) is the voice data in the j-th pair of labeled data, y (j) is the text data in the j-th pair of labeled data, K is a positive integer, and N is a positive value less than or equal to K. Integer
从有标数据集Φ (x,y)中选取N对有标数据{(x (i),y (i))} NSelect N pairs of labeled data {(x (i) ,y (i) )} N from the labeled data set Φ (x,y) ;
提取语音数据x (i)的声学特征,根据语音数据x (i)的声学特征,获取语音数据x (i)对应的 音素的后验概率和语音数据x (i)对应的音素的转移概率; Acoustic features extracted speech data x (i), according to the acoustic characteristics of the voice data x (i) acquires the corresponding transition probability of the phoneme corresponding phoneme speech data x (i) the posterior probability and the speech data x (i);
根据语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率,生成文本数据
Figure PCTCN2019117567-appb-000001
计算文本数据
Figure PCTCN2019117567-appb-000002
等于文本数据y (i)的第一对数似然;
The corresponding transition probability of the phoneme corresponding phoneme speech data x (i) and the posterior probability of the speech data x (i), text data generated
Figure PCTCN2019117567-appb-000001
Calculate text data
Figure PCTCN2019117567-appb-000002
Equal to the first log likelihood of the text data y (i) ;
获取y (i)对应的声音特征序列,并根据声音序列特征,生成语音数据
Figure PCTCN2019117567-appb-000003
计算语音数据
Figure PCTCN2019117567-appb-000004
等于语音数据x (i)的第二对数似然;
Acquire the voice feature sequence corresponding to y (i) , and generate voice data according to the voice sequence feature
Figure PCTCN2019117567-appb-000003
Calculate voice data
Figure PCTCN2019117567-appb-000004
Equal to the second log likelihood of the speech data x (i) ;
针对N对有标数据,以最大化第一对数似然和所述第二对数似然为目标函数,并将语音识别和语音合成的概率对偶性作为约束条件,优化θ xy和θ yxFor N pairs of labeled data, maximize the first log likelihood and the second log likelihood as the objective function, and use the probabilistic duality of speech recognition and speech synthesis as constraints to optimize θ xy and θ yx .
本申请实施例还提供了一种基于对偶学习的语音识别与语音合成的装置,该装置能实现上述基于对偶学习的语音识别与语音合成方法所具备的有益效果。其中,该装置的功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括至少一个与上述功能相对应的模块。可选的,该装置包括初始化单元、选取单元、处理单元、第一生成单元、第二生成单元和优化单元。The embodiment of the present application also provides a device for speech recognition and speech synthesis based on dual learning, which can realize the beneficial effects of the aforementioned method for speech recognition and speech synthesis based on dual learning. Wherein, the function of the device can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes at least one module corresponding to the above-mentioned functions. Optionally, the device includes an initialization unit, a selection unit, a processing unit, a first generation unit, a second generation unit, and an optimization unit.
初始化单元,用于初始化有标数据集Φ (x,y)、语音识别参数θ xy、语音合成参数θ yx和训练数据规模N,其中,有标数据集Φ (x,y)={(x (j),y (j))} K,有标数据集Φ (x,y)中包含K对有标数据,(x (j),y (j))表示有标数据集Φ (x,y)中的第j对有标数据,x (j)为第j对有标数据中的语音数据,y (j)为第j对有标数据中的文本数据,K为正整数,N为小于等于K的正整数。 The initialization unit is used to initialize the marked data set Φ (x,y) , speech recognition parameters θ xy , speech synthesis parameters θ yx and training data size N, where the marked data set Φ (x,y) = {(x (j) ,y (j) )} K , the labeled data set Φ (x,y) contains K pairs of labeled data, (x (j) ,y (j) ) represents the labeled data set Φ (x, The j-th pair of labeled data in y) , x (j) is the voice data in the j-th pair of labeled data, y (j) is the text data in the j-th pair of labeled data, K is a positive integer, and N is A positive integer less than or equal to K.
选取单元,用于从有标数据集Φ (x,y)中选取N对有标数据{(x (i),y (i))} NThe selection unit is used to select N pairs of labeled data {(x (i) ,y (i) )} N from the labeled data set Φ (x,y) .
处理单元,用于提取语音数据x (i)的声学特征,根据语音数据x (i)的声学特征,获取语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率。 The processing unit, the voice data x (i) for extracting acoustic features, according to the acoustic characteristic of a voice data x (i) acquires the posterior probability corresponding phoneme and voice speech data x (i) data x (i) corresponding to The transition probability of a phoneme.
第一生成单元,用于根据语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率,生成文本数据
Figure PCTCN2019117567-appb-000005
计算文本数据
Figure PCTCN2019117567-appb-000006
等于文本数据y (i)的第一对数似然。
A first generating unit, according to the corresponding transition probability of the phoneme corresponding phoneme speech data x (i) and the posterior probability of the speech data x (i), text data generated
Figure PCTCN2019117567-appb-000005
Calculate text data
Figure PCTCN2019117567-appb-000006
Equal to the first log likelihood of text data y (i) .
第二生成单元,用于获取文本数据y (i)对应的声音特征序列,并根据声音序列特征,生成语音数据
Figure PCTCN2019117567-appb-000007
计算语音数据
Figure PCTCN2019117567-appb-000008
等于语音数据x (i)的第二对数似然。
The second generating unit is used to obtain the voice feature sequence corresponding to the text data y (i) , and generate voice data according to the voice sequence feature
Figure PCTCN2019117567-appb-000007
Calculate voice data
Figure PCTCN2019117567-appb-000008
Equal to the second log likelihood of the speech data x (i) .
优化单元,用于针对N对有标数据,以最大化第一对数似然和第二对数似然为目标函数,并将语音识别和语音合成的概率对偶性作约束条件,优化θ xy和θ yxThe optimization unit is used to maximize the first log-likelihood and the second log-likelihood as the objective function for N pairs of labeled data, and use the duality of speech recognition and speech synthesis as constraints to optimize θ xy And θ yx .
本申请实施例还提供了一种服务器,该服务器能实现上述基于对偶学习的语音识别与 语音合成方法所具备的有益效果。其中,该服务器的功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括至少一个与上述功能相对应的模块。该服务器包括存储器、处理器和收发器,存储器用于存储支持服务器执行上述方法的计算机程序,所述计算机程序包括程序指令,处理器用于根据程序指令对服务器的动作进行控制管理,收发器用于支持服务器与其它通信设备的通信。The embodiment of the present application also provides a server, which can realize the beneficial effects of the above-mentioned speech recognition and speech synthesis method based on dual learning. Among them, the function of the server can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes at least one module corresponding to the above-mentioned functions. The server includes a memory, a processor, and a transceiver. The memory is used to store a computer program that supports the server to execute the above method. The computer program includes program instructions. The processor is used to control and manage the actions of the server according to the program instructions. The transceiver is used to support The server communicates with other communication devices.
本申请实施例还提供了一种计算机非易失性可读存储介质,非易失性可读存储介质上存储有指令,当其在处理器上运行时,使得处理器执行上述基于对偶学习的语音识别与语音合成方法。The embodiment of the present application also provides a computer non-volatile readable storage medium. The non-volatile readable storage medium stores instructions. When it runs on the processor, the processor executes the aforementioned dual learning-based Speech recognition and speech synthesis methods.
本申请实施例中,通过获取语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率,生成文本数据
Figure PCTCN2019117567-appb-000009
获取y (i)对应的声音特征序列,生成语音数据
Figure PCTCN2019117567-appb-000010
针对N对有标数据,以最大化文本数据
Figure PCTCN2019117567-appb-000011
等于文本数据y (i)的对数似然和语音数据
Figure PCTCN2019117567-appb-000012
等于语音数据x (i)的对数似然为目标,并将语音识别和语音合成的概率对偶性作为约束条件,从而优化语音识别和语音合成效果。有效的利用了对偶学习进行语音识别与语音合成,提高语音识别和语音生成的训练速度,以及提高语音识别和语音生成输出结果的精度。
The present application embodiment, by acquiring speech data x corresponding transition probability of the phoneme (i) corresponding to the posterior probability of phoneme and voice data x (i), text data generated
Figure PCTCN2019117567-appb-000009
Acquire the sound feature sequence corresponding to y (i) to generate speech data
Figure PCTCN2019117567-appb-000010
For N pairs of labeled data to maximize text data
Figure PCTCN2019117567-appb-000011
Equal to the log likelihood of text data y (i) and speech data
Figure PCTCN2019117567-appb-000012
The log likelihood equal to the speech data x (i) is the goal, and the probabilistic duality of speech recognition and speech synthesis is used as a constraint condition, so as to optimize the effect of speech recognition and speech synthesis. Effective use of dual learning for speech recognition and speech synthesis, improves the training speed of speech recognition and speech generation, and improves the accuracy of the output results of speech recognition and speech generation.
本申请附加的方面和优点将在下面的描述中部分给出,这些将从下面的描述中变得明显,或通过本申请的实践了解到。The additional aspects and advantages of this application will be partly given in the following description, which will become obvious from the following description, or be understood through the practice of this application.
附图说明Description of the drawings
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present application will become obvious and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, in which:
图1是本申请实施例提供的一种服务器的结构示意图;FIG. 1 is a schematic structural diagram of a server provided by an embodiment of the present application;
图2是本申请实施例提供的一种基于对偶学习的语音识别与语音合成方法的流程示意图;2 is a schematic flowchart of a method for speech recognition and speech synthesis based on dual learning provided by an embodiment of the present application;
图3是本申请实施例提供的一种基于对偶学习的语音识别与语音合成装置的结构示意图。Fig. 3 is a schematic structural diagram of a speech recognition and speech synthesis device based on dual learning provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。此外,术语“第一”、“第二”和“第三”等是用于区别不同的对象,而并非用于描述特定的顺序。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. It should be understood that when used in this specification and the appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof. In addition, the terms "first", "second", "third", etc. are used to distinguish different objects, but not to describe a specific sequence.
需要说明的是,在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。It should be noted that the terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms of "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used herein refers to and includes any or all possible combinations of one or more associated listed items.
对偶学习是一种利用一组对偶任务之间的对偶性建立反馈信号,并用这个信号约束训练的学习方案。对偶性广泛存在于人工智能任务之中,例如,机器翻译就是让机器将自然语言从一种语言翻译到另一种语言,中文到英文和英文到中文互为对偶任务。图像识别和图像合成也互为对偶任务,图像识别指的是给定一张图片,判别它的类别和具体信息。图 像生成指的是给定一个类别和具体信息,生成一张对应的图片。同样,语音识别和语音合成也互为对偶任务,语音识别是让机器通过识别和理解过程把语音信号转变为相应的文本或命令的技术,语音合成是将计算机自己产生的、或外部输入的文字信息,通过机械的、电子的方法转变为语音的技术。Dual learning is a learning scheme that uses the duality between a set of dual tasks to establish a feedback signal, and uses this signal to constrain training. Duality exists widely in artificial intelligence tasks. For example, machine translation is to allow machines to translate natural language from one language to another. Chinese to English and English to Chinese are dual tasks. Image recognition and image synthesis are also dual tasks for each other. Image recognition refers to a given picture, its classification and specific information. Image generation refers to the generation of a corresponding image given a category and specific information. Similarly, speech recognition and speech synthesis are also dual tasks. Speech recognition is a technology that allows machines to convert speech signals into corresponding texts or commands through the process of recognition and understanding. Speech synthesis is a technology that converts text generated by the computer itself or input from outside. Information is transformed into voice technology through mechanical and electronic methods.
语音识别的应用领域非常广泛,常见的应用系统有:语音输入系统,相对于键盘输入方法,更自然、高效;语音控制系统,即用语音来控制设备的运行,相对于手动控制来说更加快捷、方便;智能对话查询系统,根据客户的语音进行操作,为用户提供自然、友好的数据库检索服务。语音合成技术在我们生活中也具有广泛的应用,如电子阅读、车载语音导航、银行医院排号系统、交通播报等等。本申请实施例提供的基于对偶学习的语音识别与语音合成方法可以应用于终端设备、服务器和车载网络设备等具备语音识别和语音合成功能的网络设备,上述终端设备包括智能手机、智能手环、电子阅读设备、笔记本和平板电脑。本申请对此不做具体限定。下面以服务器为例对上述基于对偶学习的语音识别与语音合成方法的应用设备的功能进行详细介绍。The application field of speech recognition is very wide. Common application systems include: voice input system, which is more natural and efficient than keyboard input method; voice control system, which uses voice to control the operation of the device, is faster than manual control , Convenience; The intelligent dialogue query system operates according to the customer's voice, providing users with natural and friendly database retrieval services. Speech synthesis technology also has a wide range of applications in our lives, such as electronic reading, car voice navigation, bank and hospital numbering systems, traffic announcements, and so on. The speech recognition and speech synthesis method based on dual learning provided by the embodiments of this application can be applied to network equipment with speech recognition and speech synthesis functions such as terminal equipment, servers, and vehicle network equipment. The aforementioned terminal equipment includes smart phones, smart bracelets, E-reading devices, notebooks and tablets. This application does not specifically limit this. The following uses a server as an example to introduce in detail the functions of the application device of the above-mentioned speech recognition and speech synthesis method based on dual learning.
请参见图1,图1为本申请实施例提供的一种服务器100的硬件结构示意图,服务器100包括:存储器101、收发器102及与所述存储器101和收发器102耦合的处理器103。存储器101用于存储计算机程序,所述计算机程序包括程序指令,处理器103用于执行存储器101存储的程序指令,收发器102用于在处理器103的控制下与其他设备进行通信。当处理器103在执行指令时可根据程序指令执行基于对偶学习的语音识别与语音合成方法。Please refer to FIG. 1, which is a schematic diagram of the hardware structure of a server 100 provided by an embodiment of the application. The server 100 includes a memory 101, a transceiver 102, and a processor 103 coupled to the memory 101 and the transceiver 102. The memory 101 is configured to store a computer program, the computer program includes program instructions, the processor 103 is configured to execute the program instructions stored in the memory 101, and the transceiver 102 is configured to communicate with other devices under the control of the processor 103. When the processor 103 is executing instructions, it can execute a voice recognition and voice synthesis method based on dual learning according to the program instructions.
其中,处理器103可以是中央处理器(英文:central processing unit,简称:CPU),通用处理器,数字信号处理器(英文:digital signal processor,简称:DSP),专用集成电路(英文:application-specific integrated circuit,简称:ASIC),现场可编程门阵列(英文:field programmable gate array,简称:FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请实施例公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。收发器102可以是通信接口、收发电路等,其中,通信接口是统称,可以包括一个或多个接口,例如服务器与终端之间的接口。Among them, the processor 103 may be a central processing unit (English: central processing unit, abbreviated as: CPU), a general-purpose processor, a digital signal processor (English: digital signal processor, abbreviated as: DSP), an application specific integrated circuit (English: application- Specific integrated circuit, referred to as ASIC), field programmable gate array (English: field programmable gate array, referred to as FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of the embodiments of the present application. The processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of DSP and microprocessor, and so on. The transceiver 102 may be a communication interface, a transceiver circuit, etc., where the communication interface is a general term and may include one or more interfaces, such as an interface between a server and a terminal.
可选地,服务器100还可以包括总线104。其中,存储器101、收发器102以及处理器103可以通过总线104相互连接;总线104可以是外设部件互连标准(英文:peripheral component interconnect,简称:PCI)总线或扩展工业标准结构(英文:extended industry standard architecture,简称:EISA)总线等。总线104可以分为地址总线、数据总线、控制总线等。为便于表示,图1中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。Optionally, the server 100 may further include a bus 104. Among them, the memory 101, the transceiver 102, and the processor 103 may be connected to each other through a bus 104; the bus 104 may be a peripheral component interconnection standard (English: peripheral component interconnect, abbreviation: PCI) bus or an extended industry standard structure (English: extended industry standard architecture, EISA for short, etc. The bus 104 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 1, but it does not mean that there is only one bus or one type of bus.
除了图1所示的存储器101、收发器102、处理器103以及上述总线104之外,实施例中服务器100通常根据该服务器的实际功能,还可以包括其他硬件,对此不再赘述。In addition to the memory 101, the transceiver 102, the processor 103, and the above-mentioned bus 104 shown in FIG. 1, the server 100 in the embodiment may generally include other hardware according to the actual function of the server, which will not be repeated here.
在上述运行环境下,本申请实施例提供了如图2所示的基于对偶学习的语音识别与语音合成方法。请参阅图2,所述基于对偶学习的语音识别与语音合成方法包括:In the above operating environment, the embodiment of the present application provides a method for speech recognition and speech synthesis based on dual learning as shown in FIG. 2. Please refer to Fig. 2, the speech recognition and speech synthesis method based on dual learning includes:
S201、初始化有标数据集Φ (x,y)、语音识别参数θ xy、语音合成参数θ yx和训练数据规模N,其中,Φ (x,y)={(x (j),y (j))} K,有标数据集Φ (x,y)中包含K对有标数据,x (j)为语音数据,y (j)为文本数据。 S201. Initialize the labeled data set Φ (x, y) , speech recognition parameters θ xy , speech synthesis parameters θ yx and training data size N, where Φ (x,y) = {(x (j) ,y (j ) )} K , the labeled data set Φ (x,y) contains K pairs of labeled data, x (j) is voice data, and y (j) is text data.
具体的,选取K对有标数据,形成有标数据集Φ (x,y)={(x (j),y (j))} K,有标数据集Φ (x,y) 中包含K对有标数据,(x (j),y (j))表示有标数据集Φ (x,y)中的第j对有标数据,x (j)为第j对有标数据中的语音数据,y (j)为第j对有标数据中的文本数据,K为正整数,N为小于等于K的正整数。训练数据规模N为有标数据集Φ (x,y)中参与基于对偶学习的语音识别与语音合成优化训练的有标数据的数量。语音识别参数θ xy为影响语音识别效果的参数,语音合成参数θ yx为影响语音合成效果的参数。 Specifically, select K pairs of labeled data to form a labeled data set Φ (x,y) = {(x (j) ,y (j) )} K , the labeled data set Φ (x,y) contains K For labeled data, (x (j) ,y (j) ) represents the jth pair of labeled data in the labeled data set Φ (x,y) , and x (j) is the voice in the jth pair of labeled data Data, y (j) is the text data in the j-th pair of labeled data, K is a positive integer, and N is a positive integer less than or equal to K. The training data size N is the number of labeled data participating in the dual learning-based speech recognition and speech synthesis optimization training in the labeled data set Φ (x, y) . The speech recognition parameter θ xy is a parameter that affects the speech recognition effect, and the speech synthesis parameter θ yx is a parameter that affects the speech synthesis effect.
可以理解,有标数据集Φ (x,y)中的K个语音数据的内容均不同,K个语音数据中的长度可以一致,也可以不一致。语音数据可以来自电视机的新闻播报、日常对话、会议录音等,上述K个语音数据的来源场景可以相同,也可以不同。本申请对此均不作具体限定。 It can be understood that the contents of the K voice data in the labeled data set Φ (x, y) are all different, and the length of the K voice data may be the same or different. The voice data can come from TV news reports, daily conversations, meeting recordings, etc. The source scenes of the above K voice data can be the same or different. This application does not specifically limit this.
可选的,随机初始化语音识别参数θ xy和语音合成参数θ yx,例如θ xy和θ yx的初始值均取1。 Optionally, the speech recognition parameter θ xy and the speech synthesis parameter θ yx are randomly initialized, for example, the initial values of θ xy and θ yx are both set to 1.
S202、基于对偶学习的语音识别与合成设备从有标数据集Φ (x,y)中选取N对有标数据,构成有标数据集Φ (x,y) N,Φ (x,y) N={(x (i),y (i))} NS202. A speech recognition and synthesis device based on dual learning selects N pairs of labeled data from the labeled data set Φ (x,y) to form a labeled data set Φ (x,y) N , Φ (x,y) N ={(x (i) ,y (i) )} N.
可选的,基于对偶学习的语音识别与合成设备从有标数据集Φ (x,y)中随机选取N对有标数据之前,上述基于对偶学习的语音识别与语音合成方法还包括:从有标数据集Φ (x,y)中随机选取S对有标数据,对待训练的第一语音识别模型进行预训练,得到经过预训练的第二语音识别模型,以及对待训练的第一语音合成模型进行预训练,得到经过预训练的第二语音合成模型,第二语音识别模型包括深深度神经网络(英文全称:Deep Neural Network,英文缩写:DNN)-隐形马尔科夫模型(英文全称:Hidden Markov Model,英文缩写:HMM),第二语音合成模型包括编码器、解码器和神经声码器,S为小于等于K的正整数。 Optionally, before the dual learning-based speech recognition and synthesis device randomly selects N pairs of labeled data from the labeled data set Φ (x, y) , the above-mentioned dual learning-based voice recognition and speech synthesis method further includes: Randomly select S from the standard data set Φ (x, y) for the standard data, and pre-train the first speech recognition model to be trained to obtain the pre-trained second speech recognition model and the first speech synthesis model to be trained Perform pre-training to obtain a pre-trained second speech synthesis model. The second speech recognition model includes Deep Neural Network (full English name: Deep Neural Network, English abbreviation: DNN)-hidden Markov model (full English name: Hidden Markov Model, English abbreviation: HMM), the second speech synthesis model includes encoder, decoder and neural vocoder, S is a positive integer less than or equal to K.
可选的,上述基于对偶学习的语音识别与合成设备对待训练的第一语音识别模型进行预训练,得到经过预训练的第二语音识别模型,包括:将上述S对有标数据{(x (r),y (r))} S中的语音数据x (r)输入待训练的第一语音识别模型;对语音数据x (r)进行预处理,获取语音数据x (r)对应的频率倒谱系数特征;利用频率倒谱系数特征对第一语音识别模型的DNN-HMM模型进行预训练,得到第二语音识别模型,第二语音识别模型包括训练后的DNN-HMM模型。 Optionally, the aforementioned dual learning-based speech recognition and synthesis device pre-trains the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, including: comparing the aforementioned S with labeled data {(x ( r) ,y (r) )} The speech data x (r) in S is input to the first speech recognition model to be trained; the speech data x (r) is preprocessed to obtain the frequency corresponding to the speech data x (r) Spectral coefficient feature: Use the frequency cepstral coefficient feature to pre-train the DNN-HMM model of the first speech recognition model to obtain a second speech recognition model, which includes the trained DNN-HMM model.
可选的,上述基于对偶学习的语音识别与合成设备对待训练的第一语音识别模型进行预训练,得到经过预训练的第二语音识别模型,具体包括:基于对偶学习的语音识别与合成设备将有标数据{(x (r),y (r))} S中的语音数据x (r)输入待训练的第一语音识别模型,首先,对上述语音数据x (r)进行预处理,获取语音数据x (r)对应的频率倒谱系数特征。然后,基于 对偶学习的语音识别与合成设备将上述频率倒谱系数特征作为输入数据,对由高斯混合模型(英文全称:Adaptive background mixture models for real-time tracking,英文缩写:GMM)和隐形马尔科夫模型(英文全称:Hidden Markov Model,英文缩写:HMM)构成声学模型进行训练,获取预训练后的GMM输出的音素状态的似然概率特征及预训练后的HMM输出的音素状态的转移概率。基于对偶学习的语音识别与合成设备通过强制对齐将音素状态的似然概率特征转换为音素状态的后验概率特征,根据上述S对有标数据及音素状态的后验概率特征,得到深度神经网络(英文全称:Deep Neural Network,英文缩写:DNN)模型中输出层节点间的矩阵权重值和矩阵偏置值,生成预训练后的DNN模型。第二语音识别模型包括上述预训练后的DNN模型和上述预训练后的HMM。 Optionally, the voice recognition and synthesis device based on dual learning performs pre-training on the first voice recognition model to be trained to obtain the pre-trained second voice recognition model, which specifically includes: the voice recognition and synthesis device based on dual learning The speech data x (r) in the labeled data {(x (r) ,y (r) )} S is input into the first speech recognition model to be trained. First, the aforementioned speech data x (r) is preprocessed to obtain The characteristic of the frequency cepstrum coefficient corresponding to the speech data x (r) . Then, a speech recognition and synthesis device based on dual learning takes the above-mentioned frequency cepstral coefficient characteristics as input data, and uses Gaussian mixture models (full English name: Adaptive background mixture models for real-time tracking, English abbreviation: GMM) and invisible Marco The husband model (English full name: Hidden Markov Model, English abbreviation: HMM) constitutes an acoustic model for training, and obtains the likelihood probability feature of the phoneme state output by the pre-trained GMM and the transition probability of the phoneme state output by the pre-trained HMM. The dual learning-based speech recognition and synthesis device converts the likelihood probability feature of the phoneme state into the posterior probability feature of the phoneme state through forced alignment, and obtains a deep neural network according to the posterior probability feature of the S pair of labeled data and the phoneme state. (Full English name: Deep Neural Network, English abbreviation: DNN) The matrix weight value and matrix offset value between the output layer nodes in the model are generated to generate the pre-trained DNN model. The second speech recognition model includes the aforementioned pre-trained DNN model and the aforementioned pre-trained HMM.
可选的,上述基于对偶学习的语音识别与合成设备对待训练的第一语音合成模型进行预训练,得到经过预训练的第二语音合成模型,包括:将上述S对有标数据{(x (r),y (r))} S中的文本数据y (r)输入待训练的第一语音合成模型;利用文本数据y (r)对第一语音合成模型的编码器、解码器、神经声码器进行预训练,得到第二语言合成模型,第二语言合成模型包括训练后的编码器、训练后的解码器和训练后的神经声码器。 Optionally, the aforementioned dual learning-based speech recognition and synthesis device performs pre-training on the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model, including: comparing the aforementioned S with labeled data {(x ( r) , y (r) )} The text data y (r) in S is input to the first speech synthesis model to be trained; the text data y (r) is used to compare the encoder, decoder, and neuroacoustics of the first speech synthesis model The encoder is pre-trained to obtain a second language synthesis model. The second language synthesis model includes a trained encoder, a trained decoder, and a trained neural vocoder.
可选的,上述基于对偶学习的语音识别与合成设备对待训练的第一语音识别模型进行预训练,得到经过预训练的第二语音识别模型,具体包括:基于对偶学习的语音识别与合成设备从有标数据集Φ (x,y)中随机选取S对有标数据{(x (t),y (t))} S对待训练的第一语音合成模型进行预训练,得到经过预训练的第二语音合成模型。具体包括以下步骤:基于对偶学习的语音识别与合成设备将有标数据{(x (t),y (t))} S中的文本数据y (t)输入待训练的第一语音合成模型,首先,通过编码器对文本数据进行文本解析,获取文本数据y (t)对应的表示文本语义的中间语义向量。然后,基于对偶学习的语音识别与合成设备将上述中间语义向量输入解码器,获取文本数据y (t)对应的声音序列特征。将上述声音序列特征输入神经声码器,输出文本数据y (t)对应的语音数据。上述编码器、解码器和神经声码器均采用循环神经网络(英文全称:Hidden Markov Model,英文缩写:HMM)模型,第二语音合成模型包括上述编码器、解码器和神经声码器。 Optionally, the voice recognition and synthesis device based on dual learning performs pre-training on the first voice recognition model to be trained to obtain the pre-trained second voice recognition model, which specifically includes: the voice recognition and synthesis device based on dual learning from S is randomly selected from the labeled data set Φ (x, y) to pre-train the first speech synthesis model to be trained on the labeled data {(x (t) ,y (t) )} S to obtain the pre-trained first speech synthesis model 2. Speech synthesis model. Specifically, it includes the following steps: the speech recognition and synthesis device based on dual learning inputs the text data y (t) in the labeled data {(x (t) ,y (t) )} S into the first speech synthesis model to be trained, First, the text data is analyzed by the encoder to obtain the intermediate semantic vector corresponding to the text data y (t) , which represents the semantics of the text. Then, the speech recognition and synthesis device based on dual learning inputs the above-mentioned intermediate semantic vector into the decoder to obtain the sound sequence characteristics corresponding to the text data y (t) . The above-mentioned sound sequence characteristics are input to the neural vocoder, and the speech data corresponding to the text data y (t) is output. The aforementioned encoder, decoder, and neural vocoder all use a cyclic neural network (English full name: Hidden Markov Model, English abbreviation: HMM) model, and the second speech synthesis model includes the aforementioned encoder, decoder and neural vocoder.
可以理解,GMM就是用高斯概率密度函数精确地量化事物,它是一个将事物分解为若干的基于高斯概率密度函数形成的模型。HMM是关于时序的概率模型,描述由一个隐藏的马尔可夫链随机生成不可观测的状态随机序列,再由各个状态生成一个观测而产生观测随机序列的过程。It can be understood that GMM is to use Gaussian probability density function to accurately quantify things. It is a model formed by decomposing things into several Gaussian probability density functions. HMM is a probabilistic model about time series. It describes the process of randomly generating an unobservable state random sequence from a hidden Markov chain, and then generating an observation from each state to generate an observation random sequence.
S203、基于对偶学习的语音识别与合成设备提取语音数据x (i)的声学特征,根据语音数据x (i)的声学特征,获取语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率。 S203, based on the acoustic features of speech recognition dual study the synthesis device extracts the voice data x (i), according to the acoustic characteristics of the voice data x (i), and obtains the corresponding phoneme speech data x (i) the posterior probability and the voice data x (i) The transition probability of the corresponding phoneme.
具体的,基于对偶学习的语音识别与合成设备将语音数据x (i)输入第二语音识别模型,滤除不重要的信息以及背景噪声,将语音数据x (i)分成多帧语音信号。对每帧语音信号进行 分析处理,提取语音数据x (i)对应的每帧语音信号的滤波器组特征作为语音数据x (i)的声学特征。基于对偶学习的语音识别与合成设备将语音数据x (i)的声学特征输入第二语音识别模型中的DNN模型,获取DNN模型输出的语音数据x (i)对应的音素的后验概率,并将语音数据x (i)对应的音素输入第二语音识别模型中的HMM,获取语音数据x (i)对应的音素的转移概率。 Specifically, the voice recognition and synthesis device based on dual learning inputs the voice data x (i) into the second voice recognition model, filters out unimportant information and background noise, and divides the voice data x (i) into multiple frames of voice signals. Each frame of speech signal is analyzed and processed, and the filter bank feature of each frame of speech signal corresponding to the speech data x (i) is extracted as the acoustic feature of the speech data x (i) . The voice recognition and synthesis device based on dual learning inputs the acoustic features of the voice data x (i) into the DNN model in the second voice recognition model, and obtains the posterior probability of the phoneme corresponding to the voice data x (i) output by the DNN model, and The phoneme corresponding to the speech data x (i) is input into the HMM in the second speech recognition model to obtain the transition probability of the phoneme corresponding to the speech data x (i) .
可以理解,HMM输出的音素转移概率包括第一音素状态转至第一音素状态的音素转移概率和第一音素状态转至第二音素状态的音素转换概率,第二音素状态为第一音素状态的下一状态。It can be understood that the phoneme transition probability output by the HMM includes the phoneme transition probability from the first phoneme state to the first phoneme state and the phoneme transition probability from the first phoneme state to the second phoneme state. The second phoneme state is the first phoneme state. Next state.
S204、基于对偶学习的语音识别与合成设备根据语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率,生成文本数据
Figure PCTCN2019117567-appb-000013
计算文本数据
Figure PCTCN2019117567-appb-000014
等于文本数据y (i)的第一对数似然。
S204, based on speech recognition and synthesis device dual learning the transition probability corresponding phoneme speech data x (i) corresponding to the posterior probability of phoneme and voice data x (i), text data generated
Figure PCTCN2019117567-appb-000013
Calculate text data
Figure PCTCN2019117567-appb-000014
Equal to the first log likelihood of text data y (i) .
可选的,基于对偶学习的语音识别与合成设备根据语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率确定w条网络路径的概率;获取上述w条网络路径中概率最大的网络路径对应的文本数据
Figure PCTCN2019117567-appb-000015
w为大于零的正整数。
Alternatively, based on speech recognition and synthesis device dual learning probability w network paths is determined according to the transition probability corresponding phoneme corresponding phoneme speech data x (i) the posterior probability and the speech data x (i); acquires the The text data corresponding to the network path with the highest probability among the w network paths
Figure PCTCN2019117567-appb-000015
w is a positive integer greater than zero.
具体的,基于对偶学习的语音识别与合成设备根据语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率得到不同单词的概率,不同单词组成不同的网络路径,获取每条网络路径的概率,选择概率最大的网络路径作为最优的网络路径,根据上述最优的网络路径生成相应的文本数据
Figure PCTCN2019117567-appb-000016
Specifically, based on speech recognition and synthesis device dual learning (i) the corresponding transition probability of the phoneme obtained different words according to the posterior probability and the speech corresponding to the phoneme of the speech data x (i) data x probability, different words of different compositions Network path, obtain the probability of each network path, select the network path with the highest probability as the optimal network path, and generate corresponding text data according to the above optimal network path
Figure PCTCN2019117567-appb-000016
HMM模型是对语音信号的时间序列结构建立统计模型,可以看作一个数学上的双重随机过程:一个是用具有有限状态数的马尔可夫链来模拟语音信号统计特性变化的隐含随机过程,另一个是与马尔可夫链链的每一个状态相关联的外界可见的观测序列的随机过程。HMM模型包含以下元素:隐藏状态、观察序列、隐藏状态的初始概率分布、隐藏状态的转移概率矩阵,观测值的发射概率。语音识别过程中,给定一个训练好的HMM模型和一个观测序列(即语音数据的声学特征),找到观测序列对应的最优的状态序列,从而将语音转化为文本。根据每个单词的发音过程,以音素作为隐藏节点,音素的变化过程构成了HMM状态序列,每一个音素以一定的概率密度函数生成观测向量。The HMM model is to establish a statistical model for the time series structure of the speech signal, which can be regarded as a mathematical double random process: one is to use a Markov chain with a finite number of states to simulate the implicit random process of the statistical characteristics of the speech signal. The other is the random process of the externally visible observation sequence associated with each state of the Markov chain. The HMM model contains the following elements: hidden state, observation sequence, initial probability distribution of hidden state, transition probability matrix of hidden state, and emission probability of observations. In the process of speech recognition, given a trained HMM model and an observation sequence (that is, the acoustic characteristics of the speech data), find the optimal state sequence corresponding to the observation sequence, thereby converting the speech into text. According to the pronunciation process of each word, the phoneme is used as a hidden node, and the change process of the phoneme constitutes the HMM state sequence, and each phoneme generates an observation vector with a certain probability density function.
可以理解,根据每个单词的HMM状态转移概率计算每个状态生成该观测值的概率,若一个单词的HMM状态序列的联合概率最大,则判断该段语音对应上述单词。举例来说,以单词five的语音数据为例,单词“five”由[f]、[ay]和[v]这三个音素状态连接而成,隐藏节点的每一个状态对应于一个单独音素。我们以单词“one”、“two”、“three”和“five”为例,采用前向算法分别计算观测序列的后验概率,并从中去找概率最大的单词作为识别结果。It can be understood that the probability of generating the observed value for each state is calculated according to the HMM state transition probability of each word. If the joint probability of the HMM state sequence of a word is the largest, it is determined that the segment of speech corresponds to the above-mentioned word. For example, taking the voice data of the word five as an example, the word "five" is formed by connecting the three phoneme states of [f], [ay] and [v], and each state of the hidden node corresponds to a single phoneme. We take the words "one", "two", "three" and "five" as examples, and use the forward algorithm to calculate the posterior probability of the observation sequence and find the word with the highest probability as the recognition result.
可选的,计算
Figure PCTCN2019117567-appb-000017
等于y (i)的第一对数似然,即计算将x (i)输入第二语音识别模型,识别出y (i)的第一对数似然。第一对数似然表征条件概率分布P f(y (i)|x (i)xy)的对数似然函数,第一对数似然的计算表达式如下所示。
Optional, calculate
Figure PCTCN2019117567-appb-000017
Equals y (i) a first log-likelihood, calculation i.e. x (i) input of the second speech recognition model, identifies y (i) a first log-likelihood. The first log likelihood represents the log likelihood function of the conditional probability distribution P f (y (i) |x (i)xy ), and the calculation expression of the first log likelihood is shown below.
logP f(y (i)|x (i)xy)=logP{f(x (i))=y (i)xy} logP f (y (i) |x (i)xy )=logP{f(x (i) )=y (i)xy }
S205、基于对偶学习的语音识别与合成设备获取文本数据y (i)对应的声音特征序列特征,并根据声音序列特征,生成语音数据
Figure PCTCN2019117567-appb-000018
计算语音数据
Figure PCTCN2019117567-appb-000019
等于语音数据x (i)的第二对数似然。
S205. The voice recognition and synthesis device based on dual learning acquires the voice feature sequence feature corresponding to the text data y (i) , and generates voice data according to the voice sequence feature
Figure PCTCN2019117567-appb-000018
Calculate voice data
Figure PCTCN2019117567-appb-000019
Equal to the second log likelihood of the speech data x (i) .
基于对偶学习的语音识别与合成设备获取y (i)对应的声音特征序列特征,并根据声音序列特征,生成语音数据
Figure PCTCN2019117567-appb-000020
具体的,基于对偶学习的语音识别与合成设备将文本数据y (i)输入第二语音合成模型,首先,将文本数据y (i)拆分成具有语义的最小单位词。将文本数据y (i)对应的最小单位词输入第二语音合成模型的编码器,对文本数据y (i)对应的最小单位词进行语义分析并分类。然后,基于对偶学习的语音识别与合成设备对文本数据y (i)对应的最小单位词进行分类别的编码,输出文本数据y (i)对应的具有固定长度的中间语义向量。将上述中间语义向量输入第二语音合成模型的解码器,基于对偶学习的语音识别与合成设备对上述中间语义向量进行语义分析,并生成文本数据y (i)对应的声音序列特征。将上述声音序列特征输入第二语音合成模型的神经声码器,生成语音数据
Figure PCTCN2019117567-appb-000021
The voice recognition and synthesis device based on dual learning acquires the voice feature sequence feature corresponding to y (i) , and generates voice data according to the voice sequence feature
Figure PCTCN2019117567-appb-000020
Specifically, the speech recognition and synthesis device based on dual learning inputs the text data y (i) into the second speech synthesis model. First, the text data y (i) is split into the smallest unit words with semantics. The smallest unit word corresponding to the text data y (i) is input into the encoder of the second speech synthesis model, and the smallest unit word corresponding to the text data y (i) is semantically analyzed and classified. Then, the speech recognition and synthesis device based on dual learning performs classification coding on the smallest unit word corresponding to the text data y (i) , and outputs an intermediate semantic vector with a fixed length corresponding to the text data y (i) . The intermediate semantic vector is input into the decoder of the second speech synthesis model, and the speech recognition and synthesis device based on dual learning performs semantic analysis on the intermediate semantic vector, and generates the voice sequence feature corresponding to the text data y (i) . Input the above-mentioned sound sequence characteristics into the neural vocoder of the second speech synthesis model to generate speech data
Figure PCTCN2019117567-appb-000021
可选的,对文本数据y (i)对应的最小单位词进行语义分析并分类,所分类别包括:中文、英文、韩文、数字、拼音和地名等。针对不同类别的最小单位词有不同的编码规则。 Optionally, perform semantic analysis and classify the smallest unit words corresponding to the text data y (i) , and the classified categories include: Chinese, English, Korean, numbers, pinyin, and place names. There are different coding rules for different types of smallest unit words.
可选的,计算
Figure PCTCN2019117567-appb-000022
等于x (i)的第二对数似然,即计算将y (i)输入第二语音合成模型,识别出x (i)的第二对数似然。第二对数似然的计算表达式如下所示。
Optional, calculate
Figure PCTCN2019117567-appb-000022
Equal to x (i) a second log-likelihood, calculation i.e. y (i) input to the second speech synthesis model, the identification number of the second pair of x (i) likelihood. The calculation expression of the second log likelihood is as follows.
logP f(x (i)|y (i)yx)=logP{g(y (i))=x (i)yx} logP f (x (i) |y (i)yx )=logP{g(y (i) )=x (i)yx }
S206、基于对偶学习的语音识别与合成设备针对N对有标数据,以最大化第一对数似然和第二对数似然为目标函数,并将语音识别和语音合成的概率对偶性作为约束条件,优化θ xy和θ yxS206. The speech recognition and synthesis device based on dual learning aims at N pairs of labeled data, takes maximizing the first log likelihood and the second log likelihood as the objective function, and takes the probability duality of speech recognition and speech synthesis as Constraint conditions, optimize θ xy and θ yx .
可选的,基于对偶学习的语音识别与合成设备针对有标数据集中的N对有标数据Φ (x,y) N,以最大化第一对数似然和第二对数似然为目标函数,并将语音识别和语音合成的 概率对偶性作为约束条件。理想情况下,语音识别模型和语音合成模型应满足概率对偶性,即P(x (i))P(y (i)|x (i)xy)=P(y (i))P(x (i)|y (i)yx),其中P(x (i))和P(y (i))分别表示语音数据x (i)和文本数据y (i)的边缘概率。目标函数和约束条件通过公式可表示如下。 Optionally, a speech recognition and synthesis device based on dual learning aims at maximizing the first log likelihood and the second log likelihood for N pairs of labeled data Φ (x,y) N in the labeled data set Function and take the probability duality of speech recognition and speech synthesis as a constraint. Ideally, the speech recognition model and speech synthesis model should satisfy the probabilistic duality, that is, P(x (i) )P(y (i) |x (i)xy )=P(y (i) )P( x (i) |y (i)yx ), where P(x (i) ) and P(y (i) ) represent the edge probability of speech data x (i) and text data y (i) , respectively. The objective function and constraint conditions can be expressed as follows by formulas.
Figure PCTCN2019117567-appb-000023
Figure PCTCN2019117567-appb-000023
可选的,联立目标函数和约束条件,采用拉格朗日乘数优化算法,将原目标函数F(θ xy,θ yx)转化为
Figure PCTCN2019117567-appb-000024
表示如下。
Optionally, combine the objective function and constraint conditions, and use the Lagrangian multiplier optimization algorithm to transform the original objective function F(θ xy , θ yx ) into
Figure PCTCN2019117567-appb-000024
Expressed as follows.
Figure PCTCN2019117567-appb-000025
Figure PCTCN2019117567-appb-000025
其中λ为拉格朗日因子,采用梯度下降算法,对θ xy和θ yx进行迭代优化。计算函数
Figure PCTCN2019117567-appb-000026
关于θ xy和θ yx的梯度,分别表示为
Figure PCTCN2019117567-appb-000027
Figure PCTCN2019117567-appb-000028
对θ xy和θ xy进行更新,其中,
Figure PCTCN2019117567-appb-000029
迭代更新直到目标函数收敛或达到指定停止条件。
Where λ is the Lagrangian factor, using gradient descent algorithm to iteratively optimize θ xy and θ yx . Calculation function
Figure PCTCN2019117567-appb-000026
The gradients of θ xy and θ yx are expressed as
Figure PCTCN2019117567-appb-000027
with
Figure PCTCN2019117567-appb-000028
Update θ xy and θ xy , where,
Figure PCTCN2019117567-appb-000029
Iteratively update until the objective function converges or reaches the specified stopping condition.
本申请实施例中,通过获取语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率,生成文本数据
Figure PCTCN2019117567-appb-000030
获取y (i)对应的声音特征序列,生成语音数据
Figure PCTCN2019117567-appb-000031
针对N对有标数据,以最大化文本数据
Figure PCTCN2019117567-appb-000032
等于文本数据y (i)的对数似然和语音数据
Figure PCTCN2019117567-appb-000033
等于语音数据x (i)的对数似然为目标,并将语音识别和语音合成的概率对偶性作为约束条件,从而优化语音识别和语音合成效果。有效的利用了对偶学习进行语音识别与语音合成,提高语音识别和语音生成的训练速度,以及提高语音识别和语音生成输出结果的精度。
The present application embodiment, by acquiring speech data x corresponding transition probability of the phoneme (i) corresponding to the posterior probability of phoneme and voice data x (i), text data generated
Figure PCTCN2019117567-appb-000030
Acquire the sound feature sequence corresponding to y (i) to generate speech data
Figure PCTCN2019117567-appb-000031
For N pairs of labeled data to maximize text data
Figure PCTCN2019117567-appb-000032
Equal to the log likelihood of text data y (i) and speech data
Figure PCTCN2019117567-appb-000033
The log likelihood equal to the speech data x (i) is the goal, and the probabilistic duality of speech recognition and speech synthesis is used as a constraint condition, so as to optimize the effect of speech recognition and speech synthesis. Effective use of dual learning for speech recognition and speech synthesis, improves the training speed of speech recognition and speech generation, and improves the accuracy of the output results of speech recognition and speech generation.
本申请实施例还提供了一种基于对偶学习的语音识别与语音合成装置,该装置能上述基于对偶学习的语音识别与语音合成方法所具备的有益效果。其中,该装置的功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括至少一个与上述功能相对应的模块。The embodiment of the present application also provides a speech recognition and speech synthesis device based on dual learning, which can have the beneficial effects of the above-mentioned speech recognition and speech synthesis method based on dual learning. Wherein, the function of the device can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes at least one module corresponding to the above-mentioned functions.
请参阅图3,图3是本申请实施例提供的一种基于对偶学习的语音识别与语音合成装置300的结构框图,所述装置包括:初始化单元301、选取单元302、处理单元303、第一生成单元304、第二生成单元305和优化单元306。Please refer to FIG. 3, which is a structural block diagram of a speech recognition and speech synthesis device 300 based on dual learning provided by an embodiment of the present application. The device includes an initialization unit 301, a selection unit 302, a processing unit 303, and a first The generation unit 304, the second generation unit 305, and the optimization unit 306.
初始化单元301,用于初始化有标数据集Φ (x,y)、语音识别参数θ xy、语音合成参数θ yx和 训练数据规模N,其中,有标数据集Φ (x,y)={(x (j),y (j))} K,有标数据集Φ (x,y)中包含K对有标数据,(x (j),y (j))表示有标数据集Φ (x,y)中的第j对有标数据,x (j)为第j对有标数据中的语音数据,y (j)为第j对有标数据中的文本数据。 The initialization unit 301 is used to initialize the marked data set Φ (x,y) , speech recognition parameters θ xy , speech synthesis parameters θ yx and training data size N, where the marked data set Φ (x,y) = {( x (j) ,y (j) )} K , the labeled data set Φ (x,y) contains K pairs of labeled data, (x (j) ,y (j) ) represents the labeled data set Φ (x The j-th pair of labeled data in y) , x (j) is the voice data in the j-th pair of labeled data, and y (j) is the text data in the j-th pair of labeled data.
选取单元302,用于从有标数据集Φ (x,y)中选取N对有标数据{(x (i),y (i))} N,K为正整数,N为小于等于K的正整数。 The selecting unit 302 is used to select N pairs of labeled data {(x (i) ,y (i) )} N from the labeled data set Φ (x, y) , K is a positive integer, and N is less than or equal to K Positive integer.
处理单元303,用于提取x (i)的声学特征,根据x (i)的声学特征,获取x (i)对应的音素的后验概率和x (i)对应的音素的转移概率。 Processing unit 303, for extracting acoustic features x (i), the acoustic features x (i) acquires the corresponding transition probability of the phoneme x (i) corresponding to the posterior probability of the phoneme and x (i).
第一生成单元304,用于根据x (i)对应的音素的后验概率和x (i)对应的音素的转移概率,生成文本数据
Figure PCTCN2019117567-appb-000034
计算
Figure PCTCN2019117567-appb-000035
等于y (i)的第一对数似然。
The first generating unit 304 is configured to generate text data according to the posterior probability of the phoneme corresponding to x (i) and the transition probability of the phoneme corresponding to x (i)
Figure PCTCN2019117567-appb-000034
Calculation
Figure PCTCN2019117567-appb-000035
Equal to the first log likelihood of y (i) .
第二生成单元305,用于获取y (i)对应的声音特征序列,并根据声音序列特征,生成语音数据
Figure PCTCN2019117567-appb-000036
计算
Figure PCTCN2019117567-appb-000037
等于x (i)的第二对数似然。
The second generating unit 305 is configured to obtain the voice feature sequence corresponding to y (i) , and generate voice data according to the voice sequence feature
Figure PCTCN2019117567-appb-000036
Calculation
Figure PCTCN2019117567-appb-000037
Equal to the second log likelihood of x (i) .
优化单元306,用于针对N对有标数据,以最大化第一对数似然和第二对数似然为目标函数,并将语音识别和语音合成的概率对偶性作约束条件,优化θ xy和θ yxThe optimization unit 306 is configured to maximize the first log-likelihood and the second log-likelihood as the objective function for N pairs of labeled data, and use the probability duality of speech recognition and speech synthesis as constraints to optimize θ xy and θ yx .
可选的,选取单元302从有标数据集Φ (x,y)中选取N对有标数据{(x (i),y (i))} N之前,还包括:预训练单元,用于从有标数据集Φ (x,y)中随机选取S对有标数据,对第一语音识别模型进行预训练,得到第二语音识别模型,以及对第一语音合成模型进行预训练,得到第二语音合成模型,第二语音识别模型包括深度神经网络和隐形马尔科夫模型,第二语音合成模型包括编码器、解码器和神经声码器。 Optionally, the selecting unit 302 selects N pairs of labeled data from the labeled data set Φ (x, y) before {(x (i) ,y (i) )} N , it also includes: a pre-training unit for Randomly select S pairs of labeled data from the labeled data set Φ (x, y) , pre-train the first voice recognition model to obtain the second voice recognition model, and pre-train the first voice synthesis model to obtain the first Two speech synthesis models, the second speech recognition model includes a deep neural network and an invisible Markov model, and the second speech synthesis model includes an encoder, a decoder and a neural vocoder.
可选的,预训练单元用于对待训练的第一语音识别模型进行预训练,得到经过预训练的第二语音识别模型,具体包括:Optionally, the pre-training unit is used to pre-train the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, which specifically includes:
输入单元,用于将上述S对有标数据{(x (r),y (r))} S中的语音数据x (r)输入待训练的第一语音识别模型; An input unit, a first speech recognition model described above has the S standard data {(x (r), y (r))} in the speech data S x (r) input to be trained;
预处理单元,用于对语音数据x (r)进行预处理,获取语音数据x (r)对应的频率倒谱系数特征; The preprocessing unit is used to preprocess the voice data x (r) to obtain the frequency cepstral coefficient characteristics corresponding to the voice data x (r) ;
预训练单元,还用于利用频率倒谱系数特征对第一语音识别模型的深度神经网络-高斯混合模型进行预训练,得到第二语音识别模型,第二语音识别模型包括训练后的深度神经网络-高斯混合模型。The pre-training unit is also used for pre-training the deep neural network-Gaussian mixture model of the first speech recognition model by using the frequency cepstral coefficient feature to obtain a second speech recognition model. The second speech recognition model includes the trained deep neural network -Gaussian mixture model.
可选的,预训练单元用于对待训练的第一语音合成模型进行预训练,得到经过预训练的第二语音合成模型,具体包括:Optionally, the pre-training unit is used to pre-train the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model, which specifically includes:
输入单元,还用于将上述S对有标数据{(x (r),y (r))} S中的文本数据y (r)输入待训练的第一语音合成模型; The input unit is also used to input the text data y (r) in the S pair of labeled data {(x (r) ,y (r) )} S into the first speech synthesis model to be trained;
预训练单元,还用于利用文本数据y (r)对第一语音合成模型的编码器、解码器、神经声码器进行预训练,得到第二语言合成模型,第二语言合成模型包括训练后的编码器、训练后的解码器和训练后的神经声码器。 The pre-training unit is also used to pre-train the encoder, decoder, and neural vocoder of the first speech synthesis model by using text data y (r) to obtain a second speech synthesis model. The second speech synthesis model includes post-training The encoder, the decoder after training and the neural vocoder after training.
可选的,编码器、解码器和神经声码器均采用循环神经网络模型。Optionally, the encoder, decoder and neural vocoder all adopt a cyclic neural network model.
可选的,处理单元303,包括:提取单元和获取单元。Optionally, the processing unit 303 includes: an extraction unit and an acquisition unit.
提取单元,用于将x (i)的声学特征输入第二语音识别模型,逐帧提取x (i)的声学特征。 The extraction unit is used to input the acoustic features of x (i) into the second speech recognition model, and extract the acoustic features of x (i) frame by frame.
获取单元,用于将x (i)的声学特征输入第二语音识别模型中的深层神经网络,获取x (i)对应的音素的后验概率,通过第二语音识别模型中的隐马尔科夫模型获取x (i)对应的音素的转移概率。 The acquisition unit is used to input the acoustic features of x (i) into the deep neural network in the second speech recognition model to obtain the posterior probability of the phoneme corresponding to x (i) , and pass the hidden Markov in the second speech recognition model The model obtains the transition probability of the phoneme corresponding to x (i) .
可选的,第一生成单元304根据语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率,生成文本数据
Figure PCTCN2019117567-appb-000038
具体包括:
Alternatively, first generation unit 304 according to the transition probability of the phoneme corresponding to speech data x (i) and the posterior probability of the speech data x (i) corresponding to phonemes, generating text data
Figure PCTCN2019117567-appb-000038
Specifically:
确定单元,用于根据语音数据x (i)对应的音素的后验概率和语音数据x (i)对应的音素的转移概率确定w条网络路径的概率,w为大于零的正整数; Determination means for the probability of w is determined according to network paths corresponding to the transition probability of the phoneme corresponding phoneme speech data x (i) and the posterior probability of the speech data x (i), w is a positive integer greater than zero;
获取单元,还用于获取上述w条网络路径中概率最大的网络路径对应的文本数据
Figure PCTCN2019117567-appb-000039
The obtaining unit is also used to obtain text data corresponding to the network path with the highest probability among the above w network paths
Figure PCTCN2019117567-appb-000039
可选的,第二生成单元305具体用于:将y (i)输入第二语音合成模型的编码器,生成语义序列
Figure PCTCN2019117567-appb-000040
Figure PCTCN2019117567-appb-000041
输入第二语音合成模型的解码器,生成声音特征序列,将声音序列特征输入第二语音合成模型的神经声码器,生成语音数据
Figure PCTCN2019117567-appb-000042
计算
Figure PCTCN2019117567-appb-000043
等于x (i)的第二对数似然。
Optionally, the second generating unit 305 is specifically configured to: input y (i) into the encoder of the second speech synthesis model to generate a semantic sequence
Figure PCTCN2019117567-appb-000040
will
Figure PCTCN2019117567-appb-000041
Input the decoder of the second speech synthesis model to generate a voice feature sequence, and input the voice sequence features into the neural vocoder of the second speech synthesis model to generate voice data
Figure PCTCN2019117567-appb-000042
Calculation
Figure PCTCN2019117567-appb-000043
Equal to the second log likelihood of x (i) .
可选的,优化单元306具体用于:以最大化第一对数似然和第二对数似然为目标函数,并将语音识别和语音合成的概率对偶性作为约束条件,联立目标函数和约束条件,采用拉格朗日乘数优化算法,对θ xy和θ yx进行迭代优化。 Optionally, the optimization unit 306 is specifically configured to: maximize the first log-likelihood and the second log-likelihood as the objective function, and use the probabilistic duality of speech recognition and speech synthesis as the constraint condition to simultaneously establish the objective function And constraint conditions, using Lagrangian multiplier optimization algorithm, iterative optimization of θ xy and θ yx .
结合本申请实施例公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(英文:random access memory,简称:RAM)、闪存、只读存储器(英文:read only memory,简称:ROM)、可擦除可编程只读存储器(英文:erasable programmable rom,简称:EPROM)、电可擦可编程只读存储器(英文:electrically eprom,简称:EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于网络设备中。当然,处 理器和存储介质也可以作为分立组件存在于网络设备中。The steps of the method or algorithm described in combination with the disclosure of the embodiments of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. Software instructions can be composed of corresponding software modules, which can be stored in random access memory (English: random access memory, referred to as RAM), flash memory, read-only memory (English: read only memory, referred to as ROM), Erasable programmable read-only memory (English: erasable programmable rom, abbreviation: EPROM), electrically erasable programmable read-only memory (English: electrically eprom, abbreviation: EEPROM), register, hard disk, mobile hard disk, read-only optical disk (CD -ROM) or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, so that the processor can read information from the storage medium and can write information to the storage medium. Of course, the storage medium may also be an integral part of the processor. The processor and the storage medium may be located in the ASIC. In addition, the ASIC may be located in a network device. Of course, the processor and storage medium can also exist as discrete components in the network device.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机非易失性可读介质中或者作为计算机非易失性可读介质上的一个或多个指令或代码进行传输。计算机非易失性可读介质包括计算机非易失性存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。Those skilled in the art should be aware that, in one or more of the foregoing examples, the functions described in the embodiments of the present application may be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer non-volatile readable medium or transmitted as one or more instructions or codes on the computer non-volatile readable medium. Computer non-volatile readable media include computer non-volatile storage media and communication media, where communication media includes any media that facilitates the transfer of a computer program from one place to another. The storage medium may be any available medium that can be accessed by a general-purpose or special-purpose computer.
以上所述的具体实施方式,对本申请实施例的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请实施例的具体实施方式而已,并不用于限定本申请实施例的保护范围,凡在本申请实施例的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请实施例的保护范围之内。The specific implementations described above further describe the purpose, technical solutions and beneficial effects of the embodiments of this application in further detail. It should be understood that the above descriptions are only specific implementations of the embodiments of this application and are not intended to To limit the protection scope of the embodiments of the application, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the application shall be included in the protection scope of the embodiments of the application.

Claims (20)

  1. 一种基于对偶学习的语音识别与语音合成方法,其特征在于,所述方法包括:A speech recognition and speech synthesis method based on dual learning, characterized in that, the method includes:
    初始化有标数据集Φ (x,y)、语音识别参数θ xy、语音合成参数θ yx和训练数据规模N,其中,所述有标数据集Φ (x,y)={(x (j),y (j))} K,所述有标数据集Φ (x,y)中包含K对有标数据,(x (j),y (j))表示所述有标数据集Φ (x,y)中的第j对有标数据,x (j)为所述第j对有标数据中的语音数据,y (j)为所述第j对有标数据中的为文本数据,K为正整数,N为小于等于K的正整数; Initialize the labeled data set Φ (x,y) , speech recognition parameters θ xy , speech synthesis parameters θ yx, and training data size N, where the labeled data set Φ (x,y) = {(x (j) ,y (j) )} K , the labeled data set Φ (x,y) contains K pairs of labeled data, (x (j) ,y (j) ) represents the labeled data set Φ (x , y) the j-th pair of labeled data, x (j) is the voice data in the j-th pair of labeled data, y (j) is the j-th pair of labeled data is text data, K Is a positive integer, N is a positive integer less than or equal to K;
    从所述有标数据集Φ (x,y)中选取N对有标数据{(x (i),y (i))} NSelect N pairs of labeled data {(x (i) ,y (i) )} N from the labeled data set Φ (x,y) ;
    提取所述语音数据x (i)的声学特征,根据所述语音数据x (i)的声学特征,获取所述语音数据x (i)对应的音素的后验概率和所述语音数据x (i)对应的音素的转移概率; Extracting the voice data x (i) of the acoustic characteristics, acoustic characteristics of the voice data according to x (i) is, after acquiring the corresponding phoneme speech data x (i) and the posterior probability of speech data x (i ) The transition probability of the corresponding phoneme;
    根据所述语音数据x (i)对应的音素的后验概率和所述语音数据x (i)对应的音素的转移概率,生成文本数据
    Figure PCTCN2019117567-appb-100001
    计算所述文本数据
    Figure PCTCN2019117567-appb-100002
    等于所述文本数据y (i)的第一对数似然;
    Generate text data according to the posterior probability of the phoneme corresponding to the voice data x (i) and the transition probability of the phoneme corresponding to the voice data x (i)
    Figure PCTCN2019117567-appb-100001
    Calculate the text data
    Figure PCTCN2019117567-appb-100002
    Equal to the first log likelihood of the text data y (i) ;
    获取所述文本数据y (i)对应的声音特征序列,并根据所述声音序列特征,生成语音数据
    Figure PCTCN2019117567-appb-100003
    计算所述语音数据
    Figure PCTCN2019117567-appb-100004
    等于所述语音数据x (i)的第二对数似然;
    Acquire the voice feature sequence corresponding to the text data y (i) , and generate voice data according to the voice sequence feature
    Figure PCTCN2019117567-appb-100003
    Calculate the voice data
    Figure PCTCN2019117567-appb-100004
    Equal to the second log likelihood of the voice data x (i) ;
    针对所述N对有标数据,以最大化所述第一对数似然和所述第二对数似然为目标函数,并将语音识别和语音合成的概率对偶性作为约束条件,优化所述θ xy和所述θ yxFor the N pairs of labeled data, maximize the first log likelihood and the second log likelihood as the objective function, and use the probabilistic duality of speech recognition and speech synthesis as constraints to optimize all Said θ xy and said θ yx .
  2. 根据权利要求1所述的方法,其特征在于,所述从有标数据集Φ (x,y)中随机选取N对有标数据(x (i),y (i))之前,所述方法还包括: The method according to claim 1, characterized in that, before randomly selecting N pairs of labeled data (x (i) , y (i) ) from the labeled data set Φ (x, y) , the method Also includes:
    从有标数据集Φ (x,y)中随机选取S对有标数据,对待训练的第一语音识别模型进行预训练,得到经过预训练的第二语音识别模型,以及对待训练的第一语音合成模型进行预训练,得到经过预训练的第二语音合成模型,所述第二语音识别模型包括深度神经网络-隐形马尔科夫模型,所述第二语音合成模型包括编码器、解码器和神经声码器,S为小于等于K的正整数。 Randomly select S from the labeled data set Φ (x, y) to the labeled data, perform pre-training on the first voice recognition model to be trained, and obtain the pre-trained second voice recognition model and the first voice to be trained The synthesis model is pre-trained to obtain a pre-trained second speech synthesis model. The second speech recognition model includes a deep neural network-invisible Markov model. The second speech synthesis model includes an encoder, a decoder, and a neural network. For vocoder, S is a positive integer less than or equal to K.
  3. 根据权利要求2所述的方法,其特征在于,所述对待训练的第一语音识别模型进行预训练,得到经过预训练的第二语音识别模型,包括:将所述S对有标数据{(x (r),y (r))} S中的语音数据x (r)输入待训练的所述第一语音识别模型; The method according to claim 2, wherein the pre-training of the first speech recognition model to be trained to obtain the second pre-trained speech recognition model comprises: comparing the S with labeled data {( x (r) ,y (r) )} The speech data x (r) in S is input into the first speech recognition model to be trained;
    对所述语音数据x (r)进行预处理,获取所述语音数据x (r)对应的频率倒谱系数特征; The voice data x (r) is pretreated, of the voice data x (r) corresponding to the number of frequency spectrum inverted characteristic;
    利用所述频率倒谱系数特征对所述第一语音识别模型的深度神经网络-高斯混合模型进行预训练,得到所述第二语音识别模型,所述第二语音识别模型包括训练后的所述深度神经网络-高斯混合模型。Pre-training the deep neural network-Gaussian mixture model of the first speech recognition model by using the frequency cepstral coefficient feature to obtain the second speech recognition model, the second speech recognition model including the trained Deep neural network-Gaussian mixture model.
  4. 根据权利要求2或3所述的方法,其特征在于,所述对待训练的第一语音合成模型进行预训练,得到经过预训练的第二语音合成模型,包括:The method according to claim 2 or 3, wherein the pre-training of the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model comprises:
    将所述S对有标数据{(x (r),y (r))} S中的文本数据y (r)输入待训练的所述第一语音合成模型; S has the standard of the data {(x (r), y (r))} in the text data S y (r) to be trained first input speech synthesis model;
    利用所述文本数据y (r)对所述第一语音合成模型的编码器、解码器、神经声码器进行预训练,得到第二语言合成模型,所述第二语言合成模型包括训练后的所述编码器、训练后的所述解码器和训练后的所述神经声码器。 Use the text data y (r) to pre-train the encoder, decoder, and neural vocoder of the first speech synthesis model to obtain a second speech synthesis model. The second speech synthesis model includes the trained The encoder, the decoder after training, and the neural vocoder after training.
  5. 根据权利要求2至4任一项所述的方法,其特征在于,所述编码器、所述解码器和所述神经声码器均采用循环神经网络模型。The method according to any one of claims 2 to 4, wherein the encoder, the decoder and the neural vocoder all adopt a cyclic neural network model.
  6. 根据权利要求2至5任一项所述的方法,其特征在于,所述提取所述语音数据x (i)的声学特征,获取所述语音数据x (i)对应的音素的后验概率和所述语音数据x (i)对应的音素的转移概率,包括: The method according to any one of claims 2 to 5, wherein said extracting said speech data x (i) acoustic features, obtaining the corresponding phoneme posterior probability of the speech data x (i), and The transition probability of the phoneme corresponding to the voice data x (i) includes:
    将所述语音数据x (i)的声学特征输入所述第二语音识别模型,逐帧提取所述语音数据x (i)的声学特征,将所述语音数据x (i)的声学特征输入所述第二语音识别模型中的深层神经网络,获取所述语音数据x (i)对应的音素的后验概率,将所述语音数据x (i)对应的音素输入所述第二语音识别模型中的隐马尔科夫模型,获取所述语音数据x (i)对应的音素的转移概率。 The speech data x (i) said second input acoustic feature model speech recognition, acoustic features extracted from frame to frame of the speech data x (i) of the speech data x (i) of the input acoustic feature DNN said second speech recognition model, acquires the corresponding phoneme posterior probability of the speech data x (i), the speech data x (i) corresponding to the input of the second voice recognition phoneme models To obtain the transition probability of the phoneme corresponding to the speech data x (i) .
  7. 根据权利要求2至6任一项所述的方法,其特征在于,所述根据所述语音数据x (i)对应的音素的后验概率和所述语音数据x (i)对应的音素的转移概率,生成文本数据
    Figure PCTCN2019117567-appb-100005
    包括:
    The method according to any one of claims 2 to 6, characterized in that, according to the transition corresponding to the phoneme posterior probability of the phoneme corresponding to speech data x (i) and the voice data x (i) Probability, generating text data
    Figure PCTCN2019117567-appb-100005
    include:
    根据所述语音数据x (i)对应的音素的后验概率和所述语音数据x (i)对应的音素的转移概率确定w条网络路径的概率,w为大于零的正整数; Determining the probability of w network paths according to the posterior probability of the phoneme corresponding to the voice data x (i) and the transition probability of the phoneme corresponding to the voice data x (i) , where w is a positive integer greater than zero;
    获取所述w条网络路径中概率最大的网络路径对应的文本数据
    Figure PCTCN2019117567-appb-100006
    Obtain the text data corresponding to the network path with the highest probability among the w network paths
    Figure PCTCN2019117567-appb-100006
  8. 根据权利要求2至7任一项所述的方法,其特征在于,所述获取所述文本数据y (i)对应的声音特征序列,并根据所述声音序列特征,生成语音数据
    Figure PCTCN2019117567-appb-100007
    包括:
    The method according to any one of claims 2 to 7, wherein the acquiring a voice feature sequence corresponding to the text data y (i) , and generating voice data according to the voice sequence feature
    Figure PCTCN2019117567-appb-100007
    include:
    将所述文本数据y (i)输入所述第二语音合成模型的编码器,生成语义序列
    Figure PCTCN2019117567-appb-100008
    将所述语义序列
    Figure PCTCN2019117567-appb-100009
    输入所述第二语音合成模型的解码器,生成声音特征序列,将所述声音序列特 征输入所述第二语音合成模型的神经声码器,生成语音数据
    Figure PCTCN2019117567-appb-100010
    Input the text data y (i) into the encoder of the second speech synthesis model to generate a semantic sequence
    Figure PCTCN2019117567-appb-100008
    The semantic sequence
    Figure PCTCN2019117567-appb-100009
    Input the decoder of the second speech synthesis model to generate a voice feature sequence, and input the voice sequence features into the neural vocoder of the second speech synthesis model to generate voice data
    Figure PCTCN2019117567-appb-100010
  9. 根据权利要求2至8任一项所述的方法,其特征在于,所述以最大化所述第一对数似然和所述第二对数似然为目标函数,并将语音识别和语音合成的概率对偶性作为约束条件,优化所述θ xy和所述θ yx,包括: The method according to any one of claims 2 to 8, wherein the objective function is to maximize the first log likelihood and the second log likelihood, and combine speech recognition and speech The synthesized probability duality is used as a constraint condition to optimize the θ xy and the θ yx , including:
    以最大化所述第一对数似然和所述第二对数似然为目标函数,并将所述语音识别和语音合成的概率对偶性作为约束条件,联立所述目标函数和所述约束条件,采用拉格朗日乘数优化算法,对所述θ xy和所述θ yx进行迭代优化。 Taking maximizing the first log-likelihood and the second log-likelihood as the objective function, and taking the probabilistic duality of the speech recognition and speech synthesis as the constraint condition, the objective function and the The constraint condition is that the Lagrangian multiplier optimization algorithm is used to iteratively optimize the θ xy and the θ yx .
  10. 一种基于对偶学习的语音识别与语音合成装置,其特征在于,所述装置包括:A speech recognition and speech synthesis device based on dual learning, characterized in that the device includes:
    初始化单元,用于初始化有标数据集Φ (x,y)、语音识别参数θ xy、语音合成参数θ yx和训练数据规模N,其中,所述有标数据集Φ (x,y)={(x (j),y (j))} K,(x (j),y (j))表示所述有标数据集Φ (x,y)中的第j对有标数据,所述有标数据集Φ (x,y)中包含K对有标数据,x (j)为所述第j对有标数据中的语音数据,y (j)为所述第j对有标数据中的文本数据,K为正整数,N为小于等于K的正整数; The initialization unit is used to initialize the labeled data set Φ (x, y) , speech recognition parameters θ xy , speech synthesis parameters θ yx, and training data size N, where the labeled data set Φ (x, y) = { (x (j) ,y (j) )} K , (x (j) ,y (j) ) represents the j-th pair of labeled data in the labeled data set Φ (x,y) , the The labeled data set Φ (x, y) contains K pairs of labeled data, x (j) is the voice data in the j-th pair of labeled data, and y (j) is the j-th pair of labeled data. For text data, K is a positive integer, and N is a positive integer less than or equal to K;
    选取单元,用于从所述有标数据集Φ (x,y)中选取N对有标数据{(x (i),y (i))} NThe selection unit is used to select N pairs of labeled data {(x (i) ,y (i) )} N from the labeled data set Φ (x,y) ;
    处理单元,用于提取所述语音数据x (i)的声学特征,根据所述语音数据x (i)的声学特征,获取所述语音数据x (i)对应的音素的后验概率和所述语音数据x (i)对应的音素的转移概率; A processing unit for extracting voice data x (i) of the acoustic characteristics, acoustic characteristics of the voice data according to x (i) acquires the speech data x (i) corresponding to the posterior probability of the phoneme and The transition probability of the phoneme corresponding to the speech data x (i) ;
    第一生成单元,用于根据所述语音数据x (i)对应的音素的后验概率和所述语音数据x (i)对应的音素的转移概率,生成文本数据
    Figure PCTCN2019117567-appb-100011
    计算所述文本数据
    Figure PCTCN2019117567-appb-100012
    等于所述文本数据y (i)的第一对数似然;
    A first generating unit, according to the transition probability of the phoneme posterior probability corresponding to said speech data x (i) and the voice data x (i) corresponding to phonemes, generating text data
    Figure PCTCN2019117567-appb-100011
    Calculate the text data
    Figure PCTCN2019117567-appb-100012
    Equal to the first log likelihood of the text data y (i) ;
    第二生成单元,用于获取所述文本数据y (i)对应的声音特征序列,并根据所述声音序列特征,生成语音数据
    Figure PCTCN2019117567-appb-100013
    计算所述语音数据
    Figure PCTCN2019117567-appb-100014
    等于所述语音数据x (i)的第二对数似然;
    The second generating unit is used to obtain the voice feature sequence corresponding to the text data y (i) , and generate voice data according to the voice sequence feature
    Figure PCTCN2019117567-appb-100013
    Calculate the voice data
    Figure PCTCN2019117567-appb-100014
    Equal to the second log likelihood of the voice data x (i) ;
    优化单元,用于针对所述N对有标数据,以最大化所述第一对数似然和所述第二对数似然为目标函数,并将语音识别和语音合成的概率对偶性作约束条件,优化所述θ xy和所述θ yxThe optimization unit is configured to maximize the first log-likelihood and the second log-likelihood as the objective function for the N pairs of labeled data, and perform the duality of the probability of speech recognition and speech synthesis Constraint conditions, optimize the θ xy and the θ yx .
  11. 根据权利要求10所述的装置,其特征在于,所述装置在选取单元从所述有标数据集Φ (x,y)中选取N对有标数据{(x (i),y (i))} N之前,还包括: The device according to claim 10, wherein the selecting unit of the device selects N pairs of labeled data {(x (i) ,y (i) from the labeled data set Φ (x, y ) )} Before N , it also includes:
    预训练单元,用于从有标数据集Φ (x,y)中随机选取S对有标数据,对待训练的第一语音 识别模型进行预训练,得到经过预训练的第二语音识别模型,以及对待训练的第一语音合成模型进行预训练,得到经过预训练的第二语音合成模型,所述第二语音识别模型包括深度神经网络和隐形马尔科夫模型,所述第二语音合成模型包括编码器、解码器和神经声码器,S为小于等于K的正整数。 The pre-training unit is used to randomly select S pairs of labeled data from the labeled data set Φ (x, y) , and perform pre-training on the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, and The first speech synthesis model to be trained is pre-trained to obtain a pre-trained second speech synthesis model. The second speech recognition model includes a deep neural network and an invisible Markov model. The second speech synthesis model includes coding S is a positive integer less than or equal to K.
  12. 根据权利要求11所述的装置,其特征在于,所述预训练单元用于对待训练的第一语音识别模型进行预训练,得到经过预训练的第二语音识别模型,具体包括:The device according to claim 11, wherein the pre-training unit is used to pre-train the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, which specifically comprises:
    输入单元,用于将所述S对有标数据{(x (r),y (r))} S中的语音数据x (r)输入待训练的所述第一语音识别模型; An input unit for the first speech recognition model with a standard for the S data {(x (r), y (r))} in the speech data S x (r) input to be trained;
    预处理单元,用于对所述语音数据x (r)进行预处理,获取所述语音数据x (r)对应的频率倒谱系数特征; A preprocessing unit, configured to preprocess the voice data x (r) to obtain the frequency cepstral coefficient characteristics corresponding to the voice data x (r) ;
    预训练单元,还用于利用所述频率倒谱系数特征对所述第一语音识别模型的深度神经网络-高斯混合模型进行预训练,得到所述第二语音识别模型,所述第二语音识别模型包括训练后的所述深度神经网络-高斯混合模型。The pre-training unit is further configured to pre-train the deep neural network-Gaussian mixture model of the first speech recognition model by using the frequency cepstral coefficient feature to obtain the second speech recognition model, and the second speech recognition The model includes the deep neural network-Gaussian mixture model after training.
  13. 根据权利要求11或12所述的装置,其特征在于,所述预训练单元用于对待训练的第一语音合成模型进行预训练,得到经过预训练的第二语音合成模型,具体包括:The device according to claim 11 or 12, wherein the pre-training unit is used to pre-train the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model, which specifically comprises:
    输入单元,还用于将所述S对有标数据{(x (r),y (r))} S中的文本数据y (r)输入待训练的所述第一语音合成模型; The input unit is further configured to have the standard data for the S {(x (r), y (r))} in the text data S y (r) to be trained first input speech synthesis model;
    预训练单元,还用于利用所述文本数据y (r)对所述第一语音合成模型的编码器、解码器、神经声码器进行预训练,得到第二语言合成模型,所述第二语言合成模型包括训练后的所述编码器、训练后的所述解码器和训练后的所述神经声码器。 The pre-training unit is further configured to use the text data y (r) to pre-train the encoder, decoder, and neural vocoder of the first speech synthesis model to obtain a second speech synthesis model. The language synthesis model includes the trained encoder, the trained decoder, and the trained neural vocoder.
  14. 根据权利要求11至13所述的装置,其特征在于,所述编码器、所述解码器和所述神经声码器均采用循环神经网络模型。The device according to claims 11 to 13, wherein the encoder, the decoder and the neural vocoder all adopt a cyclic neural network model.
  15. 根据权利要求11至14任一项所述的装置,其特征在于,所述处理单元,包括:The device according to any one of claims 11 to 14, wherein the processing unit comprises:
    提取单元,用于将所述语音数据x (i)的声学特征输入所述第二语音识别模型,逐帧提取所述语音数据x (i)的声学特征; An extraction unit, configured to input the acoustic features of the voice data x (i) into the second voice recognition model, and extract the acoustic features of the voice data x (i) frame by frame;
    获取单元,用于将所述语音数据x (i)的声学特征输入所述第二语音识别模型中的深层神经网络,获取所述语音数据x (i)对应的音素的后验概率,将所述语音数据x (i)对应的音素输入所述第二语音识别模型中的隐马尔科夫模型,获取所述语音数据x (i)对应的音素的转移概率。 The acquiring unit is used for inputting the acoustic features of the voice data x (i) into the deep neural network in the second voice recognition model, acquiring the posterior probability of the phoneme corresponding to the voice data x (i) , and calculating the The phoneme corresponding to the speech data x (i) is input into the hidden Markov model in the second speech recognition model, and the transition probability of the phoneme corresponding to the speech data x (i) is obtained.
  16. 根据权利要求11至15任一项所述的装置,其特征在于,所述第一生成单元根据所述语音数据x (i)对应的音素的后验概率和所述语音数据x (i)对应的音素的转移概率,生成文本数据
    Figure PCTCN2019117567-appb-100015
    具体包括:
    The device according to any one of claims 11 to 15, wherein the first generating unit corresponds to the speech data x (i) according to the posterior probability of the phoneme corresponding to the speech data x (i) Transition probability of the phoneme, to generate text data
    Figure PCTCN2019117567-appb-100015
    Specifically:
    确定单元,用于根据所述语音数据x (i)对应的音素的后验概率和所述语音数据x (i)对应 的音素的转移概率确定w条网络路径的概率,w为大于零的正整数; The determining unit is configured to determine the probability of w network paths according to the posterior probability of the phoneme corresponding to the voice data x (i) and the transition probability of the phoneme corresponding to the voice data x (i) , where w is a positive value greater than zero. Integer
    获取单元,还用于获取所述w条网络路径中概率最大的网络路径对应的文本数据
    Figure PCTCN2019117567-appb-100016
    The obtaining unit is further configured to obtain text data corresponding to the network path with the highest probability among the w network paths
    Figure PCTCN2019117567-appb-100016
  17. 根据权利要求11至16任一项所述的装置,其特征在于,所述第二生成单元获取所述文本数据y (i)对应的声音特征序列,并根据所述声音序列特征,生成语音数据
    Figure PCTCN2019117567-appb-100017
    具体用于:
    The device according to any one of claims 11 to 16, wherein the second generating unit obtains the voice feature sequence corresponding to the text data y (i) , and generates voice data according to the voice sequence feature
    Figure PCTCN2019117567-appb-100017
    Specifically used for:
    将所述文本数据y (i)输入所述第二语音合成模型的编码器,生成语义序列
    Figure PCTCN2019117567-appb-100018
    将所述语义序列
    Figure PCTCN2019117567-appb-100019
    输入所述第二语音合成模型的解码器,生成声音特征序列,将所述声音序列特征输入所述第二语音合成模型的神经声码器,生成语音数据
    Figure PCTCN2019117567-appb-100020
    Input the text data y (i) into the encoder of the second speech synthesis model to generate a semantic sequence
    Figure PCTCN2019117567-appb-100018
    The semantic sequence
    Figure PCTCN2019117567-appb-100019
    Input the decoder of the second speech synthesis model to generate a voice feature sequence, and input the voice sequence features into the neural vocoder of the second speech synthesis model to generate voice data
    Figure PCTCN2019117567-appb-100020
  18. 根据权利要求11至17任一项所述的装置,其特征在于,所述优化单元具体用于:The device according to any one of claims 11 to 17, wherein the optimization unit is specifically configured to:
    以最大化所述第一对数似然和所述第二对数似然为目标函数,并将所述语音识别和语音合成的概率对偶性作为约束条件,联立所述目标函数和所述约束条件,采用拉格朗日乘数优化算法,对所述θ xy和所述θ yx进行迭代优化。 Taking maximizing the first log-likelihood and the second log-likelihood as the objective function, and taking the probabilistic duality of the speech recognition and speech synthesis as the constraint condition, the objective function and the The constraint condition is that the Lagrangian multiplier optimization algorithm is used to iteratively optimize the θ xy and the θ yx .
  19. 一种计算机非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质上存储有计算机程序,该程序被处理器执行时实现权利要求1至9任一项所述的基于对偶学习的语音识别与语音合成方法。A computer non-volatile readable storage medium, characterized in that a computer program is stored on the computer non-volatile readable storage medium, and when the program is executed by a processor, the computer program realizes any one of claims 1 to 9 The speech recognition and speech synthesis method based on dual learning described above.
  20. 一种服务器,其特征在于,包括:一个或多个处理器;存储器;一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个应用程序配置用于执行权利要求1至9任一项所述的基于对偶学习的语音识别与语音合成方法。A server, characterized by comprising: one or more processors; a memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be operated by the one Or executed by multiple processors, and the one or more application programs are configured to execute the dual learning-based speech recognition and speech synthesis method according to any one of claims 1 to 9.
PCT/CN2019/117567 2019-02-22 2019-11-12 Speech recognition and speech synthesis method and apparatus based on dual learning WO2020168752A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910135575.7A CN109887484B (en) 2019-02-22 2019-02-22 Dual learning-based voice recognition and voice synthesis method and device
CN201910135575.7 2019-02-22

Publications (1)

Publication Number Publication Date
WO2020168752A1 true WO2020168752A1 (en) 2020-08-27

Family

ID=66929081

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117567 WO2020168752A1 (en) 2019-02-22 2019-11-12 Speech recognition and speech synthesis method and apparatus based on dual learning

Country Status (2)

Country Link
CN (1) CN109887484B (en)
WO (1) WO2020168752A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571064A (en) * 2021-07-07 2021-10-29 肇庆小鹏新能源投资有限公司 Natural language understanding method and device, vehicle and medium
CN113793591A (en) * 2021-07-07 2021-12-14 科大讯飞股份有限公司 Speech synthesis method and related device, electronic equipment and storage medium
CN113793591B (en) * 2021-07-07 2024-05-31 科大讯飞股份有限公司 Speech synthesis method, related device, electronic equipment and storage medium

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887484B (en) * 2019-02-22 2023-08-04 平安科技(深圳)有限公司 Dual learning-based voice recognition and voice synthesis method and device
CN113412514A (en) 2019-07-09 2021-09-17 谷歌有限责任公司 On-device speech synthesis of text segments for training of on-device speech recognition models
CN110765784A (en) * 2019-09-12 2020-02-07 内蒙古工业大学 Mongolian Chinese machine translation method based on dual learning
CN113314096A (en) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 Speech synthesis method, apparatus, device and storage medium
CN113495943B (en) * 2020-04-02 2023-07-14 山东大学 Man-machine dialogue method based on knowledge tracking and transferring
CN111583913B (en) * 2020-06-15 2020-11-03 深圳市友杰智新科技有限公司 Model training method and device for speech recognition and speech synthesis and computer equipment
CN111444731B (en) * 2020-06-15 2020-11-03 深圳市友杰智新科技有限公司 Model training method and device and computer equipment
CN111428867B (en) * 2020-06-15 2020-09-18 深圳市友杰智新科技有限公司 Model training method and device based on reversible separation convolution and computer equipment
CN112634919B (en) * 2020-12-18 2024-05-28 平安科技(深圳)有限公司 Voice conversion method, device, computer equipment and storage medium
CN112599116B (en) * 2020-12-25 2022-07-08 思必驰科技股份有限公司 Speech recognition model training method and speech recognition federal training system
CN113160793A (en) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and storage medium based on low resource language
CN113284484B (en) * 2021-05-24 2022-07-26 百度在线网络技术(北京)有限公司 Model training method and device, voice recognition method and voice synthesis method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110218804A1 (en) * 2010-03-02 2011-09-08 Kabushiki Kaisha Toshiba Speech processor, a speech processing method and a method of training a speech processor
CN105810191A (en) * 2016-03-08 2016-07-27 江苏信息职业技术学院 Prosodic information-combined Chinese dialect identification method
CN108133705A (en) * 2017-12-21 2018-06-08 儒安科技有限公司 Speech recognition and phonetic synthesis model training method based on paired-associate learning
CN108847249A (en) * 2018-05-30 2018-11-20 苏州思必驰信息科技有限公司 Sound converts optimization method and system
CN109887484A (en) * 2019-02-22 2019-06-14 平安科技(深圳)有限公司 A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6343267B1 (en) * 1998-04-30 2002-01-29 Matsushita Electric Industrial Co., Ltd. Dimensionality reduction for speaker normalization and speaker and environment adaptation using eigenvoice techniques
CN101894548B (en) * 2010-06-23 2012-07-04 清华大学 Modeling method and modeling device for language identification
US20120158398A1 (en) * 2010-12-17 2012-06-21 John Denero Combining Model-Based Aligner Using Dual Decomposition
CN104464744A (en) * 2014-11-19 2015-03-25 河海大学常州校区 Cluster voice transforming method and system based on mixture Gaussian random process
CN105760852B (en) * 2016-03-14 2019-03-05 江苏大学 A kind of driver's emotion real-time identification method merging countenance and voice
CN105976812B (en) * 2016-04-28 2019-04-26 腾讯科技(深圳)有限公司 A kind of audio recognition method and its equipment
CN107633842B (en) * 2017-06-12 2018-08-31 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107331384B (en) * 2017-06-12 2018-05-04 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN107680582B (en) * 2017-07-28 2021-03-26 平安科技(深圳)有限公司 Acoustic model training method, voice recognition method, device, equipment and medium
CN108369813B (en) * 2017-07-31 2022-10-25 深圳和而泰智能家居科技有限公司 Specific voice recognition method, apparatus and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110218804A1 (en) * 2010-03-02 2011-09-08 Kabushiki Kaisha Toshiba Speech processor, a speech processing method and a method of training a speech processor
CN105810191A (en) * 2016-03-08 2016-07-27 江苏信息职业技术学院 Prosodic information-combined Chinese dialect identification method
CN108133705A (en) * 2017-12-21 2018-06-08 儒安科技有限公司 Speech recognition and phonetic synthesis model training method based on paired-associate learning
CN108847249A (en) * 2018-05-30 2018-11-20 苏州思必驰信息科技有限公司 Sound converts optimization method and system
CN109887484A (en) * 2019-02-22 2019-06-14 平安科技(深圳)有限公司 A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571064A (en) * 2021-07-07 2021-10-29 肇庆小鹏新能源投资有限公司 Natural language understanding method and device, vehicle and medium
CN113793591A (en) * 2021-07-07 2021-12-14 科大讯飞股份有限公司 Speech synthesis method and related device, electronic equipment and storage medium
CN113571064B (en) * 2021-07-07 2024-01-30 肇庆小鹏新能源投资有限公司 Natural language understanding method and device, vehicle and medium
CN113793591B (en) * 2021-07-07 2024-05-31 科大讯飞股份有限公司 Speech synthesis method, related device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109887484A (en) 2019-06-14
CN109887484B (en) 2023-08-04

Similar Documents

Publication Publication Date Title
WO2020168752A1 (en) Speech recognition and speech synthesis method and apparatus based on dual learning
US11468244B2 (en) Large-scale multilingual speech recognition with a streaming end-to-end model
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
Kheddar et al. Deep transfer learning for automatic speech recognition: Towards better generalization
WO2023160472A1 (en) Model training method and related device
JP2004362584A (en) Discrimination training of language model for classifying text and sound
JP2019159654A (en) Time-series information learning system, method, and neural network model
KR20230147685A (en) Word-level reliability learning for subword end-to-end automatic speech recognition
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
US10096317B2 (en) Hierarchical speech recognition decoder
CN112599128A (en) Voice recognition method, device, equipment and storage medium
US20230377564A1 (en) Proper noun recognition in end-to-end speech recognition
Ahmed et al. End-to-end lexicon free arabic speech recognition using recurrent neural networks
US20230104228A1 (en) Joint Unsupervised and Supervised Training for Multilingual ASR
CN111126084B (en) Data processing method, device, electronic equipment and storage medium
Abdelmaksoud et al. Convolutional neural network for arabic speech recognition
US20220310080A1 (en) Multi-Task Learning for End-To-End Automated Speech Recognition Confidence and Deletion Estimation
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
EP4315319A1 (en) Supervised and unsupervised training with contrastive loss over sequences
Mehra et al. Deep fusion framework for speech command recognition using acoustic and linguistic features
WO2021174922A1 (en) Statement sentiment classification method and related device
WO2023116572A1 (en) Word or sentence generation method and related device
US20220310097A1 (en) Reducing Streaming ASR Model Delay With Self Alignment
KR20240065125A (en) Large-scale language model data selection for rare word speech recognition.
CN112951270B (en) Voice fluency detection method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19915605

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19915605

Country of ref document: EP

Kind code of ref document: A1