WO2020168752A1

WO2020168752A1 - Speech recognition and speech synthesis method and apparatus based on dual learning

Info

Publication number: WO2020168752A1
Application number: PCT/CN2019/117567
Authority: WO
Inventors: 王健宗; 程宁; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-02-22
Filing date: 2019-11-12
Publication date: 2020-08-27
Also published as: CN109887484A; CN109887484B

Abstract

A speech recognition and speech synthesis method and apparatus based on dual learning. The method comprises: initializing a marked data set Φ_(x,y), a speech recognition parameter θ_xy and a speech synthesis parameter θ_yx, wherein Φ_(x,y) = {(x^(j),y^(j))}^K, x^(j) is speech data, and y^(j) is text data; selecting, from Φ_(x,y), N pairs of marked data {(x⁽ⁱ⁾,y⁽ⁱ⁾)}^N; extracting an acoustic feature of x⁽ⁱ⁾, and according to the acoustic feature of x⁽ⁱ⁾, acquiring the posterior probability of a phoneme corresponding to x⁽ⁱ⁾ and the transition probability of the phoneme corresponding to x⁽ⁱ⁾ to generate the text data (a), and calculating the first log likelihood of (a) equaling y⁽ⁱ⁾; acquiring a sound feature sequence corresponding to y⁽ⁱ⁾ to generate the speech data (b), and calculating the second log likelihood of (b) equaling x⁽ⁱ⁾; and taking the maximum first log likelihood and the maximum second log likelihood as target functions, and taking the probabilistic duality of speech recognition and speech synthesis as a constraint condition to optimize θ_xy and θ_yx. According to the method, dual learning is effectively used to perform speech recognition and speech synthesis, thereby increasing the training speed of speech recognition and speech generation and improving the accuracy of an output result.

Description

Method and device for speech recognition and speech synthesis based on dual learning

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 22, 2019, the application number is 201910135575.7, and the application name is "A method and device for speech recognition and speech synthesis based on dual learning", and its entire contents Incorporated in this application by reference.

Technical field

This application relates to the technical field of new energy vehicles, in particular to a method and device for speech recognition and speech synthesis based on dual learning.

Background technique

In recent years, artificial intelligence technology represented by deep learning and enhanced learning has made considerable progress, and has achieved great success in many applications. However, deep learning is limited to large-scale labeled data, and reinforcement learning is limited to continuous interactive environments. First, the acquisition of large-scale labeled data and the maintenance of the interactive environment are both costly. In order to make deep learning and reinforcement learning more widely successful, we need to reduce its dependence on large-scale labeled data and interactive environments. Sex. To solve this problem, a new learning paradigm has emerged, which we call dual learning.

In supervised learning tasks, it is found that many problems have a dual form, that is, input and output appear in a dual form. The input and output of one task are the output and input of another task. For example, in machine translation, The mutual translation between different languages is a dual task. These two tasks have a probabilistic relationship and a correlation model internally, but this relationship has not been effectively used because people usually complete the two tasks independently when training the model. Therefore, the emergence of dual learning is to use the correlation between these two models to train them at the same time to simplify the training process. Dual learning does not rely on large-scale labeled data.

Traditional technologies usually separate speech recognition and speech generation for training, failing to effectively utilize the duality between speech recognition and speech generation. Using the duality between speech recognition and speech generation, combined speech recognition training and speech generation training for dual learning is a major development trend of speech recognition and speech generation technology. However, how to apply dual learning to actual scenarios still faces huge challenges. How to effectively perform voice recognition and voice generation based on dual learning, and improve the training speed of voice recognition and voice generation and the accuracy of output results are technologies that need to be solved urgently. problem.

Summary of the invention

The embodiments of the application provide a method and device for speech recognition and speech synthesis based on dual learning, which can effectively use dual learning for speech recognition and speech synthesis, improve the training speed of speech recognition and speech generation, and improve speech recognition and speech The precision of the generated output.

The embodiment of the present application provides a speech recognition and speech synthesis method based on dual learning. The method includes the following steps:

Initialize the labeled data set Φ _(x,y) , speech recognition parameters θ _xy , speech synthesis parameters θ _yx, and training data size N, where the labeled data set Φ _(x,y) = {(x ^(j) ,y ^(j) )} ^K , the labeled data set Φ _(x,y) contains K pairs of labeled data, (x ^(j) ,y ^(j) ) represents the first in the labeled data set Φ _(x,y) j is the labeled data, x ^(j) is the voice data in the j-th pair of labeled data, y ^(j) is the text data in the j-th pair of labeled data, K is a positive integer, and N is a positive value less than or equal to K. Integer

Select N pairs of labeled data {(x ⁽ⁱ⁾ ,y ⁽ⁱ⁾ )} ^N from the labeled data set Φ _(x,y) ;

Acoustic features extracted speech data x ^(i), according to the acoustic characteristics of the voice data x ⁽ⁱ⁾ acquires the corresponding transition probability of the phoneme corresponding phoneme speech data x ⁽ⁱ⁾ the posterior probability and the speech data x ^(i);

The corresponding transition probability of the phoneme corresponding phoneme speech data x ⁽ⁱ⁾ and the posterior probability of the speech data x ^(i), text data generated

Calculate text data

Equal to the first log likelihood of the text data y ⁽ⁱ⁾ ;

Acquire the voice feature sequence corresponding to y ⁽ⁱ⁾ , and generate voice data according to the voice sequence feature

Calculate voice data

Equal to the second log likelihood of the speech data x ⁽ⁱ⁾ ;

For N pairs of labeled data, maximize the first log likelihood and the second log likelihood as the objective function, and use the probabilistic duality of speech recognition and speech synthesis as constraints to optimize θ _xy and θ _yx .

The embodiment of the present application also provides a device for speech recognition and speech synthesis based on dual learning, which can realize the beneficial effects of the aforementioned method for speech recognition and speech synthesis based on dual learning. Wherein, the function of the device can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes at least one module corresponding to the above-mentioned functions. Optionally, the device includes an initialization unit, a selection unit, a processing unit, a first generation unit, a second generation unit, and an optimization unit.

The initialization unit is used to initialize the marked data set Φ _(x,y) , speech recognition parameters θ _xy , speech synthesis parameters θ _yx and training data size N, where the marked data set Φ _(x,y) = {(x ^(j) ,y ^(j) )} ^K , the labeled data set Φ _(x,y) contains K pairs of labeled data, (x ^(j) ,y ^(j) ) represents the labeled data set Φ _(x, The j-th pair of labeled data in _y) , x ^(j) is the voice data in the j-th pair of labeled data, y ^(j) is the text data in the j-th pair of labeled data, K is a positive integer, and N is A positive integer less than or equal to K.

The selection unit is used to select N pairs of labeled data {(x ⁽ⁱ⁾ ,y ⁽ⁱ⁾ )} ^N from the labeled data set Φ _(x,y) .

The processing unit, the voice data x ⁽ⁱ⁾ for extracting acoustic features, according to the acoustic characteristic of a voice data x ⁽ⁱ⁾ acquires the posterior probability corresponding phoneme and voice speech data x ⁽ⁱ⁾ data x ⁽ⁱ⁾ corresponding to The transition probability of a phoneme.

A first generating unit, according to the corresponding transition probability of the phoneme corresponding phoneme speech data x ⁽ⁱ⁾ and the posterior probability of the speech data x ^(i), text data generated

Calculate text data

Equal to the first log likelihood of text data y ⁽ⁱ⁾ .

The second generating unit is used to obtain the voice feature sequence corresponding to the text data y ⁽ⁱ⁾ , and generate voice data according to the voice sequence feature

Calculate voice data

Equal to the second log likelihood of the speech data x ⁽ⁱ⁾ .

The optimization unit is used to maximize the first log-likelihood and the second log-likelihood as the objective function for N pairs of labeled data, and use the duality of speech recognition and speech synthesis as constraints to optimize θ _xy And θ _yx .

The embodiment of the present application also provides a server, which can realize the beneficial effects of the above-mentioned speech recognition and speech synthesis method based on dual learning. Among them, the function of the server can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes at least one module corresponding to the above-mentioned functions. The server includes a memory, a processor, and a transceiver. The memory is used to store a computer program that supports the server to execute the above method. The computer program includes program instructions. The processor is used to control and manage the actions of the server according to the program instructions. The transceiver is used to support The server communicates with other communication devices.

The embodiment of the present application also provides a computer non-volatile readable storage medium. The non-volatile readable storage medium stores instructions. When it runs on the processor, the processor executes the aforementioned dual learning-based Speech recognition and speech synthesis methods.

The present application embodiment, by acquiring speech data x corresponding transition probability of the phoneme ⁽ⁱ⁾ corresponding to the posterior probability of phoneme and voice data x ^(i), text data generated

Acquire the sound feature sequence corresponding to y ⁽ⁱ⁾ to generate speech data

For N pairs of labeled data to maximize text data

Equal to the log likelihood of text data y ⁽ⁱ⁾ and speech data

The log likelihood equal to the speech data x ⁽ⁱ⁾ is the goal, and the probabilistic duality of speech recognition and speech synthesis is used as a constraint condition, so as to optimize the effect of speech recognition and speech synthesis. Effective use of dual learning for speech recognition and speech synthesis, improves the training speed of speech recognition and speech generation, and improves the accuracy of the output results of speech recognition and speech generation.

The additional aspects and advantages of this application will be partly given in the following description, which will become obvious from the following description, or be understood through the practice of this application.

Description of the drawings

The above and/or additional aspects and advantages of the present application will become obvious and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic structural diagram of a server provided by an embodiment of the present application;

2 is a schematic flowchart of a method for speech recognition and speech synthesis based on dual learning provided by an embodiment of the present application;

Fig. 3 is a schematic structural diagram of a speech recognition and speech synthesis device based on dual learning provided by an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. It should be understood that when used in this specification and the appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof. In addition, the terms "first", "second", "third", etc. are used to distinguish different objects, but not to describe a specific sequence.

It should be noted that the terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms of "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used herein refers to and includes any or all possible combinations of one or more associated listed items.

Dual learning is a learning scheme that uses the duality between a set of dual tasks to establish a feedback signal, and uses this signal to constrain training. Duality exists widely in artificial intelligence tasks. For example, machine translation is to allow machines to translate natural language from one language to another. Chinese to English and English to Chinese are dual tasks. Image recognition and image synthesis are also dual tasks for each other. Image recognition refers to a given picture, its classification and specific information. Image generation refers to the generation of a corresponding image given a category and specific information. Similarly, speech recognition and speech synthesis are also dual tasks. Speech recognition is a technology that allows machines to convert speech signals into corresponding texts or commands through the process of recognition and understanding. Speech synthesis is a technology that converts text generated by the computer itself or input from outside. Information is transformed into voice technology through mechanical and electronic methods.

The application field of speech recognition is very wide. Common application systems include: voice input system, which is more natural and efficient than keyboard input method; voice control system, which uses voice to control the operation of the device, is faster than manual control , Convenience; The intelligent dialogue query system operates according to the customer's voice, providing users with natural and friendly database retrieval services. Speech synthesis technology also has a wide range of applications in our lives, such as electronic reading, car voice navigation, bank and hospital numbering systems, traffic announcements, and so on. The speech recognition and speech synthesis method based on dual learning provided by the embodiments of this application can be applied to network equipment with speech recognition and speech synthesis functions such as terminal equipment, servers, and vehicle network equipment. The aforementioned terminal equipment includes smart phones, smart bracelets, E-reading devices, notebooks and tablets. This application does not specifically limit this. The following uses a server as an example to introduce in detail the functions of the application device of the above-mentioned speech recognition and speech synthesis method based on dual learning.

Please refer to FIG. 1, which is a schematic diagram of the hardware structure of a server 100 provided by an embodiment of the application. The server 100 includes a memory 101, a transceiver 102, and a processor 103 coupled to the memory 101 and the transceiver 102. The memory 101 is configured to store a computer program, the computer program includes program instructions, the processor 103 is configured to execute the program instructions stored in the memory 101, and the transceiver 102 is configured to communicate with other devices under the control of the processor 103. When the processor 103 is executing instructions, it can execute a voice recognition and voice synthesis method based on dual learning according to the program instructions.

Among them, the processor 103 may be a central processing unit (English: central processing unit, abbreviated as: CPU), a general-purpose processor, a digital signal processor (English: digital signal processor, abbreviated as: DSP), an application specific integrated circuit (English: application- Specific integrated circuit, referred to as ASIC), field programmable gate array (English: field programmable gate array, referred to as FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of the embodiments of the present application. The processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of DSP and microprocessor, and so on. The transceiver 102 may be a communication interface, a transceiver circuit, etc., where the communication interface is a general term and may include one or more interfaces, such as an interface between a server and a terminal.

Optionally, the server 100 may further include a bus 104. Among them, the memory 101, the transceiver 102, and the processor 103 may be connected to each other through a bus 104; the bus 104 may be a peripheral component interconnection standard (English: peripheral component interconnect, abbreviation: PCI) bus or an extended industry standard structure (English: extended industry standard architecture, EISA for short, etc. The bus 104 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 1, but it does not mean that there is only one bus or one type of bus.

In addition to the memory 101, the transceiver 102, the processor 103, and the above-mentioned bus 104 shown in FIG. 1, the server 100 in the embodiment may generally include other hardware according to the actual function of the server, which will not be repeated here.

In the above operating environment, the embodiment of the present application provides a method for speech recognition and speech synthesis based on dual learning as shown in FIG. 2. Please refer to Fig. 2, the speech recognition and speech synthesis method based on dual learning includes:

S201. Initialize the labeled data set Φ _{(x, y)} , speech recognition parameters θ _xy , speech synthesis parameters θ _yx and training data size N, where Φ _(x,y) = {(x ^(j) ,y ^{(j )} )} ^K , the labeled data set Φ _(x,y) contains K pairs of labeled data, x ^(j) is voice data, and y ^(j) is text data.

Specifically, select K pairs of labeled data to form a labeled data set Φ _(x,y) = {(x ^(j) ,y ^(j) )} ^K , the labeled data set Φ _(x,y) contains K For labeled data, (x ^(j) ,y ^(j) ) represents the jth pair of labeled data in the labeled data set Φ _(x,y) , and x ^(j) is the voice in the jth pair of labeled data Data, y ^(j) is the text data in the j-th pair of labeled data, K is a positive integer, and N is a positive integer less than or equal to K. The training data size N is the number of labeled data participating in the dual learning-based speech recognition and speech synthesis optimization training in the labeled data set Φ _{(x, y)} . The speech recognition parameter θ _xy is a parameter that affects the speech recognition effect, and the speech synthesis parameter θ _yx is a parameter that affects the speech synthesis effect.

It can be understood that the contents of the K voice data in the labeled data set Φ _{(x, y)} are all different, and the length of the K voice data may be the same or different. The voice data can come from TV news reports, daily conversations, meeting recordings, etc. The source scenes of the above K voice data can be the same or different. This application does not specifically limit this.

Optionally, the speech recognition parameter θ _xy and the speech synthesis parameter θ _{yx are} randomly initialized, for example, the initial values of θ _xy and θ _yx are both set to 1.

S202. A speech recognition and synthesis device based on dual learning selects N pairs of labeled data from the labeled data set Φ _(x,y) to form a labeled data set Φ _(x,y) ^N , Φ _(x,y) ^N ={(x ⁽ⁱ⁾ ,y ⁽ⁱ⁾ )} ^N.

Optionally, before the dual learning-based speech recognition and synthesis device randomly selects N pairs of labeled data from the labeled data set Φ _{(x, y)} , the above-mentioned dual learning-based voice recognition and speech synthesis method further includes: Randomly select S from the standard data set Φ _{(x, y)} for the standard data, and pre-train the first speech recognition model to be trained to obtain the pre-trained second speech recognition model and the first speech synthesis model to be trained Perform pre-training to obtain a pre-trained second speech synthesis model. The second speech recognition model includes Deep Neural Network (full English name: Deep Neural Network, English abbreviation: DNN)-hidden Markov model (full English name: Hidden Markov Model, English abbreviation: HMM), the second speech synthesis model includes encoder, decoder and neural vocoder, S is a positive integer less than or equal to K.

Optionally, the aforementioned dual learning-based speech recognition and synthesis device pre-trains the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, including: comparing the aforementioned S with labeled data {(x ^{( r)} ,y ^(r) )} The speech data x ^{(r) in} ^S is input to the first speech recognition model to be trained; the speech data x ^(r) is preprocessed to obtain the frequency corresponding to the speech data x ^(r) Spectral coefficient feature: Use the frequency cepstral coefficient feature to pre-train the DNN-HMM model of the first speech recognition model to obtain a second speech recognition model, which includes the trained DNN-HMM model.

Optionally, the voice recognition and synthesis device based on dual learning performs pre-training on the first voice recognition model to be trained to obtain the pre-trained second voice recognition model, which specifically includes: the voice recognition and synthesis device based on dual learning The speech data x ^(r) in the labeled data {(x ^(r) ,y ^(r) )} ^S is input into the first speech recognition model to be trained. First, the aforementioned speech data x ^(r) is preprocessed to obtain The characteristic of the frequency cepstrum coefficient corresponding to the speech data x ^(r) . Then, a speech recognition and synthesis device based on dual learning takes the above-mentioned frequency cepstral coefficient characteristics as input data, and uses Gaussian mixture models (full English name: Adaptive background mixture models for real-time tracking, English abbreviation: GMM) and invisible Marco The husband model (English full name: Hidden Markov Model, English abbreviation: HMM) constitutes an acoustic model for training, and obtains the likelihood probability feature of the phoneme state output by the pre-trained GMM and the transition probability of the phoneme state output by the pre-trained HMM. The dual learning-based speech recognition and synthesis device converts the likelihood probability feature of the phoneme state into the posterior probability feature of the phoneme state through forced alignment, and obtains a deep neural network according to the posterior probability feature of the S pair of labeled data and the phoneme state. (Full English name: Deep Neural Network, English abbreviation: DNN) The matrix weight value and matrix offset value between the output layer nodes in the model are generated to generate the pre-trained DNN model. The second speech recognition model includes the aforementioned pre-trained DNN model and the aforementioned pre-trained HMM.

Optionally, the aforementioned dual learning-based speech recognition and synthesis device performs pre-training on the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model, including: comparing the aforementioned S with labeled data {(x ^{( r)} , y ^(r) )} The text data y ^{(r) in} ^S is input to the first speech synthesis model to be trained; the text data y ^{(r) is used} to compare the encoder, decoder, and neuroacoustics of the first speech synthesis model The encoder is pre-trained to obtain a second language synthesis model. The second language synthesis model includes a trained encoder, a trained decoder, and a trained neural vocoder.

Optionally, the voice recognition and synthesis device based on dual learning performs pre-training on the first voice recognition model to be trained to obtain the pre-trained second voice recognition model, which specifically includes: the voice recognition and synthesis device based on dual learning from S is randomly selected from the labeled data set Φ _{(x, y)} to pre-train the first speech synthesis model to be trained on the labeled data {(x ^(t) ,y ^(t) )} ^S to obtain the pre-trained first speech synthesis model 2. Speech synthesis model. Specifically, it includes the following steps: the speech recognition and synthesis device based on dual learning inputs the text data y ^(t) in the labeled data {(x ^(t) ,y ^(t) )} ^S into the first speech synthesis model to be trained, First, the text data is analyzed by the encoder to obtain the intermediate semantic vector corresponding to the text data y ^(t) , which represents the semantics of the text. Then, the speech recognition and synthesis device based on dual learning inputs the above-mentioned intermediate semantic vector into the decoder to obtain the sound sequence characteristics corresponding to the text data y ^(t) . The above-mentioned sound sequence characteristics are input to the neural vocoder, and the speech data corresponding to the text data y ^{(t) is} output. The aforementioned encoder, decoder, and neural vocoder all use a cyclic neural network (English full name: Hidden Markov Model, English abbreviation: HMM) model, and the second speech synthesis model includes the aforementioned encoder, decoder and neural vocoder.

It can be understood that GMM is to use Gaussian probability density function to accurately quantify things. It is a model formed by decomposing things into several Gaussian probability density functions. HMM is a probabilistic model about time series. It describes the process of randomly generating an unobservable state random sequence from a hidden Markov chain, and then generating an observation from each state to generate an observation random sequence.

S203, based on the acoustic features of speech recognition dual study the synthesis device extracts the voice data x ^(i), according to the acoustic characteristics of the voice data x ^(i), and obtains the corresponding phoneme speech data x ⁽ⁱ⁾ the posterior probability and the voice data x ⁽ⁱ⁾ The transition probability of the corresponding phoneme.

Specifically, the voice recognition and synthesis device based on dual learning inputs the voice data x ⁽ⁱ⁾ into the second voice recognition model, filters out unimportant information and background noise, and divides the voice data x ⁽ⁱ⁾ into multiple frames of voice signals. Each frame of speech signal is analyzed and processed, and the filter bank feature of each frame of speech signal corresponding to the speech data x ^{(i) is} extracted as the acoustic feature of the speech data x ⁽ⁱ⁾ . The voice recognition and synthesis device based on dual learning inputs the acoustic features of the voice data x ⁽ⁱ⁾ into the DNN model in the second voice recognition model, and obtains the posterior probability of the phoneme corresponding to the voice data x ⁽ⁱ⁾ output by the DNN model, and The phoneme corresponding to the speech data x ⁽ⁱ⁾ is input into the HMM in the second speech recognition model to obtain the transition probability of the phoneme corresponding to the speech data x ⁽ⁱ⁾ .

It can be understood that the phoneme transition probability output by the HMM includes the phoneme transition probability from the first phoneme state to the first phoneme state and the phoneme transition probability from the first phoneme state to the second phoneme state. The second phoneme state is the first phoneme state. Next state.

S204, based on speech recognition and synthesis device dual learning the transition probability corresponding phoneme speech data x ⁽ⁱ⁾ corresponding to the posterior probability of phoneme and voice data x ^(i), text data generated

Calculate text data

Equal to the first log likelihood of text data y ⁽ⁱ⁾ .

Alternatively, based on speech recognition and synthesis device dual learning probability w network paths is determined according to the transition probability corresponding phoneme corresponding phoneme speech data x ⁽ⁱ⁾ the posterior probability and the speech data x ^(i); acquires the The text data corresponding to the network path with the highest probability among the w network paths

w is a positive integer greater than zero.

Specifically, based on speech recognition and synthesis device dual learning ⁽ⁱ⁾ the corresponding transition probability of the phoneme obtained different words according to the posterior probability and the speech corresponding to the phoneme of the speech data x ⁽ⁱ⁾ data x probability, different words of different compositions Network path, obtain the probability of each network path, select the network path with the highest probability as the optimal network path, and generate corresponding text data according to the above optimal network path

The HMM model is to establish a statistical model for the time series structure of the speech signal, which can be regarded as a mathematical double random process: one is to use a Markov chain with a finite number of states to simulate the implicit random process of the statistical characteristics of the speech signal. The other is the random process of the externally visible observation sequence associated with each state of the Markov chain. The HMM model contains the following elements: hidden state, observation sequence, initial probability distribution of hidden state, transition probability matrix of hidden state, and emission probability of observations. In the process of speech recognition, given a trained HMM model and an observation sequence (that is, the acoustic characteristics of the speech data), find the optimal state sequence corresponding to the observation sequence, thereby converting the speech into text. According to the pronunciation process of each word, the phoneme is used as a hidden node, and the change process of the phoneme constitutes the HMM state sequence, and each phoneme generates an observation vector with a certain probability density function.

It can be understood that the probability of generating the observed value for each state is calculated according to the HMM state transition probability of each word. If the joint probability of the HMM state sequence of a word is the largest, it is determined that the segment of speech corresponds to the above-mentioned word. For example, taking the voice data of the word five as an example, the word "five" is formed by connecting the three phoneme states of [f], [ay] and [v], and each state of the hidden node corresponds to a single phoneme. We take the words "one", "two", "three" and "five" as examples, and use the forward algorithm to calculate the posterior probability of the observation sequence and find the word with the highest probability as the recognition result.

Optional, calculate

Equals y ⁽ⁱ⁾ a first log-likelihood, calculation i.e. x ⁽ⁱ⁾ input of the second speech recognition model, identifies y ⁽ⁱ⁾ a first log-likelihood. The first log likelihood represents the log likelihood function of the conditional probability distribution P _f (y ⁽ⁱ⁾ |x ⁽ⁱ⁾ ,θ _xy ), and the calculation expression of the first log likelihood is shown below.

logP _f (y ⁽ⁱ⁾ |x ⁽ⁱ⁾ ,θ _xy )=logP{f(x ⁽ⁱ⁾ )=y ⁽ⁱ⁾ ,θ _xy }

S205. The voice recognition and synthesis device based on dual learning acquires the voice feature sequence feature corresponding to the text data y ⁽ⁱ⁾ , and generates voice data according to the voice sequence feature

Calculate voice data

Equal to the second log likelihood of the speech data x ⁽ⁱ⁾ .

The voice recognition and synthesis device based on dual learning acquires the voice feature sequence feature corresponding to y ⁽ⁱ⁾ , and generates voice data according to the voice sequence feature

Specifically, the speech recognition and synthesis device based on dual learning inputs the text data y ⁽ⁱ⁾ into the second speech synthesis model. First, the text data y ^{(i) is} split into the smallest unit words with semantics. The smallest unit word corresponding to the text data y ⁽ⁱ⁾ is input into the encoder of the second speech synthesis model, and the smallest unit word corresponding to the text data y ⁽ⁱ⁾ is semantically analyzed and classified. Then, the speech recognition and synthesis device based on dual learning performs classification coding on the smallest unit word corresponding to the text data y ⁽ⁱ⁾ , and outputs an intermediate semantic vector with a fixed length corresponding to the text data y ⁽ⁱ⁾ . The intermediate semantic vector is input into the decoder of the second speech synthesis model, and the speech recognition and synthesis device based on dual learning performs semantic analysis on the intermediate semantic vector, and generates the voice sequence feature corresponding to the text data y ⁽ⁱ⁾ . Input the above-mentioned sound sequence characteristics into the neural vocoder of the second speech synthesis model to generate speech data

Optionally, perform semantic analysis and classify the smallest unit words corresponding to the text data y ⁽ⁱ⁾ , and the classified categories include: Chinese, English, Korean, numbers, pinyin, and place names. There are different coding rules for different types of smallest unit words.

Optional, calculate

Equal to x ⁽ⁱ⁾ a second log-likelihood, calculation i.e. y ⁽ⁱ⁾ input to the second speech synthesis model, the identification number of the second pair of x ⁽ⁱ⁾ likelihood. The calculation expression of the second log likelihood is as follows.

logP _f (x ⁽ⁱ⁾ |y ⁽ⁱ⁾ ,θ _yx )=logP{g(y ⁽ⁱ⁾ )=x ⁽ⁱ⁾ ,θ _yx }

S206. The speech recognition and synthesis device based on dual learning aims at N pairs of labeled data, takes maximizing the first log likelihood and the second log likelihood as the objective function, and takes the probability duality of speech recognition and speech synthesis as Constraint conditions, optimize θ _xy and θ _yx .

Optionally, a speech recognition and synthesis device based on dual learning aims at maximizing the first log likelihood and the second log likelihood for N pairs of labeled data Φ _(x,y) ^{N in the} labeled data set Function and take the probability duality of speech recognition and speech synthesis as a constraint. Ideally, the speech recognition model and speech synthesis model should satisfy the probabilistic duality, that is, P(x ⁽ⁱ⁾ )P(y ⁽ⁱ⁾ |x ⁽ⁱ⁾ ,θ _xy )=P(y ⁽ⁱ⁾ )P( x ⁽ⁱ⁾ |y ⁽ⁱ⁾ ,θ _yx ), where P(x ⁽ⁱ⁾ ) and P(y ⁽ⁱ⁾ ) represent the edge probability of speech data x ⁽ⁱ⁾ and text data y ⁽ⁱ⁾ , respectively. The objective function and constraint conditions can be expressed as follows by formulas.

Optionally, combine the objective function and constraint conditions, and use the Lagrangian multiplier optimization algorithm to transform the original objective function F(θ _xy , θ _yx ) into

Expressed as follows.

Where λ is the Lagrangian factor, using gradient descent algorithm to iteratively optimize θ _xy and θ _yx . Calculation function

The gradients of θ _xy and θ _yx are expressed as

with

Update θ _xy and θ _xy , where,

Iteratively update until the objective function converges or reaches the specified stopping condition.

For N pairs of labeled data to maximize text data

Equal to the log likelihood of text data y ⁽ⁱ⁾ and speech data

The embodiment of the present application also provides a speech recognition and speech synthesis device based on dual learning, which can have the beneficial effects of the above-mentioned speech recognition and speech synthesis method based on dual learning. Wherein, the function of the device can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes at least one module corresponding to the above-mentioned functions.

Please refer to FIG. 3, which is a structural block diagram of a speech recognition and speech synthesis device 300 based on dual learning provided by an embodiment of the present application. The device includes an initialization unit 301, a selection unit 302, a processing unit 303, and a first The generation unit 304, the second generation unit 305, and the optimization unit 306.

The initialization unit 301 is used to initialize the marked data set Φ _(x,y) , speech recognition parameters θ _xy , speech synthesis parameters θ _yx and training data size N, where the marked data set Φ _(x,y) = {( x ^(j) ,y ^(j) )} ^K , the labeled data set Φ _(x,y) contains K pairs of labeled data, (x ^(j) ,y ^(j) ) represents the labeled data set Φ _(x The j-th pair of labeled data in _y) , x ^(j) is the voice data in the j-th pair of labeled data, and y ^(j) is the text data in the j-th pair of labeled data.

The selecting unit 302 is used to select N pairs of labeled data {(x ⁽ⁱ⁾ ,y ⁽ⁱ⁾ )} ^N from the labeled data set Φ _{(x, y)} , K is a positive integer, and N is less than or equal to K Positive integer.

Processing unit 303, for extracting acoustic features x ^(i), the acoustic features x ⁽ⁱ⁾ acquires the corresponding transition probability of the phoneme x ⁽ⁱ⁾ corresponding to the posterior probability of the phoneme and x ^(i).

The first generating unit 304 is configured to generate text data according to the posterior probability of the phoneme corresponding to x ⁽ⁱ⁾ and the transition probability of the phoneme corresponding to x ⁽ⁱ⁾

Calculation

Equal to the first log likelihood of y ⁽ⁱ⁾ .

The second generating unit 305 is configured to obtain the voice feature sequence corresponding to y ⁽ⁱ⁾ , and generate voice data according to the voice sequence feature

Calculation

Equal to the second log likelihood of x ⁽ⁱ⁾ .

The optimization unit 306 is configured to maximize the first log-likelihood and the second log-likelihood as the objective function for N pairs of labeled data, and use the probability duality of speech recognition and speech synthesis as constraints to optimize θ _xy and θ _yx .

Optionally, the selecting unit 302 selects N pairs of labeled data from the labeled data set Φ _{(x, y)} before {(x ⁽ⁱ⁾ ,y ⁽ⁱ⁾ )} ^N , it also includes: a pre-training unit for Randomly select S pairs of labeled data from the labeled data set Φ _{(x, y)} , pre-train the first voice recognition model to obtain the second voice recognition model, and pre-train the first voice synthesis model to obtain the first Two speech synthesis models, the second speech recognition model includes a deep neural network and an invisible Markov model, and the second speech synthesis model includes an encoder, a decoder and a neural vocoder.

Optionally, the pre-training unit is used to pre-train the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, which specifically includes:

An input unit, a first speech recognition model described above has the S standard data ^{{(x (r), y} (r))} in the speech data ^S x ^(r) input to be trained;

The preprocessing unit is used to preprocess the voice data x ^(r) to obtain the frequency cepstral coefficient characteristics corresponding to the voice data x ^(r) ;

The pre-training unit is also used for pre-training the deep neural network-Gaussian mixture model of the first speech recognition model by using the frequency cepstral coefficient feature to obtain a second speech recognition model. The second speech recognition model includes the trained deep neural network -Gaussian mixture model.

Optionally, the pre-training unit is used to pre-train the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model, which specifically includes:

The input unit is also used to input the text data y ^{(r) in} the S pair of labeled data {(x ^(r) ,y ^(r) )} ^S into the first speech synthesis model to be trained;

The pre-training unit is also used to pre-train the encoder, decoder, and neural vocoder of the first speech synthesis model by using text data y ^(r) to obtain a second speech synthesis model. The second speech synthesis model includes post-training The encoder, the decoder after training and the neural vocoder after training.

Optionally, the encoder, decoder and neural vocoder all adopt a cyclic neural network model.

Optionally, the processing unit 303 includes: an extraction unit and an acquisition unit.

The extraction unit is used to input the acoustic features of x ⁽ⁱ⁾ into the second speech recognition model, and extract the acoustic features of x ⁽ⁱ⁾ frame by frame.

The acquisition unit is used to input the acoustic features of x ⁽ⁱ⁾ into the deep neural network in the second speech recognition model to obtain the posterior probability of the phoneme corresponding to x ⁽ⁱ⁾ , and pass the hidden Markov in the second speech recognition model The model obtains the transition probability of the phoneme corresponding to x ⁽ⁱ⁾ .

Alternatively, first generation unit 304 according to the transition probability of the phoneme corresponding to speech data x ⁽ⁱ⁾ and the posterior probability of the speech data x ⁽ⁱ⁾ corresponding to phonemes, generating text data

Specifically:

Determination means for the probability of w is determined according to network paths corresponding to the transition probability of the phoneme corresponding phoneme speech data x ⁽ⁱ⁾ and the posterior probability of the speech data x ^(i), w is a positive integer greater than zero;

The obtaining unit is also used to obtain text data corresponding to the network path with the highest probability among the above w network paths

Optionally, the second generating unit 305 is specifically configured to: input y ⁽ⁱ⁾ into the encoder of the second speech synthesis model to generate a semantic sequence

will

Input the decoder of the second speech synthesis model to generate a voice feature sequence, and input the voice sequence features into the neural vocoder of the second speech synthesis model to generate voice data

Calculation

Equal to the second log likelihood of x ⁽ⁱ⁾ .

Optionally, the optimization unit 306 is specifically configured to: maximize the first log-likelihood and the second log-likelihood as the objective function, and use the probabilistic duality of speech recognition and speech synthesis as the constraint condition to simultaneously establish the objective function And constraint conditions, using Lagrangian multiplier optimization algorithm, iterative optimization of θ _xy and θ _yx .

The steps of the method or algorithm described in combination with the disclosure of the embodiments of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. Software instructions can be composed of corresponding software modules, which can be stored in random access memory (English: random access memory, referred to as RAM), flash memory, read-only memory (English: read only memory, referred to as ROM), Erasable programmable read-only memory (English: erasable programmable rom, abbreviation: EPROM), electrically erasable programmable read-only memory (English: electrically eprom, abbreviation: EEPROM), register, hard disk, mobile hard disk, read-only optical disk (CD -ROM) or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, so that the processor can read information from the storage medium and can write information to the storage medium. Of course, the storage medium may also be an integral part of the processor. The processor and the storage medium may be located in the ASIC. In addition, the ASIC may be located in a network device. Of course, the processor and storage medium can also exist as discrete components in the network device.

Those skilled in the art should be aware that, in one or more of the foregoing examples, the functions described in the embodiments of the present application may be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer non-volatile readable medium or transmitted as one or more instructions or codes on the computer non-volatile readable medium. Computer non-volatile readable media include computer non-volatile storage media and communication media, where communication media includes any media that facilitates the transfer of a computer program from one place to another. The storage medium may be any available medium that can be accessed by a general-purpose or special-purpose computer.

The specific implementations described above further describe the purpose, technical solutions and beneficial effects of the embodiments of this application in further detail. It should be understood that the above descriptions are only specific implementations of the embodiments of this application and are not intended to To limit the protection scope of the embodiments of the application, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the application shall be included in the protection scope of the embodiments of the application.

Claims

A speech recognition and speech synthesis method based on dual learning, characterized in that, the method includes:

Initialize the labeled data set Φ (x,y) , speech recognition parameters θ xy , speech synthesis parameters θ yx, and training data size N, where the labeled data set Φ (x,y) = {(x (j) ,y (j) )} K , the labeled data set Φ (x,y) contains K pairs of labeled data, (x (j) ,y (j) ) represents the labeled data set Φ (x , y) the j-th pair of labeled data, x (j) is the voice data in the j-th pair of labeled data, y (j) is the j-th pair of labeled data is text data, K Is a positive integer, N is a positive integer less than or equal to K;

Select N pairs of labeled data {(x (i) ,y (i) )} N from the labeled data set Φ (x,y) ;

Extracting the voice data x (i) of the acoustic characteristics, acoustic characteristics of the voice data according to x (i) is, after acquiring the corresponding phoneme speech data x (i) and the posterior probability of speech data x (i ) The transition probability of the corresponding phoneme;

Generate text data according to the posterior probability of the phoneme corresponding to the voice data x (i) and the transition probability of the phoneme corresponding to the voice data x (i)
Calculate the text data
Equal to the first log likelihood of the text data y (i) ;

Acquire the voice feature sequence corresponding to the text data y (i) , and generate voice data according to the voice sequence feature
Calculate the voice data
Equal to the second log likelihood of the voice data x (i) ;

For the N pairs of labeled data, maximize the first log likelihood and the second log likelihood as the objective function, and use the probabilistic duality of speech recognition and speech synthesis as constraints to optimize all Said θ xy and said θ yx .
The method according to claim 1, characterized in that, before randomly selecting N pairs of labeled data (x (i) , y (i) ) from the labeled data set Φ (x, y) , the method Also includes:

Randomly select S from the labeled data set Φ (x, y) to the labeled data, perform pre-training on the first voice recognition model to be trained, and obtain the pre-trained second voice recognition model and the first voice to be trained The synthesis model is pre-trained to obtain a pre-trained second speech synthesis model. The second speech recognition model includes a deep neural network-invisible Markov model. The second speech synthesis model includes an encoder, a decoder, and a neural network. For vocoder, S is a positive integer less than or equal to K.
The method according to claim 2, wherein the pre-training of the first speech recognition model to be trained to obtain the second pre-trained speech recognition model comprises: comparing the S with labeled data {( x (r) ,y (r) )} The speech data x (r) in S is input into the first speech recognition model to be trained;

The voice data x (r) is pretreated, of the voice data x (r) corresponding to the number of frequency spectrum inverted characteristic;

Pre-training the deep neural network-Gaussian mixture model of the first speech recognition model by using the frequency cepstral coefficient feature to obtain the second speech recognition model, the second speech recognition model including the trained Deep neural network-Gaussian mixture model.
The method according to claim 2 or 3, wherein the pre-training of the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model comprises:

S has the standard of the data {(x (r), y (r))} in the text data S y (r) to be trained first input speech synthesis model;

Use the text data y (r) to pre-train the encoder, decoder, and neural vocoder of the first speech synthesis model to obtain a second speech synthesis model. The second speech synthesis model includes the trained The encoder, the decoder after training, and the neural vocoder after training.
The method according to any one of claims 2 to 4, wherein the encoder, the decoder and the neural vocoder all adopt a cyclic neural network model.
The method according to any one of claims 2 to 5, wherein said extracting said speech data x (i) acoustic features, obtaining the corresponding phoneme posterior probability of the speech data x (i), and The transition probability of the phoneme corresponding to the voice data x (i) includes:

The speech data x (i) said second input acoustic feature model speech recognition, acoustic features extracted from frame to frame of the speech data x (i) of the speech data x (i) of the input acoustic feature DNN said second speech recognition model, acquires the corresponding phoneme posterior probability of the speech data x (i), the speech data x (i) corresponding to the input of the second voice recognition phoneme models To obtain the transition probability of the phoneme corresponding to the speech data x (i) .
The method according to any one of claims 2 to 6, characterized in that, according to the transition corresponding to the phoneme posterior probability of the phoneme corresponding to speech data x (i) and the voice data x (i) Probability, generating text data
include:

Determining the probability of w network paths according to the posterior probability of the phoneme corresponding to the voice data x (i) and the transition probability of the phoneme corresponding to the voice data x (i) , where w is a positive integer greater than zero;

Obtain the text data corresponding to the network path with the highest probability among the w network paths
The method according to any one of claims 2 to 7, wherein the acquiring a voice feature sequence corresponding to the text data y (i) , and generating voice data according to the voice sequence feature
include:

Input the text data y (i) into the encoder of the second speech synthesis model to generate a semantic sequence
The semantic sequence
Input the decoder of the second speech synthesis model to generate a voice feature sequence, and input the voice sequence features into the neural vocoder of the second speech synthesis model to generate voice data
The method according to any one of claims 2 to 8, wherein the objective function is to maximize the first log likelihood and the second log likelihood, and combine speech recognition and speech The synthesized probability duality is used as a constraint condition to optimize the θ xy and the θ yx , including:

Taking maximizing the first log-likelihood and the second log-likelihood as the objective function, and taking the probabilistic duality of the speech recognition and speech synthesis as the constraint condition, the objective function and the The constraint condition is that the Lagrangian multiplier optimization algorithm is used to iteratively optimize the θ xy and the θ yx .
A speech recognition and speech synthesis device based on dual learning, characterized in that the device includes:

The initialization unit is used to initialize the labeled data set Φ (x, y) , speech recognition parameters θ xy , speech synthesis parameters θ yx, and training data size N, where the labeled data set Φ (x, y) = { (x (j) ,y (j) )} K , (x (j) ,y (j) ) represents the j-th pair of labeled data in the labeled data set Φ (x,y) , the The labeled data set Φ (x, y) contains K pairs of labeled data, x (j) is the voice data in the j-th pair of labeled data, and y (j) is the j-th pair of labeled data. For text data, K is a positive integer, and N is a positive integer less than or equal to K;

The selection unit is used to select N pairs of labeled data {(x (i) ,y (i) )} N from the labeled data set Φ (x,y) ;

A processing unit for extracting voice data x (i) of the acoustic characteristics, acoustic characteristics of the voice data according to x (i) acquires the speech data x (i) corresponding to the posterior probability of the phoneme and The transition probability of the phoneme corresponding to the speech data x (i) ;

A first generating unit, according to the transition probability of the phoneme posterior probability corresponding to said speech data x (i) and the voice data x (i) corresponding to phonemes, generating text data
Calculate the text data
Equal to the first log likelihood of the text data y (i) ;

The second generating unit is used to obtain the voice feature sequence corresponding to the text data y (i) , and generate voice data according to the voice sequence feature
Calculate the voice data
Equal to the second log likelihood of the voice data x (i) ;

The optimization unit is configured to maximize the first log-likelihood and the second log-likelihood as the objective function for the N pairs of labeled data, and perform the duality of the probability of speech recognition and speech synthesis Constraint conditions, optimize the θ xy and the θ yx .
The device according to claim 10, wherein the selecting unit of the device selects N pairs of labeled data {(x (i) ,y (i) from the labeled data set Φ (x, y ) )} Before N , it also includes:

The pre-training unit is used to randomly select S pairs of labeled data from the labeled data set Φ (x, y) , and perform pre-training on the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, and The first speech synthesis model to be trained is pre-trained to obtain a pre-trained second speech synthesis model. The second speech recognition model includes a deep neural network and an invisible Markov model. The second speech synthesis model includes coding S is a positive integer less than or equal to K.
The device according to claim 11, wherein the pre-training unit is used to pre-train the first speech recognition model to be trained to obtain the pre-trained second speech recognition model, which specifically comprises:

An input unit for the first speech recognition model with a standard for the S data {(x (r), y (r))} in the speech data S x (r) input to be trained;

A preprocessing unit, configured to preprocess the voice data x (r) to obtain the frequency cepstral coefficient characteristics corresponding to the voice data x (r) ;

The pre-training unit is further configured to pre-train the deep neural network-Gaussian mixture model of the first speech recognition model by using the frequency cepstral coefficient feature to obtain the second speech recognition model, and the second speech recognition The model includes the deep neural network-Gaussian mixture model after training.
The device according to claim 11 or 12, wherein the pre-training unit is used to pre-train the first speech synthesis model to be trained to obtain the pre-trained second speech synthesis model, which specifically comprises:

The input unit is further configured to have the standard data for the S {(x (r), y (r))} in the text data S y (r) to be trained first input speech synthesis model;

The pre-training unit is further configured to use the text data y (r) to pre-train the encoder, decoder, and neural vocoder of the first speech synthesis model to obtain a second speech synthesis model. The language synthesis model includes the trained encoder, the trained decoder, and the trained neural vocoder.
The device according to claims 11 to 13, wherein the encoder, the decoder and the neural vocoder all adopt a cyclic neural network model.
The device according to any one of claims 11 to 14, wherein the processing unit comprises:

An extraction unit, configured to input the acoustic features of the voice data x (i) into the second voice recognition model, and extract the acoustic features of the voice data x (i) frame by frame;

The acquiring unit is used for inputting the acoustic features of the voice data x (i) into the deep neural network in the second voice recognition model, acquiring the posterior probability of the phoneme corresponding to the voice data x (i) , and calculating the The phoneme corresponding to the speech data x (i) is input into the hidden Markov model in the second speech recognition model, and the transition probability of the phoneme corresponding to the speech data x (i) is obtained.
The device according to any one of claims 11 to 15, wherein the first generating unit corresponds to the speech data x (i) according to the posterior probability of the phoneme corresponding to the speech data x (i) Transition probability of the phoneme, to generate text data
Specifically:

The determining unit is configured to determine the probability of w network paths according to the posterior probability of the phoneme corresponding to the voice data x (i) and the transition probability of the phoneme corresponding to the voice data x (i) , where w is a positive value greater than zero. Integer

The obtaining unit is further configured to obtain text data corresponding to the network path with the highest probability among the w network paths
The device according to any one of claims 11 to 16, wherein the second generating unit obtains the voice feature sequence corresponding to the text data y (i) , and generates voice data according to the voice sequence feature
Specifically used for:

Input the text data y (i) into the encoder of the second speech synthesis model to generate a semantic sequence
The semantic sequence
Input the decoder of the second speech synthesis model to generate a voice feature sequence, and input the voice sequence features into the neural vocoder of the second speech synthesis model to generate voice data
The device according to any one of claims 11 to 17, wherein the optimization unit is specifically configured to:

Taking maximizing the first log-likelihood and the second log-likelihood as the objective function, and taking the probabilistic duality of the speech recognition and speech synthesis as the constraint condition, the objective function and the The constraint condition is that the Lagrangian multiplier optimization algorithm is used to iteratively optimize the θ xy and the θ yx .
A computer non-volatile readable storage medium, characterized in that a computer program is stored on the computer non-volatile readable storage medium, and when the program is executed by a processor, the computer program realizes any one of claims 1 to 9 The speech recognition and speech synthesis method based on dual learning described above.
A server, characterized by comprising: one or more processors; a memory; one or more application programs, wherein the one or more application programs are stored in the memory and configured to be operated by the one Or executed by multiple processors, and the one or more application programs are configured to execute the dual learning-based speech recognition and speech synthesis method according to any one of claims 1 to 9.