CN113851113A

CN113851113A - Model training method and device and voice awakening method and device

Info

Publication number: CN113851113A
Application number: CN202111137419.8A
Authority: CN
Inventors: 石杨
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-28

Abstract

The application discloses a model training method and device, a voice awakening method and device, electronic equipment and a readable storage medium, and belongs to the technical field of data processing. The model training method comprises the following steps: acquiring first characteristic information of audio training data, wherein the audio training data comprises awakening audio and non-awakening audio; generating a confrontation network model through an acoustic model to be trained and the first characteristic information, and outputting phoneme information and semantic information of the audio training data; outputting second characteristic information of the audio training data through the generated confrontation network model to be trained, the phoneme information and the semantic information; and training the acoustic model and the generation countermeasure network model according to the first characteristic information and the second characteristic information.

Description

Model training method and device and voice awakening method and device

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a model training method and device, a voice awakening method and device, electronic equipment and a readable storage medium.

Background

Currently, voice interaction has become an important form of human-computer interaction. The voice wake-up function is used as an entrance of voice interaction, and is successfully applied to various types of electronic devices, such as smart speakers, smart phones, smart home devices, and smart car-mounted devices.

For example, the user can successfully wake up the smart sound box through the designated wake-up word, so that the sound box can be controlled to play audio through voice; for another example, the user can successfully wake up the mobile phone by the designated wake-up word, so that the mobile phone can be controlled by voice to make a call.

In the prior art, the phenomena of awakening failure or mistaken awakening and the like often occur due to inaccurate voice judgment.

Disclosure of Invention

The embodiment of the application aims to provide a model training method, which can solve the problem that in the prior art, the phenomena of awakening failure or mistaken awakening and the like often occur due to inaccurate voice judgment.

In a first aspect, an embodiment of the present application provides a model training method, where the method includes: acquiring first characteristic information of audio training data, wherein the audio training data comprises awakening audio and non-awakening audio; generating a confrontation network model through an acoustic model to be trained and the first characteristic information, and outputting phoneme information and semantic information of the audio training data; outputting second characteristic information of the audio training data through the generated confrontation network model to be trained, the phoneme information and the semantic information; and training the acoustic model and the generation countermeasure network model according to the first characteristic information and the second characteristic information.

In a second aspect, an embodiment of the present application provides a voice wake-up method, where the method includes: acquiring third characteristic information of the first audio; outputting first phoneme information of the first audio through the acoustic model and the third feature information; outputting a wake-up instruction under the condition that the first phoneme information is matched with the preset phoneme information of the wake-up audio; wherein the acoustic model is obtained by training according to the model training method of the first aspect.

In a third aspect, an embodiment of the present application provides a model training apparatus, including: the first acquisition module is used for acquiring first characteristic information of audio training data, wherein the audio training data comprises awakening audio and non-awakening audio; the first output module is used for outputting phoneme information and semantic information of the audio training data through an acoustic model to be trained, a confrontation network model and the first characteristic information; the second output module is used for outputting second characteristic information of the audio training data through the generated confrontation network model to be trained, the phoneme information and the semantic information; and the training module is used for training the acoustic model and the generation confrontation network model according to the first characteristic information and the second characteristic information.

In a fourth aspect, an embodiment of the present application provides a voice wake-up apparatus, including: the second acquisition module is used for acquiring third characteristic information of the first audio; a third output module, configured to output first phoneme information of the first audio through the acoustic model and the third feature information; the fourth output module is used for outputting a wake-up instruction under the condition that the first phoneme information is matched with the preset phoneme information of the wake-up audio; wherein the acoustic model is obtained by training according to the model training method of the first aspect.

In a fifth aspect, the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect or the second aspect.

In a sixth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first or second aspect.

In a seventh aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect or the second aspect.

Thus, in the embodiment of the present application, in the voice wake-up function, the acoustic model needs to be trained to ensure that the accuracy of the acoustic model for audio judgment is high. Firstly, a large amount of audio including awakening audio and non-awakening audio is used as audio training data, first characteristic information is extracted, the first characteristic information is input into an acoustic model, and phoneme information of the audio training data is output. And secondly, inputting the first characteristic information into a generation confrontation network model, and outputting semantic information of the audio training data. Then, the phoneme information is input into a generation confrontation network model, the generation confrontation network model combines the semantic information and the phoneme information, and second characteristic information of the audio training data is output. Further, based on the output second feature information and the first feature information, the acoustic model and the generative confrontation network model are trained so as to minimize a difference between the second feature information and the first feature information. Therefore, in the embodiment of the application, the mode of combining the phoneme information and the semantic information is mainly adopted to enhance the representation of the audio semantic feature information to realize model training in the whole function, so that the aim of training the acoustic model is fulfilled, the accuracy of judging the audio by the acoustic model is higher, the accuracy of judging and awakening the audio is further improved, and the phenomenon of awakening without awakening or mistakenly awakening is avoided.

Drawings

FIG. 1 is a flow chart of a model training method of an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a model training method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a network structure of a model training method according to an embodiment of the present application;

FIG. 4 is a flowchart of a voice wake-up method according to an embodiment of the present application;

FIG. 5 is a block diagram of a model training apparatus according to an embodiment of the present application;

FIG. 6 is a block diagram of a voice wake-up unit according to an embodiment of the present application;

fig. 7 is one of the hardware configuration diagrams of the electronic device according to the embodiment of the present application;

fig. 8 is a second schematic diagram of a hardware structure of the electronic device according to the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The model training method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings by specific embodiments and application scenarios thereof.

Referring to fig. 1, a flowchart of a model training method according to an embodiment of the present application is shown, and the method is applied to an electronic device, and includes:

step 110: first characteristic information of audio training data is obtained, and the audio training data comprises awakening audio and non-awakening audio.

The wake-up audio is an audio for outputting the wake-up command, and the other audio except the wake-up audio is a non-wake-up audio.

In this embodiment, based on audio training data including a wake-up audio and a non-wake-up audio, a model in the wake-up function is trained to improve the accuracy of audio determination in the wake-up function.

The first feature information is a sum of feature information derived based on a large number of audios in the audio training data.

In the present embodiment, the first feature information is used to represent the Fbank feature of the audio training data.

Optionally, based on the audio training data, Fbank feature extraction is performed on the training corpus, generally, 80-dimensional features can be extracted, and the sampling rate is 16 KHz.

Step 120: and outputting phoneme information and semantic information of the audio training data through the acoustic model to be trained, the generation confrontation network model and the first characteristic information.

Wherein phoneme information of the audio training data is output through the acoustic model to be trained.

The acoustic model is a model for recognizing a sound in speech recognition or speech wakeup.

In this step, phoneme information of the audio training data is output through the acoustic model with the first feature information as an input.

Optionally, the phoneme information comprises a phoneme probability matrix. Wherein, for each audio in the audio training data, each frame corresponds to a set of phoneme probability sequences.

In addition, semantic information of the audio training data is output through the generated confrontation network model to be trained.

Alternatively, the generation of the countermeasure network model in the present embodiment is based on a variable automatic encoder (C-VAE), and it may be considered that the generation of the countermeasure network model includes an encoder.

Therefore, in this step, the semantic information of the audio training data is output by the encoder with the first feature information as input.

The semantic information is the sum of semantic information obtained by a large number of audios in the audio training data.

Illustratively, through the encoder, the respective audios in the audio training data, semantic information corresponding to each frame can be obtained.

The second feature information of this embodiment is a semantic representation hidden variable.

Step 130: and outputting second characteristic information of the audio training data through the generated confrontation network model to be trained, the phoneme information and the semantic information.

In this embodiment, a generative countermeasure (VAWGAN) network is utilized, which is a deep learning model and one of the most promising approaches to unsupervised learning in complex distributions in recent years. The model passes through at least two modules in the framework: the mutual game learning of the generation module (Generative Model) and the authentication module (Discriminative Model) produces a rather good output.

Optionally, the generation module comprises a generator and the authentication module comprises an authenticator.

In this step, the output of the acoustic model (i.e., the phoneme information of the audio training data) and the output of the encoder (i.e., the semantic information of the audio training data) are concatenated, input to the generator, and the second feature information of the audio training data is output.

In this embodiment, the second feature information is used to represent the fake feature of the audio training data.

Wherein the first feature information is a true feature derived based on the audio training data and the second feature information is a synthesized feature derived based on the model output.

Referring to fig. 2, the output z of the encoder represents a semantic representation hidden variable; the acoustic model output a (x) represents a phoneme posterior probability matrix with the horizontal axis being the time dimension and the vertical axis being the phoneme posterior probability sequence. Since both can represent semantic information characterization of audio, combining them can better enhance the characterization of audio semantic feature information, so that the acoustic model generated by modeling can better characterize phoneme probability of each frame of audio.

Step 140: and training the acoustic model and the generation countermeasure network model according to the first characteristic information and the second characteristic information.

In this step, the acoustic model and the VAWGAN network are trained to adjust parameters in each model, and finally the trained acoustic model and the VAWGAN network are obtained.

In the training process, an acoustic model parameter A, an encoder parameter phi, a generator parameter theta and a discriminator parameter psi are respectively optimized.

In this step, the training is performed to minimize the difference between the synthesized second feature information and the real first feature information, so that the audio identified by the acoustic model is closest to the real audio, and the accuracy of audio judgment can be improved.

In the flow of the model training method according to another embodiment of the present application, step 140 includes:

substep A1: the acoustic model and the generation countermeasure network model are trained until a first error rate between the first feature information and the second feature information satisfies a first preset condition.

In this step, the first feature information and the second feature information are input to the discriminator, and the difference between the first feature information and the second feature information is output.

Optionally, the difference between the first characteristic information and the second characteristic information is represented by a first error rate.

For the present embodiment, an explanation is that for the training purpose ultimately achieved by the present application, a first error rate between the first feature information and the second feature information is smaller than a certain threshold, and therefore, the first preset condition is: the first error rate is less than the threshold.

For the present embodiment, it is further explained that for the training purpose ultimately achieved by the present application, the preset number of iterations is reached, so that the first error rate between the first feature information and the second feature information does not change any more and reaches the minimum value, and therefore, the first preset condition is: and reaching the error rate under the preset iteration times.

Exemplarily, in one experiment, the number of iterations is chosen to be 200000.

In this embodiment, based on a first preset condition, a final training effect is achieved, so that the difference between the first feature information and the second feature information is minimized, and the accuracy of the acoustic model in determining the audio frequency is improved.

In the flow of the model training method according to another embodiment of the present application, before step 120, the method further includes:

step B1: and training the acoustic model until the matching rate of the phoneme information of the audio training data and the preset phoneme information meets a fourth preset condition.

Optionally, the audio training data further includes text labels corresponding to the respective audios.

Optionally, the trained speech recognition network with high accuracy is used, and the text labels corresponding to the audios are combined, so as to align the audio training data, and obtain a phoneme label corresponding to each frame of each audio in the audio training data, and further, all the phoneme labels form the preset phoneme information of this embodiment.

The preset phoneme information comprises a phoneme label corresponding to each frame of each audio.

The matching rate in this embodiment is a total matching rate obtained based on matching between the phoneme probability sequence of each frame in the audio training data and the phoneme label of the corresponding frame.

Optionally, the training process of this embodiment is:

and establishing a mapping relation between the Fbank characteristics x of the audio training data and preset phoneme information.

The method comprises the steps that firstly, a phoneme probability sequence of each frame is output through an acoustic model and first characteristic information; and obtaining the phoneme label of each frame through a speech recognition network.

Secondly, a phoneme probability sequence z is obtained by utilizing a cross entropy loss function and metric reasoning_pError loss function between phoneme label:

z_p＝[p_i1,p_i2,...,p_ic] (2)

wherein M is the sum of all phoneme labels; y is_icThe symbol function (0 or 1) of the phoneme label is adopted, if the phoneme label of the ith frame is equal to c, 1 is selected, otherwise 0 is selected; p is a radical of_icA prediction probability that the ith frame belongs to c; z is a radical of_pIs a phoneme probability sequence.

Wherein z is_pAnd (3) deducing by using an acoustic model according to the input Fbank characteristics:

z_p＝A(x) (3)

and A is a parameter of the acoustic model, and in the process of training the acoustic model, the cross entropy loss in the step (1) is minimized through continuous iteration, so that the acoustic model is continuously converged.

Wherein, the matching rate satisfies a fourth preset condition, which correspondingly is: phoneme probability sequence z_pThe error L between the phoneme labels is minimized.

In this embodiment, before the phoneme information of the output audio training data is used as the input of the encoder, the acoustic model may be preliminarily trained according to the above training method to minimize the difference between the phoneme information of the audio training data obtained by the acoustic model and the preset phoneme information. Therefore, on the basis of the training method provided by the embodiment, the training method provided by the previous embodiment is combined, so that the purpose of performing more precise training on the acoustic model can be achieved, and the accuracy of the acoustic model for audio judgment is ensured to be as high as possible.

In the flow of the model training method according to another embodiment of the present application, the generating the confrontation network model includes an identification module and a generation module, and step 140 includes:

substep C1: and training the generating module and the acoustic model until the second characteristic information output by the generating module meets a second preset condition.

Substep C2: and training the identification module until a second error rate between the first characteristic information and second characteristic information output by the generation module meets a third preset condition.

As can be seen from the foregoing embodiments, in the present application, the VAWGAN network is incorporated into the decoder based on the variational automatic encoder to improve the VAE effect. The VAWGAN includes two parts, one part is a generator for generating a synthesized spectrum, and the other part is a discriminator for judging whether the synthesized spectrum is a true spectrum. It can be understood that: the decoder includes a generator and a discriminator.

In a VAWGAN network, the objective function is:

J_vawgan＝L(x；φ,θ)+αJ_wgan (4)

where L (x; φ, θ) is the objective function of the encoder section:

wherein D is_KL(q_φ(z|x)||p_θ(z)) represents an authentication module q_φRelative entropy (KL Divergence) between (z | x) and the true posterior probability p (z | x). Prior probability p_θ(z) is a standard multidimensional gaussian distribution. q. q.s_φ(z | x) and p_θ(x | z) are encoder and decoder, respectively, subject to multi-dimensional Gaussian componentsCloth whose mean vector and covariance matrix are respectively (mu)_φ(z),σ_φ(z)) and (. mu.))_θ(x),σ_θ(x) ). Thus, the two terms on the right can be simplified as:

where K is the dimension of the intermediate variable z and L is the pair q_φ(z | x) number of samples. Since the sampling process is a discontinuous operation and cannot be derived, the network parameters of the encoder and the decoder cannot be updated by back propagation. Then, another random variable epsilon is introduced to re-parameterize the hidden variable z, let z^(l)＝μ_θ(x)+ε^(l)*σ_θ(x) ε (l) to N (0, I), then:

wherein D is the number of samples of x.

To this end, an objective loss function for an optimized VAWGAN network may be obtained.

The parameters of the acoustic model A (x) are dynamically changed according to the loss function along with the training process, so that the model is continuously converged; the output z of the encoder is dynamically varied according to the output of the encoder.

Based on the above, it is explained how the acoustic model a (x) and the VAWGAN are trained simultaneously, so that the acoustic model a (x) achieves better effect.

J_wganAn objective function representing the VAWGAN part:

wherein α isLoss factor of VAWGAN, D_ΨAnd outputting the judgment of the discriminator on the true and false characteristics. The A (x) combination z is sent to the generator and then judged by the discriminator. The second half of the above equation is the loss function of the generator two-dimensional convolutional neural network:

since the acoustic model a (x) needs to continually optimize the parameters in this process, the objective function of the optimization generator becomes:

wherein min represents the minimum generator and acoustic model loss, and the parameters of the optimal generator and acoustic model A are solved; the second half of the above equation is an acoustic model loss function, and a generator loss function needs to be combined, so that the overall loss reaches an optimal value.

Due to the loss function optimization of the acoustic model added in the generator, the loss function of the discriminator two-dimensional convolution neural network becomes:

the objective function of the optimization discriminator is:

where max represents the loss function that maximizes the discriminator, i.e. the goal of the discriminator is to maximize the difference between the true and fake features, thereby constantly optimizing the model parameters of the discriminator.

In the present embodiment, the decoder is composed of a generator and a discriminator. In the training process, firstly, the parameters of the discriminator are fixed, and the generator and the acoustic model are trained to lead the loss function L of the whole generator_GThe size is as small as possible, namely the second characteristic information meets a second preset condition, and the generated Fbank characteristic x' (namely the second characteristic information) is obtained; the generator and acoustic model parameters are then fixed, and the discriminator is trained such that the loss function L of the discriminator_DAs large as possible, i.e., -L_DAnd minimizing, namely that a second error rate between the first characteristic information and the second characteristic information satisfies a third preset condition.

It should be noted that, in one explanation, the two steps in the present embodiment are alternately repeated. For example, for the first time: based on the first parameters of the discriminator, step C1 is performed to train the generator and the acoustic model such that the overall loss function L of the generator is_GThe size is as small as possible; further, based on the parameters of the generator and the acoustic model after the training, performing step C2 to train the discriminator; secondly, since the parameters of the discriminator are adjusted in the first step C2, the generator and the acoustic model are trained based on the adjusted second parameters so that the loss function L of the entire generator is obtained_GThe size is as small as possible; further, based on the parameters of the generator and the acoustic model after the training, step C2 is performed to train the discriminator. And the rest is repeated until the iteration times are finished.

Correspondingly, the second error rate is used for representing the error rate obtained in one repeated step, and the first error rate is used for representing the error rate obtained in the final training.

In yet another explanation, the two steps in the present embodiment represent two types of steps, respectively. For example, step C1, representing all steps of training the generator and acoustic models, may be an overview of the multiple repetition steps; step C2, representing all steps in the training of the discriminator, may be an overview of the multiple repetition steps.

Correspondingly, the first error rate and the second error rate are used for representing the error rate obtained by the final training.

In this explanation, the sequence between step C1 and step C2 is not limited.

Wherein, the generator adopts a two-dimensional convolutional neural network and comprises 4 convolutional layers. The filter sizes of the 4 convolution layers are respectively 9 × 1, 7 × 1 and 1025 × 1, the step sizes are respectively 3, 3 and 1, the filter depths are respectively 32, 16, 8 and 1, and the LReLU function is adopted as the activation function.

The discriminator adopts a two-dimensional convolutional neural network, which comprises 3 convolutional layers and 1 full-connection layer. The filter size of 3 convolution layers is 7 x 1, 115 x 1 respectively, the step size is 3, the filter depth is 16, 32, 64 respectively, and the activation function adopts LReLU function.

The acoustic model is consistent with the encoder structure, and a two-dimensional convolutional neural network is adopted, wherein the two-dimensional convolutional neural network comprises 5 convolutional layers and 1 full-connection layer. The filter size of 5 convolution layers is 7 x 1, the step size is 3, the filter depth is 16, 32, 64, 128 and 256 respectively, and the LReLU function is adopted as the activation function. The network structure is shown in fig. 3. And updating network model parameters by using a Stochastic Gradient Descent (SGD) method in the training process.

In this embodiment, a method for training a voice awakening acoustic model based on a generated countermeasure network is provided, so as to improve a modeling effect of the acoustic model. The training of the voice wake-up system is realized by combining a generation countermeasure network and an acoustic model based on a variational self-encoder. The VAWGAN network is combined in the acoustic model, so that the modeling quality of the acoustic model can be better improved, and high-quality voice awakening is realized.

In the flow of the model training method according to another embodiment of the present application, step 120 includes:

substep D1: and outputting phoneme information corresponding to the target frame of each audio in the audio training data through the acoustic model to be trained and the first characteristic information.

Substep D2: and outputting semantic information corresponding to the target frames of the audios in the audio training data through the generated confrontation network model to be trained and the first characteristic information.

Optionally, the target frame comprises each frame of each audio in the audio training data.

Optionally, the target frames comprise partial frames of each audio in the audio training data. Wherein, the partial frames can be collected according to a certain frequency to ensure that the target frames are uniformly distributed in the audio training data.

Correspondingly, based on the acquired phoneme information of the target frame, semantic information of the corresponding frame is acquired, so that the phoneme information and the semantic information of any frame are combined to generate a fake feature corresponding to the frame, and the fake feature is compared with the Fbank feature corresponding to the frame.

Furthermore, each frame in the target frames is subjected to feature comparison in sequence, so that the feature comparison of the whole audio training data is completed.

In the present embodiment, a method for acquiring phoneme information and semantic information of audio training data is provided to describe the present embodiment in more detail. In the embodiment, for audio training data, corresponding phoneme information and semantic information are regularly acquired for a target frame in the audio training data to generate a synthetic feature of the frame, so that the synthetic feature is used for comparing with a real feature of the frame. Therefore, in the embodiment, based on the feature comparison of the target frame, the overall situation of the audio training data can be inferred, so as to be used for model training in the application.

Referring to fig. 4, a flowchart of a voice wake-up method according to another embodiment of the present application is shown, and applied to an electronic device, the method includes:

step 150: and acquiring third characteristic information of the first audio.

The model training method and the voice wake-up method provided by the present application are respectively and correspondingly applied to two stages, where the first stage is the training stage in the foregoing embodiment, and the other stage is the wake-up stage in the present embodiment.

In this step, the third feature information is used to represent the Fbank feature of the first audio.

In this embodiment, the first audio may be a segment of an audio stream. Therefore, a segment of audio is streamed into a memory buffer (buffer), typically 10ms long, and in order to reduce the amount of computation, it can be sent in a frame skipping manner (e.g. 1 frame every 3 frames), and then features are extracted.

Step 160: and outputting the first phoneme information of the first audio through the acoustic model and the third feature information.

The acoustic model is obtained by training the model training method in any one of the above embodiments.

And inputting the extracted Fbank characteristics into the acoustic model trained in the previous embodiment for reasoning to obtain the first phoneme information of the corresponding first audio.

Wherein the first phoneme information includes a phoneme probability matrix.

Step 170: and outputting the awakening instruction under the condition that the first phoneme information is matched with the preset phoneme information of the awakening audio.

The awakening instruction is used for awakening the terminal equipment and is applied to a voice awakening function.

This step corresponds to a viterbi decoding step.

In this step, the phoneme probability matrix obtained in step 160 is sent to a decoding graph of the wake-up audio, decoding is performed by using the viterbi algorithm, so that a score can be obtained, whether the score is greater than a certain threshold value or not is judged, if yes, the wake-up is performed, and if no, the next frame of data is continuously sent.

Wherein, the score here can be understood as: and the association degree of the first phoneme information and the preset phoneme information of the awakening audio is greater than a certain threshold value, and the first phoneme information is matched with the preset phoneme information of the awakening audio.

Illustratively, based on the first phoneme information, a phoneme label with the highest probability corresponding to each frame in the input audio stream can be obtained through decoding, the phoneme label is compared with a preset phoneme label corresponding to each frame in the wake-up audio, and if the similarity is greater than a certain set value, the degree of association between the first phoneme information and the preset phoneme information of the wake-up audio is greater than a certain threshold.

Thus, based on the foregoing embodiment, the model training in the whole function is implemented by combining the two audio features of the phoneme information and the semantic information to enhance the representation of the audio semantic feature information. Therefore, in the embodiment, in the wake-up stage, the trained acoustic model is used to perform inference calculation on the received first audio, so that more accurate phoneme information of the first audio can be obtained, and after the first audio is compared with the wake-up audio, terminal devices such as a mobile phone and the like can be accurately and timely woup.

In summary, the modeling process of voice wakeup generally trains an acoustic model to map the voice features and phonemes, and then decodes the voice features and phonemes by using an optimal path algorithm. However, due to the requirements of low power consumption and quick response, the resources of the acoustic model are limited, and the situation that the acoustic model is not accurately judged, so that the user cannot wake up or awaken by mistake is often easy to occur. Based on this, in this application, in the training phase of the voice awakening acoustic model, the generation countermeasure network based on the variational self-encoder is used, and the acoustic model is combined, so that the modeling quality of the acoustic model can be better improved, the phoneme reasoning is more accurate, the false awakening can be reduced, and the awakening rate is improved.

It should be noted that, in the model training method provided in the embodiment of the present application, the execution subject may be a model training apparatus, or a control module in the model training apparatus for executing the model training method. In the embodiment of the present application, a model training method executed by a model training device is taken as an example to describe the model training device provided in the embodiment of the present application.

FIG. 5 shows a block diagram of a model training apparatus according to another embodiment of the present application, the apparatus comprising:

the first obtaining module 10 is configured to obtain first feature information of audio training data, where the audio training data includes a wake-up audio and a non-wake-up audio;

a first output module 20, configured to output phoneme information and semantic information of the audio training data through the acoustic model to be trained, the generation confrontation network model, and the first feature information;

the second output module 30 is configured to output second feature information of the audio training data through the generated confrontation network model to be trained, the phoneme information and the semantic information;

and the training module 40 is used for training the acoustic model and the generation confrontation network model according to the first characteristic information and the second characteristic information.

Optionally, training module 40, comprising:

the first training unit is used for training the acoustic model and the generation confrontation network model until a first error rate between the first characteristic information and the second characteristic information meets a first preset condition.

Optionally, generating the antagonistic network model comprises an authentication module and a generation module; a training module 40 comprising:

the second training unit is used for training the generating module and the acoustic model until second characteristic information output by the generating module meets a second preset condition;

and the third training unit is used for training the identification module until a second error rate between the first characteristic information and the second characteristic information output by the generation module meets a third preset condition.

Optionally, the first output module 20 includes:

the first output unit is used for outputting phoneme information corresponding to target frames of various audios in the audio training data through the acoustic model to be trained and the first feature information;

and the second output unit is used for outputting semantic information corresponding to the target frames of the audios in the audio training data through the generated confrontation network model to be trained and the first characteristic information.

It should be noted that, in the voice wake-up method provided in the embodiment of the present application, the execution main body may be a voice wake-up device, or a control module in the voice wake-up device for executing the voice wake-up method. The voice wake-up apparatus provided in the embodiment of the present application is described with reference to an example in which the voice wake-up apparatus executes a voice wake-up method.

FIG. 6 shows a block diagram of a model training apparatus according to another embodiment of the present application, the apparatus comprising:

a second obtaining module 50, configured to obtain third feature information of the first audio;

a third output module 60, configured to output the first phoneme information of the first audio through the acoustic model and the third feature information;

a fourth output module 70, configured to output a wake-up instruction when the first phoneme information matches the preset phoneme information of the wake-up audio;

wherein, the acoustic model is obtained by training the model training method in any one of the foregoing embodiments.

The model training device/voice wake-up device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The model training device/voice wake-up device in the embodiment of the present application may be a device having an action system. The action system may be an Android (Android) action system, an ios action system, or other possible action systems, and the embodiment of the present application is not particularly limited.

The model training device/voice wake-up device provided in the embodiment of the present application can implement each process implemented by the corresponding method embodiment, and is not described herein again to avoid repetition.

Optionally, as shown in fig. 7, an electronic device 100 is further provided in this embodiment of the present application, and includes a processor 101, a memory 102, and a program or an instruction stored in the memory 102 and executable on the processor 101, where the program or the instruction is executed by the processor 101 to implement each process of any one of the above embodiments of the model training method/voice wake-up method, and can achieve the same technical effect, and is not described herein again to avoid repetition.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1000 includes, but is not limited to: a radio frequency unit 1001, a network module 1002, an audio output unit 1003, an input unit 1004, a sensor 1005, a display unit 1006, a user input unit 1007, an interface unit 1008, a memory 1009, and a processor 1010.

Those skilled in the art will appreciate that the electronic device 1000 may further comprise a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 1010 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 8 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

In one scenario, the processor 1010 is configured to acquire first feature information of audio training data, where the audio training data includes a wake-up audio and a non-wake-up audio; generating a confrontation network model through an acoustic model to be trained and the first characteristic information, and outputting phoneme information and semantic information of the audio training data; outputting second characteristic information of the audio training data through the generated confrontation network model to be trained, the phoneme information and the semantic information; and training the acoustic model and the generation countermeasure network model according to the first characteristic information and the second characteristic information.

Optionally, the processor 1010 is further configured to train the acoustic model and the generated confrontation network model until a first error rate between the first feature information and the second feature information satisfies a first preset condition.

Optionally, the generating a confrontation network model comprises an authentication module and a generation module; the processor 1010 is further configured to train the generation module and the acoustic model until the second feature information output by the generation module meets a second preset condition; and training the identification module until a second error rate between the first characteristic information and the second characteristic information output by the generation module meets a third preset condition.

Optionally, the processor 1010 is further configured to output phoneme information corresponding to a target frame of each audio in the audio training data through the acoustic model to be trained and the first feature information; and outputting semantic information corresponding to the target frame of each audio in the audio training data through the generated confrontation network model to be trained and the first characteristic information.

In another scenario, the processor 1010 is configured to obtain third feature information of the first audio; outputting first phoneme information of the first audio through an acoustic model and the third feature information; outputting a wake-up instruction under the condition that the first phoneme information is matched with the preset phoneme information of the wake-up audio; wherein, the acoustic model is obtained by the scene training.

It should be understood that in the embodiment of the present application, the input Unit 1004 may include a Graphics Processing Unit (GPU) 10041 and a microphone 10042, and the Graphics Processing Unit 10041 processes still pictures or video-processed image data obtained by an image capturing device (such as a camera) in a video Processing capturing mode or an image capturing mode. The display unit 1006 may include a display panel 10061, and the display panel 10061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes a touch panel 10071 and other input devices 10072. The touch panel 10071 is also referred to as a touch screen. The touch panel 10071 may include two parts, a touch detection device and a touch controller. Other input devices 10072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and an action stick, which are not described in detail herein. The memory 1009 may be used to store software programs as well as various data, including but not limited to applications and action systems. The processor 1010 may integrate an application processor, which primarily handles motion systems, user pages, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1010.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above-mentioned embodiment of the model training method/voice wake-up method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above embodiment of the model training method/voice wake-up method, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of model training, the method comprising:

acquiring first characteristic information of audio training data, wherein the audio training data comprises awakening audio and non-awakening audio;

generating a confrontation network model through an acoustic model to be trained and the first characteristic information, and outputting phoneme information and semantic information of the audio training data;

outputting second characteristic information of the audio training data through the generated confrontation network model to be trained, the phoneme information and the semantic information;

and training the acoustic model and the generation countermeasure network model according to the first characteristic information and the second characteristic information.

2. The method of claim 1, wherein the training the acoustic model and the generative confrontation network model based on the first feature information and the second feature information comprises:

training the acoustic model and the generated confrontation network model until a first error rate between the first feature information and the second feature information meets a first preset condition.

3. The method of claim 1, wherein generating the antagonistic network model comprises an authentication module and a generation module; the training the acoustic model and the generative confrontation network model according to the first feature information and the second feature information includes:

training the generating module and the acoustic model until the second characteristic information output by the generating module meets a second preset condition;

and training the identification module until a second error rate between the first characteristic information and the second characteristic information output by the generation module meets a third preset condition.

4. The method of claim 1, wherein outputting phoneme information and semantic information of the audio training data through the acoustic model to be trained, the generation of the confrontation network model, and the first feature information comprises:

outputting phoneme information corresponding to target frames of each audio in the audio training data through the acoustic model to be trained and the first feature information;

and outputting semantic information corresponding to the target frame of each audio in the audio training data through the generated confrontation network model to be trained and the first characteristic information.

5. A voice wake-up method, the method comprising:

acquiring third characteristic information of the first audio;

outputting first phoneme information of the first audio through an acoustic model and the third feature information;

outputting a wake-up instruction under the condition that the first phoneme information is matched with the preset phoneme information of the wake-up audio;

wherein the acoustic model is trained by the model training method of any one of claims 1-4.

6. A model training apparatus, the apparatus comprising:

the first acquisition module is used for acquiring first characteristic information of audio training data, wherein the audio training data comprises awakening audio and non-awakening audio;

the first output module is used for outputting phoneme information and semantic information of the audio training data through an acoustic model to be trained, a confrontation network model and the first characteristic information;

the second output module is used for outputting second characteristic information of the audio training data through the generated confrontation network model to be trained, the phoneme information and the semantic information;

and the training module is used for training the acoustic model and the generation confrontation network model according to the first characteristic information and the second characteristic information.

7. The apparatus of claim 6, wherein the training module comprises:

the first training unit is used for training the acoustic model and the generated confrontation network model until a first error rate between the first characteristic information and the second characteristic information meets a first preset condition.

8. The apparatus of claim 6, wherein the generating the antagonistic network model comprises an authentication module and a generation module; the training module comprises:

the second training unit is used for training the generating module and the acoustic model until the second characteristic information output by the generating module meets a second preset condition;

9. The apparatus of claim 6, wherein the first output module comprises:

and the second output unit is used for outputting semantic information corresponding to the target frame of each audio in the audio training data through the generated confrontation network model to be trained and the first characteristic information.

10. A voice wake-up apparatus, the apparatus comprising:

the second acquisition module is used for acquiring third characteristic information of the first audio;

a third output module, configured to output first phoneme information of the first audio through an acoustic model and the third feature information;

the fourth output module is used for outputting a wake-up instruction under the condition that the first phoneme information is matched with the preset phoneme information of the wake-up audio;

11. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the model training method of any one of claims 1-4 or the voice wake-up method of claim 5.

12. A readable storage medium, on which a program or instructions are stored, which program or instructions, when executed by a processor, carry out the steps of the model training method according to any one of claims 1 to 4 or the voice wake-up method according to claim 5.