CN110600008A

CN110600008A - Voice wake-up optimization method and system

Info

Publication number: CN110600008A
Application number: CN201910899791.9A
Authority: CN
Inventors: 徐俊峰
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2019-12-20

Abstract

The embodiment of the invention provides a voice awakening optimization method. The method comprises the following steps: constructing a secondary awakening acoustic model, wherein the secondary awakening acoustic model comprises a phoneme acoustic model and a word-level acoustic model; extracting the characteristics of the received voice audio, inputting the mentioned acoustic characteristics into a phoneme-level acoustic model in a secondary awakening acoustic model, and extracting the output characteristics of the phoneme-level acoustic model; determining the confidence coefficient of the awakening word based on the output characteristics of the phoneme-level acoustic model as the input of the word-level acoustic model in the secondary awakening acoustic model; and when the confidence coefficient exceeds a preset awakening threshold value, determining the voice audio frequency as an awakening word, and performing voice awakening. The embodiment of the invention also provides an optimization system for voice awakening. The embodiment of the invention directly reduces the dependence of the final classification effect on the accuracy of the phoneme modeling unit, and can still correctly judge the awakening words under the condition of inaccurate phoneme classification.

Description

Voice wake-up optimization method and system

Technical Field

The invention relates to the field of intelligent voice conversation, in particular to a voice awakening optimization method and system.

Background

Voice wake-up typically utilizes deep neural networks to acoustically model the underlying acoustic units, which typically select phonemes.

In the above-described voice wake-up technology, the modeling unit is a phoneme, and the phoneme is firstly predicted, classified and processed; then calculating the similarity between the processed sequence and the awakening word sequence, and if the similarity is greater than a certain threshold value, awakening; otherwise, the device does not wake up.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

this technique relies heavily on the accuracy of the acoustic model to classify the speech signal on the modeling unit. Under the condition of low signal-to-noise ratio, the acoustic model has low accuracy in classifying phonemes, so that the awakening rate of scenes with low signal-to-noise ratio is influenced.

Disclosure of Invention

The method aims to at least solve the problem of low wake-up rate in the low signal-to-noise ratio scene in the prior art.

In a first aspect, an embodiment of the present invention provides a voice wakeup optimization method, including:

constructing a secondary awakening acoustic model, wherein the secondary awakening acoustic model comprises a phoneme acoustic model and a word-level acoustic model;

extracting the characteristics of the received voice audio, inputting the mentioned acoustic characteristics into a phoneme acoustic model in the secondary awakening acoustic model, and extracting the output characteristics of the phoneme acoustic model;

determining the confidence coefficient of the awakening word based on the output characteristics of the phoneme acoustic model and used as the input of the word-level acoustic model in the secondary awakening acoustic model;

and when the confidence coefficient exceeds a preset awakening threshold value, determining the voice audio frequency as an awakening word, and performing voice awakening.

In a second aspect, an embodiment of the present invention provides a voice wake-up optimization system, including:

the model building program module is used for building a secondary awakening acoustic model, and the secondary awakening acoustic model comprises a phoneme acoustic model and a word-level acoustic model;

the feature extraction program module is used for performing feature extraction on the received voice audio, inputting the mentioned acoustic features into a phoneme acoustic model in the secondary awakening acoustic model, and extracting output features of the phoneme acoustic model;

a confidence level determining program module, configured to determine a confidence level of a wakeup word based on output features of the phoneme acoustic model as input of a word-level acoustic model in the secondary wakeup acoustic model;

and the awakening program module is used for determining the voice audio as an awakening word and carrying out voice awakening when the confidence coefficient exceeds a preset awakening threshold value.

In a third aspect, an electronic device is provided, comprising: the device comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the steps of the voice wake-up optimization method of any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the steps of the voice wake-up optimization method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: on the basis of one acoustic model, the deep acoustic features extracted by a certain length of voice signals are input into another classification model for direct classification, so that the dependence of the final classification effect on the accuracy of a phoneme modeling unit is directly reduced, and the awakening words can still be correctly distinguished under the condition of inaccurate phoneme classification.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of an optimized voice wakeup method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a voice wake-up optimization system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of an optimization method for voice wakeup according to an embodiment of the present invention, including the following steps:

s11: constructing a secondary awakening acoustic model, wherein the secondary awakening acoustic model comprises a phoneme acoustic model and a word-level acoustic model;

s12: extracting the characteristics of the received voice audio, inputting the mentioned acoustic characteristics into a phoneme acoustic model in the secondary awakening acoustic model, and extracting the output characteristics of the phoneme acoustic model;

s13: determining the confidence coefficient of the awakening word based on the output characteristics of the phoneme acoustic model and used as the input of the word-level acoustic model in the secondary awakening acoustic model;

s14: and when the confidence coefficient exceeds a preset awakening threshold value, determining the voice audio frequency as an awakening word, and performing voice awakening.

In the present embodiment, unlike one acoustic model that is modeled, and unlike the comparison of the results of two general acoustic models, the comparison is not performed using the output results of two models, because the classification accuracy of phonemes is not significantly improved by selecting a plurality of acoustic models when the signal-to-noise ratio is low.

For step S11, instead of one modeled phoneme acoustic model, on this basis a secondary wake-up acoustic model is constructed, which comprises a phoneme acoustic model and a word-level acoustic model, wherein the task of the acoustic model is to calculate P (O | W), i.e. the probability of generating a speech waveform to the model. The acoustic model is an important component of the speech recognition system, and occupies most of the computational overhead of speech recognition, determining the performance of the speech recognition system. Conventional speech recognition systems commonly employ acoustic models based on GMM-HMM, where GMM is used to model the distribution of speech acoustic features and HMM is used to model the timing of speech signals. After the rise of deep learning in 2006, Deep Neural Networks (DNNs) were applied to the speech acoustic models. The phoneme acoustic model determines the probability of each phoneme in the speech waveform, and the word-level acoustic model determines the probability of each word in the speech waveform.

For step S12, in order to receive the real-time voice wake-up, the smart device is required to collect the voice audio in the environment in real time, perform feature extraction on the collected received voice audio, input the extracted acoustic features into the phoneme acoustic model in the secondary wake-up acoustic model, and extract the output features of the phoneme acoustic model, such as the phoneme sequence of the voice audio.

For step S13, based on the output characteristics of the acoustic model of phonemes, for example, the audio sequence output in step S12, as the input of the word-level acoustic model in the secondary wake acoustic model, the acoustic model is classified by another acoustic model, so that there is an explicit classification, and thus the confidence that the user audio is a wake word is more accurately determined.

For step S14, when the confidence exceeds a preset wake threshold, determining the voice audio as a wake word, and performing voice wake. .

According to the embodiment, on the basis of one acoustic model, the deep acoustic features extracted by the voice signals with a certain length are input into the other classification model to be directly classified, so that the dependence of the final classification effect on the accuracy of the phoneme modeling unit is directly reduced, and the awakening words can still be correctly distinguished under the condition of inaccurate phoneme classification.

As an implementation manner, in this embodiment, one of the two-stage wake-up acoustic models is a phoneme acoustic model, and the other acoustic model is a word-level acoustic model.

In this embodiment, one of the acoustic models is a phoneme acoustic model, and the other acoustic model is a word-level acoustic model. Repeated experiments show that the awakening word recognition is carried out by utilizing the phoneme acoustic model, the awakening performance is low under the condition of low signal-to-noise ratio, and the recognition performance depends heavily on the accuracy of the phoneme acoustic model to phoneme classification. On the basis of the phoneme acoustic model, a word-level acoustic model is connected to directly classify the awakening words, so that the identification effect of the awakening words can be improved through direct classification even under the condition of inaccurate phoneme classification, and the defect of the single phoneme acoustic model is overcome.

As an implementation manner, in this embodiment, after the extracting the output feature of the phoneme acoustic model, the method further includes:

sending the output characteristics of each frame to a characteristic accumulator;

when the frame number accumulation of the voice audio in the feature accumulator reaches a preset threshold value, splicing the output features in the feature accumulator into one-dimensional features;

and inputting the one-dimensional features into the word-level acoustic model to complete the coupling of the two models.

In the present embodiment, after extracting the features of the phoneme acoustic model output, the output features of each frame are sent to a feature totalizer for accumulation. When a certain number of frames are accumulated, the features are spliced into one-dimensional complete features, and the one-dimensional features are input into the word-level acoustic model, so that the two acoustic models can be coupled to ensure the use of the two models.

As an implementation manner, in this embodiment, before performing feature extraction on the received speech audio, the method further includes:

receiving an audio signal in real time according to an acoustic sensor, and determining whether the audio signal is a voice audio through a voice endpoint detection model;

and when the audio signal is voice audio, performing acoustic feature extraction on the received dialogue voice.

Since voice wakeup requires real-time detection of received audio, it is very resource-consuming to detect if audio is received. Before the characteristics of the received voice audio are extracted, the voice audio signals are received in real time according to an acoustic sensor in the intelligent equipment, whether the voice audio signals are the voice audio of the user is detected, the user can speak, then the detection is carried out, the voice awakening detection is avoided when the voice audio signals are received, and the voice awakening detection efficiency is improved.

Fig. 2 is a schematic structural diagram of a voice wakeup optimization system according to an embodiment of the present invention, which can execute the voice wakeup optimization method according to any of the above embodiments and is configured in a terminal.

The voice wake-up optimization system provided by this embodiment includes: a model building program module 11, a feature extraction program module 12, a confidence determination program module 13 and a wake-up program module 14.

The model building program module 11 is configured to build a secondary awakening acoustic model, where the secondary awakening acoustic model includes a phoneme acoustic model and a word-level acoustic model; the feature extraction program module 12 is configured to perform feature extraction on the received voice audio, input the mentioned acoustic features into a phoneme acoustic model in the secondary awakening acoustic model, and extract output features of the phoneme acoustic model; the confidence level determining program module 13 is configured to determine a confidence level of the awakening word based on the output features of the phoneme acoustic model as the input of the word-level acoustic model in the secondary awakening acoustic model; the awakening program module 14 is configured to determine the voice audio as an awakening word to perform voice awakening when the confidence level exceeds a preset awakening threshold.

Further, the one of the acoustic models is a phoneme acoustic model, and the one of the acoustic models is a word-level acoustic model.

Further, after the feature extraction program module, the system further comprises: a feature accumulation program module to:

inputting the one-dimensional features to the other acoustic model to complete the coupling of the two models.

Further, the feature extraction program module is further configured to:

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the voice awakening optimization method in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform the voice wake optimization method of any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the device comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the steps of the voice wake-up optimization method of any embodiment of the invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An optimization method for voice wakeup includes:

2. The method of claim 1, wherein after said extracting output features of said phoneme acoustic model, said method further comprises:

3. The method of claim 1, wherein prior to said feature extracting the received speech audio, the method further comprises:

4. A voice wake-up optimization system, comprising:

5. The system of claim 4, wherein after the feature extraction program module, the system further comprises: a feature accumulation program module to:

6. The system of claim 4, wherein the feature extraction program module is further to:

7. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-3.

8. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 3.