CN110033758B

CN110033758B - Voice wake-up implementation method based on small training set optimization decoding network

Info

Publication number: CN110033758B
Application number: CN201910334792.9A
Authority: CN
Inventors: 赵升
Original assignee: Wuhan Shuixiang Electronic Technology Co ltd
Current assignee: Wuhan Shuixiang Electronic Technology Co ltd
Priority date: 2019-04-24
Filing date: 2019-04-24
Publication date: 2021-09-24
Anticipated expiration: 2039-04-24
Also published as: CN110033758A

Abstract

The invention discloses a voice awakening implementation method based on a small training set optimization decoding network, which comprises the following steps: s1 extracting the voice eigen feature to obtain the feature vector with obvious distinction between the awakening word and the non-awakening word; s2, combining the feature vectors to obtain a feature phoneme alignment file, selecting a time window according to the distribution of the phonemes of the awakening words, and classifying the mapping of the features and the phonemes to obtain acoustic data with labels; s3, combining with acoustic data with labels to calculate a frame-by-frame posterior probability model S4, combining with the obtained acoustic probability model to obtain a phoneme-level posterior probability confidence coefficient calculation network S5 to wake up a reconfirmation network of words; the invention can easily realize the functions of voice awakening and the like on common processors such as arm, dsp and the like through simple model training strategies, optimized decoding network and other steps.

Description

Voice wake-up implementation method based on small training set optimization decoding network

Technical Field

The invention relates to a voice awakening implementation method based on a small training set optimization decoding network. The post-positioned decoding network reduces the offline false wake-up rate on the basis of not increasing the algorithm complexity through optimization.

Background

The voice is the most convenient and fast means for human to communicate with each other, and it is the goal of human sleep to make the machine understand the voice and perform related operations according to human instructions. Thus, speech recognition technology is in force. The voice recognition technology is an important means of man-machine interaction at present, and voice awakening is an important entrance of man-machine interaction. The intelligent voice equipment is in a standby state under normal conditions and cannot respond to outside sound. Only when the input is awakened to be awakened, the system starts to process and analyze the input voice and give feedback, so that the error recognition rate of voice recognition is greatly reduced.

Specifically, voice wakeup refers to a system retrieving a preset wakeup word instruction from a continuous voice stream to achieve the purpose of starting voice recognition, and belongs to a keyword detection technology in the continuous voice stream. In order to achieve a good detection effect, the current keyword detection model is a model trained based on large-scale data, and the algorithm implementation is complex, which is a great obstacle to data resource shortage and implementation of wake-up end-to-end deployment.

In order to meet the requirement that the function of the awakening word can be easily deployed at a mobile terminal, a voice awakening implementation method which is based on a small training sample set, simple in strategy and high in operation speed is urgently needed.

The existing solution is a combination of a gaussian mixture model and a hidden markov model. Firstly, original voice data is represented by a more simplified vector, namely, feature vectors of voice, and then, the spatial distribution of the feature vectors is supposed to be in accordance with Gaussian distribution, so that the Gaussian distribution of different mean values and variances of a feature space can be obtained through the existing massive data training. The model is used as the observation probability parameter needed by the subsequent hidden Markov, and the hidden Markov model maps the characteristics to the word or phoneme space through data training to form a decoding network. During recognition, voice enters a decoding network through feature extraction, and a decoder selects a dynamic programming Viterbi beam algorithm to search and confirm results in the decoding network.

The two voice awakening overall ideas are also a method for voice recognition of large-scale vocabularies, a large amount of training corpora are needed to achieve good awakening and low false awakening effects, and the following dynamic search algorithm needs to consume huge operation time when performing global search in a decoding network, despite the application of a pruning method. The trained model is large in size, complex in algorithm implementation and low in operation efficiency. Such a model requires high computation power and a lot of resources in hardware to be used in mobile terminals such as home appliances, audio boxes, and vehicles. The cost of the product is increased while more resources are occupied and the operation time is increased invisibly, so that the use of the awakening word at a mobile terminal is limited, and the intelligent voice equipment cannot be widely applied.

Therefore, the invention provides a decoder which is optimized based on a small training set, so that algorithm implementation is simplified and operation efficiency is improved under the condition of realizing the same awakening rate, and the aim of conveniently transplanting voice awakening is further fulfilled.

Disclosure of Invention

The invention aims to solve the technical problems that the existing voice awakening whole thought is a method for voice recognition of large-scale vocabulary, a large amount of training corpora is needed to achieve good awakening and low false awakening effect, and provides a voice awakening implementation method which is based on a small training set and optimizes a decoder, so that algorithm implementation is simplified under the condition of achieving the same awakening rate, the operation efficiency is improved, and further the purpose of conveniently transplanting voice awakening is achieved.

In order to solve the technical problems, the invention provides the following technical scheme:

a voice awakening implementation method based on a small training set optimization decoding network is characterized by comprising the following steps:

s1 extracting intrinsic features of speech

According to the stability and correlation analysis of awakening words, a time window is designed to obtain frame characteristic signals, the time window design in the frame characteristic signals relates to the window length, the shape, the amplitude of each point and the weight between adjacent frame energies, and therefore characteristic vectors with obvious distinction between awakening words and non-awakening words are obtained;

s2 combining the feature vectors to obtain the feature phoneme alignment file

Selecting a time window according to the distribution of the phonemes of the awakening words, and classifying the mapping of the features and the phonemes to obtain acoustic data with labels; the alignment algorithm between the feature and the phoneme is mainly obtained by utilizing a context-related three-factor phoneme model, and the utilization rate of all phonemes in the awakening word in a fixed time window is maximized according to the length of each phoneme obtained by counting the linguistic data.

S3 calculating a frame-by-frame posterior probability model in combination with labeled acoustic data

Sending the obtained acoustic data with the labels into a forward and backward neural network based on a cross entropy loss function to train an acoustic model, and obtaining an acoustic probability model of the awakening words frame by frame;

s4 posterior probability confidence coefficient calculation network for obtaining phoneme level by combining with obtained acoustic probability model

Calculating the confidence coefficient of the wake word category of the useful category according to the frame-by-frame posterior probability of the wake word, and identifying the wake word; reserving N candidate items for the category of each frame data output according to the probability, and finally forming a network search space for confirming the awakening words; the confidence coefficient of the awakening word in the step is a dynamic acoustic confidence coefficient, the window length of the selected time domain framing time window is the window length of the confidence coefficient, the confidence coefficient window is made to slide on the posterior probability matrix output by the neural network in time according to the category, and the probability of each effective category in the window is superposed according to the weight. The weight of the step is obtained from the phoneme entropy of each category in the awakening word corpus counted in the second step. And identifying the awakening words according to the threshold of the dynamic acoustic confidence coefficient obtained by the test. In order to ensure the index of false wake-up, if the index is identified to be false wake-up, a wake-up word confirmation network is required to be entered, and the confirmation network of whether the wake-up word is a wake-up word is performed, so that the reliability of the result is ensured.

S5 network of reconfirmation of wake-up words

Confirming the awakening word according to the maximum entropy principle; firstly, reserving awakening word phonemes contained in each time point, setting zero when no awakening word phoneme exists, setting zero for phonemes on the time point if state jumping occurs in the middle, and then confirming the reliability of the awakening word according to the information entropy of all effective phonemes.

In addition, in the data preparation stage of the deep neural network, the acoustic model training can be realized by aligning the features with the syllables, and the awakening function can be completed through the following steps.

The invention has the following beneficial effects: the invention can easily realize the functions of voice awakening and the like on common processors such as arm, dsp and the like through simple model training strategies, optimized decoding network and the like:

according to the method, the time domain time window is designed according to the stationarity and the relevance of the linguistic data of the awakening words, so that the difference between the feature vectors is effectively improved, a feature model does not need to be learned through a large amount of linguistic data, and the model volume is reduced;

secondly, counting the time window length with the maximum effective phoneme utilization rate during label classification as the length of characteristic classification to prepare a file required by training, selecting a three-factor model for the characteristic phoneme alignment file, and calculating and designing the dynamic acoustic confidence coefficient, so that the adaptability of the recognition network to unknown voice is effectively improved, and the recognition effect of the network to phonemes can be well given under the condition of noise; using the information entropy of each phoneme of the awakening word counted in advance as the weight for calculating the dynamic acoustic confidence;

the invention confirms the awakening words according to the maximum entropy principle, effectively provides a method for judging the reliability of time sequence modeling by utilizing the maximum entropy matching algorithm of forbidden state hopping, effectively provides the reliability of the identified network, effectively reduces the false awakening, simplifies the complexity of the algorithm strategy and provides an effective way for transplanting the intelligent voice module at the front end with less hardware resources.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram of the present invention;

fig. 2 is a flow chart of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Examples

s1 extracting intrinsic features of speech

s2 combining the feature vectors to obtain the feature phoneme alignment file

S5 network of reconfirmation of wake-up words

The invention can easily realize the functions of voice awakening and the like on common processors such as arm, dsp and the like through simple model training strategies, optimized decoding network and other steps.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A voice awakening implementation method based on a small training set optimization decoding network is characterized by comprising the following steps:

s1 extracting intrinsic features of speech

According to the stability and correlation analysis of the awakening word material, a time window is designed to obtain a frame characteristic signal, and therefore a characteristic vector with obvious distinction between the awakening word and the non-awakening word is obtained;

s2 combining the feature vectors to obtain the feature phoneme alignment file

Selecting a time window according to the distribution of the phonemes of the awakening words, and classifying the mapping of the features and the phonemes to obtain acoustic data with labels;

Calculating the confidence coefficient of the wake word category of the useful category according to the frame-by-frame posterior probability of the wake word, and identifying the wake word; reserving N candidate items for the category of each frame data output according to the probability, and finally forming a network search space for confirming the awakening words;

the confidence of the wake word is a dynamic acoustic confidence; selecting a time domain framing time window length as a window length of a confidence coefficient, enabling the confidence coefficient window to slide on a posterior probability matrix output by the neural network in time according to categories, and overlapping the probability of each effective category in the window according to weight;

s5 network of reconfirmation of wake-up words

2. The method of claim 1, wherein the time window design in S1 relates to window length, shape, magnitude of each point and weight between adjacent frame energies.

3. The method as claimed in claim 1, wherein the alignment algorithm between the features and the phonemes in S2 is mainly obtained by using a context-dependent triphone model, and the length of each phoneme obtained by counting the corpus is used to maximize the utilization of all phonemes in the wakeup word within a fixed time window.

4. The method for realizing voice wakeup based on short training set optimized decoding network as claimed in claim 1, wherein the weight in S4 is obtained from phoneme entropy of each category in wakeup word corpus.

5. The voice wake-up implementing method based on small training set optimized decoding network as claimed in claim 1 or 4, characterized in that in S5, the recognition of the wake-up word is performed according to the threshold of the dynamic acoustic confidence obtained by the test; if yes, a wake word confirmation network is required to be entered, and whether the wake word confirmation network is a wake word or not is carried out, so that the reliability of the result is ensured.