CN112562646A

CN112562646A - Robot voice recognition method

Info

Publication number: CN112562646A
Application number: CN202011447106.8A
Authority: CN
Inventors: 马国军; 张奕凡; 申佳玮
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-03-26
Anticipated expiration: 2040-12-09
Also published as: CN112562646B

Abstract

The invention discloses a robot voice recognition method, which is characterized by comprising the following steps: acquiring a voice signal; extracting static characteristic information and dynamic characteristic information from the voice signal; and carrying out voice matching on the static characteristic information and the dynamic characteristic information, and when the voice signals are matched for the first time: obtaining the maximum probability from the matching output probabilities of all corresponding paths and comparing the maximum probability with the confidence probability, and outputting the output content of the corresponding path corresponding to the maximum probability when the maximum probability is greater than the confidence probability; when the maximum probability is smaller than the confidence probability, performing tone removing processing on the acquired voice signal, and performing secondary matching on the tone-removed voice signal; when the speech signal is a second match: and outputting the output content of the corresponding path with the highest probability in the matching output probabilities of all the corresponding paths. The invention introduces fuzzy recognition technology, can recognize a part of effective information of the signal under the condition of similar pronunciation of the voice signal, and improves the accuracy of recognition.

Description

Robot voice recognition method

Technical Field

The invention relates to the field of robots, in particular to a robot voice recognition method.

Background

Background art: with the development of society, nowadays, the voice recognition technology is mature day by day, the voice recognition technology is widely applied to our lives, people are also used to finish various affairs in a human-computer interaction mode, the life experience of people is enriched, and great convenience is brought to the lives. It can be said that speech recognition technology is ubiquitous in our lives. Under most circumstances, the speech recognition technology has higher recognition accuracy, can satisfy people to the interactive demand of people man-machine, but under different environment, speech recognition's accuracy can receive the influence of different degree, and traditional speech recognition technology can't satisfy all special circumstances, consequently need carry out optimization to data in the identification process.

The existing voice recognition technology is mainly used for outputting a final recognition result after preprocessing an input signal, extracting features and matching and comparing the input signal with an existing acoustic model. In a complex environment, the existing speech recognition technology is greatly influenced or even cannot be used, for example, in a noisy public environment, the signal recognition process is influenced by noise, which is a big problem to be solved at present.

Disclosure of Invention

The invention aims to provide a robot voice recognition method, which solves the technical problem that the accuracy of voice recognition is reduced due to external noise in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

a robot voice recognition method includes:

step 1: acquiring a voice signal;

step 2: preprocessing a voice signal and extracting static characteristic information;

and step 3: acquiring dynamic characteristic information through a difference algorithm according to the static characteristic information;

and 4, step 4: the hidden Markov model HMM is used to match the static characteristic information and the dynamic characteristic information with voice, the Viterbi algorithm is used to obtain the matching output probability of all the corresponding paths from the hidden Markov model HMM,

when the speech signal is a first match:

obtaining the maximum probability from the matching output probabilities of all the corresponding paths and comparing the maximum probability with the confidence probability,

when the maximum probability is greater than the confidence probability, outputting the output content of the corresponding path corresponding to the maximum probability, wherein the content is the identified content;

when the maximum probability is smaller than the confidence probability, performing tone removing processing on the voice signal obtained in the step 1, and performing step 2-4 on the tone removed voice signal for second matching;

when the speech signal is a second match:

and outputting the output content of the corresponding path with the highest probability in the matching output probabilities of all the corresponding paths, wherein the content is the identified content.

Further, the calculation formula of the confidence probability in the step 4 is as follows:

wherein, P_cIs the confidence probability; m is the total number of successful first matching; p_iThe maximum probability value of successful first matching of the ith time.

Further, the preprocessing the voice signal in step 2 includes performing pre-emphasis, framing, and windowing in sequence.

Further, in the step 3, the FFT, mel filtering, and DCT processing are sequentially performed on the static feature information, and then the difference calculation is performed on the DCT processing result to obtain the dynamic feature information.

Further, in step 4, the artificial neural network ANN is combined with the hidden markov model HMM to perform speech matching on the static feature information and the dynamic feature information.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention can also recognize a part of effective information of the signal under the condition that the pronunciation of the voice signal is similar, thereby improving the recognition accuracy to a certain extent.

2. According to the invention, through the combination of the ANN and the HMM, the interference of noise on voice signals is reduced to a certain extent, and the anti-interference capability of the whole system is improved.

3. After the fuzzy recognition technology is introduced, the success rate of voice recognition in a complex environment can be improved to a certain extent.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way, and in which:

FIG. 1 is a flow chart of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a specific embodiment of the present invention provides a robot speech recognition method, including:

step S1: acquiring a voice signal;

step S2: carrying out pre-emphasis, framing and windowing on the voice signals in sequence, and extracting static characteristic information from the processed voice signals;

the pre-emphasis part is used for leading the signal to pass through a high-pass filter, the pre-emphasis aims to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat, the signal is kept in the whole frequency band from low frequency to high frequency, and the frequency spectrum can be obtained by the same signal-to-noise ratio. Meanwhile, the method is also used for eliminating the vocal cords and lip effects in the generation process, compensating the high-frequency part of the voice signal which is restrained by the pronunciation system, and highlighting the formants of the high frequency.

Since the desired signal must be a stationary signal in the subsequent fast fourier transform, the signal must be framed so that the local process of the speech signal can be considered stationary. However, there are discontinuities at the beginning and end of each frame, so that the more frames are divided, the larger the error from the original signal, and windowing is performed to solve this problem, so that the framed signal becomes continuous, and each frame exhibits the characteristics of a periodic function, and in speech signal processing, a hamming window is usually added.

Each frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (N), N is 0,1 …, N-1, and N is the size of the frame, the signal after multiplication by the hamming window is of the form S' (N) ═ S (N) × w (N), w (N) as follows:

different values of a will result in different Hamming windows, typically a being 0.46.

Step S3: performing FFT, Mel filtering and DCT processing on the static characteristic information in sequence, and performing differential calculation on the output result of the DCT to obtain dynamic characteristic information;

since the signal is usually difficult to see by the transformation in the time domain, it is usually observed by transforming it into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices. After multiplication by the hamming window, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame. And the power spectrum of the voice signal is obtained by taking the modulus square of the frequency spectrum of the voice signal.

The role of the mel filter bank is mainly to reduce the amplitude of the frequency domain and reduce the redundant part in the frequency spectrum. At this time, the amplitude spectrum obtained by FFT is multiplied and accumulated by each filter, and the obtained value is the energy value of the frame data in the frequency band corresponding to the filter. If the number of filters is 22, then 22 energy values should be obtained at this time. And logarithm is taken to the obtained energy value, so that the subsequent cepstrum analysis is facilitated.

The frequency response of the mel-triangle filter is defined as:

in the formula (I), the compound is shown in the specification,

the triangular band pass filter has two main purposes:

the method smoothes the frequency spectrum, eliminates the effect of harmonic wave, and highlights the formant of the original voice, so that the tone or pitch of a section of voice is not presented in the MFCC parameters, in other words, the voice recognition system using MFCC as the characteristic is not influenced by the tone difference of the input voice, and in addition, the operation amount can be reduced.

The DCT is often used for signal processing and image processing for lossy data compression of signals and images, because the DCT has a strong "energy-concentrating" property: most of the natural signals, including sound and image, have their energy concentrated in the low frequency part after discrete cosine transform, and actually, each frame of data is subjected to dimensionality reduction once.

The log energy s (M) is substituted into the discrete cosine transform to obtain the Mel-scale Cepstrum parameter of L order, which refers to the order of MFCC coefficient, and usually takes 12-16, where M is the number of triangular filters.

The standard cepstral parameters MFCC only reflect the static characteristics of the speech parameters, and the dynamic characteristics of speech can be described by the differential spectrum of these static characteristics. Experiments prove that: the recognition performance of the system can be effectively improved by combining the dynamic and static characteristics. The calculation of the difference parameter may use the following formula:

in the formula (d)_tDenotes the t-th first order difference, C_tRepresents the t-th cepstral coefficient, Q represents the order of the cepstral coefficient, and K represents the time difference of the first derivative, which may be 1 or 2. And substituting the result of the above expression into the second-order difference parameter. Thus, the whole dynamic characteristic information extraction process of the voice signal is completed

Step S4: combining an artificial neural network ANN with a hidden Markov model HMM to perform voice matching on static characteristic information and dynamic characteristic information, acquiring matching output probabilities of all corresponding paths from the hidden Markov model HMM through a Viterbi algorithm,

when the speech signal is a first match:

and obtaining the maximum probability from the matching output probabilities of all the corresponding paths and comparing the maximum probability with the confidence probability, wherein the calculation formula of the confidence probability is as follows:

wherein, P_cIs the confidence probability; m is the total number of successful first matching; p_iThe maximum probability value of successful first matching of the ith time;

when the speech signal is a second match:

For the process of matching the voice information, the invention introduces the concept of fuzzy algorithm. Due to the influence of external noise, when the voice signals are matched, the matching output probability of the corresponding path is generally low, and the high matching probability is difficult to achieve, so that a confidence probability is introduced, and a calculation formula of the confidence probability is as follows:

the confidence probability is a long-term accumulated value, when an initial value of the confidence probability is set at the initial stage of the robot voice recognition, the initial value is an average value of matching output probabilities of all corresponding paths in the first matching process, and the initial value is updated in real time according to the formula subsequently to adapt to the voice recognition in a noisy environment in turn. When the first matching of the voice information is unsuccessful, namely the maximum probability obtained from the matching output probabilities of all corresponding paths is smaller than the confidence probability, the voice signal is subjected to tone removal processing, and then the voice signal subjected to tone removal is subjected to the second matching in the step 2-4, so that external noise can possibly influence the related information of the voice information, including the influence on the tone of the voice information, thereby reducing the matching probability, removing the tone at the moment, and matching the voice signal again, thereby improving the probability and the accuracy of matching; when the speech signal is a second match: and outputting the output content of the corresponding path with the highest probability in the matching output probabilities of all the corresponding paths, wherein the content is the identified content. The second re-tone-removal instead of the first direct tone-removal can avoid the problem of reduced accuracy caused by the fact that the output probability of the corresponding path with the highest matching degree is reduced after tone removal.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A robot voice recognition method, comprising:

step 1: acquiring a voice signal;

when the speech signal is a first match:

when the speech signal is a second match:

2. The robot speech recognition method according to claim 1, wherein the confidence probability in step 4 is calculated as follows:

3. The robot speech recognition method of claim 1, wherein the pre-processing of the speech signal in step 2 comprises pre-emphasis, framing, and windowing in sequence.

4. The robot speech recognition method according to claim 1, wherein in step 3, the static feature information is subjected to FFT, mel filtering, and DCT processing in sequence, and then the processing result of the DCT is subjected to difference calculation to obtain the dynamic feature information.

5. The robot speech recognition method according to claim 1, wherein in step 4, the artificial neural network ANN is combined with a hidden markov model HMM to perform speech matching on the static feature information and the dynamic feature information.