CN112562646A - Robot voice recognition method - Google Patents

Robot voice recognition method Download PDF

Info

Publication number
CN112562646A
CN112562646A CN202011447106.8A CN202011447106A CN112562646A CN 112562646 A CN112562646 A CN 112562646A CN 202011447106 A CN202011447106 A CN 202011447106A CN 112562646 A CN112562646 A CN 112562646A
Authority
CN
China
Prior art keywords
probability
matching
voice
characteristic information
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011447106.8A
Other languages
Chinese (zh)
Inventor
马国军
张奕凡
申佳玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN202011447106.8A priority Critical patent/CN112562646A/en
Publication of CN112562646A publication Critical patent/CN112562646A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The invention discloses a robot voice recognition method, which is characterized by comprising the following steps: acquiring a voice signal; extracting static characteristic information and dynamic characteristic information from the voice signal; and carrying out voice matching on the static characteristic information and the dynamic characteristic information, and when the voice signals are matched for the first time: obtaining the maximum probability from the matching output probabilities of all corresponding paths and comparing the maximum probability with the confidence probability, and outputting the output content of the corresponding path corresponding to the maximum probability when the maximum probability is greater than the confidence probability; when the maximum probability is smaller than the confidence probability, performing tone removing processing on the acquired voice signal, and performing secondary matching on the tone-removed voice signal; when the speech signal is a second match: and outputting the output content of the corresponding path with the highest probability in the matching output probabilities of all the corresponding paths. The invention introduces fuzzy recognition technology, can recognize a part of effective information of the signal under the condition of similar pronunciation of the voice signal, and improves the accuracy of recognition.

Description

Robot voice recognition method
Technical Field
The invention relates to the field of robots, in particular to a robot voice recognition method.
Background
Background art: with the development of society, nowadays, the voice recognition technology is mature day by day, the voice recognition technology is widely applied to our lives, people are also used to finish various affairs in a human-computer interaction mode, the life experience of people is enriched, and great convenience is brought to the lives. It can be said that speech recognition technology is ubiquitous in our lives. Under most circumstances, the speech recognition technology has higher recognition accuracy, can satisfy people to the interactive demand of people man-machine, but under different environment, speech recognition's accuracy can receive the influence of different degree, and traditional speech recognition technology can't satisfy all special circumstances, consequently need carry out optimization to data in the identification process.
The existing voice recognition technology is mainly used for outputting a final recognition result after preprocessing an input signal, extracting features and matching and comparing the input signal with an existing acoustic model. In a complex environment, the existing speech recognition technology is greatly influenced or even cannot be used, for example, in a noisy public environment, the signal recognition process is influenced by noise, which is a big problem to be solved at present.
Disclosure of Invention
The invention aims to provide a robot voice recognition method, which solves the technical problem that the accuracy of voice recognition is reduced due to external noise in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
a robot voice recognition method includes:
step 1: acquiring a voice signal;
step 2: preprocessing a voice signal and extracting static characteristic information;
and step 3: acquiring dynamic characteristic information through a difference algorithm according to the static characteristic information;
and 4, step 4: the hidden Markov model HMM is used to match the static characteristic information and the dynamic characteristic information with voice, the Viterbi algorithm is used to obtain the matching output probability of all the corresponding paths from the hidden Markov model HMM,
when the speech signal is a first match:
obtaining the maximum probability from the matching output probabilities of all the corresponding paths and comparing the maximum probability with the confidence probability,
when the maximum probability is greater than the confidence probability, outputting the output content of the corresponding path corresponding to the maximum probability, wherein the content is the identified content;
when the maximum probability is smaller than the confidence probability, performing tone removing processing on the voice signal obtained in the step 1, and performing step 2-4 on the tone removed voice signal for second matching;
when the speech signal is a second match:
and outputting the output content of the corresponding path with the highest probability in the matching output probabilities of all the corresponding paths, wherein the content is the identified content.
Further, the calculation formula of the confidence probability in the step 4 is as follows:
Figure BDA0002825029840000021
wherein, PcIs the confidence probability; m is the total number of successful first matching; piThe maximum probability value of successful first matching of the ith time.
Further, the preprocessing the voice signal in step 2 includes performing pre-emphasis, framing, and windowing in sequence.
Further, in the step 3, the FFT, mel filtering, and DCT processing are sequentially performed on the static feature information, and then the difference calculation is performed on the DCT processing result to obtain the dynamic feature information.
Further, in step 4, the artificial neural network ANN is combined with the hidden markov model HMM to perform speech matching on the static feature information and the dynamic feature information.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention can also recognize a part of effective information of the signal under the condition that the pronunciation of the voice signal is similar, thereby improving the recognition accuracy to a certain extent.
2. According to the invention, through the combination of the ANN and the HMM, the interference of noise on voice signals is reduced to a certain extent, and the anti-interference capability of the whole system is improved.
3. After the fuzzy recognition technology is introduced, the success rate of voice recognition in a complex environment can be improved to a certain extent.
Drawings
The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way, and in which:
FIG. 1 is a flow chart of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a specific embodiment of the present invention provides a robot speech recognition method, including:
step S1: acquiring a voice signal;
step S2: carrying out pre-emphasis, framing and windowing on the voice signals in sequence, and extracting static characteristic information from the processed voice signals;
the pre-emphasis part is used for leading the signal to pass through a high-pass filter, the pre-emphasis aims to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat, the signal is kept in the whole frequency band from low frequency to high frequency, and the frequency spectrum can be obtained by the same signal-to-noise ratio. Meanwhile, the method is also used for eliminating the vocal cords and lip effects in the generation process, compensating the high-frequency part of the voice signal which is restrained by the pronunciation system, and highlighting the formants of the high frequency.
Since the desired signal must be a stationary signal in the subsequent fast fourier transform, the signal must be framed so that the local process of the speech signal can be considered stationary. However, there are discontinuities at the beginning and end of each frame, so that the more frames are divided, the larger the error from the original signal, and windowing is performed to solve this problem, so that the framed signal becomes continuous, and each frame exhibits the characteristics of a periodic function, and in speech signal processing, a hamming window is usually added.
Each frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame. Assuming that the signal after framing is S (N), N is 0,1 …, N-1, and N is the size of the frame, the signal after multiplication by the hamming window is of the form S' (N) ═ S (N) × w (N), w (N) as follows:
Figure BDA0002825029840000041
different values of a will result in different Hamming windows, typically a being 0.46.
Step S3: performing FFT, Mel filtering and DCT processing on the static characteristic information in sequence, and performing differential calculation on the output result of the DCT to obtain dynamic characteristic information;
since the signal is usually difficult to see by the transformation in the time domain, it is usually observed by transforming it into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices. After multiplication by the hamming window, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame. And the power spectrum of the voice signal is obtained by taking the modulus square of the frequency spectrum of the voice signal.
The role of the mel filter bank is mainly to reduce the amplitude of the frequency domain and reduce the redundant part in the frequency spectrum. At this time, the amplitude spectrum obtained by FFT is multiplied and accumulated by each filter, and the obtained value is the energy value of the frame data in the frequency band corresponding to the filter. If the number of filters is 22, then 22 energy values should be obtained at this time. And logarithm is taken to the obtained energy value, so that the subsequent cepstrum analysis is facilitated.
The frequency response of the mel-triangle filter is defined as:
Figure BDA0002825029840000051
in the formula (I), the compound is shown in the specification,
Figure BDA0002825029840000052
the triangular band pass filter has two main purposes:
the method smoothes the frequency spectrum, eliminates the effect of harmonic wave, and highlights the formant of the original voice, so that the tone or pitch of a section of voice is not presented in the MFCC parameters, in other words, the voice recognition system using MFCC as the characteristic is not influenced by the tone difference of the input voice, and in addition, the operation amount can be reduced.
The DCT is often used for signal processing and image processing for lossy data compression of signals and images, because the DCT has a strong "energy-concentrating" property: most of the natural signals, including sound and image, have their energy concentrated in the low frequency part after discrete cosine transform, and actually, each frame of data is subjected to dimensionality reduction once.
Figure BDA0002825029840000061
Figure BDA0002825029840000062
The log energy s (M) is substituted into the discrete cosine transform to obtain the Mel-scale Cepstrum parameter of L order, which refers to the order of MFCC coefficient, and usually takes 12-16, where M is the number of triangular filters.
The standard cepstral parameters MFCC only reflect the static characteristics of the speech parameters, and the dynamic characteristics of speech can be described by the differential spectrum of these static characteristics. Experiments prove that: the recognition performance of the system can be effectively improved by combining the dynamic and static characteristics. The calculation of the difference parameter may use the following formula:
Figure BDA0002825029840000063
in the formula (d)tDenotes the t-th first order difference, CtRepresents the t-th cepstral coefficient, Q represents the order of the cepstral coefficient, and K represents the time difference of the first derivative, which may be 1 or 2. And substituting the result of the above expression into the second-order difference parameter. Thus, the whole dynamic characteristic information extraction process of the voice signal is completed
Step S4: combining an artificial neural network ANN with a hidden Markov model HMM to perform voice matching on static characteristic information and dynamic characteristic information, acquiring matching output probabilities of all corresponding paths from the hidden Markov model HMM through a Viterbi algorithm,
when the speech signal is a first match:
and obtaining the maximum probability from the matching output probabilities of all the corresponding paths and comparing the maximum probability with the confidence probability, wherein the calculation formula of the confidence probability is as follows:
Figure BDA0002825029840000071
wherein, PcIs the confidence probability; m is the total number of successful first matching; piThe maximum probability value of successful first matching of the ith time;
when the maximum probability is greater than the confidence probability, outputting the output content of the corresponding path corresponding to the maximum probability, wherein the content is the identified content;
when the maximum probability is smaller than the confidence probability, performing tone removing processing on the voice signal obtained in the step 1, and performing step 2-4 on the tone removed voice signal for second matching;
when the speech signal is a second match:
and outputting the output content of the corresponding path with the highest probability in the matching output probabilities of all the corresponding paths, wherein the content is the identified content.
For the process of matching the voice information, the invention introduces the concept of fuzzy algorithm. Due to the influence of external noise, when the voice signals are matched, the matching output probability of the corresponding path is generally low, and the high matching probability is difficult to achieve, so that a confidence probability is introduced, and a calculation formula of the confidence probability is as follows:
Figure BDA0002825029840000072
wherein, PcIs the confidence probability; m is the total number of successful first matching; piThe maximum probability value of successful first matching of the ith time;
the confidence probability is a long-term accumulated value, when an initial value of the confidence probability is set at the initial stage of the robot voice recognition, the initial value is an average value of matching output probabilities of all corresponding paths in the first matching process, and the initial value is updated in real time according to the formula subsequently to adapt to the voice recognition in a noisy environment in turn. When the first matching of the voice information is unsuccessful, namely the maximum probability obtained from the matching output probabilities of all corresponding paths is smaller than the confidence probability, the voice signal is subjected to tone removal processing, and then the voice signal subjected to tone removal is subjected to the second matching in the step 2-4, so that external noise can possibly influence the related information of the voice information, including the influence on the tone of the voice information, thereby reducing the matching probability, removing the tone at the moment, and matching the voice signal again, thereby improving the probability and the accuracy of matching; when the speech signal is a second match: and outputting the output content of the corresponding path with the highest probability in the matching output probabilities of all the corresponding paths, wherein the content is the identified content. The second re-tone-removal instead of the first direct tone-removal can avoid the problem of reduced accuracy caused by the fact that the output probability of the corresponding path with the highest matching degree is reduced after tone removal.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (5)

1. A robot voice recognition method, comprising:
step 1: acquiring a voice signal;
step 2: preprocessing a voice signal and extracting static characteristic information;
and step 3: acquiring dynamic characteristic information through a difference algorithm according to the static characteristic information;
and 4, step 4: the hidden Markov model HMM is used to match the static characteristic information and the dynamic characteristic information with voice, the Viterbi algorithm is used to obtain the matching output probability of all the corresponding paths from the hidden Markov model HMM,
when the speech signal is a first match:
obtaining the maximum probability from the matching output probabilities of all the corresponding paths and comparing the maximum probability with the confidence probability,
when the maximum probability is greater than the confidence probability, outputting the output content of the corresponding path corresponding to the maximum probability, wherein the content is the identified content;
when the maximum probability is smaller than the confidence probability, performing tone removing processing on the voice signal obtained in the step 1, and performing step 2-4 on the tone removed voice signal for second matching;
when the speech signal is a second match:
and outputting the output content of the corresponding path with the highest probability in the matching output probabilities of all the corresponding paths, wherein the content is the identified content.
2. The robot speech recognition method according to claim 1, wherein the confidence probability in step 4 is calculated as follows:
Figure FDA0002825029830000011
wherein, PcIs the confidence probability; m is the total number of successful first matching; piThe maximum probability value of successful first matching of the ith time.
3. The robot speech recognition method of claim 1, wherein the pre-processing of the speech signal in step 2 comprises pre-emphasis, framing, and windowing in sequence.
4. The robot speech recognition method according to claim 1, wherein in step 3, the static feature information is subjected to FFT, mel filtering, and DCT processing in sequence, and then the processing result of the DCT is subjected to difference calculation to obtain the dynamic feature information.
5. The robot speech recognition method according to claim 1, wherein in step 4, the artificial neural network ANN is combined with a hidden markov model HMM to perform speech matching on the static feature information and the dynamic feature information.
CN202011447106.8A 2020-12-09 2020-12-09 Robot voice recognition method Pending CN112562646A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011447106.8A CN112562646A (en) 2020-12-09 2020-12-09 Robot voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011447106.8A CN112562646A (en) 2020-12-09 2020-12-09 Robot voice recognition method

Publications (1)

Publication Number Publication Date
CN112562646A true CN112562646A (en) 2021-03-26

Family

ID=75061414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011447106.8A Pending CN112562646A (en) 2020-12-09 2020-12-09 Robot voice recognition method

Country Status (1)

Country Link
CN (1) CN112562646A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945673A (en) * 2012-11-24 2013-02-27 安徽科大讯飞信息科技股份有限公司 Continuous speech recognition method with speech command range changed dynamically
CN103065629A (en) * 2012-11-20 2013-04-24 广东工业大学 Speech recognition system of humanoid robot
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium
CN109036381A (en) * 2018-08-08 2018-12-18 平安科技(深圳)有限公司 Method of speech processing and device, computer installation and readable storage medium storing program for executing
CN109872714A (en) * 2019-01-25 2019-06-11 广州富港万嘉智能科技有限公司 A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
CN110503952A (en) * 2019-07-29 2019-11-26 北京搜狗科技发展有限公司 A kind of method of speech processing, device and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065629A (en) * 2012-11-20 2013-04-24 广东工业大学 Speech recognition system of humanoid robot
CN102945673A (en) * 2012-11-24 2013-02-27 安徽科大讯飞信息科技股份有限公司 Continuous speech recognition method with speech command range changed dynamically
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium
CN109036381A (en) * 2018-08-08 2018-12-18 平安科技(深圳)有限公司 Method of speech processing and device, computer installation and readable storage medium storing program for executing
CN109872714A (en) * 2019-01-25 2019-06-11 广州富港万嘉智能科技有限公司 A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
CN110503952A (en) * 2019-07-29 2019-11-26 北京搜狗科技发展有限公司 A kind of method of speech processing, device and electronic equipment

Similar Documents

Publication Publication Date Title
WO2021042870A1 (en) Speech processing method and apparatus, electronic device, and computer-readable storage medium
KR100908121B1 (en) Speech feature vector conversion method and apparatus
CN106971741B (en) Method and system for voice noise reduction for separating voice in real time
RU2329550C2 (en) Method and device for enhancement of voice signal in presence of background noise
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
JP5230103B2 (en) Method and system for generating training data for an automatic speech recognizer
CN109767756B (en) Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
CN111508498B (en) Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
CN112927709B (en) Voice enhancement method based on time-frequency domain joint loss function
CN112530410A (en) Command word recognition method and device
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
Ali et al. Speech enhancement using dilated wave-u-net: an experimental analysis
Labied et al. An overview of automatic speech recognition preprocessing techniques
Tu et al. DNN training based on classic gain function for single-channel speech enhancement and recognition
KR100897555B1 (en) Apparatus and method of extracting speech feature vectors and speech recognition system and method employing the same
CN111681649B (en) Speech recognition method, interaction system and achievement management system comprising system
Kaur et al. Optimizing feature extraction techniques constituting phone based modelling on connected words for Punjabi automatic speech recognition
CN112562646A (en) Robot voice recognition method
CN111833869B (en) Voice interaction method and system applied to urban brain
Upadhyay et al. Robust recognition of English speech in noisy environments using frequency warped signal processing
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
CN114550741A (en) Semantic recognition method and system
Koc Acoustic feature analysis for robust speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination