CN107342074B - Speech and sound recognition method - Google Patents

Speech and sound recognition method Download PDF

Info

Publication number
CN107342074B
CN107342074B CN201610273827.9A CN201610273827A CN107342074B CN 107342074 B CN107342074 B CN 107342074B CN 201610273827 A CN201610273827 A CN 201610273827A CN 107342074 B CN107342074 B CN 107342074B
Authority
CN
China
Prior art keywords
voice
sound
recognized
array
pure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610273827.9A
Other languages
Chinese (zh)
Other versions
CN107342074A (en
Inventor
王荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201610273827.9A priority Critical patent/CN107342074B/en
Publication of CN107342074A publication Critical patent/CN107342074A/en
Application granted granted Critical
Publication of CN107342074B publication Critical patent/CN107342074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Abstract

The invention provides a method for realizing voice recognition. The method is characterized in that the sound with smaller loudness is ignored, and the result is not more than the loudness of the pure voice at maximum when the distance between the sound to be recognized and the pure voice is calculated, so that the method has better recognition effect on noisy environment and words or phrases with shorter pronunciation.

Description

Speech and sound recognition method
Technical Field
The invention belongs to the field of voice recognition and voice recognition, and particularly relates to a method for realizing voice and voice recognition.
Background
Speech recognition is an important component of artificial intelligence and has wide application, but current speech recognition has poor recognition in noisy environments. One method of comparing the difference between two voices is described in the article An Objective Measure for Predicting Subjective Quality of Speech Coders of the JOURNAL of IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL.10, NO.5, JUNE 1992 (hereinafter referred to as document 1), but this method is not ideal if used for voice recognition. In addition, this approach requires that the two voices be perfectly aligned, but in reality, the voices will start and end at any time, almost impossible to align in advance. Accordingly, the present invention proposes a solution that attempts to solve these problems.
Disclosure of Invention
A method for implementing speech recognition by converting pure speech a into a two-dimensional array F representing the loudness of the pure speech a on barker and converting a sound G to be recognized into a two-dimensional array H representing the loudness of the sound G to be recognized on barker, characterized by:
and when comparing the array F with the array H, ignoring the elements with smaller loudness in the array F and the elements corresponding to the elements with smaller loudness in the array F in the array H.
A method for implementing speech recognition by converting pure speech A2 into a two-dimensional array F2 representing the loudness of the pure speech A2 on barker, and converting a sound G2 to be recognized into a two-dimensional array H2 representing the loudness of the sound G2 to be recognized on barker, characterized by:
and when the distance between the element F2[ x ] [ y ] of the array F2 and the corresponding element H2[ x ] [ y ] in the array H2 is calculated, the calculated result is not more than the value of the element F2[ x ] [ y ] at maximum.
Preferably, the sound G3 to be recognized is a sound with a length different from that of the pure voice A3, and is characterized in that:
extracting a section of sound G4 with the same length as the pure voice A3 from the sound G3 to be recognized frame by frame, and comparing the sound G4 with the pure voice A3.
Preferably, the pure speech a and the pure speech A2 are multiplied by a scale factor and compared with the sound G to be recognized and the sound G2 to be recognized.
Compared with the prior art, the invention has the advantages that: has better recognition effect on noisy environment and words or words with shorter pronunciation.
Detailed Description
Example 1:
in speech, and more generally in sound, the distribution of power over frequency is not exactly equal and varies over time. It is this distribution of frequencies, and their variations, that allows one to discern various sounds. Assuming that a 200hz and a 2000 hz sinusoidal sound with constant intensity are present at the same time and that the loudness of the 200hz sinusoidal sound is 2 times that of the 2000 hz, in this case, a human can easily hear one 2000 hz sound out of the sounds. However, if the method and formula of document 1 are directly used for recognition of a sound, and the distance between two sounds is calculated, the sound is considered to be far from 2000 hz, and thus the sound of 2000 hz is not recognized. However, if a human is first listening to a pure 2000 Hz sine wave, he will find that the loudness of this sound is zero at 200Hz and other frequencies, and thus will ignore the 200Hz sound, considering only the 2000 Hz sound, and still hear the 2000 Hz sound.
In addition, in noisy environments, sounds with too little loudness are too susceptible to interference, and therefore, in speech recognition in noisy environments, sounds with too little loudness in pure speech need to be ignored.
Now, it is assumed that there is a certain recorded voice, for example, a word "north" of "Beijing" (hereinafter referred to as "A"). The duration a is 0.5 seconds and the sampling rate is 8000hz, so there are 4000 samples in total. First, a is divided into a plurality of overlapping or non-overlapping frames, and then each frame is windowed using a window function (e.g., hamming window, hanning window, sin window, etc.). The application recommends using more than 8 times overlapping samples and windowing using a sin window function. For example, assuming that each frame is 50 milliseconds, 8 times overlapping samples, then frame 1 of speech is samples 1 to 400 of A, frame 2 is samples 51 to 450, frame 3 is samples 101 to 500, and so on. Each frame is then windowed using a sin window function. Thus, A is converted into a 2-dimensional array E, the elements of the array being E [ n ] [ m ], where n is the total number of frames from 1 to A, and m is 1 to 400, where 400 is the number of samples per frame. Ex is used herein to represent a line from Ex 1 to Ex 24. Each row of array E is calculated as per document 1 to produce loudness (loudness) for each bark of the human ear, so array E is converted into array F, the elements of F being F [ n ] [ m ], where n is the total number of frames from 1 to a, and m is 1 to 24, where 24 is the bark number of the human ear, representing loudness (corresponding, one frame of a) for 24 barks of the human ear, calculated as per document 1. However, other numbers of divisions are possible, such as dividing each bark equally into two, and thus 48 barks, for better recognition. Now, it is assumed that when the voice a is played again at another time, a becomes G due to the influence of noise. Similarly, G is converted into an array H using the method of document 1, the elements of H being H [ n ] [ m ], where n is the total number of frames from 1 to A and m is from 1 to 24. One line of H represents the loudness caused by 24 barks to the human ear calculated by the method of document 1. To identify whether H contains speech a, let the array p=abs (H-F), where abs is a function of the calculated absolute value. That is, let each element of array P equal each element of array H minus each corresponding element of array F, then take the absolute value of each element of array P.
Elements of F that are too small in loudness need to be ignored for identification in noisy environments. Because these elements are too susceptible to interference in noisy environments, they become nearly unusable. For a standard with too little loudness, the present application recommends 1/4 to 1/2 of the maximum loudness value on bark in pure speech. For the human ear, 1/4 of the loudness, the acoustic power is only about 1/100. Even at a loudness of 1/2, the acoustic power is about 1/10, so that although their loudness in pure speech is not small, the actual acoustic power is small and therefore very susceptible to interference. In a quiet environment, these sounds still aid in recognition, but in a noisy environment, become no longer available. Specifically, assuming that the value of the largest element in array F is mf, each element in F is checked, and if fx ] [ y ] < (mf/4), P [ x ] [ y ] =0, and fx ] [ y ] =0 are set so that these elements no longer have any influence on the result in subsequent calculations, in other words, these elements are ignored.
Second, in calculating whether a certain voice is contained in the recognized voice, the distance should not be calculated to exceed the loudness of the corresponding bark in the pure voice at maximum. That is, each element P [ x ] [ y ] of array P is examined, and if P [ x ] [ y ] > F [ x ] [ y ], then P [ x ] [ y ] = F [ x ] [ y ]. For example, if P2 5 is equal to 0.8 and F2 5 is equal to 0.5, then P2 5=0.5.
Then, calculating the sum of all elements in the array F to obtain sf; the sum of all elements in the array P is calculated to give sp. Let d=sp/sf. If d is less than or equal to a small value, e.g. 0.2, then speech a is considered found in sound G. Note that, finding the voice a in the voice G does not exclude the possibility that other voices or voices are included in the voice G, such as voices of other voices or background music which are simultaneously speaking.
Example 2:
for example 1, there have been good judgment effects, but there are some problems to be solved, for example, assuming that the length of the pure voice is 0.5 seconds, the length of the voice to be recognized is 10 seconds, and the voice therein may start at any time of 10 seconds, whereas example 1 assumes that the length of the pure voice and the voice to be recognized are the same before comparison, and that the positions where the voices appear in the voice to be recognized and the pure voice are also the same. The solution is to compare frame by frame. For example, assuming that the sampling rate of both the sound to be recognized and the pure speech is 8000Hz, the frame length is 50 milliseconds, 8 times the overlap sampling is used, so the step size of the frame is 8000/(1000/50)/8, which is equal to 50. Let the length of pure speech a be 0.5 seconds, there are 4000 samples. First, samples from 1 to 4000 of the speech to be recognized are taken, and the method of embodiment 1 is used to determine whether a is contained therein, followed by frame 2, i.e., one step up, i.e., 51 to 4050 of the speech to be recognized are compared with the pure speech. Then frame 3, frame 4, etc. However, there may be a problem in that the voices are repeatedly recognized, such as that the 4000 samples from the 2 nd and 3 rd frames recognize the voice a, and thus the same pure voice needs to be deleted if it is recognized at a too close position, for example, by only 1 to 2 frames.
Furthermore, due to sound recordings, etc., the pure speech may become lighter or louder in the sound to be recognized, and therefore it is also necessary to multiply or divide the loudness of the pure speech by a smaller coefficient, e.g. 1.05, each time and then compare it with the sound to be recognized until the loudness of the pure speech and the sound to be recognized differ too far, e.g. more than 10 times, that the sound to be recognized is unlikely to contain the pure speech.
In this application, speech and sound are almost always interchangeable. The above embodiment is only one of the preferred embodiments of the present invention, and those skilled in the art should not be able to make any changes or substitutions within the scope of the present invention.

Claims (4)

1. A method for realizing voice recognition is that pure voice A is converted into a two-dimensional array F representing the loudness of the pure voice A on Baker, the element of F is F n m, the voice G to be recognized is converted into a two-dimensional array H representing the loudness of the voice G to be recognized on Baker, and the element of H is H n m; wherein n is 1 to the total number of frames of said G, m is 1 to 24, wherein 24 is the barker number of the human ear, let array p=abs (H-F), characterized by:
in comparing the array F and the array H, assuming that the value of the largest element in the F is mf, each element in the F is checked, if fx y < a value too small, the px y=0 is set, and the fx y=0,
then, calculating the sum of all elements in the F to obtain sf; calculating the sum of all elements in said P to obtain sp, letting d = sp/sf, and if said d is smaller than or equal to a certain smaller value, then said speech a is considered to be found in said sound G.
2. A method for realizing voice recognition is that pure voice A is converted into a two-dimensional array F representing the loudness of the pure voice A on Baker, the element of F is F n m, the voice G to be recognized is converted into a two-dimensional array H representing the loudness of the voice G to be recognized on Baker, and the element of H is H n m; wherein n is 1 to the total number of frames of said G, m is 1 to 24, wherein 24 is the barker number of the human ear, let array p=abs (H-F), characterized by:
when calculating the distance between the element Fx y of the array F and the corresponding element Hx y in the array H, making the calculated result not exceeding the value of the element Fx y at maximum;
then, calculating the sum of all elements in the F to obtain sf; calculating the sum of all elements in said P to obtain sp, letting d = sp/sf, and if said d is smaller than or equal to a certain smaller value, then said speech a is considered to be found in said sound G.
3. The method for implementing voice recognition according to claim 1 or claim 2, wherein, for calculating whether the voice G3 to be recognized contains the pure voice a, it is assumed that the voice G3 to be recognized is a voice having a different length from the pure voice a, characterized in that:
and extracting a section of the sound G with the same length as the pure voice A from the sound G3 to be recognized frame by frame, and comparing the sound G with the pure voice A.
4. A method of implementing speech recognition according to claim 1 or claim 2, characterized in that:
the pure speech A is multiplied by a scaling factor and compared with the sound G to be recognized.
CN201610273827.9A 2016-04-29 2016-04-29 Speech and sound recognition method Active CN107342074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610273827.9A CN107342074B (en) 2016-04-29 2016-04-29 Speech and sound recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610273827.9A CN107342074B (en) 2016-04-29 2016-04-29 Speech and sound recognition method

Publications (2)

Publication Number Publication Date
CN107342074A CN107342074A (en) 2017-11-10
CN107342074B true CN107342074B (en) 2024-03-15

Family

ID=60221815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610273827.9A Active CN107342074B (en) 2016-04-29 2016-04-29 Speech and sound recognition method

Country Status (1)

Country Link
CN (1) CN107342074B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5864794A (en) * 1994-03-18 1999-01-26 Mitsubishi Denki Kabushiki Kaisha Signal encoding and decoding system using auditory parameters and bark spectrum
WO2003036621A1 (en) * 2001-10-22 2003-05-01 Motorola, Inc., A Corporation Of The State Of Delaware Method and apparatus for enhancing loudness of an audio signal
JP2004029215A (en) * 2002-06-24 2004-01-29 Auto Network Gijutsu Kenkyusho:Kk Method for evaluating voice recognition precision of voice recognition device
CN1655230A (en) * 2005-01-18 2005-08-17 中国电子科技集团公司第三十研究所 Noise masking threshold algorithm based Barker spectrum distortion measuring method in objective assessment of sound quality
CN102376306A (en) * 2010-08-04 2012-03-14 华为技术有限公司 Method and device for acquiring level of speech frame
CN103903612A (en) * 2014-03-26 2014-07-02 浙江工业大学 Method for performing real-time digital speech recognition

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6701291B2 (en) * 2000-10-13 2004-03-02 Lucent Technologies Inc. Automatic speech recognition with psychoacoustically-based feature extraction, using easily-tunable single-shape filters along logarithmic-frequency axis
US20120233164A1 (en) * 2008-09-05 2012-09-13 Sourcetone, Llc Music classification system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5864794A (en) * 1994-03-18 1999-01-26 Mitsubishi Denki Kabushiki Kaisha Signal encoding and decoding system using auditory parameters and bark spectrum
WO2003036621A1 (en) * 2001-10-22 2003-05-01 Motorola, Inc., A Corporation Of The State Of Delaware Method and apparatus for enhancing loudness of an audio signal
JP2004029215A (en) * 2002-06-24 2004-01-29 Auto Network Gijutsu Kenkyusho:Kk Method for evaluating voice recognition precision of voice recognition device
CN1655230A (en) * 2005-01-18 2005-08-17 中国电子科技集团公司第三十研究所 Noise masking threshold algorithm based Barker spectrum distortion measuring method in objective assessment of sound quality
CN102376306A (en) * 2010-08-04 2012-03-14 华为技术有限公司 Method and device for acquiring level of speech frame
CN103903612A (en) * 2014-03-26 2014-07-02 浙江工业大学 Method for performing real-time digital speech recognition

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
K.K. Chu 等.Perceptually non-uniform spectral compression for noisy speech recognition.《 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing》.2003,第404-407页. *
宋芳芳等.基于语音识别技术的英语口语自学系统评分机制的研究.《电脑知识与技术》.2009,第5卷(第07期),第1728页. *
袁修干 等.噪声对信息传递的影响.《人机工程》.北京航空航天大学出版社,2002,第131页. *

Also Published As

Publication number Publication date
CN107342074A (en) 2017-11-10

Similar Documents

Publication Publication Date Title
Spille et al. Predicting speech intelligibility with deep neural networks
US9916842B2 (en) Systems, methods and devices for intelligent speech recognition and processing
Ma et al. Efficient voice activity detection algorithm using long-term spectral flatness measure
Stern et al. Hearing is believing: Biologically inspired methods for robust automatic speech recognition
Shi et al. On the importance of phase in human speech recognition
Moritz et al. An auditory inspired amplitude modulation filter bank for robust feature extraction in automatic speech recognition
Zhang et al. Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison–female voices
US9240190B2 (en) Formant based speech reconstruction from noisy signals
US10176824B2 (en) Method and system for consonant-vowel ratio modification for improving speech perception
Chuang et al. Speaker-Aware Deep Denoising Autoencoder with Embedded Speaker Identity for Speech Enhancement.
Hummersone A psychoacoustic engineering approach to machine sound source separation in reverberant environments
Mack et al. Single-Channel Dereverberation Using Direct MMSE Optimization and Bidirectional LSTM Networks.
CN107342074B (en) Speech and sound recognition method
Tchorz et al. Estimation of the signal-to-noise ratio with amplitude modulation spectrograms
Srinivasan et al. A model for multitalker speech perception
Dai et al. 2D Psychoacoustic modeling of equivalent masking for automatic speech recognition
JP3916834B2 (en) Extraction method of fundamental period or fundamental frequency of periodic waveform with added noise
Lee et al. Application of shape analysis techniques for improved CASA-based speech separation
Remes et al. Comparing human and automatic speech recognition in a perceptual restoration experiment
CN102222507B (en) Method and equipment for compensating hearing loss of Chinese language
Do et al. Combining cepstral normalization and cochlear implant-like speech processing for microphone array-based speech recognition
KR20100056859A (en) Voice recognition apparatus and method
Amano et al. Acoustic features of pop-out voice in babble noise
Moritz et al. Amplitude modulation filters as feature sets for robust ASR: constant absolute or relative bandwidth?
Andrijašević Effect of phoneme variations on blind reverberation time estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
DD01 Delivery of document by public notice
DD01 Delivery of document by public notice

Addressee: Wang Rong

Document name: Notification of Patent Invention Entering into Substantive Examination Stage

GR01 Patent grant
GR01 Patent grant