CN107342074B

CN107342074B - Speech and sound recognition method

Info

Publication number: CN107342074B
Application number: CN201610273827.9A
Authority: CN
Inventors: 王荣
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2024-03-15
Anticipated expiration: 2036-04-29
Also published as: CN107342074A

Abstract

The invention provides a method for realizing voice recognition. The method is characterized in that the sound with smaller loudness is ignored, and the result is not more than the loudness of the pure voice at maximum when the distance between the sound to be recognized and the pure voice is calculated, so that the method has better recognition effect on noisy environment and words or phrases with shorter pronunciation.

Description

Speech and sound recognition method

Technical Field

The invention belongs to the field of voice recognition and voice recognition, and particularly relates to a method for realizing voice and voice recognition.

Background

Speech recognition is an important component of artificial intelligence and has wide application, but current speech recognition has poor recognition in noisy environments. One method of comparing the difference between two voices is described in the article An Objective Measure for Predicting Subjective Quality of Speech Coders of the JOURNAL of IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL.10, NO.5, JUNE 1992 (hereinafter referred to as document 1), but this method is not ideal if used for voice recognition. In addition, this approach requires that the two voices be perfectly aligned, but in reality, the voices will start and end at any time, almost impossible to align in advance. Accordingly, the present invention proposes a solution that attempts to solve these problems.

Disclosure of Invention

A method for implementing speech recognition by converting pure speech a into a two-dimensional array F representing the loudness of the pure speech a on barker and converting a sound G to be recognized into a two-dimensional array H representing the loudness of the sound G to be recognized on barker, characterized by:

and when comparing the array F with the array H, ignoring the elements with smaller loudness in the array F and the elements corresponding to the elements with smaller loudness in the array F in the array H.

A method for implementing speech recognition by converting pure speech A2 into a two-dimensional array F2 representing the loudness of the pure speech A2 on barker, and converting a sound G2 to be recognized into a two-dimensional array H2 representing the loudness of the sound G2 to be recognized on barker, characterized by:

and when the distance between the element F2[ x ] [ y ] of the array F2 and the corresponding element H2[ x ] [ y ] in the array H2 is calculated, the calculated result is not more than the value of the element F2[ x ] [ y ] at maximum.

Preferably, the sound G3 to be recognized is a sound with a length different from that of the pure voice A3, and is characterized in that:

extracting a section of sound G4 with the same length as the pure voice A3 from the sound G3 to be recognized frame by frame, and comparing the sound G4 with the pure voice A3.

Preferably, the pure speech a and the pure speech A2 are multiplied by a scale factor and compared with the sound G to be recognized and the sound G2 to be recognized.

Compared with the prior art, the invention has the advantages that: has better recognition effect on noisy environment and words or words with shorter pronunciation.

Detailed Description

Example 1:

in speech, and more generally in sound, the distribution of power over frequency is not exactly equal and varies over time. It is this distribution of frequencies, and their variations, that allows one to discern various sounds. Assuming that a 200hz and a 2000 hz sinusoidal sound with constant intensity are present at the same time and that the loudness of the 200hz sinusoidal sound is 2 times that of the 2000 hz, in this case, a human can easily hear one 2000 hz sound out of the sounds. However, if the method and formula of document 1 are directly used for recognition of a sound, and the distance between two sounds is calculated, the sound is considered to be far from 2000 hz, and thus the sound of 2000 hz is not recognized. However, if a human is first listening to a pure 2000 Hz sine wave, he will find that the loudness of this sound is zero at 200Hz and other frequencies, and thus will ignore the 200Hz sound, considering only the 2000 Hz sound, and still hear the 2000 Hz sound.

In addition, in noisy environments, sounds with too little loudness are too susceptible to interference, and therefore, in speech recognition in noisy environments, sounds with too little loudness in pure speech need to be ignored.

Now, it is assumed that there is a certain recorded voice, for example, a word "north" of "Beijing" (hereinafter referred to as "A"). The duration a is 0.5 seconds and the sampling rate is 8000hz, so there are 4000 samples in total. First, a is divided into a plurality of overlapping or non-overlapping frames, and then each frame is windowed using a window function (e.g., hamming window, hanning window, sin window, etc.). The application recommends using more than 8 times overlapping samples and windowing using a sin window function. For example, assuming that each frame is 50 milliseconds, 8 times overlapping samples, then frame 1 of speech is samples 1 to 400 of A, frame 2 is samples 51 to 450, frame 3 is samples 101 to 500, and so on. Each frame is then windowed using a sin window function. Thus, A is converted into a 2-dimensional array E, the elements of the array being E [ n ] [ m ], where n is the total number of frames from 1 to A, and m is 1 to 400, where 400 is the number of samples per frame. Ex is used herein to represent a line from Ex 1 to Ex 24. Each row of array E is calculated as per document 1 to produce loudness (loudness) for each bark of the human ear, so array E is converted into array F, the elements of F being F [ n ] [ m ], where n is the total number of frames from 1 to a, and m is 1 to 24, where 24 is the bark number of the human ear, representing loudness (corresponding, one frame of a) for 24 barks of the human ear, calculated as per document 1. However, other numbers of divisions are possible, such as dividing each bark equally into two, and thus 48 barks, for better recognition. Now, it is assumed that when the voice a is played again at another time, a becomes G due to the influence of noise. Similarly, G is converted into an array H using the method of document 1, the elements of H being H [ n ] [ m ], where n is the total number of frames from 1 to A and m is from 1 to 24. One line of H represents the loudness caused by 24 barks to the human ear calculated by the method of document 1. To identify whether H contains speech a, let the array p=abs (H-F), where abs is a function of the calculated absolute value. That is, let each element of array P equal each element of array H minus each corresponding element of array F, then take the absolute value of each element of array P.

Elements of F that are too small in loudness need to be ignored for identification in noisy environments. Because these elements are too susceptible to interference in noisy environments, they become nearly unusable. For a standard with too little loudness, the present application recommends 1/4 to 1/2 of the maximum loudness value on bark in pure speech. For the human ear, 1/4 of the loudness, the acoustic power is only about 1/100. Even at a loudness of 1/2, the acoustic power is about 1/10, so that although their loudness in pure speech is not small, the actual acoustic power is small and therefore very susceptible to interference. In a quiet environment, these sounds still aid in recognition, but in a noisy environment, become no longer available. Specifically, assuming that the value of the largest element in array F is mf, each element in F is checked, and if fx ] [ y ] < (mf/4), P [ x ] [ y ] =0, and fx ] [ y ] =0 are set so that these elements no longer have any influence on the result in subsequent calculations, in other words, these elements are ignored.

Second, in calculating whether a certain voice is contained in the recognized voice, the distance should not be calculated to exceed the loudness of the corresponding bark in the pure voice at maximum. That is, each element P [ x ] [ y ] of array P is examined, and if P [ x ] [ y ] > F [ x ] [ y ], then P [ x ] [ y ] = F [ x ] [ y ]. For example, if P2 5 is equal to 0.8 and F2 5 is equal to 0.5, then P2 5=0.5.

Then, calculating the sum of all elements in the array F to obtain sf; the sum of all elements in the array P is calculated to give sp. Let d=sp/sf. If d is less than or equal to a small value, e.g. 0.2, then speech a is considered found in sound G. Note that, finding the voice a in the voice G does not exclude the possibility that other voices or voices are included in the voice G, such as voices of other voices or background music which are simultaneously speaking.

Example 2:

for example 1, there have been good judgment effects, but there are some problems to be solved, for example, assuming that the length of the pure voice is 0.5 seconds, the length of the voice to be recognized is 10 seconds, and the voice therein may start at any time of 10 seconds, whereas example 1 assumes that the length of the pure voice and the voice to be recognized are the same before comparison, and that the positions where the voices appear in the voice to be recognized and the pure voice are also the same. The solution is to compare frame by frame. For example, assuming that the sampling rate of both the sound to be recognized and the pure speech is 8000Hz, the frame length is 50 milliseconds, 8 times the overlap sampling is used, so the step size of the frame is 8000/(1000/50)/8, which is equal to 50. Let the length of pure speech a be 0.5 seconds, there are 4000 samples. First, samples from 1 to 4000 of the speech to be recognized are taken, and the method of embodiment 1 is used to determine whether a is contained therein, followed by frame 2, i.e., one step up, i.e., 51 to 4050 of the speech to be recognized are compared with the pure speech. Then frame 3, frame 4, etc. However, there may be a problem in that the voices are repeatedly recognized, such as that the 4000 samples from the 2 nd and 3 rd frames recognize the voice a, and thus the same pure voice needs to be deleted if it is recognized at a too close position, for example, by only 1 to 2 frames.

Furthermore, due to sound recordings, etc., the pure speech may become lighter or louder in the sound to be recognized, and therefore it is also necessary to multiply or divide the loudness of the pure speech by a smaller coefficient, e.g. 1.05, each time and then compare it with the sound to be recognized until the loudness of the pure speech and the sound to be recognized differ too far, e.g. more than 10 times, that the sound to be recognized is unlikely to contain the pure speech.

In this application, speech and sound are almost always interchangeable. The above embodiment is only one of the preferred embodiments of the present invention, and those skilled in the art should not be able to make any changes or substitutions within the scope of the present invention.

Claims

1. A method for realizing voice recognition is that pure voice A is converted into a two-dimensional array F representing the loudness of the pure voice A on Baker, the element of F is F n m, the voice G to be recognized is converted into a two-dimensional array H representing the loudness of the voice G to be recognized on Baker, and the element of H is H n m; wherein n is 1 to the total number of frames of said G, m is 1 to 24, wherein 24 is the barker number of the human ear, let array p=abs (H-F), characterized by:

in comparing the array F and the array H, assuming that the value of the largest element in the F is mf, each element in the F is checked, if fx y < a value too small, the px y=0 is set, and the fx y=0,

then, calculating the sum of all elements in the F to obtain sf; calculating the sum of all elements in said P to obtain sp, letting d = sp/sf, and if said d is smaller than or equal to a certain smaller value, then said speech a is considered to be found in said sound G.

2. A method for realizing voice recognition is that pure voice A is converted into a two-dimensional array F representing the loudness of the pure voice A on Baker, the element of F is F n m, the voice G to be recognized is converted into a two-dimensional array H representing the loudness of the voice G to be recognized on Baker, and the element of H is H n m; wherein n is 1 to the total number of frames of said G, m is 1 to 24, wherein 24 is the barker number of the human ear, let array p=abs (H-F), characterized by:

when calculating the distance between the element Fx y of the array F and the corresponding element Hx y in the array H, making the calculated result not exceeding the value of the element Fx y at maximum;

3. The method for implementing voice recognition according to claim 1 or claim 2, wherein, for calculating whether the voice G3 to be recognized contains the pure voice a, it is assumed that the voice G3 to be recognized is a voice having a different length from the pure voice a, characterized in that:

and extracting a section of the sound G with the same length as the pure voice A from the sound G3 to be recognized frame by frame, and comparing the sound G with the pure voice A.

4. A method of implementing speech recognition according to claim 1 or claim 2, characterized in that:

the pure speech A is multiplied by a scaling factor and compared with the sound G to be recognized.