CN110890087A

CN110890087A - Voice recognition method and device based on cosine similarity

Info

Publication number: CN110890087A
Application number: CN201811049146.XA
Authority: CN
Inventors: 吴威; 张楠赓
Original assignee: Canaan Creative Co Ltd
Current assignee: Canaan Bright Sight Co Ltd
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2020-03-17

Abstract

The embodiment of the invention provides a method and a system for recognizing voice based on cosine similarity, wherein the method comprises the following steps: acquiring a voice to be detected; performing framing processing on the voice to be detected; acquiring a feature vector sequence to be detected by extracting a feature vector of each frame of voice to be detected; calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a similarity algorithm based on cosine values; and identifying the voice to be detected based on the cosine similarity. The embodiment of the invention reduces the error caused by the difference of the sound size between the recorded voice template and the voice to be detected, and improves the recognition rate.

Description

Voice recognition method and device based on cosine similarity

Technical Field

The invention relates to the field of voice recognition, in particular to a voice recognition method and device based on cosine similarity.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In recent years, with the development of man-machine information interaction technology, speech recognition technology has shown its importance. In speech recognition, the euclidean distance is usually used as a measure in the prior art to calculate the similarity between two time series for further recognition.

For example, in a speech recognition system, a Dynamic Time Warping (DTW) is one of the key technologies in speech recognition. DTW calculates the similarity between two time series by extending and shortening the time series. In the conventional DTW algorithm, Euclidean distances of any two vectors belonging to a feature vector sequence to be detected and a template feature vector sequence are dynamically calculated, and a regular path with the minimum accumulated distance is found to serve as an optimal path.

However, in the process of implementing the present invention, the inventor finds that when the similarity between two voice feature vector sequences is calculated based on the euclidean distance (for example, the time sequence similarity is calculated by using a dynamic time warping algorithm), the similarity between the vectors belonging to the two voice feature vector sequences may generate a large error due to the difference in voice sizes, which further reduces the recognition rate.

Disclosure of Invention

Aiming at the problem that in the speech recognition process in the prior art, the recognition rate is reduced due to the difference of the sound size between the speech to be recognized and the reference template, the embodiment of the invention provides a speech recognition method and a speech recognition device based on cosine similarity, so that the error caused by different volumes is reduced, and the recognition rate is further improved.

In a first aspect of an embodiment of the present invention, a method for speech recognition based on cosine similarity is provided, where the method includes:

acquiring a voice to be detected;

performing framing processing on the voice to be detected;

acquiring a feature vector sequence to be detected by extracting a feature vector of the voice to be detected of each frame;

calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a similarity algorithm based on cosine values;

and selecting one preset template feature vector sequence in the at least one preset template feature vector sequence based on the cosine similarity to identify the voice to be detected.

In one embodiment, the method further comprises:

and obtaining the at least one preset template feature vector sequence by extracting the feature vector sequence of the at least one recorded voice template.

In one embodiment, the feature vector is a mel-frequency cepstrum coefficient (MFCC) feature vector.

In one embodiment, the extracting the feature vector of the speech to be detected for each frame further includes:

sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum s (m) of each frame of voice to be detected;

and performing Discrete Cosine Transform (DCT) on the logarithmic energy spectrum s (m) of each frame of voice to be detected by a table lookup method to obtain Mel Frequency Cepstrum Coefficient (MFCC) of each frame of voice to be detected.

In one embodiment, the preprocessing includes fast fourier transform, Mel filtering, and logarithmic operations, among others.

In one embodiment, the performing a Discrete Cosine Transform (DCT) on the log energy spectrum s (m) of each frame of speech to be detected by using a table lookup method to obtain mel-frequency cepstrum coefficients (MFCC) of each frame of speech to be detected further includes:

before extracting the feature vector of the voice to be detected of each frame, calculating data X (n, M) corresponding to the value of each (n, M) according to a formula (1), and constructing an LxM lookup table;

wherein the formula (1) is:

wherein n is 1, 2.. times, L, which is a predetermined MFCC dimension, and M is 1, 2.. times, M, which is a predetermined number of Mel filters.

In an embodiment, the obtaining a mel-frequency cepstrum coefficient (MFCC) of each frame of speech to be tested by a table lookup method based on a log energy spectrum s (m) of each frame of speech to be tested further includes:

after the logarithmic energy spectrum s (m) of each frame of voice to be detected is obtained, sequentially taking a value of n to calculate a formula (2);

setting the obtained L calculation results as an L-dimensional MFCC;

wherein, the formula (2) is specifically as follows:

wherein s (m) is the acquired log energy spectrum of each frame of speech to be detected, and X (n, m) is acquired through the lookup table.

In one embodiment, M takes the value of 24, and L takes the value of 12.

In one embodiment, wherein the similarity algorithm is a dynamic time warping algorithm.

In an embodiment, the calculating the cosine similarity between the feature vector sequence to be measured and at least one preset template feature vector sequence by using a cosine-value-based similarity algorithm specifically includes:

calculating cosine values between each feature vector of the feature vector sequence to be detected and each feature vector of each preset template feature vector sequence aiming at each preset template feature vector sequence;

constructing the cosine values into a matrix grid by using a dynamic time warping algorithm;

selecting an optimal path from the matrix grid by using a dynamic time warping algorithm;

according to the optimal path, cosine similarity between the feature vector sequence to be detected and each preset template feature vector sequence is obtained;

and the accumulated cosine value under the optimal path is the cosine similarity.

In an embodiment, the cosine value between each feature vector of the feature vector sequence to be measured and each feature vector of the preset template feature vector sequence specifically includes:

for any group of feature vectors, calculating cosine values by adopting a formula (3);

wherein, the formula 3 specifically includes:

wherein, A is any one feature vector of the feature vector sequence to be detected, B is any one feature vector of the preset template feature vector sequence, A_nIs the value of the nth dimension in any feature vector of the feature vector sequence to be detected, B_nThe value of the nth dimension in any feature vector of the preset template feature vector sequence is obtained, L is the total dimension of the vectors, and n is an integer value between 1 and L.

In an embodiment, the selecting one of the at least one preset template feature vector sequence based on the cosine similarity to perform speech recognition on the speech to be detected specifically includes one or more of the following:

recognizing the voice to be detected according to the preset template feature vector sequence with the cosine similarity reaching a preset threshold;

and identifying the voice to be detected according to a preset template feature vector sequence with the global maximum cosine similarity.

In a second aspect of the present invention, a speech recognition apparatus based on cosine similarity is provided, where the apparatus includes:

the acquisition module is used for acquiring the voice to be detected;

the framing module is used for framing the voice to be detected;

the feature extraction module is used for extracting a feature vector of the voice to be detected of each frame to obtain a feature vector sequence to be detected;

the cosine similarity calculation module is used for calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a cosine value-based similarity calculation method;

and the recognition module is used for selecting one preset template feature vector sequence in the at least one preset template feature vector sequence based on the cosine similarity to recognize the voice to be detected.

In one embodiment, the apparatus further comprises:

and the template feature vector sequence acquisition module is used for acquiring the at least one preset template feature vector sequence by extracting the feature vector sequence of the at least one recorded voice template.

In one embodiment, the feature vector is a mel-frequency cepstrum coefficient feature vector.

In one embodiment, the feature extraction module is further configured to:

wherein, the formula (1) is specifically as follows:

wherein n is 1,2,., L, the L being a preset Mel-frequency cepstrum coefficient dimension, and M is 0,2,., M-1, the M being a preset number of Mel filters.

setting the obtained L calculation results as an L-dimensional MFCC;

wherein, the formula (2) is specifically as follows:

In one embodiment, M takes the value of 24, and L takes the value of 12.

In an embodiment, the cosine similarity calculation module is specifically configured to:

In one embodiment, the cosine similarity calculation module is further configured to:

wherein, the formula 3 specifically includes:

In an embodiment, the identification module is specifically configured to:

The cosine similarity-based voice recognition method and device provided by the embodiment of the invention replace Euclidean distance parameters in a dynamic time warping algorithm with vector cosine similarity parameters, reduce errors caused by sound size difference between a recorded voice template and a voice to be detected, and further improve the recognition rate.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 is a flowchart of a method for speech recognition based on cosine similarity according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for recognizing a speech based on cosine similarity according to an embodiment of the present invention;

FIG. 3 is a flowchart of another method for recognizing a speech based on cosine similarity according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a speech recognition apparatus based on cosine similarity according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of another speech recognition apparatus based on cosine similarity according to an embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Exemplary method

The embodiment of the invention provides a voice endpoint detection method based on fusion characteristics.

FIG. 1 is a schematic flow chart of a method of fused feature based speech endpoint detection according to an embodiment of the present invention. As shown in fig. 1, the method may include, but is not limited to, the following steps, optionally, the recognition is isolated word-sound recognition:

s110: acquiring a voice to be detected;

in some embodiments, after obtaining the speech to be tested, the sound needs to be preprocessed, and the preprocessing may include: analog-to-digital conversion and pre-emphasis.

In one embodiment, the analog-to-digital conversion is to convert an analog signal into a digital signal, that is, to convert a sound continuous waveform into discrete data points at a certain sampling rate and sampling bit number; the pre-emphasis can be realized by a high-pass filter, which is used for increasing the energy of the high-frequency part of the sound, and for the frequency spectrum of the sound signal, the energy of the low-frequency part is usually higher than that of the high-frequency part, so that the high-frequency energy of the collected voice needs to be strengthened in advance to enable the energy of the high-frequency part and the energy of the low-frequency part to have similar amplitudes, thereby improving the recognition accuracy.

S120: and performing framing processing on the voice to be detected.

And S130, acquiring a feature vector sequence to be detected by extracting the feature vector of the voice to be detected of each frame.

In one embodiment, since the time-domain waveform of the sound only represents the time-varying relationship of the sound pressure and does not represent the characteristics of the sound well, the sound waveform must be converted into an acoustic feature vector. There are many sound feature extraction methods, such as mel-frequency cepstral coefficients MFCC, linear prediction cepstral coefficients LPCC, multimedia content description interface MPEG7, etc.

In one embodiment, the present embodiment may use the MFCC parameter vector as the sound feature vector for subsequent operations.

As will be appreciated by those skilled in the art, in general, extracting MFCC features from preprocessed speech may include the steps of:

(1) and performing discrete Fourier transform on each frame of preprocessed voice to obtain a corresponding frequency spectrum.

(2) And (4) passing the spectrum through a Mel filter bank to obtain a Mel spectrum.

(3) Performing cepstrum analysis on the Mel spectrum to obtain MFCC feature vectors of the frame of speech.

The present invention is not limited to speech recognition using MFCC sound features, and may also use sound features such as LPCC (linear predictive cepstrum coefficient) and HZCRR (high zero crossing frame rate) to perform subsequent speech recognition.

According to the embodiment of the invention, the MFCC characteristic vector is used as the sound characteristic vector, and the MFCC characteristic vector is a nonlinear characteristic, because the sound heard by the human ear is not in a linear direct proportion relation with the frequency of the sound, the MFCC characteristic vector is used to be closer to the auditory characteristic of the human ear, and the recognition rate is further improved.

S140: and calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a similarity algorithm based on cosine values.

In an embodiment, the predetermined template feature vector sequence is a pre-extracted feature vector sequence to be measured of the template speech MFCC, and the lengths of the two sequences may be different.

In an embodiment, the similarity algorithm is a dynamic time warping algorithm.

In an embodiment, the similarity algorithm may also be other algorithms capable of calculating the similarity between two sound feature vector sequences, for example, a voiceprint recognition algorithm and a dynamic time warping algorithm.

It can be understood by those skilled in the art that, in the prior art, when the DTW algorithm performs similarity calculation on two time sequences, it is generally performed based on all euclidean distances between vectors belonging to the two sequences, and the cosine-value-based DTW algorithm of the embodiment of the present application calculates all cosine values between vectors belonging to the two sequences, thereby calculating the similarity between the two sequences.

S150: and identifying the voice to be detected based on the cosine similarity.

Specifically, one preset template feature vector sequence in the at least one preset template feature vector sequence is selected based on the cosine similarity to identify the voice to be detected.

In an embodiment, the greater the cosine similarity is, the higher the matching degree between the preset template feature vector sequence and the speech to be detected is. By matching each preset template feature vector sequence to correspond to a predetermined recognition result,

for example: generally, when a household intelligent sound box leaves a factory, a connection is established between some preset voice segments and a control instruction, for example, a preset template feature vector sequence of 'turn on air conditioner' built in the intelligent sound box, and correspondingly, the preset template feature vector sequence corresponds to a control instruction of actually turning on the air conditioner.

In an embodiment, the recognition of the speech to be detected is not limited to the above matching by setting a threshold, and any method that can match one or more preset template feature vector sequences and perform speech recognition may be adopted.

In one embodiment, the present invention may employ the following identification:

(1) recognizing the voice to be detected according to the preset template feature vector sequence with the cosine similarity reaching a preset threshold;

in an embodiment, when template feature vector sequences in the template library are sequentially matched and identified, once a template feature vector sequence with a similarity value exceeding a preset threshold is matched, the template feature vector sequence is determined as an identification result.

(2) Recognizing the voice to be detected according to a preset template feature vector sequence with global maximum cosine similarity;

in an embodiment, when template feature vector sequences in the template library are sequentially matched and identified, after all the template feature vector sequences are matched, the template feature vector sequence with the maximum similarity is output as an identification result.

The preset rules may be used alone or in combination, and are not limited herein.

For example: for the obtained similarity values corresponding to a plurality of different templates, selecting the maximum similarity value, judging whether the maximum similarity value exceeds a preset threshold value, if so, outputting the semantics corresponding to the template as an identification result, and if not, outputting the non-identification result

As shown in fig. 2, in an embodiment, in step S130, the MFCC feature vector may be obtained by the following method:

s210: sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum s (m) of each frame of voice to be detected;

in one embodiment, the preprocessing may specifically include Fast Fourier Transform (FFT), Mel-filtering, and logarithmic operations.

In an embodiment, the following formula (4) may be specifically adopted to obtain s (m):

wherein M is the number of filters, where M is 0,1,2, M-1, M being an integer greater than 1.

Where M is the number of triangular filters. L is the dimension of the MPCC feature, and 12 dimensions can represent the acoustic feature generally. And N is a frequency point.

S220: and performing Discrete Cosine Transform (DCT) on the logarithmic energy spectrum s (m) of each frame of voice to be detected by a table lookup method to obtain Mel Frequency Cepstrum Coefficient (MFCC) of each frame of voice to be detected.

In an embodiment, the process of building the lookup table may include:

(a) calculating data X (n, M) corresponding to the value of each (n, M) according to the following formula (1), and constructing an LxM lookup table;

wherein the formula is:

wherein n is 1,2, and L is an integer greater than 1; m-1, M being an integer greater than 1.

In an embodiment, the calculating a mel-frequency cepstrum coefficient (MFCC) according to the lookup table and the obtained log energy spectrum s (m) of each frame of the speech to be detected may specifically include:

(b) sequentially taking values of n, calculating the following formula (2), and obtaining L calculation results of C (1), C (2), … and C (L);

(c) setting the obtained L calculation results as an L-dimensional MFCC;

wherein, the formula specifically is:

The above process of calculating mel-frequency cepstrum coefficients is explained in detail below using a specific example:

for example, taking a value of 1 for n to compute C (1), first, X (1, 0) needs to be found from the lookup table, and X (1, 0) is compared with C (1)

Multiplying, then finding X (1, 1) from the lookup table, and comparing with the obtained

Multiplying, and repeating the steps all the time. And sequentially taking values of M from 0 to M-1, and accumulating all calculation results to obtain C (1), wherein the C (1) is the first dimension coefficient of the MFCC.

C (1), …, C (L) are calculated in sequence by the method, and the MFCC with L dimension is obtained.

In the embodiment of the invention, by adopting the technical scheme of pre-storing the MFCC by using the lookup table, the process of extracting the MFCC characteristic signals from the voice can be accelerated, the time of the whole voice recognition is further shortened, and the working efficiency is improved. In addition, especially for the embedded hardware identification device, the device is simpler to perform calculation, so that the manufacturing cost and the device space are saved.

In an embodiment, M takes a value of 24, L takes a value of 12, n takes values of 1,2, and 3 … 12, respectively, and mfccc (n) corresponding to the value of n is calculated in advance according to the above formula, and the value of MFCC dimension n and the calculated MFCC result are prestored in an array, so that corresponding MFCC is obtained directly by searching the array in subsequent calculation.

The number M, MFCC of the filters and the other correlation coefficients are not specifically limited, and may be adjusted according to a specific application scenario, for example: the MFCC dimension L can be set to 16 dimensions. In the present application, the value of M is 24, and the value of L is 12, but not limited thereto.

In the embodiment of the invention, the characteristics of the sound can be more completely expressed by adopting 12 dimensions through practical verification, and more stable technical effects can be obtained by adopting 24 filters. Therefore, the 12-dimensional MFCC and the 24 filters are better choices in consideration of the calculation amount.

In an embodiment, the above process of extracting the MFCC feature vector may also be used to calculate the TFCC feature vector by using a cepstrum analysis method according to the prior art, which is not limited herein.

In an embodiment of the present invention, after the feature extraction, a feature vector sequence (a) to be tested of the MFCC of the speech to be tested is obtained₁，A₂…,A_Q) N is an integer greater than 1, A₁～A_QAre arranged in time sequence.

As shown in FIG. 3, in one embodiment, the step S140 of the present invention may include steps S310 to S340.

In an embodiment, the feature vector sequence to be measured of the predetermined template feature vector sequence may be (B)₁，B₂…,B_M) T is an integer greater than 1, B₁～B_MAre arranged in time sequence.

S310: and aiming at each preset template feature vector sequence, calculating a cosine value between each feature vector of the feature vector sequence to be detected and each feature vector of each preset template feature vector sequence.

For any vector A in the feature vector sequence to be measured_qAnd any vector B in the preset template feature vector sequence_tWherein Q is more than or equal to 1 and less than or equal to Q, and T is more than or equal to 1 and less than or equal to T. The cosine of the angle between the two vectors can be calculated using the following equation (5):

A_q＝(A₁，A₂，…，A_L),B_t＝(B₁，B₂，…，B_L) And L is a vector dimension, and for the sound feature vector, an integer of 12-16 is generally taken as a value.

Wherein n is an integer between 1 and L.

It will be understood by those skilled in the art that for two vectors, the closer the cosine value is to 1, the closer the angle between the two vectors is to 0 degrees, indicating that the two vectors are more similar, i.e., cosine similar.

The cosine distance uses the cosine value of the included angle of the two vectors as the measure of the difference between the two individuals. Therefore, compared with the characteristic that the Euclidean distance tends to reflect the absolute difference of the individual numerical characteristics, the cosine distance focuses more on the difference of the two vectors in the direction, and is insensitive to the absolute numerical value, so that the problem that the possibly existing measurement standards among the compared time sequences are not uniform is further corrected.

Therefore, by the technical scheme of comparing the feature vectors of the feature vector sequence to be detected and the feature vector of the template feature vector sequence by using the vector cosine similarity, the problem of low recognition rate caused by unbalanced signal strength between the voice to be detected and the recording template, such as overlarge volume difference, can be solved.

S320: and constructing the cosine values into a matrix grid by utilizing a dynamic time warping algorithm.

As shown in table 1 below, cosine values of included angles between the feature vector sequence to be measured and all vectors in the feature vector sequence of the preset template are calculated, a matrix grid is constructed, the cosine values of the vectors are used as matrix elements, and the matrix grid conforms to a time sequence from left to right and from bottom to top because the sequences are time sequences.

B_T	cos(A₁,B_T)	cos(A₂,B_T)	…	cos(A_Q,B_T)
					…	…	…	cos(A_n,B_t)	…
B₂	cos(A₁,B₂)	cos(A₂,B₂)	…	cos(A_Q，B₂)
					B₁	cos(A₁,B₁)	cos(A₂,B₁)	…	cos(A_Q，B₁)
	A₁	A₂	…	A_Q

TABLE 1

S330: selecting an optimal path from the matrix grid by using a dynamic time warping algorithm;

s340: and obtaining the cosine similarity between the feature vector sequence to be detected and each preset template feature vector sequence according to the optimal path.

In one embodiment, consistent with conventional DTW algorithms, the optimal path selection needs to meet at least three constraints: 1) boundary conditions: the pronunciation speed of any kind of voice can be changed, but the sequence of all parts of the voice cannot be changed, so that the selected path is bound to start from the lower left corner and end at the upper right corner. 2) Continuity: cannot be matched across a certain point and can only be aligned with adjacent points. 3) Monotonicity: the path is monotonous in time and therefore cannot go left or down.

In an embodiment, the embodiment of the present application calculates a path with the largest accumulated cosine value according to the cost of each element in the matrix grid.

In the matrix grid, the cost of a point is the cosine value of the point + the largest one of the values from the three directions, i.e., lower, left, and oblique lower. The values of the three directions of lower, left and oblique lower can be recursively obtained in sequence up to the (1, 1) point.

S(A_q，B_t)＝cos(A_q，B_t)+tax[S(A_q-1，B_t)，S(A_q，B_t-1)，S(A_q-1，B_t-1)]。

The current lattice point accumulated cosine value S (A)_q，B_t) Is the cosine value cos (A) of the current lattice point_q，B_t) That is, the sum of the cosine distance (similarity) of the two vectors corresponding to the current lattice point and the accumulated cosine value of the largest adjacent lattice point that can reach the current lattice point, when the cosine value accumulation is finished, the path is moved to the vector A_NAnd vector B_MAt the corresponding end point, the resulting cumulative cosine value S (A)_Q，B_T) Namely the accumulated cosine value, and the path with the maximum global accumulated cosine value is the optimal regular path. And the accumulated cosine value under the optimal path is the cosine similarity.

According to the embodiment of the invention, by using the dynamic time warping algorithm based on the cosine similarity, the adverse effects caused by different time lengths and different sound energies between two sound characteristic vector sequences can be simultaneously overcome, and the improvement of the recognition rate is facilitated.

The operation method for calculating the accumulated cosine value is not particularly limited, and any optimal path calculation method satisfying the borderline, continuity and monotonicity may be adopted, which may be a method of performing quadratic constraint on a path boundary in the prior art to reduce the operation amount, or a method of dividing the matrix mesh into a plurality of local matrix meshes, performing a segmentation operation, and then summing. The present application is only exemplified by the operation manner employed in the above exemplary embodiment, but is not limited thereto.

It can be understood by those skilled in the art that the above-mentioned technical solution for calculating the similarity between time sequences by using a cosine similarity-based dynamic time algorithm in the embodiments of the present invention can be fully applied to other identification fields, for example: any recognition field with time series attributes (i.e. can be converted into time series) such as semantic recognition field, gesture recognition field, action recognition field, etc.

In summary, the method and the device for speech recognition based on cosine similarity provided by the embodiments of the present invention replace the euclidean distance parameter in the dynamic time warping algorithm with the vector cosine similarity parameter, thereby reducing the error caused by the difference in sound size between the recorded speech template and the speech to be detected, and further improving the recognition rate.

Exemplary device

The embodiment of the invention provides a voice endpoint detection device based on fusion characteristics.

Fig. 4 is a schematic block diagram of an apparatus for voice endpoint detection based on fusion features according to an embodiment of the present invention. As shown in fig. 4, the apparatus 400 may include but is not limited to modules,

the obtaining module 410 is configured to obtain a voice to be detected.

In one embodiment, the analog-to-digital conversion is to convert an analog signal into a digital signal, that is, to convert a sound continuous waveform into discrete data points at a certain sampling rate and sampling bit number; the pre-emphasis may be implemented by a high pass filter for increasing the energy of the high frequency part of the sound, and for the frequency spectrum of the sound signal, the energy of the low frequency part is usually higher than the energy of the high frequency part, and in order to make the energy of the high frequency part and the energy of the low frequency part have similar amplitude, the Yuyao pre-emphasizes the high frequency energy of the collected voice, thereby improving the recognition accuracy.

And a framing module 420, configured to perform framing processing on the speech to be detected.

And the feature extraction module 430 is configured to obtain a feature vector sequence to be detected by extracting a feature vector of the speech to be detected in each frame.

In one embodiment, since the time-domain waveform of the sound only represents the time-varying relationship of the sound pressure and does not represent the characteristics of the sound well, the sound waveform must be converted into an acoustic feature vector. There are many sound feature extraction devices, such as mel-frequency cepstral coefficients MFCC, linear predictive cepstral coefficients LPCC, multimedia content description interface MPEG7, etc.

In one embodiment, the present embodiment uses the MFCC parameter vectors as acoustic feature vectors for subsequent speech recognition.

As will be appreciated by those skilled in the art, in general, extracting MFCC features from preprocessed speech may include the steps of: (1) performing discrete Fourier transform on each frame of preprocessed voice to obtain a corresponding frequency spectrum; (2) the spectrum passes through a Mel filter bank to obtain a Mel spectrum; (3) performing cepstrum analysis on the Mel spectrum to obtain MFCC feature vectors of the frame of speech.

The cosine similarity calculation module 440 is configured to calculate cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by using a cosine value-based dynamic time warping algorithm.

In the prior art, when the DTW algorithm calculates the similarity between two sequences, it is usually based on all euclidean distances between vectors belonging to the two sequences to analyze and calculate, and the cosine-value-based DTW algorithm of the embodiment of the present application calculates all cosine values between vectors belonging to the two sequences, so as to calculate the similarity between the two sequences.

And the recognition module 450 is configured to recognize the speech to be detected based on the cosine similarity.

Specifically, the recognition module 450 selects one of the at least one preset template feature vector sequence to recognize the to-be-detected speech based on the cosine similarity.

In an embodiment, the greater the cosine similarity is, the higher the matching degree between the template feature vector sequence and the speech to be detected is. By matching each preset template feature vector sequence to correspond to a predetermined recognition result,

In the embodiment of the present invention, the recognition of the speech to be detected is not limited to the above matching by setting the threshold, and any method that can match one or more preset template feature vector sequences and perform speech recognition may be adopted.

In one embodiment, the invention identification module 450 can be configured to:

in an embodiment, when template feature vector sequences in a template library are sequentially matched and identified, once a template feature vector sequence with a similarity value exceeding a preset threshold is matched, the template feature vector sequence is judged as an identification result.

The preset rules may be used alone or in combination, and are not limited herein. For example: for the obtained similarity values corresponding to a plurality of different templates, selecting the maximum similarity value, judging whether the maximum similarity value exceeds a preset threshold value, if so, outputting the semantics corresponding to the template as an identification result, and if not, outputting the non-identification result

In one embodiment, the following method may be used to obtain the MFCC feature vector:

(1) sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum s (m) of each frame of voice to be detected;

wherein M is the number of filters, and M takes a value of 0-M-1. Where M is the number of triangular filters. L is the dimension of the MPCC feature, and 12 dimensions can represent the acoustic feature generally. And N is a frequency point.

(2) And performing Discrete Cosine Transform (DCT) on the logarithmic energy spectrum s (m) of each frame of voice to be detected by a table lookup method to obtain Mel Frequency Cepstrum Coefficient (MFCC) of each frame of voice to be detected.

In one embodiment, as shown in FIG. 5, the feature extraction module 430 may include the following sub-modules:

and the lookup table module 431 is configured to, for each frame of speech to be detected, directly read a prestored MFCC according to the dimension of the MFCC feature vector, and obtain the MFCC feature vector.

In an embodiment, the data building and data using process of the lookup table module 431 may include:

wherein the formula is:

(c) setting the obtained L calculation results as an L-dimensional MFCC;

wherein, the formula specifically is:

In the embodiment of the invention, by adopting the technical scheme of prestoring the MFCC, the process of extracting the MFCC characteristic signals from the voice can be accelerated, the time of whole voice recognition is further shortened, and the working efficiency is improved. In addition, especially for the embedded hardware identification device, the device is simpler to perform calculation, so that the manufacturing cost and the device space are saved.

In an embodiment, the above process of extracting the MFCC feature vector may also be implemented by a device using cepstrum analysis according to the prior art, and is not limited herein.

In an embodiment of the present invention, after the feature extraction, a feature vector sequence (a) to be tested of the MFCC of the speech to be tested is obtained₁，A₂…,A_N) N is an integer greater than 1, A₁～A_NAre arranged in a time sequence.

In an embodiment, the cosine similarity calculation module of the present invention may be specifically configured to:

s310: aiming at each preset template feature vector sequence, calculating the to-be-detected feature vector sequenceCosine values between each feature vector of the feature vector sequence and each feature vector of each preset template feature vector sequence. In an embodiment, the feature vector sequence to be measured of the predetermined template feature vector sequence may be (B)₁，B₂…,B_M) M is an integer greater than 1, B₁～B_MAre arranged in time sequence.

For any vector A in the feature vector sequence to be measured_nAnd any vector B in the preset template feature vector sequence_mWherein N is more than or equal to 1 and less than or equal to N, and M is more than or equal to 1 and less than or equal to M. The cosine of the angle between the two vectors is calculated using the following formula:

A_n＝(A₁，A₂，…，A_Q),B_m＝(B₁，B₂，…，B_Q) And L is a vector dimension, and for the sound feature vector, a positive integer of 12-16 is generally taken as a value.

Because the cosine value of the included angle of the two vectors is used as the measure for measuring the difference between the two individuals, compared with the characteristic that the Euclidean distance tends to reflect the absolute difference of the numerical characteristics of the individuals, the cosine distance pays more attention to the difference of the two vectors in the direction, is insensitive to the absolute numerical value, and further corrects the problem that the measurement standards possibly existing between the compared time sequences are not uniform.

TABLE 1

The current lattice point accumulated cosine value S (A)_n，B_m) Is the cosine value cos (A) of the current lattice point_n，B_m) That is, the sum of the cosine distance (similarity) of the two vectors corresponding to the current lattice point and the accumulated cosine value of the largest adjacent lattice point that can reach the current lattice point, when the cosine value accumulation is finished, the path is moved to the vector A_NAnd vector B_MAt the corresponding end point, the resulting cumulative cosine value S (A)_N，B_M) Namely the accumulated cosine value, and the path with the maximum global accumulated cosine value is the optimal regular path. And the accumulated cosine value under the optimal path is the cosine similarity.

According to the embodiment of the invention, by adopting the dynamic time warping algorithm based on the cosine similarity, the adverse effects caused by different time lengths and different sound energies between two sound characteristic vector sequences can be simultaneously overcome, and the improvement of the recognition rate is facilitated.

In summary, the speech recognition device based on cosine similarity provided by the embodiment of the present invention replaces the euclidean distance parameter in the dynamic time warping algorithm with the vector cosine similarity parameter, so as to reduce the error caused by the difference in sound size between the recorded speech template and the speech to be detected, and further improve the recognition rate.

Claims

1. A speech recognition method based on cosine similarity, the method comprising:

acquiring a voice to be detected;

performing framing processing on the voice to be detected;

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein the feature vector is a mel-frequency cepstral coefficient feature vector.

4. The method of claim 3, wherein the extracting the feature vector of the speech to be detected for each frame further comprises:

sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum of each frame of voice to be detected;

and performing discrete cosine transform on the logarithmic energy spectrum of each frame of voice to be detected by a table lookup method to obtain the Mel frequency cepstrum coefficient of each frame of voice to be detected.

5. The method of claim 4, wherein the pre-processing comprises fast Fourier transform, Mel filtering, and logarithmic operation.

6. The method of claim 4, wherein the obtaining the mel-frequency cepstrum coefficients of each frame of the speech to be tested by performing discrete cosine transform on the logarithmic energy spectrum of each frame of the speech to be tested through a table lookup method further comprises:

wherein the formula (1) is:

7. The method of claim 6, wherein the obtaining the mel-frequency cepstrum coefficients of each frame of the speech to be tested by a table lookup method based on the log energy spectrum of each frame of the speech to be tested further comprises:

after the logarithmic energy spectrum of each frame of voice to be detected is obtained, sequentially carrying out value calculation on n in a formula (2);

acquiring a Mel frequency cepstrum coefficient of L dimensions according to L calculation results obtained by calculation;

wherein, the formula (2) is specifically as follows:

8. The method of claim 6 or 7, wherein M takes the value 24, and wherein L takes the value 12.

9. The method according to claim 1, wherein the calculating the cosine similarity between the feature vector sequence to be measured and at least one preset template feature vector sequence by using a similarity algorithm based on cosine values specifically comprises: and calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a cosine value-based dynamic time warping algorithm.

10. The method according to claim 9, wherein the calculating the cosine similarity between the feature vector sequence to be measured and at least one preset template feature vector sequence by using a cosine-value-based dynamic time warping algorithm specifically comprises:

11. The method according to claim 10, wherein the cosine values between each eigenvector of the to-be-detected eigenvector sequence and each eigenvector of the preset template eigenvector sequence specifically comprise:

wherein, the formula 3 specifically includes:

12. The method according to claim 1, wherein the selecting one of the at least one preset template feature vector sequence based on the cosine similarity for performing speech recognition on the speech to be detected specifically includes one or more of the following:

13. An apparatus for speech recognition based on cosine similarity, the apparatus comprising:

the acquisition module is used for acquiring the voice to be detected;

the framing module is used for framing the voice to be detected;

14. The apparatus of claim 13, further comprising:

15. The apparatus of claim 13, wherein the feature vector is a mel-frequency cepstral coefficient feature vector.

16. The apparatus of claim 15, wherein the feature extraction module is further configured to:

17. The method of claim 16, wherein the pre-processing comprises fast fourier transform, Mel-filtering, and logarithmic operations.

18. The apparatus of claim 16, wherein the obtaining the mel-frequency cepstrum coefficients of each frame of the speech to be tested by performing discrete cosine transform on the log energy spectrum of each frame of the speech to be tested through a table lookup method further comprises:

wherein, the formula (1) is specifically as follows:

19. The apparatus of claim 18, wherein the obtaining the mel-frequency cepstrum coefficients of each frame of speech to be tested by a table lookup method based on the log energy spectrum of each frame of speech to be tested further comprises:

wherein, the formula (2) is specifically as follows:

20. The apparatus of claim 18 or 19, wherein M takes the value 24, and wherein L takes the value 12.

21. The method according to claim 13, wherein the cosine similarity calculation module is specifically configured to calculate cosine similarity between the feature vector sequence to be measured and at least one predetermined template feature vector sequence by using a cosine-value-based dynamic time warping algorithm.

22. The apparatus of claim 13, wherein the cosine similarity calculation module is specifically configured to:

23. The apparatus of claim 22, wherein the cosine similarity calculation module is further configured to:

wherein, the formula 3 specifically includes:

24. The apparatus of claim 13, wherein the identification module is specifically configured to: