CN110890087A - Voice recognition method and device based on cosine similarity - Google Patents

Voice recognition method and device based on cosine similarity Download PDF

Info

Publication number
CN110890087A
CN110890087A CN201811049146.XA CN201811049146A CN110890087A CN 110890087 A CN110890087 A CN 110890087A CN 201811049146 A CN201811049146 A CN 201811049146A CN 110890087 A CN110890087 A CN 110890087A
Authority
CN
China
Prior art keywords
feature vector
detected
vector sequence
voice
cosine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811049146.XA
Other languages
Chinese (zh)
Inventor
吴威
张楠赓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canaan Bright Sight Co Ltd
Original Assignee
Canaan Creative Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canaan Creative Co Ltd filed Critical Canaan Creative Co Ltd
Priority to CN201811049146.XA priority Critical patent/CN110890087A/en
Publication of CN110890087A publication Critical patent/CN110890087A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates

Abstract

The embodiment of the invention provides a method and a system for recognizing voice based on cosine similarity, wherein the method comprises the following steps: acquiring a voice to be detected; performing framing processing on the voice to be detected; acquiring a feature vector sequence to be detected by extracting a feature vector of each frame of voice to be detected; calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a similarity algorithm based on cosine values; and identifying the voice to be detected based on the cosine similarity. The embodiment of the invention reduces the error caused by the difference of the sound size between the recorded voice template and the voice to be detected, and improves the recognition rate.

Description

Voice recognition method and device based on cosine similarity
Technical Field
The invention relates to the field of voice recognition, in particular to a voice recognition method and device based on cosine similarity.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In recent years, with the development of man-machine information interaction technology, speech recognition technology has shown its importance. In speech recognition, the euclidean distance is usually used as a measure in the prior art to calculate the similarity between two time series for further recognition.
For example, in a speech recognition system, a Dynamic Time Warping (DTW) is one of the key technologies in speech recognition. DTW calculates the similarity between two time series by extending and shortening the time series. In the conventional DTW algorithm, Euclidean distances of any two vectors belonging to a feature vector sequence to be detected and a template feature vector sequence are dynamically calculated, and a regular path with the minimum accumulated distance is found to serve as an optimal path.
However, in the process of implementing the present invention, the inventor finds that when the similarity between two voice feature vector sequences is calculated based on the euclidean distance (for example, the time sequence similarity is calculated by using a dynamic time warping algorithm), the similarity between the vectors belonging to the two voice feature vector sequences may generate a large error due to the difference in voice sizes, which further reduces the recognition rate.
Disclosure of Invention
Aiming at the problem that in the speech recognition process in the prior art, the recognition rate is reduced due to the difference of the sound size between the speech to be recognized and the reference template, the embodiment of the invention provides a speech recognition method and a speech recognition device based on cosine similarity, so that the error caused by different volumes is reduced, and the recognition rate is further improved.
In a first aspect of an embodiment of the present invention, a method for speech recognition based on cosine similarity is provided, where the method includes:
acquiring a voice to be detected;
performing framing processing on the voice to be detected;
acquiring a feature vector sequence to be detected by extracting a feature vector of the voice to be detected of each frame;
calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a similarity algorithm based on cosine values;
and selecting one preset template feature vector sequence in the at least one preset template feature vector sequence based on the cosine similarity to identify the voice to be detected.
In one embodiment, the method further comprises:
and obtaining the at least one preset template feature vector sequence by extracting the feature vector sequence of the at least one recorded voice template.
In one embodiment, the feature vector is a mel-frequency cepstrum coefficient (MFCC) feature vector.
In one embodiment, the extracting the feature vector of the speech to be detected for each frame further includes:
sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum s (m) of each frame of voice to be detected;
and performing Discrete Cosine Transform (DCT) on the logarithmic energy spectrum s (m) of each frame of voice to be detected by a table lookup method to obtain Mel Frequency Cepstrum Coefficient (MFCC) of each frame of voice to be detected.
In one embodiment, the preprocessing includes fast fourier transform, Mel filtering, and logarithmic operations, among others.
In one embodiment, the performing a Discrete Cosine Transform (DCT) on the log energy spectrum s (m) of each frame of speech to be detected by using a table lookup method to obtain mel-frequency cepstrum coefficients (MFCC) of each frame of speech to be detected further includes:
before extracting the feature vector of the voice to be detected of each frame, calculating data X (n, M) corresponding to the value of each (n, M) according to a formula (1), and constructing an LxM lookup table;
wherein the formula (1) is:
Figure BDA0001794065970000021
wherein n is 1, 2.. times, L, which is a predetermined MFCC dimension, and M is 1, 2.. times, M, which is a predetermined number of Mel filters.
In an embodiment, the obtaining a mel-frequency cepstrum coefficient (MFCC) of each frame of speech to be tested by a table lookup method based on a log energy spectrum s (m) of each frame of speech to be tested further includes:
after the logarithmic energy spectrum s (m) of each frame of voice to be detected is obtained, sequentially taking a value of n to calculate a formula (2);
setting the obtained L calculation results as an L-dimensional MFCC;
wherein, the formula (2) is specifically as follows:
Figure BDA0001794065970000031
wherein s (m) is the acquired log energy spectrum of each frame of speech to be detected, and X (n, m) is acquired through the lookup table.
In one embodiment, M takes the value of 24, and L takes the value of 12.
In one embodiment, wherein the similarity algorithm is a dynamic time warping algorithm.
In an embodiment, the calculating the cosine similarity between the feature vector sequence to be measured and at least one preset template feature vector sequence by using a cosine-value-based similarity algorithm specifically includes:
calculating cosine values between each feature vector of the feature vector sequence to be detected and each feature vector of each preset template feature vector sequence aiming at each preset template feature vector sequence;
constructing the cosine values into a matrix grid by using a dynamic time warping algorithm;
selecting an optimal path from the matrix grid by using a dynamic time warping algorithm;
according to the optimal path, cosine similarity between the feature vector sequence to be detected and each preset template feature vector sequence is obtained;
and the accumulated cosine value under the optimal path is the cosine similarity.
In an embodiment, the cosine value between each feature vector of the feature vector sequence to be measured and each feature vector of the preset template feature vector sequence specifically includes:
for any group of feature vectors, calculating cosine values by adopting a formula (3);
wherein, the formula 3 specifically includes:
Figure BDA0001794065970000032
wherein, A is any one feature vector of the feature vector sequence to be detected, B is any one feature vector of the preset template feature vector sequence, AnIs the value of the nth dimension in any feature vector of the feature vector sequence to be detected, BnThe value of the nth dimension in any feature vector of the preset template feature vector sequence is obtained, L is the total dimension of the vectors, and n is an integer value between 1 and L.
In an embodiment, the selecting one of the at least one preset template feature vector sequence based on the cosine similarity to perform speech recognition on the speech to be detected specifically includes one or more of the following:
recognizing the voice to be detected according to the preset template feature vector sequence with the cosine similarity reaching a preset threshold;
and identifying the voice to be detected according to a preset template feature vector sequence with the global maximum cosine similarity.
In a second aspect of the present invention, a speech recognition apparatus based on cosine similarity is provided, where the apparatus includes:
the acquisition module is used for acquiring the voice to be detected;
the framing module is used for framing the voice to be detected;
the feature extraction module is used for extracting a feature vector of the voice to be detected of each frame to obtain a feature vector sequence to be detected;
the cosine similarity calculation module is used for calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a cosine value-based similarity calculation method;
and the recognition module is used for selecting one preset template feature vector sequence in the at least one preset template feature vector sequence based on the cosine similarity to recognize the voice to be detected.
In one embodiment, the apparatus further comprises:
and the template feature vector sequence acquisition module is used for acquiring the at least one preset template feature vector sequence by extracting the feature vector sequence of the at least one recorded voice template.
In one embodiment, the feature vector is a mel-frequency cepstrum coefficient feature vector.
In one embodiment, the feature extraction module is further configured to:
sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum s (m) of each frame of voice to be detected;
and performing Discrete Cosine Transform (DCT) on the logarithmic energy spectrum s (m) of each frame of voice to be detected by a table lookup method to obtain Mel Frequency Cepstrum Coefficient (MFCC) of each frame of voice to be detected.
In one embodiment, the preprocessing includes fast fourier transform, Mel filtering, and logarithmic operations, among others.
In one embodiment, the performing a Discrete Cosine Transform (DCT) on the log energy spectrum s (m) of each frame of speech to be detected by using a table lookup method to obtain mel-frequency cepstrum coefficients (MFCC) of each frame of speech to be detected further includes:
before extracting the feature vector of the voice to be detected of each frame, calculating data X (n, M) corresponding to the value of each (n, M) according to a formula (1), and constructing an LxM lookup table;
wherein, the formula (1) is specifically as follows:
Figure BDA0001794065970000051
wherein n is 1,2,., L, the L being a preset Mel-frequency cepstrum coefficient dimension, and M is 0,2,., M-1, the M being a preset number of Mel filters.
In an embodiment, the obtaining a mel-frequency cepstrum coefficient (MFCC) of each frame of speech to be tested by a table lookup method based on a log energy spectrum s (m) of each frame of speech to be tested further includes:
after the logarithmic energy spectrum s (m) of each frame of voice to be detected is obtained, sequentially taking a value of n to calculate a formula (2);
setting the obtained L calculation results as an L-dimensional MFCC;
wherein, the formula (2) is specifically as follows:
Figure BDA0001794065970000052
wherein s (m) is the acquired log energy spectrum of each frame of speech to be detected, and X (n, m) is acquired through the lookup table.
In one embodiment, M takes the value of 24, and L takes the value of 12.
In one embodiment, wherein the similarity algorithm is a dynamic time warping algorithm.
In an embodiment, the cosine similarity calculation module is specifically configured to:
calculating cosine values between each feature vector of the feature vector sequence to be detected and each feature vector of each preset template feature vector sequence aiming at each preset template feature vector sequence;
constructing the cosine values into a matrix grid by using a dynamic time warping algorithm;
selecting an optimal path from the matrix grid by using a dynamic time warping algorithm;
according to the optimal path, cosine similarity between the feature vector sequence to be detected and each preset template feature vector sequence is obtained;
and the accumulated cosine value under the optimal path is the cosine similarity.
In one embodiment, the cosine similarity calculation module is further configured to:
for any group of feature vectors, calculating cosine values by adopting a formula (3);
wherein, the formula 3 specifically includes:
Figure BDA0001794065970000061
wherein, A is any one feature vector of the feature vector sequence to be detected, B is any one feature vector of the preset template feature vector sequence, AnIs the value of the nth dimension in any feature vector of the feature vector sequence to be detected, BnThe value of the nth dimension in any feature vector of the preset template feature vector sequence is obtained, L is the total dimension of the vectors, and n is an integer value between 1 and L.
In an embodiment, the identification module is specifically configured to:
recognizing the voice to be detected according to the preset template feature vector sequence with the cosine similarity reaching a preset threshold;
and identifying the voice to be detected according to a preset template feature vector sequence with the global maximum cosine similarity.
The cosine similarity-based voice recognition method and device provided by the embodiment of the invention replace Euclidean distance parameters in a dynamic time warping algorithm with vector cosine similarity parameters, reduce errors caused by sound size difference between a recorded voice template and a voice to be detected, and further improve the recognition rate.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 is a flowchart of a method for speech recognition based on cosine similarity according to an embodiment of the present invention;
FIG. 2 is a flow chart of another method for recognizing a speech based on cosine similarity according to an embodiment of the present invention;
FIG. 3 is a flowchart of another method for recognizing a speech based on cosine similarity according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a speech recognition apparatus based on cosine similarity according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of another speech recognition apparatus based on cosine similarity according to an embodiment of the present invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Exemplary method
The embodiment of the invention provides a voice endpoint detection method based on fusion characteristics.
FIG. 1 is a schematic flow chart of a method of fused feature based speech endpoint detection according to an embodiment of the present invention. As shown in fig. 1, the method may include, but is not limited to, the following steps, optionally, the recognition is isolated word-sound recognition:
s110: acquiring a voice to be detected;
in some embodiments, after obtaining the speech to be tested, the sound needs to be preprocessed, and the preprocessing may include: analog-to-digital conversion and pre-emphasis.
In one embodiment, the analog-to-digital conversion is to convert an analog signal into a digital signal, that is, to convert a sound continuous waveform into discrete data points at a certain sampling rate and sampling bit number; the pre-emphasis can be realized by a high-pass filter, which is used for increasing the energy of the high-frequency part of the sound, and for the frequency spectrum of the sound signal, the energy of the low-frequency part is usually higher than that of the high-frequency part, so that the high-frequency energy of the collected voice needs to be strengthened in advance to enable the energy of the high-frequency part and the energy of the low-frequency part to have similar amplitudes, thereby improving the recognition accuracy.
S120: and performing framing processing on the voice to be detected.
And S130, acquiring a feature vector sequence to be detected by extracting the feature vector of the voice to be detected of each frame.
In one embodiment, since the time-domain waveform of the sound only represents the time-varying relationship of the sound pressure and does not represent the characteristics of the sound well, the sound waveform must be converted into an acoustic feature vector. There are many sound feature extraction methods, such as mel-frequency cepstral coefficients MFCC, linear prediction cepstral coefficients LPCC, multimedia content description interface MPEG7, etc.
In one embodiment, the present embodiment may use the MFCC parameter vector as the sound feature vector for subsequent operations.
As will be appreciated by those skilled in the art, in general, extracting MFCC features from preprocessed speech may include the steps of:
(1) and performing discrete Fourier transform on each frame of preprocessed voice to obtain a corresponding frequency spectrum.
(2) And (4) passing the spectrum through a Mel filter bank to obtain a Mel spectrum.
(3) Performing cepstrum analysis on the Mel spectrum to obtain MFCC feature vectors of the frame of speech.
The present invention is not limited to speech recognition using MFCC sound features, and may also use sound features such as LPCC (linear predictive cepstrum coefficient) and HZCRR (high zero crossing frame rate) to perform subsequent speech recognition.
According to the embodiment of the invention, the MFCC characteristic vector is used as the sound characteristic vector, and the MFCC characteristic vector is a nonlinear characteristic, because the sound heard by the human ear is not in a linear direct proportion relation with the frequency of the sound, the MFCC characteristic vector is used to be closer to the auditory characteristic of the human ear, and the recognition rate is further improved.
S140: and calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a similarity algorithm based on cosine values.
In an embodiment, the predetermined template feature vector sequence is a pre-extracted feature vector sequence to be measured of the template speech MFCC, and the lengths of the two sequences may be different.
In an embodiment, the similarity algorithm is a dynamic time warping algorithm.
In an embodiment, the similarity algorithm may also be other algorithms capable of calculating the similarity between two sound feature vector sequences, for example, a voiceprint recognition algorithm and a dynamic time warping algorithm.
It can be understood by those skilled in the art that, in the prior art, when the DTW algorithm performs similarity calculation on two time sequences, it is generally performed based on all euclidean distances between vectors belonging to the two sequences, and the cosine-value-based DTW algorithm of the embodiment of the present application calculates all cosine values between vectors belonging to the two sequences, thereby calculating the similarity between the two sequences.
S150: and identifying the voice to be detected based on the cosine similarity.
Specifically, one preset template feature vector sequence in the at least one preset template feature vector sequence is selected based on the cosine similarity to identify the voice to be detected.
In an embodiment, the greater the cosine similarity is, the higher the matching degree between the preset template feature vector sequence and the speech to be detected is. By matching each preset template feature vector sequence to correspond to a predetermined recognition result,
for example: generally, when a household intelligent sound box leaves a factory, a connection is established between some preset voice segments and a control instruction, for example, a preset template feature vector sequence of 'turn on air conditioner' built in the intelligent sound box, and correspondingly, the preset template feature vector sequence corresponds to a control instruction of actually turning on the air conditioner.
In an embodiment, the recognition of the speech to be detected is not limited to the above matching by setting a threshold, and any method that can match one or more preset template feature vector sequences and perform speech recognition may be adopted.
In one embodiment, the present invention may employ the following identification:
(1) recognizing the voice to be detected according to the preset template feature vector sequence with the cosine similarity reaching a preset threshold;
in an embodiment, when template feature vector sequences in the template library are sequentially matched and identified, once a template feature vector sequence with a similarity value exceeding a preset threshold is matched, the template feature vector sequence is determined as an identification result.
(2) Recognizing the voice to be detected according to a preset template feature vector sequence with global maximum cosine similarity;
in an embodiment, when template feature vector sequences in the template library are sequentially matched and identified, after all the template feature vector sequences are matched, the template feature vector sequence with the maximum similarity is output as an identification result.
The preset rules may be used alone or in combination, and are not limited herein.
For example: for the obtained similarity values corresponding to a plurality of different templates, selecting the maximum similarity value, judging whether the maximum similarity value exceeds a preset threshold value, if so, outputting the semantics corresponding to the template as an identification result, and if not, outputting the non-identification result
As shown in fig. 2, in an embodiment, in step S130, the MFCC feature vector may be obtained by the following method:
s210: sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum s (m) of each frame of voice to be detected;
in one embodiment, the preprocessing may specifically include Fast Fourier Transform (FFT), Mel-filtering, and logarithmic operations.
In an embodiment, the following formula (4) may be specifically adopted to obtain s (m):
Figure BDA0001794065970000101
wherein M is the number of filters, where M is 0,1,2, M-1, M being an integer greater than 1.
Where M is the number of triangular filters. L is the dimension of the MPCC feature, and 12 dimensions can represent the acoustic feature generally. And N is a frequency point.
S220: and performing Discrete Cosine Transform (DCT) on the logarithmic energy spectrum s (m) of each frame of voice to be detected by a table lookup method to obtain Mel Frequency Cepstrum Coefficient (MFCC) of each frame of voice to be detected.
In an embodiment, the process of building the lookup table may include:
(a) calculating data X (n, M) corresponding to the value of each (n, M) according to the following formula (1), and constructing an LxM lookup table;
wherein the formula is:
Figure BDA0001794065970000102
wherein n is 1,2, and L is an integer greater than 1; m-1, M being an integer greater than 1.
In an embodiment, the calculating a mel-frequency cepstrum coefficient (MFCC) according to the lookup table and the obtained log energy spectrum s (m) of each frame of the speech to be detected may specifically include:
(b) sequentially taking values of n, calculating the following formula (2), and obtaining L calculation results of C (1), C (2), … and C (L);
(c) setting the obtained L calculation results as an L-dimensional MFCC;
wherein, the formula specifically is:
Figure BDA0001794065970000103
wherein n is 1,2, and L is an integer greater than 1; m-1, M being an integer greater than 1.
Wherein s (m) is the acquired log energy spectrum of each frame of speech to be detected, and X (n, m) is acquired through the lookup table.
The above process of calculating mel-frequency cepstrum coefficients is explained in detail below using a specific example:
for example, taking a value of 1 for n to compute C (1), first, X (1, 0) needs to be found from the lookup table, and X (1, 0) is compared with C (1)
Figure BDA0001794065970000104
Multiplying, then finding X (1, 1) from the lookup table, and comparing with the obtained
Figure BDA0001794065970000105
Multiplying, and repeating the steps all the time. And sequentially taking values of M from 0 to M-1, and accumulating all calculation results to obtain C (1), wherein the C (1) is the first dimension coefficient of the MFCC.
C (1), …, C (L) are calculated in sequence by the method, and the MFCC with L dimension is obtained.
In the embodiment of the invention, by adopting the technical scheme of pre-storing the MFCC by using the lookup table, the process of extracting the MFCC characteristic signals from the voice can be accelerated, the time of the whole voice recognition is further shortened, and the working efficiency is improved. In addition, especially for the embedded hardware identification device, the device is simpler to perform calculation, so that the manufacturing cost and the device space are saved.
In an embodiment, M takes a value of 24, L takes a value of 12, n takes values of 1,2, and 3 … 12, respectively, and mfccc (n) corresponding to the value of n is calculated in advance according to the above formula, and the value of MFCC dimension n and the calculated MFCC result are prestored in an array, so that corresponding MFCC is obtained directly by searching the array in subsequent calculation.
The number M, MFCC of the filters and the other correlation coefficients are not specifically limited, and may be adjusted according to a specific application scenario, for example: the MFCC dimension L can be set to 16 dimensions. In the present application, the value of M is 24, and the value of L is 12, but not limited thereto.
In the embodiment of the invention, the characteristics of the sound can be more completely expressed by adopting 12 dimensions through practical verification, and more stable technical effects can be obtained by adopting 24 filters. Therefore, the 12-dimensional MFCC and the 24 filters are better choices in consideration of the calculation amount.
In an embodiment, the above process of extracting the MFCC feature vector may also be used to calculate the TFCC feature vector by using a cepstrum analysis method according to the prior art, which is not limited herein.
In an embodiment of the present invention, after the feature extraction, a feature vector sequence (a) to be tested of the MFCC of the speech to be tested is obtained1,A2…,AQ) N is an integer greater than 1, A1~AQAre arranged in time sequence.
As shown in FIG. 3, in one embodiment, the step S140 of the present invention may include steps S310 to S340.
In an embodiment, the feature vector sequence to be measured of the predetermined template feature vector sequence may be (B)1,B2…,BM) T is an integer greater than 1, B1~BMAre arranged in time sequence.
S310: and aiming at each preset template feature vector sequence, calculating a cosine value between each feature vector of the feature vector sequence to be detected and each feature vector of each preset template feature vector sequence.
For any vector A in the feature vector sequence to be measuredqAnd any vector B in the preset template feature vector sequencetWherein Q is more than or equal to 1 and less than or equal to Q, and T is more than or equal to 1 and less than or equal to T. The cosine of the angle between the two vectors can be calculated using the following equation (5):
Aq=(A1,A2,…,AL),Bt=(B1,B2,…,BL) And L is a vector dimension, and for the sound feature vector, an integer of 12-16 is generally taken as a value.
Figure BDA0001794065970000121
Wherein n is an integer between 1 and L.
It will be understood by those skilled in the art that for two vectors, the closer the cosine value is to 1, the closer the angle between the two vectors is to 0 degrees, indicating that the two vectors are more similar, i.e., cosine similar.
The cosine distance uses the cosine value of the included angle of the two vectors as the measure of the difference between the two individuals. Therefore, compared with the characteristic that the Euclidean distance tends to reflect the absolute difference of the individual numerical characteristics, the cosine distance focuses more on the difference of the two vectors in the direction, and is insensitive to the absolute numerical value, so that the problem that the possibly existing measurement standards among the compared time sequences are not uniform is further corrected.
Therefore, by the technical scheme of comparing the feature vectors of the feature vector sequence to be detected and the feature vector of the template feature vector sequence by using the vector cosine similarity, the problem of low recognition rate caused by unbalanced signal strength between the voice to be detected and the recording template, such as overlarge volume difference, can be solved.
S320: and constructing the cosine values into a matrix grid by utilizing a dynamic time warping algorithm.
As shown in table 1 below, cosine values of included angles between the feature vector sequence to be measured and all vectors in the feature vector sequence of the preset template are calculated, a matrix grid is constructed, the cosine values of the vectors are used as matrix elements, and the matrix grid conforms to a time sequence from left to right and from bottom to top because the sequences are time sequences.
BT cos(A1,BT) cos(A2,BT) cos(AQ,BT)
cos(An,Bt)
B2 cos(A1,B2) cos(A2,B2) cos(AQ,B2)
B1 cos(A1,B1) cos(A2,B1) cos(AQ,B1)
A1 A2 AQ
TABLE 1
S330: selecting an optimal path from the matrix grid by using a dynamic time warping algorithm;
s340: and obtaining the cosine similarity between the feature vector sequence to be detected and each preset template feature vector sequence according to the optimal path.
In one embodiment, consistent with conventional DTW algorithms, the optimal path selection needs to meet at least three constraints: 1) boundary conditions: the pronunciation speed of any kind of voice can be changed, but the sequence of all parts of the voice cannot be changed, so that the selected path is bound to start from the lower left corner and end at the upper right corner. 2) Continuity: cannot be matched across a certain point and can only be aligned with adjacent points. 3) Monotonicity: the path is monotonous in time and therefore cannot go left or down.
In an embodiment, the embodiment of the present application calculates a path with the largest accumulated cosine value according to the cost of each element in the matrix grid.
In the matrix grid, the cost of a point is the cosine value of the point + the largest one of the values from the three directions, i.e., lower, left, and oblique lower. The values of the three directions of lower, left and oblique lower can be recursively obtained in sequence up to the (1, 1) point.
S(Aq,Bt)=cos(Aq,Bt)+tax[S(Aq-1,Bt),S(Aq,Bt-1),S(Aq-1,Bt-1)]。
The current lattice point accumulated cosine value S (A)q,Bt) Is the cosine value cos (A) of the current lattice pointq,Bt) That is, the sum of the cosine distance (similarity) of the two vectors corresponding to the current lattice point and the accumulated cosine value of the largest adjacent lattice point that can reach the current lattice point, when the cosine value accumulation is finished, the path is moved to the vector ANAnd vector BMAt the corresponding end point, the resulting cumulative cosine value S (A)Q,BT) Namely the accumulated cosine value, and the path with the maximum global accumulated cosine value is the optimal regular path. And the accumulated cosine value under the optimal path is the cosine similarity.
According to the embodiment of the invention, by using the dynamic time warping algorithm based on the cosine similarity, the adverse effects caused by different time lengths and different sound energies between two sound characteristic vector sequences can be simultaneously overcome, and the improvement of the recognition rate is facilitated.
The operation method for calculating the accumulated cosine value is not particularly limited, and any optimal path calculation method satisfying the borderline, continuity and monotonicity may be adopted, which may be a method of performing quadratic constraint on a path boundary in the prior art to reduce the operation amount, or a method of dividing the matrix mesh into a plurality of local matrix meshes, performing a segmentation operation, and then summing. The present application is only exemplified by the operation manner employed in the above exemplary embodiment, but is not limited thereto.
It can be understood by those skilled in the art that the above-mentioned technical solution for calculating the similarity between time sequences by using a cosine similarity-based dynamic time algorithm in the embodiments of the present invention can be fully applied to other identification fields, for example: any recognition field with time series attributes (i.e. can be converted into time series) such as semantic recognition field, gesture recognition field, action recognition field, etc.
In summary, the method and the device for speech recognition based on cosine similarity provided by the embodiments of the present invention replace the euclidean distance parameter in the dynamic time warping algorithm with the vector cosine similarity parameter, thereby reducing the error caused by the difference in sound size between the recorded speech template and the speech to be detected, and further improving the recognition rate.
Exemplary device
The embodiment of the invention provides a voice endpoint detection device based on fusion characteristics.
Fig. 4 is a schematic block diagram of an apparatus for voice endpoint detection based on fusion features according to an embodiment of the present invention. As shown in fig. 4, the apparatus 400 may include but is not limited to modules,
the obtaining module 410 is configured to obtain a voice to be detected.
In some embodiments, after obtaining the speech to be tested, the sound needs to be preprocessed, and the preprocessing may include: analog-to-digital conversion and pre-emphasis.
In one embodiment, the analog-to-digital conversion is to convert an analog signal into a digital signal, that is, to convert a sound continuous waveform into discrete data points at a certain sampling rate and sampling bit number; the pre-emphasis may be implemented by a high pass filter for increasing the energy of the high frequency part of the sound, and for the frequency spectrum of the sound signal, the energy of the low frequency part is usually higher than the energy of the high frequency part, and in order to make the energy of the high frequency part and the energy of the low frequency part have similar amplitude, the Yuyao pre-emphasizes the high frequency energy of the collected voice, thereby improving the recognition accuracy.
And a framing module 420, configured to perform framing processing on the speech to be detected.
And the feature extraction module 430 is configured to obtain a feature vector sequence to be detected by extracting a feature vector of the speech to be detected in each frame.
In one embodiment, since the time-domain waveform of the sound only represents the time-varying relationship of the sound pressure and does not represent the characteristics of the sound well, the sound waveform must be converted into an acoustic feature vector. There are many sound feature extraction devices, such as mel-frequency cepstral coefficients MFCC, linear predictive cepstral coefficients LPCC, multimedia content description interface MPEG7, etc.
In one embodiment, the present embodiment uses the MFCC parameter vectors as acoustic feature vectors for subsequent speech recognition.
As will be appreciated by those skilled in the art, in general, extracting MFCC features from preprocessed speech may include the steps of: (1) performing discrete Fourier transform on each frame of preprocessed voice to obtain a corresponding frequency spectrum; (2) the spectrum passes through a Mel filter bank to obtain a Mel spectrum; (3) performing cepstrum analysis on the Mel spectrum to obtain MFCC feature vectors of the frame of speech.
The present invention is not limited to speech recognition using MFCC sound features, and may also use sound features such as LPCC (linear predictive cepstrum coefficient) and HZCRR (high zero crossing frame rate) to perform subsequent speech recognition.
The cosine similarity calculation module 440 is configured to calculate cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by using a cosine value-based dynamic time warping algorithm.
In an embodiment, the predetermined template feature vector sequence is a pre-extracted feature vector sequence to be measured of the template speech MFCC, and the lengths of the two sequences may be different.
In the prior art, when the DTW algorithm calculates the similarity between two sequences, it is usually based on all euclidean distances between vectors belonging to the two sequences to analyze and calculate, and the cosine-value-based DTW algorithm of the embodiment of the present application calculates all cosine values between vectors belonging to the two sequences, so as to calculate the similarity between the two sequences.
And the recognition module 450 is configured to recognize the speech to be detected based on the cosine similarity.
Specifically, the recognition module 450 selects one of the at least one preset template feature vector sequence to recognize the to-be-detected speech based on the cosine similarity.
In an embodiment, the greater the cosine similarity is, the higher the matching degree between the template feature vector sequence and the speech to be detected is. By matching each preset template feature vector sequence to correspond to a predetermined recognition result,
for example: generally, when a household intelligent sound box leaves a factory, a connection is established between some preset voice segments and a control instruction, for example, a preset template feature vector sequence of 'turn on air conditioner' built in the intelligent sound box, and correspondingly, the preset template feature vector sequence corresponds to a control instruction of actually turning on the air conditioner.
In the embodiment of the present invention, the recognition of the speech to be detected is not limited to the above matching by setting the threshold, and any method that can match one or more preset template feature vector sequences and perform speech recognition may be adopted.
In one embodiment, the invention identification module 450 can be configured to:
(1) recognizing the voice to be detected according to the preset template feature vector sequence with the cosine similarity reaching a preset threshold;
in an embodiment, when template feature vector sequences in a template library are sequentially matched and identified, once a template feature vector sequence with a similarity value exceeding a preset threshold is matched, the template feature vector sequence is judged as an identification result.
(2) Recognizing the voice to be detected according to a preset template feature vector sequence with global maximum cosine similarity;
in an embodiment, when template feature vector sequences in the template library are sequentially matched and identified, after all the template feature vector sequences are matched, the template feature vector sequence with the maximum similarity is output as an identification result.
The preset rules may be used alone or in combination, and are not limited herein. For example: for the obtained similarity values corresponding to a plurality of different templates, selecting the maximum similarity value, judging whether the maximum similarity value exceeds a preset threshold value, if so, outputting the semantics corresponding to the template as an identification result, and if not, outputting the non-identification result
In one embodiment, the following method may be used to obtain the MFCC feature vector:
(1) sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum s (m) of each frame of voice to be detected;
in one embodiment, the preprocessing may specifically include Fast Fourier Transform (FFT), Mel-filtering, and logarithmic operations.
In an embodiment, the following formula (4) may be specifically adopted to obtain s (m):
Figure BDA0001794065970000161
wherein M is the number of filters, and M takes a value of 0-M-1. Where M is the number of triangular filters. L is the dimension of the MPCC feature, and 12 dimensions can represent the acoustic feature generally. And N is a frequency point.
(2) And performing Discrete Cosine Transform (DCT) on the logarithmic energy spectrum s (m) of each frame of voice to be detected by a table lookup method to obtain Mel Frequency Cepstrum Coefficient (MFCC) of each frame of voice to be detected.
In one embodiment, as shown in FIG. 5, the feature extraction module 430 may include the following sub-modules:
and the lookup table module 431 is configured to, for each frame of speech to be detected, directly read a prestored MFCC according to the dimension of the MFCC feature vector, and obtain the MFCC feature vector.
In an embodiment, the data building and data using process of the lookup table module 431 may include:
(a) calculating data X (n, M) corresponding to the value of each (n, M) according to the following formula (1), and constructing an LxM lookup table;
wherein the formula is:
Figure BDA0001794065970000162
wherein n is 1,2, and L is an integer greater than 1; m-1, M being an integer greater than 1.
In an embodiment, the calculating a mel-frequency cepstrum coefficient (MFCC) according to the lookup table and the obtained log energy spectrum s (m) of each frame of the speech to be detected may specifically include:
(b) sequentially taking values of n, calculating the following formula (2), and obtaining L calculation results of C (1), C (2), … and C (L);
(c) setting the obtained L calculation results as an L-dimensional MFCC;
wherein, the formula specifically is:
Figure BDA0001794065970000171
wherein n is 1,2, and L is an integer greater than 1; m-1, M being an integer greater than 1.
Wherein s (m) is the acquired log energy spectrum of each frame of speech to be detected, and X (n, m) is acquired through the lookup table.
The above process of calculating mel-frequency cepstrum coefficients is explained in detail below using a specific example:
for example, taking a value of 1 for n to compute C (1), first, X (1, 0) needs to be found from the lookup table, and X (1, 0) is compared with C (1)
Figure BDA0001794065970000172
Multiplying, then finding X (1, 1) from the lookup table, and comparing with the obtained
Figure BDA0001794065970000173
Multiplying, and repeating the steps all the time. And sequentially taking values of M from 0 to M-1, and accumulating all calculation results to obtain C (1), wherein the C (1) is the first dimension coefficient of the MFCC.
C (1), …, C (L) are calculated in sequence by the method, and the MFCC with L dimension is obtained.
In an embodiment, M takes a value of 24, L takes a value of 12, n takes values of 1,2, and 3 … 12, respectively, and mfccc (n) corresponding to the value of n is calculated in advance according to the above formula, and the value of MFCC dimension n and the calculated MFCC result are prestored in an array, so that corresponding MFCC is obtained directly by searching the array in subsequent calculation.
In the embodiment of the invention, by adopting the technical scheme of prestoring the MFCC, the process of extracting the MFCC characteristic signals from the voice can be accelerated, the time of whole voice recognition is further shortened, and the working efficiency is improved. In addition, especially for the embedded hardware identification device, the device is simpler to perform calculation, so that the manufacturing cost and the device space are saved.
The number M, MFCC of the filters and the other correlation coefficients are not specifically limited, and may be adjusted according to a specific application scenario, for example: the MFCC dimension L can be set to 16 dimensions. In the present application, the value of M is 24, and the value of L is 12, but not limited thereto.
In the embodiment of the invention, the characteristics of the sound can be more completely expressed by adopting 12 dimensions through practical verification, and more stable technical effects can be obtained by adopting 24 filters. Therefore, the 12-dimensional MFCC and the 24 filters are better choices in consideration of the calculation amount.
In an embodiment, the above process of extracting the MFCC feature vector may also be implemented by a device using cepstrum analysis according to the prior art, and is not limited herein.
In an embodiment of the present invention, after the feature extraction, a feature vector sequence (a) to be tested of the MFCC of the speech to be tested is obtained1,A2…,AN) N is an integer greater than 1, A1~ANAre arranged in a time sequence.
In an embodiment, the cosine similarity calculation module of the present invention may be specifically configured to:
s310: aiming at each preset template feature vector sequence, calculating the to-be-detected feature vector sequenceCosine values between each feature vector of the feature vector sequence and each feature vector of each preset template feature vector sequence. In an embodiment, the feature vector sequence to be measured of the predetermined template feature vector sequence may be (B)1,B2…,BM) M is an integer greater than 1, B1~BMAre arranged in time sequence.
For any vector A in the feature vector sequence to be measurednAnd any vector B in the preset template feature vector sequencemWherein N is more than or equal to 1 and less than or equal to N, and M is more than or equal to 1 and less than or equal to M. The cosine of the angle between the two vectors is calculated using the following formula:
An=(A1,A2,…,AQ),Bm=(B1,B2,…,BQ) And L is a vector dimension, and for the sound feature vector, a positive integer of 12-16 is generally taken as a value.
Figure BDA0001794065970000181
It will be understood by those skilled in the art that for two vectors, the closer the cosine value is to 1, the closer the angle between the two vectors is to 0 degrees, indicating that the two vectors are more similar, i.e., cosine similar.
Because the cosine value of the included angle of the two vectors is used as the measure for measuring the difference between the two individuals, compared with the characteristic that the Euclidean distance tends to reflect the absolute difference of the numerical characteristics of the individuals, the cosine distance pays more attention to the difference of the two vectors in the direction, is insensitive to the absolute numerical value, and further corrects the problem that the measurement standards possibly existing between the compared time sequences are not uniform.
Therefore, by the technical scheme of comparing the feature vectors of the feature vector sequence to be detected and the feature vector of the template feature vector sequence by using the vector cosine similarity, the problem of low recognition rate caused by unbalanced signal strength between the voice to be detected and the recording template, such as overlarge volume difference, can be solved.
S320: and constructing the cosine values into a matrix grid by utilizing a dynamic time warping algorithm.
As shown in table 1 below, cosine values of included angles between the feature vector sequence to be measured and all vectors in the feature vector sequence of the preset template are calculated, a matrix grid is constructed, the cosine values of the vectors are used as matrix elements, and the matrix grid conforms to a time sequence from left to right and from bottom to top because the sequences are time sequences.
BT cos(A1,BT) cos(A2,BT) cos(AQ,BT)
cos(An,Bt)
B2 cos(A1,B2) cos(A2,B2) cos(AQ,B2)
B1 cos(A1,B1) cos(A2,B1) cos(AQ,B1)
A1 A2 AQ
TABLE 1
S330: selecting an optimal path from the matrix grid by using a dynamic time warping algorithm;
s340: and obtaining the cosine similarity between the feature vector sequence to be detected and each preset template feature vector sequence according to the optimal path.
In one embodiment, consistent with conventional DTW algorithms, the optimal path selection needs to meet at least three constraints: 1) boundary conditions: the pronunciation speed of any kind of voice can be changed, but the sequence of all parts of the voice cannot be changed, so that the selected path is bound to start from the lower left corner and end at the upper right corner. 2) Continuity: cannot be matched across a certain point and can only be aligned with adjacent points. 3) Monotonicity: the path is monotonous in time and therefore cannot go left or down.
In an embodiment, the embodiment of the present application calculates a path with the largest accumulated cosine value according to the cost of each element in the matrix grid.
In the matrix grid, the cost of a point is the cosine value of the point + the largest one of the values from the three directions, i.e., lower, left, and oblique lower. The values of the three directions of lower, left and oblique lower can be recursively obtained in sequence up to the (1, 1) point.
S(Aq,Bt)=cos(Aq,Bt)+tax[S(Aq-1,Bt),S(Aq,Bt-1),S(Aq-1,Bt-1)]。
The current lattice point accumulated cosine value S (A)n,Bm) Is the cosine value cos (A) of the current lattice pointn,Bm) That is, the sum of the cosine distance (similarity) of the two vectors corresponding to the current lattice point and the accumulated cosine value of the largest adjacent lattice point that can reach the current lattice point, when the cosine value accumulation is finished, the path is moved to the vector ANAnd vector BMAt the corresponding end point, the resulting cumulative cosine value S (A)N,BM) Namely the accumulated cosine value, and the path with the maximum global accumulated cosine value is the optimal regular path. And the accumulated cosine value under the optimal path is the cosine similarity.
According to the embodiment of the invention, by adopting the dynamic time warping algorithm based on the cosine similarity, the adverse effects caused by different time lengths and different sound energies between two sound characteristic vector sequences can be simultaneously overcome, and the improvement of the recognition rate is facilitated.
The operation method for calculating the accumulated cosine value is not particularly limited, and any optimal path calculation method satisfying the borderline, continuity and monotonicity may be adopted, which may be a method of performing quadratic constraint on a path boundary in the prior art to reduce the operation amount, or a method of dividing the matrix mesh into a plurality of local matrix meshes, performing a segmentation operation, and then summing. The present application is only exemplified by the operation manner employed in the above exemplary embodiment, but is not limited thereto.
It can be understood by those skilled in the art that the above-mentioned technical solution for calculating the similarity between time sequences by using a cosine similarity-based dynamic time algorithm in the embodiments of the present invention can be fully applied to other identification fields, for example: any recognition field with time series attributes (i.e. can be converted into time series) such as semantic recognition field, gesture recognition field, action recognition field, etc.
In summary, the speech recognition device based on cosine similarity provided by the embodiment of the present invention replaces the euclidean distance parameter in the dynamic time warping algorithm with the vector cosine similarity parameter, so as to reduce the error caused by the difference in sound size between the recorded speech template and the speech to be detected, and further improve the recognition rate.

Claims (24)

1. A speech recognition method based on cosine similarity, the method comprising:
acquiring a voice to be detected;
performing framing processing on the voice to be detected;
acquiring a feature vector sequence to be detected by extracting a feature vector of the voice to be detected of each frame;
calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a similarity algorithm based on cosine values;
and selecting one preset template feature vector sequence in the at least one preset template feature vector sequence based on the cosine similarity to identify the voice to be detected.
2. The method of claim 1, further comprising:
and obtaining the at least one preset template feature vector sequence by extracting the feature vector sequence of the at least one recorded voice template.
3. The method of claim 1, wherein the feature vector is a mel-frequency cepstral coefficient feature vector.
4. The method of claim 3, wherein the extracting the feature vector of the speech to be detected for each frame further comprises:
sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum of each frame of voice to be detected;
and performing discrete cosine transform on the logarithmic energy spectrum of each frame of voice to be detected by a table lookup method to obtain the Mel frequency cepstrum coefficient of each frame of voice to be detected.
5. The method of claim 4, wherein the pre-processing comprises fast Fourier transform, Mel filtering, and logarithmic operation.
6. The method of claim 4, wherein the obtaining the mel-frequency cepstrum coefficients of each frame of the speech to be tested by performing discrete cosine transform on the logarithmic energy spectrum of each frame of the speech to be tested through a table lookup method further comprises:
before extracting the feature vector of the voice to be detected of each frame, calculating data X (n, M) corresponding to the value of each (n, M) according to a formula (1), and constructing an LxM lookup table;
wherein the formula (1) is:
Figure FDA0001794065960000011
wherein n is 1,2,., L, the L being a preset Mel-frequency cepstrum coefficient dimension, and M is 0,2,., M-1, the M being a preset number of Mel filters.
7. The method of claim 6, wherein the obtaining the mel-frequency cepstrum coefficients of each frame of the speech to be tested by a table lookup method based on the log energy spectrum of each frame of the speech to be tested further comprises:
after the logarithmic energy spectrum of each frame of voice to be detected is obtained, sequentially carrying out value calculation on n in a formula (2);
acquiring a Mel frequency cepstrum coefficient of L dimensions according to L calculation results obtained by calculation;
wherein, the formula (2) is specifically as follows:
Figure FDA0001794065960000021
wherein s (m) is the acquired log energy spectrum of each frame of speech to be detected, and X (n, m) is acquired through the lookup table.
8. The method of claim 6 or 7, wherein M takes the value 24, and wherein L takes the value 12.
9. The method according to claim 1, wherein the calculating the cosine similarity between the feature vector sequence to be measured and at least one preset template feature vector sequence by using a similarity algorithm based on cosine values specifically comprises: and calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a cosine value-based dynamic time warping algorithm.
10. The method according to claim 9, wherein the calculating the cosine similarity between the feature vector sequence to be measured and at least one preset template feature vector sequence by using a cosine-value-based dynamic time warping algorithm specifically comprises:
calculating cosine values between each feature vector of the feature vector sequence to be detected and each feature vector of each preset template feature vector sequence aiming at each preset template feature vector sequence;
constructing the cosine values into a matrix grid by using a dynamic time warping algorithm;
selecting an optimal path from the matrix grid by using a dynamic time warping algorithm;
according to the optimal path, cosine similarity between the feature vector sequence to be detected and each preset template feature vector sequence is obtained;
and the accumulated cosine value under the optimal path is the cosine similarity.
11. The method according to claim 10, wherein the cosine values between each eigenvector of the to-be-detected eigenvector sequence and each eigenvector of the preset template eigenvector sequence specifically comprise:
for any group of feature vectors, calculating cosine values by adopting a formula (3);
wherein, the formula 3 specifically includes:
Figure FDA0001794065960000031
wherein, A is any one feature vector of the feature vector sequence to be detected, B is any one feature vector of the preset template feature vector sequence, AnIs the value of the nth dimension in any feature vector of the feature vector sequence to be detected, BnThe value of the nth dimension in any feature vector of the preset template feature vector sequence is obtained, L is the total dimension of the vectors, and n is an integer value between 1 and L.
12. The method according to claim 1, wherein the selecting one of the at least one preset template feature vector sequence based on the cosine similarity for performing speech recognition on the speech to be detected specifically includes one or more of the following:
recognizing the voice to be detected according to the preset template feature vector sequence with the cosine similarity reaching a preset threshold;
and identifying the voice to be detected according to a preset template feature vector sequence with the global maximum cosine similarity.
13. An apparatus for speech recognition based on cosine similarity, the apparatus comprising:
the acquisition module is used for acquiring the voice to be detected;
the framing module is used for framing the voice to be detected;
the feature extraction module is used for extracting a feature vector of the voice to be detected of each frame to obtain a feature vector sequence to be detected;
the cosine similarity calculation module is used for calculating cosine similarity between the feature vector sequence to be detected and at least one preset template feature vector sequence by utilizing a cosine value-based similarity calculation method;
and the recognition module is used for selecting one preset template feature vector sequence in the at least one preset template feature vector sequence based on the cosine similarity to recognize the voice to be detected.
14. The apparatus of claim 13, further comprising:
and the template feature vector sequence acquisition module is used for acquiring the at least one preset template feature vector sequence by extracting the feature vector sequence of the at least one recorded voice template.
15. The apparatus of claim 13, wherein the feature vector is a mel-frequency cepstral coefficient feature vector.
16. The apparatus of claim 15, wherein the feature extraction module is further configured to:
sequentially preprocessing each frame of voice to be detected to obtain a logarithmic energy spectrum of each frame of voice to be detected;
and performing discrete cosine transform on the logarithmic energy spectrum of each frame of voice to be detected by a table lookup method to obtain the Mel frequency cepstrum coefficient of each frame of voice to be detected.
17. The method of claim 16, wherein the pre-processing comprises fast fourier transform, Mel-filtering, and logarithmic operations.
18. The apparatus of claim 16, wherein the obtaining the mel-frequency cepstrum coefficients of each frame of the speech to be tested by performing discrete cosine transform on the log energy spectrum of each frame of the speech to be tested through a table lookup method further comprises:
before extracting the feature vector of the voice to be detected of each frame, calculating data X (n, M) corresponding to the value of each (n, M) according to a formula (1), and constructing an LxM lookup table;
wherein, the formula (1) is specifically as follows:
Figure FDA0001794065960000041
wherein n is 1,2,., L, the L being a preset Mel-frequency cepstrum coefficient dimension, and M is 0,2,., M-1, the M being a preset number of Mel filters.
19. The apparatus of claim 18, wherein the obtaining the mel-frequency cepstrum coefficients of each frame of speech to be tested by a table lookup method based on the log energy spectrum of each frame of speech to be tested further comprises:
after the logarithmic energy spectrum of each frame of voice to be detected is obtained, sequentially carrying out value calculation on n in a formula (2);
acquiring a Mel frequency cepstrum coefficient of L dimensions according to L calculation results obtained by calculation;
wherein, the formula (2) is specifically as follows:
Figure FDA0001794065960000042
wherein s (m) is the acquired log energy spectrum of each frame of speech to be detected, and X (n, m) is acquired through the lookup table.
20. The apparatus of claim 18 or 19, wherein M takes the value 24, and wherein L takes the value 12.
21. The method according to claim 13, wherein the cosine similarity calculation module is specifically configured to calculate cosine similarity between the feature vector sequence to be measured and at least one predetermined template feature vector sequence by using a cosine-value-based dynamic time warping algorithm.
22. The apparatus of claim 13, wherein the cosine similarity calculation module is specifically configured to:
calculating cosine values between each feature vector of the feature vector sequence to be detected and each feature vector of each preset template feature vector sequence aiming at each preset template feature vector sequence;
constructing the cosine values into a matrix grid by using a dynamic time warping algorithm;
selecting an optimal path from the matrix grid by using a dynamic time warping algorithm;
according to the optimal path, cosine similarity between the feature vector sequence to be detected and each preset template feature vector sequence is obtained;
and the accumulated cosine value under the optimal path is the cosine similarity.
23. The apparatus of claim 22, wherein the cosine similarity calculation module is further configured to:
for any group of feature vectors, calculating cosine values by adopting a formula (3);
wherein, the formula 3 specifically includes:
Figure FDA0001794065960000051
wherein, A is any one feature vector of the feature vector sequence to be detected, B is any one feature vector of the preset template feature vector sequence, AnIs the value of the nth dimension in any feature vector of the feature vector sequence to be detected, BnThe value of the nth dimension in any feature vector of the preset template feature vector sequence is obtained, L is the total dimension of the vectors, and n is an integer value between 1 and L.
24. The apparatus of claim 13, wherein the identification module is specifically configured to:
recognizing the voice to be detected according to the preset template feature vector sequence with the cosine similarity reaching a preset threshold;
and identifying the voice to be detected according to a preset template feature vector sequence with the global maximum cosine similarity.
CN201811049146.XA 2018-09-10 2018-09-10 Voice recognition method and device based on cosine similarity Pending CN110890087A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811049146.XA CN110890087A (en) 2018-09-10 2018-09-10 Voice recognition method and device based on cosine similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811049146.XA CN110890087A (en) 2018-09-10 2018-09-10 Voice recognition method and device based on cosine similarity

Publications (1)

Publication Number Publication Date
CN110890087A true CN110890087A (en) 2020-03-17

Family

ID=69744883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811049146.XA Pending CN110890087A (en) 2018-09-10 2018-09-10 Voice recognition method and device based on cosine similarity

Country Status (1)

Country Link
CN (1) CN110890087A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489740A (en) * 2020-04-23 2020-08-04 北京声智科技有限公司 Voice processing method and device and elevator control method and device
CN111800145A (en) * 2020-07-20 2020-10-20 电子科技大学 Code length blind identification method of linear block code based on cosine similarity
CN112434722A (en) * 2020-10-23 2021-03-02 浙江智慧视频安防创新中心有限公司 Label smooth calculation method and device based on category similarity, electronic equipment and medium
CN112820278A (en) * 2021-01-23 2021-05-18 广东美她实业投资有限公司 Household doorbell automatic monitoring method, equipment and medium based on intelligent earphone
CN112863487A (en) * 2021-01-15 2021-05-28 广东优碧胜科技有限公司 Voice recognition method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010044241A (en) * 2008-08-13 2010-02-25 Kddi Corp Voice recognition device and control program of same
CN102982803A (en) * 2012-12-11 2013-03-20 华南师范大学 Isolated word speech recognition method based on HRSF and improved DTW algorithm
CN106448663A (en) * 2016-10-17 2017-02-22 海信集团有限公司 Voice wakeup method and voice interaction device
US20170076726A1 (en) * 2015-09-14 2017-03-16 Samsung Electronics Co., Ltd. Electronic device, method for driving electronic device, voice recognition device, method for driving voice recognition device, and non-transitory computer readable recording medium
CN107767863A (en) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 voice awakening method, system and intelligent terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010044241A (en) * 2008-08-13 2010-02-25 Kddi Corp Voice recognition device and control program of same
CN102982803A (en) * 2012-12-11 2013-03-20 华南师范大学 Isolated word speech recognition method based on HRSF and improved DTW algorithm
US20170076726A1 (en) * 2015-09-14 2017-03-16 Samsung Electronics Co., Ltd. Electronic device, method for driving electronic device, voice recognition device, method for driving voice recognition device, and non-transitory computer readable recording medium
CN107767863A (en) * 2016-08-22 2018-03-06 科大讯飞股份有限公司 voice awakening method, system and intelligent terminal
CN106448663A (en) * 2016-10-17 2017-02-22 海信集团有限公司 Voice wakeup method and voice interaction device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林强等: "行为识别与智能计算", 西安电子科技大学出版社, pages: 187 - 188 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489740A (en) * 2020-04-23 2020-08-04 北京声智科技有限公司 Voice processing method and device and elevator control method and device
CN111800145A (en) * 2020-07-20 2020-10-20 电子科技大学 Code length blind identification method of linear block code based on cosine similarity
CN112434722A (en) * 2020-10-23 2021-03-02 浙江智慧视频安防创新中心有限公司 Label smooth calculation method and device based on category similarity, electronic equipment and medium
CN112434722B (en) * 2020-10-23 2024-03-19 浙江智慧视频安防创新中心有限公司 Label smooth calculation method and device based on category similarity, electronic equipment and medium
CN112863487A (en) * 2021-01-15 2021-05-28 广东优碧胜科技有限公司 Voice recognition method and device and electronic equipment
CN112820278A (en) * 2021-01-23 2021-05-18 广东美她实业投资有限公司 Household doorbell automatic monitoring method, equipment and medium based on intelligent earphone

Similar Documents

Publication Publication Date Title
CN110890087A (en) Voice recognition method and device based on cosine similarity
CN106935248B (en) Voice similarity detection method and device
CN109034046B (en) Method for automatically identifying foreign matters in electric energy meter based on acoustic detection
Tiwari MFCC and its applications in speaker recognition
US8271283B2 (en) Method and apparatus for recognizing speech by measuring confidence levels of respective frames
KR100745976B1 (en) Method and apparatus for classifying voice and non-voice using sound model
CN106601230B (en) Logistics sorting place name voice recognition method and system based on continuous Gaussian mixture HMM model and logistics sorting system
CN105529028A (en) Voice analytical method and apparatus
CN110599987A (en) Piano note recognition algorithm based on convolutional neural network
JP6272433B2 (en) Method and apparatus for detecting pitch cycle accuracy
US9997168B2 (en) Method and apparatus for signal extraction of audio signal
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
CN106847267A (en) A kind of folded sound detection method in continuous speech stream
CN104103280A (en) Dynamic time warping algorithm based voice activity detection method and device
CN112542174A (en) VAD-based multi-dimensional characteristic parameter voiceprint identification method
CN116741148A (en) Voice recognition system based on digital twinning
US8942977B2 (en) System and method for speech recognition using pitch-synchronous spectral parameters
CN114234061A (en) Neural network-based intelligent judgment method for water leakage sound of pressurized operation water supply pipeline
KR100744288B1 (en) Method of segmenting phoneme in a vocal signal and the system thereof
Smolenski et al. Usable speech processing: A filterless approach in the presence of interference
Kumar et al. Text dependent voice recognition system using MFCC and VQ for security applications
CN111326161B (en) Voiceprint determining method and device
CN114093385A (en) Unmanned aerial vehicle detection method and device
Tahliramani et al. Performance analysis of speaker identification system with and without spoofing attack of voice conversion
CN110634473A (en) Voice digital recognition method based on MFCC

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201202

Address after: Room 206, 2 / F, building C, phase I, Zhongguancun Software Park, No. 8, Dongbei Wangxi Road, Haidian District, Beijing 100094

Applicant after: Canaan Bright Sight Co.,Ltd.

Address before: 100094, No. 3, building 23, building 8, northeast Wang Xi Road, Beijing, Haidian District, 307

Applicant before: Canaan Creative Co.,Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination