CN112634869B - Command word recognition method, device and computer storage medium - Google Patents

Command word recognition method, device and computer storage medium Download PDF

Info

Publication number
CN112634869B
CN112634869B CN202011431850.9A CN202011431850A CN112634869B CN 112634869 B CN112634869 B CN 112634869B CN 202011431850 A CN202011431850 A CN 202011431850A CN 112634869 B CN112634869 B CN 112634869B
Authority
CN
China
Prior art keywords
feature vector
command word
voiceprint
feature
searching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011431850.9A
Other languages
Chinese (zh)
Other versions
CN112634869A (en
Inventor
束建钢
黄炜
张伟哲
卢梓杰
黄树佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202011431850.9A priority Critical patent/CN112634869B/en
Publication of CN112634869A publication Critical patent/CN112634869A/en
Application granted granted Critical
Publication of CN112634869B publication Critical patent/CN112634869B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a command word recognition method, a device and a computer storage medium, wherein the method comprises the following steps: extracting voiceprint feature vectors and command word feature vectors based on SincNet; training the voiceprint feature vector and the command word feature vector by using an improved triplet loss function; searching and matching the voiceprint feature vector based on a feature searching database; searching and matching the command word feature vector based on the feature searching database; and when the voiceprint feature vector meets a first preset threshold and the command word feature vector meets a second preset threshold, the identification is successful. The invention solves the problems of poor recognition effect and low response speed of the command words, and improves the recognition effect and recognition speed of the command words.

Description

Command word recognition method, device and computer storage medium
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a command word recognition method, device, and computer storage medium.
Background
Voiceprint recognition and voice recognition become more and more important in the field of voice use, and particularly, the appearance of the internet of things enables an application scene of artificial intelligence to be further improved, so that the safety of voice recognition is challenged.
In order to identify command words, the task of order word identification is challenged with safety, and an industrial system without judging the identity of a user is often more vulnerable to attack and abuse.
The response speed of the system is low, so that the user experience is reduced; and also may cause damage to the system, for example, failure of the emergency equipment system to shut down in time may result in equipment failure. The reason is generally 1) that feature vector retrieval is slow; 2) The current speech model is mostly in sentence units, which results in excessive model parameter quantity and is also a reason for lower response speed.
In some large systems, there are often hundreds of thousands of people and millions of custom commands. The general feature vector retrieval method is slow, thereby reducing the response speed.
Disclosure of Invention
Therefore, the command word recognition method solves the problems of poor command word recognition effect and low response speed.
The embodiment of the application provides a command word recognition method, which comprises the following steps:
extracting voiceprint feature vectors and command word feature vectors based on SincNet;
training the voiceprint feature vector and the command word feature vector by using an improved triplet loss function; wherein the improved triplet loss function is still trainable when it is negative;
searching and matching the voiceprint feature vector based on a feature searching database;
searching and matching the command word feature vector based on the feature searching database;
and when the voiceprint feature vector meets a first preset threshold and the command word feature vector meets a second preset threshold, the identification is successful.
In an embodiment, the extracting voiceprint feature vectors and command word feature vectors based on SincNet includes:
extracting voice characteristics based on SincNet;
and generating the voiceprint feature vector and the command word feature vector by utilizing a residual network with a preset layer number which is sequentially overlapped according to the voice feature.
In one embodiment, the SincNet is calculated as follows:
O[n]=x[n]*g[n,θ];
Figure BDA0002826489370000021
g[n,f 1 ,f 2 ]=2f 2 *sinc(2πf 2 n)-2f 1 *sinc(2πf 1 n);
sinC=sin(x)/x;
wherein the O function is a convolution function, the g function is a rectangular pass filter, and the theta is a trainable parameter; g is marked as a form of G in a frequency domain space, and G is converted into a form of a time domain through inverse Fourier transform; f is the original frequency of the speech, f 1 For a low cut-off frequency, f 2 For a high cut-off frequency, f 1 And f 2 Are all learnable variables.
In an embodiment, the training the voiceprint feature vector and the command word feature vector using the modified triplet loss function includes:
storing the voiceprint feature vector and the command word feature vector corresponding to each section of voice;
the feature vector of the anchor point and the sample feature vector with the farthest positive sample distance are adopted, and the feature vector of the anchor point and the sample feature vector with the nearest negative sample distance are adopted for training.
In an embodiment, the formula for training using the feature vector of the anchor point and the sample feature vector farthest from the positive sample and using the feature vector and the sample feature vector closest to the negative sample is as follows:
record A v For the feature vector group of the anchor point, P v For the set of eigenvectors of the positive sample, N v A set of feature vectors that are the negative samples; then:
distance(A v ,P v )=A v *P v
distance(A v ,N v )=A v *N v
diff=distance(A v ,P v )-distance(A v ,N v )+margin;
Figure BDA0002826489370000031
wherein the anchor points are samples in the same class; the positive samples are samples of the same type as the anchor points, and the negative samples are samples of different types from the anchor points; distance is mse (c, d) = (c-d) 2 The method comprises the steps of carrying out a first treatment on the surface of the margin is a self-defined threshold; a is a learnable parameter and U (l, U) is a uniform distribution between l, U).
In an embodiment, the searching and matching the voiceprint feature vector based on the feature searching database includes:
extracting a voiceprint feature vector to be searched and storing the voiceprint feature vector in a pyragnitide feature search database;
calculating cosine similarity of the voiceprint feature vector and the voiceprint feature vector to be retrieved, and obtaining first cosine similarity with highest matching degree; and the voiceprint feature vector to be searched is obtained by searching in a K-dimensional tree searching mode in a pyragnitide feature searching database.
In an embodiment, the searching and matching the command word feature vector based on the feature searching database includes:
extracting command word feature vectors to be searched and storing the command word feature vectors in a pyragnitide feature search database;
calculating cosine similarity of the command word feature vector and the command word feature vector to be retrieved, and obtaining second cosine similarity with highest matching degree; and the command word feature vector to be searched is obtained by searching in a K-dimensional tree searching mode in a pyragnitide feature searching database.
In an embodiment, when the voiceprint feature vector meets a first preset threshold and the command word feature vector meets a second preset threshold, the identifying is successful, including:
and if the first cosine similarity is smaller than or equal to the first preset threshold value and the second cosine similarity is smaller than or equal to the second preset threshold value, the identification is successful.
In an embodiment, further comprising:
terminating the identification when the first cosine similarity is greater than the first preset threshold; or alternatively, the first and second heat exchangers may be,
and when the second cosine similarity is larger than the second preset threshold value, terminating the identification.
In an embodiment, in the step of training the voiceprint feature vector and the command word feature vector by using the improved triplet loss function, performing a data enhancement operation on the voice training data includes:
performing framing operation on the voice data;
stretching the voice data based on a time dimension to generate the voice training data with different speech speeds; or adding the voice data into noise to generate the voice training data with different noise.
In an embodiment, the performing a framing operation on the voice data includes:
uniformly collecting a preset number of observation points on the voice data by utilizing a sliding window;
taking the preset neighborhood of the observation point as a framing; wherein, the frame division and the frame division are overlapped by a preset part.
In order to achieve the above object, there is also provided a computer-readable storage medium having stored thereon a command word recognition method program which, when executed by a processor, implements the steps of any of the methods described above.
In order to achieve the above object, there is also provided a command word recognition apparatus including a memory, a processor, and a command word recognition method program stored on the memory and executable on the processor, the processor implementing the steps of any one of the above methods when executing the command word recognition method program.
One or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages: extracting voiceprint feature vectors and command word feature vectors based on SincNet; the SincNet feature extraction network enhances the portability of the whole and simultaneously improves the speed and accuracy of feature extraction. Training the voiceprint feature vector and the command word feature vector by using an improved triplet loss function; wherein the improved triplet loss function is still trainable when it is negative; the improved triple loss function can enlarge the difference between classes and increase the similarity in the classes, thereby improving the recognition result of command words. Searching and matching the voiceprint feature vector based on a feature searching database; searching and matching the command word feature vector based on the feature searching database; the voiceprint feature vector and the command word feature vector are searched by utilizing the feature search database, so that the dimension reduction operation of the feature vector is realized, the search speed and the response speed of the feature vector are accelerated by utilizing a K-dimension search method, and the experience of a user is improved. And when the voiceprint feature vector meets a first preset threshold and the command word feature vector meets a second preset threshold, the identification is successful. When the two conditions are satisfied at the same time, the accuracy of command word recognition is improved. The invention solves the problems of poor recognition effect and low response speed of the command words, and improves the recognition effect and recognition speed of the command words.
Drawings
Fig. 1 is a schematic diagram of a hardware architecture of a command word recognition method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a command word recognition method according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating steps for implementing step S110 in the first embodiment of the command word recognition method according to the present invention;
FIG. 4 is a flowchart illustrating steps performed in step S120 of the first embodiment of the command word recognition method according to the present invention;
FIG. 5 is a flowchart illustrating steps for implementing step S130 in the first embodiment of the command word recognition method according to the present invention;
FIG. 6 is a flowchart illustrating steps performed in step S140 of the first embodiment of the command word recognition method according to the present invention;
FIG. 7 is a flowchart illustrating steps performed in step S150 of the first embodiment of the command word recognition method according to the present invention;
FIG. 8 is a flowchart of a command word recognition method according to a second embodiment of the present invention;
FIG. 9 is a flowchart illustrating steps of performing data enhancement operations on voice training data in the command word recognition method of the present invention;
FIG. 10 is a flowchart illustrating steps performed in step S221 of the command word recognition method according to the present invention;
FIG. 11 is a flow chart of command word recognition according to the present invention.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The main solutions of the embodiments of the present invention are: extracting voiceprint feature vectors and command word feature vectors based on SincNet; training the voiceprint feature vector and the command word feature vector by using an improved triplet loss function; searching and matching the voiceprint feature vector based on a feature searching database; searching and matching the command word feature vector based on the feature searching database; and when the voiceprint feature vector meets a first preset threshold and the command word feature vector meets a second preset threshold, the identification is successful. The invention solves the problems of poor recognition effect and low response speed of the command words, and improves the recognition effect and recognition speed of the command words.
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
The present application relates to a command word recognition apparatus 010 including as shown in fig. 1: at least one processor 012, a memory 011.
The processor 012 may be an integrated circuit chip having signal processing capability. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software form in the processor 012. The processor 012 may be a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 011, and the processor 012 reads information in the memory 011 and performs the steps of the above method in combination with its hardware.
It is to be appreciated that memory 011 in embodiments of the present invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory, a programmable read-only memory, an erasable programmable read-only memory, an electrically erasable programmable read-only memory, or a flash memory. The volatile memory may be a random access memory that acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as static random access memory, dynamic random access memory, synchronous dynamic random access memory, double data rate synchronous dynamic random access memory, enhanced synchronous dynamic random access memory, synchronous link dynamic random access memory, and direct memory bus random access memory. The memory 011 of the systems and methods described by embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
Referring to fig. 2, fig. 2 is a first embodiment of the command word recognition method of the present invention, the method includes:
step 110: voiceprint feature vectors and command word feature vectors are extracted based on SincNet.
SincNet is an interpretable convolution filter structure. The sinnet first layer adopts a convolution form based on a sinc function, and is then subjected to conventional CNN (convolutional neural network) operation (pooling, normalization, activation, drop) and input into a stacked network of a plurality of convolution layers and a full connection layer (or a loop layer), and finally classified by using a softmax classifier.
Voiceprint features can be divided into acoustic features (audiofeature) the former means a description of sound features that the human ear can identify and describe, such as speaking aero-acoustic multi/mid-aero-cross, and acoustic features (Acoustic features) the latter means a set of acoustic description parameters (vectors) extracted from the sound signal by a computer algorithm (mathematical method) { x 1 ,x 2 ,……,x n The acoustic features may be voiceprint feature vectors in the present invention.
The command word feature vector may be a feature vector corresponding to a command word in the voice data, and may identify the command word and perform an operation represented by the command word.
Step 120: training the voiceprint feature vector and the command word feature vector by using an improved triplet loss function; wherein the improved triplet loss function is still trainable when it is negative.
In this embodiment, the training may be performed by using a triplet loss function, or may be performed by using an improved triplet, where the purpose of the training method of the triplet loss function is to pull the feature vectors in the same class closer, and simultaneously pull the distances between feature vectors in different classes.
Let the anchor point (anchor, abbreviated as a) be the feature vector in the same class, the positive sample (abbreviated as p) be the same class of sample as the anchor point, and the negative sample (abbreviated as n) be different class of sample as the anchor point. The loss function used is as follows:
diff=distance(a,p)-distance(a,n)+margin
TripleLoss=max(diff,0)
in the above, distance may be mse (a, p) = (a-p) 2 Or other distance function, margin is a custom oneA threshold value.
Wherein, the triplet loss function adopts a mode similar to ReLu, and ReLu is improved into RReLu form to form an improved triplet loss function, which is as follows:
Figure BDA0002826489370000081
where a is a learnable parameter and U (l, U) obeys a uniform distribution among [ l, U ], the purpose of RTripletLoss is to enable training learning also when the loss function is negative.
Step S130: and searching and matching the voiceprint feature vector based on a feature searching database.
The feature retrieval database has a storage function and a feature vector dimension reduction function, so that the retrieval speed of the feature vector can be increased, the retrieval mode of a K-dimensional tree can be utilized, and the retrieval speed can be increased.
Step S140: and carrying out retrieval matching on the command word feature vector based on the feature retrieval database.
The feature retrieval database is not described in detail herein.
Step S150: and when the voiceprint feature vector meets a first preset threshold and the command word feature vector meets a second preset threshold, the identification is successful.
The first preset threshold and the second preset threshold are adjusted according to training data, and are not limited herein.
The beneficial effects existing in the above embodiments are as follows: extracting voiceprint feature vectors and command word feature vectors based on SincNet; the SincNet feature extraction network enhances the portability of the whole and simultaneously improves the speed and accuracy of feature extraction. Training the voiceprint feature vector and the command word feature vector by using an improved triplet loss function; wherein the improved triplet loss function is still trainable when it is negative; the improved triple loss function can enlarge the difference between classes and increase the similarity in the classes, thereby improving the recognition result of command words. Searching and matching the voiceprint feature vector based on a feature searching database; searching and matching the command word feature vector based on the feature searching database; the voiceprint feature vector and the command word feature vector are searched by utilizing the feature search database, so that the dimension reduction operation of the feature vector is realized, the search speed and the response speed of the feature vector are accelerated by utilizing a K-dimension search method, and the experience of a user is improved. And when the voiceprint feature vector meets a first preset threshold and the command word feature vector meets a second preset threshold, the identification is successful. When the two conditions are satisfied at the same time, the accuracy of command word recognition is improved. The invention solves the problems of poor recognition effect and low response speed of the command words, and improves the recognition effect and recognition speed of the command words.
Referring to fig. 3, fig. 3 is a specific implementation step of step S110 in the first embodiment of the command word recognition method of the present invention, where extracting voiceprint feature vectors and command word feature vectors based on SincNet includes:
step S111: speech features are extracted based on SincNet.
The voice feature may be a fusion feature comprising voiceprint feature information and command word feature information.
Step S112: and generating the voiceprint feature vector and the command word feature vector by utilizing a residual network with a preset layer number which is sequentially overlapped according to the voice feature.
The residual network (RetNet) is a convolutional neural network proposed by 4 scholars from Microsoft Research, and the advantages of image classification and object recognition were obtained in the image net large-scale visual recognition competition (ImageNet Large Scale Visual Recognition Challenge, ILSVRC) in 2015. The residual network is characterized by easy optimization and can improve accuracy by increasing considerable depth. The residual blocks inside the deep neural network are connected in a jumping mode, and the gradient disappearance problem caused by depth increase in the deep neural network is relieved.
The preset layer number can be dynamically adjusted according to the training effect, and is not limited herein.
The beneficial effects existing in the above embodiments are as follows: the embodiment specifically provides implementation steps of steps for extracting voiceprint feature vectors and command word feature vectors based on SincNet, and the addition of a residual network ensures that voiceprint feature vectors and command word feature vectors are extracted efficiently and correctly.
In one embodiment, the SincNet is calculated as follows:
O[n]=x[n]*g[n,θ];
Figure BDA0002826489370000101
g[n,f 1 ,f 2 ]=2f 2 *sinc(2πf 2 n)-2f 1 *Sinc(2πf 1 n);
Sinc=sin(x)/x;
wherein the O function is a convolution function, the g function is a rectangular pass filter, and the theta is a trainable parameter; g is marked as a form of G in a frequency domain space, and G is converted into a form of a time domain through inverse Fourier transform; f is the original frequency of the speech, f 1 For a low cut-off frequency, f 2 For a high cut-off frequency, f 1 And f 2 Are all learnable variables.
The beneficial effects existing in the embodiment are as follows: the embodiment particularly provides a calculation formula of the SincNet, ensures the correctness of the SincNet extraction characteristics, has fewer parameters of the SincNet network, can accelerate the characteristic extraction in the invention, and reduces the resource consumption of the invention; and the calculation is efficient, the filter of SincNet is symmetrical, and the parameter calculation amount is reduced.
Referring to fig. 4, fig. 4 is a specific implementation step of step S120 in the first embodiment of the command word recognition method of the present invention, where the training of the voiceprint feature vector and the command word feature vector by using the improved triplet loss function includes:
step S121: and storing the voiceprint feature vector and the command word feature vector corresponding to each section of voice.
During training, the voiceprint feature vector and the command word feature vector corresponding to each section of voice are stored firstly so as to be used for training.
Step S122: the feature vector of the anchor point and the sample feature vector with the farthest positive sample distance are adopted, and the feature vector of the anchor point and the sample feature vector with the nearest negative sample distance are adopted for training.
The anchor point can be a voiceprint feature vector or a command word feature vector in the invention; i.e. using the voiceprint feature vector and the sample feature vector with the furthest positive sample distance and using the voiceprint feature vector and the sample feature vector with the closest negative sample distance for training.
And simultaneously, using a command word feature vector and a sample feature vector with the farthest positive sample distance, and using a voiceprint feature vector and a sample feature vector with the nearest negative sample distance for training.
The beneficial effects existing in the embodiment are as follows: the embodiment specifically provides specific steps of training the voiceprint feature vector and the command word feature vector by using the improved triplet loss function, and ensures that the training process is performed correctly, thereby ensuring the recognition effect of the command word.
In one embodiment, the formula for training using the feature vector of the anchor point and the sample feature vector farthest from the positive sample and using the feature vector and the sample feature vector closest to the negative sample is as follows:
record A v For the feature vector group of the anchor point, P v For the set of eigenvectors of the positive sample, N v A set of feature vectors that are the negative samples; then:
distance(A v ,P v )=A v *P v
distance(A v ,N v )=A v *N v
diff=distance(A v ,P v )-distance(A v ,N v )+margin;
Figure BDA0002826489370000111
wherein the anchor points are samples in the same class; the positive samples are samples of the same type as the anchor points, and the negative samples are samples of different types from the anchor points; distance is mse (c, d) = (c-d) 2 The method comprises the steps of carrying out a first treatment on the surface of the margin is a self-defined threshold; a is a learnable parameter and U (l, U) is a uniform distribution between l, U).
Referring to fig. 5, fig. 5 is a specific implementation step of step S130 in the first embodiment of the command word recognition method of the present invention, where the searching and matching the voiceprint feature vector based on the feature searching database includes:
step S131: and extracting the voiceprint feature vector to be retrieved and storing the voiceprint feature vector in a pyragnitide feature retrieval database.
The process of extracting the voiceprint feature vector to be retrieved is the same as the extraction process of the voiceprint feature vector described in the first embodiment.
The Pymagnitude feature retrieval database is a feature package and vector storage file format developed by plastics for quick, efficient and simple use of vector embedding in machine learning models. It is mainly intended to be a simpler/faster alternative to Gensim, but can also be stored as a generic key-vector in fields other than NLP.
Step S132: calculating cosine similarity of the voiceprint feature vector and the voiceprint feature vector to be retrieved, and obtaining first cosine similarity with highest matching degree; and the voiceprint feature vector to be searched is obtained by searching in a K-dimensional tree searching mode in a pyragnitide feature searching database.
Cosine similarity, also known as cosine similarity, is evaluated by calculating the cosine value of the angle between two vectors. When the two vectors have the same direction, the cosine similarity value is 1; when the included angle of the two vectors is 90 degrees, the cosine similarity value is 0; when the two vectors point in diametrically opposite directions, the cosine similarity has a value of-1. I.e. the larger the value of the cosine similarity, the more similar the two vectors are.
A K-dimensional tree (K-d tree) is commonly used for space division and neighbor search, and is a special case of a binary space division tree. Generally, for a data set of dimension k and number of data points N, k-d tree applies to the case where N > 2 k.
The beneficial effects existing in the above embodiments are as follows: the embodiment particularly provides a specific implementation step for carrying out search matching on the voiceprint feature vector based on a feature search database, wherein the introduction of a pymagnitide feature search database and a K-dimensional tree search mode accelerates the search matching speed, thereby accelerating the response speed of command word recognition.
Referring to fig. 6, fig. 6 is a specific implementation step of step S140 in the first embodiment of the command word recognition method of the present invention, where the searching and matching the feature vector of the command word based on the feature searching database includes:
step S141: and extracting the command word feature vector to be retrieved and storing the command word feature vector in a pyragnitide feature retrieval database.
Step S142: calculating cosine similarity of the command word feature vector and the command word feature vector to be retrieved, and obtaining second cosine similarity with highest matching degree; and the command word feature vector to be searched is obtained by searching in a K-dimensional tree searching mode in a pyragnitide feature searching database.
The technical features related to this embodiment are described in the previous embodiment, and are not described herein.
The beneficial effects of the embodiment are as follows: the embodiment particularly provides a specific implementation step for carrying out search matching on the command word feature vector based on a feature search database, wherein the introduction of a pymagnitide feature search database and a K-dimensional tree search mode accelerates the search matching speed, thereby accelerating the response speed of command word recognition.
Referring to fig. 7, fig. 7 is a specific implementation step of step S150 in the first embodiment of the command word recognition method according to the present invention, where when the voiceprint feature vector meets a first preset threshold and the command word feature vector meets a second preset threshold, the recognition is successful, and the method includes:
step S151: and if the first cosine similarity is smaller than or equal to the first preset threshold value and the second cosine similarity is smaller than or equal to the second preset threshold value, the identification is successful.
The first preset threshold and the second preset threshold are adjusted according to training data, and are not limited herein.
In the present embodiment, a determination condition for determining whether the command word recognition is successful is given, wherein the determination condition is not limited to the above-described method, but may be other determination methods, and is not limited thereto.
The beneficial effects of the embodiment are as follows: the embodiment is a judging condition for judging whether the recognition is successful or not, and ensures the effect of command word recognition.
Referring to fig. 8, fig. 8 is a second embodiment of the command word recognition method according to the present invention, the method further includes:
step 210: voiceprint feature vectors and command word feature vectors are extracted based on SincNet.
Step 220: training the voiceprint feature vector and the command word feature vector by using an improved triplet loss function; wherein the improved triplet loss function is still trainable when it is negative.
Step S230: and searching and matching the voiceprint feature vector based on a feature searching database.
Step S240: and carrying out retrieval matching on the command word feature vector based on the feature retrieval database.
Step S250: and if the first cosine similarity is smaller than or equal to the first preset threshold value and the second cosine similarity is smaller than or equal to the second preset threshold value, the identification is successful.
Step S260: terminating the identification when the first cosine similarity is greater than the first preset threshold; or alternatively, the first and second heat exchangers may be,
and when the second cosine similarity is larger than the second preset threshold value, terminating the identification.
The second embodiment includes step S250 and step S260, as compared with the first embodiment. Other steps are the same as those of the first embodiment, and will not be described again. Wherein step S250 is described in the above embodiment.
The first preset threshold and the second preset threshold are adjusted according to training data, and are not limited herein.
In the present embodiment, a determination condition for determining whether or not the recognition of the command word is terminated is given, wherein the determination condition is not limited to the above-described method, but may be other determination methods, and is not limited thereto.
The beneficial effects of the embodiment are as follows: the embodiment is a judging condition for judging whether the recognition is terminated or not, and ensures that the recognition of the special condition of the recognition of the command word is terminated, thereby ensuring the effect of the recognition of the command word.
Referring to fig. 9, fig. 9 is a step of performing a data enhancement operation on voice training data in the command word recognition method according to the present invention, where in the step of training the voiceprint feature vector and the command word feature vector by using the improved triplet loss function, the step of performing the data enhancement operation on the voice training data includes:
step S221: performing framing operation on the voice data;
and generating a plurality of sections of different voice data by framing one section of voice data, thereby achieving the effect of increasing training data.
Step S222: stretching the voice data based on a time dimension to generate the voice training data with different speech speeds; or adding the voice data into noise to generate the voice training data with different noise.
Through the two methods, the voice training data with different speech speeds and different noises are generated, so that the effect of increasing the training data is achieved, and the training effect of command word recognition is ensured.
The beneficial effects of the embodiment are as follows: the embodiment specifically provides an implementation method for executing data enhancement operation on voice training data, increases the training data and ensures the training effect of command word recognition.
Referring to fig. 10, fig. 10 is a specific implementation step of step S221 in the command word recognition method according to the present invention, where the step of performing framing operation on voice data includes:
step S2211: and uniformly collecting a preset number of observation points on the voice data by utilizing a sliding window.
Step S2212: taking the preset neighborhood of the observation point as a framing; wherein, the frame division and the frame division are overlapped by a preset part.
The framing is to utilize a sliding window to carry out short-time Fourier transform, so that the processing from a frequency domain is facilitated, and the processing efficiency is improved. The framing is to set a continuous preset number of observation points as one frame.
The beneficial effects of the embodiment are as follows: the method specifically comprises the specific steps of executing framing operation on voice data, and executing framing operation on the voice data, so that the number of the voice data is increased, and the training effect of command word recognition is ensured.
The present invention is also a computer readable storage medium having stored thereon a command word recognition method program which when executed by a processor implements the steps of any of the methods described above.
The invention also provides command word recognition equipment, which comprises a memory, a processor and a command word recognition method program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of any one of the methods when executing the command word recognition method program.
Referring to fig. 11, fig. 11 is a command word recognition flowchart, in which a solid line represents the next step in the flowchart and a dotted line represents the corresponding feature layer input.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (13)

1. A method of command word recognition, the method comprising:
extracting voiceprint feature vectors and command word feature vectors based on SincNet;
training the voiceprint feature vector and the command word feature vector by using an improved triplet loss function; wherein the improved triplet loss function is still trainable when it is negative;
searching and matching the voiceprint feature vector based on a feature searching database;
searching and matching the command word feature vector based on the feature searching database;
and when the voiceprint feature vector meets a first preset threshold and the command word feature vector meets a second preset threshold, the identification is successful.
2. The command word recognition method of claim 1, wherein the extracting voiceprint feature vectors and command word feature vectors based on SincNet comprises:
extracting voice characteristics based on SincNet;
and generating the voiceprint feature vector and the command word feature vector by utilizing a residual network with a preset layer number which is sequentially overlapped according to the voice feature.
3. The command word recognition method of claim 2, wherein the SincNet has a calculation formula as follows:
O[n]=x[n]*g[n,θ];
Figure FDA0002826489360000011
g[n,f 1 ,f 2 ]=2f 2 *sinc(2πf 2 n)-2f 1 *sinc(2πf 1 n);
sinc=sin(x)/x;
wherein the O function is a convolution function, the g function is a rectangular pass filter, and the theta is a trainable parameter; g is marked as a form of G in a frequency domain space, and G is converted into a form of a time domain through inverse Fourier transform; f is the original frequency of the speech, f 1 For a low cut-off frequency, f 2 For a high cut-off frequency, f 1 And f 2 Are all learnable variables.
4. The method of claim 1, wherein the training the voiceprint feature vector and the command word feature vector, respectively, using an improved triplet loss function, comprises:
storing the voiceprint feature vector and the command word feature vector corresponding to each section of voice;
the feature vector of the anchor point and the sample feature vector with the farthest positive sample distance are adopted, and the feature vector of the anchor point and the sample feature vector with the nearest negative sample distance are adopted for training.
5. The command word recognition method of claim 4, wherein the formula for training using the feature vector of the anchor point and the sample feature vector farthest from the positive sample and using the feature vector and the sample feature vector closest to the negative sample is as follows:
record A v Characteristic of the anchor pointQuantity group, P v For the set of eigenvectors of the positive sample, N v A set of feature vectors that are the negative samples; then:
distance(A v ,P v )=A v *P v
distance(A v ,N v )=A v *N v
diff=distance(A v ,P v )-distance(A v ,N v )+margin;
Figure FDA0002826489360000021
a~U(l,u),l<u and l,u∈[0,1).
wherein the anchor points are samples in the same class; the positive samples are samples of the same type as the anchor points, and the negative samples are samples of different types from the anchor points; distance is mse (c, d) = (c-d) 2 The method comprises the steps of carrying out a first treatment on the surface of the margin is a self-defined threshold; a is a learnable parameter, U (l, U) obeys a uniform distribution among [ l, U ].
6. The command word recognition method according to claim 1, wherein the search matching of the voiceprint feature vector based on a feature search database comprises:
extracting a voiceprint feature vector to be searched and storing the voiceprint feature vector in a pyragnitide feature search database;
calculating cosine similarity of the voiceprint feature vector and the voiceprint feature vector to be retrieved, and obtaining first cosine similarity with highest matching degree; and the voiceprint feature vector to be searched is obtained by searching in a K-dimensional tree searching mode in a pyragnitide feature searching database.
7. The command word recognition method of claim 6, wherein the search matching the command word feature vector based on the feature search database comprises:
extracting command word feature vectors to be searched and storing the command word feature vectors in a pyragnitide feature search database;
calculating cosine similarity of the command word feature vector and the command word feature vector to be retrieved, and obtaining second cosine similarity with highest matching degree; and the command word feature vector to be searched is obtained by searching in a K-dimensional tree searching mode in a pyragnitide feature searching database.
8. The method of claim 7, wherein when the voiceprint feature vector meets a first predetermined threshold and the command word feature vector meets a second predetermined threshold, the identifying is successful, comprising:
and if the first cosine similarity is smaller than or equal to the first preset threshold value and the second cosine similarity is smaller than or equal to the second preset threshold value, the identification is successful.
9. The command word recognition method of claim 8, wherein the method further comprises:
terminating the identification when the first cosine similarity is greater than the first preset threshold; or alternatively, the first and second heat exchangers may be,
and when the second cosine similarity is larger than the second preset threshold value, terminating the identification.
10. The command word recognition method of claim 1, wherein in the step of training the voiceprint feature vector and the command word feature vector using the modified triplet loss function, respectively, a data enhancement operation is performed on voice training data, comprising:
performing framing operation on the voice data;
stretching the voice data based on a time dimension to generate the voice training data with different speech speeds; or adding the voice data into noise to generate the voice training data with different noise.
11. The command word recognition method of claim 10, wherein the performing a framing operation on the voice data comprises:
uniformly collecting a preset number of observation points on the voice data by utilizing a sliding window;
taking the preset neighborhood of the observation point as a framing; wherein, the frame division and the frame division are overlapped by a preset part.
12. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a command word recognition method program, which, when executed by a processor, implements the steps of the method of any of claims 1-11.
13. A command word recognition device comprising a memory, a processor and a command word recognition method program stored on said memory and executable on said processor, said processor implementing the steps of the method of any one of claims 1-11 when said command word recognition method program is executed.
CN202011431850.9A 2020-12-09 2020-12-09 Command word recognition method, device and computer storage medium Active CN112634869B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011431850.9A CN112634869B (en) 2020-12-09 2020-12-09 Command word recognition method, device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011431850.9A CN112634869B (en) 2020-12-09 2020-12-09 Command word recognition method, device and computer storage medium

Publications (2)

Publication Number Publication Date
CN112634869A CN112634869A (en) 2021-04-09
CN112634869B true CN112634869B (en) 2023-05-26

Family

ID=75308976

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011431850.9A Active CN112634869B (en) 2020-12-09 2020-12-09 Command word recognition method, device and computer storage medium

Country Status (1)

Country Link
CN (1) CN112634869B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331673B (en) * 2022-10-14 2023-01-03 北京师范大学 Voiceprint recognition household appliance control method and device in complex sound scene

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179941A (en) * 2020-01-06 2020-05-19 科大讯飞股份有限公司 Intelligent device awakening method, registration method and device
US10706857B1 (en) * 2020-04-20 2020-07-07 Kaizen Secure Voiz, Inc. Raw speech speaker-recognition

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2573809B (en) * 2018-05-18 2020-11-04 Emotech Ltd Speaker Recognition
US20200322377A1 (en) * 2019-04-08 2020-10-08 Pindrop Security, Inc. Systems and methods for end-to-end architectures for voice spoofing detection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179941A (en) * 2020-01-06 2020-05-19 科大讯飞股份有限公司 Intelligent device awakening method, registration method and device
US10706857B1 (en) * 2020-04-20 2020-07-07 Kaizen Secure Voiz, Inc. Raw speech speaker-recognition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Comparison of Metric Learning Loss Functions for End-To-End Speaker Verification;Juan M. Coria等;arXiv;1-7 *
Additive Margin SincNet for Speaker Recognition;Jo˜ao Antˆonio Chagas Nunes等;《International Joint Conference on Neural Networks,2019》;1-5 *

Also Published As

Publication number Publication date
CN112634869A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN110020592B (en) Object detection model training method, device, computer equipment and storage medium
CN110633745B (en) Image classification training method and device based on artificial intelligence and storage medium
US10275719B2 (en) Hyper-parameter selection for deep convolutional networks
CN112215280B (en) Small sample image classification method based on meta-backbone network
US10540988B2 (en) Method and apparatus for sound event detection robust to frequency change
Kim et al. Orchard: Visual object recognition accelerator based on approximate in-memory processing
US8606022B2 (en) Information processing apparatus, method and program
KR20170016231A (en) Multi-modal fusion method for user authentification and user authentification method
JP2022141931A (en) Method and device for training living body detection model, method and apparatus for living body detection, electronic apparatus, storage medium, and computer program
CN112766324B (en) Image countermeasure sample detection method, system, storage medium, terminal and application
CN105184260A (en) Image characteristic extraction method, pedestrian detection method and device
CN111798828B (en) Synthetic audio detection method, system, mobile terminal and storage medium
US10373028B2 (en) Pattern recognition device, pattern recognition method, and computer program product
CN111177447B (en) Pedestrian image identification method based on depth network model
CN112634869B (en) Command word recognition method, device and computer storage medium
CN112613474B (en) Pedestrian re-identification method and device
CN114358205A (en) Model training method, model training device, terminal device, and storage medium
CN115830401B (en) Small sample image classification method
CN115984671A (en) Model online updating method and device, electronic equipment and readable storage medium
CN114004233B (en) Remote supervision named entity recognition method based on semi-training and sentence selection
CN111753583A (en) Identification method and device
CN112215282B (en) Meta-generalization network system based on small sample image classification
CN114154569B (en) Noise data identification method, device, terminal and storage medium
Jiang et al. Speaker identification in medical simulation data using fisher vector representation
CN113205082B (en) Robust iris identification method based on acquisition uncertainty decoupling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant