CN118072715A - Voice keyword detection method, system, storage medium and electronic equipment - Google Patents
Voice keyword detection method, system, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN118072715A CN118072715A CN202410143591.1A CN202410143591A CN118072715A CN 118072715 A CN118072715 A CN 118072715A CN 202410143591 A CN202410143591 A CN 202410143591A CN 118072715 A CN118072715 A CN 118072715A
- Authority
- CN
- China
- Prior art keywords
- keyword
- vector
- embedded
- voice
- acoustic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 110
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims description 36
- 230000006870 function Effects 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 16
- 238000010586 diagram Methods 0.000 claims description 12
- 238000007493 shaping process Methods 0.000 claims description 12
- 238000000034 method Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000011176 pooling Methods 0.000 description 14
- 238000004891 communication Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 6
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 4
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000003416 augmentation Effects 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 241001674048 Phthiraptera Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000035479 physiological effects, processes and functions Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a voice keyword detection method, a voice keyword detection system, a storage medium and electronic equipment, wherein the voice keyword detection method comprises the following steps: preprocessing a plurality of keywords to be detected to obtain embedded keywords; acquiring acoustic embedding characteristics of voice; the embedded keywords and the acoustic embedded features are subjected to a preset number of sequentially connected transform coding blocks, and a keyword initial embedded vector, a keyword ending embedded vector and an acoustic output vector are obtained; calculating a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector; acquiring keyword bias of each keyword based on the similarity score; and multiplying the acoustic output vector and the keyword deviation element, and then decoding to obtain a keyword detection result. The voice keyword detection method, the voice keyword detection system, the storage medium and the electronic equipment can effectively improve recall rate of voice keyword detection.
Description
Technical Field
The invention belongs to the technical field of deep learning, and particularly relates to a voice keyword detection method, a voice keyword detection system, a storage medium and electronic equipment.
Background
With the progress of data processing technology and the rapid popularization of mobile internet, computer technology is widely applied to various fields of society, and mass data generation follows. Among them, voice data is receiving increasing attention. Speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR), aims at converting lexical content in human speech into computer-readable inputs, such as keys, binary codes, or character sequences.
Recently, speech recognition technology has made remarkable progress, which relates to various fields of industry, home appliances, communications, automotive electronics, medical treatment, home services, consumer electronics, and the like. The speech recognition technology belongs to an important branch of the artificial intelligence direction, relates to a plurality of subjects such as signal processing, computer science, linguistics, acoustics, physiology, psychology and the like, and is a key link in the man-machine natural interaction technology.
Keyword detection is a sub-field of the speech recognition field whose purpose is to detect all occurrence locations of specified words in a speech signal. Keyword detection commonly used in the prior art includes keyword detection based on a complementary white model, keyword detection based on a sample, keyword detection based on a massive vocabulary continuous voice system and the like. However, the keyword recall rate of the existing voice keyword detection method is not high, and the actual application requirement cannot be met.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the present invention aims to provide a method, a system, a storage medium and an electronic device for detecting a voice keyword, which can effectively improve the recall rate of voice keyword detection and meet the actual application requirements.
In a first aspect, the present invention provides a method for detecting a voice keyword, the method comprising the steps of: preprocessing a plurality of keywords to be detected to obtain embedded keywords; acquiring acoustic embedding characteristics of voice; the embedded keywords and the acoustic embedded features are subjected to a preset number of sequentially connected transform coding blocks, and a keyword initial embedded vector, a keyword ending embedded vector and an acoustic output vector are obtained; calculating a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector; acquiring keyword bias of each keyword based on the similarity score; and multiplying the acoustic output vector and the keyword deviation element, and then decoding to obtain a keyword detection result.
In one implementation manner of the first aspect, preprocessing the plurality of keyword keywords to be detected includes the following steps:
for each keyword, adding a start character and an end character to obtain a standard keyword;
and embedding and reconstructing each standard keyword into a preset size to obtain the embedded keywords.
In one implementation manner of the first aspect, acquiring the acoustic embedded feature of the speech includes the steps of:
Acquiring fbank features of the voice;
Inputting the fbank features into a one-dimensional convolution module to obtain embedded features;
Inputting the embedded features into a preset number of tri-modal attention blocks which are sequentially connected to obtain the acoustic embedded features; the trimodal attention block is used to implement attention processing based on semantics, location and region.
In one implementation manner of the first aspect, calculating the similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector includes the steps of:
Adding elements of the key word initial embedded vector and the key word ending embedded vector, and inputting the added elements into a multi-layer perceptron to obtain a key word library;
Performing matrix dot product operation on the acoustic output vector and the keyword library after passing through a preset number of tri-modal attention blocks, and obtaining a dot product operation result; the tri-modal attention block is used for realizing attention processing based on semantics, position and region;
and obtaining the similarity score by passing the dot product operation result through a sigmoid function.
In one implementation manner of the first aspect, calculating the keyword bias of each keyword based on the similarity score includes the following steps:
Selecting the maximum similarity score corresponding to each keyword;
acquiring a specific keyword with the maximum similarity score larger than a preset value;
and (3) keeping the position elements corresponding to the specific keywords in the dot product operation result unchanged, and after other positions are set to be 1, inputting a multi-layer perceptron and a sigmoid function to acquire the keyword deviation.
In one implementation manner of the first aspect, the trimodal attention block performs the following operations:
dividing an input vector into 4 sub-vectors, and shaping the sub-vectors into shaped sub-vectors with preset shapes;
inputting the shaping sub-vector into a semantic attention module to obtain a semantic attention map;
performing matrix multiplication on the shaping sub-vector and the semantic attention map to obtain a matrix A;
Inputting the matrix A into a position attention module to obtain a position attention map;
Multiplying the matrix A by the position attention map matrix to obtain a matrix B;
Inputting the matrix B into a regional attention module to obtain regional attention force diagram;
And multiplying the matrix B and the regional attention diagram matrix, and then, shaping and inputting the matrix B and the regional attention diagram matrix into a one-dimensional convolution module to obtain an output vector.
In one implementation of the first aspect, a CTC decoder is used to decode the result of multiplying the acoustic output vector and the keyword bias element.
In a second aspect, the invention provides a voice keyword detection system, which comprises a first acquisition module, a second acquisition module, a third acquisition module, a calculation module, a fourth acquisition module and a detection module;
the first acquisition module is used for preprocessing a plurality of keywords to be detected to acquire embedded keywords;
The second acquisition module is used for acquiring the acoustic embedded characteristics of the voice;
the third acquisition module is used for acquiring a keyword initial embedding vector, a keyword ending embedding vector and an acoustic output vector from the embedded keywords and the acoustic embedded features through a preset number of sequentially connected transducer coding blocks;
The computing module is used for computing similarity scores of the keywords and the acoustic embedded features based on the keyword start embedded vector, the keyword end embedded vector and the acoustic output vector;
The fourth obtaining module is used for obtaining keyword bias of each keyword based on the similarity score;
and the detection module is used for decoding after multiplying the acoustic output vector and the keyword deviation element to obtain a keyword detection result.
In a third aspect, the present invention provides an electronic device comprising: a processor and a memory;
the memory is used for storing a computer program;
The processor is configured to execute the computer program stored in the memory, so that the electronic device executes the voice keyword detection method.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by an electronic device, implements the above-described voice keyword detection method.
As described above, the voice keyword detection method, system, storage medium and electronic device of the present invention have the following
The beneficial effects are that:
(1) The recall rate of voice keyword detection can be effectively improved;
(2) The intelligent degree is high, and the practicability is high.
Drawings
FIG. 1 is a schematic view of an electronic device according to an embodiment of the invention;
FIG. 2 is a flowchart of a method for detecting a voice keyword according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a three-modal block of attention according to one embodiment of the present invention;
FIG. 4 is a schematic diagram of a voice keyword detection system according to an embodiment of the invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
The following embodiments of the present invention provide a voice keyword detection method, which can be applied to an electronic device as shown in fig. 1. The electronic device in the present invention may include a mobile phone 11, a tablet computer 12, a notebook computer 13, a wearable device, a vehicle-mounted device, an augmented Reality (Augmented Reality, AR)/Virtual Reality (VR) device, an Ultra-Mobile Personal Computer (UMPC), a netbook, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), etc. with a wireless charging function, and the specific type of the electronic device is not limited in the embodiments of the present invention.
For example, the electronic device may be a Station (ST) in a wireless charging enabled WLAN, a wireless charging enabled cellular telephone, a cordless telephone, a Session initiation protocol (Session InitiationProtocol, SIP) telephone, a wireless local loop (WirelessLocal Loop, WLL) station, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA) device, a wireless charging enabled handheld device, a computing device or other processing device, a computer, a laptop computer, a handheld communication device, a handheld computing device, and/or other devices for communicating over a wireless system, as well as next generation communication systems, such as a mobile terminal in a 5G network, a mobile terminal in a future evolved public land mobile network (PublicLand Mobile Network, PLMN), or a mobile terminal in a future evolved Non-terrestrial network (Non-TERRESTRIAL NETWORK, NTN), etc.
For example, the electronic device may communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (GlobalSystem of Mobile communication, GSM), general packet radio service (GENERAL PACKET RadioService, GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE)), email, short message service (Short MESSAGING SERVICE, SMS), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (Global Positioning System, GPS), a global navigation satellite system (Global Navigation SATELLITE SYSTEM, GLONASS), a Beidou satellite navigation system (BeiDou navigation SATELLITE SYSTEM, BDS), a Quasi-Zenith satellite system (Quasi-Zenith SATELLITE SYSTEM, QZSS) and/or a satellite based augmentation system (SATELLITE BASED AUGMENTATION SYSTEMS, SBAS).
The following describes the technical solution in the embodiment of the present invention in detail with reference to the drawings in the embodiment of the present invention.
As shown in fig. 2 and 3, in an embodiment, the voice keyword detection method of the present invention includes steps S1 to S6.
Step S1, preprocessing a plurality of keywords to be detected to obtain embedded keywords.
Specifically, N keywords to be detected are acquired and preprocessed.
In one embodiment, preprocessing the plurality of keyword keywords to be detected includes the following steps:
11 For each keyword, a start symbol 'SOS' and an end symbol 'EOS' are added, and a standard keyword is acquired. Where l_w is the number of characters of the longest keyword (after addition of the start and end characters).
12 Embedding and reconstructing each standard keyword into a preset size to obtain the embedded keywords. All standard keywords are subjected to embedings spatial embedding processing together to obtain an embedded keyword embeddings _w, wherein the shape is (N, L_w, D), D is an embedded dimension, and then reconstruction reshape is performed, and the shape is (N.times.L_w, D).
And S2, acquiring acoustic embedded characteristics of the voice.
Specifically, acquiring the acoustic embedded features of speech includes the steps of:
21 Fbank features of the speech are acquired.
22 Inputting the fbank features into a one-dimensional convolution module to obtain embedded features.
In the one-dimensional convolution module, kernel size=2, stride=2, channels=d. The embedded features are shaped as (L f, D).
23 Inputting the embedded features into a preset number of tri-modal attention blocks which are sequentially connected to obtain the acoustic embedded features; the trimodal attention block is used to implement attention processing based on semantics, location and region.
Wherein the acoustic embedding feature embeddings _a has the shape (l_f, D).
And S3, the embedded keywords and the acoustic embedded features are subjected to a preset number of sequentially connected transducer coding blocks, and a keyword start embedded vector, a keyword end embedded vector and an acoustic output vector are obtained.
Specifically, the embedded keyword embeddings _w and the acoustic embedded feature embeddings _a are spliced together in a time dimension to be used as input, and the shape is (n×l_w+l_a, D). Inside the transform coding block is the operation of the traditional self-attention mechanism, ffn, layer-norm, etc., the preset number is 32. The transform coding block makes the embedded keywords and the acoustic embedded features pay attention to each other, and the output vector is the same as the input shape, but in the invention, only 3 parts of the output are taken to participate in subsequent operation, namely (the middle part of the keyword start embedded vector embeddings of SOS and the keyword end embedded vector embeddings of EOS is not used), the three parts are the keyword start embedded vector embeddings of SOS, the keyword end embedded vector embeddings of EOS and the acoustic output vector spatial output, and the shapes are (N, 1, D), (N, 1, D) and (L_a, D), respectively, and note that the meanings of 2 (N, 1, D) are that N represents N keywords, 1 represents SOS/EOS, D represents embedded dimension, and represents information of N keywords in a summarized way by SOS/EOS.
And S4, calculating the similarity score of the keyword and the acoustic embedded feature based on the keyword initial embedded vector, the keyword end embedded vector and the acoustic output vector.
Specifically, calculating the similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector includes the steps of:
41 Adding elements of the key word initial embedded vector and the key word ending embedded vector, and then inputting the added elements into a multi-layer perceptron (MLP) to obtain a key word library; the trimodal attention block is used to implement attention processing based on semantics, location and region.
The keyword library (gallery of keywords) is in the shape of (N, D), and is expressed as vectors of N words, and the length of the vectors is D.
42 And performing matrix dot product operation on the acoustic output vector and the keyword library after passing through a preset number of tri-modal attention blocks, and obtaining a dot product operation result.
Wherein the dot product operation result is named as A (L_a, N).
43 And) the dot product operation result is subjected to a sigmoid function, so that the similarity score is obtained.
The sigmoid function compresses the dot product operation result to be between 0 and 1 to obtain a similarity score SIMILARITY SCORE, wherein the similarity score is (L_a, N) and represents the similarity degree between L_a acoustic output vectors and N keywords.
And S5, acquiring the keyword deviation of each keyword based on the similarity score.
Specifically, calculating the keyword bias of each keyword based on the similarity score includes the steps of:
51 Selecting the maximum similarity score corresponding to each keyword.
Specifically, the similarity score is top-1 score filtered in the dimension of N, and L_a total keyword indexes are selected, such as 56 (0), 12 (1), 70 (2) and 0 (L_a-1). Wherein 56 represents that the similarity score of the 56 th keyword is highest after top-1 selection, and (0) represents that the keyword corresponds to a number 0 position vector in the acoustic output vector. Thus, each of the acoustic output vectors, the acoustic output vector, corresponds to a keyword.
52 Acquiring the specific keywords with the maximum similarity score larger than a preset value.
Wherein the similarity score SIMILARITY SCORE corresponds to the dot product operation result A one by one, and positions of top-1 greater than a preset value, such as 0.95, in all similarity scores SIMILARITY SCORE are counted
53 And (3) keeping the position elements corresponding to the specific keywords in the dot product operation result unchanged, and after other positions are set to be 1, inputting a multi-layer perceptron and a sigmoid function to acquire the keyword deviation.
And for all positions with top-1 greater than a preset value, such as 0.95, in the similarity score SIMILARITY SCORE, keeping the element values of the same positions in the dot product operation result A unchanged, replacing other parts of the dot product operation result A with 1, then performing multi-layer perceptron mlp +sigmoid function (in the dimension of D) operation on the dot product operation result A, outputting keyword bias in the shape of (L_a, 1), wherein a weight proportion is given to each vector of the acoustic output vector, and a larger weight is given to the similarity score greater than 0.95.
And S6, multiplying the acoustic output vector and the keyword deviation element, and then decoding to obtain a keyword detection result.
And decoding the result of multiplying the acoustic output vector and the keyword bias element by adopting a CTC decoder.
As shown in fig. 3, in the present invention, the trimodal attention block performs the following operations:
A) The input vector embeddings (l_a, D) is split at D, split into 4 (l_a, D/4) sub-vectors, and shaped into a shaped sub-vector of a predetermined shape. Wherein the preset shape is (4, 2, L_a/2, D/4).
B) And inputting the shaping sub-vector into a semantic attention module semantic attention module to obtain a semantic attention map.
The semantic attention module semantic attention module inputs a vector with a shape of (4, L_a, D/4), and after shaping reshape, the vector has a shape of (4, 2, L_a/2, D/4), wherein 2 represents h, L_a/2 represents w, and D/4 represents c. And then performing a maximum pooling operation max pooling (except that c does not, all other dimensions operate) and an average pooling operation average pooling (except that c does not, all other dimensions operate) respectively, and passing through 2 multi-layer perceptrons respectively to obtain a maximum feature map max feature map and an average feature map avg feature map, wherein the shapes of the maximum feature map max feature map and the average feature map avg feature map are (1, c) and (1, c) respectively. Then, element addition-phase addition is carried out, finally, a sigmoid function is input, semantic attention diagram is obtained, and the shape is (1, c).
C) And multiplying the shaping sub-vector and the semantic attention map by a matrix to obtain a matrix A.
D) The matrix a is input to a location attention module positional attention module to obtain a location attention map.
The position attention module positional attention module inputs the vector with the shape (4, 2, l_a/2, d/4), that is, (4, h, w, c), performs the maximum pooling operation max pooling (except h, w does not), performs the average pooling operation average pooling (except h, w does not), performs the other dimensions, performs the average pooling operation average pooling, obtains the maximum feature map max feature map (1, h, w, 1), and the average feature map avg feature map (1, h, w, 1), and then sequentially inputs the two-dimensional average value module and the sigmoid function after connecting concatnation the maximum feature map and the average feature map, so as to obtain the position attention map (1, h, w, 1).
E) And multiplying the matrix A and the position attention map matrix to obtain a matrix B.
F) The matrix B is input to the regional attention module slice attention module to obtain a regional attention map.
The area attention module lice attention module inputs vectors with the shapes of (4, 2, l_a/2, d/4), and performs a maximum pooling operation max pooling and an average pooling operation average pooling (except that 4 does not, all other dimensions) respectively, and obtains a maximum feature map max feature map (4,1,1,1) and an average feature map avg feature map (4,1,1,1) through the multi-layer perceptron. Then, element addition-phase addition is carried out, and finally, tanh functions are input to carry out nonlinear mapping of-1 to 1, so that the area attention map is obtained, and the shape is 4,1,1,1.
G) And multiplying the matrix B and the regional attention diagram matrix, and then, shaping and inputting the matrix B and the regional attention diagram matrix into a one-dimensional convolution module to obtain an output vector. Wherein kernel size=1 in the one-dimensional convolution module.
The protection scope of the voice keyword detection method according to the embodiment of the present invention is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art made by the principles of the present invention are included in the protection scope of the present invention.
The embodiment of the invention also provides a voice keyword detection system, which can realize the voice keyword detection method of the invention, but the implementation device of the voice keyword detection system of the invention comprises but is not limited to the structure of the voice keyword detection system listed in the embodiment, and all structural variations and substitutions of the prior art according to the principles of the invention are included in the protection scope of the invention.
As shown in fig. 4, in an embodiment, the voice keyword detection system of the present invention includes a first obtaining module 41, a second obtaining module 42, a third obtaining module 43, a calculating module 44, a fourth obtaining module 45, and a detecting module 46.
The first obtaining module 41 is configured to perform preprocessing on a plurality of keywords to be detected, and obtain embedded keywords.
The second acquisition module 42 is configured to acquire an acoustic embedded feature of the speech.
The third obtaining module 43 is connected to the first obtaining module 41 and the second obtaining module 42, and is configured to obtain a keyword start embedding vector, a keyword end embedding vector, and an acoustic output vector by using a preset number of sequentially connected transducer encoding blocks for embedding the keyword and the acoustic embedding feature.
The calculating module 44 is connected to the third obtaining module 43, and is configured to calculate a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector.
The fourth obtaining module 45 is connected to the calculating module 44, and is configured to obtain a keyword bias of each keyword based on the similarity score.
The detection module 46 is connected to the third acquisition module 43 and the fourth acquisition module 45, and is configured to multiply the acoustic output vector and the keyword bias element, and then decode the multiplied acoustic output vector and the keyword bias element, to obtain a keyword detection result.
The structures and principles of the first obtaining module 41, the second obtaining module 42, the third obtaining module 43, the calculating module 44, the fourth obtaining module 45 and the detecting module 4 are in one-to-one correspondence with the steps in the above-mentioned voice keyword detecting method, so that the description thereof will not be repeated here.
In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus, or method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.
The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention. For example, functional modules/units in various embodiments of the invention may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The embodiment of the invention also provides a computer readable storage medium. Those of ordinary skill in the art will appreciate that all or part of the steps in a method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (MAGNETICTAPE), a floppy disk (floppy disk), a compact disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Drive (SSD)), or the like.
The embodiment of the invention also provides electronic equipment. The electronic device includes a processor and a memory.
The memory is used for storing a computer program.
The memory includes: various media capable of storing program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.
The processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the electronic equipment to execute the voice keyword detection method.
Preferably, the processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), and the like; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit, ASIC, field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
As shown in FIG. 5, the electronic device of the present invention is embodied in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: one or more processors or processing units 51, a memory 52, a bus 53 that connects the various system components, including the memory 52 and the processing unit 51.
Bus 53 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic devices typically include a variety of computer system readable media. Such media can be any available media that can be accessed by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 52 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 521 and/or cache memory 522. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 523 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be coupled to bus 53 through one or more data medium interfaces. Memory 52 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 524 having a set (at least one) of program modules 5241 may be stored in, for example, memory 52, such program modules 5241 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 5241 generally perform the functions and/or methods in the described embodiments of the invention.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., network card, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 54. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through the network adapter 55. As shown in fig. 5, the network adapter 55 communicates with other modules of the electronic device over the bus 53. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.
Claims (10)
1. A method for detecting a voice keyword, the method comprising the steps of:
preprocessing a plurality of keywords to be detected to obtain embedded keywords;
acquiring acoustic embedding characteristics of voice;
The embedded keywords and the acoustic embedded features are subjected to a preset number of sequentially connected transform coding blocks, and a keyword initial embedded vector, a keyword ending embedded vector and an acoustic output vector are obtained;
Calculating a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector;
acquiring keyword bias of each keyword based on the similarity score;
And multiplying the acoustic output vector and the keyword deviation element, and then decoding to obtain a keyword detection result.
2. The voice keyword detection method of claim 1, wherein: the preprocessing of the keywords to be detected comprises the following steps:
for each keyword, adding a start character and an end character to obtain a standard keyword;
and embedding and reconstructing each standard keyword into a preset size to obtain the embedded keywords.
3. The voice keyword detection method of claim 1, wherein: acquiring the acoustic embedded features of speech includes the steps of:
Acquiring fbank features of the voice;
Inputting the fbank features into a one-dimensional convolution module to obtain embedded features;
Inputting the embedded features into a preset number of tri-modal attention blocks which are sequentially connected to obtain the acoustic embedded features;
The trimodal attention block is used to implement attention processing based on semantics, location and region.
4. The voice keyword detection method of claim 1, wherein: calculating a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector comprises the steps of:
Adding elements of the key word initial embedded vector and the key word ending embedded vector, and inputting the added elements into a multi-layer perceptron to obtain a key word library;
Performing matrix dot product operation on the acoustic output vector and the keyword library after passing through a preset number of tri-modal attention blocks, and obtaining a dot product operation result; the tri-modal attention block is used for realizing attention processing based on semantics, position and region;
and obtaining the similarity score by passing the dot product operation result through a sigmoid function.
5. The voice keyword detection method of claim 4, wherein: calculating keyword bias of each keyword based on the similarity score includes the steps of:
Selecting the maximum similarity score corresponding to each keyword;
acquiring a specific keyword with the maximum similarity score larger than a preset value;
and (3) keeping the position elements corresponding to the specific keywords in the dot product operation result unchanged, and after other positions are set to be 1, inputting a multi-layer perceptron and a sigmoid function to acquire the keyword deviation.
6. The voice keyword detection method of claim 3 or 4, wherein: the tri-modal attention block performs the following operations:
dividing an input vector into 4 sub-vectors, and shaping the sub-vectors into shaped sub-vectors with preset shapes;
inputting the shaping sub-vector into a semantic attention module to obtain a semantic attention map;
performing matrix multiplication on the shaping sub-vector and the semantic attention map to obtain a matrix A;
Inputting the matrix A into a position attention module to obtain a position attention map;
Multiplying the matrix A by the position attention map matrix to obtain a matrix B;
Inputting the matrix B into a regional attention module to obtain regional attention force diagram;
And multiplying the matrix B and the regional attention diagram matrix, and then, shaping and inputting the matrix B and the regional attention diagram matrix into a one-dimensional convolution module to obtain an output vector.
7. The voice keyword detection method of claim 1, wherein: and decoding the result of multiplying the acoustic output vector and the keyword bias element by adopting a CTC decoder.
8. The voice keyword detection system is characterized by comprising a first acquisition module, a second acquisition module, a third acquisition module, a calculation module, a fourth acquisition module and a detection module;
the first acquisition module is used for preprocessing a plurality of keywords to be detected to acquire embedded keywords;
The second acquisition module is used for acquiring the acoustic embedded characteristics of the voice;
the third acquisition module is used for acquiring a keyword initial embedding vector, a keyword ending embedding vector and an acoustic output vector from the embedded keywords and the acoustic embedded features through a preset number of sequentially connected transducer coding blocks;
The computing module is used for computing similarity scores of the keywords and the acoustic embedded features based on the keyword start embedded vector, the keyword end embedded vector and the acoustic output vector;
The fourth obtaining module is used for obtaining keyword bias of each keyword based on the similarity score;
and the detection module is used for decoding after multiplying the acoustic output vector and the keyword deviation element to obtain a keyword detection result.
9. An electronic device, the electronic device comprising: a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to execute the computer program stored in the memory, so that the electronic device executes the voice keyword detection method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that the program, when executed by an electronic device, implements the voice keyword detection method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410143591.1A CN118072715A (en) | 2024-02-01 | 2024-02-01 | Voice keyword detection method, system, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410143591.1A CN118072715A (en) | 2024-02-01 | 2024-02-01 | Voice keyword detection method, system, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118072715A true CN118072715A (en) | 2024-05-24 |
Family
ID=91094820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410143591.1A Pending CN118072715A (en) | 2024-02-01 | 2024-02-01 | Voice keyword detection method, system, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118072715A (en) |
-
2024
- 2024-02-01 CN CN202410143591.1A patent/CN118072715A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112071300B (en) | Voice conversation method, device, computer equipment and storage medium | |
CN116912353B (en) | Multitasking image processing method, system, storage medium and electronic device | |
CN118135571B (en) | Image semantic segmentation method, system, storage medium and electronic equipment | |
CN117275461B (en) | Multitasking audio processing method, system, storage medium and electronic equipment | |
CN118072715A (en) | Voice keyword detection method, system, storage medium and electronic equipment | |
CN118314409B (en) | Multi-mode image classification method, system, storage medium and electronic equipment | |
CN117975941A (en) | Multi-attention multi-feature voice recognition method, system, storage medium and electronic equipment | |
CN118366082A (en) | Audiovisual emotion recognition method, system, storage medium and electronic equipment | |
CN117746866B (en) | Multilingual voice conversion text method, multilingual voice conversion text system, storage medium and electronic equipment | |
CN118338098B (en) | Multi-mode video generation method, system, storage medium and electronic equipment | |
CN118314445B (en) | Image multitasking method, system, storage medium and electronic device | |
CN118396120A (en) | Structured information reasoning method, system, storage medium and electronic equipment | |
CN118609557A (en) | Visual speech recognition method, system, storage medium and electronic equipment | |
CN118484559A (en) | Image description screening method, system, storage medium and electronic equipment | |
CN117351973A (en) | Tone color conversion method, system, storage medium and electronic equipment | |
CN118296186B (en) | Video advertisement detection method, system, storage medium and electronic equipment | |
CN117079643A (en) | Speech recognition method, system, storage medium and electronic equipment | |
CN118587526A (en) | Image classification method, system, storage medium and electronic equipment based on text classifier | |
CN118196775A (en) | Target detection method, target detection system, storage medium and electronic equipment | |
CN118279611A (en) | Image difference description method, system, storage medium and electronic equipment | |
CN118398026A (en) | Method and system for detecting voice position, storage medium and electronic equipment | |
CN116701708B (en) | Multi-mode enhanced video classification method, system, storage medium and electronic equipment | |
CN118506339A (en) | Video subtitle identification method, system, storage medium and electronic equipment | |
CN116108147A (en) | Cross-modal retrieval method, system, terminal and storage medium based on feature fusion | |
CN118154883B (en) | Target semantic segmentation method, system, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |