CN118072715A - Voice keyword detection method, system, storage medium and electronic equipment - Google Patents

Voice keyword detection method, system, storage medium and electronic equipment Download PDF

Info

Publication number
CN118072715A
CN118072715A CN202410143591.1A CN202410143591A CN118072715A CN 118072715 A CN118072715 A CN 118072715A CN 202410143591 A CN202410143591 A CN 202410143591A CN 118072715 A CN118072715 A CN 118072715A
Authority
CN
China
Prior art keywords
keyword
vector
embedded
voice
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410143591.1A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mido Digital Technology Co ltd
Original Assignee
Shanghai Mido Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mido Digital Technology Co ltd filed Critical Shanghai Mido Digital Technology Co ltd
Priority to CN202410143591.1A priority Critical patent/CN118072715A/en
Publication of CN118072715A publication Critical patent/CN118072715A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a voice keyword detection method, a voice keyword detection system, a storage medium and electronic equipment, wherein the voice keyword detection method comprises the following steps: preprocessing a plurality of keywords to be detected to obtain embedded keywords; acquiring acoustic embedding characteristics of voice; the embedded keywords and the acoustic embedded features are subjected to a preset number of sequentially connected transform coding blocks, and a keyword initial embedded vector, a keyword ending embedded vector and an acoustic output vector are obtained; calculating a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector; acquiring keyword bias of each keyword based on the similarity score; and multiplying the acoustic output vector and the keyword deviation element, and then decoding to obtain a keyword detection result. The voice keyword detection method, the voice keyword detection system, the storage medium and the electronic equipment can effectively improve recall rate of voice keyword detection.

Description

Voice keyword detection method, system, storage medium and electronic equipment
Technical Field
The invention belongs to the technical field of deep learning, and particularly relates to a voice keyword detection method, a voice keyword detection system, a storage medium and electronic equipment.
Background
With the progress of data processing technology and the rapid popularization of mobile internet, computer technology is widely applied to various fields of society, and mass data generation follows. Among them, voice data is receiving increasing attention. Speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR), aims at converting lexical content in human speech into computer-readable inputs, such as keys, binary codes, or character sequences.
Recently, speech recognition technology has made remarkable progress, which relates to various fields of industry, home appliances, communications, automotive electronics, medical treatment, home services, consumer electronics, and the like. The speech recognition technology belongs to an important branch of the artificial intelligence direction, relates to a plurality of subjects such as signal processing, computer science, linguistics, acoustics, physiology, psychology and the like, and is a key link in the man-machine natural interaction technology.
Keyword detection is a sub-field of the speech recognition field whose purpose is to detect all occurrence locations of specified words in a speech signal. Keyword detection commonly used in the prior art includes keyword detection based on a complementary white model, keyword detection based on a sample, keyword detection based on a massive vocabulary continuous voice system and the like. However, the keyword recall rate of the existing voice keyword detection method is not high, and the actual application requirement cannot be met.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the present invention aims to provide a method, a system, a storage medium and an electronic device for detecting a voice keyword, which can effectively improve the recall rate of voice keyword detection and meet the actual application requirements.
In a first aspect, the present invention provides a method for detecting a voice keyword, the method comprising the steps of: preprocessing a plurality of keywords to be detected to obtain embedded keywords; acquiring acoustic embedding characteristics of voice; the embedded keywords and the acoustic embedded features are subjected to a preset number of sequentially connected transform coding blocks, and a keyword initial embedded vector, a keyword ending embedded vector and an acoustic output vector are obtained; calculating a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector; acquiring keyword bias of each keyword based on the similarity score; and multiplying the acoustic output vector and the keyword deviation element, and then decoding to obtain a keyword detection result.
In one implementation manner of the first aspect, preprocessing the plurality of keyword keywords to be detected includes the following steps:
for each keyword, adding a start character and an end character to obtain a standard keyword;
and embedding and reconstructing each standard keyword into a preset size to obtain the embedded keywords.
In one implementation manner of the first aspect, acquiring the acoustic embedded feature of the speech includes the steps of:
Acquiring fbank features of the voice;
Inputting the fbank features into a one-dimensional convolution module to obtain embedded features;
Inputting the embedded features into a preset number of tri-modal attention blocks which are sequentially connected to obtain the acoustic embedded features; the trimodal attention block is used to implement attention processing based on semantics, location and region.
In one implementation manner of the first aspect, calculating the similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector includes the steps of:
Adding elements of the key word initial embedded vector and the key word ending embedded vector, and inputting the added elements into a multi-layer perceptron to obtain a key word library;
Performing matrix dot product operation on the acoustic output vector and the keyword library after passing through a preset number of tri-modal attention blocks, and obtaining a dot product operation result; the tri-modal attention block is used for realizing attention processing based on semantics, position and region;
and obtaining the similarity score by passing the dot product operation result through a sigmoid function.
In one implementation manner of the first aspect, calculating the keyword bias of each keyword based on the similarity score includes the following steps:
Selecting the maximum similarity score corresponding to each keyword;
acquiring a specific keyword with the maximum similarity score larger than a preset value;
and (3) keeping the position elements corresponding to the specific keywords in the dot product operation result unchanged, and after other positions are set to be 1, inputting a multi-layer perceptron and a sigmoid function to acquire the keyword deviation.
In one implementation manner of the first aspect, the trimodal attention block performs the following operations:
dividing an input vector into 4 sub-vectors, and shaping the sub-vectors into shaped sub-vectors with preset shapes;
inputting the shaping sub-vector into a semantic attention module to obtain a semantic attention map;
performing matrix multiplication on the shaping sub-vector and the semantic attention map to obtain a matrix A;
Inputting the matrix A into a position attention module to obtain a position attention map;
Multiplying the matrix A by the position attention map matrix to obtain a matrix B;
Inputting the matrix B into a regional attention module to obtain regional attention force diagram;
And multiplying the matrix B and the regional attention diagram matrix, and then, shaping and inputting the matrix B and the regional attention diagram matrix into a one-dimensional convolution module to obtain an output vector.
In one implementation of the first aspect, a CTC decoder is used to decode the result of multiplying the acoustic output vector and the keyword bias element.
In a second aspect, the invention provides a voice keyword detection system, which comprises a first acquisition module, a second acquisition module, a third acquisition module, a calculation module, a fourth acquisition module and a detection module;
the first acquisition module is used for preprocessing a plurality of keywords to be detected to acquire embedded keywords;
The second acquisition module is used for acquiring the acoustic embedded characteristics of the voice;
the third acquisition module is used for acquiring a keyword initial embedding vector, a keyword ending embedding vector and an acoustic output vector from the embedded keywords and the acoustic embedded features through a preset number of sequentially connected transducer coding blocks;
The computing module is used for computing similarity scores of the keywords and the acoustic embedded features based on the keyword start embedded vector, the keyword end embedded vector and the acoustic output vector;
The fourth obtaining module is used for obtaining keyword bias of each keyword based on the similarity score;
and the detection module is used for decoding after multiplying the acoustic output vector and the keyword deviation element to obtain a keyword detection result.
In a third aspect, the present invention provides an electronic device comprising: a processor and a memory;
the memory is used for storing a computer program;
The processor is configured to execute the computer program stored in the memory, so that the electronic device executes the voice keyword detection method.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by an electronic device, implements the above-described voice keyword detection method.
As described above, the voice keyword detection method, system, storage medium and electronic device of the present invention have the following
The beneficial effects are that:
(1) The recall rate of voice keyword detection can be effectively improved;
(2) The intelligent degree is high, and the practicability is high.
Drawings
FIG. 1 is a schematic view of an electronic device according to an embodiment of the invention;
FIG. 2 is a flowchart of a method for detecting a voice keyword according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a three-modal block of attention according to one embodiment of the present invention;
FIG. 4 is a schematic diagram of a voice keyword detection system according to an embodiment of the invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
The following embodiments of the present invention provide a voice keyword detection method, which can be applied to an electronic device as shown in fig. 1. The electronic device in the present invention may include a mobile phone 11, a tablet computer 12, a notebook computer 13, a wearable device, a vehicle-mounted device, an augmented Reality (Augmented Reality, AR)/Virtual Reality (VR) device, an Ultra-Mobile Personal Computer (UMPC), a netbook, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), etc. with a wireless charging function, and the specific type of the electronic device is not limited in the embodiments of the present invention.
For example, the electronic device may be a Station (ST) in a wireless charging enabled WLAN, a wireless charging enabled cellular telephone, a cordless telephone, a Session initiation protocol (Session InitiationProtocol, SIP) telephone, a wireless local loop (WirelessLocal Loop, WLL) station, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA) device, a wireless charging enabled handheld device, a computing device or other processing device, a computer, a laptop computer, a handheld communication device, a handheld computing device, and/or other devices for communicating over a wireless system, as well as next generation communication systems, such as a mobile terminal in a 5G network, a mobile terminal in a future evolved public land mobile network (PublicLand Mobile Network, PLMN), or a mobile terminal in a future evolved Non-terrestrial network (Non-TERRESTRIAL NETWORK, NTN), etc.
For example, the electronic device may communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (GlobalSystem of Mobile communication, GSM), general packet radio service (GENERAL PACKET RadioService, GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE)), email, short message service (Short MESSAGING SERVICE, SMS), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (Global Positioning System, GPS), a global navigation satellite system (Global Navigation SATELLITE SYSTEM, GLONASS), a Beidou satellite navigation system (BeiDou navigation SATELLITE SYSTEM, BDS), a Quasi-Zenith satellite system (Quasi-Zenith SATELLITE SYSTEM, QZSS) and/or a satellite based augmentation system (SATELLITE BASED AUGMENTATION SYSTEMS, SBAS).
The following describes the technical solution in the embodiment of the present invention in detail with reference to the drawings in the embodiment of the present invention.
As shown in fig. 2 and 3, in an embodiment, the voice keyword detection method of the present invention includes steps S1 to S6.
Step S1, preprocessing a plurality of keywords to be detected to obtain embedded keywords.
Specifically, N keywords to be detected are acquired and preprocessed.
In one embodiment, preprocessing the plurality of keyword keywords to be detected includes the following steps:
11 For each keyword, a start symbol 'SOS' and an end symbol 'EOS' are added, and a standard keyword is acquired. Where l_w is the number of characters of the longest keyword (after addition of the start and end characters).
12 Embedding and reconstructing each standard keyword into a preset size to obtain the embedded keywords. All standard keywords are subjected to embedings spatial embedding processing together to obtain an embedded keyword embeddings _w, wherein the shape is (N, L_w, D), D is an embedded dimension, and then reconstruction reshape is performed, and the shape is (N.times.L_w, D).
And S2, acquiring acoustic embedded characteristics of the voice.
Specifically, acquiring the acoustic embedded features of speech includes the steps of:
21 Fbank features of the speech are acquired.
22 Inputting the fbank features into a one-dimensional convolution module to obtain embedded features.
In the one-dimensional convolution module, kernel size=2, stride=2, channels=d. The embedded features are shaped as (L f, D).
23 Inputting the embedded features into a preset number of tri-modal attention blocks which are sequentially connected to obtain the acoustic embedded features; the trimodal attention block is used to implement attention processing based on semantics, location and region.
Wherein the acoustic embedding feature embeddings _a has the shape (l_f, D).
And S3, the embedded keywords and the acoustic embedded features are subjected to a preset number of sequentially connected transducer coding blocks, and a keyword start embedded vector, a keyword end embedded vector and an acoustic output vector are obtained.
Specifically, the embedded keyword embeddings _w and the acoustic embedded feature embeddings _a are spliced together in a time dimension to be used as input, and the shape is (n×l_w+l_a, D). Inside the transform coding block is the operation of the traditional self-attention mechanism, ffn, layer-norm, etc., the preset number is 32. The transform coding block makes the embedded keywords and the acoustic embedded features pay attention to each other, and the output vector is the same as the input shape, but in the invention, only 3 parts of the output are taken to participate in subsequent operation, namely (the middle part of the keyword start embedded vector embeddings of SOS and the keyword end embedded vector embeddings of EOS is not used), the three parts are the keyword start embedded vector embeddings of SOS, the keyword end embedded vector embeddings of EOS and the acoustic output vector spatial output, and the shapes are (N, 1, D), (N, 1, D) and (L_a, D), respectively, and note that the meanings of 2 (N, 1, D) are that N represents N keywords, 1 represents SOS/EOS, D represents embedded dimension, and represents information of N keywords in a summarized way by SOS/EOS.
And S4, calculating the similarity score of the keyword and the acoustic embedded feature based on the keyword initial embedded vector, the keyword end embedded vector and the acoustic output vector.
Specifically, calculating the similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector includes the steps of:
41 Adding elements of the key word initial embedded vector and the key word ending embedded vector, and then inputting the added elements into a multi-layer perceptron (MLP) to obtain a key word library; the trimodal attention block is used to implement attention processing based on semantics, location and region.
The keyword library (gallery of keywords) is in the shape of (N, D), and is expressed as vectors of N words, and the length of the vectors is D.
42 And performing matrix dot product operation on the acoustic output vector and the keyword library after passing through a preset number of tri-modal attention blocks, and obtaining a dot product operation result.
Wherein the dot product operation result is named as A (L_a, N).
43 And) the dot product operation result is subjected to a sigmoid function, so that the similarity score is obtained.
The sigmoid function compresses the dot product operation result to be between 0 and 1 to obtain a similarity score SIMILARITY SCORE, wherein the similarity score is (L_a, N) and represents the similarity degree between L_a acoustic output vectors and N keywords.
And S5, acquiring the keyword deviation of each keyword based on the similarity score.
Specifically, calculating the keyword bias of each keyword based on the similarity score includes the steps of:
51 Selecting the maximum similarity score corresponding to each keyword.
Specifically, the similarity score is top-1 score filtered in the dimension of N, and L_a total keyword indexes are selected, such as 56 (0), 12 (1), 70 (2) and 0 (L_a-1). Wherein 56 represents that the similarity score of the 56 th keyword is highest after top-1 selection, and (0) represents that the keyword corresponds to a number 0 position vector in the acoustic output vector. Thus, each of the acoustic output vectors, the acoustic output vector, corresponds to a keyword.
52 Acquiring the specific keywords with the maximum similarity score larger than a preset value.
Wherein the similarity score SIMILARITY SCORE corresponds to the dot product operation result A one by one, and positions of top-1 greater than a preset value, such as 0.95, in all similarity scores SIMILARITY SCORE are counted
53 And (3) keeping the position elements corresponding to the specific keywords in the dot product operation result unchanged, and after other positions are set to be 1, inputting a multi-layer perceptron and a sigmoid function to acquire the keyword deviation.
And for all positions with top-1 greater than a preset value, such as 0.95, in the similarity score SIMILARITY SCORE, keeping the element values of the same positions in the dot product operation result A unchanged, replacing other parts of the dot product operation result A with 1, then performing multi-layer perceptron mlp +sigmoid function (in the dimension of D) operation on the dot product operation result A, outputting keyword bias in the shape of (L_a, 1), wherein a weight proportion is given to each vector of the acoustic output vector, and a larger weight is given to the similarity score greater than 0.95.
And S6, multiplying the acoustic output vector and the keyword deviation element, and then decoding to obtain a keyword detection result.
And decoding the result of multiplying the acoustic output vector and the keyword bias element by adopting a CTC decoder.
As shown in fig. 3, in the present invention, the trimodal attention block performs the following operations:
A) The input vector embeddings (l_a, D) is split at D, split into 4 (l_a, D/4) sub-vectors, and shaped into a shaped sub-vector of a predetermined shape. Wherein the preset shape is (4, 2, L_a/2, D/4).
B) And inputting the shaping sub-vector into a semantic attention module semantic attention module to obtain a semantic attention map.
The semantic attention module semantic attention module inputs a vector with a shape of (4, L_a, D/4), and after shaping reshape, the vector has a shape of (4, 2, L_a/2, D/4), wherein 2 represents h, L_a/2 represents w, and D/4 represents c. And then performing a maximum pooling operation max pooling (except that c does not, all other dimensions operate) and an average pooling operation average pooling (except that c does not, all other dimensions operate) respectively, and passing through 2 multi-layer perceptrons respectively to obtain a maximum feature map max feature map and an average feature map avg feature map, wherein the shapes of the maximum feature map max feature map and the average feature map avg feature map are (1, c) and (1, c) respectively. Then, element addition-phase addition is carried out, finally, a sigmoid function is input, semantic attention diagram is obtained, and the shape is (1, c).
C) And multiplying the shaping sub-vector and the semantic attention map by a matrix to obtain a matrix A.
D) The matrix a is input to a location attention module positional attention module to obtain a location attention map.
The position attention module positional attention module inputs the vector with the shape (4, 2, l_a/2, d/4), that is, (4, h, w, c), performs the maximum pooling operation max pooling (except h, w does not), performs the average pooling operation average pooling (except h, w does not), performs the other dimensions, performs the average pooling operation average pooling, obtains the maximum feature map max feature map (1, h, w, 1), and the average feature map avg feature map (1, h, w, 1), and then sequentially inputs the two-dimensional average value module and the sigmoid function after connecting concatnation the maximum feature map and the average feature map, so as to obtain the position attention map (1, h, w, 1).
E) And multiplying the matrix A and the position attention map matrix to obtain a matrix B.
F) The matrix B is input to the regional attention module slice attention module to obtain a regional attention map.
The area attention module lice attention module inputs vectors with the shapes of (4, 2, l_a/2, d/4), and performs a maximum pooling operation max pooling and an average pooling operation average pooling (except that 4 does not, all other dimensions) respectively, and obtains a maximum feature map max feature map (4,1,1,1) and an average feature map avg feature map (4,1,1,1) through the multi-layer perceptron. Then, element addition-phase addition is carried out, and finally, tanh functions are input to carry out nonlinear mapping of-1 to 1, so that the area attention map is obtained, and the shape is 4,1,1,1.
G) And multiplying the matrix B and the regional attention diagram matrix, and then, shaping and inputting the matrix B and the regional attention diagram matrix into a one-dimensional convolution module to obtain an output vector. Wherein kernel size=1 in the one-dimensional convolution module.
The protection scope of the voice keyword detection method according to the embodiment of the present invention is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art made by the principles of the present invention are included in the protection scope of the present invention.
The embodiment of the invention also provides a voice keyword detection system, which can realize the voice keyword detection method of the invention, but the implementation device of the voice keyword detection system of the invention comprises but is not limited to the structure of the voice keyword detection system listed in the embodiment, and all structural variations and substitutions of the prior art according to the principles of the invention are included in the protection scope of the invention.
As shown in fig. 4, in an embodiment, the voice keyword detection system of the present invention includes a first obtaining module 41, a second obtaining module 42, a third obtaining module 43, a calculating module 44, a fourth obtaining module 45, and a detecting module 46.
The first obtaining module 41 is configured to perform preprocessing on a plurality of keywords to be detected, and obtain embedded keywords.
The second acquisition module 42 is configured to acquire an acoustic embedded feature of the speech.
The third obtaining module 43 is connected to the first obtaining module 41 and the second obtaining module 42, and is configured to obtain a keyword start embedding vector, a keyword end embedding vector, and an acoustic output vector by using a preset number of sequentially connected transducer encoding blocks for embedding the keyword and the acoustic embedding feature.
The calculating module 44 is connected to the third obtaining module 43, and is configured to calculate a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector.
The fourth obtaining module 45 is connected to the calculating module 44, and is configured to obtain a keyword bias of each keyword based on the similarity score.
The detection module 46 is connected to the third acquisition module 43 and the fourth acquisition module 45, and is configured to multiply the acoustic output vector and the keyword bias element, and then decode the multiplied acoustic output vector and the keyword bias element, to obtain a keyword detection result.
The structures and principles of the first obtaining module 41, the second obtaining module 42, the third obtaining module 43, the calculating module 44, the fourth obtaining module 45 and the detecting module 4 are in one-to-one correspondence with the steps in the above-mentioned voice keyword detecting method, so that the description thereof will not be repeated here.
In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus, or method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.
The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention. For example, functional modules/units in various embodiments of the invention may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The embodiment of the invention also provides a computer readable storage medium. Those of ordinary skill in the art will appreciate that all or part of the steps in a method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (MAGNETICTAPE), a floppy disk (floppy disk), a compact disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Drive (SSD)), or the like.
The embodiment of the invention also provides electronic equipment. The electronic device includes a processor and a memory.
The memory is used for storing a computer program.
The memory includes: various media capable of storing program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.
The processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the electronic equipment to execute the voice keyword detection method.
Preferably, the processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), and the like; but may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit, ASIC, field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
As shown in FIG. 5, the electronic device of the present invention is embodied in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: one or more processors or processing units 51, a memory 52, a bus 53 that connects the various system components, including the memory 52 and the processing unit 51.
Bus 53 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic devices typically include a variety of computer system readable media. Such media can be any available media that can be accessed by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 52 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 521 and/or cache memory 522. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 523 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be coupled to bus 53 through one or more data medium interfaces. Memory 52 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 524 having a set (at least one) of program modules 5241 may be stored in, for example, memory 52, such program modules 5241 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 5241 generally perform the functions and/or methods in the described embodiments of the invention.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., network card, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 54. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through the network adapter 55. As shown in fig. 5, the network adapter 55 communicates with other modules of the electronic device over the bus 53. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims (10)

1. A method for detecting a voice keyword, the method comprising the steps of:
preprocessing a plurality of keywords to be detected to obtain embedded keywords;
acquiring acoustic embedding characteristics of voice;
The embedded keywords and the acoustic embedded features are subjected to a preset number of sequentially connected transform coding blocks, and a keyword initial embedded vector, a keyword ending embedded vector and an acoustic output vector are obtained;
Calculating a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector;
acquiring keyword bias of each keyword based on the similarity score;
And multiplying the acoustic output vector and the keyword deviation element, and then decoding to obtain a keyword detection result.
2. The voice keyword detection method of claim 1, wherein: the preprocessing of the keywords to be detected comprises the following steps:
for each keyword, adding a start character and an end character to obtain a standard keyword;
and embedding and reconstructing each standard keyword into a preset size to obtain the embedded keywords.
3. The voice keyword detection method of claim 1, wherein: acquiring the acoustic embedded features of speech includes the steps of:
Acquiring fbank features of the voice;
Inputting the fbank features into a one-dimensional convolution module to obtain embedded features;
Inputting the embedded features into a preset number of tri-modal attention blocks which are sequentially connected to obtain the acoustic embedded features;
The trimodal attention block is used to implement attention processing based on semantics, location and region.
4. The voice keyword detection method of claim 1, wherein: calculating a similarity score of the keyword and the acoustic embedding feature based on the keyword start embedding vector, the keyword end embedding vector, and the acoustic output vector comprises the steps of:
Adding elements of the key word initial embedded vector and the key word ending embedded vector, and inputting the added elements into a multi-layer perceptron to obtain a key word library;
Performing matrix dot product operation on the acoustic output vector and the keyword library after passing through a preset number of tri-modal attention blocks, and obtaining a dot product operation result; the tri-modal attention block is used for realizing attention processing based on semantics, position and region;
and obtaining the similarity score by passing the dot product operation result through a sigmoid function.
5. The voice keyword detection method of claim 4, wherein: calculating keyword bias of each keyword based on the similarity score includes the steps of:
Selecting the maximum similarity score corresponding to each keyword;
acquiring a specific keyword with the maximum similarity score larger than a preset value;
and (3) keeping the position elements corresponding to the specific keywords in the dot product operation result unchanged, and after other positions are set to be 1, inputting a multi-layer perceptron and a sigmoid function to acquire the keyword deviation.
6. The voice keyword detection method of claim 3 or 4, wherein: the tri-modal attention block performs the following operations:
dividing an input vector into 4 sub-vectors, and shaping the sub-vectors into shaped sub-vectors with preset shapes;
inputting the shaping sub-vector into a semantic attention module to obtain a semantic attention map;
performing matrix multiplication on the shaping sub-vector and the semantic attention map to obtain a matrix A;
Inputting the matrix A into a position attention module to obtain a position attention map;
Multiplying the matrix A by the position attention map matrix to obtain a matrix B;
Inputting the matrix B into a regional attention module to obtain regional attention force diagram;
And multiplying the matrix B and the regional attention diagram matrix, and then, shaping and inputting the matrix B and the regional attention diagram matrix into a one-dimensional convolution module to obtain an output vector.
7. The voice keyword detection method of claim 1, wherein: and decoding the result of multiplying the acoustic output vector and the keyword bias element by adopting a CTC decoder.
8. The voice keyword detection system is characterized by comprising a first acquisition module, a second acquisition module, a third acquisition module, a calculation module, a fourth acquisition module and a detection module;
the first acquisition module is used for preprocessing a plurality of keywords to be detected to acquire embedded keywords;
The second acquisition module is used for acquiring the acoustic embedded characteristics of the voice;
the third acquisition module is used for acquiring a keyword initial embedding vector, a keyword ending embedding vector and an acoustic output vector from the embedded keywords and the acoustic embedded features through a preset number of sequentially connected transducer coding blocks;
The computing module is used for computing similarity scores of the keywords and the acoustic embedded features based on the keyword start embedded vector, the keyword end embedded vector and the acoustic output vector;
The fourth obtaining module is used for obtaining keyword bias of each keyword based on the similarity score;
and the detection module is used for decoding after multiplying the acoustic output vector and the keyword deviation element to obtain a keyword detection result.
9. An electronic device, the electronic device comprising: a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to execute the computer program stored in the memory, so that the electronic device executes the voice keyword detection method of any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that the program, when executed by an electronic device, implements the voice keyword detection method of any one of claims 1 to 7.
CN202410143591.1A 2024-02-01 2024-02-01 Voice keyword detection method, system, storage medium and electronic equipment Pending CN118072715A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410143591.1A CN118072715A (en) 2024-02-01 2024-02-01 Voice keyword detection method, system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410143591.1A CN118072715A (en) 2024-02-01 2024-02-01 Voice keyword detection method, system, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN118072715A true CN118072715A (en) 2024-05-24

Family

ID=91094820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410143591.1A Pending CN118072715A (en) 2024-02-01 2024-02-01 Voice keyword detection method, system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN118072715A (en)

Similar Documents

Publication Publication Date Title
CN112071300B (en) Voice conversation method, device, computer equipment and storage medium
CN116912353B (en) Multitasking image processing method, system, storage medium and electronic device
CN118135571B (en) Image semantic segmentation method, system, storage medium and electronic equipment
CN117275461B (en) Multitasking audio processing method, system, storage medium and electronic equipment
CN118072715A (en) Voice keyword detection method, system, storage medium and electronic equipment
CN118314409B (en) Multi-mode image classification method, system, storage medium and electronic equipment
CN117975941A (en) Multi-attention multi-feature voice recognition method, system, storage medium and electronic equipment
CN118366082A (en) Audiovisual emotion recognition method, system, storage medium and electronic equipment
CN117746866B (en) Multilingual voice conversion text method, multilingual voice conversion text system, storage medium and electronic equipment
CN118338098B (en) Multi-mode video generation method, system, storage medium and electronic equipment
CN118314445B (en) Image multitasking method, system, storage medium and electronic device
CN118396120A (en) Structured information reasoning method, system, storage medium and electronic equipment
CN118609557A (en) Visual speech recognition method, system, storage medium and electronic equipment
CN118484559A (en) Image description screening method, system, storage medium and electronic equipment
CN117351973A (en) Tone color conversion method, system, storage medium and electronic equipment
CN118296186B (en) Video advertisement detection method, system, storage medium and electronic equipment
CN117079643A (en) Speech recognition method, system, storage medium and electronic equipment
CN118587526A (en) Image classification method, system, storage medium and electronic equipment based on text classifier
CN118196775A (en) Target detection method, target detection system, storage medium and electronic equipment
CN118279611A (en) Image difference description method, system, storage medium and electronic equipment
CN118398026A (en) Method and system for detecting voice position, storage medium and electronic equipment
CN116701708B (en) Multi-mode enhanced video classification method, system, storage medium and electronic equipment
CN118506339A (en) Video subtitle identification method, system, storage medium and electronic equipment
CN116108147A (en) Cross-modal retrieval method, system, terminal and storage medium based on feature fusion
CN118154883B (en) Target semantic segmentation method, system, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination