WO2023029615A1 - 语音唤醒的方法、装置、设备、存储介质及程序产品 - Google Patents
语音唤醒的方法、装置、设备、存储介质及程序产品 Download PDFInfo
- Publication number
- WO2023029615A1 WO2023029615A1 PCT/CN2022/095443 CN2022095443W WO2023029615A1 WO 2023029615 A1 WO2023029615 A1 WO 2023029615A1 CN 2022095443 W CN2022095443 W CN 2022095443W WO 2023029615 A1 WO2023029615 A1 WO 2023029615A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal
- posterior probability
- bone conduction
- wake
- word
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 224
- 238000003860 storage Methods 0.000 title claims abstract description 19
- 210000000988 bone and bone Anatomy 0.000 claims abstract description 583
- 238000001514 detection method Methods 0.000 claims abstract description 171
- 239000013598 vector Substances 0.000 claims description 504
- 230000004927 fusion Effects 0.000 claims description 147
- 238000012545 processing Methods 0.000 claims description 34
- 230000015654 memory Effects 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 description 83
- 230000008569 process Effects 0.000 description 67
- 230000006870 function Effects 0.000 description 45
- 238000005070 sampling Methods 0.000 description 33
- 238000010586 diagram Methods 0.000 description 32
- 238000007781 pre-processing Methods 0.000 description 21
- 230000004913 activation Effects 0.000 description 20
- 238000004891 communication Methods 0.000 description 16
- 238000004422 calculation algorithm Methods 0.000 description 14
- 238000013527 convolutional neural network Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 230000007774 longterm Effects 0.000 description 6
- 210000002569 neuron Anatomy 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 238000012795 verification Methods 0.000 description 5
- 230000003068 static effect Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000009432 framing Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 244000141353 Prunus domestica Species 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 239000004984 smart glass Substances 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000012952 Resampling Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 210000001519 tissue Anatomy 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
- 230000002618 waking effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- the embodiments of the present application relate to the technical field of voice recognition, and in particular, to a method, device, device, storage medium, and program product for voice wake-up.
- the smart device needs to be awakened by first inputting a wake-up word through the user's voice, so as to receive instructions to complete the task.
- a large number of bone conduction microphones are applied to wearable devices to wake up smart devices through wearable devices.
- Wearable devices such as wireless earphones, smart glasses, smart watches, etc.
- the sensor in the bone conduction microphone is a non-acoustic sensor, which converts the vibration signal into an electrical signal by collecting the vibration signal of the vocal cords when people speak, and the electrical signal is called a bone conduction signal.
- a bone conduction microphone and an air microphone are installed on a wearable device.
- the air microphone is in a dormant state until the smart device is woken up. Due to the low power consumption of the bone conduction microphone, it is possible to use the bone conduction microphone to collect bone conduction signals, and perform voice detection based on the bone conduction signals (such as voice activation detection (voice activate detector, VAD)), thereby controlling the switch of the air microphone to reduce power consumption.
- voice activation detection voice activate detector, VAD
- VAD voice activate detector
- the voice head of the input command word will be truncated, that is, the collected air conduction signal may lose its head, so that it does not contain the complete information of the command word input by the sound source, resulting in The recognition accuracy of wake-up words is low, and the accuracy of voice wake-up is low.
- Embodiments of the present application provide a voice wake-up method, device, device, storage medium, and program product, which can improve the accuracy of voice wake-up. Described technical scheme is as follows:
- a voice wake-up method includes:
- Speech detection is performed according to the bone conduction signal collected by the bone conduction microphone, the bone conduction signal contains the command word information input by the sound source; when a voice input is detected, the wake-up word is detected based on the bone conduction signal; When the command word includes a wake-up word, voice wake-up is performed on the device to be woken up.
- a bone conduction microphone is used to collect bone conduction signals for voice detection, which can ensure low power consumption.
- the acquired air conduction signal may lose its head due to the delay of voice detection, it does not contain the complete information of the command word input by the sound source, while the bone conduction signal collected by the bone conduction microphone contains the command word information input by the sound source , that is, the bone conduction signal has not lost its head, so this solution detects the wake-up word based on the bone conduction signal. In this way, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
- detecting the wake-up word based on the bone conduction signal includes: determining a fusion signal based on the bone conduction signal; and detecting the wake-up word on the fusion signal.
- the fusion signal determined based on the bone conduction signal also includes command word information input by the sound source.
- determining the fusion signal based on the bone conduction signal before determining the fusion signal based on the bone conduction signal, it also includes: turning on the air microphone, and collecting the air conduction signal through the air microphone; determining the fusion signal based on the bone conduction signal, including: combining the initial part of the bone conduction signal with the air conduction The signals are fused to obtain a fusion signal, and the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or, an enhanced initial signal is generated based on the initial part of the bone conduction signal, and the initial signal and the air conduction signal are enhanced Fusion is performed to obtain a fusion signal, and the initial part of the bone conduction signal is determined according to the detection time delay of voice detection; or, the bone conduction signal and the air conduction signal are directly fused to obtain a fusion signal.
- the embodiment of the present application provides three methods of using the bone conduction signal to compensate for the head loss of the air conduction signal, that is, performing the head loss compensation on the air conduction signal directly through display signal fusion.
- signal fusion is performed by signal splicing.
- determining the fusion signal based on the bone conduction signal includes: determining the bone conduction signal as the fusion signal. That is, the embodiment of the present application may also directly use the bone conduction signal to detect the wake-up word.
- detecting the wake-up word on the fusion signal includes: inputting multiple audio frames included in the fusion signal into the first acoustic model, so as to obtain multiple posterior probability vectors output by the first acoustic model, the multiple A posterior probability vector corresponds to the plurality of audio frames one by one, and the first posterior probability vector in the plurality of posterior probability vectors is used to indicate that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified The probability of the phoneme; the detection of the wake-up word is performed based on the plurality of posterior probability vectors.
- the fusion signal is first processed by the first acoustic model to obtain a plurality of posterior probability vectors respectively corresponding to the plurality of audio frames included in the fusion signal, and then the wake-up word is performed based on the plurality of posterior probability vectors. , for example, decoding the plurality of posterior probability vectors to detect wake-up words.
- detecting the wake-up word based on the bone conduction signal before detecting the wake-up word based on the bone conduction signal, it also includes: turning on the air mic, and collecting the air conduction signal through the air mic; detecting the wake-up word based on the bone conduction signal includes: conduction signal, determine a plurality of posterior probability vectors, the plurality of posterior probability vectors correspond to a plurality of audio frames included in the bone conduction signal and the air conduction signal, and the first posterior probability in the plurality of posterior probability vectors The vector is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified phonemes; the wake-up word is detected based on the plurality of posterior probability vectors.
- signal fusion may not be performed, but the posterior probability vectors corresponding to each audio frame are determined directly based on the bone conduction signal and the air conduction signal, so that the obtained multiple posterior probability vectors include
- the command word information input by the sound source is implicitly included in the form of phoneme probability, that is, the bone conduction signal is implicitly used to compensate the air conduction signal for head loss.
- determining a plurality of posterior probability vectors based on the bone conduction signal and the air conduction signal includes: inputting the initial part of the bone conduction signal and the air conduction signal into the second acoustic model to obtain the first A number of bone conduction posterior probability vectors and a second number of air conduction posterior probability vectors, the initial part of the bone conduction signal is determined according to the detection time delay of voice detection, the first number of bone conduction posterior probability vectors and bone conduction
- the audio frames included in the initial part of the conduction signal are in one-to-one correspondence, and the second number of air conduction posterior probability vectors are in one-to-one correspondence with the audio frames included in the air conduction signal; the first bone conduction posterior probability vector and the first The air conduction posterior probability vectors are fused to obtain a second posterior probability vector, the first bone conduction posterior probability vector corresponds to the last audio frame of the initial part of the bone conduction signal, and the duration of the last audio frame is less than the frame duration, The first air conduction posterior probability vector correspond
- the initial part of the bone conduction signal and the air conduction signal can be processed respectively by the second acoustic model to obtain the corresponding bone conduction posterior probability vector and air conduction posterior probability vector respectively, Then, by fusing the first bone conduction posterior probability vector and the first air conduction posterior probability vector, the bone conduction signal is implicitly used to perform head loss compensation on the air conduction signal.
- determining a plurality of posterior probability vectors includes: inputting the initial part of the bone conduction signal and the air conduction signal into the third acoustic model to obtain a plurality of A posteriori probability vector, the initial portion of the bone conduction signal is determined according to the detection delay of speech detection; or, the bone conduction signal and the air conduction signal are input into the third acoustic model to obtain a plurality of posterior probabilities output by the third acoustic model vector.
- the initial part of the bone conduction signal and the air conduction signal may be respectively input into the third acoustic model, and multiple posterior probability vectors are directly obtained through the third acoustic model. That is, through the process of processing the initial part of the bone conduction signal and the air conduction signal by the third acoustic model, the two parts of the signal are implicitly fused, that is, the bone conduction signal is implicitly used to perform head loss compensation on the air conduction signal .
- detecting the wake-up word based on the multiple posterior probability vectors includes: determining that the phoneme sequence corresponding to the command word includes the phoneme sequence corresponding to the wake-up word based on the multiple posterior probability vectors and the phoneme sequence corresponding to the wake-up word Confidence of the sequence; when the confidence exceeds the confidence threshold, it is determined that the command word includes a wake-up word.
- the confidence is obtained by decoding the plurality of posterior probability vectors, and then the confidence threshold is used to judge whether the command word includes the wake-up word, that is, when the confidence condition is satisfied, it is determined that the command word contains the wake-up word. word.
- detecting the wake-up word based on the multiple posterior probability vectors includes: determining that the phoneme sequence corresponding to the command word includes the phoneme sequence corresponding to the wake-up word based on the multiple posterior probability vectors and the phoneme sequence corresponding to the wake-up word The degree of confidence of the sequence; when the degree of confidence exceeds the degree of confidence threshold, and the distance condition is satisfied between the plurality of posterior probability vectors and the plurality of template vectors, it is determined that the command word is detected to include a wake-up word, and the plurality of templates A vector indicating the probability that a phoneme of a speech signal containing complete information about a wake word belongs to a number of specified phonemes. That is, when the confidence condition is met and the template matches, it is determined that the command word contains a wake-up word, so as to avoid false wake-up as much as possible.
- the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold. That is, it is possible to judge whether a template matches based on the average distance between vectors.
- the method further includes: acquiring a bone conduction registration signal, the bone conduction registration signal including complete information of the wake-up word; determining a confidence threshold and a plurality of template vectors based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word . That is to say, the embodiment of the present application can also determine the confidence threshold and multiple template vectors based on the bone conduction registration signal containing complete information of the wake-up word during the registration process of the wake-up word, and use the confidence threshold and Multiple template vectors are used to detect wake-up words in the subsequent voice wake-up process, which can improve the accuracy of wake-up word detection, thereby reducing false wake-ups.
- determining a confidence threshold and a plurality of template vectors includes: determining a fusion registration signal based on the bone conduction registration signal; The phoneme sequence of , determine the confidence threshold and multiple template vectors. That is to say, in the registration process of the wake-up word, the fused registration signal can also be obtained first through signal fusion, and the obtained fused registration signal contains the information of the command word input by the sound source, and then the confidence threshold and Multiple template vectors.
- determining a confidence threshold and a plurality of template vectors includes: inputting a plurality of registration audio frames included in the fused registration signal into the first acoustic model to obtain A plurality of registered posterior probability vectors output by the first acoustic model, the plurality of registered posterior probability vectors are in one-to-one correspondence with the plurality of registered audio frames, and the first registered posterior probability in the plurality of registered posterior probability vectors
- the vector indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of designated phonemes; the plurality of registration posterior probability vectors is determined as a plurality of template vectors; based on the plurality of registration posterior probability vectors
- the phoneme sequence corresponding to the wake word determines the confidence threshold.
- the fusion registration signal can also be processed through the first acoustic model, so as to obtain the information included in the fusion registration signal.
- the plurality of enrollment posterior probability vectors is decoded to determine a confidence threshold.
- the plurality of registration posterior probability vectors may also be determined as a plurality of template vectors.
- the threshold value includes: based on the bone conduction registration signal and the air conduction registration signal, determining a plurality of registration posterior probability vectors, the plurality of registration posterior probability vectors and the plurality of registration audio frames included in the bone conduction registration signal and the air conduction registration signal one by one
- the first registration a posteriori probability vector in the plurality of registration a posteriori probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes; based on the plurality of registration a posteriori The probability vector and the phoneme sequence corresponding to the wake word determine the confidence threshold. That is, during the registration process of the wake-up word, the signal fusion may not be performed first
- a voice wake-up device in a second aspect, has a function of implementing the behavior of the voice wake-up method in the first aspect above.
- the voice wake-up device includes one or more modules, and the one or more modules are used to implement the voice wake-up method provided in the first aspect above.
- a voice wake-up device which includes:
- the voice detection module is used to perform voice detection according to the bone conduction signal collected by the bone conduction microphone, and the bone conduction signal contains command word information input by the sound source;
- the wake-up word detection module is used to detect the wake-up word based on the bone conduction signal when a voice input is detected;
- the voice wake-up module is configured to perform voice wake-up on the device to be woken up when it is detected that the command word includes a wake-up word.
- the wake-up word detection module includes:
- the first determining submodule is used to determine the fusion signal based on the bone conduction signal
- the wake-up word detection submodule is used to detect the wake-up word on the fusion signal.
- the device also includes:
- the processing module is used to turn on the air microphone, and collect the air conduction signal through the air microphone;
- the first determined submodule is used for:
- the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or,
- the bone conduction signal and the air conduction signal are directly fused to obtain a fusion signal.
- the wake-up word detection submodule is used for:
- the multiple posterior probability vectors correspond one-to-one to the multiple audio frames
- the first posterior probability vector in the plurality of posterior probability vectors is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified phonemes
- the wake word detection is performed based on the plurality of posterior probability vectors.
- the device also includes:
- the processing module is used to turn on the air microphone, and collect the air conduction signal through the air microphone;
- the wake word detection module includes:
- the second determining submodule is used to determine a plurality of posterior probability vectors based on the bone conduction signal and the air conduction signal, and the plurality of posterior probability vectors correspond to the multiple audio frames included in the bone conduction signal and the air conduction signal one by one.
- the first posterior probability vector in the posterior probability vectors is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified phonemes;
- the wake-up word detection submodule is configured to detect wake-up words based on the plurality of posterior probability vectors.
- the second determination submodule is used for:
- the bone conduction Inputting the initial part of the bone conduction signal and the air conduction signal into the second acoustic model to obtain the first number of bone conduction posterior probability vectors and the second number of air conduction posterior probability vectors output by the second acoustic model, the bone conduction
- the initial part of the signal is determined according to the detection delay of the speech detection, the first number of bone conduction posterior probability vectors correspond to the audio frames included in the initial part of the bone conduction signal, and the second number of air conduction posterior probability vectors
- the vector is in one-to-one correspondence with the audio frames included in the air conduction signal;
- the first bone conduction posterior probability vector corresponds to the last audio frequency of the initial part of the bone conduction signal frames, the duration of the last audio frame is less than the frame duration
- the first air conduction posterior probability vector corresponds to the first audio frame of the air conduction signal, the duration of the first audio frame is less than the frame duration
- the plurality of posterior probabilities include the second posterior probability vector, the vectors in the first number of bone conduction posterior probability vectors except for the first bone conduction posterior probability vector, and the vectors in the second number of air conduction posterior probability vectors except for the first air conduction A vector other than the vector of posterior probabilities.
- the second determination submodule is used for:
- the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or ,
- the bone conduction signal and the air conduction signal are input into the third acoustic model to obtain a plurality of posterior probability vectors output by the third acoustic model.
- the wake-up word detection submodule is used for:
- the command word includes a wake-up word.
- the wake-up word detection submodule is used for:
- the command word includes a wake-up word
- the plurality of template vectors indicate that the wake-up word is included
- the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold.
- the device also includes:
- the obtaining module is used to obtain the bone conduction registration signal, and the bone conduction registration signal includes complete information of the wake-up word;
- the determination module is configured to determine a confidence threshold and a plurality of template vectors based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word.
- the determination module includes:
- the third determining submodule is used to determine the fusion registration signal based on the bone conduction registration signal
- the fourth determining submodule is configured to determine a confidence threshold and a plurality of template vectors based on the fused registration signal and the phoneme sequence corresponding to the wake-up word.
- the fourth determining submodule is used for:
- the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes;
- a confidence threshold is determined based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
- the device also includes:
- An acquisition module configured to acquire an air conduction registration signal
- Identify modules include:
- the fifth determination sub-module is used to determine a plurality of registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal, the plurality of registration posterior probability vectors and the plurality of registration audio signals included in the bone conduction registration signal and the air conduction registration signal Frame one-to-one correspondence, the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes;
- the sixth determining submodule is configured to determine a confidence threshold based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
- an electronic device in a third aspect, includes a processor and a memory, and the memory is used to store a program for executing the voice wake-up method provided in the first aspect above, and to store a program for realizing the above first The data involved in the voice wake-up method provided by the aspect.
- the processor is configured to execute programs stored in the memory.
- the operating device of the storage device may further include a communication bus for establishing a connection between the processor and the memory.
- a computer-readable storage medium wherein instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer is made to execute the voice wake-up method described in the first aspect above.
- a fifth aspect provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the voice wake-up method described in the first aspect above.
- a bone conduction microphone is used to collect bone conduction signals for voice detection, which can ensure low power consumption.
- the acquired air conduction signal may lose its head due to the delay of voice detection, it does not contain the complete information of the command word input by the sound source, while the bone conduction signal collected by the bone conduction microphone contains the command word information input by the sound source , that is, the bone conduction signal has not lost its head, so this solution detects the wake-up word based on the bone conduction signal. In this way, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
- Fig. 1 is a schematic structural diagram of an acoustic model provided by an embodiment of the present application
- FIG. 2 is a system architecture diagram involved in a voice wake-up method provided by an embodiment of the present application
- FIG. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
- FIG. 4 is a flow chart of a voice wake-up method provided in an embodiment of the present application.
- Fig. 5 is a schematic diagram of the principle of bone conduction signal and air conduction signal generation provided by the embodiment of the present application;
- FIG. 6 is a signal timing diagram provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of a signal splicing method provided in an embodiment of the present application.
- FIG. 8 is a schematic diagram of downsampling a bone conduction signal provided by an embodiment of the present application.
- FIG. 9 is a schematic diagram of gain adjustment for bone conduction signals provided by an embodiment of the present application.
- FIG. 10 is a schematic diagram of a method for training and generating a network model provided by an embodiment of the present application.
- Fig. 11 is a schematic structural diagram of another acoustic model provided by the embodiment of the present application.
- Fig. 12 is a schematic structural diagram of another acoustic model provided by the embodiment of the present application.
- Fig. 13 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
- FIG. 14 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
- Fig. 15 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
- Fig. 16 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
- FIG. 17 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
- FIG. 18 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
- FIG. 19 is a flow chart of a method for registering a wake-up word provided in an embodiment of the present application.
- FIG. 20 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
- Fig. 21 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
- Fig. 22 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
- Fig. 23 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
- Fig. 24 is a flow chart of another wake-up word registration method provided by the embodiment of the present application.
- Fig. 25 is a schematic diagram of a method for training a first acoustic model provided by an embodiment of the present application.
- Fig. 26 is a schematic diagram of another method for training the first acoustic model provided by the embodiment of the present application.
- Fig. 27 is a schematic diagram of another method for training the first acoustic model provided by the embodiment of the present application.
- Fig. 28 is a schematic diagram of another method for training the first acoustic model provided by the embodiment of the present application.
- Fig. 29 is a schematic diagram of a method for training a second acoustic model provided by an embodiment of the present application.
- Fig. 30 is a schematic diagram of a method for training a third acoustic model provided by an embodiment of the present application.
- Fig. 31 is a schematic structural diagram of a voice wake-up device provided by an embodiment of the present application.
- Speech recognition also known as automatic speech recognition (ASR). Speech recognition refers to the computer recognition of the vocabulary content contained in the speech signal.
- Voice wake-up also known as keyword spotting (KWS), wake-up word detection, wake-up word recognition, etc.
- WLS keyword spotting
- Voice wake-up refers to the real-time detection of wake-up words in the continuous voice stream, and wakes up the smart device when it detects that the named word input by the sound source is a wake-up word.
- Deep learning Deep learning (deep learning, DL): It is a learning algorithm based on data representation in machine learning.
- Voice activation detection (voice activate detector, VAD)
- VAD is used to judge when there is voice input and when it is in a mute state, and is also used to intercept valid segments with voice input. Subsequent operations of the speech recognition are performed on the effective segments intercepted by the VAD, thereby reducing the noise misrecognition rate of the speech recognition system and system power consumption.
- SNR signal-noise ratio
- VAD is an implementation manner of voice detection. This embodiment of the present application uses VAD as an example to introduce voice detection. In other embodiments, voice detection may also be performed in other ways.
- the first step is to detect whether there is speech input, ie, Voice Activation Detection (VAD).
- VAD Voice Activation Detection
- the recognition system mainly includes feature extraction, recognition modeling and decoding to obtain recognition results.
- model training includes acoustic model training, language model training, and the like.
- Speech recognition is essentially the process of converting an audio sequence to a text sequence, that is, to find the text sequence with the highest probability given a speech input.
- the speech recognition problem can be decomposed into the conditional probability of the occurrence of this speech in a given text sequence and the prior probability of the occurrence of the text sequence.
- the model obtained by modeling the conditional probability is the acoustic model.
- the model obtained by modeling the prior probability of the text sequence is the language model.
- Phoneme The pronunciation of words is composed of phonemes, which are a kind of sound unit.
- An English phoneme set i.e. a pronunciation dictionary
- the phoneme set includes more phonemes.
- the phoneme set includes 100 phonemes.
- State It can be regarded as a more detailed unit of speech than a phoneme, and a phoneme is usually divided into three states. In the embodiment of the present application, one frame of audio corresponds to one phoneme, and several phonemes form a word (word).
- the probability that the phoneme corresponding to each audio frame is each phoneme in the phoneme set can be known through the acoustic model, that is, the posterior probability vector corresponding to the audio.
- the acoustic model that is, the posterior probability vector corresponding to the audio.
- the backward movement probability vector corresponding to each audio frame can be known.
- a decoding map (also called a state network, search space, etc.) is constructed based on the language model, pronunciation dictionary, etc.
- the test probability vector is used as the input of the decoding map, and the optimal path is searched in the decoding map, and the probability of the phoneme corresponding to the speech is on this path is the largest.
- the phoneme corresponding to each audio frame can be known, that is, the best word string for speech recognition can be obtained.
- the process of searching the optimal path in the state network to obtain the word string can be regarded as a kind of decoding, and the decoding is to determine what the word string corresponding to the speech signal is.
- the probabilities of each phoneme on the decoding path are found in the decoding map, and the found probabilities of each phoneme are added to obtain a path score.
- the decoding path refers to the phoneme sequence corresponding to the wake-up word. If the path score is large, it is considered that the command word is detected to include a wake word. That is, the decoding in the embodiment of the present application is to judge whether the word string corresponding to the voice signal is a wake-up word based on the decoding map.
- the acoustic model involved in the embodiment of the present application is further introduced here first.
- the acoustic model is a model capable of recognizing a single phoneme, which can be modeled using a hidden Markov model (HMM).
- HMM hidden Markov model
- the acoustic model is a trained model, and the acoustic model can be trained by using the acoustic features of the sound signal and corresponding labels.
- the corresponding probability distribution between the acoustic signal and the modeling unit is established.
- the modeling unit is such as HMM state, phoneme, syllable, word, etc.
- the modeling unit can also be called the pronunciation unit.
- the structure of the acoustic model is like GMM-HMM , DNN-HMM, DNN-CTC, etc.
- GMM Gaussian mixed model
- DNN means deep neural network
- CTC connectionist temporal classification
- the modeling unit is a phoneme
- the acoustic model is a DNN-HMM model as an example for introduction.
- the acoustic model can process the audio frame by frame, and output the probability that the phoneme of each audio frame belongs to multiple specified phonemes, and the multiple specified phonemes can be determined according to the pronunciation dictionary.
- the pronunciation dictionary includes 100 phonemes, and the multiple designated phonemes are the 100 phonemes.
- Fig. 1 is a schematic structural diagram of an acoustic model provided by an embodiment of the present application.
- the acoustic model is a DNN-HMM model
- the dimension of the input layer of the acoustic model is 3
- the dimension of the two hidden layers is 5
- the dimension of the output layer is 3.
- the dimension of the input layer represents the feature dimension of the input signal
- the dimension of the output layer represents three state dimensions
- each state dimension includes the probability corresponding to multiple specified phonemes.
- Decoding in speech recognition can be divided into dynamic decoding and static decoding.
- dynamic decoding the language score is dynamically searched in the language model centered on the dictionary tree.
- Static decoding means that the language model is statically compiled into the decoding map in advance, and the decoding efficiency is improved through a series of optimization operations such as determinization, weight shifting, and minimization.
- static decoding is adopted in the embodiment of the present application, such as a weighted finite state transducer (weighted finite state transducer, WFST), and redundant information is eliminated based on the static decoding of the HCLG network.
- the generation of the HCLG network requires language models, pronunciation dictionaries, and acoustic models to be expressed in the corresponding FST format, and then compiled into a large decoding graph through operations such as combination, determinization, and minimization.
- ASL means adding self-loops
- min means minimizing
- RDS means removing disambiguation symbols
- det means deterministic
- H' means HMM with self-loops removed
- o means combination.
- the Viterbi (viterbi) algorithm is used to find the optimal path in the decoding graph, and there will be no two identical paths in the decoding graph.
- Accumulated beam pruning is adopted in the decoding process, that is, the beam value is subtracted from the current maximum probability path score as a threshold, and the paths smaller than the threshold are pruned.
- the frame synchronous decoding algorithm is used to find the starting node of the decoding graph, create the token of the corresponding node, and expand from the token corresponding to the starting node to the empty side (that is, the input does not correspond to the real modeling unit), and each reachable node is Bind corresponding tokens, prune and keep active tokens.
- a token is taken from the current active token, and the corresponding node starts to expand the subsequent non-empty edge (that is, the input corresponds to the real physical modeling unit), traverses all active tokens, prunes and keeps the current frame active token. Repeat the above steps until all audio frames are extended, that is, find the token with the highest score, and trace back to get the final recognition result.
- the network model refers to the above-mentioned acoustic model.
- network models such as hidden Markov model HMM, Gaussian mixture model GMM, deep neural network DNN, deep belief networks-hidden Markov model (deep belief networks HMM, DBN-HMM), loop Neural network (recurrent neural network, RNN), long short-term memory (long short-term memory, LSTM) network, convolutional neural network (convolutional neural network, CNN), etc.
- CNN and HMM are used in the embodiment of this application.
- the hidden Markov model is a statistical model, which is currently mostly used in the field of speech signal processing.
- this model whether a state in the Markov chain is transferred to another state depends on the state transition probability, and the observation value produced by a certain state depends on the state generation probability.
- HMM When performing speech recognition, HMM first establishes a vocalization model for each recognition unit, obtains a state transition probability matrix and an output probability matrix through long-term training, and makes a judgment based on the maximum probability in the state transition process during recognition.
- the basic structure of the convolutional neural network includes two parts, one part is the feature extraction layer, the input of each neuron is connected to the local receptive field of the previous layer, and the local features are extracted.
- the other part is the feature map layer.
- Each calculation layer of the network is composed of multiple feature maps. Each feature map is a plane, and the weights of all neurons on the plane are equal.
- the feature map structure uses a function with a small influence function kernel (such as sigmoid) as the activation function of the convolutional network, so that the feature map has displacement invariance.
- a small influence function kernel such as sigmoid
- neurons on a mapping plane share weights, the number of free parameters of the network is reduced.
- Each convolutional layer in the convolutional neural network can be followed by a calculation layer for local averaging and secondary extraction. This unique two-time feature extraction structure reduces the feature resolution.
- the loss function is the iterative basis for the network model in training.
- the loss function is used to evaluate the degree of difference between the predicted value of the network model and the real value, and the choice of the loss function affects the performance of the network model.
- the loss functions used by different network models are generally different. Loss functions can be divided into empirical risk loss functions and structural risk loss functions.
- the empirical risk loss function refers to the difference between the predicted result and the actual result, and the structural risk loss function refers to the empirical risk loss function plus a regular term.
- the embodiment of the present application uses a cross-entropy loss function (cross-entropy loss function), that is, a CE loss function.
- the cross-entropy loss function is essentially a log-likelihood function that can be used in binary and multi-classification tasks.
- the cross-entropy loss function is often used instead of the mean square error loss function, because the cross-entropy loss function can perfectly solve the problem that the square loss function weight update is too slow.
- the weight update is fast.
- the weight update has a good property of being slow.
- the error is backpropagated, and the network parameters are adjusted using the loss function and the gradient descent method.
- the gradient descent method is an optimization algorithm, the central idea is to update the parameter values along the direction of the gradient of the objective function in order to achieve the minimum (or maximum) of the objective function.
- Gradient descent is a commonly used optimization algorithm in deep learning. Gradient descent is aimed at the loss function, and the purpose is to find the weight and bias corresponding to the minimum value of the loss function as soon as possible.
- the core of the backpropagation algorithm is by defining the special variable of neuron error. Starting from the output layer, the neuron error is backpropagated layer by layer. Then use the neuron error to calculate the partial derivative of the weight and bias through the formula. Gradient descent is a way to solve the minimum problem, while backpropagation is a way to solve gradient calculations.
- FIG. 2 is a system architecture diagram involved in a voice wake-up method provided by an embodiment of the present application.
- the system architecture includes a wearable device 201 and a smart device 202 .
- the wearable device 201 and the smart device 202 are connected in a wired or wireless manner for communication.
- the smart device 202 is the device to be woken up in the embodiment of the present application.
- the wearable device 201 is configured to receive a voice signal, and send an instruction to the smart device 202 based on the received voice signal.
- the smart device 202 is configured to receive instructions sent by the wearable device 201, and perform corresponding operations based on the received instructions.
- the wearable device 201 is used to collect voice signals, detect command words included in the collected voice signals, and send a wake-up instruction to the smart device 202 to wake up the smart device 202 if it is detected that the command words include a wake-up word.
- the smart device 202 is configured to enter a working state from a sleep state after receiving a wake-up instruction.
- the wearable device 201 is equipped with a bone conduction microphone, and due to the low power consumption of the bone conduction microphone, the bone conduction microphone can always work.
- the bone conduction microphone is used to collect bone conduction signals in the working state.
- the processor in the wearable device 201 performs voice activation detection based on the bone conduction signal, so as to detect whether there is voice input.
- the processor detects the wake-up word based on the bone conduction signal, so as to detect whether the command word input by the sound source includes the wake-up word.
- voice wake-up is performed, that is, the wearable device 201 sends a wake-up instruction to the smart device 202 .
- the wearable device 201 is such as a wireless earphone, smart glasses, a smart watch, a smart bracelet, and the like.
- the smart device 202 (that is, the device to be woken up) includes a smart speaker, a smart home appliance, a smart toy, a smart robot, and the like.
- the wearable device 201 and the smart device 202 are the same device.
- FIG. 3 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.
- the electronic device is the wearable device 201 shown in FIG. 2 .
- the electronic device includes one or more processors 301 , a communication bus 302 , a memory 303 , one or more communication interfaces 304 , a bone conduction microphone 308 and an air microphone 309 .
- the processor 301 is a general-purpose central processing unit (central processing unit, CPU), a network processor (network processing, NP), a microprocessor, or one or more integrated circuits for realizing the solution of the present application, for example, a dedicated Integrated circuit (application-specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof.
- a dedicated Integrated circuit application-specific integrated circuit, ASIC
- programmable logic device programmable logic device
- PLD programmable logic device
- the above-mentioned PLD is a complex programmable logic device (complex programmable logic device, CPLD), field-programmable logic gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL) or its arbitrary combination.
- the communication bus 302 is used to transfer information between the aforementioned components.
- the communication bus 302 is divided into an address bus, a data bus, a control bus, and the like.
- address bus a data bus
- control bus a control bus
- only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
- the memory 303 is a read-only memory (read-only memory, ROM), a random access memory (random access memory, RAM), an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM) , optical discs (including compact disc read-only memory, CD-ROM), compact discs, laser discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or any other medium storing desired program code in the form of instructions or data structures and capable of being accessed by a computer, but not limited thereto.
- the memory 303 exists independently and is connected to the processor 301 through the communication bus 302 , or, the memory 303 and the processor 301 are integrated together.
- the Communication interface 304 utilizes any transceiver-like device for communicating with other devices or a communication network.
- the communication interface 304 includes a wired communication interface, and optionally, also includes a wireless communication interface.
- the wired communication interface is, for example, an Ethernet interface.
- the Ethernet interface is an optical interface, an electrical interface or a combination thereof.
- the wireless communication interface is a wireless local area network (wireless local area networks, WLAN) interface, a cellular network communication interface, or a combination thereof.
- the electronic device includes multiple processors, such as processor 301 and processor 305 shown in FIG. 2 .
- processors such as processor 301 and processor 305 shown in FIG. 2 .
- Each of these processors is a single-core processor, or a multi-core processor.
- a processor here refers to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
- the electronic device further includes an output device 306 and an input device 307 .
- Output device 306 is in communication with processor 301 and can display information in a variety of ways.
- the output device 306 is a liquid crystal display (liquid crystal display, LCD), a light emitting diode (light emitting diode, LED) display device, a cathode ray tube (cathode ray tube, CRT) display device, or a projector (projector).
- the input device 307 communicates with the processor 301 and can receive user input in various ways.
- the input device 307 includes one or more of a mouse, a keyboard, a touch screen device, or a sensing device.
- the input device 307 includes a bone conduction microphone 308 and an air microphone 309, and the bone conduction microphone 308 and the air microphone 309 are used to collect bone conduction signals and air conduction signals respectively.
- the processor 301 is configured to wake up the smart device based on the bone conduction signal or based on the bone conduction signal and the air conduction signal through the voice wake-up method provided in the embodiment of the present application.
- the processor 301 is further configured to control the smart device to perform tasks based on the bone conduction signal, or the air conduction signal, or both the bone conduction signal and the air conduction signal.
- the memory 303 is used to store the program code 310 for implementing the solution of the present application, and the processor 301 can execute the program code 310 stored in the memory 303 .
- the program code 310 includes one or more software modules, and the electronic device can implement the voice wake-up method provided in the embodiment of FIG. 4 below through the processor 301 and the program code 310 in the memory 303 .
- Fig. 4 is a flow chart of a voice wake-up method provided by an embodiment of the present application, and the method is applied to a wearable device. Please refer to FIG. 4 , the method includes the following steps.
- Step 401 Carry out voice detection according to the bone conduction signal collected by the bone conduction microphone, the bone conduction signal includes command word information input by the sound source.
- the bone conduction microphone can be used to collect the bone conduction signal.
- Voice detection (such as voice activation detection VAD) is performed on the guide signal to detect whether there is a voice input.
- the components other than the bone conduction microphone in the wearable device can be in a sleep state to reduce power consumption, and then control other components of the wearable device to turn on when a voice input is detected .
- the air microphone when the wearable device is equipped with an air microphone, since the air microphone is a device with high power consumption, for portable wearable devices, in order to reduce power consumption, the air microphone will be turned on and off. Control, when a voice input is detected (such as the user is speaking), the air microphone will be turned on for sound pickup operation (that is, air conduction signal collection), so that the power consumption of the wearable device can be reduced. That is, before the smart device is woken up, the air microphone is in a dormant state to reduce power consumption, and when a voice input is detected, the air microphone is turned on.
- the wearable device may perform voice activation detection according to the bone conduction signal collected by the bone conduction microphone, which is not limited in this embodiment of the present application.
- voice activation detection is mainly used to detect whether there is a human voice signal in the current input signal.
- voice activation detection distinguishes speech segments from non-speech segments (such as segments with only various background noise signals) by judging the input signal, so that different processing methods can be adopted for each segment of the signal.
- the voice activation detection detects whether there is voice input by extracting features of the input signal. For example, by extracting the short-time energy (short time energy, STE) and short-time zero-crossing rate (zero cross counter, ZCC) features of each frame input signal to detect whether there is speech input, that is, to perform speech activation detection based on energy features .
- the short-term energy refers to the energy of a frame of signal
- the zero-crossing rate refers to the number of times that a frame of time-domain signal crosses 0 (time axis).
- some high-precision VADs will extract multiple features such as energy-based features, frequency-domain features, cepstrum features, harmonic features, and long-term features for comprehensive detection.
- threshold comparison, or statistical methods or machine learning methods may also be combined to determine whether a frame of input signal is a speech signal or a non-speech signal.
- energy-based features, frequency domain features, cepstrum features, harmonic features, long-term features and other features are briefly introduced.
- VAD is performed based on the two features of STE and ZCC.
- SNR signal-noise ratio
- the STE of the speech segment is relatively large and the ZCC is relatively small
- the STE of the non-speech segment is relatively small and the ZCC is relatively large.
- the speech signal with human voice usually has high energy, and most of the energy is contained in the low frequency band, while the noise signal usually has low energy and contains more information in the high frequency band. Therefore, the speech signal and the non-speech signal can be distinguished by extracting these two features of the input signal.
- the method for calculating the STE may be to calculate the sum of squares of the energy of the input signal of each frame through the spectrogram.
- the method of calculating the short-term zero-crossing rate can be to calculate the number of zero-crossings corresponding to the input signal of each frame in the time domain, for example, in the time domain, all the sampling points in the frame are shifted to the left or right by one point, and after shifting The product of each sampling point and the amplitude value of each sampling point before shifting is done at the corresponding point. If the sign of the product obtained by the corresponding two sampling points is negative, it means that the corresponding sampling point crosses zero, and the negative product in the frame is The short-term zero-crossing rate can be obtained by calculating the number of .
- Frequency-domain features Through short-time Fourier transform or other time-frequency transform methods, the time-domain signal of the input signal is converted into a frequency-domain signal to obtain a spectrogram, and frequency-domain features are obtained based on the spectrogram. Such as extracting the envelope features of the frequency band based on the spectrogram. In some experiments, when the SNR is 0dB, the long-term envelopes of some frequency bands can distinguish speech segments from noise segments.
- Cepstral features such as including energy cepstral peaks.
- the energy cepstrum peak determines the pitch of the speech signal.
- Mel-frequency cepstral coefficients Mel-frequency cepstral coefficients, MFCC are used as cepstral features.
- Harmonic-based features An obvious feature of a speech signal is that it contains the fundamental frequency and its multiple harmonic frequencies. Even in strong noise scenes, this feature of harmonics exists.
- the fundamental frequency of the speech signal can be found using the method of autocorrelation.
- Speech signals are non-stationary signals.
- the normal speech rate usually emits 10 to 15 phonemes per second.
- the spectral distribution between phonemes is different, which leads to changes in the statistical characteristics of speech over time.
- Most of the daily noise is steady state, that is, the change is relatively slow, such as white noise. Based on this, long-term features can be extracted to judge whether the input signal is a speech signal or a non-speech signal.
- the input signal used for voice activation detection is a bone conduction signal collected by a bone conduction microphone, and voice activation detection is performed for each frame of the received bone conduction signal to detect whether there is voice or not. enter.
- the bone conduction microphone since the bone conduction microphone is always in the working state, the bone conduction signal continuously collected by the bone conduction microphone contains complete information of the command word input by the sound source, that is, the bone conduction signal will not lose its head.
- the sampling rate of the bone conduction signal is 32 kHz (kilohertz), 48 kHz, etc., which is not limited in this embodiment of the present application.
- the sensor in the bone conduction microphone is a non-acoustic sensor, which can shield the influence of ambient noise and has strong anti-noise performance.
- Step 402 When a voice input is detected, detect a wake-up word based on the bone conduction signal.
- the wearable device when a voice input is detected, the wearable device detects the wake-up word based on the bone conduction signal, so as to detect whether the command word includes the wake-up word. It should be noted that there are many implementation methods for the wearable device to detect the wake-up word based on the bone conduction signal, and two of the implementation methods will be introduced next.
- the wearable device detects the wake-up word based on the bone conduction signal by determining a fusion signal based on the bone conduction signal, and detecting the wake-up word on the fusion signal.
- the wearable device determines the fusion signal based on the bone conduction signal. It should be noted that there are many ways for wearable devices to determine fusion signals based on bone conduction signals, and four of them will be introduced next.
- Method 1 of determining the fusion signal based on the bone conduction signal before determining the fusion signal based on the bone conduction signal, turn on the air microphone, and collect the air conduction signal through the air microphone. For example, when a voice input is detected, the air microphone is turned on, and the air conduction signal is collected through the air microphone.
- the wearable device fuses the initial part of the bone conduction signal with the air conduction signal to obtain a fusion signal.
- the initial part of the bone conduction signal is determined according to the detection time delay of voice detection (such as VAD).
- the wearable device collects the bone conduction signal and the air conduction signal, and uses the initial part of the bone conduction signal to compensate for head loss on the air conduction signal, so that the obtained fusion signal also includes the command word information input by the sound source.
- the length of the fused signal is relatively short, which can reduce the amount of data processing to a certain extent.
- signal fusion is performed by signal splicing in the embodiments of the present application, and signal fusion may also be performed by means of signal superposition in some embodiments. In the following embodiments, signal fusion by signal splicing is used as an example Make an introduction.
- the bone conduction signal and the air conduction signal are signals generated by the same sound source, and the transmission paths of the bone conduction signal and the air conduction signal are different.
- the bone conduction signal is a signal formed by the vibration signal (excitation signal) transmitted through the bones and tissues inside the human body, and the air conduction signal is formed by the sound wave transmitted through the air.
- FIG. 6 is a signal timing diagram provided by an embodiment of the present application.
- the signal timing diagram shows the timing relationship of the bone conduction signal, the air conduction signal, the VAD control signal and the user's voice signal.
- the sound source emits a voice signal
- the bone conduction signal immediately becomes a high-level signal.
- the VAD determines that there is a voice input.
- a VAD control signal is generated.
- the VAD control signal controls the air microphone to turn on and collect The air conduction signal, that is, the signal that the air conduction signal becomes high level at this time.
- the bone conduction signal changes synchronously with the user's voice signal
- the air conduction signal has a delay of ⁇ t compared to the bone conduction signal, which is caused by the detection delay of the VAD.
- ⁇ t represents the detection delay of the voice activation detection, that is, the time difference between the time when a voice input is detected and the user actually inputs a voice.
- the VAD can detect the speech segment and the non-speech segment in the bone conduction signal
- the endpoint detection can detect the speech segment and the non-speech segment in the air conduction signal.
- the time range of the speech segment extracted from the bone conduction signal is [0, t]
- the time range of the initial part of the bone conduction signal that is, the initial part of the intercepted speech segment
- 0, ⁇ t] the time range of the speech segment intercepted from the air conduction signal is [ ⁇ t, t]
- the duration of the fusion signal obtained is t.
- ⁇ t represents the detection delay of the voice activation detection
- t represents the total duration of the actual voice input.
- the initial part of the bone conduction signal that is, the speech segment from 0 to ⁇ t
- the air conduction signal that is, the speech segment from ⁇ t to t
- the wearable device performs preprocessing on the air conduction signal, and the preprocessing includes front-end enhancement.
- the front-end enhancement can eliminate part of the noise and the influence of different sound sources, etc., so that the air conduction signal after the front-end enhancement can better reflect the essential characteristics of the voice, so as to improve the accuracy of voice wake-up.
- front-end enhancement of air conduction signals such as endpoint detection and speech enhancement, and speech enhancement such as echo cancellation, beamforming algorithm, noise cancellation, automatic gain control, reverberation, etc.
- the endpoint detection can distinguish the speech segment and the non-speech segment of the air conduction signal, that is, accurately determine the starting point of the speech segment.
- the endpoint detection only the speech segment of the air conduction signal can be processed later, which can improve the accuracy and recall rate of speech recognition.
- Speech enhancement is to remove the influence of environmental noise on speech clips.
- echo cancellation uses an effective echo cancellation algorithm to suppress the interference of far-end signals, mainly including double-talk detection and delay estimation, such as by judging the current speech mode (such as near-talk mode, far-talk mode, double-talk mode, etc.
- the corresponding strategy is used to adjust the filter, and then the far-end interference in the air conduction signal is filtered out through the filter, and on this basis, the residual noise interference is eliminated through the post-filtering algorithm.
- the automatic gain algorithm is used to quickly gain the signal to an appropriate volume. This solution can multiply all sampling points of the air conduction signal by the corresponding gain factor through rigid gain processing, and multiply each frequency by the corresponding gain factor at the same time in the frequency domain. gain factor.
- the frequency of the air conduction signal may be weighted according to the equal loudness curve, and the loudness gain factor is mapped onto the equal loudness curve, thereby determining the gain factor of each frequency.
- the wearable device before the initial part of the bone conduction signal is fused with the air conduction signal, the wearable device performs preprocessing on the bone conduction signal, and the preprocessing includes down-sampling and/or gain adjustment.
- downsampling can reduce the data volume of the bone conduction signal and improve the efficiency of data processing.
- Gain adjustment is used to enhance the energy of the adjusted bone conduction signal. The average energy is the same.
- downsampling refers to reducing the sampling frequency (also referred to as sampling rate) of a signal, and is a way of resampling a signal.
- the sampling frequency refers to the number of samples of the sound wave amplitude extracted per second after the analog sound waveform is digitized.
- a sampling point is drawn every M-1 sampling points to obtain an air conduction signal y including M sampling points [m].
- downsampling may cause spectral aliasing of the signal, so the air conduction signal can be processed with a low-pass de-aliasing filter before downsampling, that is, anti-aliasing filtering is performed to reduce the subsequent downsampling band Come spectrum confusion.
- the gain adjustment refers to adjusting the amplitude value of the sampling point of the bone conduction signal through the gain factor, or adjusting the energy value of the frequency point of the bone conduction signal.
- the gain factor may be determined according to a gain function, or may be determined according to statistical information of the air conduction signal and the bone conduction signal, which is not limited in this embodiment of the present application.
- FIG. 8 is a schematic diagram of downsampling a bone conduction signal provided by an embodiment of the present application.
- the sampling rate of the bone conduction signal is 48kHz
- the collected bone conduction signal x[n] is first sent to the anti-aliasing filter H(z) to prevent signal aliasing.
- v[n] represents the bone conduction signal after the anti-aliasing filter, and the sampling rate remains unchanged.
- Three times downsampling is performed on v[n] to obtain bone conduction signal y[m] after three times downsampling, and the sampling rate is reduced to 16kHz.
- FIG. 9 is a schematic diagram of gain adjustment for a bone conduction signal provided by an embodiment of the present application.
- x[n] represents the bone conduction signal
- f(g) represents the gain function
- Method 2 of determining the fusion signal based on the bone conduction signal before determining the fusion signal based on the bone conduction signal, turn on the air microphone, and collect the air conduction signal through the air microphone.
- the wearable device generates an enhanced initial signal based on the initial portion of the bone conduction signal, and fuses the enhanced initial signal with the air conduction signal to obtain a fusion signal.
- the initial part of the bone conduction signal is determined according to the detection time delay of the voice detection. That is to say, the wearable device uses the initial part of the bone conduction signal to generate an enhanced initial signal, and uses the enhanced initial signal to perform head loss compensation on the collected air conduction signal, so that the obtained fusion signal also includes the input signal of the sound source. command word information.
- the length of the fusion signal is short, which can reduce the amount of data processing to a certain extent.
- the difference from the above method 1 of determining the fusion signal based on the bone conduction signal is that in the method 2 of determining the fusion signal based on the bone conduction signal, the initial part of the bone conduction signal is used to generate an enhanced initial signal, The enhanced initial signal is fused with the air conduction signal instead of the initial part of the bone conduction signal and the air conduction signal.
- the other contents introduced in the above method 1 are applicable to this method 2.
- the bone conduction signal and the air conduction signal can also be detected for the speech segment to extract the speech segment, and then splice the signal based on the intercepted speech segment, thereby reducing the amount of data processing.
- the wearable device can also perform preprocessing on the bone conduction signal and the air conduction signal, such as performing down-sampling and/or gain adjustment on the bone conduction signal, and performing speech enhancement on the air conduction signal.
- the wearable device may input the initial part of the bone conduction signal into the generation network model, so as to obtain the enhanced initial signal output by the generation network model.
- the generative network model is a model trained based on a deep learning algorithm, and the generative network model can be regarded as a signal generator, which can generate a speech signal that contains information about the input signal and is close to real speech based on the input signal.
- the enhanced initial signal includes signal information of the initial part of the bone conduction signal, and the enhanced initial signal is close to the real speech signal. It should be noted that the embodiment of the present application does not limit the network structure, training method, training equipment, etc. for generating the network model. Next, a training method for generating a network model is exemplarily introduced.
- the computer device obtains a first training data set, and the first training data set includes a plurality of first sample signal pairs.
- the computer device inputs the initial part of the bone conduction sample signal in the plurality of first sample signal pairs into the initial generation network model to obtain a plurality of enhanced initial sample signals output by the initial generation network model.
- the computer device inputs the plurality of enhanced initial sample signals and the initial part of the air conduction sample signal in the plurality of first sample signal pairs into the initial decision network model to obtain a decision result output by the initial decision network model.
- the computer device adjusts network parameters of the initial generated network model based on the decision result to obtain a trained generated network model.
- a first sample signal pair includes an initial part of a bone conduction sample signal and an initial part of an air conduction sample signal, and a first sample signal pair corresponds to a command word, the bone conduction sample signal and the air conduction sample signal Contains complete information for the corresponding command word.
- the first sample signal pair acquired by the computer device includes a bone conduction sample signal and an air conduction sample signal
- the computer device intercepts the initial part of the bone conduction sample signal and the initial part of the air conduction sample signal to obtain an initial generation network model and the input data of the initial decision network model. That is, the computer equipment first obtains the complete speech signal, and then intercepts the initial part to obtain the training data.
- the first sample signal pair acquired by the computer device only includes the initial part of the bone conduction sample signal and the initial part of the air conduction sample signal.
- the first training data set includes directly collected voice data, public voice data and/or voice data purchased from a third party.
- the computer device can preprocess the acquired first training data set to obtain a preprocessed first training data set, which can simulate real voice data In order to be closer to the speech of the real scene and increase the diversity of training samples.
- the first training data set is backed up, that is, an additional piece of data is added, and the backed up data is preprocessed.
- the backup data is divided into multiple parts, and a preprocessing is performed on each part of the data. The preprocessing for each part of the data can be different, which can double the total training data and ensure the comprehensiveness of the data.
- the method of preprocessing each piece of data may include adding noise (noise addition), volume enhancement, adding reverberation (add reverb), time shifting (time shifting), changing pitch (pitch shifting), time stretching (time one or more of stretching) and the like.
- adding noise refers to mixing one or more types of background noise into the speech signal, so that the training data can cover more types of noise.
- office environment noise canteen environment noise, street environment noise and other background noise.
- the SNR can be selected according to a normal distribution, so that the average value of the SNR is better.
- the average value can be 10dB, 20dB, etc., and the SNR can range from 10dB to 30dB.
- the volume enhancement refers to increasing or decreasing the volume of the speech signal according to the variation coefficient of the volume, and the value range of the variation coefficient of the volume may be 0.5 to 1.5, or other value ranges.
- Adding reverberation refers to adding reverberation processing to the speech signal, and the reverberation is caused by the reflection of the sound signal by the space environment.
- Pitch changes such as pitch correction to change the preferred pitch of the voice without affecting the speed of sound.
- Time stretching refers to changing the speed or duration of the speech signal without affecting the pitch, that is, changing the speech rate, so that the training data can cover different speech rates, and the speech rate can vary from 0.9 to 1.1 Or in other ranges.
- FIG. 10 is a schematic diagram of a method for training and generating a network model provided by an embodiment of the present application.
- the generator (that is, the initial generation network model) is a network for generating speech signals, and the initial part of the bone conduction sample signal in the first training data set is input into the generator, optionally, before inputting into the generator, the bone conduction A random noise is superimposed on the sample signal.
- the generator processes the input bone conduction sample signal to generate an enhanced initial sample signal.
- the decision device (that is, the initial decision network model) is a decision network for judging whether the input signal is a real speech signal, and the decision result output by the decision device indicates whether the input signal is a real speech.
- the output decision result is 1, it means The decision unit determines that the input signal is a real speech signal, and if the output judgment result is 0, it means that the decision unit determines that the input signal is not a real speech signal. Adjust the parameters in the generator and the decider by judging whether the judgment result is accurate, so as to train the generator and the decider.
- the goal of the generator is to generate fake speech signals to fool the judger, and the goal of the judger is to be able to distinguish whether the input signal is real or generated. It can be seen that the generator and the decider are essentially playing a game through the training data, and the capabilities of the generator and the decider are improved during the game. Under ideal conditions, the accuracy of the trained decider is close to 0.5 .
- the computer device may also use other methods to generate the enhanced initial signal based on the initial signal of the bone conduction signal, which is not limited in this embodiment of the present application.
- Method 3 of determining the fusion signal based on the bone conduction signal before determining the fusion signal based on the bone conduction signal, turn on the air microphone, and collect the air conduction signal through the air microphone. Wearable devices directly fuse bone conduction signals and air conduction signals to obtain fusion signals. In this way, the obtained fusion signal also contains the command word information input by the sound source. In addition, the fusion signal not only contains the complete speech information in the bone conduction signal, but also contains the complete speech information in the air conduction signal, so that the fusion signal contains Speech features are more abundant, which improves the accuracy of speech recognition to a certain extent.
- the wearable device directly fuses the bone conduction signal and the air conduction signal , except for this, other content introduced in the above-mentioned method 1 is applicable to the method 3, and will not be introduced in detail in the method 3 one by one.
- the bone conduction signal and the air conduction signal may also be detected for speech segments to extract the speech segments, and the intercepted speech segments may be fused, thereby reducing the amount of data processing. It is also possible to perform preprocessing on the bone conduction signal and the air conduction signal, such as performing down-sampling and/or gain adjustment on the bone conduction signal, and performing endpoint detection and speech enhancement on the air conduction signal.
- x1[n] represents the bone conduction signal
- x2[n] represents the air conduction signal
- f(x): b[n] 0,2t ⁇ t concat[x1[n] 0 ⁇ t , x2[n] ⁇ tt ]. That is, the bone conduction signal (the speech segment from 0 to t) and the air conduction signal (the signal segment from ⁇ t to t) are spliced by f(x) to obtain the fusion signal b[n].
- the wearable device determines the bone conduction signal as the fusion signal. That is, it is also possible to detect the wake-up word only by using the bone conduction signal.
- the difference from the above method 1 of determining the fusion signal based on the bone conduction signal is that in the method 4 of determining the fusion signal based on the bone conduction signal, the bone conduction signal is directly used as the fusion signal.
- the other content introduced in the above-mentioned method 1 is applicable to the method 4, and will not be described in detail in the method 4 one by one.
- the bone conduction signal may also be detected for the speech segment to extract the speech segment, and the intercepted speech segment is used as the fusion signal, thereby reducing the amount of data processing. Preprocessing may also be performed on the bone conduction signal, for example, performing down-sampling and/or gain adjustment on the bone conduction signal.
- the wearable device inputs the multiple audio frames included in the fusion signal into the first acoustic model, so as to obtain multiple posterior probability vectors output by the first acoustic model.
- the wearable device detects the wake-up word based on the plurality of posterior probability vectors.
- the plurality of posterior probability vectors correspond one-to-one to the plurality of audio frames included in the fusion signal, that is, one posterior probability vector corresponds to one audio frame included in the fusion signal, among the plurality of posterior probability vectors
- the first posterior probability vector is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of designated phonemes, that is, a posteriori probability vector indicates that the phoneme of a corresponding audio frame belongs to a plurality of designated phonemes probability. That is, the wearable device processes the fusion signal through the first acoustic model to obtain phoneme information contained in the fusion signal, so as to detect the wake-up word based on the phoneme information.
- the first acoustic model may be the network model as described above, or a model of other structures.
- the first acoustic model processes each audio frame included in the fusion signal to obtain the posterior probability vector corresponding to each audio frame output by the first acoustic model .
- the wearable device after the wearable device obtains multiple posterior probability vectors output by the first acoustic model, it determines the command word input by the sound source based on the multiple posterior probability vectors and the phoneme sequence corresponding to the wake-up word
- the corresponding phoneme sequence includes the confidence of the phoneme sequence corresponding to the wake word. If the confidence level exceeds the confidence level threshold, it is determined that the command word includes the wake-up word. That is, the wearable device decodes the plurality of posterior probability vectors to determine a confidence level.
- the phoneme sequence corresponding to the wake-up word is called a decoding path
- the determined confidence level may be called a path score
- the confidence level threshold may be called a wake-up threshold.
- the constructed decoding graph looks for the probability of each phoneme on the decoding path in the decoding graph, and adds up the probabilities of each phoneme found to obtain a confidence level.
- the decoding path refers to the phoneme sequence corresponding to the wake-up word. If the confidence level is greater than the confidence level threshold, it is determined that the command word is detected to include a wake-up word.
- the wearable device determines that the sound source input is detected
- the command words include the wake word.
- the multiple template vectors indicate probabilities that the phonemes of the speech signal including the complete information of the wake-up word belong to multiple specified phonemes. That is, the current input speech not only needs to satisfy the confidence condition, but also needs to match the template.
- the confidence threshold can be preset, for example, based on experience, or determined according to the bone conduction registration signal and/or air conduction registration signal containing complete information of the wake-up word when registering the wake-up word. The method is described below.
- the multiple template vectors are registration posterior probability vectors determined according to the bone conduction registration signal and/or the air conduction registration signal, and the specific implementation will be introduced below.
- the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold.
- the wearable device can directly calculate the distances between the multiple posterior probability vectors and the corresponding template vectors and calculate an average value. For example, when the duration of the voice input by the current sound source is consistent with the duration of the voice input by the user when the wake-up word is registered, there may be a one-to-one correspondence between the plurality of posterior probability vectors and the plurality of template vectors.
- the wearable The device can adopt a dynamic time warping (dynamic time warping, DTW) method to establish the mapping relationship between the multiple posterior probability vectors and the multiple template vectors, so that the wearable device can calculate the multiple posterior probability vectors and the multiple template vectors.
- DTW dynamic time warping
- the wearable device first determines the fusion signal based on the bone conduction signal (including four methods), and then passes The acoustic model processes the fused signal to obtain a posterior probability vector. Then, the wearable device decodes the obtained posterior probability vector based on the decoding path corresponding to the wake-up word, so as to obtain the confidence corresponding to the command word currently input by the sound source. If the confidence level is greater than the confidence level threshold, the wearable device determines that the detected command word includes a wake-up word.
- the wearable device determines that the command word includes a wake-up word.
- the second implementation method of the wearable device detecting the wake word based on the bone conduction signal is introduced.
- the wearable device before the wearable device detects the wake-up word based on the bone conduction signal, the air microphone is turned on, and the air conduction signal is collected through the air microphone. For example, when a voice input is detected, the air microphone is turned on, and the air conduction signal is collected through the air microphone.
- the wearable device determines multiple posterior probability vectors based on the bone conduction signal and the air conduction signal, and detects the wake-up word based on the multiple posterior probability vectors.
- the plurality of posterior probability vectors correspond one-to-one to the plurality of audio frames included in the bone conduction signal and the air conduction signal
- the first posterior probability vector in the plurality of posterior probability vectors is used to indicate that in the plurality of audio frames The probability that the phoneme of the first audio frame belongs to multiple specified phonemes.
- the multiple audio frames include audio frames included in the bone conduction signal and audio frames included in the air conduction signal. That is, each posterior probability vector in the plurality of posterior probability vectors corresponds to an audio frame included in the bone conduction signal or the air conduction signal, and one posterior probability vector indicates that the phoneme of the corresponding audio frame belongs to multiple Probability for the specified phoneme.
- bone conduction signal and air conduction signal you can refer to the content in the first implementation mode above, including the generation principle of bone conduction signal and air conduction signal, preprocessing of bone conduction signal and air conduction signal, etc. , which will not be repeated here.
- wearable devices to determine multiple posterior probability vectors based on bone conduction signals and air conduction signals. It should be noted that there are many ways for wearable devices to determine multiple posterior probability vectors based on bone conduction signals and air conduction signals, and three of them will be introduced next.
- Method 1 of determining multiple posterior probability vectors based on the bone conduction signal and the air conduction signal the wearable device inputs the initial part of the bone conduction signal and the air conduction signal into the second acoustic model to obtain the first quantity output by the second acoustic model bone conduction posterior probability vectors and a second number of air conduction posterior probability vectors.
- the initial part of the bone conduction signal is determined according to the detection time delay of voice detection
- the first number of bone conduction posterior probability vectors correspond to the audio frames included in the initial part of the bone conduction signal
- the conduction posterior probability vector is in one-to-one correspondence with the audio frames included in the air conduction signal.
- the wearable device fuses the first bone conduction posterior probability vector and the first air conduction posterior probability vector to obtain a second posterior probability vector.
- the first bone conduction posterior probability vector corresponds to the last audio frame of the initial portion of the bone conduction signal, and the duration of the last audio frame is less than the frame duration
- the first air conduction posterior probability vector corresponds to the first audio frame of the air conduction signal. audio frames, the duration of the first audio frame is less than the frame duration.
- the multiple posterior probability vectors finally determined by the wearable device include a second posterior probability vector, a vector of the first number of bone conduction posterior probability vectors except the first bone conduction posterior probability vector, and a second number of A vector of air conduction posterior probability vectors other than the first air conduction posterior probability vector.
- the first quantity and the second quantity may be the same or different.
- the last audio frame of the initial part of the bone conduction signal may not be a complete audio frame, that is, the duration of the last audio frame is less than the frame duration, for example, the initial part of the bone conduction signal includes half a frame Duration of audio frames.
- the first audio frame of the air conduction signal may not be a complete audio frame, that is, the duration of the first audio frame is less than the frame duration, for example, the first audio frame of the air conduction signal includes half audio frames with frame duration.
- the sum of the duration of the last audio frame of the initial part of the bone conduction signal and the duration of the first audio frame of the air conduction signal may be equal to the frame duration.
- voice detection such as VAD
- the initial part of the bone conduction signal and the first frame of the air conduction signal will be incomplete, and the initial part of the bone conduction signal and the first frame of the air conduction signal are combined to represent Information about a complete audio frame. It should be noted that this complete audio frame is a potential frame of audio, not an actual frame.
- the wearable device adds the first bone conduction posterior probability vector and the first air conduction posterior probability vector to obtain a second posterior probability vector, and the second posterior probability vector obtained by the wearable device Indicates the probability that the phoneme of this complete audio frame above belongs to more than one specified phoneme.
- the wearable device needs to fuse (such as add) the second bone conduction posterior probability vector and the second air conduction posterior probability vector to obtain multiple posterior probability vectors.
- the duration of the last audio frame of the initial part of the bone conduction signal is equal to the frame duration
- the duration of the first audio frame of the air conduction signal is equal to the frame duration
- the wearable device uses the first number of bone conduction posterior probability vectors and the second number of air conduction posterior probability vectors obtained as a plurality of posterior probability vectors, and performs subsequent processing.
- Fig. 11 is a schematic structural diagram of another acoustic model provided by an embodiment of the present application.
- the acoustic model shown in FIG. 11 is the second acoustic model in the embodiment of the present application.
- the second acoustic model in the embodiment of the present application includes two input layers (not shown), one shared network layer and two output layers.
- the two input layers are used to respectively input the initial part of the bone conduction signal and the air conduction signal.
- the shared network layer is used to process the input data of the two input layers separately, so as to extract the initial part of the bone conduction signal and the characteristics of the air conduction signal respectively.
- These two output layers are used to receive two output data of the shared network layer respectively, and process the two output data respectively to output the first number of bone conduction posterior probability vectors corresponding to the initial part of the bone conduction signal , and the second number of air conduction posterior probability vectors corresponding to the air conduction signal. That is, the wearable device processes the initial part of the bone conduction signal and the two parts of the air conduction signal through the second acoustic model to obtain two sets of posterior probability vectors corresponding to the two parts of the signal. It's just that there is a shared network layer in the acoustic model for the two parts of the signal to share some network parameters.
- the wearable device fuses the obtained first bone conduction posterior probability vector with the first air conduction posterior probability vector to obtain the second posterior probability vector, so that the multiple bone conduction posterior probability vectors
- the probability vector and multiple air conduction posterior probability vectors are fused to obtain multiple posterior probability vectors, that is, the wearable device fuses the posterior probabilities of the two parts of the signal, so that the obtained multiple posterior probability vectors
- the scheme of processing the initial part of the bone conduction signal and the air conduction signal based on the second acoustic model can be considered as a multi-task (multi-task) scheme, that is, the initial part of the bone conduction signal and the air conduction signal are used as two
- the method of sharing network parameters is used to determine the corresponding posterior probability vectors to implicitly fuse the initial part of the bone conduction signal with the air conduction signal.
- Method 2 of determining multiple posterior probability vectors based on the bone conduction signal and the air conduction signal the wearable device inputs the initial part of the bone conduction signal and the air conduction signal into the third acoustic model to obtain multiple posterior probability vectors output by the third acoustic model. test probability vector.
- the initial part of the bone conduction signal is determined according to the detection time delay of the voice detection. It should be noted that for the relevant introduction about the initial part of the bone conduction signal, reference may also be made to the content in the aforementioned first implementation manner, which will not be repeated here.
- the third acoustic model includes two input layers (such as an input layer including DNN and CNN layers), a splicing layer (concat layer), and a network parameter layer (such as including RNN, etc.) and an output layer (such as including softmax, etc.).
- the two input layers are used to input the bone conduction signal and the air conduction signal respectively
- the splicing layer is used to splice the output data of the two input layers
- the network parameter layer is used to process the output data of the splicing layer
- the output layer is used to output A set of posterior probability vectors.
- the wearable device simultaneously inputs the initial part of the bone conduction signal and the air conduction signal into the third acoustic model, and implicitly fuses the initial part of the bone conduction signal and the air conduction signal into the third acoustic model through the splicing layer in the third acoustic model.
- a set of posterior probability vectors are obtained, so that the obtained multiple posterior probability vectors contain the command word information input by the sound source, which can also be regarded as a method of head loss compensation for air conduction signals based on bone conduction signals. This method is just not to compensate by directly fusing the signals.
- Method 3 of determining multiple posterior probability vectors based on the bone conduction signal and the air conduction signal the wearable device inputs the bone conduction signal and the air conduction signal into the third acoustic model to obtain multiple posterior probability vectors output by the third acoustic model. That is, the wearable device directly inputs the bone conduction signal and the air conduction signal into the third acoustic model at the same time, and outputs a set of posterior probability vectors through the third acoustic model, so that the obtained multiple posterior probability vectors include the sound source input
- This can also be regarded as a method of head loss compensation for air conduction signals based on bone conduction signals, but it is not compensated by direct fusion of signals.
- the wearable device determines the confidence level that the phoneme sequence corresponding to the command word input by the sound source includes the phoneme sequence corresponding to the wake-up word based on the plurality of posterior probability vectors and the phoneme sequence corresponding to the wake-up word. If the confidence level exceeds the confidence level threshold, it is determined that the wake-up word is detected.
- the confidence level exceeds the confidence level threshold, it is determined that the wake-up word is detected.
- the wearable device determines that the command word is detected when the confidence exceeds a confidence threshold and the distance condition is satisfied between the plurality of posterior probability vectors and the plurality of template vectors Include the wake word.
- the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold.
- Step 403 When it is detected that the command word includes a wake-up word, wake up the device to be woken up by voice.
- the wearable device when it is detected that the command word input by the sound source includes the wake-up word, the wearable device performs voice wake-up. For example, the wearable device sends a wake-up instruction to the smart device (that is, the device to be woken up), so as to wake up the smart device. Or, in the case that the wearable device itself is a smart device, the wearable device wakes up other components or modules except the bone conduction microphone, that is, the wearable device as a whole enters a working state.
- the voice wake-up method provided by the embodiment of the present application has multiple implementations, such as the first implementation and the second implementation described above, and these two implementations include multiple specific implementations respectively. Way. Next, please refer to FIGS. 13 to 18 to explain again the several specific implementations introduced above.
- FIG. 13 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
- FIG. 13 corresponds to manner 1 in the above-mentioned first implementation manner.
- the wearable device collects the bone conduction signal through the bone conduction microphone, and performs VAD on the bone conduction signal through the VAD control module.
- the VAD control module Output high level VAD control signal.
- the VAD control module outputs a low-level VAD control signal.
- the VAD control module sends the VAD control signal to the air microphone control module, the front-end enhancement module and the recognition engine respectively.
- the VAD control signal is used to control the switch of the air microphone control module, the front-end enhancement module and the recognition engine.
- the air microphone control module controls the air microphone to be turned on to collect the air conduction signal
- the front-end enhancement module is turned on to perform front-end enhancement on the air conduction signal
- the recognition engine is turned on to collect the air conduction signal based on the bone conduction signal and the air conduction signal. signal for wake word detection.
- the fusion module performs preprocessing such as downsampling and/or gain adjustment on the bone conduction signal, and uses the initial part of the preprocessed bone conduction signal to perform head loss compensation on the front-end enhanced air conduction signal to obtain the fusion signal.
- the fusion module sends the fusion signal to the recognition engine, and the recognition engine recognizes the fusion signal through the first acoustic model to obtain the detection result of the wake word.
- the recognition engine sends the obtained detection result to the processor (such as the illustrated micro-controller unit (MCU)), and the processor determines whether to wake up the smart device based on the detection result. If the detection result indicates that it is detected that the command word input by the sound source includes a wake-up word, the processor wakes up the smart device by voice. If the detection result indicates that no wake word is detected, the processor does not wake up the smart device.
- MCU micro-controller unit
- FIG. 14 to FIG. 16 are flowcharts of three other voice wake-up methods provided by the embodiment of the present application.
- the fusion module generates an enhanced initial signal based on the initial part of the preprocessed bone conduction signal, and uses the enhanced initial signal Head loss compensation is performed on the enhanced air conduction signal at the front end to obtain a fusion signal.
- the fusion module directly splices the preprocessed bone conduction signal and the front-end enhanced air conduction signal to perform head loss compensation on the air conduction signal, thereby obtaining a fusion signal.
- the VAD control signal does not need to be sent to the air microphone control module, so there is no need to collect the air conduction signal.
- the recognition engine directly determines the preprocessed bone conduction signal as the fusion signal.
- FIG. 17 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
- the recognition engine inputs the initial part of the preprocessed bone conduction signal and the front-end enhanced air conduction signal into the second acoustic model respectively to obtain the second acoustic model
- the two output layers of the model respectively output the bone conduction posterior probability vector and the air conduction posterior probability vector, that is, the posterior probability pair is obtained.
- the recognition engine fuses the bone conduction posterior probability vector and the air conduction posterior probability vector to obtain multiple posterior probability vectors, and decodes the multiple posterior probability vectors to obtain the detection result of the wake word.
- Fig. 18 is a flow chart of another voice wake-up method provided by the embodiment of the present application.
- the difference between Fig. 18 and Fig. 17 is that in the method shown in Fig. 18, the recognition engine inputs the initial part of the preprocessed bone conduction signal and the front-end enhanced air conduction signal into the third acoustic model respectively, or the preprocessed
- the bone conduction signal and the front-enhanced air conduction signal are respectively input into the third acoustic model to obtain a plurality of posterior probability vectors respectively output by an output layer of the third acoustic model.
- the bone conduction microphone is used to collect bone conduction signals for voice detection, which can ensure low power consumption. While ensuring low power consumption, it is considered that due to the delay of voice detection, the collected air conduction signal may lose its head, and thus does not contain the complete information of the command words input by the sound source, while the bone conduction signal collected by the bone conduction microphone contains sound
- the command word information input by the source that is, the bone conduction signal is not lost, so this solution detects the wake-up word based on the bone conduction signal. In this way, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
- head loss compensation may be performed directly or implicitly on the air conduction signal based on the bone conduction signal, or the wake-up word detection may be performed directly based on the bone conduction signal.
- the wake-up word can also be registered in the wearable device.
- the confidence threshold in the above-mentioned embodiment can also be determined while registering the wake-up word, and multiple Template vector. Next, the registration process of the wake word will be introduced.
- the wearable device first determines the phoneme sequence corresponding to the wake-up word. Afterwards, the wearable device obtains the bone conduction registration signal, and the bone conduction registration signal contains complete information of the wake word. The wearable device determines a confidence threshold based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word. Optionally, the wearable device may also determine multiple template vectors based on the bone conduction signal.
- the wearable device obtains the input wake-up word, and determines the phoneme sequence corresponding to the wake-up word according to the pronunciation dictionary. Taking the user inputting the wake-up word text to the wearable device as an example, the wearable device obtains the wake-up word text input by the user, and determines the phoneme sequence corresponding to the wake-up word according to the pronunciation dictionary.
- the wearable device can also detect whether the input wake-up word text meets the text registration conditions after the user enters the wake-up word text.
- the pronunciation dictionary determines the sequence of phonemes corresponding to the wake word text.
- the text registration conditions include text input times requirement, character requirement and so on. Taking the number of text input requirements as requiring the user to input one or more wake-up word texts as an example, every time the wearable device detects a wake-up word text entered by the user, it will perform text verification and analysis on the input wake-up word text to verify the current Whether the entered wake word text meets the character requirements. If the wake-up word text entered by the user does not meet the character requirements, the wearable device will prompt the user with text or sound to indicate the reason for the non-compliance and require re-input. If the wake-up word text input by the user one or more times all meet the character requirements and are the same, the wearable device determines the phoneme sequence corresponding to the wake-up word text according to the pronunciation dictionary.
- the wearable device detects whether the currently input wake word text meets character requirements through text verification.
- the character requirements include one or more of the following requirements: requiring Chinese (non-Chinese characters do not meet the character requirements), 4 to 6 characters (less than 4 characters or more than 6 characters do not meet the character requirements) , there is no modal particle (if it exists, it does not meet the character requirements), there are no more than 3 repeated words with the same pronunciation (if it exists, it does not meet the character requirements), it is different from the existing command words (if it exists, it does not meet the character requirements),
- the overlapping ratio of the phoneme with the existing command word is no more than 70% (more than 70% means that it does not meet the character requirements, which is used to prevent accidental entry), and the corresponding phoneme belongs to the phoneme in the pronunciation dictionary (if it does not belong to, it does not meet the character requirements, it is a an exceptional situation).
- the text registration can determine the phoneme sequence corresponding to the wake-up word.
- the phoneme sequence can subsequently be used as a decoding path for the wake-up word, and the decoding path is used to detect the wake-up word during the voice wake-up process.
- Voice registration is required in addition to text registration.
- the wearable device after the text registration is completed, the wearable device also needs to acquire the bone conduction registration signal, which includes the complete information of the wake-up word.
- the wearable device also acquires the air conduction registration signal while acquiring the bone conduction registration signal.
- the wearable device verifies the bone conduction registration signal and the air conduction registration signal after obtaining the input bone conduction registration signal and air conduction registration signal. Whether the guidance registration signal and the air conduction registration signal meet the voice registration conditions, and if the voice registration conditions are met, the wearable device performs subsequent processing to determine the confidence threshold.
- the voice registration condition includes the voice input frequency requirement, the signal-to-noise ratio requirement, the path score requirement, and the like.
- the user needs to input the wake-up word voice three times (including the bone conduction registration signal and the air conduction registration signal). Every time the wearable device detects the wake-up word voice input by the user, it will pronounce the wake-up word voice input Checksum analysis to check whether the currently input wake-up word speech meets the signal-to-noise ratio requirements and path score requirements. If the wake-up word text entered by the user does not meet the character requirements, the wearable device will prompt the user the reason for the non-compliance through text or sound and require re-input. If the wake-up word voices input by the user three times meet the SNR requirements and path score requirements, the wearable device determines that the wake-up word voices input by the user meet the voice registration conditions, and the wearable device performs subsequent processing.
- the wearable device can first detect whether the input wake-up word speech meets the SNR requirement, and then detect whether the input wake-up word speech meets the path score requirement after determining that the input wake-up word speech meets the SNR requirement.
- the signal-to-noise ratio requirement includes the requirement that the signal-to-noise ratio is not lower than the signal-to-noise ratio threshold (if it is lower than the signal-to-noise ratio requirement), for example, it is required that the signal-to-noise ratio of the bone conduction registration signal is not lower than the first signal-to-noise ratio ratio threshold, and/or, require that the signal-to-noise ratio of the air conduction registration signal is not lower than a second signal-to-noise ratio threshold.
- the first SNR threshold is greater than the second SNR threshold. If the voice of the wake-up word input by the user does not meet the SNR requirements, the wearable device will prompt the user that the current environment is noisy and not suitable for registration, and the user needs to find a quiet environment to re-enter the voice of the wake-up word.
- the path score requirements include that the path score obtained based on each input wake-up word voice is not less than the calibration threshold, the average of the three path scores based on three input wake-up word voices is not less than the calibration threshold, and the average value of the three path scores based on any two input wake-up word voices is not less than the calibration threshold.
- the resulting two path scores differ by no more than 100 points (or other value). Among them, the implementation process of obtaining the path score based on the wake-up word voice will be introduced below, which is essentially similar to the process of obtaining the confidence level based on the bone conduction signal in the aforementioned voice wake-up process.
- the wearable device determines the confidence threshold based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word. Similar to the confidence level obtained based on the bone conduction signal in the aforementioned voice wake-up process, the wearable device can determine the confidence level threshold in a variety of implementation ways, and two of the implementation ways will be introduced next.
- the wearable device determines a fusion registration signal based on the bone conduction registration signal, and determines a confidence threshold and a plurality of template vectors based on the fusion registration signal and the phoneme sequence corresponding to the wake-up word.
- the wearable device determines the fusion registration signal based on the bone conduction registration signal. It should be noted that there are multiple ways for the wearable device to determine the fusion registration signal based on the bone conduction registration signal, four of which will be introduced next.
- Method 1 of determining the fusion registration signal based on the bone conduction registration signal Before determining the fusion registration signal based on the bone conduction registration signal, the air conduction registration signal is acquired. The wearable device fuses the initial part of the bone conduction registration signal with the air conduction registration signal to obtain a fusion registration signal. Wherein, the initial part of the bone conduction registration signal is determined according to the detection time delay of the voice detection. Optionally, signal fusion is performed by signal splicing in the embodiment of the present application.
- the wearable device can also detect the voice segment of the bone conduction registration signal and the air conduction registration signal to intercept the voice segment, and perform signal splicing based on the intercepted voice segment, thereby reducing the amount of data processing. It is also possible to perform preprocessing on the bone conduction registration signal and the air conduction registration signal, such as performing down-sampling and/or gain adjustment on the bone conduction registration signal, performing speech enhancement on the air conduction signal, and the like.
- the specific implementation manner is similar to the principle of the relevant content in the foregoing embodiments, please refer to the foregoing embodiments, and no detailed description will be given here.
- Method 2 of determining the fusion registration signal based on the bone conduction registration signal before determining the fusion registration signal based on the bone conduction registration signal, the air conduction registration signal is acquired.
- the wearable device generates an enhanced initial registration signal based on the initial part of the bone conduction registration signal, and fuses the enhanced initial registration signal and the air conduction registration signal to obtain a fused registration signal.
- the initial part of the bone conduction registration signal is determined according to the detection time delay of the voice detection.
- the wearable device uses the initial part of the bone conduction registration signal to generate an enhanced initial registration signal, Instead of fusing the initial portion of the bone conduction registration signal with the air conduction signal, the enhanced initial registration signal is fused with the air conduction registration signal.
- the wearable device can also detect the voice segment of the bone conduction registration signal and the air conduction registration signal to intercept the voice segment, and perform signal fusion based on the intercepted voice segment, thereby reducing the amount of data processing.
- the wearable device can also perform preprocessing on the bone conduction registration signal and the air conduction registration signal, such as performing down-sampling and/or gain adjustment on the bone conduction registration signal, performing voice enhancement on the air conduction signal, and the like.
- preprocessing on the bone conduction registration signal and the air conduction registration signal, such as performing down-sampling and/or gain adjustment on the bone conduction registration signal, performing voice enhancement on the air conduction signal, and the like.
- the wearable device may input the initial part of the bone conduction registration signal into the generating network model, so as to obtain the enhanced initial registration signal output by the generating network model.
- the generated network model may be the same as the generated network model described above, or may be another generated network model, which is not limited in this embodiment of the present application.
- the embodiment of the present application also does not limit the network structure, training method, training equipment, etc. of the generated network model.
- the air conduction registration signal is acquired.
- the wearable device directly fuses the bone conduction registration signal and the air conduction registration signal to obtain a fusion registration signal.
- the wearable device directly fuses the bone conduction registration signal and the air conduction registration signal to obtain a fusion Register signal.
- the wearable device can also detect the voice segment of the bone conduction registration signal and the air conduction registration signal to intercept the voice segment, and perform signal fusion based on the intercepted voice segment, thereby reducing the amount of data processing.
- the wearable device can also perform preprocessing on the bone conduction registration signal and the air conduction registration signal, such as performing down-sampling and/or gain adjustment on the bone conduction registration signal, performing voice enhancement on the air conduction signal, and the like.
- the wearable device determines the bone conduction registration signal as the fusion registration signal.
- the wearable device directly uses the bone conduction registration signal as the fused registration signal.
- the wearable device can also detect the voice segment of the bone conduction registration signal to extract the voice segment, and perform subsequent processing based on the extracted voice segment, thereby reducing the amount of data processing.
- the wearable device may also perform preprocessing on the bone conduction registration signal, for example, perform down-sampling and/or gain adjustment on the bone conduction registration signal.
- the wearable device inputs the multiple registration audio frames included in the fused registration signal into the first acoustic model, so as to obtain multiple registration posterior probability vectors output by the first acoustic model.
- the plurality of registration posterior probability vectors correspond to the plurality of registration audio frames one by one
- the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the first registration in the plurality of registration audio frames. Probability that a phoneme of an audio frame belongs to more than one specified phoneme.
- each registration posterior probability vector in the plurality of registration posterior probability vectors corresponds to a registration audio frame included in the fused registration signal, and a registration posterior probability vector indicates that the phoneme of a corresponding registration audio frame belongs to a plurality of specified phoneme probability.
- the wearable device determines the plurality of enrollment posterior probability vectors as a plurality of template vectors.
- the wearable device determines a confidence threshold based on the plurality of registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word. That is, the wearable device processes the fused registration signal through the first acoustic model to obtain the information of the phonemes contained in the fused signal, that is, obtains the registration posterior probability vector, uses the registration posterior probability vector as a template vector, and stores Template vector.
- the wearable device also decodes the registration posterior probability vector based on the phoneme sequence (that is, the decoding path) corresponding to the wake-up word to determine a path score, uses the path score as a confidence threshold, and stores the confidence threshold.
- the wearable device first determines the confidence threshold based on the bone conduction registration signal.
- the registration signals (including four ways) are fused, and then the fused registration signals are processed through an acoustic model to obtain a registration posterior probability vector.
- the wearable device decodes the obtained registration posterior probability vector based on the decoding path corresponding to the wake-up word to obtain the confidence threshold.
- the wearable device stores the obtained registration posterior probability vector as a template vector.
- the wearable device will introduce the second implementation method of determining the confidence threshold based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word.
- the wearable device obtains the air conduction registration signal before determining the confidence threshold based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word.
- the wearable device determines multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal.
- the multiple registration posterior probability vectors correspond to the multiple registration audio frames included in the bone conduction registration signal and the air conduction registration signal
- the first registration posterior probability vector among the multiple registration posterior probability vectors indicates the multiple The probability that the phoneme of the first registered audio frame in the registered audio frames belongs to multiple specified phonemes.
- the multiple registration audio frames include registration audio frames included in the bone conduction registration signal and registration audio frames included in the air conduction registration signal.
- each of the plurality of registration posterior probability vectors corresponds to a registration audio frame included in the bone conduction registration signal or the air conduction registration signal
- a registration posterior probability vector indicates a corresponding registration Probability that a phoneme of an audio frame belongs to more than one specified phoneme.
- the wearable device determines a confidence threshold based on the plurality of registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
- the wearable device determines the multiple enrollment posterior probability vectors as multiple template vectors.
- bone conduction registration signal and air conduction registration signal you can refer to the content in the first implementation mode above, including the generation principle of bone conduction registration signal and air conduction registration signal, and the analysis of bone conduction registration signal and air conduction registration signal. Signal preprocessing and the like will not be described here one by one.
- the wearable device determines multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal. It should be noted that there are multiple ways for the wearable device to determine multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal, and three of them will be introduced next.
- Method 1 of determining multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal the wearable device inputs the initial part of the bone conduction registration signal and the air conduction registration signal into the second acoustic model to obtain the second acoustic model The third quantity of bone conduction registration posterior probability vectors and the fourth quantity of air conduction registration posterior probability vectors are output. The wearable device fuses the first bone conduction posterior registration probability vector with the first air conduction posterior registration probability vector to obtain a second registration posterior probability vector.
- the initial part of the bone conduction registration signal is determined according to the detection time delay of voice detection
- the third number of bone conduction registration posterior probability vectors correspond to the registration audio frames included in the initial part of the bone conduction registration signal
- the first The four air conduction registration posterior probability vectors are in one-to-one correspondence with the registration audio frames included in the air conduction registration signal. That is, a bone conduction registration posterior probability vector corresponds to a registration audio frame included in the initial part of the bone conduction registration signal, and an air conduction registration posterior probability vector corresponds to a registration audio frame included in the air conduction registration signal.
- the first bone conduction registration posterior probability vector corresponds to the last registration audio frame of the initial part of the bone conduction registration signal, and the duration of the last registration audio frame is less than the frame duration
- the first air conduction posterior probability vector corresponds to the air conduction registration signal
- the first registered audio frame of the , the duration of the first registered audio frame is less than the frame duration.
- the multiple registration posterior probability vectors finally determined by the wearable device include the second registration posterior probability vector and the third number of bone conduction registration posterior probability vectors except for the first bone conduction registration posterior probability vector, And vectors in the fourth number of air conduction registration posterior probability vectors except the first air conduction registration posterior probability vector.
- the third quantity and the fourth quantity may be the same or different
- the third quantity may be the same or different from the aforementioned first quantity
- the fourth quantity may be the same or different from the aforementioned second quantity.
- the wearable device adds the first bone conduction registration posterior probability vector and the first air conduction registration posterior probability vector to obtain the second registration posterior probability vector.
- the wearable device obtains the third number of bone conduction registration posterior probability vectors and the fourth number of air conduction registration posterior probability vectors through the second acoustic model is the same as the principle of obtaining the first number of posterior probability vectors through the second acoustic model in the previous embodiment.
- the principle of the bone conduction posterior probability vector is the same as that of the second number of air conduction posterior probability vectors, and will not be described in detail here.
- Method 2 of determining multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal the wearable device inputs the initial part of the bone conduction registration signal and the air conduction registration signal into the third acoustic model to obtain the third acoustic model Multiple registered posterior probability vectors for output.
- the initial part of the bone conduction registration signal is determined according to the detection time delay of the voice detection.
- Method 3 of determining multiple registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal the wearable device inputs the bone conduction registration signal and the air conduction registration signal into the third acoustic model to obtain multiple Register the posterior probability vector. That is, the wearable device directly inputs the bone conduction registration signal and the air conduction registration signal into the third acoustic model at the same time, and outputs a set of registration posterior probability vectors through the third acoustic model, so that the obtained multiple registration posterior probability vectors include complete information about the wake word input by the sound source.
- the wearable device determines the multiple registration posterior probability vectors, it determines the confidence threshold based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
- the principle is the same as that of the wearable device described above.
- the principle of determining the confidence level based on multiple posterior probability vectors and the phoneme sequence of the wake-up word is similar.
- FIG. 19 to FIG. 24 are flow charts of six wake-up word registration methods provided by the embodiments of the present application. Next, the registration process of the wake-up word in the embodiment of the present application will be explained again with reference to FIG. 19 to FIG. 24 .
- FIG. 19 corresponds to mode 1 in the first implementation mode of the above-mentioned wake-up word registration, and the registration process of the wake-up word includes text registration and voice registration.
- the wearable device first performs text registration.
- the text registration module of the wearable device obtains the user-defined wake-up word text, performs text verification and text analysis on the input wake-up word text, and determines the phoneme sequence corresponding to the wake-up word text that meets the text registration requirements according to the pronunciation dictionary.
- the phoneme sequence is determined as a decoding path, and the text registration module sends the decoding path to the recognition engine.
- the recognition engine stores the decoding path.
- the wearable device performs voice registration again.
- the voice registration module of the wearable device acquires voice registration signals, including bone conduction registration signals and air conduction registration signals.
- the wearable device acquires the bone conduction registration signal and the air conduction registration signal through the VAD, and may also perform preprocessing on the acquired bone conduction registration signal and the air conduction registration signal.
- the voice registration module performs pronunciation verification on the bone conduction registration signal and the air conduction registration signal
- the fusion module fuses the bone conduction registration signal and the air conduction registration signal that meet the voice registration requirements after verification to obtain a fusion registration signal.
- the fused registration signal in FIG. 19 is referred to as fused registration signal 1 here.
- the voice registration module processes the fused registration signal 1 through the first acoustic model to obtain multiple registration posterior probability vectors, and determines a path score by decoding the multiple registration posterior probability vectors, and uses the path score as
- the wake-up threshold (that is, the confidence threshold) is sent to the recognition engine, and the recognition engine stores the wake-up threshold, and the wake-up threshold suppresses one-level false wake-up in the user's subsequent voice wake-up.
- the speech registration module sends a plurality of registration posterior probability vectors obtained as a plurality of template vectors to the recognition engine, and the recognition engine stores the plurality of template vectors, and the plurality of template vectors are used for the secondary False awakening suppression.
- FIG. 20 to FIG. 22 respectively correspond to mode 2, mode 3 and mode 4 in the first implementation mode of the above-mentioned wake-up word registration.
- the voice registration module of the wearable device generates an enhanced initial registration signal based on the initial part of the bone conduction registration signal, and combines the enhanced initial registration signal with the air conduction registration
- the signals are fused to obtain a fused registration signal.
- the fused registration signal in FIG. 20 is referred to as fused registration signal 2 here.
- the voice registration module directly fuses the bone conduction registration signal and the air conduction registration signal to obtain a fused registration signal.
- the fused registration signal in FIG. 21 is referred to as fused registration signal 3 here.
- the voice registration module may directly determine the bone conduction registration signal as the fused registration signal without acquiring the air conduction registration signal.
- the fused registration signal in FIG. 22 is referred to as fused registration signal 4 here.
- FIG. 23 corresponds to mode 1 in the second implementation mode of the above-mentioned wake-up word registration.
- the voice registration module of the wearable device inputs the initial part of the bone conduction registration signal and the air conduction registration signal into the second acoustic model respectively to obtain the second acoustic model
- the third number of bone conduction registration posterior probability vectors and the fourth number of air conduction registration posterior probability vectors are respectively output.
- the speech registration module fuses the third number of bone conduction registration posterior probability vectors and the fourth number of air conduction registration posterior probability vectors to obtain multiple registration posterior probability vectors.
- FIG. 24 corresponds to mode 2 and mode 3 in the second implementation mode of the above-mentioned wake-up word registration.
- the voice registration module of the wearable device inputs the initial part of the bone conduction registration signal and the air conduction registration signal into the third acoustic model respectively, or inputs the bone conduction registration signal and the air conduction registration signal are input into the third acoustic model to obtain a plurality of registration posterior probability vectors output by the third acoustic model.
- the processing flow of bone conduction registration signal and air conduction registration signal in the process of wake-up word registration is similar to the processing flow of bone conduction signal and air conduction signal in the process of voice wake-up, except that in the wake-up word registration process
- it is to obtain the wake-up threshold and the template vector and in the process of voice wake-up, it is to detect the wake-up word.
- the template vector can improve the accuracy and robustness of this scheme.
- This solution uses the bone conduction signal to directly or implicitly perform head loss compensation on the air conduction signal, or directly detects the wake-up word based on the bone conduction signal. Since the bone conduction signal contains the command word information input by the sound source, the bone conduction signal does not Lost head, therefore, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
- the voice wake-up process and the wake-up word registration process are introduced.
- the acoustic model in the embodiment of the present application needs to be pre-trained.
- the first acoustic model, the second acoustic model and the third acoustic model all need to be pre-trained.
- the computer equipment is used to train the acoustic model as This example introduces the training process of the acoustic model.
- the computer device first obtains the second training data set, the second training data set includes a plurality of second sample signal pairs, one second sample signal pair includes a bone conduction sample signal and an air conduction sample signal, A second sample signal pair corresponds to a command word.
- the second training data set includes directly collected voice data, public voice data and/or voice data purchased from a third party.
- the computer device can preprocess the acquired second training data set to obtain a preprocessed second training data set, which can simulate real speech data In order to be closer to the speech of the real scene and increase the diversity of training samples.
- the second training data set is backed up, that is, an additional piece of data is added, and the backed up data is preprocessed.
- the backup data is divided into multiple parts, and a preprocessing is performed on each part of the data.
- the preprocessing for each part of the data can be different, which can double the total training data and ensure the comprehensiveness of the data. , to achieve a balance between performance and training overhead, so that the accuracy and robustness of speech recognition can be improved to a certain extent.
- the method for preprocessing each piece of data may include one or more of adding noise, increasing volume, adding reverberation, time shifting, changing pitch, and time stretching.
- the computer device determines a plurality of fusion sample signals based on the second training data set, and there are four ways in total. It should be noted that these four ways correspond one-to-one to the four ways in which the wearable device determines the fusion signal based on the bone conduction signal in the recognition process (that is, the voice wake-up process) in the above-mentioned embodiment.
- the wearable device fuses the initial part of the bone conduction signal with the air conduction signal to obtain a fusion signal
- the computer device pairs the multiple second sample signals with each The second sample signal fuses the initial part of the included bone conduction sample signal and the air conduction sample signal to obtain a plurality of fused sample signals.
- the wearable device If during the identification process, the wearable device generates an enhanced initial signal based on the initial part of the bone conduction signal, and fuses the enhanced initial signal with the air conduction signal to obtain a fusion signal, then during the training process, the computer device The initial part of the bone conduction sample signal included in each second sample signal pair in each second sample signal pair generates an enhanced initial sample signal, and each enhanced initial sample signal is fused with the corresponding air conduction sample signal to obtain Multiple fused sample signals.
- the computer device directly fuses the bone conduction signal and the air conduction signal to obtain a fusion signal
- the computer device includes each second sample signal pair among the plurality of second sample signal pairs
- the bone conduction sample signal and the air conduction sample signal are directly fused to obtain multiple fused sample signals.
- the wearable device determines the bone conduction signal as a fusion signal
- the computer device determines the bone conduction sample signals included in the multiple second sample signal pairs as multiple fusion sample signals.
- the initial part of the bone conduction sample signal is determined according to the detection time delay of the voice detection or set according to experience.
- the computer device trains the first initial acoustic model by using the plurality of fused sample signals, so as to obtain the first acoustic model in the embodiment of the present application.
- the network structure of the first initial acoustic model is the same as the network structure of the first acoustic model.
- the computer device before the computer device determines a plurality of fusion sample signals based on the second training data set, it preprocesses the bone conduction sample signals and air conduction sample signals included in the second training data set, for example, front-end enhancement is performed on the air conduction sample signals , perform down-sampling and gain adjustment on the bone conduction sample signal.
- the computer device inputs the initial part of the bone conduction sample signal included in each of the plurality of second sample signal pairs into the generation network model, and obtains the enhanced initial sample signal output by the generation network model, the The generated network model is the same as the generated network model in the foregoing embodiments, or may be a different model, which is not limited in this embodiment of the present application.
- FIG. 25 to FIG. 28 are four schematic diagrams of the first acoustic model respectively trained based on the above four methods provided by the embodiment of the present application.
- the second training data set acquired by the computer device includes bone conduction data (bone conduction sample signal) and air conduction data (air conduction sample signal), and the computer device performs down-sampling on the bone conduction data through the fusion module and/or Or gain adjustment, and perform front-end enhancement on the air conduction data through the front-end enhancement module.
- Fig. 25 to Fig. 27 correspond to the first three of the four methods, and the fusion module uses the corresponding method to perform head loss compensation on the air conduction signal through the bone conduction data, so as to obtain the training input data.
- Fig. 25 to Fig. 27 correspond to the first three of the four methods, and the fusion module uses the corresponding method to perform head loss compensation on the air conduction signal through the bone conduction data, so as to obtain the training input data.
- Air conduction data is not needed, and the fusion module directly uses bone conduction data as training input data. Then, the computer equipment trains the network model (namely the first initial acoustic model) by training the input data, and adjusts the network model through the loss function, gradient descent algorithm and error backpropagation, so as to obtain the trained first acoustic model.
- the network model namely the first initial acoustic model
- the computer device The initial part of the bone conduction sample signal and the air conduction sample signal included in each of the second sample signal pairs are used as the input of the second initial acoustic model to train the second initial acoustic model to obtain the second acoustic model.
- the network structure of the second initial acoustic model is the same as the network structure of the second acoustic model. That is, the second initial acoustic model also includes two input layers, one shared network layer and two output layers.
- Fig. 29 is a schematic diagram of a second acoustic model obtained through training provided by an embodiment of the present application.
- the second training data set acquired by the computer device includes bone conduction data (bone conduction sample signal) and air conduction data (air conduction sample signal), and the computer device performs down-sampling and/or gain adjustment on the bone conduction data, and air conduction data Import data for front-end enhancement.
- the computer equipment uses bone conduction data as training input data 1 and air conduction data as training input data 2 .
- the computer equipment trains the network model (that is, the second initial acoustic model) through the training input data 1 and the training input data 2, and adjusts the network model through the loss function, gradient descent algorithm and error backpropagation, thereby obtaining the trained second acoustic model.
- the training input data 1 and the training input data 2 may correspond to the same loss function or different loss functions, which is not limited in this embodiment of the present application.
- the computer device Taking the training of the third acoustic model as an example, corresponding to the method 2 in which the wearable device determines multiple posterior probability vectors based on the bone conduction signal and the air conduction signal during the voice wake-up process, during the training process, the computer device The initial part of the bone conduction sample signal and the air conduction sample signal included in each of the second sample signal pairs are used as the input of the third initial acoustic model to train the third initial acoustic model to obtain the third acoustic model.
- the computer device centers the multiple second sample signals
- the bone conduction sample signal and the air conduction sample signal included in each second sample signal pair are used as the input of the third initial acoustic model to train the third initial acoustic model to obtain the third acoustic model.
- the network structure of the third initial acoustic model is the same as the network structure of the third acoustic model. That is, the third initial acoustic model also includes two input layers, a concatenation layer, a network parameter layer and an output layer.
- FIG. 30 is a schematic diagram of a third acoustic model obtained through training according to an embodiment of the present application.
- the second training data set acquired by the computer device includes bone conduction data (bone conduction sample signal) and air conduction data (air conduction sample signal), and the computer device performs down-sampling and/or gain adjustment on the bone conduction data, and air conduction data Import data for front-end enhancement.
- the computer equipment uses the bone conduction data or the initial part of the bone conduction data as training input data 1 , and uses the air conduction data as training input data 2 .
- the computer equipment trains the network model (that is, the third initial acoustic model) through the training input data 1 and the training input data 2, and adjusts the network model through the loss function, gradient descent algorithm and error backpropagation, thereby obtaining the trained third acoustic model.
- the network model that is, the third initial acoustic model
- the air conduction registration signal is also directly or implicitly compensated for head loss through the bone conduction sample signal, so as to construct the training input data to train the initial acoustic model, and obtain A trained acoustic model.
- voice wake-up use the bone conduction signal to directly or implicitly compensate the air conduction signal for head loss in the same way. Since the bone conduction signal contains the command word information input by the sound source, that is, the bone conduction signal does not lose the head. Therefore, the recognition accuracy of the detection of the wake-up word based on the bone conduction signal is higher, the accuracy of the voice wake-up is higher, and the robustness is also improved.
- Fig. 31 is a schematic structural diagram of a voice wake-up device 3100 provided by an embodiment of the present application.
- the voice wake-up device 3100 can be implemented by software, hardware or a combination of the two to become part or all of an electronic device.
- the electronic device can be The wearable device shown in Figure 2.
- the device 3100 includes: a voice detection module 3101 , a wake-up word detection module 3102 and a voice wake-up module 3103 .
- the voice detection module 3101 is used to perform voice detection according to the bone conduction signal collected by the bone conduction microphone, and the bone conduction signal includes command word information input by the sound source;
- a wake-up word detection module 3102 configured to detect a wake-up word based on a bone conduction signal when a voice input is detected
- the voice wake-up module 3103 is configured to wake up the device to be woken up by voice when it is detected that the command word includes a wake-up word.
- the wake-up word detection module 3102 includes:
- the first determining submodule is used to determine the fusion signal based on the bone conduction signal
- the wake-up word detection submodule is used to detect the wake-up word on the fusion signal.
- the device 3100 also includes:
- the processing module is used to turn on the air microphone, and collect the air conduction signal through the air microphone;
- the first determined submodule is used for:
- the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or,
- the bone conduction signal and the air conduction signal are directly fused to obtain a fusion signal.
- the wake-up word detection submodule is used for:
- the multiple posterior probability vectors correspond one-to-one to the multiple audio frames
- the first posterior probability vector in the plurality of posterior probability vectors is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of specified phonemes
- the wake word detection is performed based on the plurality of posterior probability vectors.
- the device 3100 also includes:
- the processing module is used to turn on the air microphone, and collect the air conduction signal through the air microphone;
- Wake-up word detection module 3102 includes:
- the second determining submodule is used to determine a plurality of posterior probability vectors based on the bone conduction signal and the air conduction signal, and the plurality of posterior probability vectors correspond to the multiple audio frames included in the bone conduction signal and the air conduction signal one by one.
- the first posterior probability vector in the posterior probability vectors is used to indicate the probability that the phoneme of the first audio frame in the plurality of audio frames belongs to a plurality of designated phonemes;
- the wake-up word detection submodule is configured to detect wake-up words based on the plurality of posterior probability vectors.
- the second determination submodule is used for:
- the bone conduction Inputting the initial part of the bone conduction signal and the air conduction signal into the second acoustic model to obtain the first number of bone conduction posterior probability vectors and the second number of air conduction posterior probability vectors output by the second acoustic model, the bone conduction
- the initial part of the signal is determined according to the detection delay of the speech detection, the first number of bone conduction posterior probability vectors correspond to the audio frames included in the initial part of the bone conduction signal, and the second number of air conduction posterior probability vectors
- the vector is in one-to-one correspondence with the audio frames included in the air conduction signal;
- the first bone conduction posterior probability vector corresponds to the last audio frequency of the initial part of the bone conduction signal frames, the duration of the last audio frame is less than the frame duration
- the first air conduction posterior probability vector corresponds to the first audio frame of the air conduction signal, the duration of the first audio frame is less than the frame duration
- the plurality of posterior probabilities include the second posterior probability vector, the vectors in the first number of bone conduction posterior probability vectors except for the first bone conduction posterior probability vector, and the vectors in the second number of air conduction posterior probability vectors except for the first air conduction A vector other than the vector of posterior probabilities.
- the second determination submodule is used for:
- the initial part of the bone conduction signal is determined according to the detection delay of the voice detection; or ,
- the bone conduction signal and the air conduction signal are input into the third acoustic model to obtain a plurality of posterior probability vectors output by the third acoustic model.
- the wake-up word detection submodule is used for:
- the command word includes a wake-up word.
- the wake-up word detection submodule is used for:
- the command word includes a wake-up word
- the plurality of template vectors indicate that the wake-up word is included
- the distance condition includes: the average value of the distances between the multiple posterior probability vectors and the corresponding template vectors is less than the distance threshold.
- the device 3100 also includes:
- the obtaining module is used to obtain the bone conduction registration signal, and the bone conduction registration signal includes complete information of the wake-up word;
- the determination module is configured to determine a confidence threshold and a plurality of template vectors based on the bone conduction registration signal and the phoneme sequence corresponding to the wake-up word.
- the determination module includes:
- the third determining submodule is used to determine the fusion registration signal based on the bone conduction registration signal
- the fourth determining submodule is configured to determine a confidence threshold and a plurality of template vectors based on the fused registration signal and the phoneme sequence corresponding to the wake-up word.
- the fourth determining submodule is used for:
- the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes;
- a confidence threshold is determined based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
- the device 3100 also includes:
- An acquisition module configured to acquire an air conduction registration signal
- Identify modules include:
- the fifth determination sub-module is used to determine a plurality of registration posterior probability vectors based on the bone conduction registration signal and the air conduction registration signal, the plurality of registration posterior probability vectors and the plurality of registration audio signals included in the bone conduction registration signal and the air conduction registration signal Frame one-to-one correspondence, the first registration posterior probability vector in the plurality of registration posterior probability vectors indicates the probability that the phoneme of the first registration audio frame in the plurality of registration audio frames belongs to a plurality of specified phonemes;
- the sixth determining submodule is configured to determine a confidence threshold based on the multiple registration posterior probability vectors and the phoneme sequence corresponding to the wake-up word.
- a bone conduction microphone is used to collect bone conduction signals for voice detection, which can ensure low power consumption.
- the acquired air conduction signal may lose its head due to the delay of voice detection, it does not contain the complete information of the command word input by the sound source, while the bone conduction signal collected by the bone conduction microphone contains the command word information input by the sound source , that is, the bone conduction signal has not lost its head, so this solution detects the wake-up word based on the bone conduction signal. In this way, the recognition accuracy of the wake-up word is higher, and the accuracy of voice wake-up is higher.
- the voice wake-up device when the voice wake-up device provided by the above-mentioned embodiment performs voice wake-up, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to needs. , that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
- the voice wake-up device and the voice wake-up method embodiments provided in the above embodiments belong to the same idea, and the specific implementation process thereof is detailed in the method embodiments, and will not be repeated here.
- all or part may be implemented by software, hardware, firmware or any combination thereof.
- software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
- the computer can be a general purpose computer, a special purpose computer, a computer network or other programmable devices.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that can be accessed by a computer, or may be a data storage device such as a server or a data center integrated with one or more available media.
- the available medium may be a magnetic medium (for example: floppy disk, hard disk, magnetic tape), an optical medium (for example: digital versatile disc (digital versatile disc, DVD)) or a semiconductor medium (for example: solid state disk (solid state disk, SSD)) wait.
- a magnetic medium for example: floppy disk, hard disk, magnetic tape
- an optical medium for example: digital versatile disc (digital versatile disc, DVD)
- a semiconductor medium for example: solid state disk (solid state disk, SSD)
- the information including but not limited to user equipment information, user personal information, etc.
- data including but not limited to data used for analysis, stored data, displayed data, etc.
- All signals are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.
- the voice signals involved in the embodiments of the present application are all obtained under the condition of sufficient authorization.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Circuit For Audible Band Transducer (AREA)
- Manipulator (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
Description
Claims (31)
- 一种语音唤醒的方法,其特征在于,所述方法包括:根据骨导麦克采集的骨导信号进行语音检测,所述骨导信号包含声源输入的命令词信息;在检测到有语音输入的情况下,基于所述骨导信号进行唤醒词的检测;在检测到所述命令词包括所述唤醒词时,对待唤醒设备进行语音唤醒。
- 如权利要求1所述的方法,其特征在于,所述基于所述骨导信号进行唤醒词的检测,包括:基于所述骨导信号确定融合信号;对所述融合信号进行所述唤醒词的检测。
- 如权利要求2所述的方法,其特征在于,所述基于所述骨导信号确定融合信号之前,还包括:开启空气麦克,通过所述空气麦克采集气导信号;所述基于所述骨导信号确定融合信号,包括:将所述骨导信号的起始部分和所述气导信号进行融合,以得到所述融合信号,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,基于所述骨导信号的起始部分生成增强起始信号,将所述增强起始信号和所述气导信号进行融合,以得到所述融合信号,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,将所述骨导信号和所述气导信号直接进行融合,以得到所述融合信号。
- 如权利要求2或3所述的方法,其特征在于,所述对所述融合信号进行所述唤醒词的检测,包括:将所述融合信号包括的多个音频帧输入第一声学模型,以得到所述第一声学模型输出的多个后验概率向量,所述多个后验概率向量与所述多个音频帧一一对应,所述多个后验概率向量中的第一后验概率向量用于指示所述多个音频帧中的第一音频帧的音素属于多个指定音素的概率;基于所述多个后验概率向量进行所述唤醒词的检测。
- 如权利要求1所述的方法,其特征在于,所述基于所述骨导信号进行唤醒词的检测之前,还包括:开启空气麦克,通过所述空气麦克采集气导信号;所述基于所述骨导信号进行唤醒词的检测,包括:基于所述骨导信号和所述气导信号,确定多个后验概率向量,所述多个后验概率向量与所述骨导信号和所述气导信号包括的多个音频帧一一对应,所述多个后验概率向量中的第一后验概率向量用于指示所述多个音频帧中的第一音频帧的音素属于多个指定音素的概率;基于所述多个后验概率向量进行所述唤醒词的检测。
- 如权利要求5所述的方法,其特征在于,所述基于所述骨导信号和所述气导信号,确定多个后验概率向量,包括:将所述骨导信号的起始部分和所述气导信号输入第二声学模型,以得到所述第二声学模型输出的第一数量个骨导后验概率向量和第二数量个气导后验概率向量,所述骨导信号的起始部分根据所述语音检测的检测时延确定,所述第一数量个骨导后验概率向量与所述骨导信号的起始部分所包括的音频帧一一对应,所述第二数量个气导后验概率向量与所述气导信号所包括的音频帧一一对应;将第一骨导后验概率向量和第一气导后验概率向量进行融合,以得到第二后验概率向量,所述第一骨导后验概率向量对应所述骨导信号的起始部分的最后一个音频帧,所述最后一个音频帧的时长小于帧时长,所述第一气导后验概率向量对应所述气导信号的第一个音频帧,所述第一个音频帧的时长小于所述帧时长,所述多个后验概率向量包括所述第二后验概率向量、所述第一数量个骨导后验概率向量中除所述第一骨导后验概率向量之外的向量,以及所述第二数量个气导后验概率向量中除所述第一气导后验概率向量之外的向量。
- 如权利要求5所述的方法,其特征在于,所述基于所述骨导信号和所述气导信号,确定多个后验概率向量,包括:将所述骨导信号的起始部分和所述气导信号输入第三声学模型,以得到所述第三声学模型输出的所述多个后验概率向量,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,将所述骨导信号和所述气导信号输入所述第三声学模型,以得到所述第三声学模型输出的所述多个后验概率向量。
- 如权利要求4-7任一所述的方法,其特征在于,所述基于所述多个后验概率向量进行所述唤醒词的检测,包括:基于所述多个后验概率向量和所述唤醒词对应的音素序列,确定所述命令词对应的音素序列包括所述唤醒词对应的音素序列的置信度;在所述置信度超过置信度阈值的情况下,确定检测到所述命令词包括所述唤醒词。
- 如权利要求4-7任一所述的方法,其特征在于,所述基于所述多个后验概率向量进行所述唤醒词的检测,包括:基于所述多个后验概率向量和所述唤醒词对应的音素序列,确定所述命令词对应的音素序列包括所述唤醒词对应的音素序列的置信度;在所述置信度超过置信度阈值,且所述多个后验概率向量与多个模板向量之间满足距离条件的情况下,确定检测到所述命令词包括所述唤醒词,所述多个模板向量指示包含所述唤醒词的完整信息的语音信号的音素属于所述多个指定音素的概率。
- 如权利要求9所述的方法,其特征在于,在所述多个后验概率向量与所述多个模板向量一一对应的情况下,所述距离条件包括:所述多个后验概率向量与对应的模板向量之间的距离的均值小于距离阈值。
- 如权利要求9或10所述的方法,其特征在于,所述方法还包括:获取骨导注册信号,所述骨导注册信号包含所述唤醒词的完整信息;基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量。
- 如权利要求11所述的方法,其特征在于,所述基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量,包括:基于所述骨导注册信号确定融合注册信号;基于所述融合注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量。
- 如权利要求12所述的方法,其特征在于,所述基于所述融合注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量,包括:将所述融合注册信号包括的多个注册音频帧输入第一声学模型,以得到所述第一声学模型输出的多个注册后验概率向量,所述多个注册后验概率向量与所述多个注册音频帧一一对应,所述多个注册后验概率向量中的第一注册后验概率向量指示所述多个注册音频帧中的第一注册音频帧的音素属于所述多个指定音素的概率;将所述多个注册后验概率向量确定为所述多个模板向量;基于所述多个注册后验概率向量和所述唤醒词对应的音素序列确定所述置信度阈值。
- 如权利要求11所述的方法,其特征在于,所述基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值之前,还包括:获取气导注册信号;所述基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值,包括:基于所述骨导注册信号和所述气导注册信号,确定多个注册后验概率向量,所述多个注册后验概率向量与所述骨导注册信号和所述气导注册信号包括的多个注册音频帧一一对应,所述多个注册后验概率向量中的第一注册后验概率向量指示所述多个注册音频帧中的第一注册音频帧的音素属于所述多个指定音素的概率;基于所述多个注册后验概率向量和所述唤醒词对应的音素序列确定所述置信度阈值。
- 一种语音唤醒的装置,其特征在于,所述装置包括:语音检测模块,用于根据骨导麦克采集的骨导信号进行语音检测,所述骨导信号包含声源输入的命令词信息;唤醒词检测模块,用于在检测到有语音输入的情况下,基于所述骨导信号进行唤醒词的检测;语音唤醒模块,用于在检测到所述命令词包括所述唤醒词时,对待唤醒设备进行语音唤醒。
- 如权利要求15所述的装置,其特征在于,所述唤醒词检测模块包括:第一确定子模块,用于基于所述骨导信号确定融合信号;唤醒词检测子模块,用于对所述融合信号进行所述唤醒词的检测。
- 如权利要求16所述的装置,其特征在于,所述装置还包括:处理模块,用于开启空气麦克,通过所述空气麦克采集气导信号;所述第一确定子模块用于:将所述骨导信号的起始部分和所述气导信号进行融合,以得到所述融合信号,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,基于所述骨导信号的起始部分生成增强起始信号,将所述增强起始信号和所述气导信号进行融合,以得到所述融合信号,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,将所述骨导信号和所述气导信号直接进行融合,以得到所述融合信号。
- 如权利要求16或17所述的装置,其特征在于,所述唤醒词检测子模块用于:将所述融合信号包括的多个音频帧输入第一声学模型,以得到所述第一声学模型输出的多个后验概率向量,所述多个后验概率向量与所述多个音频帧一一对应,所述多个后验概率向量中的第一后验概率向量用于指示所述多个音频帧中的第一音频帧的音素属于多个指定音素的概率;基于所述多个后验概率向量进行所述唤醒词的检测。
- 如权利要求15所述的装置,其特征在于,所述装置还包括:处理模块,用于开启空气麦克,通过所述空气麦克采集气导信号;所述唤醒词检测模块包括:第二确定子模块,用于基于所述骨导信号和所述气导信号,确定多个后验概率向量,所述多个后验概率向量与所述骨导信号和所述气导信号包括的多个音频帧一一对应,所述多个后验概率向量中的第一后验概率向量用于指示所述多个音频帧中的第一音频帧的音素属于多个指定音素的概率;唤醒词检测子模块,用于基于所述多个后验概率向量进行所述唤醒词的检测。
- 如权利要求19所述的装置,其特征在于,所述第二确定子模块用于:将所述骨导信号的起始部分和所述气导信号输入第二声学模型,以得到所述第二声学模型输出的第一数量个骨导后验概率向量和第二数量个气导后验概率向量,所述骨导信号的起始部分根据所述语音检测的检测时延确定,所述第一数量个骨导后验概率向量与所述骨导信号的起始部分所包括的音频帧一一对应,所述第二数量个气导后验概率向量与所述气导信号所包括的音频帧一一对应;将第一骨导后验概率向量和第一气导后验概率向量进行融合,以得到第二后验概率向量,所述第一骨导后验概率向量对应所述骨导信号的起始部分的最后一个音频帧,所述最后一个音频帧的时长小于帧时长,所述第一气导后验概率向量对应所述气导信号的第一个音频帧,所述第一个音频帧的时长小于所述帧时长,所述多个后验概率向量包括所述第二后验概率向 量、所述第一数量个骨导后验概率向量中除所述第一骨导后验概率向量之外的向量,以及所述第二数量个气导后验概率向量中除所述第一气导后验概率向量之外的向量。
- 如权利要求19所述的装置,其特征在于,所述第二确定子模块用于:将所述骨导信号的起始部分和所述气导信号输入第三声学模型,以得到所述第三声学模型输出的所述多个后验概率向量,所述骨导信号的起始部分根据所述语音检测的检测时延确定;或者,将所述骨导信号和所述气导信号输入所述第三声学模型,以得到所述第三声学模型输出的所述多个后验概率向量。
- 如权利要求18-21任一所述的装置,其特征在于,所述唤醒词检测子模块用于:基于所述多个后验概率向量和所述唤醒词对应的音素序列,确定所述命令词对应的音素序列包括所述唤醒词对应的音素序列的置信度;在所述置信度超过置信度阈值的情况下,确定检测到所述命令词包括所述唤醒词。
- 如权利要求18-21任一所述的装置,其特征在于,所述唤醒词检测子模块用于:基于所述多个后验概率向量和所述唤醒词对应的音素序列,确定所述命令词对应的音素序列包括所述唤醒词对应的音素序列的置信度;在所述置信度超过置信度阈值,且所述多个后验概率向量与多个模板向量之间满足距离条件的情况下,确定检测到所述命令词包括所述唤醒词,所述多个模板向量指示包含所述唤醒词的完整信息的语音信号的音素属于所述多个指定音素的概率。
- 如权利要求23所述的装置,其特征在于,在所述多个后验概率向量与所述多个模板向量一一对应的情况下,所述距离条件包括:所述多个后验概率向量与对应的模板向量之间的距离的均值小于距离阈值。
- 如权利要求23或24所述的装置,其特征在于,所述装置还包括:获取模块,用于获取骨导注册信号,所述骨导注册信号包含所述唤醒词的完整信息;确定模块,用于基于所述骨导注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量。
- 如权利要求25所述的装置,其特征在于,所述确定模块包括:第三确定子模块,用于基于所述骨导注册信号确定融合注册信号;第四确定子模块,用于基于所述融合注册信号和所述唤醒词对应的音素序列,确定所述置信度阈值和所述多个模板向量。
- 如权利要求26所述的装置,其特征在于,所述第四确定子模块用于:将所述融合注册信号包括的多个注册音频帧输入第一声学模型,以得到所述第一声学模型输出的多个注册后验概率向量,所述多个注册后验概率向量与所述多个注册音频帧一一对 应,所述多个注册后验概率向量中的第一注册后验概率向量指示所述多个注册音频帧中的第一注册音频帧的音素属于所述多个指定音素的概率;将所述多个注册后验概率向量确定为所述多个模板向量;基于所述多个注册后验概率向量和所述唤醒词对应的音素序列确定所述置信度阈值。
- 如权利要求25所述的装置,其特征在于,所述装置还包括:获取模块,用于获取气导注册信号;所述确定模块包括:第五确定子模块,用于基于所述骨导注册信号和所述气导注册信号,确定多个注册后验概率向量,所述多个注册后验概率向量与所述骨导注册信号和所述气导注册信号包括的多个注册音频帧一一对应,所述多个注册后验概率向量中的第一注册后验概率向量指示所述多个注册音频帧中的第一注册音频帧的音素属于所述多个指定音素的概率;第六确定子模块,用于基于所述多个注册后验概率向量和所述唤醒词对应的音素序列确定所述置信度阈值。
- 一种电子设备,其特征在于,所述电子设备包括:存储器和处理器;所述存储器,用于存储计算机程序;所述处理器,用于执行所述计算机程序,以实现权利要求1-14任一所述的方法的步骤。
- 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-14任一所述的方法的步骤。
- 一种计算机程序产品,包括计算机指令,其特征在于,所述计算机指令被处理器执行时实现权利要求1-14任一所述的方法的步骤。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22862757.6A EP4379712A4 (en) | 2021-08-30 | 2022-05-27 | VOICE WAKE-UP METHOD AND APPARATUS, DEVICE, STORAGE MEDIUM AND PROGRAM PRODUCT |
JP2024513453A JP2024534198A (ja) | 2021-08-30 | 2022-05-27 | 音声ウェイクアップ方法および装置、デバイス、記憶媒体、ならびにプログラム製品 |
US18/591,853 US20240203408A1 (en) | 2021-08-30 | 2024-02-29 | Speech Wakeup Method and Apparatus, Device, Storage Medium, and Program Product |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111005443.6 | 2021-08-30 | ||
CN202111005443.6A CN115731927A (zh) | 2021-08-30 | 2021-08-30 | 语音唤醒的方法、装置、设备、存储介质及程序产品 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/591,853 Continuation US20240203408A1 (en) | 2021-08-30 | 2024-02-29 | Speech Wakeup Method and Apparatus, Device, Storage Medium, and Program Product |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023029615A1 true WO2023029615A1 (zh) | 2023-03-09 |
Family
ID=85290866
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/095443 WO2023029615A1 (zh) | 2021-08-30 | 2022-05-27 | 语音唤醒的方法、装置、设备、存储介质及程序产品 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20240203408A1 (zh) |
EP (1) | EP4379712A4 (zh) |
JP (1) | JP2024534198A (zh) |
CN (1) | CN115731927A (zh) |
WO (1) | WO2023029615A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115862604A (zh) * | 2022-11-24 | 2023-03-28 | 镁佳(北京)科技有限公司 | 语音唤醒模型训练及语音唤醒方法、装置及计算机设备 |
CN115985323A (zh) * | 2023-03-21 | 2023-04-18 | 北京探境科技有限公司 | 语音唤醒方法、装置、电子设备及可读存储介质 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140229184A1 (en) * | 2013-02-14 | 2014-08-14 | Google Inc. | Waking other devices for additional data |
CN106098059A (zh) * | 2016-06-23 | 2016-11-09 | 上海交通大学 | 可定制语音唤醒方法及系统 |
CN109036412A (zh) * | 2018-09-17 | 2018-12-18 | 苏州奇梦者网络科技有限公司 | 语音唤醒方法和系统 |
CN110010143A (zh) * | 2019-04-19 | 2019-07-12 | 出门问问信息科技有限公司 | 一种语音信号增强系统、方法及存储介质 |
JP2020122819A (ja) * | 2019-01-29 | 2020-08-13 | オンキヨー株式会社 | 電子機器及びその制御方法 |
CN112581970A (zh) * | 2019-09-12 | 2021-03-30 | 深圳市韶音科技有限公司 | 用于音频信号生成的系统和方法 |
CN113053371A (zh) * | 2019-12-27 | 2021-06-29 | 阿里巴巴集团控股有限公司 | 语音控制系统和方法、语音套件、骨传导及语音处理装置 |
CN113259793A (zh) * | 2020-02-07 | 2021-08-13 | 杭州智芯科微电子科技有限公司 | 智能麦克风及其信号处理方法 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9043211B2 (en) * | 2013-05-09 | 2015-05-26 | Dsp Group Ltd. | Low power activation of a voice activated device |
EP3790006A4 (en) * | 2018-06-29 | 2021-06-09 | Huawei Technologies Co., Ltd. | VOICE COMMAND PROCESS, PORTABLE DEVICE AND TERMINAL |
-
2021
- 2021-08-30 CN CN202111005443.6A patent/CN115731927A/zh active Pending
-
2022
- 2022-05-27 JP JP2024513453A patent/JP2024534198A/ja active Pending
- 2022-05-27 EP EP22862757.6A patent/EP4379712A4/en active Pending
- 2022-05-27 WO PCT/CN2022/095443 patent/WO2023029615A1/zh active Application Filing
-
2024
- 2024-02-29 US US18/591,853 patent/US20240203408A1/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140229184A1 (en) * | 2013-02-14 | 2014-08-14 | Google Inc. | Waking other devices for additional data |
CN106098059A (zh) * | 2016-06-23 | 2016-11-09 | 上海交通大学 | 可定制语音唤醒方法及系统 |
CN109036412A (zh) * | 2018-09-17 | 2018-12-18 | 苏州奇梦者网络科技有限公司 | 语音唤醒方法和系统 |
JP2020122819A (ja) * | 2019-01-29 | 2020-08-13 | オンキヨー株式会社 | 電子機器及びその制御方法 |
CN110010143A (zh) * | 2019-04-19 | 2019-07-12 | 出门问问信息科技有限公司 | 一种语音信号增强系统、方法及存储介质 |
CN112581970A (zh) * | 2019-09-12 | 2021-03-30 | 深圳市韶音科技有限公司 | 用于音频信号生成的系统和方法 |
CN113053371A (zh) * | 2019-12-27 | 2021-06-29 | 阿里巴巴集团控股有限公司 | 语音控制系统和方法、语音套件、骨传导及语音处理装置 |
CN113259793A (zh) * | 2020-02-07 | 2021-08-13 | 杭州智芯科微电子科技有限公司 | 智能麦克风及其信号处理方法 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4379712A4 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115862604A (zh) * | 2022-11-24 | 2023-03-28 | 镁佳(北京)科技有限公司 | 语音唤醒模型训练及语音唤醒方法、装置及计算机设备 |
CN115985323A (zh) * | 2023-03-21 | 2023-04-18 | 北京探境科技有限公司 | 语音唤醒方法、装置、电子设备及可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
EP4379712A4 (en) | 2024-10-09 |
JP2024534198A (ja) | 2024-09-18 |
US20240203408A1 (en) | 2024-06-20 |
EP4379712A1 (en) | 2024-06-05 |
CN115731927A (zh) | 2023-03-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Karpagavalli et al. | A review on automatic speech recognition architecture and approaches | |
O’Shaughnessy | Automatic speech recognition: History, methods and challenges | |
Arora et al. | Automatic speech recognition: a review | |
US8930196B2 (en) | System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands | |
WO2023029615A1 (zh) | 语音唤醒的方法、装置、设备、存储介质及程序产品 | |
US10650306B1 (en) | User representation using a generative adversarial network | |
CN109036381A (zh) | 语音处理方法及装置、计算机装置及可读存储介质 | |
US11302329B1 (en) | Acoustic event detection | |
CN113012686A (zh) | 神经语音到意思 | |
US11308946B2 (en) | Methods and apparatus for ASR with embedded noise reduction | |
KR20210132615A (ko) | 사운드 특성에 대한 음향 모델 컨디셔닝 | |
Mistry et al. | Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann) | |
CN107039035A (zh) | 一种语音起始点和终止点的检测方法 | |
CN115176309A (zh) | 语音处理系统 | |
Herbig et al. | Self-learning speaker identification: a system for enhanced speech recognition | |
Fu et al. | A survey on Chinese speech recognition | |
Adnene et al. | Design and implementation of an automatic speech recognition based voice control system | |
CN112259077B (zh) | 语音识别方法、装置、终端和存储介质 | |
US11735178B1 (en) | Speech-processing system | |
Nguyen et al. | Vietnamese voice recognition for home automation using MFCC and DTW techniques | |
Oprea et al. | An artificial neural network-based isolated word speech recognition system for the Romanian language | |
Hao et al. | Denoi-spex+: a speaker extraction network based speech dialogue system | |
Bohouta | Improving wake-up-word and general speech recognition systems | |
Hirsch | Speech Assistant System With Local Client and Server Devices to Guarantee Data Privacy | |
CN107039046A (zh) | 一种基于特征融合的语音声效模式检测方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22862757 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2024513453 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202447016333 Country of ref document: IN |
|
ENP | Entry into the national phase |
Ref document number: 2022862757 Country of ref document: EP Effective date: 20240301 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |