WO2022199405A1 - 一种语音控制方法和装置 - Google Patents
一种语音控制方法和装置 Download PDFInfo
- Publication number
- WO2022199405A1 WO2022199405A1 PCT/CN2022/080436 CN2022080436W WO2022199405A1 WO 2022199405 A1 WO2022199405 A1 WO 2022199405A1 CN 2022080436 W CN2022080436 W CN 2022080436W WO 2022199405 A1 WO2022199405 A1 WO 2022199405A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- voiceprint
- user
- feature
- component
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 186
- 210000000988 bone and bone Anatomy 0.000 claims abstract description 118
- 230000004927 fusion Effects 0.000 claims description 254
- 230000000875 corresponding effect Effects 0.000 claims description 95
- 230000015654 memory Effects 0.000 claims description 82
- 238000000605 extraction Methods 0.000 claims description 62
- 238000001514 detection method Methods 0.000 claims description 58
- 238000003860 storage Methods 0.000 claims description 35
- 230000002596 correlated effect Effects 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 20
- 230000000694 effects Effects 0.000 abstract description 28
- 238000004891 communication Methods 0.000 description 74
- 230000006854 communication Effects 0.000 description 74
- 230000006870 function Effects 0.000 description 61
- 230000008569 process Effects 0.000 description 40
- 238000012545 processing Methods 0.000 description 30
- 238000004422 calculation algorithm Methods 0.000 description 25
- 238000010586 diagram Methods 0.000 description 23
- 210000000613 ear canal Anatomy 0.000 description 20
- 238000007726 management method Methods 0.000 description 16
- 238000005516 engineering process Methods 0.000 description 15
- 230000001133 acceleration Effects 0.000 description 12
- 238000013528 artificial neural network Methods 0.000 description 11
- 239000000284 extract Substances 0.000 description 10
- 230000003287 optical effect Effects 0.000 description 9
- 238000010295 mobile communication Methods 0.000 description 8
- 238000012360 testing method Methods 0.000 description 8
- 238000007781 pre-processing Methods 0.000 description 7
- 230000005236 sound signal Effects 0.000 description 7
- 230000003321 amplification Effects 0.000 description 6
- 230000005237 high-frequency sound signal Effects 0.000 description 6
- 238000003199 nucleic acid amplification method Methods 0.000 description 6
- 230000009467 reduction Effects 0.000 description 6
- 210000000959 ear middle Anatomy 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000012795 verification Methods 0.000 description 5
- 230000001276 controlling effect Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 230000001960 triggered effect Effects 0.000 description 4
- 210000004027 cell Anatomy 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000007500 overflow downdraw method Methods 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000037361 pathway Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000005855 radiation Effects 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000012896 Statistical algorithm Methods 0.000 description 1
- 208000009205 Tinnitus Diseases 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000007175 bidirectional communication Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000013529 biological neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 210000003128 head Anatomy 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 235000012054 meals Nutrition 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004984 smart glass Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/10—Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/20—Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
Definitions
- the present application relates to the technical field of audio processing, and in particular, to a voice control method and device.
- Bone vibration sensor is a common voice sensor. When the sound propagates in the bone, it will cause the vibration of the bone. The bone vibration sensor senses the vibration of the bone and converts the vibration signal into an electrical signal to achieve sound collection.
- the present application provides a voice control method and device, which can solve the problem of inaccurate voiceprint recognition caused by loss of high-frequency components when a bone vibration sensor is used.
- the present application provides a voice control method, including: acquiring voice information of a user, the voice information including a first voice component, a second voice component and a third voice component, the first voice component being generated by an in-ear voice sensor
- the second voice component is collected by the out-of-ear voice sensor
- the third voice component is collected by the bone vibration sensor;
- Recognition according to the first voiceprint recognition result of the first voice component in the voice information, the second voiceprint recognition result of the second voice component in the voice information and the third voiceprint recognition result of the third voice component in the voice information, obtain the user
- the operation instruction is executed, wherein the operation instruction is determined according to the voice information.
- the wearable device uses the in-ear voice sensor when collecting sound, it can make up for the distortion problem caused by the loss of some high-frequency signal components of the voice information when the bone vibration sensor collects voice information, so it can improve the overall performance of the wearable device. Voiceprint collection effect and accuracy of voiceprint recognition to improve user experience.
- the voice components need to be obtained separately.
- the acquisition of multi-channel voice components can improve the accuracy and anti-interference ability of voiceprint recognition.
- the method before performing voiceprint recognition on the first voice component, the second voice component and the third voice component, the method further includes: performing keyword detection on the voice information, or detecting user input.
- voice information includes preset keywords
- voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively; or, when a preset operation input by the user is received, Voiceprint recognition is performed on the first voice component, the second voice component and the third voice component, respectively. Otherwise, it means that the user does not need to perform voiceprint recognition at this time, and the terminal or wearable device does not need to enable the voiceprint recognition function, thereby reducing the power consumption of the terminal or wearable device.
- the method before performing keyword detection on the voice information or detecting user input, the method further includes: acquiring a wearing state detection result of the wearable device.
- the wearing state detection result is passed, keyword detection is performed on the voice information, or user input is detected. Otherwise, it means that the user is not wearing the wearable device at this time, and of course there is no need for voiceprint recognition, so the terminal or wearable device does not need to enable the keyword detection function, thereby reducing the power consumption of the terminal or wearable device.
- the specific process of performing voiceprint recognition on the first voice component is:
- the first registered voiceprint feature is the first registered voice after the first registered voice. Obtained by feature extraction from the voiceprint model, the first registered voiceprint feature is used to reflect the user's preset audio features collected by the in-ear voice sensor. Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the specific process of performing voiceprint recognition on the second voice component is:
- the second registered voiceprint feature is the second registered voice after the first registered voice. Obtained by feature extraction from the second voiceprint model, the second registered voiceprint feature is used to reflect the user's preset audio features collected by the out-of-ear voice sensor. Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the specific process of performing voiceprint recognition on the third voice component is:
- the third registered voiceprint feature is the third registered voice after the third registered voice. Obtained by feature extraction from the three voiceprint models, the third registered voiceprint feature is used to reflect the user's preset audio features collected by the bone vibration sensor. Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the voiceprint recognition result is used to obtain the user's identity information.
- the user's identity information can be obtained by fusing each voiceprint recognition result by means of dynamic fusion coefficients, which can be specifically:
- the first similarity, the second similarity and the third similarity are used to obtain a fusion similarity score. If the fusion similarity score is greater than the first threshold, it is determined that the user's identity information matches the preset identity information.
- the first fusion coefficient, the second fusion coefficient and the third fusion coefficient are determined.
- the decibel number of the ambient sound may be obtained according to the sound pressure sensor;
- the playback volume may be determined according to the playback signal of the speaker; according to the environment The decibel number and playback volume of the sound, respectively determine the first fusion coefficient, the second fusion coefficient and the third fusion coefficient, wherein: the second fusion coefficient is negatively correlated with the decibel number of the ambient sound, the first fusion coefficient, the third fusion coefficient They are respectively negatively correlated with the decibels of the playback volume, and the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a fixed value.
- the above-mentioned sound pressure sensor and speaker are sound pressure sensors and speakers of a wearable device.
- dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes, and the speech signals with different attributes are used to fuse the voiceprint recognition results.
- the complementarity of voiceprint recognition can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
- voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
- the operation instruction includes an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction or a call instruction.
- the user only needs to input voice information once to complete a series of operations such as user identity authentication and execution of a certain function, thereby greatly improving the user's control efficiency and user experience.
- the present application provides a voice control method.
- the voice control method is applied to a wearable device.
- the execution subject of the voice control method is a wearable device.
- the method is specifically as follows: the wearable device obtains: User voice information, the voice information includes a first voice component, a second voice component and a third voice component, the first voice component is collected by the in-ear voice sensor, and the second voice component is collected by the out-of-ear voice sensor , the third voice component is collected by the bone vibration sensor; the voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively; the wearable device is based on the first voice of the first voice component in the voice information.
- the fingerprint recognition result, the second voiceprint recognition result of the second voice component in the voice information, and the third voiceprint recognition result of the third voice component in the voice information to obtain the user's identity information; when the user's identity information matches the preset information
- the operation instruction is executed, wherein the operation instruction is determined according to the voice information.
- the wearable device uses the in-ear voice sensor when collecting sound, it can make up for the distortion problem caused by the loss of some high-frequency signal components of the voice information when the bone vibration sensor collects voice information, so it can improve the overall performance of the wearable device. Voiceprint collection effect and accuracy of voiceprint recognition to improve user experience.
- the wearable device Before the wearable device performs voiceprint recognition, the wearable device needs to obtain the voice components separately.
- the wearable device obtains three-way voice components through different sensors such as the in-ear voice sensor, the out-of-ear voice sensor and the bone vibration sensor, which can improve the performance of the wearable device. Accuracy and anti-interference ability of voiceprint recognition.
- the method before the wearable device performs voiceprint recognition on the first voice component, the second voice component and the third voice component, the method further includes: the wearable device performs keyword detection on the voice information, or, User input is detected.
- the voice information includes preset keywords
- the wearable device performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively;
- the wearable device performs voiceprint recognition on the first voice component, the second voice component and the third voice component, respectively. Otherwise, it means that the user does not need to perform voiceprint recognition at this time, and the wearable device does not need to enable the voiceprint recognition function, thereby reducing the power consumption of the wearable device.
- the method before the wearable device performs keyword detection on the voice information or detects user input, the method further includes: acquiring a wearing state detection result of the wearable device.
- the wearing state detection result is passed, keyword detection is performed on the voice information, or user input is detected. Otherwise, it means that the user is not wearing the wearable device at this time, and of course there is no need for voiceprint recognition, and the wearable device does not need to enable the keyword detection function, thereby reducing the power consumption of the wearable device.
- the specific process of the wearable device performing voiceprint recognition on the first voice component is as follows:
- the wearable device performs feature extraction on the first voice component to obtain the first voiceprint feature, and the wearable device calculates the first similarity between the first voiceprint feature and the user's first registered voiceprint feature, and the first registered voiceprint feature is
- the first registered voice is obtained through feature extraction by the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset audio feature of the user collected by the in-ear voice sensor.
- Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the specific process of the wearable device performing voiceprint recognition on the second voice component is as follows:
- the wearable device performs feature extraction on the second voice component to obtain the second voiceprint feature, and the wearable device calculates the second similarity between the second voiceprint feature and the user's second registered voiceprint feature, and the second registered voiceprint feature is
- the second registered voice is obtained through feature extraction by the second voiceprint model, and the second registered voiceprint feature is used to reflect the preset audio feature of the user collected by the out-of-ear voice sensor.
- Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the specific process of performing voiceprint recognition on the third voice component by the wearable device is as follows:
- the wearable device performs feature extraction on the third voice component to obtain the third voiceprint feature, and the wearable device calculates the third similarity between the third voiceprint feature and the user's third registered voiceprint feature, and the third registered voiceprint feature is
- the third registered voice is obtained through feature extraction by the third voiceprint model, and the third registered voiceprint feature is used to reflect the preset audio features of the user collected by the bone vibration sensor.
- Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the wearable device uses the first voiceprint recognition result of the first voice component in the voice information, the second voiceprint recognition result of the second voice component in the voice information, and the third voice component in the voice information.
- the third voiceprint recognition result is obtained, and the identity information of the user can be obtained.
- the identity information of the user can be obtained by fusing each voiceprint recognition result by means of a dynamic fusion coefficient, which can be specifically:
- the wearable device determines the first fusion coefficient corresponding to the first similarity, the second fusion coefficient corresponding to the second similarity, and the third fusion coefficient corresponding to the third similarity; the wearable device determines the first fusion coefficient and the second fusion coefficient according to the Integrate the first similarity, the second similarity and the third similarity with the third fusion coefficient to obtain a fusion similarity score. If the fusion similarity score is greater than the first threshold, it is determined that the user's identity information matches the preset identity information . By fusing multiple similarities to obtain fusion similarity scores and making judgments, the accuracy of voiceprint recognition can be effectively improved.
- the wearable device determines the first fusion coefficient, the second fusion coefficient and the third fusion coefficient.
- the decibel number of the ambient sound can be obtained according to the sound pressure sensor; according to the playback signal of the speaker, the playback volume is determined. ; Determine the first fusion coefficient, the second fusion coefficient and the third fusion coefficient respectively according to the decibel number of the ambient sound and the playback volume, wherein: the second fusion coefficient is negatively correlated with the decibel number of the ambient sound, the first fusion coefficient, the third fusion coefficient The three fusion coefficients are respectively negatively correlated with the decibels of the playback volume, and the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a fixed value.
- the above-mentioned sound pressure sensor and speaker are sound pressure sensors and speakers of a wearable device.
- dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes, and the speech signals with different attributes are used to fuse the voiceprint recognition results.
- the complementarity of voiceprint recognition can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
- voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
- the wearable device sends an instruction instruction to the terminal, and the terminal executes an operation instruction corresponding to the voice information, and the operation instruction includes an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction or a call instruction.
- the user only needs to input voice information once to complete a series of operations such as user identity authentication and execution of a certain function, thereby greatly improving the user's control efficiency and user experience of the wearable device.
- the present application provides a voice control method.
- the voice control method is applied to a terminal.
- the execution subject of the voice control method is a terminal.
- the method is specifically as follows: the method includes: acquiring user voice information, the voice information It includes a first voice component, a second voice component and a third voice component.
- the first voice component is collected by the in-ear voice sensor
- the second voice component is collected by the out-of-ear voice sensor
- the third voice component is collected by collected by the bone vibration sensor
- the terminal performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively
- the second voiceprint recognition result of the second voice component and the third voiceprint recognition result of the third voice component in the voice information obtain the user's identity information
- the terminal executes the operation instruction, Wherein, the operation instruction is determined according to the voice information.
- the wearable device uses the in-ear voice sensor when collecting voice, it can make up for the distortion caused by the loss of high-frequency signal components of part of the voice information when the bone vibration sensor collects voice information, so it can improve the overall voiceprint of the terminal. Accuracy of acquisition effect and voiceprint recognition to improve user experience.
- the wearable device after acquiring the voice information input by the user, the wearable device will send the voice component corresponding to the voice information to the terminal, so that the terminal can perform voiceprint recognition according to the voice component.
- Executing the voice control method on the terminal side can effectively utilize the computing power of the terminal, and can still ensure the accuracy of identity authentication when the computing power of the wearable device is insufficient.
- the terminal Before the terminal performs voiceprint recognition, the terminal needs to obtain the voice components separately.
- the wearable device obtains three voice components through different sensors, such as the in-ear voice sensor, the out-of-ear voice sensor and the bone vibration sensor, and sends them to the terminal. Accuracy and anti-interference ability of terminal voiceprint recognition.
- the method before the terminal performs voiceprint recognition on the first voice component, the second voice component and the third voice component, the method further includes: performing keyword detection on the voice information, or detecting user input.
- the voice information includes a preset keyword
- the wearable device will send the voice component corresponding to the voice information to the terminal, and the terminal will sound the first voice component, the second voice component and the third voice component respectively. or, when receiving a preset operation input by the user, the terminal performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively. Otherwise, it means that the user does not need to perform voiceprint recognition at this time, and the wearable device does not need to enable the voiceprint recognition function, thereby reducing the power consumption of the terminal.
- the method before the wearable device performs keyword detection on the voice information or detects user input, the method further includes: acquiring a wearing state detection result of the wearable device.
- the wearing state detection result is passed, keyword detection is performed on the voice information, or user input is detected. Otherwise, it means that the user is not wearing the wearable device at this time, and of course there is no need for voiceprint recognition, and the wearable device does not need to enable the keyword detection function, thereby reducing the power consumption of the wearable device.
- the specific process for the terminal to perform voiceprint recognition on the first voice component is as follows:
- the terminal performs feature extraction on the first voice component to obtain the first voiceprint feature, and the terminal calculates the first similarity between the first voiceprint feature and the user's first registered voiceprint feature, and the first registered voiceprint feature is the first registered voice Obtained through feature extraction by the first voiceprint model, the first registered voiceprint feature is used to reflect the user's preset audio feature collected by the in-ear voice sensor.
- Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the specific process for the terminal to perform voiceprint recognition on the second voice component is as follows:
- the terminal performs feature extraction on the second voice component to obtain a second voiceprint feature, and the terminal calculates a second similarity between the second voiceprint feature and the second registered voiceprint feature of the user, and the second registered voiceprint feature is the second registered voice Obtained through feature extraction by the second voiceprint model, the second registered voiceprint feature is used to reflect the user's preset audio features collected by the out-of-ear voice sensor.
- Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the specific process for the terminal to perform voiceprint recognition on the third voice component is as follows:
- the terminal performs feature extraction on the third voice component to obtain the third voiceprint feature, and the terminal calculates the third similarity between the third voiceprint feature and the third registered voiceprint feature of the user, and the third registered voiceprint feature is the third registered voice Obtained through feature extraction by the third voiceprint model, the third registered voiceprint feature is used to reflect the user's preset audio features collected by the bone vibration sensor.
- Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the terminal uses the first voiceprint recognition result of the first voice component in the voice information, the second voiceprint recognition result of the second voice component in the voice information, and the first voiceprint recognition result of the third voice component in the voice information.
- the user's identity information can be obtained by fusing each voiceprint recognition result by means of a dynamic fusion coefficient, which can be specifically:
- the terminal determines the first fusion coefficient corresponding to the first similarity, the second fusion coefficient corresponding to the second similarity, and the third fusion coefficient corresponding to the third similarity; the terminal determines according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient.
- the coefficients fuse the first similarity, the second similarity and the third similarity to obtain a fusion similarity score. If the fusion similarity score is greater than the first threshold, it is determined that the user's identity information matches the preset identity information.
- the terminal determines the first fusion coefficient, the second fusion coefficient and the third fusion coefficient, specifically, the decibel number of the ambient sound may be obtained according to the sound pressure sensor; the playback volume may be determined according to the playback signal of the speaker; After the wearable device detects the decibel number and playback volume of the ambient sound, the data is sent to the terminal, and the terminal determines the first fusion coefficient, the second fusion coefficient and the third fusion coefficient respectively according to the decibel number and playback volume of the ambient sound, where: The second fusion coefficient is negatively correlated with the decibel number of the ambient sound, the first fusion coefficient and the third fusion coefficient are negatively correlated with the decibel number of the playback volume respectively, and the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is Fixed value.
- the above-mentioned sound pressure sensor and speaker are sound pressure sensors and speakers of a wearable device.
- dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes, and the speech signals with different attributes are used to fuse the voiceprint recognition results.
- the complementarity of voiceprint recognition can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
- voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
- the terminal executes an operation instruction corresponding to the voice information, and the operation instruction includes an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction or a call instruction.
- the user only needs to input voice information once to complete a series of operations such as user identity authentication and executing a certain function of the wearable device, thereby greatly improving the user's control efficiency and user experience on the wearable terminal.
- the present application provides a voice control device, comprising: a voice information acquisition unit, the voice information acquisition unit is configured to acquire voice information of a user, and the voice information includes a first voice component, a second voice component and a third voice component,
- the first speech component is collected by the in-ear speech sensor, the second speech component is collected by the out-of-ear speech sensor, and the third speech component is collected by the bone vibration sensor;
- the first voice component, the second voice component and the third voice component are used for voiceprint recognition;
- the identity information acquisition unit the identity information acquisition unit is used for the voiceprint recognition result of the first voice component and the voiceprint recognition result of the second voice component and the voiceprint recognition result of the third voice component to obtain the identity information of the user;
- the execution unit the execution unit is used to execute an operation instruction when the user's identity information matches the preset information, wherein the operation instruction is determined according to the voice information of.
- the wearable device uses the in-ear voice sensor when collecting sound, it can make up for the distortion problem caused by the loss of some high-frequency signal components of the voice information when the bone vibration sensor collects voice information, so it can improve the overall performance of the wearable device.
- Voiceprint collection effect and accuracy of voiceprint recognition to improve user experience. Before obtaining the voiceprint recognition results, it is necessary to obtain the voice components separately. The acquisition of multi-channel voice components can improve the accuracy and anti-interference ability of voiceprint recognition.
- the voice information acquiring unit is further configured to: perform keyword detection on the voice information, or detect user input.
- voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively; when a preset operation input by the user is received, Voiceprint recognition is performed on the first voice component, the second voice component and the third voice component. Otherwise, it means that the user does not need to perform voiceprint recognition at this time, and the terminal or wearable device does not need to enable the voiceprint recognition function, thereby reducing the power consumption of the terminal or wearable device.
- the voice information acquisition unit is further configured to: acquire the wearing state detection result of the wearable device.
- the wearing state detection result is passed, keyword detection is performed on the voice information, or user input is detected. Otherwise, it means that the user is not wearing the wearable device at this time, and of course there is no need for voiceprint recognition, so the terminal or wearable device does not need to enable the keyword detection function, thereby reducing the power consumption of the terminal or wearable device.
- the identification unit is specifically configured to: perform feature extraction on the first voice component to obtain a first voiceprint feature, and calculate a first similarity between the first voiceprint feature and the user's first registered voiceprint feature
- the first registered voiceprint feature is obtained by the feature extraction of the first registered voice through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset audio features of the user collected by the in-ear voice sensor; Perform feature extraction on the two voice components to obtain the second voiceprint feature, and calculate the second similarity between the second voiceprint feature and the user's second registered voiceprint feature.
- the second registered voiceprint feature is that the second registered voice passes through the second voiceprint.
- the second registered voiceprint feature is used to reflect the preset audio features of the user collected by the out-of-ear voice sensor; the third voice component is extracted by feature extraction to obtain the third voiceprint feature, and the third voiceprint feature is calculated.
- the third registered voiceprint feature is obtained by the feature extraction of the third registered voice through the third voiceprint model, and the third registered voiceprint feature is used to reflect The user's preset audio characteristics collected by the bone vibration sensor.
- Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
- the identity information obtaining unit may obtain the identity information by means of dynamic fusion coefficients, and the identity information obtaining unit is specifically configured to: determine the first fusion coefficient corresponding to the first similarity, and the second similarity corresponding to the The second fusion coefficient, the third fusion coefficient corresponding to the third similarity; the fusion similarity is obtained by fusing the first similarity, the second similarity and the third similarity according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient If the fusion similarity score is greater than the first threshold, it is determined that the user's identity information matches the preset identity information.
- the identity information acquisition unit is specifically used to: obtain the decibel number of the ambient sound according to the sound pressure sensor; determine the playback volume according to the playback signal of the speaker; determine the decibel number and the playback volume of the ambient sound respectively The first fusion coefficient, the second fusion coefficient and the third fusion coefficient, wherein: the second fusion coefficient is negatively correlated with the decibel number of the ambient sound, and the first fusion coefficient and the third fusion coefficient are respectively negatively correlated with the decibel number of the playback volume , the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a fixed value.
- dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes, and the speech signals with different attributes are used to fuse the voiceprint recognition results.
- the complementarity of voiceprint recognition can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
- voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
- the execution unit is specifically configured to: execute an operation instruction corresponding to the voice information, and the operation instruction includes an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction, or call command.
- the user only needs to input voice information once to complete a series of operations such as user identity authentication and execution of a certain function, thereby greatly improving the user's control efficiency and user experience.
- the voice control device provided in the fourth aspect of the present application can be understood as a terminal or a wearable device, which depends on the execution subject of the voice control method, which is not limited in the present application.
- the present application provides a wearable device, comprising: an in-ear voice sensor, an out-of-ear voice sensor, a bone vibration sensor, a memory, and a processor; the in-ear voice sensor is used to collect a first voice component of voice information, an ear The external voice sensor is used to collect the second voice component of the voice information, and the bone vibration sensor is used to collect the third voice component of the voice information; the memory is coupled to the processor; the memory is used to store computer program codes, and the computer program codes include computer instructions; when When the processor executes the computer instructions, the wearable device executes the voice control method of any one of the first aspect or possible implementations of the first aspect or the third aspect or possible implementations of the third aspect.
- the present application provides a terminal, comprising: including a memory and a processor; the memory and the processor are coupled; the memory is used to store computer program codes, and the computer program codes include computer instructions; when the processor executes the computer instructions, the terminal executes the The voice control method of any one of the first aspect or the possible implementation manner of the first aspect or the third aspect or the possible implementation manner of the third aspect.
- the present application provides a chip system, which is applied to electronic equipment; the chip system includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected through lines; the interface circuit is used for A signal is received from the memory of the electronic device, and a signal is sent to the processor, where the signal includes computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device executes the first aspect or any of the possible implementations of the first aspect. A method of voice control.
- the present application provides a computer storage medium, comprising computer instructions, when the computer instructions are executed on the voice control device, the voice control device is made to perform any one of the first aspect or the possible implementation manners of the first aspect The voice control method for the item.
- the present application provides a computer program product comprising computer instructions that, when the computer instructions are run on a voice control device, cause the voice control device to perform the first aspect or a possible implementation of the first aspect
- the voice control method of any one of the methods is not limited to:
- the wearable device of the fifth aspect, the terminal of the sixth aspect, the chip system of the seventh aspect, the computer storage medium of the eighth aspect, and the computer program product of the ninth aspect are all used to execute the above.
- the corresponding method provided, therefore, the beneficial effects that can be achieved can be referred to the beneficial effects in the corresponding methods provided above, which will not be repeated here.
- FIG. 1 is a schematic diagram of the hardware structure of a mobile phone according to an embodiment of the present application.
- FIG. 2 is a schematic structural diagram of a mobile phone software provided by an embodiment of the present application.
- FIG. 3 is a schematic structural diagram of a wearable device according to an embodiment of the present application.
- FIG. 4 is a schematic diagram of a voice control system provided by an embodiment of the present application.
- FIG. 5 is a schematic structural diagram of a server according to an embodiment of the present application.
- FIG. 6 is a schematic flowchart of a voiceprint recognition provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of a voice control method provided by an embodiment of the present application.
- FIG. 8 is a schematic diagram of a sensor setting area provided by an embodiment of the present application.
- FIG. 9 is a schematic diagram of a payment interface provided by an embodiment of the present application.
- FIG. 10 is a schematic diagram of another voice control method provided by an embodiment of the present application.
- FIG. 11 is a schematic diagram of a mobile phone setting interface provided by an embodiment of the application.
- FIG. 12 is a schematic diagram of a voice control device according to an embodiment of the present application.
- FIG. 13 is a schematic diagram of a wearable device provided by an embodiment of the present application.
- FIG. 14 is a schematic diagram of a terminal according to an embodiment of the present application.
- FIG. 15 is a schematic diagram of a chip system provided by an embodiment of the present application.
- first and second are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features.
- features defined as “first”, “second” may expressly or implicitly include one or more of such features, it being understood that the data so used may be interchanged under appropriate circumstances for the implementation described herein Examples can be implemented in sequences other than those illustrated or described herein.
- “plurality” means two or more.
- Voiceprint is a spectrum of sound waves that carry speech information displayed by electroacoustic instruments.
- Voiceprint has the characteristics of stability, measurability and uniqueness. After adulthood, the human voice can remain relatively stable for a long time.
- the vocal organs that people use when speaking are very different in size and shape, so any two people have different voiceprints, and different people's voices have different distributions of formants in the spectrogram.
- Voiceprint recognition is to judge whether it is the same person by comparing the voices of the speakers of the two speeches on the same phoneme, so as to realize the function of "recognizing people by hearing voices".
- Voiceprint recognition is to extract voiceprint information from the speech signal sent by the speaker. From the perspective of application, it can be divided into: speaker identification (SI, Speaker Identification): It is used to determine which one of several people said a certain piece of speech, which is a "multiple choice” problem. Speaker Verification (SV, Speaker Verification): It is a "one-to-one discrimination" problem to confirm whether a certain piece of speech is spoken by a designated person. This application is primarily concerned with speaker verification techniques.
- the voiceprint recognition technology can be applied to end user identification scenarios, and can also be applied to household head identification scenarios for home security, which is not limited in this application.
- the usual voiceprint recognition technology performs voiceprint recognition through the collection of one or two channels of voice signals, that is, only if the voiceprint recognition results of the two channels of voice components match, it will be determined as a preset user.
- voiceprint recognition there are two problems.
- the voice components collected in the face of a multi-person speaking scene or a background of strong interfering environmental noise will interfere with the voiceprint recognition result, resulting in inaccurate or even wrong identity authentication.
- the voiceprint recognition performance will be degraded, and the identity authentication result will be misjudged. That is, the existing voiceprint recognition technology cannot well suppress noise from all directions, which reduces the accuracy of voiceprint recognition.
- an embodiment of the present application provides a voice control method.
- the subject executing the method of this embodiment may be a terminal, and the terminal establishes a connection with a wearable device and can obtain the data collected by the wearable device. voice information, and perform voiceprint recognition on the voice information.
- the subject performing the method of this embodiment may also be the wearable device itself, and the wearable device itself includes a processor with computing capability, which can directly perform voiceprint recognition on the collected voice information.
- the main body executing the method of this embodiment may also be a server, and the server establishes a connection with the wearable device, can obtain the voice information collected by the wearable device, and performs voiceprint recognition on the voice information.
- the main body that executes the method of this embodiment may be determined according to the computing power of the wearable device chip.
- the wearable device when the computing power of the wearable device chip is high, the wearable device can perform the method of this embodiment; when the computing power of the wearable device chip is low, the wearable device can be connected to the wearable device.
- the method of this embodiment is performed by the terminal device, or the method of this embodiment may be performed by a server connected to the wearable device.
- the terminal connected to the wearable device as the execution body of the method of this embodiment as an example
- the wearable device as the execution body of the method of this embodiment as an example
- the server connected to the wearable device as the embodiment.
- the embodiment of the present application is described in detail by taking the execution body of the method in this embodiment as an example.
- terminal equipment is also called user equipment (UE), mobile station (MS), mobile terminal (MT), etc. , a device that provides voice and/or data connectivity to the user.
- UE user equipment
- MS mobile station
- MT mobile terminal
- UE user equipment
- handheld devices in-vehicle devices, etc. that are enabled by wireless connectivity.
- terminal devices are: mobile phone (mobile phone), tablet computer, notebook computer, PDA, mobile internet device (MID), wearable device, virtual reality (VR) device, augmented Augmented reality (AR) equipment, wireless terminals in industrial control, wireless terminals in self-driving, wireless terminals in remote medical surgery, smart grid
- the voice control method may be implemented by an application program installed on the terminal for recognizing voiceprints.
- the above-mentioned application program for recognizing voiceprint may be an embedded application program installed in the terminal (ie, a system application of the terminal) or a downloadable application program.
- an embedded application is an application provided as a part of the realization of a terminal (such as a mobile phone).
- a downloadable application is an application that can provide its own internet protocol multimedia subsystem (IMS) connection, the downloadable application is an application that can be pre-installed in the terminal or can be downloaded and installed by the user in the Third-party applications in the terminal.
- IMS internet protocol multimedia subsystem
- FIG. 1 shows a hardware structure of the mobile phone. As shown in FIG.
- the mobile phone 10 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, Antenna 1, Antenna 2, Mobile Communication Module 150, Wireless Communication Module 160, Audio Module 170, Speaker 170A, Receiver 170B, Microphone 170C, Headphone Interface 170D, Sensor Module 180, Key 190, Motor 191, Indicator 192, Camera 193, Display screen 194, and subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
- SIM subscriber identification module
- the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and an environment Light sensor 180L, bone conduction sensor 180M, etc.
- the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the mobile phone.
- the mobile phone may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements.
- the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
- the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
- the processor 110 can execute the voiceprint recognition algorithm provided by the embodiment of the present application.
- the controller can be the nerve center and command center of the mobile phone.
- the controller can generate operation control signals according to the instruction opcode and timing signal, and complete the control of fetching and executing instructions.
- a memory may also be provided in the processor 110 for storing instructions and data.
- the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
- the processor 110 may include one or more interfaces.
- the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
- the terminal can establish a wired communication connection with the wearable device through the interface.
- the terminal can obtain through the interface that the wearable device collects the first voice component through the in-ear voice sensor, collects the second voice component through the out-of-ear voice sensor, and collects the third voice component through the bone vibration sensor.
- the I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL).
- the I2S interface can be used for audio communication.
- the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
- the UART interface is a universal serial data bus used for asynchronous communication.
- the bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
- the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 . MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
- the GPIO interface can be configured by software.
- the GPIO interface can be configured as a control signal or as a data signal.
- the USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
- the USB interface 130 can be used to connect a charger to charge the mobile phone, and can also be used to transfer data between the mobile phone and peripheral devices. It can also be used to connect headphones to play audio through the headphones.
- the interface can also be used to connect other electronic devices, such as AR devices.
- the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the mobile phone.
- the mobile phone may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
- the charging management module 140 is used to receive charging input from the charger.
- the charger may be a wireless charger or a wired charger.
- the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
- the power management module 141 receives input from the battery 142 and/or the charging management module 140 and supplies power to the processor 110 , the internal memory 121 , the external memory, the display screen 194 , the camera 193 , and the wireless communication module 160 .
- the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance).
- the wireless communication function of the mobile phone can be realized by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
- Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
- Each antenna in a cell phone can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
- the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
- the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the mobile phone.
- the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
- the mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
- the mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 .
- at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 .
- at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
- the modem processor may include a modulator and a demodulator.
- the wireless communication module 160 can provide applications on the mobile phone including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), GNSS, frequency modulation (frequency). modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
- WLAN wireless local area networks
- BT Bluetooth
- GNSS frequency modulation (frequency). modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
- the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
- the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
- the wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna
- the terminal can establish a communication connection with the wearable device through the wireless communication module 160 .
- the terminal may acquire the wearable device through the wireless communication module 160 to collect the first voice component through the in-ear voice sensor, the second voice component through the outside-the-ear voice sensor, and the third voice component through the bone vibration sensor.
- the GNSS in this embodiment of the present application may include GPS, GLONASS, BDS, QZSS, SBAS, and/or GALILEO, and the like.
- the mobile phone realizes the display function through the GPU, the display screen 194, and the application processor.
- the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
- the GPU is used to perform mathematical and geometric calculations for graphics rendering.
- Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
- Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel.
- the mobile phone can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
- the ISP is used to process the data fed back by the camera 193 .
- the camera 193 is used to obtain still images or videos.
- the object is projected through the lens to generate an optical image onto the photosensitive element.
- a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals.
- Video codecs are used to compress or decompress digital video.
- the NPU is a neural-network (NN) computing processor.
- NN neural-network
- applications such as intelligent cognition of mobile phones can be realized, such as image recognition, face recognition, speech recognition, text understanding, etc.
- the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone.
- the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
- Internal memory 121 may be used to store computer executable program code, which includes instructions.
- the processor 110 executes various functional applications and data processing of the mobile phone by executing the instructions stored in the internal memory 121 .
- the code stored in the internal memory 121 can execute a voice control method provided by the embodiment of the present application. For example, when the user inputs voice information to the wearable device, the wearable device collects the first voice component through the in-ear voice sensor, and the wearable device collects the first voice component through the external voice sensor.
- the voice sensor collects the second voice component
- the bone vibration sensor collects the third voice component
- the mobile phone obtains the first voice component, the second voice component and the third voice component from the wearable device through the communication connection, and performs voiceprint recognition respectively
- the first voiceprint recognition result of the first voice component, the second voiceprint recognition result of the second voice component, and the third voiceprint recognition result of the third voice component are used to authenticate the user; if the user's identity authentication result If it is a preset user, the mobile phone executes the operation instruction corresponding to the voice information.
- the mobile phone can implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and an application processor. Such as music playback, recording, etc.
- the terminal can establish a communication connection with the wearable device through the wireless communication module 160 .
- the terminal may acquire the wearable device through the wireless communication module 160 to collect the first voice component through the in-ear voice sensor, the second voice component through the outside-the-ear voice sensor, and the third voice component through the bone vibration sensor.
- the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal.
- Speaker 170A also referred to as a "speaker” is used to convert audio electrical signals into sound signals.
- the receiver 170B also referred to as “earpiece”, is used to convert audio electrical signals into sound signals.
- the microphone 170C also called “microphone” or “microphone”, is used to convert sound signals into electrical signals.
- the earphone jack 170D is used to connect wired earphones.
- the earphone interface 170D can be the USB interface 130, or can be a 3.2mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
- OMTP open mobile terminal platform
- CTIA cellular telecommunications industry association of the USA
- the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
- the cell phone can receive key input and generate key signal input related to user settings and function control of the cell phone.
- Motor 191 can generate vibrating cues.
- the motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback.
- the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
- the SIM card interface 195 is used to connect a SIM card.
- the SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the mobile phone.
- the mobile phone can support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
- the SIM card interface 195 can support Nano SIM card, Micro SIM card, SIM card and so on.
- the mobile phone 100 may further include a camera, a flash, a micro-projection device, a near field communication (near field communication, NFC) device, etc., which will not be repeated here.
- the software system of the mobile phone can adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
- the embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of a mobile phone.
- FIG. 2 is a block diagram of a software structure of a mobile phone according to an embodiment of the present application.
- the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
- the Android system is divided into four layers, from top to bottom: an application layer, an application framework layer, an Android runtime (Android runtime) and system libraries, and a kernel layer.
- the application layer can include a series of application packages.
- the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and so on.
- An application program for voiceprint recognition may also be included, and the application program for voiceprint recognition may be built into the terminal or downloaded through an external website.
- the application framework layer provides an application programming interface (API) and a programming framework for applications in the application layer.
- API application programming interface
- the application framework layer includes some predefined functions.
- the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like.
- a window manager is used to manage window programs.
- the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
- Content providers are used to store and retrieve data and make these data accessible to applications.
- the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
- the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
- a display interface can consist of one or more views.
- the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
- the phone manager is used to provide the communication functions of the mobile phone. For example, the management of call status (including connecting, hanging up, etc.).
- the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
- the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
- the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
- Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
- the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
- the application layer and the application framework layer run in virtual machines.
- the virtual machine executes the java files of the application layer and the application framework layer as binary files.
- the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
- a system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
- surface manager surface manager
- media library Media Libraries
- 3D graphics processing library eg: OpenGL ES
- 2D graphics engine eg: SGL
- the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
- the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
- the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
- the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
- 2D graphics engine is a drawing engine for 2D drawing.
- the kernel layer is the layer between hardware and software.
- the kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
- a corresponding hardware interrupt is sent to the kernel layer.
- the kernel layer processes touch operations into raw input events (including touch coordinates, timestamps of touch operations, etc.). Raw input events are stored at the kernel layer.
- the application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, for example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer.
- the camera 193 captures still images or video.
- the voice control method of the embodiment of the present application can be applied to a wearable device, in other words, the wearable device can be used as the execution subject of the voice control method of the embodiment of the present application.
- the wearable device may be a device with a voice collection function, such as a wireless headset, a wired headset, smart glasses, a smart helmet, or a smart watch, which is not limited in this embodiment of the present application.
- the wearable device provided by the embodiment of the present application may be a TWS (True Wireless Stereo, true wireless stereo) headset, and the TWS technology is based on the development of the Bluetooth chip technology. According to its working principle, it means that the mobile phone is connected to the main earphone, and then the main earphone is quickly connected to the auxiliary earphone by wireless means, so as to realize the true wireless separation of the left and right channels of Bluetooth.
- TWS Truste Wireless Stereo, true wireless stereo
- TWS smart earphones have begun to play a role in the fields of wireless connection, voice interaction, intelligent noise reduction, health monitoring and hearing enhancement/protection. And noise reduction, hearing protection, intelligent translation, health monitoring, bone vibration ID, anti-lost, etc. will be the key technology trends of TWS headsets.
- the wearable device 30 may specifically include an in-ear voice sensor 301 , an out-of-ear voice sensor 302 and a bone vibration sensor 303 .
- the in-ear voice sensor 301 and the out-of-ear voice sensor may be air conduction microphones
- the bone vibration sensor may be a bone conduction microphone, an optical vibration sensor, an acceleration sensor, or an air conduction microphone and other sensors that can collect vibration signals generated when a user utters a voice.
- the air conduction microphone collects voice information by transmitting the vibration signal when it occurs to the microphone through the air, and then collects the sound signal and converts it into an electrical signal;
- the bone conduction microphone collects voice information by using the head caused by human speech. The slight vibration of the bones in the neck transmits the vibration signal of the sound to the microphone through the bone, and then the sound signal is collected and converted into an electrical signal.
- the voice control method provided in the embodiments of the present application needs to be applied to a wearable device with a voiceprint recognition function, in other words, the wearable device 30 needs to have a voiceprint recognition function.
- the in-ear voice sensor 301 of the wearable device 30 refers to that, when the wearable device is in a state of being used by a user, the in-ear voice sensor is located inside the user's ear canal, or in other words, the ear The sound detection direction of the internal voice sensor is the inside of the ear canal.
- the in-ear voice sensor is used to collect the sound transmitted by the vibration of the outside air and the air in the ear canal when the user makes a sound, and the sound is the in-ear voice signal component.
- the out-of-ear voice sensor 302 refers to that, when the wearable device is in the state of being used by the user, the out-of-ear speech sensor is located outside the user's ear canal, or in other words, the sound detection direction of the out-of-ear speech sensor is to remove the ear. Other directions inside the duct, i.e. the whole outside air direction.
- the out-of-ear voice sensor is exposed to the environment, and is used for collecting the sound transmitted by the user through the vibration of the outside air, and the sound is an out-of-ear voice signal component or an ambient sound component.
- the bone vibration sensor 303 refers to that, when the wearable device is in the state of being used by the user, the bone vibration sensor is in contact with the user's skin, and is used to collect the vibration signal transmitted by the user's bones, or, in other words, to collect a certain time of the user.
- the component of speech information conveyed by bone vibrations during vocalization can select microphones with different directions according to the positions of the microphones, such as cardioid, omnidirectional, figure-8, etc., so as to obtain voice signals in different directions.
- the external auditory canal and the middle ear canal will form a closed cavity, and the sound has a certain amplification effect in the cavity, that is, the cavity effect. Therefore, the sound collected by the in-ear voice sensor will be clearer. , especially for high-frequency sound signals, it can make up for the distortion problem caused by the loss of high-frequency signal components of part of the voice information when the bone vibration sensor collects voice information, and improve the overall voiceprint collection effect of the headset. The accuracy of fingerprint recognition can be improved to improve the user experience.
- in-ear speech sensor 301 picks up the in-ear speech signal, it is usually accompanied by residual noise in the ear, and when the out-of-ear speech sensor 302 picks up the out-of-ear speech signal, it is usually accompanied by extra-ear noise.
- the wearable device 30 when the user wears the wearable device 30 to speak, the wearable device 30 can not only collect the voice information transmitted by the user through the air through the in-ear voice sensor 301 and the out-of-ear voice sensor 302, but also through the bone vibration
- the sensor 303 collects the voice information sent by the user after being transmitted through the bone.
- in-ear voice sensors 301 there may be multiple in-ear voice sensors 301 , out-of-ear voice sensors 302 and bone vibration sensors 303 in the wearable device 30 , which is not limited in this application.
- the in-ear voice sensor 301 , the out-of-ear voice sensor 302 and the bone vibration sensor 303 may be built into the wearable device 30 .
- the wearable device 30 may further include components such as a communication module 304 , a speaker 305 , a computing module 306 , a storage module 307 , and a power supply 309 .
- the communication module 304 can establish a communication connection with the terminal or the server.
- the communication module 304 may include a communication interface, and the communication interface may be in a wired or wireless manner, and the wireless manner may be through bluetooth or wifi.
- the communication module 304 can be used to collect the first voice component of the wearable device 30 through the in-ear voice sensor 301, the second voice component through the out-of-ear voice sensor 302, and the third voice component through the bone vibration sensor 303, and transmit them to the terminal. or server.
- the computing module 306 can execute the voice control method provided in the embodiment of the present application.
- the internal voice sensor 301 collects the first voice component
- the out-of-ear voice sensor 302 collects the second voice component
- the bone vibration sensor 303 collects the third voice component, and voiceprint recognition is performed respectively;
- the result, the second voiceprint recognition result of the second voice component and the third voiceprint recognition result of the third voice component perform identity authentication on the user; if the user's identity authentication result is a preset user, the wearable device executes Operation instructions corresponding to voice information.
- the storage module 307 is used for storing the application program code for executing the method of the embodiment of the present application, and the execution is controlled by the computing module 306 .
- the code stored in the storage module 307 can execute a voice control method provided by this embodiment of the present application, for example: when the user inputs voice information to the wearable device, the wearable device 30 collects the first voice component through the in-ear voice sensor 301, The extra-ear voice sensor 302 collects the second voice component, and the bone vibration sensor 303 collects the third voice component to perform voiceprint recognition respectively; according to the first voiceprint recognition result of the first voice component and the second voiceprint of the second voice component The identification result and the third voiceprint identification result of the third voice component are used to authenticate the user's identity; if the user's identity authentication result is a preset user, the wearable device executes the operation instruction corresponding to the voice information.
- the above-mentioned wearable device 30 may also include pressure sensors, acceleration sensors, optical sensors, etc.
- the wearable device 30 may also have more or less components than those shown in FIG. 3 , and two or more components may be combined. components, or may have different component configurations.
- the various components shown in Figure 3 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing or application specific integrated circuits.
- a voice control method provided in this embodiment of the present application can be applied to a voice control system composed of a wearable device 30 and a terminal 10 , and the voice control system is shown in FIG. 4 .
- the wearable device 30 can collect the first voice component through the in-ear voice sensor 301, the second voice component through the outside-the-ear voice sensor 302, and the bone
- the vibration sensor 303 collects the third voice component
- the terminal 10 obtains the first voice component, the second voice component, and the third voice component from the wearable device, and then analyzes the first voice component, the second voice voiceprint recognition of the first voice component and the third voice component; according to the first voiceprint recognition result of the first voice component, the second voiceprint recognition result of the second voice component and the third voiceprint recognition result of the third voice component, Perform identity authentication; if the user's identity authentication result is a preset user, the terminal 10 executes an operation instruction corresponding to the voice information.
- the voice control method of the embodiment of the present application may also be applied to the server, in other words, the server may serve as the execution body of the voice control method of the embodiment of the present application.
- the server may be a desktop server, a rack server, a cabinet server, a blade server, or other types of servers, and the server may also be a cloud server such as a public cloud or a private cloud, which is not limited in this embodiment of the present application.
- the server 50 includes at least one processor 501 , at least one memory 502 and at least one communication interface 503 .
- the processor 501, the memory 502, and the communication interface 503 are connected through a communication bus 504 and communicate with each other.
- the processor 501 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the above programs.
- CPU central processing unit
- ASIC application-specific integrated circuit
- Memory 502 may be read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (RAM) or other type of static storage device that can store information and instructions It can also be an electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being executed by a computer Access any other medium without limitation.
- the memory can exist independently and be connected to the processor through a bus.
- the memory can also be integrated with the processor.
- the memory 502 is used for storing application program codes for executing the methods of the embodiments of the present application, and the execution is controlled by the processor 501 .
- the code stored in the memory 502 can execute a voice control method provided by the embodiment of the present application. For example, when the user inputs voice information to the wearable device, the wearable device collects the first voice component through the in-ear voice sensor, and uses the out-of-ear voice to collect the first voice component.
- the sensor collects the second voice component
- the bone vibration sensor collects the third voice component
- the server obtains the first voice component, the second voice component and the third voice component from the wearable device through the communication connection, and performs voiceprint recognition respectively;
- the first voiceprint recognition result of a voice component, the second voiceprint recognition result of the second voice component, and the third voiceprint recognition result of the third voice component are used to authenticate the user; if the user's identity authentication result is If the user is preset, the server executes the operation instruction corresponding to the voice information.
- the communication interface 503 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (Wireless Local Area Networks, WLAN).
- RAN radio access network
- WLAN Wireless Local Area Networks
- the specific implementation manner when the voice control method of the present application is applied to a terminal is summarized.
- the method first acquires the voice information of the user, and the voice information includes a first voice component, a second voice component and a third voice component.
- the user can input voice information into the Bluetooth headset when wearing the Bluetooth headset, and at this time , the Bluetooth headset can collect the first voice component through the in-ear voice sensor, the second voice component through the out-of-ear voice sensor, and the third voice component through the bone vibration sensor, based on the voice information input by the user.
- the Bluetooth headset obtains the first voice component, the second voice component, and the third voice component from the voice information
- the mobile phone obtains the first voice component, the second voice component, and the second voice component from the Bluetooth headset through a Bluetooth connection with the Bluetooth headset.
- speech component and a third speech component may be performed in a possible implementation manner.
- the mobile phone may perform keyword detection on the voice information input by the user to the Bluetooth headset, or the mobile phone may detect user input.
- voice information includes preset keywords
- voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively.
- voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively.
- the user input may be the user's input to the mobile phone through a touch screen or keys, for example, the user clicks an unlock key of the mobile phone.
- the wearing state detection result may also be acquired from the Bluetooth headset.
- the mobile phone performs keyword detection on the voice information, or detects user input.
- a first voiceprint recognition result corresponding to the first voice component and a second voiceprint recognition result corresponding to the second voice component are obtained and a third voiceprint recognition result corresponding to the third voice component.
- the mobile phone can use a certain algorithm to calculate the first matching degree between the first voiceprint feature and the first registered voiceprint feature, the second matching degree between the second voiceprint feature and the second registered voiceprint feature, and the third voiceprint feature A third degree of matching with the third registered voiceprint feature.
- the matching degree is higher, it means that the voiceprint feature is more consistent with the corresponding registered voiceprint feature, and at this time, the possibility that the voice user is a preset user is higher.
- the mobile phone can determine that the first voiceprint feature matches the first registered voiceprint feature, and the second voiceprint feature matches the second voiceprint feature.
- the registered voiceprint feature matches, and the third voiceprint feature matches the third registered voiceprint feature.
- the mobile phone can determine that the first voiceprint feature matches the first registered voiceprint feature, and the second voiceprint feature matches the second registered voiceprint feature.
- the voiceprint features match, and the third voiceprint feature matches the third registered voiceprint feature.
- the first registered voiceprint feature is obtained by feature extraction through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset user's voiceprint feature collected by the in-ear voice sensor;
- the second registered voiceprint feature The feature is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the out-of-ear voice sensor;
- the third registered voiceprint feature is obtained through the third voiceprint feature.
- the third registered voiceprint feature is obtained by performing feature extraction on the voiceprint model to reflect the preset user's voiceprint feature collected by the bone vibration sensor.
- the mobile phone can execute an operation instruction corresponding to the voice information, for example, an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction, or a call instruction.
- the mobile phone can perform the corresponding operation according to the operation instruction, so as to realize the function of the user controlling the mobile phone by voice.
- the conditions of identity authentication are not limited. For example, when the first matching degree, the second matching degree and the third matching degree are all greater than a certain threshold, it can be considered that the identity authentication has passed, and the voiced user is the pre-defined user.
- the identity authentication in this embodiment of the present application refers to obtaining the user's identity information to determine whether the identity information matches the preset identity information. Fail.
- the above-mentioned preset users refer to users who can pass the preset identity authentication measures of the mobile phone.
- the preset identity authentication measures of the terminal are inputting a password, fingerprint recognition and voiceprint recognition.
- the user who stores the fingerprint information and the registered voiceprint feature that has been authenticated by the user can be regarded as the preset user of the terminal.
- the preset users of a terminal may include one or more, and any user other than the preset users may be regarded as an illegal user of the terminal.
- An illegal user can also be transformed into a default user after passing certain identity authentication measures, which is not limited in this embodiment of the present application.
- the first registered voiceprint feature is obtained by feature extraction through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset user's voiceprint collected by the in-ear voice sensor feature;
- the second registered voiceprint feature is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the out-of-ear voice sensor;
- the third registered voiceprint feature The voiceprint feature is obtained by feature extraction through the third voiceprint model, and the third registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the bone vibration sensor.
- the above algorithm for calculating the matching degree may be calculating the similarity.
- the mobile phone performs feature extraction on the first voice component to obtain the first voiceprint feature, respectively calculates the first similarity between the first voiceprint feature and the pre-stored preset user's first registered voiceprint feature, and the second voiceprint feature and The second similarity of the pre-stored second registered voiceprint feature of the preset user, the third similarity of the third voiceprint feature and the pre-stored third registered voiceprint feature of the preset user, based on the first similarity, The second similarity and the third similarity are used to authenticate the user.
- the way of authenticating the user may be that the mobile phone determines the first fusion coefficient corresponding to the first similarity degree according to the decibel number of the ambient sound and the playback volume of the wearable device, respectively, and the second similarity
- the second fusion coefficient corresponding to the degree, the third fusion coefficient corresponding to the third similarity; according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient, the first similarity, the second similarity and the third similarity are fused, Get the fusion similarity score. If the fusion similarity score is greater than the first threshold, the mobile phone determines that the user who inputs voice information to the Bluetooth headset is a preset user.
- the decibel number of the ambient sound is detected by the sound pressure sensor of the Bluetooth headset and sent to the mobile phone
- the playback volume can be detected by the speaker of the Bluetooth headset and sent to the mobile phone. It can be obtained by the mobile phone itself calling its own data, that is, obtained through the volume interface program interface of the underlying system.
- the second fusion coefficient is negatively correlated with the decibel number of the ambient sound
- the first fusion coefficient and the third fusion coefficient are respectively negatively correlated with the decibel number of the playback volume
- the first fusion coefficient and the second fusion coefficient are negatively correlated with the decibel number of the playback volume.
- the sum of the fusion coefficient and the third fusion coefficient is a fixed value. That is to say, when the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a preset fixed value, the larger the decibel of the ambient sound, the smaller the second fusion coefficient.
- the corresponding , the first fusion coefficient and the third fusion coefficient will increase adaptively, so as to keep the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient unchanged;
- the above-mentioned variable fusion coefficient can take into account the recognition accuracy in different application scenarios (in the case of a large noise environment or when the headphones are playing music).
- the mobile phone After the mobile phone determines that the user who inputs the voice information to the Bluetooth headset is the preset user, the mobile phone can automatically execute an operation instruction corresponding to the voice information, for example, the mobile phone unlocking operation or the confirming payment operation.
- the wearable device when the user inputs voice information to the wearable device to achieve the purpose of controlling the terminal, the wearable device can collect the voice information generated in the ear canal when the user makes a sound, and the voice information generated outside the ear canal. Voice information and bone vibration information. At this time, three channels of voice information (ie, the above-mentioned first voice component, second voice component, and third voice component) are generated in the wearable device. In this way, the terminal (or the wearable device itself, or the server) can perform voiceprint recognition for the three channels of voice information respectively.
- the triple voiceprint recognition process of this three-channel voice information can significantly improve the accuracy and security of user identity authentication compared with the voiceprint recognition process of one-channel voice information or the voiceprint recognition process of two-channel voice information.
- adding a microphone in the ear can solve the problem that the high-frequency signal of the speech signal collected by the bone vibration sensor is lost during the voiceprint recognition process of the two-way speech information of the out-of-ear speech sensor and the bone vibration sensor.
- the wearable device can collect the voice information input by the user through bone conduction. Therefore, when the wearable device collects the voice information through bone conduction During fingerprint recognition, it also shows that the source of the above voice information is generated by the voice of the preset user wearing the wearable device, so as to avoid the situation where an illegal user maliciously controls the terminal of the preset user by using the preset user's recording.
- a voice control method provided by the embodiments of the present application will be specifically introduced below with reference to the accompanying drawings.
- a mobile phone is used as a terminal, and a Bluetooth headset is used as an example for illustration.
- the general voiceprint recognition application process is shown in Figure 6.
- the registration voice 601 is first collected, and after preprocessing by the preprocessing module 602, After input into the pre-trained voiceprint model 603 for feature extraction, a registered voiceprint feature 604 is obtained, and the registered voiceprint feature can also be understood as a preset user registered voiceprint feature.
- the registered speech can be picked up by different types of sensors, eg out-of-ear speech sensors, in-ear speech sensors or bone vibration sensors.
- the voiceprint model 603 is obtained through training data in advance.
- the voiceprint model 603 may be built-in before the terminal leaves the factory, or may be trained by an application to guide the user.
- the training method may use the method of the prior art, which is not limited in this application.
- the verification process part first collect the test voice 605 of the voiced user in a certain voiceprint recognition process, after preprocessing by the preprocessing module 606, input it into the pre-trained voiceprint model 607 for feature extraction, and obtain the test voice Voiceprint feature 608, the test voice voiceprint feature can also be understood as a preset user registered voiceprint feature.
- the identity authentication Passing 6010 means that the voice user of the test voice 605 is the same person as the voice user of the registered voice 601, in other words, the voice user of the test voice 605 is the preset user; the identity authentication failure 6011 refers to the test voice 605
- the uttering user and the uttering user of the registered voice 601 are not the same person, in other words, the uttering user of the test voice 605 is an illegal user.
- the preprocessing of voice, feature extraction, and the training process of voiceprint model will have different degrees of differences, and the preprocessing module is an optional module. Filtering, noise reduction or enhancement is not limited in this application.
- FIG. 7 shows a schematic flowchart of a voice control method provided by an embodiment of the present application, taking the terminal being a mobile phone and the wearable device being a Bluetooth headset as an example.
- the Bluetooth headset includes an in-ear voice sensor, an out-of-ear voice sensor and a bone vibration sensor.
- the voice control method may include:
- a mobile phone establishes a connection with a Bluetooth headset.
- the connection method can be bluetooth connection, wifi connection or wired connection.
- the Bluetooth function of the Bluetooth headset can be turned on.
- the Bluetooth headset can send a pairing broadcast to the outside world. If the mobile phone does not have the bluetooth function turned on, the user needs to turn on the bluetooth function of the mobile phone. If the mobile phone has turned on the bluetooth function, the mobile phone can receive the pairing broadcast and prompt the user that the relevant bluetooth device has been scanned. After the user selects the Bluetooth headset on the mobile phone, the mobile phone can be paired with the Bluetooth headset and a Bluetooth connection can be established. Subsequently, the mobile phone and the Bluetooth headset can communicate through the Bluetooth connection. Of course, if the mobile phone and the Bluetooth headset have been successfully paired before the current Bluetooth connection is established, the mobile phone can automatically establish a Bluetooth connection with the scanned Bluetooth headset.
- the headset the user wants to use has the Wi-Fi function
- the user can also operate the mobile phone to establish a Wi-Fi connection with the headset.
- the earphone the user wishes to use is a wired earphone
- the user also inserts the plug of the earphone cable into the corresponding earphone port of the mobile phone to establish a wired connection, which is not limited in this embodiment of the present application.
- the Bluetooth headset detects whether it is in a wearing state.
- the wearing detection method can sense the wearing state of the user by using the principle of optical sensing by means of photoelectric detection.
- the light detected by the photoelectric sensor inside the earphone is blocked, and a switch control signal is output, thereby judging that the user is in the state of wearing the earphone.
- a proximity light sensor and an acceleration sensor may be provided in the Bluetooth headset, wherein the proximity light sensor is provided on the side that is in contact with the user when worn by the user.
- the proximity light sensor and acceleration sensor can be activated periodically to obtain currently detected measurements.
- the Bluetooth headset Since the user wears the Bluetooth headset, the light entering the proximity light sensor will be blocked. Therefore, when the light intensity detected by the proximity light sensor is less than the preset light intensity threshold, the Bluetooth headset can determine that it is in the wearing state at this time. Also, because the Bluetooth headset will move with the user after the user wears the Bluetooth headset, when the acceleration value detected by the acceleration sensor is greater than the preset acceleration threshold, the Bluetooth headset can determine that it is in the wearing state at this time. Or, when the light intensity detected by the proximity light sensor is less than the preset light intensity threshold, if it is detected whether the acceleration value detected by the acceleration sensor at this time is greater than the preset acceleration threshold, the Bluetooth headset can determine that it is wearing state.
- the Bluetooth headset is also provided with a sensor that collects voice information by means of bone conduction, such as a bone vibration sensor or an optical vibration sensor, etc., therefore, in a possible implementation, the Bluetooth headset can further pass the bone vibration sensor. Collect vibration signals generated in the current environment. When the Bluetooth headset is in direct contact with the user, the vibration signal collected by the bone vibration sensor is stronger than that in the non-wearing state. Then, if the energy of the vibration signal collected by the bone vibration sensor is greater than the energy threshold, the Bluetooth The headset can determine that it is being worn.
- a sensor that collects voice information by means of bone conduction such as a bone vibration sensor or an optical vibration sensor, etc.
- the Bluetooth headset can determine that it is in a wearing state.
- the above two situations can be understood as the user's wearing state detection result passing. This can reduce the probability that the Bluetooth headset cannot accurately detect the wearing state through the proximity light sensor or acceleration sensor in scenarios such as the user putting the Bluetooth headset into a pocket.
- the above-mentioned energy threshold or preset spectral characteristics may be obtained by capturing various vibration signals generated by a large number of users wearing Bluetooth headsets, such as uttering sounds or exercising. There are significant differences in the energy or spectral characteristics of speech signals.
- the power consumption of the voice sensor such as an air conduction microphone
- the in-ear voice sensor, the out-of-ear voice sensor and/or the bone vibration sensor can be turned on to collect the voice information generated by the user's voice, so as to reduce the power consumption of the Bluetooth headset.
- the Bluetooth headset When the Bluetooth headset detects that it is currently in the wearing state, or in other words, after the wearing state detection result is passed, the following steps S703-S707 may be continued; otherwise, the Bluetooth headset may enter the sleep state until it is detected that the current wearing state is continued.
- the above step S702 is an optional step, that is, regardless of whether the user wears a Bluetooth headset, the Bluetooth headset can continue to perform the following steps S703-S707, which is not limited in this embodiment of the present application.
- the Bluetooth headset if the Bluetooth headset has collected a voice signal before detecting whether it is in the wearing state, in this case, when the Bluetooth headset detects that it is currently in the wearing state, or in other words, after the wearing state detection result passes, The voice signal collected by the Bluetooth headset is stored and the following steps S703-S707 are continued; when the Bluetooth headset does not detect that it is currently in a wearing state, or in other words, after the wearing state detection result fails, the Bluetooth headset deletes the voice signal just collected.
- the Bluetooth headset acquires the first voice component in the voice information input by the user through acquisition by the in-ear voice sensor, collects the second voice component in the above-mentioned voice information through the out-of-ear voice sensor, and vibrates through the bone.
- the sensor collects the third voice component in the voice information.
- the bluetooth headset can start the voice detection module, and use the above-mentioned in-ear voice sensor, out-of-ear voice sensor and bone vibration sensor to collect the voice information input by the user, and obtain the first voice information in the voice information.
- a speech component, a second speech component and a third speech component Taking the in-ear voice sensor and the out-of-ear voice sensor as the air conduction microphone, and the bone vibration sensor as the bone conduction microphone, the user can input the voice information "Xiao E, use WeChat payment" when using the Bluetooth headset.
- the Bluetooth headset can use the air conduction microphone to receive the vibration signal (that is, the first voice component, the second voice component and the first voice component in the above voice information) generated by the air vibration after the user speaks. three voice components).
- the Bluetooth headset can use the bone conduction microphone to receive the vibration signal generated by the vibration of the ear bone and the skin after the user's voice (that is, the third voice component in the above voice information) .
- FIG. 8 is a schematic diagram of a sensor setting area.
- the Bluetooth headset provided by the embodiment of the present application includes an in-ear voice sensor, an out-of-ear voice sensor, and a bone vibration sensor.
- the in-ear voice sensor refers to that when the headset is in the state of being used by the user, the in-ear voice sensor is located inside the user's ear canal, or the sound detection direction of the in-ear voice sensor is inside the ear canal , the in-ear voice sensor is set in the in-ear voice sensor setting area 801 .
- the in-ear speech sensor is used to collect the sound transmitted by the vibration of the outside air and the air in the ear canal when the user makes a sound, and the sound is the in-ear speech signal component.
- the out-of-ear voice sensor means that when the headset is in the state of being used by the user, the out-of-ear voice sensor is located outside the user's ear canal, or the sound detection direction of the out-of-ear voice sensor is excluding the direction inside the ear canal. In other directions, that is, the entire outside air direction, the out-of-ear voice sensor is arranged in the out-of-ear speech sensor setting area 802 .
- the out-of-ear voice sensor is exposed to the environment, and is used for collecting the sound transmitted by the user through the vibration of the outside air, and the sound is an out-of-ear voice signal component or an ambient sound component.
- the bone vibration sensor refers to that when the headset is in the state of being used by the user, the bone vibration sensor is in contact with the user's skin, and is used to collect the vibration signal transmitted by the user's bones, or, in other words, to collect a certain sound of the user, The component of speech information conveyed by bone vibrations.
- the setting area of the bone vibration sensor is not limited, as long as the user's bone vibration can be detected when the user wears the earphone. It can be understood that the in-ear voice sensor can be set at any position in the area 801, and the out-of-ear voice sensor can be set at any position in the area 802, which is not limited in this application. It should be noted that the area division method in Figure 8 is only an example. In fact, the setting position of the in-ear voice sensor can detect the sound inside the ear canal, and the setting position of the out-of-ear voice sensor can detect the sound inside the ear canal. Sounds from the direction of the outside air will do.
- a VAD voice activity detection, voice activity detection
- the Bluetooth headset can respectively input the first voice component, the second voice component and the third voice component in the above voice information into the corresponding VAD algorithm to obtain the first VAD value corresponding to the first voice component, and The second VAD value corresponding to the second voice component and the third VAD value corresponding to the third voice component.
- the value of VAD can be used to reflect whether the above-mentioned speech information is a normal speech signal of the speaker or a noise signal.
- the value range of VAD can be set in the interval of 0 to 100.
- the voice information is the normal voice signal of the speaker.
- the value of VAD is less than a certain VAD threshold
- the voice information is a noise signal.
- the value of VAD can be set to 0 or 1. When the value of VAD is 1, it indicates that the voice information is a normal voice signal of the speaker, and when the value of VAD is 0, it indicates that the voice information is a noise signal.
- the Bluetooth headset can determine whether the above-mentioned voice information is a noise signal in combination with the above-mentioned three VAD values of the first VAD value, the second VAD value and the third VAD value. For example, when the first VAD value, the second VAD value and the third VAD value are all 1, the Bluetooth headset can determine that the above voice information is not a noise signal, but a normal voice signal of the speaker. For another example, when the first VAD value, the second VAD value and the third VAD value are respectively greater than the preset values, the Bluetooth headset can determine that the above voice information is not a noise signal, but a normal voice signal of the speaker.
- the Bluetooth headset can also determine whether the above-mentioned voice information is a noise signal only according to the value of the first VAD or the value of the second VAD, and the Bluetooth headset can also be based on the value of the first VAD and the second VAD. Any two of the value and the third VAD value determine whether the above voice information is a noise signal.
- the Bluetooth headset can discard the voice information; if the Bluetooth headset determines that the voice information is a noise signal If the voice information is not a noise signal, the Bluetooth headset can continue to perform the following steps S704-S707. That is, when the user inputs valid voice information into the Bluetooth headset, the Bluetooth headset will be triggered to perform subsequent voiceprint recognition and other processes, thereby reducing the power consumption of the Bluetooth headset.
- a noise estimation algorithm can also be used.
- the minimum value statistical algorithm or the minimum value control recursive average algorithm, etc. respectively measure the noise value in the above voice information.
- a Bluetooth headset may set a storage space dedicated to storing noise values, and each time the Bluetooth headset calculates a new noise value, the new noise value may be updated in the above-mentioned storage space. That is, the recently measured noise value is always stored in the storage space.
- the noise value in the above-mentioned storage space can be used to perform noise reduction processing on the above-mentioned first voice component, second voice component and third voice component respectively. , so that the recognition result when the subsequent Bluetooth headset (or mobile phone) respectively performs voiceprint recognition on the first voice component, the second voice component and the third voice component is more accurate.
- the Bluetooth headset sends the first voice component, the second voice component and the third voice component to the mobile phone through the Bluetooth connection.
- the Bluetooth headset can send the first voice component, second voice component and third voice component to the mobile phone, and then the mobile phone performs the following steps S705- S707, to implement operations such as voiceprint recognition and user identity authentication on the voice information input by the user.
- the mobile phone performs voiceprint recognition on the first voice component, the second voice component and the third voice component, respectively, to obtain a first voiceprint recognition result corresponding to the first voice component and a second voiceprint corresponding to the second voice component
- the recognition result and the third voiceprint recognition result corresponding to the third speech component are the same.
- the principle of voiceprint recognition is to compare the registered voiceprint features of the preset user with the voiceprint features extracted from the voice information input by the user, and make a judgment through a certain algorithm, and the judgment result is the voiceprint recognition result.
- the registered voiceprint features of one or more preset users may be pre-stored in the mobile phone.
- each preset user has three registered voiceprint features, one is the first registered voiceprint feature obtained by feature extraction based on the user's first registered voice collected when the in-ear voice sensor is working, and the other is based on the ear
- the second registered voiceprint feature is obtained by feature extraction from the second registered voice of the user collected by the external voice sensor when it is working, and the other is obtained by feature extraction based on the third registered voice of the user collected when the bone conduction microphone is working.
- the third registered voiceprint feature is obtained by feature extraction from the second registered voice of the user collected by the external voice sensor when it is working.
- the acquisition of the first registered voiceprint feature, the second registered voiceprint feature and the third registered voiceprint feature needs to go through two stages.
- the first stage is the background model training stage.
- the developer can capture the speech of the relevant text (eg, "Hello, little E", etc.) generated by a large number of speakers wearing the above-mentioned Bluetooth headset.
- the mobile phone can perform preprocessing (such as filtering, noise reduction, etc.) on the speech of these related texts, and then extract the voiceprint features in the speech.
- the voiceprint features can be spectrogram (time-frequency spectrogram), fbank (filter banks, features based on filter banks), mfcc (mel-frequency cepstral coefficients, Mel-frequency cepstral coefficients), plp (Perceptual Linear coefficients) Prediction, perceptual linear prediction) or CQCC (Constant Q Cepstral Coefficients, constant Q cepstral coefficient) and so on.
- the voiceprint features extracted by the mobile phone can also extract two or more of the above-mentioned voiceprint features, and obtain the fused voiceprint features by means of splicing or the like.
- a machine learning algorithm such as GMM (gaussian mixed model, Gaussian mixture model), SVM (support vector machines, support vector machine) or deep neural network framework to establish a background model for voiceprint recognition, among which,
- the above machine learning algorithms include but are not limited to DNN (deep neural network, deep neural network) algorithm, RNN (recurrent neural network, recurrent neural network) algorithm, LSTM (long short term memory, long short term memory) algorithm, TDNN (Time Delay Neural) algorithm Network, time-delay neural network), Resnet (deep residual network), etc.
- DNN deep neural network, deep neural network
- RNN recurrent neural network, recurrent neural network
- LSTM long short term memory, long short term memory
- TDNN Time Delay Neural algorithm Network
- Resnet deep residual network
- the mobile phone After obtaining the background model, the mobile phone stores the obtained background model.
- the storage location may be a mobile phone, a wearable device or a server.
- a single or multiple background models can be stored, and the stored multiple background models can be obtained by the same or different algorithms.
- the stored multiple background models can realize the fusion at the voiceprint model level.
- Resnet that is, a deep residual network
- TDNN that is, a temporal neural network
- RNN that is, a recurrent neural network
- a mobile phone or a Bluetooth headset can establish multiple voiceprint models respectively by combining the characteristics of different voice sensors in the wearable device connected to the mobile phone. For example, a first voiceprint model corresponding to the in-ear voice sensor of the Bluetooth headset, a second voiceprint model corresponding to the out-of-ear voice sensor of the Bluetooth headset, and a third voiceprint model corresponding to the bone vibration sensor of the Bluetooth headset are established.
- the mobile phone can save the first voiceprint model, the second voiceprint model and the third voiceprint model locally on the mobile phone, or send the first voiceprint model, the second voiceprint model and the third voiceprint model to the Bluetooth headset for processing. save.
- the second stage is that when the user uses the voiceprint recognition function on the mobile phone for the first time, by entering the registered voice, the mobile phone extracts the user through the in-ear voice sensor, out-of-ear voice sensor and bone vibration sensor of the Bluetooth headset connected to the mobile phone.
- the first registered voiceprint feature, the second registered voiceprint feature and the third registered voiceprint feature can be carried out through the voiceprint recognition option in the built-in device biometric function of the mobile phone system, or the system program can be called through the downloaded APP to carry out the registration process.
- the voice assistant APP can prompt the user to wear a Bluetooth headset and say a registration voice of "Hello, Little E".
- the Bluetooth headset includes an in-ear voice sensor, an out-of-ear voice sensor, and a bone vibration sensor
- the Bluetooth headset can obtain the first registered voice component of the registered voice collected by the in-ear voice sensor, The second registered voice component collected by the voice sensor and the third registered voice component collected by the bone vibration sensor.
- the mobile phone can extract the first registered voice component through the first voiceprint model respectively to obtain the first registered voice component.
- the second registered voiceprint feature is obtained by extracting the feature of the second registered voice component by the second voiceprint model
- the third registered voiceprint feature is obtained by extracting the feature of the third registered voice component by the third voiceprint model.
- the mobile phone can save the preset first registered voiceprint feature, second registered voiceprint feature and third registered voiceprint feature of user 1 locally on the mobile phone, or can preset the first registered voiceprint feature, second registered voiceprint feature of user 1
- the registered voiceprint feature and the third registered voiceprint feature are sent to the Bluetooth headset for storage.
- the mobile phone may also use the Bluetooth headset connected at this time as the preset Bluetooth device.
- the mobile phone may store the preset identifier of the Bluetooth device (eg, the MAC address of the Bluetooth headset, etc.) locally in the mobile phone.
- the mobile phone can receive and execute the relevant operation instructions sent by the preset Bluetooth device, and when the illegal Bluetooth device sends the operation instruction to the mobile phone, the mobile phone can discard the operation instruction to improve security.
- a phone can manage one or more preset Bluetooth devices. As shown in (a) of FIG. 11 , the user can enter the setting interface 1101 of the voiceprint recognition function from the setting function. After the user clicks the setting button 1105, the user can enter the preset device management shown in (b) of FIG. 11 . Interface 1106. The user can add or delete preset Bluetooth devices in the preset device management interface 1106 .
- step S705 after acquiring the first voice component, the second voice component and the third voice component in the above voice information, the mobile phone can extract the voiceprint feature of the first voice component to obtain the first voiceprint feature and extract the second voice respectively.
- the component voiceprint feature obtains the second voiceprint feature and the third voice component voiceprint feature is extracted to obtain the third voiceprint feature, and then the first registered voiceprint feature of the preset user 1 is used to match the first voiceprint feature, and the preset voiceprint feature is used. It is assumed that the second registered voiceprint feature of user 1 is matched with the second voiceprint feature, and the preset third registered voiceprint feature of user 1 is used to match the third voiceprint feature.
- the mobile phone can use a certain algorithm to calculate the first degree of matching between the first registered voiceprint feature and the first voice component (ie, the first voiceprint recognition result), and the second matching between the second registered voiceprint feature and the second voice component.
- the matching degree ie the second voiceprint recognition result
- the third matching degree ie the third voiceprint recognition result
- the matching degree is higher, it means that the voiceprint feature in the voice information is more similar to the voiceprint feature of the preset user 1, and the probability that the user who inputs the voice information is the preset user 1 is higher.
- the mobile phone can determine that the first voiceprint feature matches the first registered voiceprint feature, and the second voiceprint feature matches the second voiceprint feature.
- the registered voiceprint feature matches, and the third voiceprint feature matches the third registered voiceprint feature.
- the mobile phone can determine that the first voiceprint feature matches the first registered voiceprint feature, and the second voiceprint feature matches the second registered voiceprint feature.
- the voiceprint features match, and the third voiceprint feature matches the third registered voiceprint feature.
- the first registered voiceprint feature is obtained by feature extraction through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset user's voiceprint feature collected by the in-ear voice sensor;
- the second registered voiceprint feature The feature is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the out-of-ear voice sensor;
- the third registered voiceprint feature is obtained through the third voiceprint feature.
- the third registered voiceprint feature is obtained by performing feature extraction on the voiceprint model to reflect the preset user's voiceprint feature collected by the bone vibration sensor. It can be understood that the function of the voiceprint model is to extract the voiceprint features of the input voice.
- the voiceprint model can extract the registered voiceprint features of the registered voice.
- the voiceprint model can extract the voiceprint features of the speech.
- the acquisition method of the voiceprint feature may also be a fusion method, including a voiceprint model fusion method and a voiceprint feature-level fusion method.
- the above algorithm for calculating the matching degree may be calculating the similarity.
- the mobile phone performs feature extraction on the first voice component to obtain the first voiceprint feature, respectively calculates the first similarity between the first voiceprint feature and the pre-stored preset user's first registered voiceprint feature, and the second voiceprint feature and The pre-stored second similarity of the second registered voiceprint feature of the preset user, and the third similarity of the third voiceprint feature and the pre-stored preset third registered voiceprint feature of the user.
- the mobile phone can also calculate the first voice component and the first voice components of other preset users (such as preset user 2 and preset user 3) one by one according to the above method.
- the Bluetooth headset may determine the preset user with the highest matching degree (eg, preset user A) as the sounding user at this time.
- the judgment method may be to perform keyword detection on the voice information, and when the voice information includes preset keywords, the mobile phone performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively; or; determine
- the method may also be to detect user input, and when receiving a preset operation input by the user, the mobile phone performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively.
- the specific method of keyword detection may be that after voice recognition is performed on the keyword, the similarity is greater than a preset threshold, and the keyword detection is considered to be passed.
- the Bluetooth headset or mobile phone can recognize preset keywords from the voice information input by the user, for example, "transfer”, “payment”, "**bank” or “chat record” and other keywords related to user privacy or financial behavior, indicating that the user has high security requirements for controlling the mobile phone by voice at this time. Therefore, the mobile phone can perform step S705 to perform voiceprint recognition.
- the Bluetooth headset detects a preset operation for enabling the voiceprint recognition function performed by receiving user input, for example, tapping the Bluetooth headset or pressing the volume + and volume - buttons at the same time, it means that the user is at this time. The user identity needs to be verified through voiceprint recognition, therefore, the Bluetooth headset can notify the mobile phone to perform step S705 for voiceprint recognition.
- keywords corresponding to different security levels may also be preset in the mobile phone.
- keywords with the highest security level include “payment”, “payment”, etc.
- keywords with higher security levels include “photography”, “calling”, etc.
- keywords with the lowest security level include “listening to songs", “navigation”, etc. "Wait.
- the mobile phone can be triggered to perform voiceprint recognition on the first voice component, the second voice component and the third voice component respectively, that is, the collected voice Voiceprint recognition is performed on all three audio sources to improve the security of voice control of mobile phones.
- the mobile phone When it is detected that the collected voice information contains keywords with a higher security level, since the security requirements of the user to control the mobile phone through voice are normal, the mobile phone can be triggered to only detect the first voice component, the second voice component or the mobile phone.
- the third voice component performs voiceprint recognition.
- the mobile phone does not need to perform voiceprint recognition on the first voice component, the second voice component and the third voice component.
- the voice information collected by the Bluetooth headset does not contain keywords, it means that the voice information collected at this time may only be the voice information sent by the user during a normal conversation. and the third voice component for voiceprint recognition, thereby reducing the power consumption of the mobile phone.
- the mobile phone may also preset one or more wake-up words to wake the mobile phone to turn on the voiceprint recognition function.
- the wake word can be "Hello, Little E”.
- the Bluetooth headset or the mobile phone can identify whether the voice information is a wake-up voice containing a wake-up word.
- the Bluetooth headset can send the first voice component, the second voice component and the third voice component in the collected voice information to the mobile phone. If the mobile phone further recognizes that the voice information contains the above wake-up word, the mobile phone can turn on the sound The fingerprint recognition function (for example, power on the voiceprint recognition chip). Subsequently, if the above-mentioned keywords are included in the voice information collected by the Bluetooth headset, the mobile phone can use the voiceprint recognition function that has been enabled to perform voiceprint recognition according to the method of step S705.
- the Bluetooth headset can further identify whether the voice information contains the above wake-up word. If the above wake-up word is included, it means that subsequent users may need to use the voiceprint recognition function. Then, the Bluetooth headset can send a start command to the mobile phone, so that the mobile phone can turn on the voiceprint recognition function in response to the start command.
- the mobile phone authenticates the user identity according to the first voiceprint recognition result, the second voiceprint recognition result and the third voiceprint recognition result.
- step S706 the mobile phone obtains, through voiceprint recognition, a first voiceprint recognition result corresponding to the first voice component, a second voiceprint recognition result corresponding to the second voice component, and a third voiceprint corresponding to the third voice component
- the three voiceprint identification results can be integrated to authenticate the identity of the user inputting the above voice information, thereby improving the accuracy and security of the user identity authentication.
- the preset first matching degree between the first registered voiceprint feature of the user and the above-mentioned first voiceprint feature is the first voiceprint recognition result
- the second matching degree of the feature is the second voiceprint recognition result
- the preset third matching degree between the third registered voiceprint feature of the user and the above-mentioned third voiceprint feature is the third voiceprint recognition result.
- the mobile phone determines to send out the first voice component and the second voice component and the user of the third voice component is a preset user; otherwise, the mobile phone may determine that the user who emits the first voice component, the second voice component and the third voice component is an illegal user.
- the mobile phone can calculate the weighted average of the first matching degree and the second matching degree, and when the weighted average is greater than a preset threshold, the mobile phone can determine to emit the first voice component, the second voice component and the third voice.
- the user of the component is a preset user; otherwise, the mobile phone may determine that the user who emits the first voice component, the second voice component and the third voice component is an illegal user.
- the mobile phone can use different authentication strategies in different voiceprint recognition scenarios. For example, when the collected voice information contains a keyword with the highest security level, the mobile phone may set the above-mentioned first threshold, second threshold and third threshold to 99 points. In this way, only when the first matching degree, the second matching degree and the third matching degree are all greater than 99 points, the mobile phone determines that the current uttering user is the preset user. When the collected voice information contains keywords with a lower security level, the mobile phone can set the above-mentioned first threshold, second threshold and third threshold to 85 points. In this way, when the first matching degree, the second matching degree and the third matching degree are all greater than 85 points, the mobile phone can determine that the current uttering user is the preset user. That is to say, for voiceprint recognition scenarios of different security levels, the mobile phone can use authentication policies of different security levels to authenticate the user's identity.
- the voiceprint models of one or more preset users are stored in the mobile phone, for example, the registered voiceprint features of preset user A, preset user B, and preset user C are stored in the mobile phone, the The registered voiceprint features all include a first registered voiceprint feature, a second registered voiceprint feature, and a third registered voiceprint feature. Then, the mobile phone can match the collected first voice component, second voice component and third voice component with the registered voiceprint feature of each preset user respectively according to the above method. Furthermore, the mobile phone may determine the preset user (eg preset user A) that satisfies the above-mentioned authentication policy and has the highest matching degree as the uttering user at this time.
- the preset user eg preset user A
- the mobile phone after the mobile phone receives the first voice component, the second voice component and the third voice component in the voice information sent by the Bluetooth headset, it can fuse the first voice component, the second voice component and the third voice component for voiceprinting Identify, for example, calculate the degree of matching between the first voice component, the second voice component and the third voice component after fusion and the preset user's voiceprint model. Furthermore, the mobile phone can also authenticate the user identity according to the matching degree. Since the voiceprint model of the preset user in this identity authentication method is integrated into one, the complexity of the voiceprint model and the required storage space are correspondingly reduced, and the voiceprint feature information of the second voice component is used. Therefore, it also has dual voiceprint protection and live detection functions.
- the above algorithm for calculating the matching degree may be calculating the similarity.
- the mobile phone performs feature extraction on the first voice component to obtain the first voiceprint feature, respectively calculates the first similarity between the first voiceprint feature and the pre-stored preset user's first registered voiceprint feature, and the second voiceprint feature and The second similarity of the pre-stored second registered voiceprint feature of the preset user, the third similarity of the third voiceprint feature and the pre-stored third registered voiceprint feature of the preset user, based on the first similarity, The second similarity and the third similarity are used to authenticate the user.
- Similarity calculation methods include: Euclidean Distance, Cosine similarity, Pearson correlation coefficient (Pearson), Adjusted Cosine similarity (Adjusted Cosine), Hamming Distance, Manhattan distance ( Manhattan Distance), etc., which are not limited in this application.
- the way of authenticating the user may be that the mobile phone determines the first fusion coefficient corresponding to the first similarity, the second fusion coefficient corresponding to the second similarity, and the third The third fusion coefficient corresponding to the similarity; the fusion similarity score is obtained by fusing the first similarity, the second similarity and the third similarity according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient. If the fusion similarity score is greater than the first threshold, the mobile phone determines that the user who inputs voice information to the Bluetooth headset is a preset user.
- the decibel number of the ambient sound is detected by the sound pressure sensor of the Bluetooth headset and sent to the mobile phone, and the playback volume can be detected by the speaker of the Bluetooth headset and sent to the mobile phone. It can be obtained by the mobile phone itself calling its own data.
- the second fusion coefficient is negatively correlated with the decibel number of the ambient sound
- the first fusion coefficient and the third fusion coefficient are respectively negatively correlated with the decibel number of the playback volume
- the first fusion coefficient and the second fusion coefficient are negatively correlated with the decibel number of the playback volume.
- the sum of the fusion coefficient and the third fusion coefficient is a fixed value. That is to say, when the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a preset fixed value, the larger the decibel of the ambient sound, the smaller the second fusion coefficient.
- the corresponding , the first fusion coefficient and the third fusion coefficient will increase adaptively, so as to keep the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient unchanged;
- the fusion coefficient in this implementation can be understood as dynamic. In other words, the fusion coefficient changes dynamically according to the ambient sound and playback volume, according to the decibels of the ambient sound detected by the microphone and the playback volume detected by the in-ear sensor. to dynamically determine the fusion coefficient.
- the voice control method provided in this application needs to reduce the fusion coefficient corresponding to the out-of-ear sensor of the Bluetooth headset and the bone vibration sensor.
- the result of the fusion similarity score is more dependent on the in-ear sensor that is less affected by environmental noise; on the contrary, if the playback volume is large, it means that the noise level of the playback sound in the ear canal is high, and it can be considered that the in-ear sensor of the Bluetooth headset is affected by The influence of playing sound is relatively large, so the voice control method provided by the present application needs to reduce the fusion coefficient corresponding to the in-ear sensor, and the result of fusion similarity score is more dependent on the extra-ear sensor and bone vibration sensor which are less affected by the playing sound.
- a look-up table can be set according to the above principles during system design, and in specific use, the fusion coefficient can be determined by looking up the table according to the monitored self-volume and ambient sound decibels.
- Table 1-1 shows an example.
- the fusion coefficients of the similarity scores of the speech signals collected by the in-ear speech sensor and the bone vibration sensor are denoted by a1 and a2, respectively, and the fusion coefficient of the similarity scores obtained by the speech signals collected by the out-of-ear speech sensor is denoted by b1.
- the external environment at this time can be considered to be noisy, the voice signal collected by the out-of-ear voice sensor will be mixed with more ambient noise, and the fusion coefficient corresponding to the voice signal collected by the out-of-ear voice sensor can be lower. value or set to 0 directly.
- the playback volume of the internal speaker of the headset exceeds 80% of the total volume, it can be considered that the volume inside the headset is too large, and the fusion coefficient corresponding to the voice signal collected by the in-ear voice sensor can be set to a lower value or directly set to 0.
- volume 20% refers to "volume 10”.
- %-30% volume 40% refers to "volume 30%-50%”;
- ambient sound 20dB refers to “ambient sound 10dB-30dB”,
- ambient sound 40dB refers to "ambient sound 30dB” -50dB”.
- the above-mentioned specific design is only an example, and the specific parameter settings, threshold settings and coefficients corresponding to different ambient sound decibels and speaker volumes can be designed and modified according to the actual situation. make restrictions.
- the fusion coefficient provided in the embodiment of the present application can be understood as a "dynamic fusion coefficient", that is, the fusion coefficient can be dynamically adjusted according to different decibels of ambient sound and speaker volume.
- the strategy of performing identity authentication on the user based on the fusion of the first voiceprint recognition result, the second voiceprint recognition result and the third voiceprint recognition result may be: Change to directly fuse the audio features, extract the voiceprint feature based on the fusion audio feature and the voiceprint model, calculate the similarity between the voiceprint feature and the pre-stored preset user's registered voiceprint feature, and then perform identity authentication.
- the audio features feaE1 and feaE2 of each frame are extracted from the speech signal of the current user collected by the in-ear speech sensor and the out-of-ear speech sensor.
- the audio feature feaB1 of each frame is extracted from the speech signal of the current user collected by the bone voiceprint sensor.
- Fusion of the above audio features feaE1, feaE2, feaB1, including but not limited to the following methods: normalizing feaE1, feaE2 and feaB1 to obtain feaE1', feaE2' and feaB1', and then splicing into a feature vector fea [ feaE1', feaE2', feaB1']. Extract the voiceprint feature of the feature vector fea through the voiceprint model to obtain the voiceprint feature of the current user. Similarly, for the registered voice of the registered user, the voiceprint feature of the registered user can be obtained by referring to the above method. The similarity between the voiceprint feature of the current user and the voiceprint feature of the registered user is compared to obtain a similarity score, and the relationship between the similarity score and the preset threshold is determined to obtain an authentication result.
- the strategy of performing identity authentication on the user based on the fusion of the first similarity, the second similarity and the third similarity in S706 may be changed to
- the second voiceprint feature and the third voiceprint feature are fused to obtain the fused voiceprint feature, the similarity between the fused voiceprint feature and the pre-stored preset user's registered fused voiceprint feature is calculated, and then identity authentication is performed.
- features are extracted from the voice signal of the current user collected from the in-ear voice sensor and the out-of-ear voice sensor through a voiceprint model, to obtain voiceprint features e1 and e2.
- the voiceprint feature b1 is obtained by extracting the features of the current user's voice signal collected from the bone voiceprint sensor through the voiceprint model.
- the registered voice of the registered user can be obtained by referring to the above method to obtain the voiceprint feature of the registered user after splicing. Compare the voiceprint features of the current user after splicing with the voiceprint features of the registered user after splicing to obtain a similarity score, and determine the relationship between the similarity score and a preset threshold to obtain an authentication result.
- the mobile phone executes the operation instruction corresponding to the above-mentioned voice information.
- the mobile phone determines that the voice user inputting the voice information in the step S702 is the preset user, and the mobile phone can execute the operation instruction corresponding to the above voice information. If the authentication fails, The subsequent operation instructions are not executed.
- the operation instruction includes, but is not limited to, the unlocking operation of the mobile phone or the confirming payment operation. For example, when the above voice message is "Little E, use WeChat to pay", the corresponding operation instruction is to open the payment interface of the WeChat APP. In this way, after the mobile phone generates an operation instruction for opening the payment interface in the WeChat APP, the WeChat APP can be automatically opened, and the payment interface in the WeChat APP can be displayed.
- the mobile phone since the mobile phone has determined that the above-mentioned user is the default user, as shown in Figure 9, if the mobile phone is currently in a locked state, the mobile phone can also unlock the screen first, and then execute the operation instruction to open the payment interface in the WeChat APP, and the display shows Payment interface 901 in the WeChat APP.
- the voice control method provided in the above steps S701-S707 may be a function provided by the voice assistant APP.
- the voice assistant APP When the Bluetooth headset interacts with the mobile phone, if it is determined through voiceprint recognition that the voice user at this time is the default user, the mobile phone can send the generated operation instructions or voice information and other data to the voice assistant APP running at the application layer. Furthermore, the voice assistant APP invokes the relevant interface or service of the application framework layer to execute the operation instruction corresponding to the above voice information.
- the voice control method provided in the embodiments of the present application can unlock the mobile phone and execute the relevant operation instructions in the voice information while using the voiceprint to identify the user's identity. That is, a user only needs to input a voice message once to complete a series of operations such as user identity authentication, unlocking the mobile phone, and opening a certain function of the mobile phone, thereby greatly improving the user's control efficiency and user experience on the mobile phone.
- the voice control method may include:
- a mobile phone and a Bluetooth headset establish a Bluetooth connection.
- the Bluetooth headset detects whether it is in a wearing state.
- the Bluetooth headset acquires the first voice component in the voice information input by the user by collecting through the first voice sensor, collects the second voice component in the above voice information through the second voice sensor, and vibrates through the bone The sensor collects the third voice component in the voice information.
- the bluetooth headset establishes a bluetooth connection with the mobile phone, detects whether the bluetooth headset is in a wearing state, and detects the specific method of the first voice component, the second voice component and the third voice component in the voice information, please refer to the above steps The related descriptions of S701-S703 will not be repeated here.
- the Bluetooth headset can also perform operations such as enhancement, noise reduction or filtering on the detected first voice component and second voice component.
- operations such as enhancement, noise reduction or filtering on the detected first voice component and second voice component.
- the Bluetooth headset since the Bluetooth headset has an audio playback function, when the speaker of the Bluetooth headset is working, the air conduction microphone and the bone conduction microphone on the Bluetooth headset may receive the echo signal of the sound source played by the speaker . Therefore, after the Bluetooth headset obtains the above-mentioned first voice component and second voice component, an echo cancellation algorithm (adaptive echo cancellation, AEC) can also be used to eliminate the echo signals in the first voice component and the second voice component, so as to improve the follow-up Accuracy of voiceprint recognition.
- AEC adaptive echo cancellation
- the Bluetooth headset performs voiceprint recognition on the first voice component, the second voice component, and the third voice component, respectively, to obtain a first voiceprint recognition result corresponding to the first voice component, and a second voiceprint corresponding to the second voice component.
- the fingerprint recognition result and the third voiceprint recognition result corresponding to the third voice component are the first voiceprint recognition result corresponding to the first voice component, and a second voiceprint corresponding to the second voice component.
- one or more voiceprint models and preset registered voiceprint features of the user may be pre-stored in the Bluetooth headset.
- the Bluetooth headset can use the voiceprint model locally stored in the Bluetooth headset to perform sound analysis on the first voice component, the second voice component and the third voice component.
- the fingerprint recognition is performed to obtain the voiceprint features corresponding to the voice components respectively, and the obtained voiceprint features corresponding to the voice components are compared with the corresponding registered voiceprint features. Thereby performing voiceprint recognition.
- step S705 for the mobile phone to identify the first voice component, the second voice component and the third voice respectively.
- the specific method for voiceprint recognition by component is not repeated here.
- the Bluetooth headset authenticates the user identity according to the first voiceprint recognition result, the second voiceprint recognition result and the third voiceprint recognition result.
- the process of authenticating the user identity by the Bluetooth headset according to the first voiceprint recognition result, the second voiceprint recognition result and the third voiceprint recognition result can be referred to in the above step S706 by the mobile phone according to the first voiceprint recognition result, the second voiceprint recognition result
- the description of the user identity authentication by the fingerprint recognition result and the third voiceprint recognition result will not be repeated here.
- the Bluetooth headset sends an operation instruction corresponding to the above-mentioned voice information to the mobile phone through a Bluetooth connection.
- the Bluetooth headset determines that the user who inputs the voice information is a preset user, the Bluetooth headset can generate an operation instruction corresponding to the voice information.
- the operation instruction can be included in the example of the operation instruction of the mobile phone in the above step S707, and details are not repeated here.
- the Bluetooth headset can also send a message or an unlocking instruction to the mobile phone that the user's identity has been authenticated, so that the mobile phone can unlock the screen first, and then Execute the operation instruction corresponding to the above voice information.
- the Bluetooth headset can also send the collected voice information to the mobile phone, and the mobile phone generates a corresponding operation instruction according to the voice information, and executes the operation instruction.
- the Bluetooth headset when the Bluetooth headset sends the above-mentioned voice information or corresponding operation instructions to the mobile phone, it can also send its own device identification (eg, MAC address) to the mobile phone. Since the identification of the preset Bluetooth device that has passed the authentication is stored in the mobile phone, the mobile phone can determine whether the currently connected Bluetooth headset is the preset Bluetooth device according to the received device identification. If the Bluetooth headset is a preset Bluetooth device, the mobile phone can further execute the operation instructions sent by the Bluetooth headset, or perform voice recognition and other operations on the voice information sent by the Bluetooth headset, otherwise, the mobile phone can discard the Bluetooth headset. to avoid security problems caused by illegal Bluetooth devices maliciously manipulating mobile phones.
- MAC address e.g., MAC address
- the mobile phone and the preset Bluetooth device may pre-agreed a password or password for transmitting the above-mentioned operation command.
- the Bluetooth headset when it sends the above-mentioned voice information or corresponding operation instructions to the mobile phone, it can also send a pre-agreed password or password to the mobile phone, so that the mobile phone can determine whether the currently connected Bluetooth headset is a preset Bluetooth device.
- the mobile phone and the preset Bluetooth device may pre-agreed on the encryption and decryption algorithms used when transmitting the above operation command.
- the operation instruction can be encrypted by using an agreed encryption algorithm.
- the mobile phone receives the encrypted operation command, if the above-mentioned operation command can be decrypted using the agreed decryption algorithm, it means that the currently connected Bluetooth headset is a preset Bluetooth device, and the mobile phone can further execute the operation command sent by the Bluetooth headset; Otherwise, it indicates that the currently connected Bluetooth headset is an illegal Bluetooth device, and the mobile phone can discard the operation command sent by the Bluetooth headset.
- steps S701-S707 and steps S1001-S1007 are only two implementation manners of the voice control method provided in this application. It can be understood that those skilled in the art can set which steps in the foregoing embodiments are performed by the Bluetooth headset and which steps are performed by the mobile phone according to actual application scenarios or actual experience, which is not limited in this embodiment of the present application.
- the voice control method provided by the present application may also use a server as an execution subject, that is, the Bluetooth headset establishes a connection with the server, and the server implements the functions of the mobile phone in the above embodiment, and the specific process is not repeated here.
- the Bluetooth headset performs voiceprint recognition on the first voice component, the second voice component and the third voice component, the obtained first voiceprint recognition result, second voiceprint recognition result and third voiceprint recognition
- the result is sent to the mobile phone, and the mobile phone performs user identity authentication and other operations based on the voiceprint recognition result.
- the Bluetooth headset can also determine whether voiceprint recognition needs to be performed on the first voice component, the second voice component, and the third voice component after acquiring the first voice component, the second voice component, and the third voice component. . If voiceprint recognition needs to be performed on the first voice component, the second voice component and the third voice component, the Bluetooth headset can send the first voice component, the second voice component and the third voice component to the mobile phone, and then the mobile phone can complete the follow-up Voiceprint recognition, user identity authentication and other operations; otherwise, the Bluetooth headset does not need to send the first voice component, the second voice component and the third voice component to the mobile phone, to avoid increasing the mobile phone to process the first voice The power consumption of the third speech component.
- the user can also enter the setting interface 1101 of the mobile phone to enable or disable the above-mentioned voice control function.
- the user can set the keywords that trigger the voice control through the setting button 1102, such as "small E", "payment”, etc., and the user can also manage the preset user's voiceprint model through the setting button 1103 For example, adding or deleting a preset user's voiceprint model, the user can also use the setting button 1104 to set the operation instructions that the voice assistant can support, such as payment, making a phone call, ordering a meal, and so on. In this way, users can get a customized voice control experience.
- the embodiments of the present application disclose a voice control device.
- the voice control device includes a voice information acquisition unit 1201 , an identification unit 1202 , an identity information acquisition unit 1203 and an execution unit 1204.
- the voice control device itself can be a terminal or a wearable device, the voice control device can be fully integrated into the wearable device, or the wearable device and the terminal can be combined into a voice control system, that is, part of the unit. Located in the wearable device, part of the unit is located in the terminal.
- the voice control device may be fully integrated into a Bluetooth headset.
- the voice information obtaining unit 1201 is used to obtain the voice information of the user.
- the user can input voice information into the Bluetooth headset when wearing the Bluetooth headset.
- the in-ear voice sensor collects the first voice component
- the out-of-ear voice sensor collects the second voice component
- the bone vibration sensor collects the third voice component.
- the recognition unit 1202 is configured to perform voiceprint recognition on the first voice component, the second voice component and the third voice component, respectively, to obtain a first voiceprint recognition result corresponding to the first voice component and a second voiceprint corresponding to the second voice component.
- the voiceprint recognition result and the third voiceprint recognition result corresponding to the third voice component are configured to perform voiceprint recognition on the first voice component, the second voice component and the third voice component, respectively, to obtain a first voiceprint recognition result corresponding to the first voice component and a second voiceprint corresponding to the second voice component.
- the identification unit 1202 may also be used to perform keyword detection on the voice information input by the user to the Bluetooth headset.
- the voice information includes preset keywords, Perform voiceprint recognition on the second voice component and the third voice component; or; the recognition unit 1202 may be configured to detect user input, and when receiving a preset operation input by the user, respectively perform voiceprint recognition on the first voice component and the third voice component
- the second voice component and the third voice component are used for voiceprint recognition.
- the user input may be the user's input to the Bluetooth headset through a touch screen or a key, for example, the user clicks an unlock key of the Bluetooth headset.
- the acquisition unit 1201 may also acquire the wearing status detection result, and when the wearing status detection result passes, the recognition unit 1202 performs keyword detection on the voice information. Detection, or detection of user input.
- the identifying unit 1202 is specifically configured to: perform feature extraction on the first voice component, obtain the first voiceprint feature, and calculate the difference between the first voiceprint feature and the preset user's first registered voiceprint feature
- the first similarity, the first registered voiceprint feature is obtained by the feature extraction of the first registered voice through the first voiceprint model, and the first registered voiceprint feature is used to reflect the audio features of the preset user collected by the in-ear voice sensor Carry out feature extraction to the second voice component, obtain the second voiceprint feature, calculate the second similarity between the second voiceprint feature and the second registered voiceprint feature of the preset user, and the second registered voiceprint feature is the second registered voiceprint feature.
- the voice is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the audio features of the preset user collected by the out-of-ear voice sensor; the third voice component is extracted by feature extraction to obtain the third voiceprint feature, calculate the third similarity between the third voiceprint feature and the third registered voiceprint feature of the preset user, and the third registered voiceprint feature is obtained from the third registered voice through feature extraction through the third voiceprint model.
- the registered voiceprint feature is used to reflect the preset user's audio features collected by the bone vibration sensor.
- the first registered voiceprint feature is obtained by feature extraction through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset user's voiceprint collected by the in-ear voice sensor feature;
- the second registered voiceprint feature is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the out-of-ear voice sensor;
- the third registered voiceprint feature The voiceprint feature is obtained by feature extraction through the third voiceprint model, and the third registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the bone vibration sensor.
- the identity information obtaining unit 1203 is used to obtain user identity information for user identity authentication, specifically, according to the decibel number and playback volume of the ambient sound, respectively determine the first fusion coefficient corresponding to the first similarity, and the second similarity corresponding to The second fusion coefficient, the third fusion coefficient corresponding to the third degree of similarity; according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient, the first similarity, the second similarity and the third similarity are fused to obtain a fusion similarity score. If the fusion similarity score is greater than the first threshold, the mobile phone determines that the user who inputs voice information to the Bluetooth headset is a preset user.
- the decibel number of the ambient sound is detected by the sound pressure sensor of the Bluetooth headset, and the playback volume can be detected by the speaker of the Bluetooth headset by detecting the playback signal.
- the second fusion coefficient is negatively correlated with the decibel number of the ambient sound
- the first fusion coefficient and the third fusion coefficient are respectively negatively correlated with the decibel number of the playback volume
- the first fusion coefficient and the second fusion coefficient are negatively correlated with the decibel number of the playback volume.
- the sum of the fusion coefficient and the third fusion coefficient is a fixed value. That is to say, when the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a preset fixed value, the larger the decibel of the ambient sound, the smaller the second fusion coefficient.
- the corresponding , the first fusion coefficient and the third fusion coefficient will increase adaptively, so as to keep the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient unchanged;
- the above-mentioned variable fusion coefficient can take into account the recognition accuracy in different application scenarios (in the case of a large noise environment or when the headphones are playing music).
- the execution unit 1204 is configured to execute the operation instruction corresponding to the voice information, and the operation instruction includes an unlock instruction, payment command, shutdown command, open application command or call command.
- the voice control method provided by the above-mentioned embodiments of the present application adds a method for collecting voiceprint features through an in-ear voice sensor.
- the ear canal will form a closed cavity, and the sound will have a certain amplification effect in the cavity, that is, the cavity effect. Therefore, the sound collected by the in-ear voice sensor will be clearer, especially for high-frequency sound signals. It can make up for the distortion problem caused by the loss of high-frequency signal components of part of the voice information when the bone vibration sensor collects voice information, improve the overall voiceprint collection effect of the headset and the accuracy of voiceprint recognition, thereby improving user experience.
- dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes.
- the complementarity of speech signals with different attributes can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy and accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
- voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
- FIG. 13 is a schematic diagram of a wearable device 130 provided by an embodiment of the present application.
- the wearable device shown in FIG. 13 includes a memory 1301 , a processor 1302 , a communication interface 1303 , a bus 1304 , an in-ear voice sensor 1305 , an out-of-ear voice sensor 1306 , and a bone vibration sensor 1307 .
- the memory 1301 , the processor 1302 , and the communication interface 1303 are connected to each other through the bus 1304 for communication.
- the memory 1301 is coupled to the processor 1302, and the memory 801 is used to store computer program codes.
- the computer program codes include computer instructions. When the processor 802 executes the computer instructions, the wearable device can execute the voice control method described in the above embodiments.
- the in-ear voice sensor 1305 is used to collect the first voice component of the voice information
- the out-of-ear voice sensor 1306 is used to collect the second voice component of the voice information
- the bone vibration sensor 1307 is used to collect the third voice component of the voice information.
- the memory 1301 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
- the memory 1301 may store a program. When the program stored in the memory 1301 is executed by the processor 1302, the processor 1302 and the communication interface 1303 are used to execute each step of the voice control method of the embodiment of the present application.
- the processor 1302 may adopt a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
- the integrated circuit is used to execute the relevant program to realize the functions required to be performed by the units in the voice control apparatus of the embodiments of the present application, or to execute the voice control method of the method embodiments of the present application.
- the processor 1302 can also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the voice control method of the present application can be completed by an integrated logic circuit of hardware in the processor 1302 or an instruction in the form of software.
- the above-mentioned processor 1302 can also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components.
- DSP Digital Signal Processing
- ASIC application-specific integrated circuit
- FPGA Field Programmable Gate Array
- a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
- the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
- the storage medium is located in the memory 1301, and the processor 1302 reads the information in the memory 1301, and combines its hardware to complete the functions required to be performed by the units included in the voice control apparatus of the embodiments of the present application, or to execute the voice control of the method embodiments of the present application. method.
- the communication interface 1303 uses a transceiver such as but not limited to a transceiver, and can perform wired communication or wireless communication, so as to implement communication between the wearable device 1300 and other devices or a communication network.
- a transceiver such as but not limited to a transceiver
- the wearable device can establish a communication connection with the terminal device through the communication interface 1303 .
- Bus 1304 may include a pathway for communicating information between various components of device 1300 (eg, memory 1301, processor 1302, communication interface 1303).
- FIG. 14 is a schematic diagram of a terminal provided by an embodiment of the present application.
- the terminal shown in FIG. 14 includes a touch screen 1401 , a processor 1402 , a memory 1403 , one or more computer programs 1404 , a bus 1405 , and a communication interface 1408 .
- the touch screen 1401 includes a touch-sensitive surface 1406 and a display screen 1407, and the terminal may also include one or more application programs (not shown).
- the various devices described above may be connected by one or more communication buses 1405 .
- the memory 1403 is coupled to the processor 1402, and the memory 1403 is used for storing computer program codes.
- the computer program codes include computer instructions.
- the terminal can execute the voice control method described in the above embodiments.
- the touch screen 1401 is used to interact with the user, and can receive input information from the user.
- the user enters input to the phone through the touch-sensitive surface 1406, eg, the user clicks an unlock key displayed on the touch-sensitive surface 1406 of the phone.
- the memory 1403 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
- the memory 1403 may store a program. When the program stored in the memory 1403 is executed by the processor 1402, the processor 1402 and the communication interface 1403 are used to execute each step of the voice control method of the embodiment of the present application.
- the processor 1402 may adopt a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a graphics processor (graphics processing unit, GPU) or one or more
- the integrated circuit is used to execute the relevant program to realize the functions required to be performed by the units in the voice control apparatus of the embodiments of the present application, or to execute the voice control method of the method embodiments of the present application.
- the processor 1402 can also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the voice control method of the present application can be completed by an integrated logic circuit of hardware in the processor 1402 or an instruction in the form of software.
- the above-mentioned processor 1402 can also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices. , discrete gate or transistor logic devices, discrete hardware components.
- a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
- the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
- the storage medium is located in the memory 1403, and the processor 1402 reads the information in the memory 1403, and in combination with its hardware, completes the functions required to be performed by the units included in the voice control apparatus of the embodiments of the present application, or executes the voice control of the method embodiments of the present application. method.
- the communication interface 1408 uses a transceiver device such as but not limited to a transceiver, and can perform wired communication or wireless communication, so as to realize communication between the terminal 1400 and other devices or a communication network.
- the terminal may establish a communication connection with the wearable device through the communication interface 1408 .
- Bus 1405 may include a pathway for communicating information between various components of device 1400 (eg, touch screen 1401, memory 1403, processor 1402, communication interface 1408).
- wearable device 1300 and the terminal 1400 shown in FIG. 13 and FIG. 14 only show a memory, a processor, a communication interface, etc., in the specific implementation process, those skilled in the art should understand that the wearable device 1300 and terminal 1400 also include other components necessary for proper operation. Meanwhile, according to specific needs, those skilled in the art should understand that the wearable device 1300 and the terminal 1400 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the wearable device 1300 and the terminal 1400 may also only include the necessary devices for implementing the embodiments of the present application, and need not include all the devices shown in FIG. 13 or FIG. 14 .
- FIG. 15 is a schematic diagram of the chip system.
- the chip system includes at least one processor 1501 , at least one interface circuit 1502 and a bus 1503 .
- the processor 1501 and the interface circuit 1502 may be interconnected by wires.
- interface circuit 1502 may be used to receive signals from other devices, such as the memory of a voice-controlled device.
- the interface circuit 1502 may be used to send signals to other devices (eg, the processor 1501).
- the interface circuit 1502 may read the instructions stored in the memory and send the instructions to the processor 1501 .
- the voice control device can be made to execute the steps in the above embodiments.
- the chip system may also include other discrete devices, which are not specifically limited in this embodiment of the present application.
- Another embodiment of the present application further provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on the voice control device, the voice control device executes the above method embodiments. Each step performed by the identification device in the method flow shown.
- Another embodiment of the present application further provides a computer program product, where computer instructions are stored in the computer program product, and when the instructions are run on a recognition device on the voice control device, the recognition device executes the methods shown in the above method embodiments The individual steps performed by the identification device in the flow.
- the disclosed methods may be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or on other non-transitory media or articles of manufacture.
- the computer program product is provided using a signal bearing medium.
- the signal bearing medium may include one or more program instructions, which, when executed by one or more processors, may implement the functions of the voice control method of the embodiments of the present application.
- reference to one or more features of S701-S707 in FIG. 7 may be undertaken by one or more instructions associated with the signal bearing medium.
- signal bearing media may include computer readable media such as, but not limited to, hard drives, compact discs (CDs), digital video discs (DVDs), digital tapes, memories, read-only storage memories memory, ROM) or random access memory (random access memory, RAM), etc.
- computer readable media such as, but not limited to, hard drives, compact discs (CDs), digital video discs (DVDs), digital tapes, memories, read-only storage memories memory, ROM) or random access memory (random access memory, RAM), etc.
- the signal bearing medium may include computer recordable media such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like.
- signal bearing media may include communication media such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.).
- digital and/or analog communication media eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.
- a signal-bearing medium may be conveyed by a wireless form of communication medium (e.g., one that conforms to the IEEE 802.16 standard or other transmission protocol).
- the one or more program instructions may be, for example, computer-executable instructions or logic-implemented instructions.
- Each functional unit in each of the embodiments of the embodiments of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
- the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
- a computer-readable storage medium includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: flash memory, removable hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.
- the disclosed system, apparatus and method may be implemented in other manners.
- the apparatus embodiments described above are only illustrative.
- the division of the units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- the above is only the specific implementation of the embodiment of the application, but the protection scope of the embodiment of the application is not limited to this, any Changes or substitutions within the technical scope disclosed in the embodiments of the present application should all be covered by the protection scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application should be subject to the protection scope of the claims.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- User Interface Of Digital Computer (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
Description
Claims (23)
- 一种语音控制方法,其特征在于,包括:获取用户的语音信息,所述语音信息包括第一语音分量,第二语音分量和第三语音分量,所述第一语音分量是由耳内语音传感器采集到的,所述第二语音分量是由耳外语音传感器采集到的,所述第三语音分量是由骨振动传感器采集到的;分别对所述第一语音分量,所述第二语音分量和所述第三语音分量进行声纹识别;根据所述第一语音分量的声纹识别结果、所述第二语音分量的声纹识别结果和所述第三语音分量的声纹识别结果,得到所述用户的身份信息;当所述用户的身份信息与预设的信息匹配时,执行操作指令,其中,所述操作指令是根据所述语音信息确定的。
- 根据权利要求1所述的语音控制方法,其特征在于,所述对所述第一语音分量、所述第二语音分量和所述第三语音分量进行声纹识别之前,还包括:对所述语音信息进行关键词检测,或者,对用户输入进行检测。
- 根据权利要求2所述的语音控制方法,其特征在于,所述对所述语音信息进行关键词检测或者对用户输入进行检测之前,还包括:获取所述可穿戴设备的佩戴状态检测结果。
- 根据权利要求1-3任一所述的语音控制方法,其特征在于,所述对所述第一语音分量进行声纹识别,具体包括:对所述第一语音分量进行特征提取,得到第一声纹特征,计算所述第一声纹特征与所述用户的第一注册声纹特征的第一相似度,所述第一注册声纹特征是第一注册语音经过第一声纹模型进行特征提取得到的,所述第一注册声纹特征用于反映所述耳内语音传感器采集到的所述用户的预设音频特征。
- 根据权利要求1-3任一所述的语音控制方法,其特征在于,所述对所述第二语音分量进行声纹识别,具体包括:对所述第二语音分量进行特征提取,得到第二声纹特征,计算所述第二声纹特征与所述用户的第二注册声纹特征的第二相似度,所述第二注册声纹特征是第二注册语音经过第二声纹模型进行特征提取得到的,所述第二注册声纹特征用于反映所述耳外语音传感器采集到的所述用户的预设音频特征。
- 根据权利要求1-3任一所述的语音控制方法,其特征在于,所述对所述第三语音分量进行声纹识别,具体包括:对所述第三语音分量进行特征提取,得到第三声纹特征,计算所述第三声纹特征与所述用户的第三注册声纹特征的第三相似度,所述第三注册声纹特征是第三注册语音经过第三声纹模型进行特征提取得到的,所述第三注册声纹特征用于反映所述骨振动传感器采集到的所述用户的预设音频特征。
- 根据权利要求1-6任一所述的语音控制方法,其特征在于,所述根据所述第一语音分量的声纹识别结果、所述第二语音分量的声纹识别结果和所述第三语音分量的声纹识别结果,得到所述用户的身份信息,具体包括:确定所述第一相似度对应的第一融合系数,所述第二相似度对应的第二融合系数,所述第三相似度对应的第三融合系数;根据所述第一融合系数、所述第二融合系数和所述第三融合系数融合所述第一相似度、第二相似度和第三相似度,得到融合相似度得分,若所述融合相似度得分大于第一阈值,则 确定所述用户的身份信息与预设身份信息匹配。
- 根据权利要求7所述的语音控制方法,其特征在于,确定所述第一融合系数、所述第二融合系数和所述第三融合系数,具体包括:根据声压传感器得到环境声的分贝数;根据扬声器的播放信号,确定播放音量;根据所述环境声的分贝数和所述播放音量,分别确定所述第一融合系数、所述第二融合系数和所述第三融合系数,其中:所述第二融合系数与所述环境声的分贝数呈负相关,所述第一融合系数、所述第三融合系数分别与所述播放音量的分贝数呈负相关,所述第一融合系数、第二融合系数和第三融合系数的和为固定值。
- 根据权利要求1-8中任一项所述的语音控制方法,其特征在于,所述操作指令包括解锁指令、支付指令、关机指令、打开应用程序应用程序指令或呼叫呼叫指令。
- 一种语音控制装置,其特征在于,包括:语音信息获取单元,所述语音信息获取单元用于获取用户的语音信息,所述语音信息包括第一语音分量,第二语音分量和第三语音分量,所述第一语音分量是由耳内语音传感器采集到的,所述第二语音分量是由耳外语音传感器采集到的,所述第三语音分量是由骨振动传感器采集到的;识别单元,所述识别单元用于分别对所述第一语音分量,所述第二语音分量和所述第三语音分量进行声纹识别;身份信息获取单元,所述身份信息获取单元用于根据所述第一语音分量的声纹识别结果、所述第二语音分量的声纹识别结果和所述第三语音分量的声纹识别结果,得到所述用户的身份信息;执行单元,所述执行单元用于当所述用户的身份信息与预设的信息匹配时,执行操作指令,其中,所述操作指令是根据所述语音信息确定的。
- 根据权利要求10所述的语音控制装置,其特征在于,所述语音信息获取单元还用于:对所述语音信息进行关键词检测,或者,对用户输入进行检测。
- 根据权利要求11所述的语音控制装置,其特征在于,所述语音信息获取单元还用于:获取所述可穿戴设备的佩戴状态检测结果。
- 根据权利要10-12任一所述的语音控制装置,其特征在于,所述识别单元具体用于:对所述第一语音分量进行特征提取,得到第一声纹特征,计算所述第一声纹特征与所述用户的第一注册声纹特征的第一相似度,所述第一注册声纹特征是第一注册语音经过第一声纹模型进行特征提取得到的,所述第一注册声纹特征用于反映所述耳内语音传感器采集到的所述用户的预设音频特征;
- 根据权利要求10-12任一所述的语音控制装置,其特征在于,所述识别单元具体用于:对所述第二语音分量进行特征提取,得到第二声纹特征,计算所述第二声纹特征与所述用户的第二注册声纹特征的第二相似度,所述第二注册声纹特征是第二注册语音经过第二声纹模型进行特征提取得到的,所述第二注册声纹特征用于反映所述耳外语音传感器采集到的所述用户的预设音频特征;
- 根据权利要求10-12任一所述的语音控制装置,其特征在于,所述识别单元具体用于:对所述第三语音分量进行特征提取,得到第三声纹特征,计算所述第三声纹特征与所述用户的第三注册声纹特征的第三相似度,所述第三注册声纹特征是第三注册语音经过第三声纹模型进行特征提取得到的,所述第三注册声纹特征用于反映所述骨振动传感器采集到的所述用户的预设音频特征。
- 根据权利要求10-15任一所述的语音控制装置,其特征在于,所述身份信息获取单元具体用于:确定所述第一相似度对应的第一融合系数,所述第二相似度对应的第二融合系数,所述第三相似度对应的第三融合系数;根据所述第一融合系数、所述第二融合系数和所述第三融合系数融合所述第一相似度、第二相似度和第三相似度,得到融合相似度得分,若所述融合相似度得分大于第一阈值,则确定所述用户的身份信息与预设身份信息匹配。
- 根据权利要求16所述的语音控制装置,其特征在于,所述身份信息获取单元具体用于:根据声压传感器得到环境声的分贝数;根据扬声器的播放信号,确定播放音量;根据所述环境声的分贝数和所述播放音量,分别确定所述第一融合系数、所述第二融合系数和所述第三融合系数,其中:所述第二融合系数与所述环境声的分贝数呈负相关,所述第一融合系数、所述第三融合系数分别与所述播放音量的分贝数呈负相关,所述第一融合系数、第二融合系数和第三融合系数的和为固定值。
- 根据权利要求10-17中任一项所述的语音控制装置,其特征在于,所述操作指令包括解锁指令、支付指令、关机指令、打开应用程序应用程序指令或呼叫呼叫指令。
- 一种可穿戴设备,其特征在于,所述可穿戴设备包括耳内语音传感器,耳外语音传感器,骨振动传感器,存储器和处理器;所述耳内语音传感器用于采集语音信息的第一语音分量,所述耳外语音传感器用于采集语音信息的第二语音分量,所述骨振动传感器用于采集语音信息的第三语音分量;所述存储器和所述处理器耦合;所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;当所述处理器执行所述计算机指令时,所述可穿戴设备执行如权利要求1-9中任意一项所述的语音控制方法。
- 一种终端,其特征在于,所述终端包括存储器和处理器;所述存储器和所述处理器耦合;所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;当所述处理器执行所述计算机指令时,所述终端执行如权利要求1-9中任意一项所述的语音控制方法。
- 一种芯片系统,其特征在于,所述芯片系统应用于电子设备;所述芯片系统包括一个或多个接口电路,以及一个或多个处理器;所述接口电路和所述处理器通过线路互联;所述接口电路用于从所述电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信号包括所述存储器中存储的计算机指令;当所述处理器执行所述计算机指令时,所述电子设备执行如权利要求1-9中任意一项所述的语音控制方法。
- 一种计算机可读存储介质,其特征在于,包括计算机指令,当所述计算机指令在语音控制装置上运行时,使得所述语音控制装置执行如权利要求1-9中任意一项所述的语音控 制方法。
- 一种计算机程序产品,其特征在于,包括计算机指令,当所述计算机指令在语音控制装置上运行时,使得所述语音控制装置执行如权利要求1-9中任意一项所述的语音控制方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22774067.7A EP4297023A4 (en) | 2021-03-24 | 2022-03-11 | VOICE CONTROL METHOD AND APPARATUS |
JP2023558328A JP2024510779A (ja) | 2021-03-24 | 2022-03-11 | 音声制御方法及び装置 |
US18/471,702 US20240013789A1 (en) | 2021-03-24 | 2023-09-21 | Voice control method and apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110313304.3 | 2021-03-24 | ||
CN202110313304.3A CN115132212A (zh) | 2021-03-24 | 2021-03-24 | 一种语音控制方法和装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/471,702 Continuation US20240013789A1 (en) | 2021-03-24 | 2023-09-21 | Voice control method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022199405A1 true WO2022199405A1 (zh) | 2022-09-29 |
Family
ID=83373864
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/080436 WO2022199405A1 (zh) | 2021-03-24 | 2022-03-11 | 一种语音控制方法和装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20240013789A1 (zh) |
EP (1) | EP4297023A4 (zh) |
JP (1) | JP2024510779A (zh) |
CN (1) | CN115132212A (zh) |
WO (1) | WO2022199405A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117116258A (zh) * | 2023-04-12 | 2023-11-24 | 荣耀终端有限公司 | 一种语音唤醒方法及电子设备 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117133281B (zh) * | 2023-01-16 | 2024-06-28 | 荣耀终端有限公司 | 语音识别方法和电子设备 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102084668A (zh) * | 2008-05-22 | 2011-06-01 | 伯恩同通信有限公司 | 处理信号的方法和系统 |
CN106713569A (zh) * | 2016-12-27 | 2017-05-24 | 广东小天才科技有限公司 | 一种可穿戴设备的操作控制方法及可穿戴设备 |
US20180324518A1 (en) * | 2017-05-04 | 2018-11-08 | Apple Inc. | Automatic speech recognition triggering system |
CN111432303A (zh) * | 2020-03-19 | 2020-07-17 | 清华大学 | 单耳耳机、智能电子设备、方法和计算机可读介质 |
CN111916101A (zh) * | 2020-08-06 | 2020-11-10 | 大象声科(深圳)科技有限公司 | 一种融合骨振动传感器和双麦克风信号的深度学习降噪方法及系统 |
CN112017696A (zh) * | 2020-09-10 | 2020-12-01 | 歌尔科技有限公司 | 耳机的语音活动检测方法、耳机及存储介质 |
CN112420035A (zh) * | 2018-06-29 | 2021-02-26 | 华为技术有限公司 | 一种语音控制方法、可穿戴设备及终端 |
-
2021
- 2021-03-24 CN CN202110313304.3A patent/CN115132212A/zh active Pending
-
2022
- 2022-03-11 JP JP2023558328A patent/JP2024510779A/ja active Pending
- 2022-03-11 EP EP22774067.7A patent/EP4297023A4/en active Pending
- 2022-03-11 WO PCT/CN2022/080436 patent/WO2022199405A1/zh active Application Filing
-
2023
- 2023-09-21 US US18/471,702 patent/US20240013789A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102084668A (zh) * | 2008-05-22 | 2011-06-01 | 伯恩同通信有限公司 | 处理信号的方法和系统 |
CN106713569A (zh) * | 2016-12-27 | 2017-05-24 | 广东小天才科技有限公司 | 一种可穿戴设备的操作控制方法及可穿戴设备 |
US20180324518A1 (en) * | 2017-05-04 | 2018-11-08 | Apple Inc. | Automatic speech recognition triggering system |
CN112420035A (zh) * | 2018-06-29 | 2021-02-26 | 华为技术有限公司 | 一种语音控制方法、可穿戴设备及终端 |
CN111432303A (zh) * | 2020-03-19 | 2020-07-17 | 清华大学 | 单耳耳机、智能电子设备、方法和计算机可读介质 |
CN111916101A (zh) * | 2020-08-06 | 2020-11-10 | 大象声科(深圳)科技有限公司 | 一种融合骨振动传感器和双麦克风信号的深度学习降噪方法及系统 |
CN112017696A (zh) * | 2020-09-10 | 2020-12-01 | 歌尔科技有限公司 | 耳机的语音活动检测方法、耳机及存储介质 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4297023A4 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117116258A (zh) * | 2023-04-12 | 2023-11-24 | 荣耀终端有限公司 | 一种语音唤醒方法及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
JP2024510779A (ja) | 2024-03-11 |
EP4297023A1 (en) | 2023-12-27 |
EP4297023A4 (en) | 2024-07-10 |
CN115132212A (zh) | 2022-09-30 |
US20240013789A1 (en) | 2024-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102525294B1 (ko) | 음성 제어 방법, 웨어러블 디바이스 및 단말 | |
WO2022033556A1 (zh) | 电子设备及其语音识别方法和介质 | |
CN111131601B (zh) | 一种音频控制方法、电子设备、芯片及计算机存储介质 | |
WO2022199405A1 (zh) | 一种语音控制方法和装置 | |
CN110070863A (zh) | 一种语音控制方法及装置 | |
WO2021114953A1 (zh) | 语音信号的采集方法、装置、电子设备以及存储介质 | |
US20190147890A1 (en) | Audio peripheral device | |
WO2022022585A1 (zh) | 电子设备及其音频降噪方法和介质 | |
US20200278832A1 (en) | Voice activation for computing devices | |
US20230239800A1 (en) | Voice Wake-Up Method, Electronic Device, Wearable Device, and System | |
CN113643707A (zh) | 一种身份验证方法、装置和电子设备 | |
CN113299309A (zh) | 语音翻译方法及装置、计算机可读介质和电子设备 | |
CN114360206B (zh) | 一种智能报警方法、耳机、终端和系统 | |
WO2023124248A1 (zh) | 声纹识别方法和装置 | |
WO2023207185A1 (zh) | 声纹识别方法、图形界面及电子设备 | |
CN113506566B (zh) | 声音检测模型训练方法、数据处理方法以及相关装置 | |
WO2022007757A1 (zh) | 跨设备声纹注册方法、电子设备及存储介质 | |
CN115731923A (zh) | 命令词响应方法、控制设备及装置 | |
US11393449B1 (en) | Methods and apparatus for obtaining biometric data | |
WO2022252858A1 (zh) | 一种语音控制方法及电子设备 | |
WO2022233239A1 (zh) | 一种升级方法、装置及电子设备 | |
CN116530944B (zh) | 声音处理方法及电子设备 | |
US20220261218A1 (en) | Electronic device including speaker and microphone and method for operating the same | |
CN117953872A (zh) | 语音唤醒模型更新方法、存储介质、程序产品及设备 | |
CN116935858A (zh) | 声纹识别方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22774067 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022774067 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023558328 Country of ref document: JP |
|
ENP | Entry into the national phase |
Ref document number: 2022774067 Country of ref document: EP Effective date: 20230920 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |