WO2022199405A1 - 一种语音控制方法和装置 - Google Patents

一种语音控制方法和装置 Download PDF

Info

Publication number
WO2022199405A1
WO2022199405A1 PCT/CN2022/080436 CN2022080436W WO2022199405A1 WO 2022199405 A1 WO2022199405 A1 WO 2022199405A1 CN 2022080436 W CN2022080436 W CN 2022080436W WO 2022199405 A1 WO2022199405 A1 WO 2022199405A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voiceprint
user
feature
component
Prior art date
Application number
PCT/CN2022/080436
Other languages
English (en)
French (fr)
Inventor
徐嘉明
郎玥
萨出荣贵
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22774067.7A priority Critical patent/EP4297023A4/en
Priority to JP2023558328A priority patent/JP2024510779A/ja
Publication of WO2022199405A1 publication Critical patent/WO2022199405A1/zh
Priority to US18/471,702 priority patent/US20240013789A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Definitions

  • the present application relates to the technical field of audio processing, and in particular, to a voice control method and device.
  • Bone vibration sensor is a common voice sensor. When the sound propagates in the bone, it will cause the vibration of the bone. The bone vibration sensor senses the vibration of the bone and converts the vibration signal into an electrical signal to achieve sound collection.
  • the present application provides a voice control method and device, which can solve the problem of inaccurate voiceprint recognition caused by loss of high-frequency components when a bone vibration sensor is used.
  • the present application provides a voice control method, including: acquiring voice information of a user, the voice information including a first voice component, a second voice component and a third voice component, the first voice component being generated by an in-ear voice sensor
  • the second voice component is collected by the out-of-ear voice sensor
  • the third voice component is collected by the bone vibration sensor;
  • Recognition according to the first voiceprint recognition result of the first voice component in the voice information, the second voiceprint recognition result of the second voice component in the voice information and the third voiceprint recognition result of the third voice component in the voice information, obtain the user
  • the operation instruction is executed, wherein the operation instruction is determined according to the voice information.
  • the wearable device uses the in-ear voice sensor when collecting sound, it can make up for the distortion problem caused by the loss of some high-frequency signal components of the voice information when the bone vibration sensor collects voice information, so it can improve the overall performance of the wearable device. Voiceprint collection effect and accuracy of voiceprint recognition to improve user experience.
  • the voice components need to be obtained separately.
  • the acquisition of multi-channel voice components can improve the accuracy and anti-interference ability of voiceprint recognition.
  • the method before performing voiceprint recognition on the first voice component, the second voice component and the third voice component, the method further includes: performing keyword detection on the voice information, or detecting user input.
  • voice information includes preset keywords
  • voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively; or, when a preset operation input by the user is received, Voiceprint recognition is performed on the first voice component, the second voice component and the third voice component, respectively. Otherwise, it means that the user does not need to perform voiceprint recognition at this time, and the terminal or wearable device does not need to enable the voiceprint recognition function, thereby reducing the power consumption of the terminal or wearable device.
  • the method before performing keyword detection on the voice information or detecting user input, the method further includes: acquiring a wearing state detection result of the wearable device.
  • the wearing state detection result is passed, keyword detection is performed on the voice information, or user input is detected. Otherwise, it means that the user is not wearing the wearable device at this time, and of course there is no need for voiceprint recognition, so the terminal or wearable device does not need to enable the keyword detection function, thereby reducing the power consumption of the terminal or wearable device.
  • the specific process of performing voiceprint recognition on the first voice component is:
  • the first registered voiceprint feature is the first registered voice after the first registered voice. Obtained by feature extraction from the voiceprint model, the first registered voiceprint feature is used to reflect the user's preset audio features collected by the in-ear voice sensor. Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the specific process of performing voiceprint recognition on the second voice component is:
  • the second registered voiceprint feature is the second registered voice after the first registered voice. Obtained by feature extraction from the second voiceprint model, the second registered voiceprint feature is used to reflect the user's preset audio features collected by the out-of-ear voice sensor. Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the specific process of performing voiceprint recognition on the third voice component is:
  • the third registered voiceprint feature is the third registered voice after the third registered voice. Obtained by feature extraction from the three voiceprint models, the third registered voiceprint feature is used to reflect the user's preset audio features collected by the bone vibration sensor. Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the voiceprint recognition result is used to obtain the user's identity information.
  • the user's identity information can be obtained by fusing each voiceprint recognition result by means of dynamic fusion coefficients, which can be specifically:
  • the first similarity, the second similarity and the third similarity are used to obtain a fusion similarity score. If the fusion similarity score is greater than the first threshold, it is determined that the user's identity information matches the preset identity information.
  • the first fusion coefficient, the second fusion coefficient and the third fusion coefficient are determined.
  • the decibel number of the ambient sound may be obtained according to the sound pressure sensor;
  • the playback volume may be determined according to the playback signal of the speaker; according to the environment The decibel number and playback volume of the sound, respectively determine the first fusion coefficient, the second fusion coefficient and the third fusion coefficient, wherein: the second fusion coefficient is negatively correlated with the decibel number of the ambient sound, the first fusion coefficient, the third fusion coefficient They are respectively negatively correlated with the decibels of the playback volume, and the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a fixed value.
  • the above-mentioned sound pressure sensor and speaker are sound pressure sensors and speakers of a wearable device.
  • dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes, and the speech signals with different attributes are used to fuse the voiceprint recognition results.
  • the complementarity of voiceprint recognition can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
  • voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
  • the operation instruction includes an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction or a call instruction.
  • the user only needs to input voice information once to complete a series of operations such as user identity authentication and execution of a certain function, thereby greatly improving the user's control efficiency and user experience.
  • the present application provides a voice control method.
  • the voice control method is applied to a wearable device.
  • the execution subject of the voice control method is a wearable device.
  • the method is specifically as follows: the wearable device obtains: User voice information, the voice information includes a first voice component, a second voice component and a third voice component, the first voice component is collected by the in-ear voice sensor, and the second voice component is collected by the out-of-ear voice sensor , the third voice component is collected by the bone vibration sensor; the voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively; the wearable device is based on the first voice of the first voice component in the voice information.
  • the fingerprint recognition result, the second voiceprint recognition result of the second voice component in the voice information, and the third voiceprint recognition result of the third voice component in the voice information to obtain the user's identity information; when the user's identity information matches the preset information
  • the operation instruction is executed, wherein the operation instruction is determined according to the voice information.
  • the wearable device uses the in-ear voice sensor when collecting sound, it can make up for the distortion problem caused by the loss of some high-frequency signal components of the voice information when the bone vibration sensor collects voice information, so it can improve the overall performance of the wearable device. Voiceprint collection effect and accuracy of voiceprint recognition to improve user experience.
  • the wearable device Before the wearable device performs voiceprint recognition, the wearable device needs to obtain the voice components separately.
  • the wearable device obtains three-way voice components through different sensors such as the in-ear voice sensor, the out-of-ear voice sensor and the bone vibration sensor, which can improve the performance of the wearable device. Accuracy and anti-interference ability of voiceprint recognition.
  • the method before the wearable device performs voiceprint recognition on the first voice component, the second voice component and the third voice component, the method further includes: the wearable device performs keyword detection on the voice information, or, User input is detected.
  • the voice information includes preset keywords
  • the wearable device performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively;
  • the wearable device performs voiceprint recognition on the first voice component, the second voice component and the third voice component, respectively. Otherwise, it means that the user does not need to perform voiceprint recognition at this time, and the wearable device does not need to enable the voiceprint recognition function, thereby reducing the power consumption of the wearable device.
  • the method before the wearable device performs keyword detection on the voice information or detects user input, the method further includes: acquiring a wearing state detection result of the wearable device.
  • the wearing state detection result is passed, keyword detection is performed on the voice information, or user input is detected. Otherwise, it means that the user is not wearing the wearable device at this time, and of course there is no need for voiceprint recognition, and the wearable device does not need to enable the keyword detection function, thereby reducing the power consumption of the wearable device.
  • the specific process of the wearable device performing voiceprint recognition on the first voice component is as follows:
  • the wearable device performs feature extraction on the first voice component to obtain the first voiceprint feature, and the wearable device calculates the first similarity between the first voiceprint feature and the user's first registered voiceprint feature, and the first registered voiceprint feature is
  • the first registered voice is obtained through feature extraction by the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset audio feature of the user collected by the in-ear voice sensor.
  • Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the specific process of the wearable device performing voiceprint recognition on the second voice component is as follows:
  • the wearable device performs feature extraction on the second voice component to obtain the second voiceprint feature, and the wearable device calculates the second similarity between the second voiceprint feature and the user's second registered voiceprint feature, and the second registered voiceprint feature is
  • the second registered voice is obtained through feature extraction by the second voiceprint model, and the second registered voiceprint feature is used to reflect the preset audio feature of the user collected by the out-of-ear voice sensor.
  • Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the specific process of performing voiceprint recognition on the third voice component by the wearable device is as follows:
  • the wearable device performs feature extraction on the third voice component to obtain the third voiceprint feature, and the wearable device calculates the third similarity between the third voiceprint feature and the user's third registered voiceprint feature, and the third registered voiceprint feature is
  • the third registered voice is obtained through feature extraction by the third voiceprint model, and the third registered voiceprint feature is used to reflect the preset audio features of the user collected by the bone vibration sensor.
  • Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the wearable device uses the first voiceprint recognition result of the first voice component in the voice information, the second voiceprint recognition result of the second voice component in the voice information, and the third voice component in the voice information.
  • the third voiceprint recognition result is obtained, and the identity information of the user can be obtained.
  • the identity information of the user can be obtained by fusing each voiceprint recognition result by means of a dynamic fusion coefficient, which can be specifically:
  • the wearable device determines the first fusion coefficient corresponding to the first similarity, the second fusion coefficient corresponding to the second similarity, and the third fusion coefficient corresponding to the third similarity; the wearable device determines the first fusion coefficient and the second fusion coefficient according to the Integrate the first similarity, the second similarity and the third similarity with the third fusion coefficient to obtain a fusion similarity score. If the fusion similarity score is greater than the first threshold, it is determined that the user's identity information matches the preset identity information . By fusing multiple similarities to obtain fusion similarity scores and making judgments, the accuracy of voiceprint recognition can be effectively improved.
  • the wearable device determines the first fusion coefficient, the second fusion coefficient and the third fusion coefficient.
  • the decibel number of the ambient sound can be obtained according to the sound pressure sensor; according to the playback signal of the speaker, the playback volume is determined. ; Determine the first fusion coefficient, the second fusion coefficient and the third fusion coefficient respectively according to the decibel number of the ambient sound and the playback volume, wherein: the second fusion coefficient is negatively correlated with the decibel number of the ambient sound, the first fusion coefficient, the third fusion coefficient The three fusion coefficients are respectively negatively correlated with the decibels of the playback volume, and the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a fixed value.
  • the above-mentioned sound pressure sensor and speaker are sound pressure sensors and speakers of a wearable device.
  • dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes, and the speech signals with different attributes are used to fuse the voiceprint recognition results.
  • the complementarity of voiceprint recognition can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
  • voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
  • the wearable device sends an instruction instruction to the terminal, and the terminal executes an operation instruction corresponding to the voice information, and the operation instruction includes an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction or a call instruction.
  • the user only needs to input voice information once to complete a series of operations such as user identity authentication and execution of a certain function, thereby greatly improving the user's control efficiency and user experience of the wearable device.
  • the present application provides a voice control method.
  • the voice control method is applied to a terminal.
  • the execution subject of the voice control method is a terminal.
  • the method is specifically as follows: the method includes: acquiring user voice information, the voice information It includes a first voice component, a second voice component and a third voice component.
  • the first voice component is collected by the in-ear voice sensor
  • the second voice component is collected by the out-of-ear voice sensor
  • the third voice component is collected by collected by the bone vibration sensor
  • the terminal performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively
  • the second voiceprint recognition result of the second voice component and the third voiceprint recognition result of the third voice component in the voice information obtain the user's identity information
  • the terminal executes the operation instruction, Wherein, the operation instruction is determined according to the voice information.
  • the wearable device uses the in-ear voice sensor when collecting voice, it can make up for the distortion caused by the loss of high-frequency signal components of part of the voice information when the bone vibration sensor collects voice information, so it can improve the overall voiceprint of the terminal. Accuracy of acquisition effect and voiceprint recognition to improve user experience.
  • the wearable device after acquiring the voice information input by the user, the wearable device will send the voice component corresponding to the voice information to the terminal, so that the terminal can perform voiceprint recognition according to the voice component.
  • Executing the voice control method on the terminal side can effectively utilize the computing power of the terminal, and can still ensure the accuracy of identity authentication when the computing power of the wearable device is insufficient.
  • the terminal Before the terminal performs voiceprint recognition, the terminal needs to obtain the voice components separately.
  • the wearable device obtains three voice components through different sensors, such as the in-ear voice sensor, the out-of-ear voice sensor and the bone vibration sensor, and sends them to the terminal. Accuracy and anti-interference ability of terminal voiceprint recognition.
  • the method before the terminal performs voiceprint recognition on the first voice component, the second voice component and the third voice component, the method further includes: performing keyword detection on the voice information, or detecting user input.
  • the voice information includes a preset keyword
  • the wearable device will send the voice component corresponding to the voice information to the terminal, and the terminal will sound the first voice component, the second voice component and the third voice component respectively. or, when receiving a preset operation input by the user, the terminal performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively. Otherwise, it means that the user does not need to perform voiceprint recognition at this time, and the wearable device does not need to enable the voiceprint recognition function, thereby reducing the power consumption of the terminal.
  • the method before the wearable device performs keyword detection on the voice information or detects user input, the method further includes: acquiring a wearing state detection result of the wearable device.
  • the wearing state detection result is passed, keyword detection is performed on the voice information, or user input is detected. Otherwise, it means that the user is not wearing the wearable device at this time, and of course there is no need for voiceprint recognition, and the wearable device does not need to enable the keyword detection function, thereby reducing the power consumption of the wearable device.
  • the specific process for the terminal to perform voiceprint recognition on the first voice component is as follows:
  • the terminal performs feature extraction on the first voice component to obtain the first voiceprint feature, and the terminal calculates the first similarity between the first voiceprint feature and the user's first registered voiceprint feature, and the first registered voiceprint feature is the first registered voice Obtained through feature extraction by the first voiceprint model, the first registered voiceprint feature is used to reflect the user's preset audio feature collected by the in-ear voice sensor.
  • Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the specific process for the terminal to perform voiceprint recognition on the second voice component is as follows:
  • the terminal performs feature extraction on the second voice component to obtain a second voiceprint feature, and the terminal calculates a second similarity between the second voiceprint feature and the second registered voiceprint feature of the user, and the second registered voiceprint feature is the second registered voice Obtained through feature extraction by the second voiceprint model, the second registered voiceprint feature is used to reflect the user's preset audio features collected by the out-of-ear voice sensor.
  • Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the specific process for the terminal to perform voiceprint recognition on the third voice component is as follows:
  • the terminal performs feature extraction on the third voice component to obtain the third voiceprint feature, and the terminal calculates the third similarity between the third voiceprint feature and the third registered voiceprint feature of the user, and the third registered voiceprint feature is the third registered voice Obtained through feature extraction by the third voiceprint model, the third registered voiceprint feature is used to reflect the user's preset audio features collected by the bone vibration sensor.
  • Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the terminal uses the first voiceprint recognition result of the first voice component in the voice information, the second voiceprint recognition result of the second voice component in the voice information, and the first voiceprint recognition result of the third voice component in the voice information.
  • the user's identity information can be obtained by fusing each voiceprint recognition result by means of a dynamic fusion coefficient, which can be specifically:
  • the terminal determines the first fusion coefficient corresponding to the first similarity, the second fusion coefficient corresponding to the second similarity, and the third fusion coefficient corresponding to the third similarity; the terminal determines according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient.
  • the coefficients fuse the first similarity, the second similarity and the third similarity to obtain a fusion similarity score. If the fusion similarity score is greater than the first threshold, it is determined that the user's identity information matches the preset identity information.
  • the terminal determines the first fusion coefficient, the second fusion coefficient and the third fusion coefficient, specifically, the decibel number of the ambient sound may be obtained according to the sound pressure sensor; the playback volume may be determined according to the playback signal of the speaker; After the wearable device detects the decibel number and playback volume of the ambient sound, the data is sent to the terminal, and the terminal determines the first fusion coefficient, the second fusion coefficient and the third fusion coefficient respectively according to the decibel number and playback volume of the ambient sound, where: The second fusion coefficient is negatively correlated with the decibel number of the ambient sound, the first fusion coefficient and the third fusion coefficient are negatively correlated with the decibel number of the playback volume respectively, and the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is Fixed value.
  • the above-mentioned sound pressure sensor and speaker are sound pressure sensors and speakers of a wearable device.
  • dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes, and the speech signals with different attributes are used to fuse the voiceprint recognition results.
  • the complementarity of voiceprint recognition can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
  • voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
  • the terminal executes an operation instruction corresponding to the voice information, and the operation instruction includes an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction or a call instruction.
  • the user only needs to input voice information once to complete a series of operations such as user identity authentication and executing a certain function of the wearable device, thereby greatly improving the user's control efficiency and user experience on the wearable terminal.
  • the present application provides a voice control device, comprising: a voice information acquisition unit, the voice information acquisition unit is configured to acquire voice information of a user, and the voice information includes a first voice component, a second voice component and a third voice component,
  • the first speech component is collected by the in-ear speech sensor, the second speech component is collected by the out-of-ear speech sensor, and the third speech component is collected by the bone vibration sensor;
  • the first voice component, the second voice component and the third voice component are used for voiceprint recognition;
  • the identity information acquisition unit the identity information acquisition unit is used for the voiceprint recognition result of the first voice component and the voiceprint recognition result of the second voice component and the voiceprint recognition result of the third voice component to obtain the identity information of the user;
  • the execution unit the execution unit is used to execute an operation instruction when the user's identity information matches the preset information, wherein the operation instruction is determined according to the voice information of.
  • the wearable device uses the in-ear voice sensor when collecting sound, it can make up for the distortion problem caused by the loss of some high-frequency signal components of the voice information when the bone vibration sensor collects voice information, so it can improve the overall performance of the wearable device.
  • Voiceprint collection effect and accuracy of voiceprint recognition to improve user experience. Before obtaining the voiceprint recognition results, it is necessary to obtain the voice components separately. The acquisition of multi-channel voice components can improve the accuracy and anti-interference ability of voiceprint recognition.
  • the voice information acquiring unit is further configured to: perform keyword detection on the voice information, or detect user input.
  • voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively; when a preset operation input by the user is received, Voiceprint recognition is performed on the first voice component, the second voice component and the third voice component. Otherwise, it means that the user does not need to perform voiceprint recognition at this time, and the terminal or wearable device does not need to enable the voiceprint recognition function, thereby reducing the power consumption of the terminal or wearable device.
  • the voice information acquisition unit is further configured to: acquire the wearing state detection result of the wearable device.
  • the wearing state detection result is passed, keyword detection is performed on the voice information, or user input is detected. Otherwise, it means that the user is not wearing the wearable device at this time, and of course there is no need for voiceprint recognition, so the terminal or wearable device does not need to enable the keyword detection function, thereby reducing the power consumption of the terminal or wearable device.
  • the identification unit is specifically configured to: perform feature extraction on the first voice component to obtain a first voiceprint feature, and calculate a first similarity between the first voiceprint feature and the user's first registered voiceprint feature
  • the first registered voiceprint feature is obtained by the feature extraction of the first registered voice through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset audio features of the user collected by the in-ear voice sensor; Perform feature extraction on the two voice components to obtain the second voiceprint feature, and calculate the second similarity between the second voiceprint feature and the user's second registered voiceprint feature.
  • the second registered voiceprint feature is that the second registered voice passes through the second voiceprint.
  • the second registered voiceprint feature is used to reflect the preset audio features of the user collected by the out-of-ear voice sensor; the third voice component is extracted by feature extraction to obtain the third voiceprint feature, and the third voiceprint feature is calculated.
  • the third registered voiceprint feature is obtained by the feature extraction of the third registered voice through the third voiceprint model, and the third registered voiceprint feature is used to reflect The user's preset audio characteristics collected by the bone vibration sensor.
  • Voiceprint recognition is performed by calculating the similarity, which can improve the accuracy of voiceprint recognition.
  • the identity information obtaining unit may obtain the identity information by means of dynamic fusion coefficients, and the identity information obtaining unit is specifically configured to: determine the first fusion coefficient corresponding to the first similarity, and the second similarity corresponding to the The second fusion coefficient, the third fusion coefficient corresponding to the third similarity; the fusion similarity is obtained by fusing the first similarity, the second similarity and the third similarity according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient If the fusion similarity score is greater than the first threshold, it is determined that the user's identity information matches the preset identity information.
  • the identity information acquisition unit is specifically used to: obtain the decibel number of the ambient sound according to the sound pressure sensor; determine the playback volume according to the playback signal of the speaker; determine the decibel number and the playback volume of the ambient sound respectively The first fusion coefficient, the second fusion coefficient and the third fusion coefficient, wherein: the second fusion coefficient is negatively correlated with the decibel number of the ambient sound, and the first fusion coefficient and the third fusion coefficient are respectively negatively correlated with the decibel number of the playback volume , the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a fixed value.
  • dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes, and the speech signals with different attributes are used to fuse the voiceprint recognition results.
  • the complementarity of voiceprint recognition can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
  • voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
  • the execution unit is specifically configured to: execute an operation instruction corresponding to the voice information, and the operation instruction includes an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction, or call command.
  • the user only needs to input voice information once to complete a series of operations such as user identity authentication and execution of a certain function, thereby greatly improving the user's control efficiency and user experience.
  • the voice control device provided in the fourth aspect of the present application can be understood as a terminal or a wearable device, which depends on the execution subject of the voice control method, which is not limited in the present application.
  • the present application provides a wearable device, comprising: an in-ear voice sensor, an out-of-ear voice sensor, a bone vibration sensor, a memory, and a processor; the in-ear voice sensor is used to collect a first voice component of voice information, an ear The external voice sensor is used to collect the second voice component of the voice information, and the bone vibration sensor is used to collect the third voice component of the voice information; the memory is coupled to the processor; the memory is used to store computer program codes, and the computer program codes include computer instructions; when When the processor executes the computer instructions, the wearable device executes the voice control method of any one of the first aspect or possible implementations of the first aspect or the third aspect or possible implementations of the third aspect.
  • the present application provides a terminal, comprising: including a memory and a processor; the memory and the processor are coupled; the memory is used to store computer program codes, and the computer program codes include computer instructions; when the processor executes the computer instructions, the terminal executes the The voice control method of any one of the first aspect or the possible implementation manner of the first aspect or the third aspect or the possible implementation manner of the third aspect.
  • the present application provides a chip system, which is applied to electronic equipment; the chip system includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected through lines; the interface circuit is used for A signal is received from the memory of the electronic device, and a signal is sent to the processor, where the signal includes computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device executes the first aspect or any of the possible implementations of the first aspect. A method of voice control.
  • the present application provides a computer storage medium, comprising computer instructions, when the computer instructions are executed on the voice control device, the voice control device is made to perform any one of the first aspect or the possible implementation manners of the first aspect The voice control method for the item.
  • the present application provides a computer program product comprising computer instructions that, when the computer instructions are run on a voice control device, cause the voice control device to perform the first aspect or a possible implementation of the first aspect
  • the voice control method of any one of the methods is not limited to:
  • the wearable device of the fifth aspect, the terminal of the sixth aspect, the chip system of the seventh aspect, the computer storage medium of the eighth aspect, and the computer program product of the ninth aspect are all used to execute the above.
  • the corresponding method provided, therefore, the beneficial effects that can be achieved can be referred to the beneficial effects in the corresponding methods provided above, which will not be repeated here.
  • FIG. 1 is a schematic diagram of the hardware structure of a mobile phone according to an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a mobile phone software provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a wearable device according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a voice control system provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a server according to an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a voiceprint recognition provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a voice control method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a sensor setting area provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a payment interface provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of another voice control method provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a mobile phone setting interface provided by an embodiment of the application.
  • FIG. 12 is a schematic diagram of a voice control device according to an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a wearable device provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a terminal according to an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a chip system provided by an embodiment of the present application.
  • first and second are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features.
  • features defined as “first”, “second” may expressly or implicitly include one or more of such features, it being understood that the data so used may be interchanged under appropriate circumstances for the implementation described herein Examples can be implemented in sequences other than those illustrated or described herein.
  • “plurality” means two or more.
  • Voiceprint is a spectrum of sound waves that carry speech information displayed by electroacoustic instruments.
  • Voiceprint has the characteristics of stability, measurability and uniqueness. After adulthood, the human voice can remain relatively stable for a long time.
  • the vocal organs that people use when speaking are very different in size and shape, so any two people have different voiceprints, and different people's voices have different distributions of formants in the spectrogram.
  • Voiceprint recognition is to judge whether it is the same person by comparing the voices of the speakers of the two speeches on the same phoneme, so as to realize the function of "recognizing people by hearing voices".
  • Voiceprint recognition is to extract voiceprint information from the speech signal sent by the speaker. From the perspective of application, it can be divided into: speaker identification (SI, Speaker Identification): It is used to determine which one of several people said a certain piece of speech, which is a "multiple choice” problem. Speaker Verification (SV, Speaker Verification): It is a "one-to-one discrimination" problem to confirm whether a certain piece of speech is spoken by a designated person. This application is primarily concerned with speaker verification techniques.
  • the voiceprint recognition technology can be applied to end user identification scenarios, and can also be applied to household head identification scenarios for home security, which is not limited in this application.
  • the usual voiceprint recognition technology performs voiceprint recognition through the collection of one or two channels of voice signals, that is, only if the voiceprint recognition results of the two channels of voice components match, it will be determined as a preset user.
  • voiceprint recognition there are two problems.
  • the voice components collected in the face of a multi-person speaking scene or a background of strong interfering environmental noise will interfere with the voiceprint recognition result, resulting in inaccurate or even wrong identity authentication.
  • the voiceprint recognition performance will be degraded, and the identity authentication result will be misjudged. That is, the existing voiceprint recognition technology cannot well suppress noise from all directions, which reduces the accuracy of voiceprint recognition.
  • an embodiment of the present application provides a voice control method.
  • the subject executing the method of this embodiment may be a terminal, and the terminal establishes a connection with a wearable device and can obtain the data collected by the wearable device. voice information, and perform voiceprint recognition on the voice information.
  • the subject performing the method of this embodiment may also be the wearable device itself, and the wearable device itself includes a processor with computing capability, which can directly perform voiceprint recognition on the collected voice information.
  • the main body executing the method of this embodiment may also be a server, and the server establishes a connection with the wearable device, can obtain the voice information collected by the wearable device, and performs voiceprint recognition on the voice information.
  • the main body that executes the method of this embodiment may be determined according to the computing power of the wearable device chip.
  • the wearable device when the computing power of the wearable device chip is high, the wearable device can perform the method of this embodiment; when the computing power of the wearable device chip is low, the wearable device can be connected to the wearable device.
  • the method of this embodiment is performed by the terminal device, or the method of this embodiment may be performed by a server connected to the wearable device.
  • the terminal connected to the wearable device as the execution body of the method of this embodiment as an example
  • the wearable device as the execution body of the method of this embodiment as an example
  • the server connected to the wearable device as the embodiment.
  • the embodiment of the present application is described in detail by taking the execution body of the method in this embodiment as an example.
  • terminal equipment is also called user equipment (UE), mobile station (MS), mobile terminal (MT), etc. , a device that provides voice and/or data connectivity to the user.
  • UE user equipment
  • MS mobile station
  • MT mobile terminal
  • UE user equipment
  • handheld devices in-vehicle devices, etc. that are enabled by wireless connectivity.
  • terminal devices are: mobile phone (mobile phone), tablet computer, notebook computer, PDA, mobile internet device (MID), wearable device, virtual reality (VR) device, augmented Augmented reality (AR) equipment, wireless terminals in industrial control, wireless terminals in self-driving, wireless terminals in remote medical surgery, smart grid
  • the voice control method may be implemented by an application program installed on the terminal for recognizing voiceprints.
  • the above-mentioned application program for recognizing voiceprint may be an embedded application program installed in the terminal (ie, a system application of the terminal) or a downloadable application program.
  • an embedded application is an application provided as a part of the realization of a terminal (such as a mobile phone).
  • a downloadable application is an application that can provide its own internet protocol multimedia subsystem (IMS) connection, the downloadable application is an application that can be pre-installed in the terminal or can be downloaded and installed by the user in the Third-party applications in the terminal.
  • IMS internet protocol multimedia subsystem
  • FIG. 1 shows a hardware structure of the mobile phone. As shown in FIG.
  • the mobile phone 10 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, Antenna 1, Antenna 2, Mobile Communication Module 150, Wireless Communication Module 160, Audio Module 170, Speaker 170A, Receiver 170B, Microphone 170C, Headphone Interface 170D, Sensor Module 180, Key 190, Motor 191, Indicator 192, Camera 193, Display screen 194, and subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and an environment Light sensor 180L, bone conduction sensor 180M, etc.
  • the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the mobile phone.
  • the mobile phone may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • the processor 110 can execute the voiceprint recognition algorithm provided by the embodiment of the present application.
  • the controller can be the nerve center and command center of the mobile phone.
  • the controller can generate operation control signals according to the instruction opcode and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
  • the terminal can establish a wired communication connection with the wearable device through the interface.
  • the terminal can obtain through the interface that the wearable device collects the first voice component through the in-ear voice sensor, collects the second voice component through the out-of-ear voice sensor, and collects the third voice component through the bone vibration sensor.
  • the I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL).
  • the I2S interface can be used for audio communication.
  • the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 . MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface 130 can be used to connect a charger to charge the mobile phone, and can also be used to transfer data between the mobile phone and peripheral devices. It can also be used to connect headphones to play audio through the headphones.
  • the interface can also be used to connect other electronic devices, such as AR devices.
  • the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the mobile phone.
  • the mobile phone may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger may be a wireless charger or a wired charger.
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140 and supplies power to the processor 110 , the internal memory 121 , the external memory, the display screen 194 , the camera 193 , and the wireless communication module 160 .
  • the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance).
  • the wireless communication function of the mobile phone can be realized by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in a cell phone can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G etc. applied on the mobile phone.
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
  • the mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
  • the modem processor may include a modulator and a demodulator.
  • the wireless communication module 160 can provide applications on the mobile phone including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), GNSS, frequency modulation (frequency). modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS frequency modulation (frequency). modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna
  • the terminal can establish a communication connection with the wearable device through the wireless communication module 160 .
  • the terminal may acquire the wearable device through the wireless communication module 160 to collect the first voice component through the in-ear voice sensor, the second voice component through the outside-the-ear voice sensor, and the third voice component through the bone vibration sensor.
  • the GNSS in this embodiment of the present application may include GPS, GLONASS, BDS, QZSS, SBAS, and/or GALILEO, and the like.
  • the mobile phone realizes the display function through the GPU, the display screen 194, and the application processor.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel.
  • the mobile phone can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
  • the ISP is used to process the data fed back by the camera 193 .
  • the camera 193 is used to obtain still images or videos.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals.
  • Video codecs are used to compress or decompress digital video.
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • applications such as intelligent cognition of mobile phones can be realized, such as image recognition, face recognition, speech recognition, text understanding, etc.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the processor 110 executes various functional applications and data processing of the mobile phone by executing the instructions stored in the internal memory 121 .
  • the code stored in the internal memory 121 can execute a voice control method provided by the embodiment of the present application. For example, when the user inputs voice information to the wearable device, the wearable device collects the first voice component through the in-ear voice sensor, and the wearable device collects the first voice component through the external voice sensor.
  • the voice sensor collects the second voice component
  • the bone vibration sensor collects the third voice component
  • the mobile phone obtains the first voice component, the second voice component and the third voice component from the wearable device through the communication connection, and performs voiceprint recognition respectively
  • the first voiceprint recognition result of the first voice component, the second voiceprint recognition result of the second voice component, and the third voiceprint recognition result of the third voice component are used to authenticate the user; if the user's identity authentication result If it is a preset user, the mobile phone executes the operation instruction corresponding to the voice information.
  • the mobile phone can implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and an application processor. Such as music playback, recording, etc.
  • the terminal can establish a communication connection with the wearable device through the wireless communication module 160 .
  • the terminal may acquire the wearable device through the wireless communication module 160 to collect the first voice component through the in-ear voice sensor, the second voice component through the outside-the-ear voice sensor, and the third voice component through the bone vibration sensor.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal.
  • Speaker 170A also referred to as a "speaker” is used to convert audio electrical signals into sound signals.
  • the receiver 170B also referred to as “earpiece”, is used to convert audio electrical signals into sound signals.
  • the microphone 170C also called “microphone” or “microphone”, is used to convert sound signals into electrical signals.
  • the earphone jack 170D is used to connect wired earphones.
  • the earphone interface 170D can be the USB interface 130, or can be a 3.2mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
  • the cell phone can receive key input and generate key signal input related to user settings and function control of the cell phone.
  • Motor 191 can generate vibrating cues.
  • the motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
  • the SIM card interface 195 is used to connect a SIM card.
  • the SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the mobile phone.
  • the mobile phone can support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
  • the SIM card interface 195 can support Nano SIM card, Micro SIM card, SIM card and so on.
  • the mobile phone 100 may further include a camera, a flash, a micro-projection device, a near field communication (near field communication, NFC) device, etc., which will not be repeated here.
  • the software system of the mobile phone can adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of a mobile phone.
  • FIG. 2 is a block diagram of a software structure of a mobile phone according to an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, from top to bottom: an application layer, an application framework layer, an Android runtime (Android runtime) and system libraries, and a kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and so on.
  • An application program for voiceprint recognition may also be included, and the application program for voiceprint recognition may be built into the terminal or downloaded through an external website.
  • the application framework layer provides an application programming interface (API) and a programming framework for applications in the application layer.
  • API application programming interface
  • the application framework layer includes some predefined functions.
  • the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide the communication functions of the mobile phone. For example, the management of call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.
  • Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
  • the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • a system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • surface manager surface manager
  • media library Media Libraries
  • 3D graphics processing library eg: OpenGL ES
  • 2D graphics engine eg: SGL
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
  • a corresponding hardware interrupt is sent to the kernel layer.
  • the kernel layer processes touch operations into raw input events (including touch coordinates, timestamps of touch operations, etc.). Raw input events are stored at the kernel layer.
  • the application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, for example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer.
  • the camera 193 captures still images or video.
  • the voice control method of the embodiment of the present application can be applied to a wearable device, in other words, the wearable device can be used as the execution subject of the voice control method of the embodiment of the present application.
  • the wearable device may be a device with a voice collection function, such as a wireless headset, a wired headset, smart glasses, a smart helmet, or a smart watch, which is not limited in this embodiment of the present application.
  • the wearable device provided by the embodiment of the present application may be a TWS (True Wireless Stereo, true wireless stereo) headset, and the TWS technology is based on the development of the Bluetooth chip technology. According to its working principle, it means that the mobile phone is connected to the main earphone, and then the main earphone is quickly connected to the auxiliary earphone by wireless means, so as to realize the true wireless separation of the left and right channels of Bluetooth.
  • TWS Truste Wireless Stereo, true wireless stereo
  • TWS smart earphones have begun to play a role in the fields of wireless connection, voice interaction, intelligent noise reduction, health monitoring and hearing enhancement/protection. And noise reduction, hearing protection, intelligent translation, health monitoring, bone vibration ID, anti-lost, etc. will be the key technology trends of TWS headsets.
  • the wearable device 30 may specifically include an in-ear voice sensor 301 , an out-of-ear voice sensor 302 and a bone vibration sensor 303 .
  • the in-ear voice sensor 301 and the out-of-ear voice sensor may be air conduction microphones
  • the bone vibration sensor may be a bone conduction microphone, an optical vibration sensor, an acceleration sensor, or an air conduction microphone and other sensors that can collect vibration signals generated when a user utters a voice.
  • the air conduction microphone collects voice information by transmitting the vibration signal when it occurs to the microphone through the air, and then collects the sound signal and converts it into an electrical signal;
  • the bone conduction microphone collects voice information by using the head caused by human speech. The slight vibration of the bones in the neck transmits the vibration signal of the sound to the microphone through the bone, and then the sound signal is collected and converted into an electrical signal.
  • the voice control method provided in the embodiments of the present application needs to be applied to a wearable device with a voiceprint recognition function, in other words, the wearable device 30 needs to have a voiceprint recognition function.
  • the in-ear voice sensor 301 of the wearable device 30 refers to that, when the wearable device is in a state of being used by a user, the in-ear voice sensor is located inside the user's ear canal, or in other words, the ear The sound detection direction of the internal voice sensor is the inside of the ear canal.
  • the in-ear voice sensor is used to collect the sound transmitted by the vibration of the outside air and the air in the ear canal when the user makes a sound, and the sound is the in-ear voice signal component.
  • the out-of-ear voice sensor 302 refers to that, when the wearable device is in the state of being used by the user, the out-of-ear speech sensor is located outside the user's ear canal, or in other words, the sound detection direction of the out-of-ear speech sensor is to remove the ear. Other directions inside the duct, i.e. the whole outside air direction.
  • the out-of-ear voice sensor is exposed to the environment, and is used for collecting the sound transmitted by the user through the vibration of the outside air, and the sound is an out-of-ear voice signal component or an ambient sound component.
  • the bone vibration sensor 303 refers to that, when the wearable device is in the state of being used by the user, the bone vibration sensor is in contact with the user's skin, and is used to collect the vibration signal transmitted by the user's bones, or, in other words, to collect a certain time of the user.
  • the component of speech information conveyed by bone vibrations during vocalization can select microphones with different directions according to the positions of the microphones, such as cardioid, omnidirectional, figure-8, etc., so as to obtain voice signals in different directions.
  • the external auditory canal and the middle ear canal will form a closed cavity, and the sound has a certain amplification effect in the cavity, that is, the cavity effect. Therefore, the sound collected by the in-ear voice sensor will be clearer. , especially for high-frequency sound signals, it can make up for the distortion problem caused by the loss of high-frequency signal components of part of the voice information when the bone vibration sensor collects voice information, and improve the overall voiceprint collection effect of the headset. The accuracy of fingerprint recognition can be improved to improve the user experience.
  • in-ear speech sensor 301 picks up the in-ear speech signal, it is usually accompanied by residual noise in the ear, and when the out-of-ear speech sensor 302 picks up the out-of-ear speech signal, it is usually accompanied by extra-ear noise.
  • the wearable device 30 when the user wears the wearable device 30 to speak, the wearable device 30 can not only collect the voice information transmitted by the user through the air through the in-ear voice sensor 301 and the out-of-ear voice sensor 302, but also through the bone vibration
  • the sensor 303 collects the voice information sent by the user after being transmitted through the bone.
  • in-ear voice sensors 301 there may be multiple in-ear voice sensors 301 , out-of-ear voice sensors 302 and bone vibration sensors 303 in the wearable device 30 , which is not limited in this application.
  • the in-ear voice sensor 301 , the out-of-ear voice sensor 302 and the bone vibration sensor 303 may be built into the wearable device 30 .
  • the wearable device 30 may further include components such as a communication module 304 , a speaker 305 , a computing module 306 , a storage module 307 , and a power supply 309 .
  • the communication module 304 can establish a communication connection with the terminal or the server.
  • the communication module 304 may include a communication interface, and the communication interface may be in a wired or wireless manner, and the wireless manner may be through bluetooth or wifi.
  • the communication module 304 can be used to collect the first voice component of the wearable device 30 through the in-ear voice sensor 301, the second voice component through the out-of-ear voice sensor 302, and the third voice component through the bone vibration sensor 303, and transmit them to the terminal. or server.
  • the computing module 306 can execute the voice control method provided in the embodiment of the present application.
  • the internal voice sensor 301 collects the first voice component
  • the out-of-ear voice sensor 302 collects the second voice component
  • the bone vibration sensor 303 collects the third voice component, and voiceprint recognition is performed respectively;
  • the result, the second voiceprint recognition result of the second voice component and the third voiceprint recognition result of the third voice component perform identity authentication on the user; if the user's identity authentication result is a preset user, the wearable device executes Operation instructions corresponding to voice information.
  • the storage module 307 is used for storing the application program code for executing the method of the embodiment of the present application, and the execution is controlled by the computing module 306 .
  • the code stored in the storage module 307 can execute a voice control method provided by this embodiment of the present application, for example: when the user inputs voice information to the wearable device, the wearable device 30 collects the first voice component through the in-ear voice sensor 301, The extra-ear voice sensor 302 collects the second voice component, and the bone vibration sensor 303 collects the third voice component to perform voiceprint recognition respectively; according to the first voiceprint recognition result of the first voice component and the second voiceprint of the second voice component The identification result and the third voiceprint identification result of the third voice component are used to authenticate the user's identity; if the user's identity authentication result is a preset user, the wearable device executes the operation instruction corresponding to the voice information.
  • the above-mentioned wearable device 30 may also include pressure sensors, acceleration sensors, optical sensors, etc.
  • the wearable device 30 may also have more or less components than those shown in FIG. 3 , and two or more components may be combined. components, or may have different component configurations.
  • the various components shown in Figure 3 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing or application specific integrated circuits.
  • a voice control method provided in this embodiment of the present application can be applied to a voice control system composed of a wearable device 30 and a terminal 10 , and the voice control system is shown in FIG. 4 .
  • the wearable device 30 can collect the first voice component through the in-ear voice sensor 301, the second voice component through the outside-the-ear voice sensor 302, and the bone
  • the vibration sensor 303 collects the third voice component
  • the terminal 10 obtains the first voice component, the second voice component, and the third voice component from the wearable device, and then analyzes the first voice component, the second voice voiceprint recognition of the first voice component and the third voice component; according to the first voiceprint recognition result of the first voice component, the second voiceprint recognition result of the second voice component and the third voiceprint recognition result of the third voice component, Perform identity authentication; if the user's identity authentication result is a preset user, the terminal 10 executes an operation instruction corresponding to the voice information.
  • the voice control method of the embodiment of the present application may also be applied to the server, in other words, the server may serve as the execution body of the voice control method of the embodiment of the present application.
  • the server may be a desktop server, a rack server, a cabinet server, a blade server, or other types of servers, and the server may also be a cloud server such as a public cloud or a private cloud, which is not limited in this embodiment of the present application.
  • the server 50 includes at least one processor 501 , at least one memory 502 and at least one communication interface 503 .
  • the processor 501, the memory 502, and the communication interface 503 are connected through a communication bus 504 and communicate with each other.
  • the processor 501 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the above programs.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • Memory 502 may be read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (RAM) or other type of static storage device that can store information and instructions It can also be an electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being executed by a computer Access any other medium without limitation.
  • the memory can exist independently and be connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory 502 is used for storing application program codes for executing the methods of the embodiments of the present application, and the execution is controlled by the processor 501 .
  • the code stored in the memory 502 can execute a voice control method provided by the embodiment of the present application. For example, when the user inputs voice information to the wearable device, the wearable device collects the first voice component through the in-ear voice sensor, and uses the out-of-ear voice to collect the first voice component.
  • the sensor collects the second voice component
  • the bone vibration sensor collects the third voice component
  • the server obtains the first voice component, the second voice component and the third voice component from the wearable device through the communication connection, and performs voiceprint recognition respectively;
  • the first voiceprint recognition result of a voice component, the second voiceprint recognition result of the second voice component, and the third voiceprint recognition result of the third voice component are used to authenticate the user; if the user's identity authentication result is If the user is preset, the server executes the operation instruction corresponding to the voice information.
  • the communication interface 503 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (Wireless Local Area Networks, WLAN).
  • RAN radio access network
  • WLAN Wireless Local Area Networks
  • the specific implementation manner when the voice control method of the present application is applied to a terminal is summarized.
  • the method first acquires the voice information of the user, and the voice information includes a first voice component, a second voice component and a third voice component.
  • the user can input voice information into the Bluetooth headset when wearing the Bluetooth headset, and at this time , the Bluetooth headset can collect the first voice component through the in-ear voice sensor, the second voice component through the out-of-ear voice sensor, and the third voice component through the bone vibration sensor, based on the voice information input by the user.
  • the Bluetooth headset obtains the first voice component, the second voice component, and the third voice component from the voice information
  • the mobile phone obtains the first voice component, the second voice component, and the second voice component from the Bluetooth headset through a Bluetooth connection with the Bluetooth headset.
  • speech component and a third speech component may be performed in a possible implementation manner.
  • the mobile phone may perform keyword detection on the voice information input by the user to the Bluetooth headset, or the mobile phone may detect user input.
  • voice information includes preset keywords
  • voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively.
  • voiceprint recognition is performed on the first voice component, the second voice component and the third voice component respectively.
  • the user input may be the user's input to the mobile phone through a touch screen or keys, for example, the user clicks an unlock key of the mobile phone.
  • the wearing state detection result may also be acquired from the Bluetooth headset.
  • the mobile phone performs keyword detection on the voice information, or detects user input.
  • a first voiceprint recognition result corresponding to the first voice component and a second voiceprint recognition result corresponding to the second voice component are obtained and a third voiceprint recognition result corresponding to the third voice component.
  • the mobile phone can use a certain algorithm to calculate the first matching degree between the first voiceprint feature and the first registered voiceprint feature, the second matching degree between the second voiceprint feature and the second registered voiceprint feature, and the third voiceprint feature A third degree of matching with the third registered voiceprint feature.
  • the matching degree is higher, it means that the voiceprint feature is more consistent with the corresponding registered voiceprint feature, and at this time, the possibility that the voice user is a preset user is higher.
  • the mobile phone can determine that the first voiceprint feature matches the first registered voiceprint feature, and the second voiceprint feature matches the second voiceprint feature.
  • the registered voiceprint feature matches, and the third voiceprint feature matches the third registered voiceprint feature.
  • the mobile phone can determine that the first voiceprint feature matches the first registered voiceprint feature, and the second voiceprint feature matches the second registered voiceprint feature.
  • the voiceprint features match, and the third voiceprint feature matches the third registered voiceprint feature.
  • the first registered voiceprint feature is obtained by feature extraction through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset user's voiceprint feature collected by the in-ear voice sensor;
  • the second registered voiceprint feature The feature is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the out-of-ear voice sensor;
  • the third registered voiceprint feature is obtained through the third voiceprint feature.
  • the third registered voiceprint feature is obtained by performing feature extraction on the voiceprint model to reflect the preset user's voiceprint feature collected by the bone vibration sensor.
  • the mobile phone can execute an operation instruction corresponding to the voice information, for example, an unlock instruction, a payment instruction, a shutdown instruction, an application program opening instruction, or a call instruction.
  • the mobile phone can perform the corresponding operation according to the operation instruction, so as to realize the function of the user controlling the mobile phone by voice.
  • the conditions of identity authentication are not limited. For example, when the first matching degree, the second matching degree and the third matching degree are all greater than a certain threshold, it can be considered that the identity authentication has passed, and the voiced user is the pre-defined user.
  • the identity authentication in this embodiment of the present application refers to obtaining the user's identity information to determine whether the identity information matches the preset identity information. Fail.
  • the above-mentioned preset users refer to users who can pass the preset identity authentication measures of the mobile phone.
  • the preset identity authentication measures of the terminal are inputting a password, fingerprint recognition and voiceprint recognition.
  • the user who stores the fingerprint information and the registered voiceprint feature that has been authenticated by the user can be regarded as the preset user of the terminal.
  • the preset users of a terminal may include one or more, and any user other than the preset users may be regarded as an illegal user of the terminal.
  • An illegal user can also be transformed into a default user after passing certain identity authentication measures, which is not limited in this embodiment of the present application.
  • the first registered voiceprint feature is obtained by feature extraction through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset user's voiceprint collected by the in-ear voice sensor feature;
  • the second registered voiceprint feature is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the out-of-ear voice sensor;
  • the third registered voiceprint feature The voiceprint feature is obtained by feature extraction through the third voiceprint model, and the third registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the bone vibration sensor.
  • the above algorithm for calculating the matching degree may be calculating the similarity.
  • the mobile phone performs feature extraction on the first voice component to obtain the first voiceprint feature, respectively calculates the first similarity between the first voiceprint feature and the pre-stored preset user's first registered voiceprint feature, and the second voiceprint feature and The second similarity of the pre-stored second registered voiceprint feature of the preset user, the third similarity of the third voiceprint feature and the pre-stored third registered voiceprint feature of the preset user, based on the first similarity, The second similarity and the third similarity are used to authenticate the user.
  • the way of authenticating the user may be that the mobile phone determines the first fusion coefficient corresponding to the first similarity degree according to the decibel number of the ambient sound and the playback volume of the wearable device, respectively, and the second similarity
  • the second fusion coefficient corresponding to the degree, the third fusion coefficient corresponding to the third similarity; according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient, the first similarity, the second similarity and the third similarity are fused, Get the fusion similarity score. If the fusion similarity score is greater than the first threshold, the mobile phone determines that the user who inputs voice information to the Bluetooth headset is a preset user.
  • the decibel number of the ambient sound is detected by the sound pressure sensor of the Bluetooth headset and sent to the mobile phone
  • the playback volume can be detected by the speaker of the Bluetooth headset and sent to the mobile phone. It can be obtained by the mobile phone itself calling its own data, that is, obtained through the volume interface program interface of the underlying system.
  • the second fusion coefficient is negatively correlated with the decibel number of the ambient sound
  • the first fusion coefficient and the third fusion coefficient are respectively negatively correlated with the decibel number of the playback volume
  • the first fusion coefficient and the second fusion coefficient are negatively correlated with the decibel number of the playback volume.
  • the sum of the fusion coefficient and the third fusion coefficient is a fixed value. That is to say, when the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a preset fixed value, the larger the decibel of the ambient sound, the smaller the second fusion coefficient.
  • the corresponding , the first fusion coefficient and the third fusion coefficient will increase adaptively, so as to keep the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient unchanged;
  • the above-mentioned variable fusion coefficient can take into account the recognition accuracy in different application scenarios (in the case of a large noise environment or when the headphones are playing music).
  • the mobile phone After the mobile phone determines that the user who inputs the voice information to the Bluetooth headset is the preset user, the mobile phone can automatically execute an operation instruction corresponding to the voice information, for example, the mobile phone unlocking operation or the confirming payment operation.
  • the wearable device when the user inputs voice information to the wearable device to achieve the purpose of controlling the terminal, the wearable device can collect the voice information generated in the ear canal when the user makes a sound, and the voice information generated outside the ear canal. Voice information and bone vibration information. At this time, three channels of voice information (ie, the above-mentioned first voice component, second voice component, and third voice component) are generated in the wearable device. In this way, the terminal (or the wearable device itself, or the server) can perform voiceprint recognition for the three channels of voice information respectively.
  • the triple voiceprint recognition process of this three-channel voice information can significantly improve the accuracy and security of user identity authentication compared with the voiceprint recognition process of one-channel voice information or the voiceprint recognition process of two-channel voice information.
  • adding a microphone in the ear can solve the problem that the high-frequency signal of the speech signal collected by the bone vibration sensor is lost during the voiceprint recognition process of the two-way speech information of the out-of-ear speech sensor and the bone vibration sensor.
  • the wearable device can collect the voice information input by the user through bone conduction. Therefore, when the wearable device collects the voice information through bone conduction During fingerprint recognition, it also shows that the source of the above voice information is generated by the voice of the preset user wearing the wearable device, so as to avoid the situation where an illegal user maliciously controls the terminal of the preset user by using the preset user's recording.
  • a voice control method provided by the embodiments of the present application will be specifically introduced below with reference to the accompanying drawings.
  • a mobile phone is used as a terminal, and a Bluetooth headset is used as an example for illustration.
  • the general voiceprint recognition application process is shown in Figure 6.
  • the registration voice 601 is first collected, and after preprocessing by the preprocessing module 602, After input into the pre-trained voiceprint model 603 for feature extraction, a registered voiceprint feature 604 is obtained, and the registered voiceprint feature can also be understood as a preset user registered voiceprint feature.
  • the registered speech can be picked up by different types of sensors, eg out-of-ear speech sensors, in-ear speech sensors or bone vibration sensors.
  • the voiceprint model 603 is obtained through training data in advance.
  • the voiceprint model 603 may be built-in before the terminal leaves the factory, or may be trained by an application to guide the user.
  • the training method may use the method of the prior art, which is not limited in this application.
  • the verification process part first collect the test voice 605 of the voiced user in a certain voiceprint recognition process, after preprocessing by the preprocessing module 606, input it into the pre-trained voiceprint model 607 for feature extraction, and obtain the test voice Voiceprint feature 608, the test voice voiceprint feature can also be understood as a preset user registered voiceprint feature.
  • the identity authentication Passing 6010 means that the voice user of the test voice 605 is the same person as the voice user of the registered voice 601, in other words, the voice user of the test voice 605 is the preset user; the identity authentication failure 6011 refers to the test voice 605
  • the uttering user and the uttering user of the registered voice 601 are not the same person, in other words, the uttering user of the test voice 605 is an illegal user.
  • the preprocessing of voice, feature extraction, and the training process of voiceprint model will have different degrees of differences, and the preprocessing module is an optional module. Filtering, noise reduction or enhancement is not limited in this application.
  • FIG. 7 shows a schematic flowchart of a voice control method provided by an embodiment of the present application, taking the terminal being a mobile phone and the wearable device being a Bluetooth headset as an example.
  • the Bluetooth headset includes an in-ear voice sensor, an out-of-ear voice sensor and a bone vibration sensor.
  • the voice control method may include:
  • a mobile phone establishes a connection with a Bluetooth headset.
  • the connection method can be bluetooth connection, wifi connection or wired connection.
  • the Bluetooth function of the Bluetooth headset can be turned on.
  • the Bluetooth headset can send a pairing broadcast to the outside world. If the mobile phone does not have the bluetooth function turned on, the user needs to turn on the bluetooth function of the mobile phone. If the mobile phone has turned on the bluetooth function, the mobile phone can receive the pairing broadcast and prompt the user that the relevant bluetooth device has been scanned. After the user selects the Bluetooth headset on the mobile phone, the mobile phone can be paired with the Bluetooth headset and a Bluetooth connection can be established. Subsequently, the mobile phone and the Bluetooth headset can communicate through the Bluetooth connection. Of course, if the mobile phone and the Bluetooth headset have been successfully paired before the current Bluetooth connection is established, the mobile phone can automatically establish a Bluetooth connection with the scanned Bluetooth headset.
  • the headset the user wants to use has the Wi-Fi function
  • the user can also operate the mobile phone to establish a Wi-Fi connection with the headset.
  • the earphone the user wishes to use is a wired earphone
  • the user also inserts the plug of the earphone cable into the corresponding earphone port of the mobile phone to establish a wired connection, which is not limited in this embodiment of the present application.
  • the Bluetooth headset detects whether it is in a wearing state.
  • the wearing detection method can sense the wearing state of the user by using the principle of optical sensing by means of photoelectric detection.
  • the light detected by the photoelectric sensor inside the earphone is blocked, and a switch control signal is output, thereby judging that the user is in the state of wearing the earphone.
  • a proximity light sensor and an acceleration sensor may be provided in the Bluetooth headset, wherein the proximity light sensor is provided on the side that is in contact with the user when worn by the user.
  • the proximity light sensor and acceleration sensor can be activated periodically to obtain currently detected measurements.
  • the Bluetooth headset Since the user wears the Bluetooth headset, the light entering the proximity light sensor will be blocked. Therefore, when the light intensity detected by the proximity light sensor is less than the preset light intensity threshold, the Bluetooth headset can determine that it is in the wearing state at this time. Also, because the Bluetooth headset will move with the user after the user wears the Bluetooth headset, when the acceleration value detected by the acceleration sensor is greater than the preset acceleration threshold, the Bluetooth headset can determine that it is in the wearing state at this time. Or, when the light intensity detected by the proximity light sensor is less than the preset light intensity threshold, if it is detected whether the acceleration value detected by the acceleration sensor at this time is greater than the preset acceleration threshold, the Bluetooth headset can determine that it is wearing state.
  • the Bluetooth headset is also provided with a sensor that collects voice information by means of bone conduction, such as a bone vibration sensor or an optical vibration sensor, etc., therefore, in a possible implementation, the Bluetooth headset can further pass the bone vibration sensor. Collect vibration signals generated in the current environment. When the Bluetooth headset is in direct contact with the user, the vibration signal collected by the bone vibration sensor is stronger than that in the non-wearing state. Then, if the energy of the vibration signal collected by the bone vibration sensor is greater than the energy threshold, the Bluetooth The headset can determine that it is being worn.
  • a sensor that collects voice information by means of bone conduction such as a bone vibration sensor or an optical vibration sensor, etc.
  • the Bluetooth headset can determine that it is in a wearing state.
  • the above two situations can be understood as the user's wearing state detection result passing. This can reduce the probability that the Bluetooth headset cannot accurately detect the wearing state through the proximity light sensor or acceleration sensor in scenarios such as the user putting the Bluetooth headset into a pocket.
  • the above-mentioned energy threshold or preset spectral characteristics may be obtained by capturing various vibration signals generated by a large number of users wearing Bluetooth headsets, such as uttering sounds or exercising. There are significant differences in the energy or spectral characteristics of speech signals.
  • the power consumption of the voice sensor such as an air conduction microphone
  • the in-ear voice sensor, the out-of-ear voice sensor and/or the bone vibration sensor can be turned on to collect the voice information generated by the user's voice, so as to reduce the power consumption of the Bluetooth headset.
  • the Bluetooth headset When the Bluetooth headset detects that it is currently in the wearing state, or in other words, after the wearing state detection result is passed, the following steps S703-S707 may be continued; otherwise, the Bluetooth headset may enter the sleep state until it is detected that the current wearing state is continued.
  • the above step S702 is an optional step, that is, regardless of whether the user wears a Bluetooth headset, the Bluetooth headset can continue to perform the following steps S703-S707, which is not limited in this embodiment of the present application.
  • the Bluetooth headset if the Bluetooth headset has collected a voice signal before detecting whether it is in the wearing state, in this case, when the Bluetooth headset detects that it is currently in the wearing state, or in other words, after the wearing state detection result passes, The voice signal collected by the Bluetooth headset is stored and the following steps S703-S707 are continued; when the Bluetooth headset does not detect that it is currently in a wearing state, or in other words, after the wearing state detection result fails, the Bluetooth headset deletes the voice signal just collected.
  • the Bluetooth headset acquires the first voice component in the voice information input by the user through acquisition by the in-ear voice sensor, collects the second voice component in the above-mentioned voice information through the out-of-ear voice sensor, and vibrates through the bone.
  • the sensor collects the third voice component in the voice information.
  • the bluetooth headset can start the voice detection module, and use the above-mentioned in-ear voice sensor, out-of-ear voice sensor and bone vibration sensor to collect the voice information input by the user, and obtain the first voice information in the voice information.
  • a speech component, a second speech component and a third speech component Taking the in-ear voice sensor and the out-of-ear voice sensor as the air conduction microphone, and the bone vibration sensor as the bone conduction microphone, the user can input the voice information "Xiao E, use WeChat payment" when using the Bluetooth headset.
  • the Bluetooth headset can use the air conduction microphone to receive the vibration signal (that is, the first voice component, the second voice component and the first voice component in the above voice information) generated by the air vibration after the user speaks. three voice components).
  • the Bluetooth headset can use the bone conduction microphone to receive the vibration signal generated by the vibration of the ear bone and the skin after the user's voice (that is, the third voice component in the above voice information) .
  • FIG. 8 is a schematic diagram of a sensor setting area.
  • the Bluetooth headset provided by the embodiment of the present application includes an in-ear voice sensor, an out-of-ear voice sensor, and a bone vibration sensor.
  • the in-ear voice sensor refers to that when the headset is in the state of being used by the user, the in-ear voice sensor is located inside the user's ear canal, or the sound detection direction of the in-ear voice sensor is inside the ear canal , the in-ear voice sensor is set in the in-ear voice sensor setting area 801 .
  • the in-ear speech sensor is used to collect the sound transmitted by the vibration of the outside air and the air in the ear canal when the user makes a sound, and the sound is the in-ear speech signal component.
  • the out-of-ear voice sensor means that when the headset is in the state of being used by the user, the out-of-ear voice sensor is located outside the user's ear canal, or the sound detection direction of the out-of-ear voice sensor is excluding the direction inside the ear canal. In other directions, that is, the entire outside air direction, the out-of-ear voice sensor is arranged in the out-of-ear speech sensor setting area 802 .
  • the out-of-ear voice sensor is exposed to the environment, and is used for collecting the sound transmitted by the user through the vibration of the outside air, and the sound is an out-of-ear voice signal component or an ambient sound component.
  • the bone vibration sensor refers to that when the headset is in the state of being used by the user, the bone vibration sensor is in contact with the user's skin, and is used to collect the vibration signal transmitted by the user's bones, or, in other words, to collect a certain sound of the user, The component of speech information conveyed by bone vibrations.
  • the setting area of the bone vibration sensor is not limited, as long as the user's bone vibration can be detected when the user wears the earphone. It can be understood that the in-ear voice sensor can be set at any position in the area 801, and the out-of-ear voice sensor can be set at any position in the area 802, which is not limited in this application. It should be noted that the area division method in Figure 8 is only an example. In fact, the setting position of the in-ear voice sensor can detect the sound inside the ear canal, and the setting position of the out-of-ear voice sensor can detect the sound inside the ear canal. Sounds from the direction of the outside air will do.
  • a VAD voice activity detection, voice activity detection
  • the Bluetooth headset can respectively input the first voice component, the second voice component and the third voice component in the above voice information into the corresponding VAD algorithm to obtain the first VAD value corresponding to the first voice component, and The second VAD value corresponding to the second voice component and the third VAD value corresponding to the third voice component.
  • the value of VAD can be used to reflect whether the above-mentioned speech information is a normal speech signal of the speaker or a noise signal.
  • the value range of VAD can be set in the interval of 0 to 100.
  • the voice information is the normal voice signal of the speaker.
  • the value of VAD is less than a certain VAD threshold
  • the voice information is a noise signal.
  • the value of VAD can be set to 0 or 1. When the value of VAD is 1, it indicates that the voice information is a normal voice signal of the speaker, and when the value of VAD is 0, it indicates that the voice information is a noise signal.
  • the Bluetooth headset can determine whether the above-mentioned voice information is a noise signal in combination with the above-mentioned three VAD values of the first VAD value, the second VAD value and the third VAD value. For example, when the first VAD value, the second VAD value and the third VAD value are all 1, the Bluetooth headset can determine that the above voice information is not a noise signal, but a normal voice signal of the speaker. For another example, when the first VAD value, the second VAD value and the third VAD value are respectively greater than the preset values, the Bluetooth headset can determine that the above voice information is not a noise signal, but a normal voice signal of the speaker.
  • the Bluetooth headset can also determine whether the above-mentioned voice information is a noise signal only according to the value of the first VAD or the value of the second VAD, and the Bluetooth headset can also be based on the value of the first VAD and the second VAD. Any two of the value and the third VAD value determine whether the above voice information is a noise signal.
  • the Bluetooth headset can discard the voice information; if the Bluetooth headset determines that the voice information is a noise signal If the voice information is not a noise signal, the Bluetooth headset can continue to perform the following steps S704-S707. That is, when the user inputs valid voice information into the Bluetooth headset, the Bluetooth headset will be triggered to perform subsequent voiceprint recognition and other processes, thereby reducing the power consumption of the Bluetooth headset.
  • a noise estimation algorithm can also be used.
  • the minimum value statistical algorithm or the minimum value control recursive average algorithm, etc. respectively measure the noise value in the above voice information.
  • a Bluetooth headset may set a storage space dedicated to storing noise values, and each time the Bluetooth headset calculates a new noise value, the new noise value may be updated in the above-mentioned storage space. That is, the recently measured noise value is always stored in the storage space.
  • the noise value in the above-mentioned storage space can be used to perform noise reduction processing on the above-mentioned first voice component, second voice component and third voice component respectively. , so that the recognition result when the subsequent Bluetooth headset (or mobile phone) respectively performs voiceprint recognition on the first voice component, the second voice component and the third voice component is more accurate.
  • the Bluetooth headset sends the first voice component, the second voice component and the third voice component to the mobile phone through the Bluetooth connection.
  • the Bluetooth headset can send the first voice component, second voice component and third voice component to the mobile phone, and then the mobile phone performs the following steps S705- S707, to implement operations such as voiceprint recognition and user identity authentication on the voice information input by the user.
  • the mobile phone performs voiceprint recognition on the first voice component, the second voice component and the third voice component, respectively, to obtain a first voiceprint recognition result corresponding to the first voice component and a second voiceprint corresponding to the second voice component
  • the recognition result and the third voiceprint recognition result corresponding to the third speech component are the same.
  • the principle of voiceprint recognition is to compare the registered voiceprint features of the preset user with the voiceprint features extracted from the voice information input by the user, and make a judgment through a certain algorithm, and the judgment result is the voiceprint recognition result.
  • the registered voiceprint features of one or more preset users may be pre-stored in the mobile phone.
  • each preset user has three registered voiceprint features, one is the first registered voiceprint feature obtained by feature extraction based on the user's first registered voice collected when the in-ear voice sensor is working, and the other is based on the ear
  • the second registered voiceprint feature is obtained by feature extraction from the second registered voice of the user collected by the external voice sensor when it is working, and the other is obtained by feature extraction based on the third registered voice of the user collected when the bone conduction microphone is working.
  • the third registered voiceprint feature is obtained by feature extraction from the second registered voice of the user collected by the external voice sensor when it is working.
  • the acquisition of the first registered voiceprint feature, the second registered voiceprint feature and the third registered voiceprint feature needs to go through two stages.
  • the first stage is the background model training stage.
  • the developer can capture the speech of the relevant text (eg, "Hello, little E", etc.) generated by a large number of speakers wearing the above-mentioned Bluetooth headset.
  • the mobile phone can perform preprocessing (such as filtering, noise reduction, etc.) on the speech of these related texts, and then extract the voiceprint features in the speech.
  • the voiceprint features can be spectrogram (time-frequency spectrogram), fbank (filter banks, features based on filter banks), mfcc (mel-frequency cepstral coefficients, Mel-frequency cepstral coefficients), plp (Perceptual Linear coefficients) Prediction, perceptual linear prediction) or CQCC (Constant Q Cepstral Coefficients, constant Q cepstral coefficient) and so on.
  • the voiceprint features extracted by the mobile phone can also extract two or more of the above-mentioned voiceprint features, and obtain the fused voiceprint features by means of splicing or the like.
  • a machine learning algorithm such as GMM (gaussian mixed model, Gaussian mixture model), SVM (support vector machines, support vector machine) or deep neural network framework to establish a background model for voiceprint recognition, among which,
  • the above machine learning algorithms include but are not limited to DNN (deep neural network, deep neural network) algorithm, RNN (recurrent neural network, recurrent neural network) algorithm, LSTM (long short term memory, long short term memory) algorithm, TDNN (Time Delay Neural) algorithm Network, time-delay neural network), Resnet (deep residual network), etc.
  • DNN deep neural network, deep neural network
  • RNN recurrent neural network, recurrent neural network
  • LSTM long short term memory, long short term memory
  • TDNN Time Delay Neural algorithm Network
  • Resnet deep residual network
  • the mobile phone After obtaining the background model, the mobile phone stores the obtained background model.
  • the storage location may be a mobile phone, a wearable device or a server.
  • a single or multiple background models can be stored, and the stored multiple background models can be obtained by the same or different algorithms.
  • the stored multiple background models can realize the fusion at the voiceprint model level.
  • Resnet that is, a deep residual network
  • TDNN that is, a temporal neural network
  • RNN that is, a recurrent neural network
  • a mobile phone or a Bluetooth headset can establish multiple voiceprint models respectively by combining the characteristics of different voice sensors in the wearable device connected to the mobile phone. For example, a first voiceprint model corresponding to the in-ear voice sensor of the Bluetooth headset, a second voiceprint model corresponding to the out-of-ear voice sensor of the Bluetooth headset, and a third voiceprint model corresponding to the bone vibration sensor of the Bluetooth headset are established.
  • the mobile phone can save the first voiceprint model, the second voiceprint model and the third voiceprint model locally on the mobile phone, or send the first voiceprint model, the second voiceprint model and the third voiceprint model to the Bluetooth headset for processing. save.
  • the second stage is that when the user uses the voiceprint recognition function on the mobile phone for the first time, by entering the registered voice, the mobile phone extracts the user through the in-ear voice sensor, out-of-ear voice sensor and bone vibration sensor of the Bluetooth headset connected to the mobile phone.
  • the first registered voiceprint feature, the second registered voiceprint feature and the third registered voiceprint feature can be carried out through the voiceprint recognition option in the built-in device biometric function of the mobile phone system, or the system program can be called through the downloaded APP to carry out the registration process.
  • the voice assistant APP can prompt the user to wear a Bluetooth headset and say a registration voice of "Hello, Little E".
  • the Bluetooth headset includes an in-ear voice sensor, an out-of-ear voice sensor, and a bone vibration sensor
  • the Bluetooth headset can obtain the first registered voice component of the registered voice collected by the in-ear voice sensor, The second registered voice component collected by the voice sensor and the third registered voice component collected by the bone vibration sensor.
  • the mobile phone can extract the first registered voice component through the first voiceprint model respectively to obtain the first registered voice component.
  • the second registered voiceprint feature is obtained by extracting the feature of the second registered voice component by the second voiceprint model
  • the third registered voiceprint feature is obtained by extracting the feature of the third registered voice component by the third voiceprint model.
  • the mobile phone can save the preset first registered voiceprint feature, second registered voiceprint feature and third registered voiceprint feature of user 1 locally on the mobile phone, or can preset the first registered voiceprint feature, second registered voiceprint feature of user 1
  • the registered voiceprint feature and the third registered voiceprint feature are sent to the Bluetooth headset for storage.
  • the mobile phone may also use the Bluetooth headset connected at this time as the preset Bluetooth device.
  • the mobile phone may store the preset identifier of the Bluetooth device (eg, the MAC address of the Bluetooth headset, etc.) locally in the mobile phone.
  • the mobile phone can receive and execute the relevant operation instructions sent by the preset Bluetooth device, and when the illegal Bluetooth device sends the operation instruction to the mobile phone, the mobile phone can discard the operation instruction to improve security.
  • a phone can manage one or more preset Bluetooth devices. As shown in (a) of FIG. 11 , the user can enter the setting interface 1101 of the voiceprint recognition function from the setting function. After the user clicks the setting button 1105, the user can enter the preset device management shown in (b) of FIG. 11 . Interface 1106. The user can add or delete preset Bluetooth devices in the preset device management interface 1106 .
  • step S705 after acquiring the first voice component, the second voice component and the third voice component in the above voice information, the mobile phone can extract the voiceprint feature of the first voice component to obtain the first voiceprint feature and extract the second voice respectively.
  • the component voiceprint feature obtains the second voiceprint feature and the third voice component voiceprint feature is extracted to obtain the third voiceprint feature, and then the first registered voiceprint feature of the preset user 1 is used to match the first voiceprint feature, and the preset voiceprint feature is used. It is assumed that the second registered voiceprint feature of user 1 is matched with the second voiceprint feature, and the preset third registered voiceprint feature of user 1 is used to match the third voiceprint feature.
  • the mobile phone can use a certain algorithm to calculate the first degree of matching between the first registered voiceprint feature and the first voice component (ie, the first voiceprint recognition result), and the second matching between the second registered voiceprint feature and the second voice component.
  • the matching degree ie the second voiceprint recognition result
  • the third matching degree ie the third voiceprint recognition result
  • the matching degree is higher, it means that the voiceprint feature in the voice information is more similar to the voiceprint feature of the preset user 1, and the probability that the user who inputs the voice information is the preset user 1 is higher.
  • the mobile phone can determine that the first voiceprint feature matches the first registered voiceprint feature, and the second voiceprint feature matches the second voiceprint feature.
  • the registered voiceprint feature matches, and the third voiceprint feature matches the third registered voiceprint feature.
  • the mobile phone can determine that the first voiceprint feature matches the first registered voiceprint feature, and the second voiceprint feature matches the second registered voiceprint feature.
  • the voiceprint features match, and the third voiceprint feature matches the third registered voiceprint feature.
  • the first registered voiceprint feature is obtained by feature extraction through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset user's voiceprint feature collected by the in-ear voice sensor;
  • the second registered voiceprint feature The feature is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the out-of-ear voice sensor;
  • the third registered voiceprint feature is obtained through the third voiceprint feature.
  • the third registered voiceprint feature is obtained by performing feature extraction on the voiceprint model to reflect the preset user's voiceprint feature collected by the bone vibration sensor. It can be understood that the function of the voiceprint model is to extract the voiceprint features of the input voice.
  • the voiceprint model can extract the registered voiceprint features of the registered voice.
  • the voiceprint model can extract the voiceprint features of the speech.
  • the acquisition method of the voiceprint feature may also be a fusion method, including a voiceprint model fusion method and a voiceprint feature-level fusion method.
  • the above algorithm for calculating the matching degree may be calculating the similarity.
  • the mobile phone performs feature extraction on the first voice component to obtain the first voiceprint feature, respectively calculates the first similarity between the first voiceprint feature and the pre-stored preset user's first registered voiceprint feature, and the second voiceprint feature and The pre-stored second similarity of the second registered voiceprint feature of the preset user, and the third similarity of the third voiceprint feature and the pre-stored preset third registered voiceprint feature of the user.
  • the mobile phone can also calculate the first voice component and the first voice components of other preset users (such as preset user 2 and preset user 3) one by one according to the above method.
  • the Bluetooth headset may determine the preset user with the highest matching degree (eg, preset user A) as the sounding user at this time.
  • the judgment method may be to perform keyword detection on the voice information, and when the voice information includes preset keywords, the mobile phone performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively; or; determine
  • the method may also be to detect user input, and when receiving a preset operation input by the user, the mobile phone performs voiceprint recognition on the first voice component, the second voice component and the third voice component respectively.
  • the specific method of keyword detection may be that after voice recognition is performed on the keyword, the similarity is greater than a preset threshold, and the keyword detection is considered to be passed.
  • the Bluetooth headset or mobile phone can recognize preset keywords from the voice information input by the user, for example, "transfer”, “payment”, "**bank” or “chat record” and other keywords related to user privacy or financial behavior, indicating that the user has high security requirements for controlling the mobile phone by voice at this time. Therefore, the mobile phone can perform step S705 to perform voiceprint recognition.
  • the Bluetooth headset detects a preset operation for enabling the voiceprint recognition function performed by receiving user input, for example, tapping the Bluetooth headset or pressing the volume + and volume - buttons at the same time, it means that the user is at this time. The user identity needs to be verified through voiceprint recognition, therefore, the Bluetooth headset can notify the mobile phone to perform step S705 for voiceprint recognition.
  • keywords corresponding to different security levels may also be preset in the mobile phone.
  • keywords with the highest security level include “payment”, “payment”, etc.
  • keywords with higher security levels include “photography”, “calling”, etc.
  • keywords with the lowest security level include “listening to songs", “navigation”, etc. "Wait.
  • the mobile phone can be triggered to perform voiceprint recognition on the first voice component, the second voice component and the third voice component respectively, that is, the collected voice Voiceprint recognition is performed on all three audio sources to improve the security of voice control of mobile phones.
  • the mobile phone When it is detected that the collected voice information contains keywords with a higher security level, since the security requirements of the user to control the mobile phone through voice are normal, the mobile phone can be triggered to only detect the first voice component, the second voice component or the mobile phone.
  • the third voice component performs voiceprint recognition.
  • the mobile phone does not need to perform voiceprint recognition on the first voice component, the second voice component and the third voice component.
  • the voice information collected by the Bluetooth headset does not contain keywords, it means that the voice information collected at this time may only be the voice information sent by the user during a normal conversation. and the third voice component for voiceprint recognition, thereby reducing the power consumption of the mobile phone.
  • the mobile phone may also preset one or more wake-up words to wake the mobile phone to turn on the voiceprint recognition function.
  • the wake word can be "Hello, Little E”.
  • the Bluetooth headset or the mobile phone can identify whether the voice information is a wake-up voice containing a wake-up word.
  • the Bluetooth headset can send the first voice component, the second voice component and the third voice component in the collected voice information to the mobile phone. If the mobile phone further recognizes that the voice information contains the above wake-up word, the mobile phone can turn on the sound The fingerprint recognition function (for example, power on the voiceprint recognition chip). Subsequently, if the above-mentioned keywords are included in the voice information collected by the Bluetooth headset, the mobile phone can use the voiceprint recognition function that has been enabled to perform voiceprint recognition according to the method of step S705.
  • the Bluetooth headset can further identify whether the voice information contains the above wake-up word. If the above wake-up word is included, it means that subsequent users may need to use the voiceprint recognition function. Then, the Bluetooth headset can send a start command to the mobile phone, so that the mobile phone can turn on the voiceprint recognition function in response to the start command.
  • the mobile phone authenticates the user identity according to the first voiceprint recognition result, the second voiceprint recognition result and the third voiceprint recognition result.
  • step S706 the mobile phone obtains, through voiceprint recognition, a first voiceprint recognition result corresponding to the first voice component, a second voiceprint recognition result corresponding to the second voice component, and a third voiceprint corresponding to the third voice component
  • the three voiceprint identification results can be integrated to authenticate the identity of the user inputting the above voice information, thereby improving the accuracy and security of the user identity authentication.
  • the preset first matching degree between the first registered voiceprint feature of the user and the above-mentioned first voiceprint feature is the first voiceprint recognition result
  • the second matching degree of the feature is the second voiceprint recognition result
  • the preset third matching degree between the third registered voiceprint feature of the user and the above-mentioned third voiceprint feature is the third voiceprint recognition result.
  • the mobile phone determines to send out the first voice component and the second voice component and the user of the third voice component is a preset user; otherwise, the mobile phone may determine that the user who emits the first voice component, the second voice component and the third voice component is an illegal user.
  • the mobile phone can calculate the weighted average of the first matching degree and the second matching degree, and when the weighted average is greater than a preset threshold, the mobile phone can determine to emit the first voice component, the second voice component and the third voice.
  • the user of the component is a preset user; otherwise, the mobile phone may determine that the user who emits the first voice component, the second voice component and the third voice component is an illegal user.
  • the mobile phone can use different authentication strategies in different voiceprint recognition scenarios. For example, when the collected voice information contains a keyword with the highest security level, the mobile phone may set the above-mentioned first threshold, second threshold and third threshold to 99 points. In this way, only when the first matching degree, the second matching degree and the third matching degree are all greater than 99 points, the mobile phone determines that the current uttering user is the preset user. When the collected voice information contains keywords with a lower security level, the mobile phone can set the above-mentioned first threshold, second threshold and third threshold to 85 points. In this way, when the first matching degree, the second matching degree and the third matching degree are all greater than 85 points, the mobile phone can determine that the current uttering user is the preset user. That is to say, for voiceprint recognition scenarios of different security levels, the mobile phone can use authentication policies of different security levels to authenticate the user's identity.
  • the voiceprint models of one or more preset users are stored in the mobile phone, for example, the registered voiceprint features of preset user A, preset user B, and preset user C are stored in the mobile phone, the The registered voiceprint features all include a first registered voiceprint feature, a second registered voiceprint feature, and a third registered voiceprint feature. Then, the mobile phone can match the collected first voice component, second voice component and third voice component with the registered voiceprint feature of each preset user respectively according to the above method. Furthermore, the mobile phone may determine the preset user (eg preset user A) that satisfies the above-mentioned authentication policy and has the highest matching degree as the uttering user at this time.
  • the preset user eg preset user A
  • the mobile phone after the mobile phone receives the first voice component, the second voice component and the third voice component in the voice information sent by the Bluetooth headset, it can fuse the first voice component, the second voice component and the third voice component for voiceprinting Identify, for example, calculate the degree of matching between the first voice component, the second voice component and the third voice component after fusion and the preset user's voiceprint model. Furthermore, the mobile phone can also authenticate the user identity according to the matching degree. Since the voiceprint model of the preset user in this identity authentication method is integrated into one, the complexity of the voiceprint model and the required storage space are correspondingly reduced, and the voiceprint feature information of the second voice component is used. Therefore, it also has dual voiceprint protection and live detection functions.
  • the above algorithm for calculating the matching degree may be calculating the similarity.
  • the mobile phone performs feature extraction on the first voice component to obtain the first voiceprint feature, respectively calculates the first similarity between the first voiceprint feature and the pre-stored preset user's first registered voiceprint feature, and the second voiceprint feature and The second similarity of the pre-stored second registered voiceprint feature of the preset user, the third similarity of the third voiceprint feature and the pre-stored third registered voiceprint feature of the preset user, based on the first similarity, The second similarity and the third similarity are used to authenticate the user.
  • Similarity calculation methods include: Euclidean Distance, Cosine similarity, Pearson correlation coefficient (Pearson), Adjusted Cosine similarity (Adjusted Cosine), Hamming Distance, Manhattan distance ( Manhattan Distance), etc., which are not limited in this application.
  • the way of authenticating the user may be that the mobile phone determines the first fusion coefficient corresponding to the first similarity, the second fusion coefficient corresponding to the second similarity, and the third The third fusion coefficient corresponding to the similarity; the fusion similarity score is obtained by fusing the first similarity, the second similarity and the third similarity according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient. If the fusion similarity score is greater than the first threshold, the mobile phone determines that the user who inputs voice information to the Bluetooth headset is a preset user.
  • the decibel number of the ambient sound is detected by the sound pressure sensor of the Bluetooth headset and sent to the mobile phone, and the playback volume can be detected by the speaker of the Bluetooth headset and sent to the mobile phone. It can be obtained by the mobile phone itself calling its own data.
  • the second fusion coefficient is negatively correlated with the decibel number of the ambient sound
  • the first fusion coefficient and the third fusion coefficient are respectively negatively correlated with the decibel number of the playback volume
  • the first fusion coefficient and the second fusion coefficient are negatively correlated with the decibel number of the playback volume.
  • the sum of the fusion coefficient and the third fusion coefficient is a fixed value. That is to say, when the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a preset fixed value, the larger the decibel of the ambient sound, the smaller the second fusion coefficient.
  • the corresponding , the first fusion coefficient and the third fusion coefficient will increase adaptively, so as to keep the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient unchanged;
  • the fusion coefficient in this implementation can be understood as dynamic. In other words, the fusion coefficient changes dynamically according to the ambient sound and playback volume, according to the decibels of the ambient sound detected by the microphone and the playback volume detected by the in-ear sensor. to dynamically determine the fusion coefficient.
  • the voice control method provided in this application needs to reduce the fusion coefficient corresponding to the out-of-ear sensor of the Bluetooth headset and the bone vibration sensor.
  • the result of the fusion similarity score is more dependent on the in-ear sensor that is less affected by environmental noise; on the contrary, if the playback volume is large, it means that the noise level of the playback sound in the ear canal is high, and it can be considered that the in-ear sensor of the Bluetooth headset is affected by The influence of playing sound is relatively large, so the voice control method provided by the present application needs to reduce the fusion coefficient corresponding to the in-ear sensor, and the result of fusion similarity score is more dependent on the extra-ear sensor and bone vibration sensor which are less affected by the playing sound.
  • a look-up table can be set according to the above principles during system design, and in specific use, the fusion coefficient can be determined by looking up the table according to the monitored self-volume and ambient sound decibels.
  • Table 1-1 shows an example.
  • the fusion coefficients of the similarity scores of the speech signals collected by the in-ear speech sensor and the bone vibration sensor are denoted by a1 and a2, respectively, and the fusion coefficient of the similarity scores obtained by the speech signals collected by the out-of-ear speech sensor is denoted by b1.
  • the external environment at this time can be considered to be noisy, the voice signal collected by the out-of-ear voice sensor will be mixed with more ambient noise, and the fusion coefficient corresponding to the voice signal collected by the out-of-ear voice sensor can be lower. value or set to 0 directly.
  • the playback volume of the internal speaker of the headset exceeds 80% of the total volume, it can be considered that the volume inside the headset is too large, and the fusion coefficient corresponding to the voice signal collected by the in-ear voice sensor can be set to a lower value or directly set to 0.
  • volume 20% refers to "volume 10”.
  • %-30% volume 40% refers to "volume 30%-50%”;
  • ambient sound 20dB refers to “ambient sound 10dB-30dB”,
  • ambient sound 40dB refers to "ambient sound 30dB” -50dB”.
  • the above-mentioned specific design is only an example, and the specific parameter settings, threshold settings and coefficients corresponding to different ambient sound decibels and speaker volumes can be designed and modified according to the actual situation. make restrictions.
  • the fusion coefficient provided in the embodiment of the present application can be understood as a "dynamic fusion coefficient", that is, the fusion coefficient can be dynamically adjusted according to different decibels of ambient sound and speaker volume.
  • the strategy of performing identity authentication on the user based on the fusion of the first voiceprint recognition result, the second voiceprint recognition result and the third voiceprint recognition result may be: Change to directly fuse the audio features, extract the voiceprint feature based on the fusion audio feature and the voiceprint model, calculate the similarity between the voiceprint feature and the pre-stored preset user's registered voiceprint feature, and then perform identity authentication.
  • the audio features feaE1 and feaE2 of each frame are extracted from the speech signal of the current user collected by the in-ear speech sensor and the out-of-ear speech sensor.
  • the audio feature feaB1 of each frame is extracted from the speech signal of the current user collected by the bone voiceprint sensor.
  • Fusion of the above audio features feaE1, feaE2, feaB1, including but not limited to the following methods: normalizing feaE1, feaE2 and feaB1 to obtain feaE1', feaE2' and feaB1', and then splicing into a feature vector fea [ feaE1', feaE2', feaB1']. Extract the voiceprint feature of the feature vector fea through the voiceprint model to obtain the voiceprint feature of the current user. Similarly, for the registered voice of the registered user, the voiceprint feature of the registered user can be obtained by referring to the above method. The similarity between the voiceprint feature of the current user and the voiceprint feature of the registered user is compared to obtain a similarity score, and the relationship between the similarity score and the preset threshold is determined to obtain an authentication result.
  • the strategy of performing identity authentication on the user based on the fusion of the first similarity, the second similarity and the third similarity in S706 may be changed to
  • the second voiceprint feature and the third voiceprint feature are fused to obtain the fused voiceprint feature, the similarity between the fused voiceprint feature and the pre-stored preset user's registered fused voiceprint feature is calculated, and then identity authentication is performed.
  • features are extracted from the voice signal of the current user collected from the in-ear voice sensor and the out-of-ear voice sensor through a voiceprint model, to obtain voiceprint features e1 and e2.
  • the voiceprint feature b1 is obtained by extracting the features of the current user's voice signal collected from the bone voiceprint sensor through the voiceprint model.
  • the registered voice of the registered user can be obtained by referring to the above method to obtain the voiceprint feature of the registered user after splicing. Compare the voiceprint features of the current user after splicing with the voiceprint features of the registered user after splicing to obtain a similarity score, and determine the relationship between the similarity score and a preset threshold to obtain an authentication result.
  • the mobile phone executes the operation instruction corresponding to the above-mentioned voice information.
  • the mobile phone determines that the voice user inputting the voice information in the step S702 is the preset user, and the mobile phone can execute the operation instruction corresponding to the above voice information. If the authentication fails, The subsequent operation instructions are not executed.
  • the operation instruction includes, but is not limited to, the unlocking operation of the mobile phone or the confirming payment operation. For example, when the above voice message is "Little E, use WeChat to pay", the corresponding operation instruction is to open the payment interface of the WeChat APP. In this way, after the mobile phone generates an operation instruction for opening the payment interface in the WeChat APP, the WeChat APP can be automatically opened, and the payment interface in the WeChat APP can be displayed.
  • the mobile phone since the mobile phone has determined that the above-mentioned user is the default user, as shown in Figure 9, if the mobile phone is currently in a locked state, the mobile phone can also unlock the screen first, and then execute the operation instruction to open the payment interface in the WeChat APP, and the display shows Payment interface 901 in the WeChat APP.
  • the voice control method provided in the above steps S701-S707 may be a function provided by the voice assistant APP.
  • the voice assistant APP When the Bluetooth headset interacts with the mobile phone, if it is determined through voiceprint recognition that the voice user at this time is the default user, the mobile phone can send the generated operation instructions or voice information and other data to the voice assistant APP running at the application layer. Furthermore, the voice assistant APP invokes the relevant interface or service of the application framework layer to execute the operation instruction corresponding to the above voice information.
  • the voice control method provided in the embodiments of the present application can unlock the mobile phone and execute the relevant operation instructions in the voice information while using the voiceprint to identify the user's identity. That is, a user only needs to input a voice message once to complete a series of operations such as user identity authentication, unlocking the mobile phone, and opening a certain function of the mobile phone, thereby greatly improving the user's control efficiency and user experience on the mobile phone.
  • the voice control method may include:
  • a mobile phone and a Bluetooth headset establish a Bluetooth connection.
  • the Bluetooth headset detects whether it is in a wearing state.
  • the Bluetooth headset acquires the first voice component in the voice information input by the user by collecting through the first voice sensor, collects the second voice component in the above voice information through the second voice sensor, and vibrates through the bone The sensor collects the third voice component in the voice information.
  • the bluetooth headset establishes a bluetooth connection with the mobile phone, detects whether the bluetooth headset is in a wearing state, and detects the specific method of the first voice component, the second voice component and the third voice component in the voice information, please refer to the above steps The related descriptions of S701-S703 will not be repeated here.
  • the Bluetooth headset can also perform operations such as enhancement, noise reduction or filtering on the detected first voice component and second voice component.
  • operations such as enhancement, noise reduction or filtering on the detected first voice component and second voice component.
  • the Bluetooth headset since the Bluetooth headset has an audio playback function, when the speaker of the Bluetooth headset is working, the air conduction microphone and the bone conduction microphone on the Bluetooth headset may receive the echo signal of the sound source played by the speaker . Therefore, after the Bluetooth headset obtains the above-mentioned first voice component and second voice component, an echo cancellation algorithm (adaptive echo cancellation, AEC) can also be used to eliminate the echo signals in the first voice component and the second voice component, so as to improve the follow-up Accuracy of voiceprint recognition.
  • AEC adaptive echo cancellation
  • the Bluetooth headset performs voiceprint recognition on the first voice component, the second voice component, and the third voice component, respectively, to obtain a first voiceprint recognition result corresponding to the first voice component, and a second voiceprint corresponding to the second voice component.
  • the fingerprint recognition result and the third voiceprint recognition result corresponding to the third voice component are the first voiceprint recognition result corresponding to the first voice component, and a second voiceprint corresponding to the second voice component.
  • one or more voiceprint models and preset registered voiceprint features of the user may be pre-stored in the Bluetooth headset.
  • the Bluetooth headset can use the voiceprint model locally stored in the Bluetooth headset to perform sound analysis on the first voice component, the second voice component and the third voice component.
  • the fingerprint recognition is performed to obtain the voiceprint features corresponding to the voice components respectively, and the obtained voiceprint features corresponding to the voice components are compared with the corresponding registered voiceprint features. Thereby performing voiceprint recognition.
  • step S705 for the mobile phone to identify the first voice component, the second voice component and the third voice respectively.
  • the specific method for voiceprint recognition by component is not repeated here.
  • the Bluetooth headset authenticates the user identity according to the first voiceprint recognition result, the second voiceprint recognition result and the third voiceprint recognition result.
  • the process of authenticating the user identity by the Bluetooth headset according to the first voiceprint recognition result, the second voiceprint recognition result and the third voiceprint recognition result can be referred to in the above step S706 by the mobile phone according to the first voiceprint recognition result, the second voiceprint recognition result
  • the description of the user identity authentication by the fingerprint recognition result and the third voiceprint recognition result will not be repeated here.
  • the Bluetooth headset sends an operation instruction corresponding to the above-mentioned voice information to the mobile phone through a Bluetooth connection.
  • the Bluetooth headset determines that the user who inputs the voice information is a preset user, the Bluetooth headset can generate an operation instruction corresponding to the voice information.
  • the operation instruction can be included in the example of the operation instruction of the mobile phone in the above step S707, and details are not repeated here.
  • the Bluetooth headset can also send a message or an unlocking instruction to the mobile phone that the user's identity has been authenticated, so that the mobile phone can unlock the screen first, and then Execute the operation instruction corresponding to the above voice information.
  • the Bluetooth headset can also send the collected voice information to the mobile phone, and the mobile phone generates a corresponding operation instruction according to the voice information, and executes the operation instruction.
  • the Bluetooth headset when the Bluetooth headset sends the above-mentioned voice information or corresponding operation instructions to the mobile phone, it can also send its own device identification (eg, MAC address) to the mobile phone. Since the identification of the preset Bluetooth device that has passed the authentication is stored in the mobile phone, the mobile phone can determine whether the currently connected Bluetooth headset is the preset Bluetooth device according to the received device identification. If the Bluetooth headset is a preset Bluetooth device, the mobile phone can further execute the operation instructions sent by the Bluetooth headset, or perform voice recognition and other operations on the voice information sent by the Bluetooth headset, otherwise, the mobile phone can discard the Bluetooth headset. to avoid security problems caused by illegal Bluetooth devices maliciously manipulating mobile phones.
  • MAC address e.g., MAC address
  • the mobile phone and the preset Bluetooth device may pre-agreed a password or password for transmitting the above-mentioned operation command.
  • the Bluetooth headset when it sends the above-mentioned voice information or corresponding operation instructions to the mobile phone, it can also send a pre-agreed password or password to the mobile phone, so that the mobile phone can determine whether the currently connected Bluetooth headset is a preset Bluetooth device.
  • the mobile phone and the preset Bluetooth device may pre-agreed on the encryption and decryption algorithms used when transmitting the above operation command.
  • the operation instruction can be encrypted by using an agreed encryption algorithm.
  • the mobile phone receives the encrypted operation command, if the above-mentioned operation command can be decrypted using the agreed decryption algorithm, it means that the currently connected Bluetooth headset is a preset Bluetooth device, and the mobile phone can further execute the operation command sent by the Bluetooth headset; Otherwise, it indicates that the currently connected Bluetooth headset is an illegal Bluetooth device, and the mobile phone can discard the operation command sent by the Bluetooth headset.
  • steps S701-S707 and steps S1001-S1007 are only two implementation manners of the voice control method provided in this application. It can be understood that those skilled in the art can set which steps in the foregoing embodiments are performed by the Bluetooth headset and which steps are performed by the mobile phone according to actual application scenarios or actual experience, which is not limited in this embodiment of the present application.
  • the voice control method provided by the present application may also use a server as an execution subject, that is, the Bluetooth headset establishes a connection with the server, and the server implements the functions of the mobile phone in the above embodiment, and the specific process is not repeated here.
  • the Bluetooth headset performs voiceprint recognition on the first voice component, the second voice component and the third voice component, the obtained first voiceprint recognition result, second voiceprint recognition result and third voiceprint recognition
  • the result is sent to the mobile phone, and the mobile phone performs user identity authentication and other operations based on the voiceprint recognition result.
  • the Bluetooth headset can also determine whether voiceprint recognition needs to be performed on the first voice component, the second voice component, and the third voice component after acquiring the first voice component, the second voice component, and the third voice component. . If voiceprint recognition needs to be performed on the first voice component, the second voice component and the third voice component, the Bluetooth headset can send the first voice component, the second voice component and the third voice component to the mobile phone, and then the mobile phone can complete the follow-up Voiceprint recognition, user identity authentication and other operations; otherwise, the Bluetooth headset does not need to send the first voice component, the second voice component and the third voice component to the mobile phone, to avoid increasing the mobile phone to process the first voice The power consumption of the third speech component.
  • the user can also enter the setting interface 1101 of the mobile phone to enable or disable the above-mentioned voice control function.
  • the user can set the keywords that trigger the voice control through the setting button 1102, such as "small E", "payment”, etc., and the user can also manage the preset user's voiceprint model through the setting button 1103 For example, adding or deleting a preset user's voiceprint model, the user can also use the setting button 1104 to set the operation instructions that the voice assistant can support, such as payment, making a phone call, ordering a meal, and so on. In this way, users can get a customized voice control experience.
  • the embodiments of the present application disclose a voice control device.
  • the voice control device includes a voice information acquisition unit 1201 , an identification unit 1202 , an identity information acquisition unit 1203 and an execution unit 1204.
  • the voice control device itself can be a terminal or a wearable device, the voice control device can be fully integrated into the wearable device, or the wearable device and the terminal can be combined into a voice control system, that is, part of the unit. Located in the wearable device, part of the unit is located in the terminal.
  • the voice control device may be fully integrated into a Bluetooth headset.
  • the voice information obtaining unit 1201 is used to obtain the voice information of the user.
  • the user can input voice information into the Bluetooth headset when wearing the Bluetooth headset.
  • the in-ear voice sensor collects the first voice component
  • the out-of-ear voice sensor collects the second voice component
  • the bone vibration sensor collects the third voice component.
  • the recognition unit 1202 is configured to perform voiceprint recognition on the first voice component, the second voice component and the third voice component, respectively, to obtain a first voiceprint recognition result corresponding to the first voice component and a second voiceprint corresponding to the second voice component.
  • the voiceprint recognition result and the third voiceprint recognition result corresponding to the third voice component are configured to perform voiceprint recognition on the first voice component, the second voice component and the third voice component, respectively, to obtain a first voiceprint recognition result corresponding to the first voice component and a second voiceprint corresponding to the second voice component.
  • the identification unit 1202 may also be used to perform keyword detection on the voice information input by the user to the Bluetooth headset.
  • the voice information includes preset keywords, Perform voiceprint recognition on the second voice component and the third voice component; or; the recognition unit 1202 may be configured to detect user input, and when receiving a preset operation input by the user, respectively perform voiceprint recognition on the first voice component and the third voice component
  • the second voice component and the third voice component are used for voiceprint recognition.
  • the user input may be the user's input to the Bluetooth headset through a touch screen or a key, for example, the user clicks an unlock key of the Bluetooth headset.
  • the acquisition unit 1201 may also acquire the wearing status detection result, and when the wearing status detection result passes, the recognition unit 1202 performs keyword detection on the voice information. Detection, or detection of user input.
  • the identifying unit 1202 is specifically configured to: perform feature extraction on the first voice component, obtain the first voiceprint feature, and calculate the difference between the first voiceprint feature and the preset user's first registered voiceprint feature
  • the first similarity, the first registered voiceprint feature is obtained by the feature extraction of the first registered voice through the first voiceprint model, and the first registered voiceprint feature is used to reflect the audio features of the preset user collected by the in-ear voice sensor Carry out feature extraction to the second voice component, obtain the second voiceprint feature, calculate the second similarity between the second voiceprint feature and the second registered voiceprint feature of the preset user, and the second registered voiceprint feature is the second registered voiceprint feature.
  • the voice is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the audio features of the preset user collected by the out-of-ear voice sensor; the third voice component is extracted by feature extraction to obtain the third voiceprint feature, calculate the third similarity between the third voiceprint feature and the third registered voiceprint feature of the preset user, and the third registered voiceprint feature is obtained from the third registered voice through feature extraction through the third voiceprint model.
  • the registered voiceprint feature is used to reflect the preset user's audio features collected by the bone vibration sensor.
  • the first registered voiceprint feature is obtained by feature extraction through the first voiceprint model, and the first registered voiceprint feature is used to reflect the preset user's voiceprint collected by the in-ear voice sensor feature;
  • the second registered voiceprint feature is obtained by feature extraction through the second voiceprint model, and the second registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the out-of-ear voice sensor;
  • the third registered voiceprint feature The voiceprint feature is obtained by feature extraction through the third voiceprint model, and the third registered voiceprint feature is used to reflect the voiceprint feature of the preset user collected by the bone vibration sensor.
  • the identity information obtaining unit 1203 is used to obtain user identity information for user identity authentication, specifically, according to the decibel number and playback volume of the ambient sound, respectively determine the first fusion coefficient corresponding to the first similarity, and the second similarity corresponding to The second fusion coefficient, the third fusion coefficient corresponding to the third degree of similarity; according to the first fusion coefficient, the second fusion coefficient and the third fusion coefficient, the first similarity, the second similarity and the third similarity are fused to obtain a fusion similarity score. If the fusion similarity score is greater than the first threshold, the mobile phone determines that the user who inputs voice information to the Bluetooth headset is a preset user.
  • the decibel number of the ambient sound is detected by the sound pressure sensor of the Bluetooth headset, and the playback volume can be detected by the speaker of the Bluetooth headset by detecting the playback signal.
  • the second fusion coefficient is negatively correlated with the decibel number of the ambient sound
  • the first fusion coefficient and the third fusion coefficient are respectively negatively correlated with the decibel number of the playback volume
  • the first fusion coefficient and the second fusion coefficient are negatively correlated with the decibel number of the playback volume.
  • the sum of the fusion coefficient and the third fusion coefficient is a fixed value. That is to say, when the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient is a preset fixed value, the larger the decibel of the ambient sound, the smaller the second fusion coefficient.
  • the corresponding , the first fusion coefficient and the third fusion coefficient will increase adaptively, so as to keep the sum of the first fusion coefficient, the second fusion coefficient and the third fusion coefficient unchanged;
  • the above-mentioned variable fusion coefficient can take into account the recognition accuracy in different application scenarios (in the case of a large noise environment or when the headphones are playing music).
  • the execution unit 1204 is configured to execute the operation instruction corresponding to the voice information, and the operation instruction includes an unlock instruction, payment command, shutdown command, open application command or call command.
  • the voice control method provided by the above-mentioned embodiments of the present application adds a method for collecting voiceprint features through an in-ear voice sensor.
  • the ear canal will form a closed cavity, and the sound will have a certain amplification effect in the cavity, that is, the cavity effect. Therefore, the sound collected by the in-ear voice sensor will be clearer, especially for high-frequency sound signals. It can make up for the distortion problem caused by the loss of high-frequency signal components of part of the voice information when the bone vibration sensor collects voice information, improve the overall voiceprint collection effect of the headset and the accuracy of voiceprint recognition, thereby improving user experience.
  • dynamic fusion coefficients are used to fuse the voiceprint recognition results obtained from speech signals with different attributes.
  • the complementarity of speech signals with different attributes can improve the robustness and accuracy of voiceprint recognition. For example, the recognition accuracy and accuracy can be significantly improved in the case of a noisy environment or playing music with headphones.
  • voice signals with different attributes can also be understood as voice signals obtained by different sensors (in-ear voice sensor, out-of-ear voice sensor, bone vibration sensor).
  • FIG. 13 is a schematic diagram of a wearable device 130 provided by an embodiment of the present application.
  • the wearable device shown in FIG. 13 includes a memory 1301 , a processor 1302 , a communication interface 1303 , a bus 1304 , an in-ear voice sensor 1305 , an out-of-ear voice sensor 1306 , and a bone vibration sensor 1307 .
  • the memory 1301 , the processor 1302 , and the communication interface 1303 are connected to each other through the bus 1304 for communication.
  • the memory 1301 is coupled to the processor 1302, and the memory 801 is used to store computer program codes.
  • the computer program codes include computer instructions. When the processor 802 executes the computer instructions, the wearable device can execute the voice control method described in the above embodiments.
  • the in-ear voice sensor 1305 is used to collect the first voice component of the voice information
  • the out-of-ear voice sensor 1306 is used to collect the second voice component of the voice information
  • the bone vibration sensor 1307 is used to collect the third voice component of the voice information.
  • the memory 1301 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 1301 may store a program. When the program stored in the memory 1301 is executed by the processor 1302, the processor 1302 and the communication interface 1303 are used to execute each step of the voice control method of the embodiment of the present application.
  • the processor 1302 may adopt a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
  • the integrated circuit is used to execute the relevant program to realize the functions required to be performed by the units in the voice control apparatus of the embodiments of the present application, or to execute the voice control method of the method embodiments of the present application.
  • the processor 1302 can also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the voice control method of the present application can be completed by an integrated logic circuit of hardware in the processor 1302 or an instruction in the form of software.
  • the above-mentioned processor 1302 can also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processing
  • ASIC application-specific integrated circuit
  • FPGA Field Programmable Gate Array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 1301, and the processor 1302 reads the information in the memory 1301, and combines its hardware to complete the functions required to be performed by the units included in the voice control apparatus of the embodiments of the present application, or to execute the voice control of the method embodiments of the present application. method.
  • the communication interface 1303 uses a transceiver such as but not limited to a transceiver, and can perform wired communication or wireless communication, so as to implement communication between the wearable device 1300 and other devices or a communication network.
  • a transceiver such as but not limited to a transceiver
  • the wearable device can establish a communication connection with the terminal device through the communication interface 1303 .
  • Bus 1304 may include a pathway for communicating information between various components of device 1300 (eg, memory 1301, processor 1302, communication interface 1303).
  • FIG. 14 is a schematic diagram of a terminal provided by an embodiment of the present application.
  • the terminal shown in FIG. 14 includes a touch screen 1401 , a processor 1402 , a memory 1403 , one or more computer programs 1404 , a bus 1405 , and a communication interface 1408 .
  • the touch screen 1401 includes a touch-sensitive surface 1406 and a display screen 1407, and the terminal may also include one or more application programs (not shown).
  • the various devices described above may be connected by one or more communication buses 1405 .
  • the memory 1403 is coupled to the processor 1402, and the memory 1403 is used for storing computer program codes.
  • the computer program codes include computer instructions.
  • the terminal can execute the voice control method described in the above embodiments.
  • the touch screen 1401 is used to interact with the user, and can receive input information from the user.
  • the user enters input to the phone through the touch-sensitive surface 1406, eg, the user clicks an unlock key displayed on the touch-sensitive surface 1406 of the phone.
  • the memory 1403 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 1403 may store a program. When the program stored in the memory 1403 is executed by the processor 1402, the processor 1402 and the communication interface 1403 are used to execute each step of the voice control method of the embodiment of the present application.
  • the processor 1402 may adopt a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a graphics processor (graphics processing unit, GPU) or one or more
  • the integrated circuit is used to execute the relevant program to realize the functions required to be performed by the units in the voice control apparatus of the embodiments of the present application, or to execute the voice control method of the method embodiments of the present application.
  • the processor 1402 can also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the voice control method of the present application can be completed by an integrated logic circuit of hardware in the processor 1402 or an instruction in the form of software.
  • the above-mentioned processor 1402 can also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices. , discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 1403, and the processor 1402 reads the information in the memory 1403, and in combination with its hardware, completes the functions required to be performed by the units included in the voice control apparatus of the embodiments of the present application, or executes the voice control of the method embodiments of the present application. method.
  • the communication interface 1408 uses a transceiver device such as but not limited to a transceiver, and can perform wired communication or wireless communication, so as to realize communication between the terminal 1400 and other devices or a communication network.
  • the terminal may establish a communication connection with the wearable device through the communication interface 1408 .
  • Bus 1405 may include a pathway for communicating information between various components of device 1400 (eg, touch screen 1401, memory 1403, processor 1402, communication interface 1408).
  • wearable device 1300 and the terminal 1400 shown in FIG. 13 and FIG. 14 only show a memory, a processor, a communication interface, etc., in the specific implementation process, those skilled in the art should understand that the wearable device 1300 and terminal 1400 also include other components necessary for proper operation. Meanwhile, according to specific needs, those skilled in the art should understand that the wearable device 1300 and the terminal 1400 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the wearable device 1300 and the terminal 1400 may also only include the necessary devices for implementing the embodiments of the present application, and need not include all the devices shown in FIG. 13 or FIG. 14 .
  • FIG. 15 is a schematic diagram of the chip system.
  • the chip system includes at least one processor 1501 , at least one interface circuit 1502 and a bus 1503 .
  • the processor 1501 and the interface circuit 1502 may be interconnected by wires.
  • interface circuit 1502 may be used to receive signals from other devices, such as the memory of a voice-controlled device.
  • the interface circuit 1502 may be used to send signals to other devices (eg, the processor 1501).
  • the interface circuit 1502 may read the instructions stored in the memory and send the instructions to the processor 1501 .
  • the voice control device can be made to execute the steps in the above embodiments.
  • the chip system may also include other discrete devices, which are not specifically limited in this embodiment of the present application.
  • Another embodiment of the present application further provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on the voice control device, the voice control device executes the above method embodiments. Each step performed by the identification device in the method flow shown.
  • Another embodiment of the present application further provides a computer program product, where computer instructions are stored in the computer program product, and when the instructions are run on a recognition device on the voice control device, the recognition device executes the methods shown in the above method embodiments The individual steps performed by the identification device in the flow.
  • the disclosed methods may be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or on other non-transitory media or articles of manufacture.
  • the computer program product is provided using a signal bearing medium.
  • the signal bearing medium may include one or more program instructions, which, when executed by one or more processors, may implement the functions of the voice control method of the embodiments of the present application.
  • reference to one or more features of S701-S707 in FIG. 7 may be undertaken by one or more instructions associated with the signal bearing medium.
  • signal bearing media may include computer readable media such as, but not limited to, hard drives, compact discs (CDs), digital video discs (DVDs), digital tapes, memories, read-only storage memories memory, ROM) or random access memory (random access memory, RAM), etc.
  • computer readable media such as, but not limited to, hard drives, compact discs (CDs), digital video discs (DVDs), digital tapes, memories, read-only storage memories memory, ROM) or random access memory (random access memory, RAM), etc.
  • the signal bearing medium may include computer recordable media such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like.
  • signal bearing media may include communication media such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.).
  • digital and/or analog communication media eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.
  • a signal-bearing medium may be conveyed by a wireless form of communication medium (e.g., one that conforms to the IEEE 802.16 standard or other transmission protocol).
  • the one or more program instructions may be, for example, computer-executable instructions or logic-implemented instructions.
  • Each functional unit in each of the embodiments of the embodiments of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • a computer-readable storage medium includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: flash memory, removable hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the above is only the specific implementation of the embodiment of the application, but the protection scope of the embodiment of the application is not limited to this, any Changes or substitutions within the technical scope disclosed in the embodiments of the present application should all be covered by the protection scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application should be subject to the protection scope of the claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • User Interface Of Digital Computer (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

本申请提供了一种语音控制方法、装置、可穿戴设备及终端,能够在用户使用语音控制装置时提升声纹采集的效果和声纹识别的准确度。该方法包括:获取用户的语音信息;根据语音信息中第一语音分量的第一声纹识别结果、语音信息中第二语音分量的第二声纹识别结果和语音信息中第三语音分量的第三声纹识别结果,得到用户的身份信息,其中,第一语音分量是由可穿戴设备的耳内语音传感器采集到的,第二语音分量是由可穿戴设备的耳外语音传感器采集到的,第三语音分量是由可穿戴设备的骨振动传感器采集到的;当用户的身份信息与预设的信息匹配时,执行操作指令。

Description

一种语音控制方法和装置 技术领域
本申请涉及音频处理技术领域,尤其涉及一种语音控制方法和装置。
背景技术
现有技术通常采用两个语音传感器采集两路声音信号进行声纹识别,从而对发声用户进行身份鉴权。即需要两路语音分量的声纹识别结果都为匹配才会判定为预设用户。骨振动传感器是一种常见的语音传感器,声音在骨头中传播时会引起骨头的振动,骨振动传感器感应骨头的振动,并将振动信号转换为电信号来实现声音的收集。
如果两个语音传感器中有一路使用了骨振动传感器,由于当前的骨振动传感器,往往只能采集到说话人发音信号的低频成分(通常1KHz以下),高频成分会损失掉,这对声纹识别是不利的,会导致声纹识别不准确的问题。
发明内容
本申请提供一种语音控制方法及装置,可能够解决当使用骨振动传感器时,高频成分会损失掉,导致声纹识别不准确的问题。
为达到上述目的,本申请采用如下技术方案:
第一方面,本申请提供一种语音控制方法,包括:获取用户的语音信息,该语音信息包括第一语音分量,第二语音分量和第三语音分量,第一语音分量是由耳内语音传感器采集到的,第二语音分量是由耳外语音传感器采集到的,第三语音分量是由骨振动传感器采集到的;分别对第一语音分量,第二语音分量和第三语音分量进行声纹识别;根据语音信息中第一语音分量的第一声纹识别结果、语音信息中第二语音分量的第二声纹识别结果和语音信息中第三语音分量的第三声纹识别结果,得到用户的身份信息;当用户的身份信息与预设的信息匹配时,执行操作指令,其中,操作指令是根据语音信息确定的。
其中,由于当用户佩戴可穿戴设备之后外耳道与中耳道会形成一个封闭的腔室,声音在腔室里有一定的放大作用,即空腔效应,因此,耳内语音传感器采集到的声音会更加清晰,尤其对于高频声音信号具有明显的增强作用。由于可穿戴设备在采集声音时用到了耳内语音传感器,能够弥补骨振动传感器在采集语音信息时,会丢失部分语音信息的高频信号分量所造成的失真问题,因此能够提升可穿戴设备整体的声纹采集效果和声纹识别的准确度,从而提升用户体验。
在进行声纹识别之前,需要先分别获取语音分量,多路语音分量的获取,能够提升声纹识别的准确性与抗干扰能力。
在一种可能的实现方式中,对第一语音分量、第二语音分量和第三语音分量进行声纹识别之前,还包括:对语音信息进行关键词检测,或者,对用户输入进行检测。可选的,当语音信息中包括预设的关键词时,分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别;或者,当接收到用户输入的预设操作时,分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别。否则,说明用户此时没有进行声纹识别的需求,则终端或可穿戴设备无需开启声纹识别功能,从而降低终端或可穿戴设备的功耗。
在一种可能的实现方式中,对语音信息进行关键词检测或者对用户输入进行检测之前,还包括:获取穿戴设备的佩戴状态检测结果。可选的,当佩戴状态检测结果通过时,对语音 信息进行关键词检测,或者,对用户输入进行检测。否则,说明用户此时没有佩戴可穿戴设备,当然也就没有进行声纹识别的需求,则终端或可穿戴设备无需开启关键词检测功能,从而降低终端或可穿戴设备的功耗。
在一种可能的实现方式中,对第一语音分量进行声纹识别的具体过程为:
对第一语音分量进行特征提取,得到第一声纹特征,计算第一声纹特征与用户的第一注册声纹特征的第一相似度,第一注册声纹特征是第一注册语音经过第一声纹模型进行特征提取得到的,第一注册声纹特征用于反映耳内语音传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,对第二语音分量进行声纹识别的具体过程为:
对第二语音分量进行特征提取,得到第二声纹特征,计算第二声纹特征与用户的第二注册声纹特征的第二相似度,第二注册声纹特征是第二注册语音经过第二声纹模型进行特征提取得到的,第二注册声纹特征用于反映耳外语音传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,对第三语音分量进行声纹识别的具体过程为:
对第三语音分量进行特征提取,得到第三声纹特征,计算第三声纹特征与用户的第三注册声纹特征的第三相似度,第三注册声纹特征是第三注册语音经过第三声纹模型进行特征提取得到的,第三注册声纹特征用于反映骨振动传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,根据语音信息中第一语音分量的第一声纹识别结果、语音信息中第二语音分量的第二声纹识别结果和语音信息中第三语音分量的第三声纹识别结果,得到所述用户的身份信息,具体可以通过动态融合系数的方式,融合各个声纹识别结果,来得到所述用户的身份信息,具体可以为:
确定第一相似度对应的第一融合系数,第二相似度对应的第二融合系数,第三相似度对应的第三融合系数;根据第一融合系数、第二融合系数和第三融合系数融合第一相似度、第二相似度和第三相似度,得到融合相似度得分,若融合相似度得分大于第一阈值,则确定该用户的身份信息与预设身份信息匹配。通过融合多个相似度得到融合相似度得分并进行判断的方法,能够有效提升声纹识别的准确性。
在一种可能的实现方式中,确定第一融合系数、第二融合系数和第三融合系数,具体可以根据声压传感器得到环境声的分贝数;根据扬声器的播放信号,确定播放音量;根据环境声的分贝数和播放音量,分别确定第一融合系数、第二融合系数和第三融合系数,其中:第二融合系数与环境声的分贝数呈负相关,第一融合系数、第三融合系数分别与播放音量的分贝数呈负相关,第一融合系数、第二融合系数和第三融合系数的和为固定值。可选的,上述声压传感器和扬声器为可穿戴设备的声压传感器和扬声器。
由于本申请实施例在相似度融合时采用了动态融合系数,针对不同的应用环境,采用动态的融合系数对具有不同属性的语音信号获得的声纹识别结果进行融合,利用这些不同属性的语音信号的互补性可以提升声纹识别的鲁棒性和准确率。例如,在噪声环境较大或耳机播放音乐的情况下能够显著提升的识别准确率。其中,不同属性的语音信号也可以理解为通过不同的传感器(耳内语音传感器、耳外语音传感器、骨振动传感器)获取到的语音信号。
在一种可能的实现方式中,操作指令包括解锁指令、支付指令、关机指令、打开应用程序指令或呼叫指令。这样,用户只需要输入一次语音信息即可完成用户身份鉴权、以及执行某一功能等一些列操作,从而大大提高了用户的操控效率和用户体验。
第二方面,本申请提供一种语音控制方法,该语音控制方法应用于可穿戴设备,换句话说,该语音控制方法的执行主体为可穿戴设备,该方法具体如下:包括:可穿戴设备获取用户语音信息,该语音信息包括第一语音分量,第二语音分量和第三语音分量,第一语音分量是由耳内语音传感器采集到的,第二语音分量是由耳外语音传感器采集到的,第三语音分量是由骨振动传感器采集到的;分别对第一语音分量,第二语音分量和第三语音分量进行声纹识别;可穿戴设备根据语音信息中第一语音分量的第一声纹识别结果、语音信息中第二语音分量的第二声纹识别结果和语音信息中第三语音分量的第三声纹识别结果,得到用户的身份信息;当用户的身份信息与预设的信息匹配时,执行操作指令,其中,操作指令是根据语音信息确定的。
其中,由于当用户佩戴可穿戴设备之后外耳道与中耳道会形成一个封闭的腔室,声音在腔室里有一定的放大作用,即空腔效应,因此,耳内语音传感器采集到的声音会更加清晰,尤其对于高频声音信号具有明显的增强作用。由于可穿戴设备在采集声音时用到了耳内语音传感器,能够弥补骨振动传感器在采集语音信息时,会丢失部分语音信息的高频信号分量所造成的失真问题,因此能够提升可穿戴设备整体的声纹采集效果和声纹识别的准确度,从而提升用户体验。
在可穿戴设备进行声纹识别之前,可穿戴设备需要先分别获取语音分量,可穿戴设备通过耳内语音传感器、耳外语音传感器和骨振动传感器这种不同的传感器获取三路语音分量,能够提升声纹识别的准确性与抗干扰能力。
在一种可能的实现方式中,可穿戴设备对第一语音分量、第二语音分量和第三语音分量进行声纹识别之前,还包括:可穿戴设备对语音信息进行关键词检测,或者,对用户输入进行检测。可选的,当语音信息中包括预设的关键词时,可穿戴设备分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别;或者,当接收到用户输入的预设操作时,可穿戴设备分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别。否则,说明用户此时没有进行声纹识别的需求,则可穿戴设备无需开启声纹识别功能,从而降低了可穿戴设备的功耗。
在一种可能的实现方式中,可穿戴设备对语音信息进行关键词检测或者对用户输入进行检测之前,还包括:获取可穿戴设备的佩戴状态检测结果。可选的,当佩戴状态检测结果通过时,对语音信息进行关键词检测,或者,对用户输入进行检测。否则,说明用户此时没有佩戴可穿戴设备,当然也就没有进行声纹识别的需求,则可穿戴设备无需开启关键词检测功能,从而降低了可穿戴设备的功耗。
在一种可能的实现方式中,可穿戴设备对第一语音分量进行声纹识别的具体过程为:
可穿戴设备对第一语音分量进行特征提取,得到第一声纹特征,可穿戴设备计算第一声纹特征与用户的第一注册声纹特征的第一相似度,第一注册声纹特征是第一注册语音经过第一声纹模型进行特征提取得到的,第一注册声纹特征用于反映耳内语音传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,可穿戴设备对第二语音分量进行声纹识别的具体过程为:
可穿戴设备对第二语音分量进行特征提取,得到第二声纹特征,可穿戴设备计算第二声纹特征与用户的第二注册声纹特征的第二相似度,第二注册声纹特征是第二注册语音经过第二声纹模型进行特征提取得到的,第二注册声纹特征用于反映耳外语音传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,可穿戴设备对第三语音分量进行声纹识别的具体过程为:
可穿戴设备对第三语音分量进行特征提取,得到第三声纹特征,可穿戴设备计算第三声纹特征与用户的第三注册声纹特征的第三相似度,第三注册声纹特征是第三注册语音经过第三声纹模型进行特征提取得到的,第三注册声纹特征用于反映骨振动传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,可穿戴设备根据语音信息中第一语音分量的第一声纹识别结果、语音信息中第二语音分量的第二声纹识别结果和语音信息中第三语音分量的第三声纹识别结果,得到所述用户的身份信息,具体可以通过动态融合系数的方式,融合各个声纹识别结果,来得到所述用户的身份信息,具体可以为:
可穿戴设备确定第一相似度对应的第一融合系数,第二相似度对应的第二融合系数,第三相似度对应的第三融合系数;可穿戴设备根据第一融合系数、第二融合系数和第三融合系数融合第一相似度、第二相似度和第三相似度,得到融合相似度得分,若融合相似度得分大于第一阈值,则确定该用户的身份信息与预设身份信息匹配。通过融合多个相似度得到融合相似度得分并进行判断的方法,能够有效提升声纹识别的准确性。
在一种可能的实现方式中,可穿戴设备确定第一融合系数、第二融合系数和第三融合系数,具体可以根据声压传感器得到环境声的分贝数;根据扬声器的播放信号,确定播放音量;根据环境声的分贝数和播放音量,分别确定第一融合系数、第二融合系数和第三融合系数,其中:第二融合系数与环境声的分贝数呈负相关,第一融合系数、第三融合系数分别与播放音量的分贝数呈负相关,第一融合系数、第二融合系数和第三融合系数的和为固定值。可选的,上述声压传感器和扬声器为可穿戴设备的声压传感器和扬声器。
由于本申请实施例在相似度融合时采用了动态融合系数,针对不同的应用环境,采用动态的融合系数对具有不同属性的语音信号获得的声纹识别结果进行融合,利用这些不同属性的语音信号的互补性可以提升声纹识别的鲁棒性和准确率。例如,在噪声环境较大或耳机播放音乐的情况下能够显著提升的识别准确率。其中,不同属性的语音信号也可以理解为通过不同的传感器(耳内语音传感器、耳外语音传感器、骨振动传感器)获取到的语音信号。
在一种可能的实现方式中,可穿戴设备发送指示指令给终端,终端执行与语音信息对应的操作指令,操作指令包括解锁指令、支付指令、关机指令、打开应用程序指令或呼叫指令。这样,用户只需要输入一次语音信息即可完成用户身份鉴权、以及执行某一功能等一些列操作,从而大大提高了用户对可穿戴设备的操控效率和用户体验。
第三方面,本申请提供一种语音控制方法,该语音控制方法应用于终端,换句话说,该语音控制方法的执行主体为终端,该方法具体如下:包括:获取用户语音信息,该语音信息包括第一语音分量,第二语音分量和第三语音分量,第一语音分量是由耳内语音传感器采集到的,第二语音分量是由耳外语音传感器采集到的,第三语音分量是由骨振动传感器采集到的;终端分别对第一语音分量,第二语音分量和第三语音分量进行声纹识别;终端根据语音信息中第一语音分量的第一声纹识别结果、语音信息中第二语音分量的第二声纹识别结果和语音信息中第三语音分量的第三声纹识别结果,得到用户的身份信息;当用户的身份信息与预设的信息匹配时,终端执行操作指令,其中,操作指令是根据语音信息确定的。
其中,由于当用户佩戴可穿戴设备之后外耳道与中耳道会形成一个封闭的腔室,声音在腔室里有一定的放大作用,即空腔效应,因此,耳内语音传感器采集到的声音会更加清晰,尤其对于高频声音信号具有明显的增强作用。由于可穿戴设备在采集声音时用到了耳内语音传感器,能够弥补骨振动传感器在采集语音信息时,会丢失部分语音信息的高频信号分量所造成的失真问题,因此能够提升终端整体的声纹采集效果和声纹识别的准确度,从而提升用 户体验。
在一种可能的实现方式中,可穿戴设备获取用户输入的语音信息后,会发送该语音信息对应的语音分量给终端,以使得终端根据语音分量进行声纹识别。在终端侧执行该语音控制方法,能够有效利用终端算力,在可穿戴设备算力不够的情况下,依然能够保障身份鉴权的准确性。
在终端进行声纹识别之前,终端需要先分别获取语音分量,可穿戴设备通过耳内语音传感器、耳外语音传感器和骨振动传感器这种不同的传感器获取三路语音分量并发送给终端,能够提升终端声纹识别的准确性与抗干扰能力。
在一种可能的实现方式中,终端对第一语音分量、第二语音分量和第三语音分量进行声纹识别之前,还包括:对语音信息进行关键词检测,或者,对用户输入进行检测。可选的,当语音信息中包括预设的关键词时,可穿戴设备会发送该语音信息对应的语音分量给终端,终端分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别;或者,当接收到用户输入的预设操作时,终端分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别。否则,说明用户此时没有进行声纹识别的需求,则可穿戴设备无需开启声纹识别功能,从而降低了终端的功耗。
在一种可能的实现方式中,可穿戴设备对语音信息进行关键词检测或者对用户输入进行检测之前,还包括:获取可穿戴设备的佩戴状态检测结果。可选的,当佩戴状态检测结果通过时,对语音信息进行关键词检测,或者,对用户输入进行检测。否则,说明用户此时没有佩戴可穿戴设备,当然也就没有进行声纹识别的需求,则可穿戴设备无需开启关键词检测功能,从而降低了可穿戴设备的功耗。
在一种可能的实现方式中,终端对第一语音分量进行声纹识别的具体过程为:
终端对第一语音分量进行特征提取,得到第一声纹特征,终端计算第一声纹特征与用户的第一注册声纹特征的第一相似度,第一注册声纹特征是第一注册语音经过第一声纹模型进行特征提取得到的,第一注册声纹特征用于反映耳内语音传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,终端对第二语音分量进行声纹识别的具体过程为:
终端对第二语音分量进行特征提取,得到第二声纹特征,终端计算第二声纹特征与用户的第二注册声纹特征的第二相似度,第二注册声纹特征是第二注册语音经过第二声纹模型进行特征提取得到的,第二注册声纹特征用于反映耳外语音传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,终端对第三语音分量进行声纹识别的具体过程为:
终端对第三语音分量进行特征提取,得到第三声纹特征,终端计算第三声纹特征与用户的第三注册声纹特征的第三相似度,第三注册声纹特征是第三注册语音经过第三声纹模型进行特征提取得到的,第三注册声纹特征用于反映骨振动传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,终端根据语音信息中第一语音分量的第一声纹识别结果、语音信息中第二语音分量的第二声纹识别结果和语音信息中第三语音分量的第三声纹识别结果,得到所述用户的身份信息,具体可以通过动态融合系数的方式,融合各个声纹识别结果,来得到所述用户的身份信息,具体可以为:
终端确定第一相似度对应的第一融合系数,第二相似度对应的第二融合系数,第三相似度对应的第三融合系数;终端根据第一融合系数、第二融合系数和第三融合系数融合第一相 似度、第二相似度和第三相似度,得到融合相似度得分,若融合相似度得分大于第一阈值,则确定该用户的身份信息与预设身份信息匹配。通过融合多个相似度得到融合相似度得分并进行判断的方法,能够有效提升声纹识别的准确性。
在一种可能的实现方式中,终端确定第一融合系数、第二融合系数和第三融合系数,具体可以根据声压传感器得到环境声的分贝数;根据扬声器的播放信号,确定播放音量;可穿戴设备检测到环境声的分贝数和播放音量后将数据发送给终端,终端根据环境声的分贝数和播放音量,分别确定第一融合系数、第二融合系数和第三融合系数,其中:第二融合系数与环境声的分贝数呈负相关,第一融合系数、第三融合系数分别与播放音量的分贝数呈负相关,第一融合系数、第二融合系数和第三融合系数的和为固定值。可选的,上述声压传感器和扬声器为可穿戴设备的声压传感器和扬声器。
由于本申请实施例在相似度融合时采用了动态融合系数,针对不同的应用环境,采用动态的融合系数对具有不同属性的语音信号获得的声纹识别结果进行融合,利用这些不同属性的语音信号的互补性可以提升声纹识别的鲁棒性和准确率。例如,在噪声环境较大或耳机播放音乐的情况下能够显著提升的识别准确率。其中,不同属性的语音信号也可以理解为通过不同的传感器(耳内语音传感器、耳外语音传感器、骨振动传感器)获取到的语音信号。
在一种可能的实现方式,终端执行与语音信息对应的操作指令,操作指令包括解锁指令、支付指令、关机指令、打开应用程序指令或呼叫指令。这样,用户只需要输入一次语音信息即可完成用户身份鉴权、以及执行可穿戴设备的某一功能等一些列操作,从而大大提高了用户对可终端的操控效率和用户体验。
第四方面,本申请提供一种语音控制装置,包括:语音信息获取单元,语音信息获取单元用于获取用户的语音信息,语音信息包括第一语音分量,第二语音分量和第三语音分量,第一语音分量是由耳内语音传感器采集到的,第二语音分量是由耳外语音传感器采集到的,第三语音分量是由骨振动传感器采集到的;识别单元,识别单元用于分别对第一语音分量,第二语音分量和第三语音分量进行声纹识别;身份信息获取单元,身份信息获取单元用于根据第一语音分量的声纹识别结果、第二语音分量的声纹识别结果和第三语音分量的声纹识别结果,得到用户的身份信息;执行单元,执行单元用于当用户的身份信息与预设的信息匹配时,执行操作指令,其中,操作指令是根据语音信息确定的。
其中,由于当用户佩戴可穿戴设备之后外耳道与中耳道会形成一个封闭的腔室,声音在腔室里有一定的放大作用,即空腔效应,因此,耳内语音传感器采集到的声音会更加清晰,尤其对于高频声音信号具有明显的增强作用。由于可穿戴设备在采集声音时用到了耳内语音传感器,能够弥补骨振动传感器在采集语音信息时,会丢失部分语音信息的高频信号分量所造成的失真问题,因此能够提升可穿戴设备整体的声纹采集效果和声纹识别的准确度,从而提升用户体验。在获取声纹识别结果之前,需要先分别获取语音分量,多路语音分量的获取,能够提升声纹识别的准确性与抗干扰能力。
在一种可能的实现方式中,语音信息获取单元还用于:对语音信息进行关键词检测,或者,对用户输入进行检测。可选的,当语音信息中包括预设的关键词时,分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别;当接收到用户输入的预设操作时,分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别。否则,说明用户此时没有进行声纹识别的需求,则终端或可穿戴设备无需开启声纹识别功能,从而降低终端或可穿戴设备的功耗。
在一种可能的实现方式中,语音信息获取单元还用于:获取可穿戴设备的佩戴状态检测 结果。可选的,当佩戴状态检测结果通过时,对语音信息进行关键词检测,或者,对用户输入进行检测。否则,说明用户此时没有佩戴可穿戴设备,当然也就没有进行声纹识别的需求,则终端或可穿戴设备无需开启关键词检测功能,从而降低终端或可穿戴设备的功耗。
在一种可能的实现方式中,识别单元具体用于:对第一语音分量进行特征提取,得到第一声纹特征,计算第一声纹特征与用户的第一注册声纹特征的第一相似度,第一注册声纹特征是第一注册语音经过第一声纹模型进行特征提取得到的,第一注册声纹特征用于反映耳内语音传感器采集到的用户的预设音频特征;对第二语音分量进行特征提取,得到第二声纹特征,计算第二声纹特征与用户的第二注册声纹特征的第二相似度,第二注册声纹特征是第二注册语音经过第二声纹模型进行特征提取得到的,第二注册声纹特征用于反映耳外语音传感器采集到的用户的预设音频特征;对第三语音分量进行特征提取,得到第三声纹特征,计算第三声纹特征与用户的第三注册声纹特征的第三相似度,第三注册声纹特征是第三注册语音经过第三声纹模型进行特征提取得到的,第三注册声纹特征用于反映骨振动传感器采集到的用户的预设音频特征。通过计算相似度的方法来进行声纹识别,能够提升声纹识别的准确性。
在一种可能的实现方式中,身份信息获取单元可以通过动态融合系数的方式获取身份信息,身份信息获取单元具体用于:确定第一相似度对应的第一融合系数,第二相似度对应的第二融合系数,第三相似度对应的第三融合系数;根据第一融合系数、第二融合系数和第三融合系数融合第一相似度、第二相似度和第三相似度,得到融合相似度得分,若融合相似度得分大于第一阈值,则确定用户的身份信息与预设身份信息匹配。通过融合多个相似度得到融合相似度得分并进行判断的方法,能够有效提升声纹识别的准确性。
在一种可能的实现方式中,身份信息获取单元具体用于:根据声压传感器得到环境声的分贝数;根据扬声器的播放信号,确定播放音量;根据环境声的分贝数和播放音量,分别确定第一融合系数、第二融合系数和第三融合系数,其中:第二融合系数与环境声的分贝数呈负相关,第一融合系数、第三融合系数分别与播放音量的分贝数呈负相关,第一融合系数、第二融合系数和第三融合系数的和为固定值。
由于本申请实施例在相似度融合时采用了动态融合系数,针对不同的应用环境,采用动态的融合系数对具有不同属性的语音信号获得的声纹识别结果进行融合,利用这些不同属性的语音信号的互补性可以提升声纹识别的鲁棒性和准确率。例如,在噪声环境较大或耳机播放音乐的情况下能够显著提升的识别准确率。其中,不同属性的语音信号也可以理解为通过不同的传感器(耳内语音传感器、耳外语音传感器、骨振动传感器)获取到的语音信号。
在一种可能的实现方式中,若该用户为预设用户,则执行单元具体用于:执行与语音信息对应的操作指令,操作指令包括解锁指令、支付指令、关机指令、打开应用程序指令或呼叫指令。这样,用户只需要输入一次语音信息即可完成用户身份鉴权、以及执行某一功能等一些列操作,从而大大提高了用户的操控效率和用户体验。
可以理解的是,本申请第四方面提供的语音控制装置,可以理解为终端或可穿戴设备,具体视语音控制方法的执行主体而定,本申请对此不做限制。
第五方面,本申请提供一种可穿戴设备,包括:耳内语音传感器,耳外语音传感器,骨振动传感器,存储器和处理器;耳内语音传感器用于采集语音信息的第一语音分量,耳外语音传感器用于采集语音信息的第二语音分量,骨振动传感器用于采集语音信息的第三语音分量;存储器和处理器耦合;存储器用于存储计算机程序代码,计算机程序代码包括计算机指令;当处理器执行计算机指令时,可穿戴设备执行上述第一方面或第一方面的可能的实现方式或第三方面或第三方面的可能的实现方式中任一项的语音控制方法。
第六方面,本申请提供一种终端,包括:包括存储器和处理器;存储器和处理器耦合;存储器用于存储计算机程序代码,计算机程序代码包括计算机指令;当处理器执行计算机指令时,终端执行上述第一方面或第一方面的可能的实现方式或第三方面或第三方面的可能的实现方式中任一项的语音控制方法。
第七方面,本申请提供一种芯片系统,芯片系统应用于电子设备;芯片系统包括一个或多个接口电路,以及一个或多个处理器;接口电路和处理器通过线路互联;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,电子设备执行上述第一方面或第一方面的可能的实现方式中任一项的语音控制方法。
第八方面,本申请提供一种计算机存储介质,包括计算机指令,当计算机指令在语音控制装置上运行时,使得该语音控制装置执行如第一方面或第一方面的可能的实现方式中任一项的语音控制方法。
第九方面,本申请提供一种计算机程序产品,该计算机程序产品包括计算机指令,当该计算机指令在语音控制装置上运行时,使得语音控制装置执行如第一方面或第一方面的可能的实现方式中任一项的语音控制方法。
可以理解地,上述提供的第五方面的可穿戴设备、第六方面的终端、第七方面的芯片系统、第八方面的计算机存储介质,以及第九方面的计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
附图说明
图1为本申请实施例提供的一种手机硬件结构示意图;
图2为本申请实施例提供的一种手机软件结构示意图;
图3为本申请实施例提供的一种可穿戴设备结构示意图;
图4为本申请实施例提供的一种语音控制系统示意图;
图5为本申请实施例提供的一种服务器的结构示意图;
图6为本申请实施例提供的一种声纹识别流程示意图;
图7为本申请实施例提供的一种语音控制方法示意图;
图8为本申请实施例提供的一种传感器设置区域示意图;
图9是本申请实施例提供的一种支付界面示意图;
图10是本申请实施例提供的另一种语音控制方法示意图;
图11为本申请实施例提供的一种手机设置界面示意图;
图12为本申请实施例提供的一种语音控制装置示意图;
图13为本申请实施例提供的一种可穿戴设备示意图;
图14为本申请实施例提供的一种终端示意图;
图15是本申请实施例提供的一种芯片系统示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者 隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征,应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。
随着音频处理技术的日益发展,声纹识别方法在音频处理领域中已成为一个重要的热点问题。声纹(Voiceprint),是用电声学仪器显示的携带言语信息的声波频谱。声纹具有稳定性、可测量性、唯一性等特点。成年以后,人的声音可保持长期相对稳定不变。人在讲话时使用的发声器官在尺寸和形态方面每个人的差异很大,所以任何两个人的声纹图谱都有差异,不同人的声音在语谱图中共振峰的分布情况不同。声纹识别正是通过比对两段语音的说话人在相同音素上的发声来判断是否为同一个人,从而实现“闻声识人”的功能。
声纹识别(VR)作为生物识别技术的一种,也称为说话人识别,是从说话人发出的语音信号中提取声纹信息,从应用上看,可分为:说话人辨认(SI,Speaker Identification):用以判断某段语音是若干人中的哪一个人所说的,是“多选一”问题。说话人确认(SV,Speaker Verification):用以确认某段语音是否是指定的某个人所说的,是“一对一判别”问题。本申请主要涉及说话人确认技术。
声纹识别技术可以应用于终端用户识别场景中,也可以应用于家庭安防的户主识别场景中,本申请对此不做限制。
通常的声纹识别技术通过一路或两路语音信号的采集,进行声纹识别,即需要两路语音分量的声纹识别结果都为匹配才会判定为预设用户。但是会存在两个问题,其一,面对多人说话场景或者强干扰环境噪音的背景下采集的语音分量会对声纹识别结果进行干扰,导致身份鉴权不准确甚至错误。只要有任意一路语音分量的采集在上述干扰环境下完成,会导致声纹识别性能下降,使身份鉴权结果出现误判。即现有声纹识别技术无法很好地对来自各个方向的噪声进行抑制,降低了声纹识别准确性。
其二,如果两个语音传感器中有一路使用了骨振动传感器,由于当前的骨振动传感器,往往只能采集到说话人发音信号的低频成分(通常1KHz以下),高频成分会损失掉,这对声纹识别是不利的,会导致声纹识别不准确甚至错误,因为声纹识别需要描述说话人在各个频带的发音特性。
有鉴于此,本申请实施例提供了一种语音控制方法,可以理解的是,执行本实施例方法的主体可以是终端,该终端与可穿戴设备建立连接,能够获取到可穿戴设备采集到的语音信息,并对语音信息进行声纹识别。执行本实施例方法的主体也可以是可穿戴设备本身,该可穿戴设备本身包括具备计算能力的处理器,能够直接对采集到的语音信息进行声纹识别。执行本实施例方法的主体也可以是服务器,该服务器与可穿戴设备建立连接,能够获取到可穿戴设备采集到的语音信息,并对语音信息进行声纹识别。在实际应用过程中,可以根据可穿戴设备芯片的算力来决定执行本实施例方法的主体。例如,在可穿戴设备芯片的算力较高的情况下,可以由可穿戴设备来执行本实施例方法;在可穿戴设备芯片算力较低的情况下,则 可以由与可穿戴设备连接的终端设备来执行本实施例方法,或者,可以由与可穿戴设备连接的服务器来执行本实施例方法。为便于叙述,以下将分别以与可穿戴设备连接的终端为本实施例方法的执行主体为例,以可穿戴设备为本实施例方法的执行主体为例,以与可穿戴设备连接的服务器为本实施例方法的执行主体为例对本申请实施例进行详细介绍。
其中,终端设备又称之为用户设备(user equipment,UE)、移动台(mobile station,MS)、移动终端(mobile terminal,MT)等,是一种能够与可穿戴设备进行有线连接或无线连接,以向用户提供语音和/或数据连通性的设备。例如,无线连接功能允许的手持式设备、车载设备等。目前,一些终端设备的举例为:手机(mobile phone)、平板电脑、笔记本电脑、掌上电脑、移动互联网设备(mobile internet device,MID)、可穿戴设备,虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线终端、无人驾驶(self driving)中的无线终端、远程手术(remote medical surgery)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端、智慧家庭(smart home)中的无线终端等,本申请实施例对此不做任何限制。
当上述语音控制方法为终端时,所述语音控制方法可以通过安装在终端上的用于识别声纹的应用程序实现。
上述用于识别声纹的应用程序可以是安装在终端中的嵌入式应用程序(即终端的系统应用)或者可下载应用程序。其中,嵌入式应用程序是作为终端(如手机)实现的一部分提供的应用程序。可下载应用程序是一个可以提供自己的因特网协议多媒体子系统(internet protocol multimedia subsystem,IMS)连接的应用程序,该可下载应用程序是可以预先安装在终端中的应用或可以由用户下载并安装在终端中的第三方应用。
为了便于理解,以下先介绍本申请实施例方法应用的终端、可穿戴设备和服务器。请参考图1,以终端是手机为例,图1示出了手机的一种硬件结构。如图1所示,手机10可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。
其中,传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本申请实施例示意的结构并不构成对手机的具体限定。在本申请另一些实施例中,手机可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。处理器110能够执行本申请实施例提供的声纹识别算法。
其中,控制器可以是手机的神经中枢和指挥中心。控制器可以根据指令操作码和时序信 号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。终端可以通过接口与可穿戴设备建立有线通信连接。终端可以通过接口获取穿戴设备分别通过耳内语音传感器采集第一语音分量,通过耳外语音传感器采集第二语音分量,通过骨振动传感器采集第三语音分量。
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。I2S接口可以用于音频通信。PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为手机充电,也可以用于手机与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对手机的结构限定。在本申请另一些实施例中,手机也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。
手机的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。手机中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在手机上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行 滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。调制解调处理器可以包括调制器和解调器。
无线通信模块160可以提供应用在手机上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),GNSS,调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。终端可以通过无线通信模块160与可穿戴设备建立通信连接。终端可以无线通信模块160获取穿戴设备分别通过耳内语音传感器采集第一语音分量,通过耳外语音传感器采集第二语音分量,通过骨振动传感器采集第三语音分量。
示例性的,本申请实施例中的GNSS可以包括:GPS,GLONASS,BDS,QZSS,SBAS,和/或GALILEO等。
手机通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。显示屏194用于显示图像,视频等。显示屏194包括显示面板。
手机可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。ISP用于处理摄像头193反馈的数据。摄像头193用于获取静态图像或视频。物体通过镜头生成光学图像投射到感光元件。数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。视频编解码器用于对数字视频压缩或解压缩。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现手机的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展手机的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行手机的各种功能应用以及数据处理。内部存储器121存储的代码可执行本申请实施例提供的一种语音控制方法,比如:当用户向可穿戴设备输入语音信息时,可穿戴设备通过耳内语音传感器采集第一语音分量,通过耳外语音传感器采集第二语音分量,通过骨振动传感器采集第三语音分量,手机通过通信连接从可穿戴设备获取第一语音分量、第二语音分量和第三语音分量,并分别进行声纹识别;根据第一语音分量的第一声纹识别结果、第二语音分量的第二声纹识别结果和第三语音分量的第三声纹识别结果,对用户进行身份鉴权;若用户的身份鉴权结果为预设用户,则手机执行与语音信息对应的操作指令。
手机可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D, 以及应用处理器等实现音频功能。例如音乐播放,录音等。终端可以通过无线通信模块160与可穿戴设备建立通信连接。终端可以无线通信模块160获取穿戴设备分别通过耳内语音传感器采集第一语音分量,通过耳外语音传感器采集第二语音分量,通过骨振动传感器采集第三语音分量。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.2mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。手机可以接收按键输入,产生与手机的用户设置以及功能控制有关的键信号输入。马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和手机的接触和分离。手机可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口195可以支持Nano SIM卡,Micro SIM卡,SIM卡等。
尽管图1未示出,手机100还可以包括摄像头、闪光灯、微型投影装置、近场通信(near field communication,NFC)装置等,在此不予赘述。
手机的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的安卓(Android)系统为例,示例性说明手机的软件结构。
图2是本申请实施例的手机的软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为:应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图2所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。还可以包括用于声纹识别的应用程序,该用于声纹识别应用程序可以是终端内置的,也可以是通过外部网站下载的。
应用程序框架层为应用程序层中的应用程序提供应用编程接口(application programming interface,API)和编程框架。
应用程序框架层包括一些预先定义的函数。
如图2所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供手机的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
下面结合捕获拍照场景,示例性说明手机软件以及硬件的工作流程。
当触摸传感器180K接收到触摸操作,相应的硬件中断被发给内核层。内核层将触摸操作加工成原始输入事件(包括触摸坐标,触摸操作的时间戳等信息)。原始输入事件被存储在内核层。应用程序框架层从内核层获取原始输入事件,识别该输入事件所对应的控件。以该触摸操作是触摸单击操作,该单击操作所对应的控件为相机应用图标的控件为例,相机应用调用应用框架层的接口,启动相机应用,进而通过调用内核层启动摄像头驱动,通过摄像头193捕获静态图像或视频。
本申请实施例的语音控制方法可以应用于可穿戴设备,换句话说,可穿戴设备可以作为本申请实施例语音控制方法的执行主体。其中,可穿戴设备可以是无线耳机、有线耳机、智能眼镜、智能头盔或者智能腕表等具有语音采集功能的设备,本申请实施例对此不做任何限制。
例如,本申请实施例提供的可穿戴设备可以是TWS(True Wireless Stereo,真正无线立体声)耳机,TWS技术基于蓝牙芯片技术的发展。按其工作原理来说是指手机通过连接主耳机,再由主耳机通过无线方式快速连接副耳机,实现真正的蓝牙左右声道无线分离使用。
随着TWS技术和人工智能技术的发展,TWS智能耳机开始在无线连接、语音交互、智 能降噪、健康监测和听力增强/保护等领域发挥作用。而降噪、听力保护、智能翻译、健康监测、骨振动ID、防丢等将是TWS耳机关键技术的趋势。
请参考图3,图3示出了可穿戴设备的一种结构图,可穿戴设备30具体可以包括耳内语音传感器301,耳外语音传感器302和骨振动传感器303。上述耳内语音传感器301和耳外语音传感器可以是气传导麦克风,上述骨振动传感器可以是骨传导麦克风、光学振动传感器、加速度传感器或气传导麦克风等能够采集用户发声时产生的振动信号的传感器。其中,气传导麦克风采集语音信息的方式是通过空气将发生时的振动信号传至麦克风,继而将声音信号收集起来转为电信号;骨传导麦克风采集语音信息的方式是利用人讲话时引起的头颈部骨骼的轻微振动,通过骨头将发声时的振动信号传至麦克风,继而将声音信号收集起来转为电信号。
可以理解的是,本申请实施例提供的语音控制方法需要应用于具有声纹识别功能的可穿戴设备,换句话说,可穿戴设备30需要具备声纹识别功能。
本申请实施例提供的可穿戴设备30的耳内语音传感器301指的是,当该可穿戴设备处于被用户使用的状态时,该耳内语音传感器位于用户的耳道内部,或者说,该耳内语音传感器的声音侦测方向为耳道内部。该耳内语音传感器用于采集用户发声时经过外界空气和耳道内空气的振动传播的声音,该声音为耳内语音信号分量。耳外语音传感器302指的是,当该可穿戴设备处于被用户使用的状态时,该耳外语音传感器位于用户的耳道外部,或者说,该耳外语音传感器的声音侦测方向为除耳道内部的其他方向,即整个外部空气方向。该耳外语音传感器暴露于环境中,用于采集用户发出的经过外界空气的振动传播的声音,该声音为耳外语音信号分量或环境声分量。骨振动传感器303指的是,当该可穿戴设备处于被用户使用的状态时,该骨振动传感器与用户的皮肤接触,用于采集用户骨头传递的振动信号,或者说,用于采集用户某次发声时,通过骨头振动所传递的语音信息分量。可选的,耳内麦克风和耳外麦克风均可以根据麦克风的位置,可以选择不同方向性的麦克风,如心型、全向型、8字型等,从而获取不同方向的语音信号。
其中,由于当用户佩戴耳机之后外耳道与中耳道会形成一个封闭的腔室,声音在腔室里有一定的放大作用,即空腔效应,因此,耳内语音传感器采集到的声音会更加清晰,尤其对于高频声音信号具有明显的增强作用,能够弥补骨振动传感器在采集语音信息时,会丢失部分语音信息的高频信号分量所造成的失真问题,提升耳机整体的声纹采集效果和声纹识别的准确度,从而提升用户体验。
可以理解的是,耳内语音传感器301在拾取耳内语音信号时,通常伴有耳内残余噪声,耳外语音传感器302在拾取耳外语音信号时,通常伴有耳外噪声。
在本申请实施例中,用户佩戴可穿戴设备30说话时,可穿戴设备30既可以通过耳内语音传感器301和耳外语音传感器302采集经空气传播后用户发出的语音信息,还可以通过骨振动传感器303采集经骨头传播后用户发出的语音信息。
可以理解的是,可穿戴设备30中的耳内语音传感器301、耳外语音传感器302和骨振动传感器303均可以有多个,本申请对此不做限制。耳内语音传感器301、耳外语音传感器302和骨振动传感器303可以是内置于可穿戴设备30中的。
仍如图3所示,可穿戴设备30中还可以包括通信模块304、扬声器305、计算模块306、存储模块307以及电源309等部件。
当终端或服务器作为本申请实施例语音控制方法的执行主体时,通信模块304能够与终端或服务器建立通信连接。其中,通信模块304可以包括通信接口,通信接口有线或无线的 方式,无线方式可以是通过蓝牙或者wifi方式。通信模块304可以用于将可穿戴设备30分别通过耳内语音传感器301采集第一语音分量,通过耳外语音传感器302采集第二语音分量,通过骨振动传感器303采集第三语音分量,传送给终端或服务器。
当可穿戴设备30作为本申请实施例语音控制方法的执行主体时,计算模块306能够执行本申请实施例提供的语音控制方法,当用户向可穿戴设备输入语音信息时,可穿戴设备30通过耳内语音传感器301采集第一语音分量,通过耳外语音传感器302采集第二语音分量,通过骨振动传感器303采集第三语音分量,分别进行声纹识别;根据第一语音分量的第一声纹识别结果、第二语音分量的第二声纹识别结果和第三语音分量的第三声纹识别结果,对用户进行身份鉴权;若用户的身份鉴权结果为预设用户,则可穿戴设备执行与语音信息对应的操作指令。
其中,存储模块307用于存储执行本申请实施例方法的应用程序代码,并由计算模块306来控制执行。
存储模块307存储的代码可执行本申请实施例提供的一种语音控制方法,比如:当用户向可穿戴设备输入语音信息时,可穿戴设备30通过耳内语音传感器301采集第一语音分量,通过耳外语音传感器302采集第二语音分量,通过骨振动传感器303采集第三语音分量,分别进行声纹识别;根据第一语音分量的第一声纹识别结果、第二语音分量的第二声纹识别结果和第三语音分量的第三声纹识别结果,对用户进行身份鉴权;若用户的身份鉴权结果为预设用户,则可穿戴设备执行与语音信息对应的操作指令。
可以理解的是,麦克风和骨振动传感器可以任意组合。上述可穿戴设备30还可以包括压力传感器、加速度传感器、光学传感器等,可穿戴设备30还可以具有比图3中所示出的更多的或者更少的部件,可以组合两个或更多的部件,或者可以具有不同的部件配置。图3中所示出的各种部件可以在包括一个或多个信号处理或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
本申请实施例提供的一种语音控制方法可以应用于可穿戴设备30与终端10组成的语音控制系统中,该语音控制系统如图4所示。在该语音控制系统中,当用户向可穿戴设备输入语音信息时,可穿戴设备30可以分别通过耳内语音传感器301采集第一语音分量,通过耳外语音传感器302采集第二语音分量,通过骨振动传感器303采集第三语音分量,终端10从所述可穿戴设备获取所述第一语音分量、所述第二语音分量和所述第三语音分量,继而分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别;根据第一语音分量的第一声纹识别结果、第二语音分量的第二声纹识别结果和第三语音分量的第三声纹识别结果,对用户进行身份鉴权;若用户的身份鉴权结果为预设用户,则终端10执行与语音信息对应的操作指令。
本申请实施例的语音控制方法可以还可以应用于服务器,换句话说,服务器可以作为本申请实施例语音控制方法的执行主体。
服务器可以为台式服务器、机架式服务器、机柜式服务器、刀片式服务器或者其他类型的服务器,服务器还可以为公用云、私有云等云端服务器,本申请实施例对此不做任何限制。
请参考图5,图5示出了服务器的一种结构图,该服务器50包括至少一个处理器501,至少一个存储器502以及至少一个通信接口503。处理器501、存储器502、和通信接口503通过通信总线504连接并完成相互间的通信。
处理器501可以是通用中央处理器(CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制以上方案程序执行的集 成电路。
存储器502可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
其中,存储器502用于存储执行本申请实施例方法的应用程序代码,并由处理器501来控制执行。
存储器502存储的代码可执行本申请实施例提供的一种语音控制方法,比如:当用户向可穿戴设备输入语音信息时,可穿戴设备通过耳内语音传感器采集第一语音分量,通过耳外语音传感器采集第二语音分量,通过骨振动传感器采集第三语音分量,服务器通过通信连接从可穿戴设备获取第一语音分量、第二语音分量和第三语音分量,并分别进行声纹识别;根据第一语音分量的第一声纹识别结果、第二语音分量的第二声纹识别结果和第三语音分量的第三声纹识别结果,对用户进行身份鉴权;若用户的身份鉴权结果为预设用户,则服务器执行与语音信息对应的操作指令。
通信接口503,用于与其他设备或通信网络通信,如以太网,无线接入网(RAN),无线局域网(Wireless Local Area Networks,WLAN)等。
结合上述图1-图5,以可穿戴设备为蓝牙耳机、终端为手机举例,概述本申请的语音控制方法应用于终端时的具体实施方式。该方法首先获取用户的语音信息,语音信息包括第一语音分量,第二语音分量和第三语音分量,在本申请实施例中,用户可在佩戴蓝牙耳机时向蓝牙耳机输入语音信息,此时,蓝牙耳机可以基于用户输入的语音信息,通过耳内语音传感器采集第一语音分量,通过耳外语音传感器采集第二语音分量,通过骨振动传感器采集第三语音分量。
蓝牙耳机从所述语音信息中获取所述第一语音分量、所述第二语音分量和所述第三语音分量,手机通过与蓝牙耳机的蓝牙连接,从蓝牙耳机获取第一语音分量、第二语音分量和第三语音分量。在一种可能的实现方式中,手机可以对用户向蓝牙耳机输入的语音信息进行关键词检测,或者,手机可以对用户输入进行检测。可选的,当语音信息中包括预设的关键词时,分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别。当接收到用户输入的预设操作时,分别对所述第一语音分量、所述第二语音分量和所述第三语音分量进行声纹识别。用户输入可以为用户通过触摸屏或按键对手机的输入,例如,用户点击手机的解锁键。可选的,手机对语音信息进行关键词检测或者对用户输入进行检测之前,还可以从蓝牙耳机获取佩戴状态检测结果。可选的,当佩戴状态检测结果通过时,手机对语音信息进行关键词检测,或者,对用户输入进行检测。
手机分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别后得到与第一语音分量对应的第一声纹识别结果、与第二语音分量对应的第二声纹识别结果以及与第三语音分量对应的第三声纹识别结果。
当上述第一声纹特征与第一注册声纹特征匹配,第二声纹特征与第二注册声纹特征匹配,且第三声纹特征与第三注册声纹特征匹配时,说明蓝牙耳机此时采集到的语音信息为预设用 户输入的。例如,手机可通过一定算法计算第一声纹特征与第一注册声纹特征的第一匹配度,第二声纹特征与第二注册声纹特征的第二匹配度,以及第三声纹特征与第三注册声纹特征的第三匹配度。当匹配度越高时,说明该声纹特征与对应的注册声纹特征越吻合,此时发声用户为预设用户的可能性越高。例如,当第一匹配度、第二匹配度与第三匹配度的平均值大于80分时,手机可确定第一声纹特征与第一注册声纹特征匹配,第二声纹特征与第二注册声纹特征匹配,且第三声纹特征与第三注册声纹特征匹配。又或者,当第一匹配度、第二匹配度与第三匹配度分别大于85分时,手机可确定第一声纹特征与第一注册声纹特征匹配,第二声纹特征与第二注册声纹特征匹配,且第三声纹特征与第三注册声纹特征匹配。其中,第一注册声纹特征是通过第一声纹模型进行特征提取得到的,第一注册声纹特征用于反映耳内语音传感器采集到的预设用户的声纹特征;第二注册声纹特征是通过第二声纹模型进行特征提取得到的,第二注册声纹特征用于反映耳外语音传感器采集到的所述预设用户的声纹特征;第三注册声纹特征是通过第三声纹模型进行特征提取得到的,第三注册声纹特征用于反映骨振动传感器采集到的所述预设用户的声纹特征。
可以理解的是,这里的算法类型不限,判断条件不限,只要能达到本申请实施例的技术效果即可。进而,手机可以执行与该语音信息对应的操作指令,例如,解锁指令、支付指令、关机指令、打开应用程序指令或者呼叫等指令。使得手机可以根据该操作指令执行对应的操作,实现用户通过语音操控手机的功能。可以理解的是,身份鉴权的条件不做限制,例如,当第一匹配度、第二匹配度与第三匹配度均大于某一阈值的时候,可以认为身份鉴权通过,发声用户为预设用户;或者,当第一匹配度、第二匹配度与第三匹配度以一定方式进行匹配度融合得到的融合匹配度大于某一阈值的时候,可以认为身份鉴权通过,发声用户为预设用户。本申请实施例中的身份鉴权,指的是通过获得用户的身份信息,判断该身份信息与预设的身份信息是否匹配,若匹配,则认为鉴权通过,若不匹配,则认为鉴权不通过。
其中,上述预设用户是指能够通过手机预设的身份认证措施的用户,例如,终端预设的身份认证措施为输入密码、指纹识别和声纹识别,那么,通过密码输入或者预先在终端内存储有经过用户身份认证的指纹信息和注册声纹特征的用户可认为是该终端的预设用户。当然,一个终端的预设用户可以包括一个或多个,除预设用户之外的任意用户都可以视为该终端的非法用户。非法用户通过一定的身份认证措施后也可转变为预设用户,本申请实施例对此不做任何限制。
在一种可能的实现方式中,第一注册声纹特征是通过第一声纹模型进行特征提取得到的,第一注册声纹特征用于反映耳内语音传感器采集到的预设用户的声纹特征;第二注册声纹特征是通过第二声纹模型进行特征提取得到的,第二注册声纹特征用于反映耳外语音传感器采集到的所述预设用户的声纹特征;第三注册声纹特征是通过第三声纹模型进行特征提取得到的,第三注册声纹特征用于反映骨振动传感器采集到的所述预设用户的声纹特征。
在一种可能的实现方式中,上述计算匹配度的算法可以为计算相似度。手机对第一语音分量进行特征提取,得到第一声纹特征,分别计算第一声纹特征与预先存储的预设用户的第一注册声纹特征的第一相似度,第二声纹特征与预先存储的预设用户的第二注册声纹特征的第二相似度,第三声纹特征与预先存储的预设用户的第三注册声纹特征的第三相似度,基于第一相似度、第二相似度和第三相似度,对用户进行身份鉴权。
在一种可能的实现方式中,对用户进行身份鉴权的方式可以为手机根据环境声的分贝数和可穿戴设备的播放音量,分别确定第一相似度对应的第一融合系数,第二相似度对应的第二融合系数,第三相似度对应的第三融合系数;根据第一融合系数、第二融合系数和第三融 合系数融合第一相似度、第二相似度和第三相似度,得到融合相似度得分。若融合相似度得分大于第一阈值,则手机确定向蓝牙耳机输入语音信息的用户为预设用户。
在一种可能的实现方式中,环境声的分贝数是蓝牙耳机的声压传感器检测得到的并发送给手机的,播放音量可以是蓝牙耳机的扬声器检测播放信号得到的并发送给手机的,也可以是手机本身调用自身数据得到的,即通过底层系统的音量接口程序接口获得。
在一种可能的实现方式中,第二融合系数与环境声的分贝数呈负相关,第一融合系数、第三融合系数分别与播放音量的分贝数呈负相关,第一融合系数、第二融合系数和第三融合系数的和为固定值。也就是说,在第一融合系数、第二融合系数和第三融合系数的和为预设的固定值的情况下,环境声的分贝数越大,第二融合系数越小,此时,相应的,第一融合系数和第三融合系数会适应性增大,以维持在第一融合系数、第二融合系数和第三融合系数的和不变化;播放音量越大,第一融合系数和第三融合系数越小,此时,相应的,第二融合系数会适应性增大,以维持在第一融合系数、第二融合系数和第三融合系数的和不变化。可以理解的是,上述可变的融合系数能够兼顾不同的应用场景(噪声环境较大或耳机播放音乐的情况下)下的识别准确率。
当手机确定向蓝牙耳机输入语音信息的用户为预设用户后,手机可以自动执行与所述语音信息对应的操作指令,例如,手机解锁操作或者确认支付操作。
可以看出,在本申请实施例中,当用户通过向可穿戴设备输入语音信息以达到控制终端的目的时,可穿戴设备可采集用户发声时在耳道内产生的语音信息、在耳道外产生的语音信息以及骨振动信息,此时可穿戴设备内产生了三路语音信息(即上述第一语音分量、第二语音分量和第三语音分量)。这样,终端(或可穿戴设备本身,或服务器)可针对这三路语音信息分别进行声纹识别,当这三路语音信息的声纹识别结果均与预设用户的注册声纹特征匹配时,可确认此时输入语音信息的用户为预设用户,或者,当这三路语音信息的声纹识别结果进行加权融合后的融合结果大于某一阈值时,可确认此时输入语音信息的用户为预设用户。显然,这种三路语音信息的三重声纹识别过程相比于一路语音信息的声纹识别过程或两路语音信息的声纹识别过程,能够显著提高用户身份鉴权时的准确性和安全性。尤其是在耳内增加一个麦克风可以解决,在耳外语音传感器和骨振动传感器两路语音信息的声纹识别过程中,骨振动传感器采集的语音信号的高频信号丢失的问题。
并且,由于用户必须佩戴该可穿戴设备后,可穿戴设备才能通过骨传导这种方式采集到用户输入的语音信息,因此,当可穿戴设备通过骨传导这种方式采集到的语音信息能够通过声纹识别时,也说明了上述语音信息的来源是佩戴可穿戴设备的预设用户发声产生的,从而避免非法用户使用预设用户的录音恶意控制预设用户的终端的情况。
为了便于理解,以下结合附图对本申请实施例提供的一种语音控制方法进行具体介绍。以下实施例中均以手机作为终端,以蓝牙耳机作为可穿戴设备举例说明。
首先对声纹识别技术进行简单介绍。
声纹识别技术实际应用中一般分注册和验证两个流程,一般的声纹识别应用流程如图6所示,在注册流程部分中,首先采集注册语音601,经过预处理模块602预处理后,输入到预先训练好的声纹模型603进行特征提取后,得到注册语音声纹特征604,该注册语音声纹特征也可以理解为预设用户注册声纹特征。可以理解的是,注册语音可以被不同类型的传感器提取,例如,耳外语音传感器,耳内语音传感器或骨振动传感器。其中,声纹模型603是预先通过训练数据训练得到的。声纹模型603可以是终端出厂前内置的,也可以是通过应用程序指导用户训练的,训练方法可以利用现有技术的方法,本申请对此不做限制。在验证流 程部分中,首先采集在某一次声纹识别过程中发声用户的测试语音605,经过预处理模块606预处理后,输入到预先训练好的声纹模型607进行特征提取后,得到测试语音声纹特征608,该测试语音声纹特征也可以理解为预设用户注册声纹特征。通过对基于注册语音声纹特征604和测试语音声纹特征608进行声纹识别来进行身份鉴权609后,基于声纹识别结果得到身份鉴权通过6010和身份鉴权不通过6011,身份鉴权通过6010指的是测试语音605的发声用户与注册语音601的发声用户为同一人,换句话说,测试语音605的发声用户为预设用户;身份鉴权不通过6011指的是测试语音605的发声用户与注册语音601的发声用户不为同一人,换句话说,测试语音605的发声用户为非法用户。可以理解的是,根据不同应用场景,声音的预处理、特征提取、以及声纹模型的训练过程会存在不同程度的差异,并且,预处理模块为可选的模块,预处理包括对语音信号的滤波、降噪或增强,本申请对此不做限制。
图7以终端是手机,可穿戴设备是蓝牙耳机为例,展示了本申请实施例提供的一种语音控制方法的流程示意图。其中,该蓝牙耳机包括耳内语音传感器,耳外语音传感器和骨振动传感器。如图7所示,该语音控制方法可以包括:
S701、手机与蓝牙耳机建立连接。
连接的方式可以为蓝牙连接、wifi连接或有线连接。手机与蓝牙耳机建立蓝牙连接的情况下,当用户希望使用蓝牙耳机时,可打开蓝牙耳机的蓝牙功能。此时,蓝牙耳机可对外发送配对广播。如果手机未打开蓝牙功能,则用户需要打开手机的蓝牙功能,如果手机已经打开蓝牙功能,则手机可以接收到该配对广播并提示用户已经扫描到相关的蓝牙设备。当用户在手机上选中蓝牙耳机后,手机可与蓝牙耳机进行配对并建立蓝牙连接。后续,手机与蓝牙耳机之间可通过该蓝牙连接进行通信。当然,如果手机与蓝牙耳机在建立本次蓝牙连接之前已经成功配对,则手机可自动与扫描到的蓝牙耳机建立蓝牙连接。
另外,如果用户希望使用的耳机具有Wi-Fi功能,用户也可操作手机与该耳机建立Wi-Fi连接。又或者,如果用户希望使用的耳机为有线耳机,用户也将耳机线的插头插入手机相应的耳机接口中建立有线连接,本申请实施例对此不做任何限制。
S702(可选的)、蓝牙耳机检测是否处于佩戴状态。
佩戴检测方法可以通过光电探测的方式,利用光学感应原理感知用户的佩戴状态。当用户佩戴耳机时,耳机内部光电传感器检测到的光被遮挡,输出一个开关控制信号,从而判断用户处于佩戴耳机状态。
具体的,蓝牙耳机中可设置接近光传感器和加速度传感器,其中,接近光传感器设置在用户佩戴时与用户接触的一侧。该接近光传感器和加速度传感器可定期启动以获取当前检测到的测量值。
由于用户佩戴蓝牙耳机后会挡住射入接近光传感器的光线,因此,当接近光传感器检测到的光强小于预设的光强阈值时,蓝牙耳机可确定此时自身处于佩戴状态。又因为,用户佩戴蓝牙耳机后蓝牙耳机会随用户一起运动,因此,当加速度传感器检测到的加速度值大于预设的加速度阈值时,蓝牙耳机可确定此时自身处于佩戴状态。或者,当接近光传感器检测到的光强小于预设的光强阈值时,如果检测到此时加速度传感器检测到的加速度值是否大于预设的加速度阈值,则蓝牙耳机可确定此时自身处于佩戴状态。
进一步地,由于蓝牙耳机内还设置有通过骨传导的方式采集语音信息的传感器,例如骨振动传感器或光学振动传感器等,因此,在一种可能的实现方式中,蓝牙耳机可进一步通过骨振动传感器采集当前环境中产生的振动信号。当蓝牙耳机处于佩戴状态时与用户直接接触,因此骨振动传感器采集到的振动信号相较于未佩戴状态下较为强烈,那么,如果骨振动传感 器采集到的振动信号的能量大于能量阈值,则蓝牙耳机可确定出自身处于佩戴状态。又或者,由于用户佩戴蓝牙耳机时采集到的振动信号中的谐波、共振等频谱特征与蓝牙耳机未被佩戴时采集到的频谱特征具有显著区别,因此,如果骨振动传感器采集到的振动信号满足预设频谱特征,则蓝牙耳机可确定出自身处于佩戴状态。上述两种情况均可以理解为用户的佩戴状态检测结果通过。这样可以减少用户将蓝牙耳机放入口袋等场景下,蓝牙耳机无法通过接近光传感器或加速度传感器准确检测佩戴状态的几率。
其中,上述能量阈值或者预设频谱特征可以是通过抓取大量用户佩戴蓝牙耳机后发声或者运动等方式产生的各种振动信号后统计得到的,与用户没有佩戴蓝牙耳机时骨振动传感器检测到的语音信号的能量或频谱特征具有明显差异。另外,由于蓝牙耳机外部的语音传感器(例如气传导麦克风)的功耗一般较大,因此,在蓝牙耳机检测出当前处于佩戴状态之前,无需开启耳内语音传感器、耳外语音传感器和/或骨振动传感器。当蓝牙耳机检测出当前处于佩戴状态后,可开启耳内语音传感器、耳外语音传感器和/或骨振动传感器采集用户发声时产生的语音信息,以降低蓝牙耳机的功耗。
当蓝牙耳机检测出当前处于佩戴状态后,或者说,佩戴状态检测结果通过后,可继续执行下述步骤S703-S707;否则,蓝牙耳机可进入休眠状态,直到检测出当前处于佩戴状态后继续执行下述步骤S703-S707。也就是说,蓝牙耳机可在检测出用户佩戴了蓝牙耳机,即用户对蓝牙耳机具有使用意图时,才会触发蓝牙耳机采集从而获取用户输入的语音信息以及声纹识别等过程,从而降低蓝牙耳机的功耗。当然,上述步骤S702为可选步骤,即无论用户是否佩戴了蓝牙耳机,蓝牙耳机均可续执行下述步骤S703-S707,本申请实施例对此不做任何限制。
在一种可能的实现方式中,若蓝牙耳机检测是否处于佩戴状态前已经采集了语音信号,这种情况下,当蓝牙耳机检测出当前处于佩戴状态后,或者说,佩戴状态检测结果通过后,蓝牙耳机采集的语音信号存储并继续执行下述步骤S703-S707;当蓝牙耳机没有检测出当前处于佩戴状态,或者说,佩戴状态检测结果不通过后,则蓝牙耳机删除刚刚采集的语音信号。
S703、若处于佩戴状态,则蓝牙耳机通过耳内语音传感器采集从而获取用户输入的语音信息中的第一语音分量,通过耳外语音传感器采集上述语音信息中的第二语音分量,并通过骨振动传感器采集上述语音信息中的第三语音分量。
当确定出蓝牙耳机处于佩戴状态时,蓝牙耳机可启动语音检测模块,分别使用上述耳内语音传感器、耳外语音传感器和骨振动传感器采集从而获取用户输入的语音信息,得到该语音信息中的第一语音分量、第二语音分量和第三语音分量。以耳内语音传感器和耳外语音传感器为气传导麦克风,骨振动传感器为骨传导麦克风举例,用户在使用蓝牙耳机的过程中可以输入语音信息“小E,使用微信支付”。此时,由于气传导麦克风暴露在空气中,因此,蓝牙耳机可使用气传导麦克风接收用户发声后由空气振动产生的振动信号(即上述语音信息中的第一语音分量、第二语音分量和第三语音分量)。同时,由于骨传导麦克风能够通过皮肤与用户耳骨接触,因此,蓝牙耳机可使用骨传导麦克风接收用户发声后由耳骨和皮肤振动产生的振动信号(即上述语音信息中的第三语音分量)。
如图8所示为传感器设置区域示意图,本申请实施例提供的蓝牙耳机包括耳内语音传感器、耳外语音传感器和骨振动传感器。其中,耳内语音传感器指的是,当该耳机处于被用户使用的状态时,该耳内语音传感器位于用户的耳道内部,或者说,该耳内语音传感器的声音侦测方向为耳道内部,该耳内语音传感器的设置于耳内语音传感器设置区域801。该耳内语音传感器用于采集用户发声时经过外界空气和耳道内空气的振动传播的声音,该声音为耳内 语音信号分量。耳外语音传感器指的是,当该耳机处于被用户使用的状态时,该耳外语音传感器位于用户的耳道外部,或者说,该耳外语音传感器的声音侦测方向为除耳道内部的其他方向,即整个外部空气方向,该耳外语音传感器的设置于耳外语音传感器设置区域802。该耳外语音传感器暴露于环境中,用于采集用户发出的经过外界空气的振动传播的声音,该声音为耳外语音信号分量或环境声分量。骨振动传感器指的是,当该耳机处于被用户使用的状态时,该骨振动传感器与用户的皮肤接触,用于采集用户骨头传递的振动信号,或者说,用于采集用户某次发声时,通过骨头振动所传递的语音信息分量。该骨振动传感器的设置区域不做限定,只要在用户佩戴该耳机时,能够检测到用户的骨骼振动即可。可以理解的是,耳内语音传感器可以设置于区域801中的任意位置,耳外语音传感器可以设置于区域802中的任意位置,本申请对此不做限制。需要注意的是,图8中的区域划分方式只是一种实例,实际上,耳内语音传感器的设置位置能够侦测到耳道内部的声音即可,耳外语音传感器的设置位置能够侦测到外部空气方向的声音即可。
在本申请的一些实施例中,当蓝牙耳机检测到用户输入的语音信息后,还可以通过VAD(voice activity detection,语音活动检测)算法区分上述语音信息中的语音信号和背景噪音。具体的,蓝牙耳机可以分别将上述语音信息中的第一语音分量、第二语音分量和第三语音分量输入至相应的VAD算法中,得到与第一语音分量对应的第一VAD取值、与第二语音分量对应的第二VAD取值以及与第三语音分量对应的第三VAD取值。其中,VAD取值可用于反映上述语音信息是说话人正常的语音信号还是噪音信号。例如,可将VAD取值范围设置在0至100的区间内,当VAD取值大于某一VAD阈值时可说明该语音信息是说话人正常的语音信号,当VAD取值小于某一VAD阈值时可说明该语音信息是噪音信号。又例如,可将VAD取值设置为0或1,当VAD取值为1时,说明该语音信息是说话人正常的语音信号,当VAD取值为0时,说明该语音信息是噪音信号。
那么,蓝牙耳机可结合上述第一VAD取值、第二VAD取值和第三VAD取值这三个VAD取值确定上述语音信息是否为噪音信号。例如,当第一VAD取值、第二VAD取值和第三VAD取值均为1时,蓝牙耳机可确定上述语音信息不是噪音信号,而是说话人正常的语音信号。又例如,当第一VAD取值、第二VAD取值和第三VAD取值分别大于预设取值时,蓝牙耳机可确定上述语音信息不是噪音信号,而是说话人正常的语音信号。
另外,当第三VAD取值为1或者第三VAD取值大于预设取值时,可一定程度上说明此时采集到的语音信息为活体用户发出的,因此,蓝牙耳机也可以仅根据第三VAD取值确定上述语音信息是否为噪音信号。可以理解的是,在一些情况下,蓝牙耳机也可以仅根据第一VAD取值或第二VAD取值确定上述语音信息是否为噪音信号,蓝牙耳机也可以根据第一VAD取值、第二VAD取值和第三VAD取值中的任意两个确定上述语音信息是否为噪音信号。
通过对上述第一语音分量、第二语音分量和第三语音分量分别进行语音活动检测,如果蓝牙耳机确定出上述语音信息是噪音信号,则蓝牙耳机可丢弃该语音信息;如果蓝牙耳机确定出上述语音信息不是噪音信号,则蓝牙耳机可继续执行下述步骤S704-S707。即用户向蓝牙耳机输入有效的语音信息时,才会触发蓝牙耳机进行后续声纹识别等过程,从而降低蓝牙耳机的功耗。
另外,当蓝牙耳机获取到与第一语音分量、第二语音分量和第三语音分量分别对应的第一VAD取值、第二VAD取值和第三VAD取值后,还可以使用噪声估计算法(例如,最小值统计算法或最小值控制递归平均算法等)分别测算上述语音信息中的噪声值。例如,蓝牙耳机可以设置专门用于存储噪声值的存储空间,蓝牙耳机每次计算出新的噪声值后,可以将新的 噪声值更新在上述存储空间中。即该存储空间中一直保存有最近测算出的噪声值。
这样,蓝牙耳机通过上述VAD算法确定出上述语音信息为有效的语音信息后,可使用上述存储空间中的噪声值分别对上述第一语音分量、第二语音分量和第三语音分量进行降噪处理,使得后续蓝牙耳机(或手机)分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别时的识别结果更加准确。
S704、蓝牙耳机通过蓝牙连接向手机发送第一语音分量、第二语音分量和第三语音分量。
蓝牙耳机获取到上述第一语音分量、第二语音分量和第三语音分量后,可将第一语音分量、第二语音分量和第三语音分量发送给手机,进而由手机执行下述步骤S705-S707,以实现对用户输入的语音信息的声纹识别、用户身份鉴权等操作。
S705、手机分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别,得到与第一语音分量对应的第一声纹识别结果、与第二语音分量对应的第二声纹识别结果以及与第三语音分量对应的第三声纹识别结果。
声纹识别的原理是通过比对预设用户的注册声纹特征与从用户输入的语音信息中提取到的声纹特征,通过一定的算法进行判断,判断的结果即为声纹识别结果。
具体的,手机内可预先存储一个或多个预设用户的注册声纹特征。其中,每个预设用户均具有三个注册声纹特征,一个是根据耳内语音传感器工作时采集到的用户的第一注册语音进行特征提取得到的第一注册声纹特征,一个是根据耳外语音传感器工作时采集到的用户的第二注册语音进行特征提取得到的第二注册声纹特征,还有一个是根据骨传导麦克风工作时采集到的用户的第三注册语音进行特征提取得到的第三注册声纹特征。
其中,第一注册声纹特征、第二注册声纹特征和第三注册声纹特征的获取需要经过两个阶段。第一阶段是背景模型训练阶段。在第一阶段中,开发人员可采集大量说话人佩戴上述蓝牙耳机发声时产生的相关文本的语音(例如,“你好,小E”等)。进而,手机可对这些相关文本的语音进行预处理(例如滤波、降噪等)后可提取语音中的声纹特征。其中,声纹特征具体可以为spectrogram(时频语谱图),fbank(filter banks,基于滤波器组的特征),mfcc(mel-frequency cepstral coefficients,梅尔频率倒谱系数),plp(Perceptual Linear Prediction,感知线性预测)或CQCC(Constant Q Cepstral Coefficients,常数Q倒谱系数)等。手机提取的声纹特征除了直接提取上述声纹特征以外,还可以提取两个或两个以上的上述声纹特征,并通过拼接等方式获得融合后的声纹特征。手机提起到声纹特征后,使用GMM(gaussian mixed model,高斯混合模型)、SVM(support vector machines,支持向量机)或者深度神经网络类框架等机器学习算法建立声纹识别的背景模型,其中,上述机器学习算法包括但不限于DNN(deep neural network,深度神经网络)算法,RNN(recurrent neural network,循环神经网络)算法,LSTM(long short term memory,长短时记忆)算法,TDNN(Time Delay Neural Network,时延神经网络),Resnet(深度残差网络)等。可以理解为,上述步骤是通过大量训练语音构建的UBM(Universal Background Model,通用背景模型),其中,UBM本身是可以自适应进行训练的,UBM的参数是可以根据不同的厂商需求或用户需求进行调整的。
手机在得到背景模型后将得到的背景模型进行存储,可以理解的是,根据该方法的执行主体的不同,存储的位置可以是手机、可穿戴设备或服务器。需要说明的是,可以存储单个或多个背景模型,存储的多个背景模型可以用相同或不同的算法得到。存储的多个背景模型可以实现声纹模型层面的融合。例如,可以使用Resnet(即深度残差网络)来训练得到第一背景说话人声纹模型,使用TDNN(即时延神经网络)来训练得到第二背景说话人声纹模型, 使用RNN(即循环神经网络)来训练得到第三背景说话人声纹模型。可以理解的是,本申请实施例可以对空气麦克风和骨振动麦克风分别建模,并进行多模型融合。手机或蓝牙耳机可基于这些背景模型,结合与该手机相连接的可穿戴设备中,不同的语音传感器的特性,分别建立多个声纹模型。例如,建立与蓝牙耳机的耳内语音传感器对应的第一声纹模型、与蓝牙耳机的耳外语音传感器对应第二声纹模型和与蓝牙耳机的骨振动传感器对应第三声纹模型。手机可以将第一声纹模型、第二声纹模型和第三声纹模型保存在手机本地,也可以将第一声纹模型、第二声纹模型和第三声纹模型发送给蓝牙耳机进行保存。
第二阶段是用户在手机上首次使用声纹识别功能时,通过输入注册语音,手机分别通过与手机相连接的蓝牙耳机的耳内语音传感器、耳外语音传感器和骨振动传感器,提取到该用户的第一注册声纹特征、第二注册声纹特征和第三注册声纹特征的过程。该阶段可以通过手机系统内置的设备生物识别功能中的声纹识别选项进行注册过程,也可以通过下载的APP调用系统程序进行注册过程。例如,预设用户1首次使用手机内安装的语音助手APP时,语音助手APP可提示用户佩戴蓝牙耳机并说出“你好,小E”的注册语音。同样,由于蓝牙耳机上包括耳内语音传感器、耳外语音语音传感器和骨振动传感器,因此,蓝牙耳机可获取到该注册语音中通过耳内语音传感器采集到的第一注册语音分量、通过耳外语音传感器采集到的第二注册语音分量以及通过骨振动传感器采集到的第三注册语音分量。进而,蓝牙耳机将第一注册语音分量、第二注册语音分量和第三注册语音分量发送给手机后,手机可分别通过第一声纹模型对第一注册语音分量进行特征提取得到第一注册声纹特征,通过第二声纹模型对第二注册语音分量进行特征提取得到第二注册声纹特征,通过第三声纹模型对第三注册语音分量进行特征提取得到第三注册声纹特征。手机可以将预设用户1的第一注册声纹特征、第二注册声纹特征和第三注册声纹特征保存在手机本地,也可以将预设用户1的第一注册声纹特征、第二注册声纹特征和第三注册声纹特征发送给蓝牙耳机进行保存。
可选地,在提取预设用户1的第一注册声纹特征、第二注册声纹特征和第三注册声纹特征时,手机还可以将此时连接的蓝牙耳机作为预设蓝牙设备。例如,手机可以将该预设蓝牙设备的标识(例如蓝牙耳机的MAC地址等)保存在手机本地。这样,手机可以接收和执行预设蓝牙设备发来的相关操作指令,而当非法蓝牙设备向手机发送操作指令时,手机可丢弃该操作指令以提高安全性。一个手机可以管理一个或多个预设蓝牙设备。如图11中的(a)所示,用户可以从设置功能中进入声纹识别功能的设置界面1101,用户点击设置按钮1105后可进入如图11中的(b)所示的预设设备管理界面1106。用户在预设设备管理界面1106中可以添加或删除预设蓝牙设备。
在步骤S705中,手机获取到上述语音信息中的第一语音分量、第二语音分量和第三语音分量后,可分别提取第一语音分量声纹特征得到第一声纹特征、提取第二语音分量声纹特征得到第二声纹特征以及提取第三语音分量声纹特征得到第三声纹特征,进而使用预设用户1的第一注册声纹特征与第一声纹特征进行匹配,使用预设用户1的第二注册声纹特征与第二声纹特征进行匹配,使用预设用户1的第三注册声纹特征与第三声纹特征进行匹配。例如,手机可通过一定算法计算上述第一注册声纹特征与第一语音分量的第一匹配度(即第一声纹识别结果),上述第二注册声纹特征与第二语音分量的第二匹配度(即第二声纹识别结果)以及上述第三注册声纹特征与第三语音分量的第三匹配度(即第三声纹识别结果)。一般,当匹配度越高时,说明上述语音信息中的声纹特征与预设用户1的声纹特征越相似,输入该语音信息的用户是预设用户1的概率越高。
例如,当第一匹配度、第二匹配度与第三匹配度的平均值大于80分时,手机可确定第一 声纹特征与第一注册声纹特征匹配,第二声纹特征与第二注册声纹特征匹配,且第三声纹特征与第三注册声纹特征匹配。又或者,当第一匹配度、第二匹配度与第三匹配度分别大于85分时,手机可确定第一声纹特征与第一注册声纹特征匹配,第二声纹特征与第二注册声纹特征匹配,且第三声纹特征与第三注册声纹特征匹配。
其中,第一注册声纹特征是通过第一声纹模型进行特征提取得到的,第一注册声纹特征用于反映耳内语音传感器采集到的预设用户的声纹特征;第二注册声纹特征是通过第二声纹模型进行特征提取得到的,第二注册声纹特征用于反映耳外语音传感器采集到的所述预设用户的声纹特征;第三注册声纹特征是通过第三声纹模型进行特征提取得到的,第三注册声纹特征用于反映骨振动传感器采集到的所述预设用户的声纹特征。可以理解的是,声纹模型的功能是提取输入语音的声纹特征,输入语音为注册语音时,声纹模型能够提取注册语音的注册声纹特征,输入语音为用户某次说话的语音时,声纹模型能够提取该语音的声纹特征。可选地,声纹特征的获取方式还可以为融合方式,包括声纹模型融合方式和声纹特征层面的融合方式。
在一种可能的实现方式中,上述计算匹配度的算法可以为计算相似度。手机对第一语音分量进行特征提取,得到第一声纹特征,分别计算第一声纹特征与预先存储的预设用户的第一注册声纹特征的第一相似度,第二声纹特征与预先存储的预设用户的第二注册声纹特征的第二相似度,第三声纹特征与预先存储的预设用户的第三注册声纹特征的第三相似度。
如果手机内存储有多个预设用户的注册声纹特征,则手机还可以按照上述方法逐一计算上述第一语音分量与其他预设用户(例如预设用户2、预设用户3)的第一匹配度,以及上述第二语音分量与其他预设用户的第二匹配度。进而,蓝牙耳机可以将匹配度最高的预设用户(例如预设用户A)确定为此时的发声用户。
另外,在手机对第一语音分量、第二语音分量和第三语音分量进行声纹识别之前,还可以先判断是否需要对第一语音分量、第二语音分量和第三语音分量进行声纹识别。判断的方式可以为对语音信息进行关键词检测,当语音信息中包括预设的关键词时,手机分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别;或者;判断的方式还可以为对用户输入进行检测,当接收到用户输入的预设操作时,手机分别对所述第一语音分量、所述第二语音分量和所述第三语音分量进行声纹识别。其中,关键词检测的具体方式可以为对于关键词进行语音识别后相似度大于预设阈值,则认为关键词检测通过。
在一种可能的实现方式中,如果蓝牙耳机或者手机可以从用户输入的语音信息中识别出预设的关键词,例如,“转账”、“支付”、“**银行”或者“聊天记录”等涉及用户隐私或资金行为的关键词,说明用户此时通过语音控制手机所需的安全需求较高,因此,手机可执行步骤S705进行声纹识别。又例如,如果蓝牙耳机检测接收到用户输入所执行的预先设置的用于开启声纹识别功能的操作,例如,敲击蓝牙耳机或者同时按下音量+和音量-按键等操作,说明用户此时需要通过声纹识别验证用户身份,因此,蓝牙耳机可通知手机执行步骤S705进行声纹识别。
又或者,还可以在手机内预先设置与不同安全等级对应的关键词。例如,安全等级最高的关键词包括“支付”、“付款”等,安全等级较高的关键词包括“拍照”、“打电话”等,安全等级最低的关键词包括“听歌”、“导航”等。这样,当检测到上述采集到的语音信息中包含安全等级最高的关键词时,可触发手机分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别,即对采集到的三路音源均进行声纹识别以提高语音控制手机时的安全性。当检测到上述采集到的语音信息中包含安全等级较高的关键词时,由于此时用户通过语音控 制手机的安全性需求一般,因此可触发手机仅对第一语音分量、第二语音分量或第三语音分量进行声纹识别。当检测到上述采集到的语音信息中包含安全等级最低的关键词时,手机无需对第一语音分量、第二语音分量和第三语音分量进行声纹识别。
当然,如果蓝牙耳机采集到的语音信息中没有包含关键词,说明此时采集到的语音信息可能只是用户在正常交谈时发出的语音信息,因此,手机无需对第一语音分量、第二语音分量和第三语音分量进行声纹识别,从而可降低手机的功耗。
又或者,手机还可以预先设置一个或多个唤醒词用于唤醒手机打开声纹识别功能。例如,该唤醒词可以为“你好,小E”。当用户向蓝牙耳机输入语音信息后,蓝牙耳机或手机可识别该语音信息是否是包含唤醒词的唤醒语音。例如,蓝牙耳机可将采集到的语音信息中的第一语音分量、第二语音分量和第三语音分量发送给手机,如果手机进一步识别出该语音信息中包含上述唤醒词,则手机可打开声纹识别功能(例如为声纹识别芯片上电)。后续如果蓝牙耳机采集到的语音信息中包含上述关键词,则手机可使用已开启的声纹识别功能按照步骤S705的方法进行声纹识别。
又例如,蓝牙耳机采集到语音信息后也可进一步识别该语音信息中是否包含上述唤醒词。如果包含上述唤醒词,则说明后续用户可能需要使用声纹识别功能,那么,蓝牙耳机可向手机发送启动指令,使得手机响应于该启动指令打开声纹识别功能。
S706、手机根据第一声纹识别结果、第二声纹识别结果和第三声纹识别结果对用户身份鉴权。
在步骤S706中,手机通过声纹识别得到与第一语音分量对应的第一声纹识别结果、与第二语音分量对应的第二声纹识别结果以及与第三语音分量对应的第三声纹识别结果后,可综合这三个声纹识别结果对输入上述语音信息的用户身份鉴权,从而提高用户身份鉴权时的准确性和安全性。
示例性的,预设用户的第一注册声纹特征与上述第一声纹特征的第一匹配度为第一声纹识别结果,预设用户的第二注册声纹特征与上述第二声纹特征的第二匹配度为第二声纹识别结果,预设用户的第三注册声纹特征与上述第三声纹特征的第三匹配度为第三声纹识别结果。在对用户身份鉴权时,如果上述第一匹配度、第二匹配度和第三匹配度满足预设的鉴权策略,例如,鉴权策略为当上述第一匹配度大于第一阈值、上述第二匹配度大于第二阈值且上述第三匹配度大于第三阈值时(第三阈值、第二阈值与第一阈值互相可以相同或不同),手机确定发出该第一语音分量、第二语音分量和第三语音分量的用户为预设用户;否则,手机可确定发出该第一语音分量、第二语音分量和第三语音分量的用户为非法用户。
又例如,手机可计算上述第一匹配度和第二匹配度的加权平均值,当该加权平均值大于预设阈值时,手机可确定发出该第一语音分量、第二语音分量和第三语音分量的用户为预设用户;否则,手机可确定发出上述第一语音分量、第二语音分量和第三语音分量的用户为非法用户。
又或者,手机可以在不同的声纹识别场景下使用不同的鉴权策略。例如,当采集到的语音信息中包含安全等级最高的关键词时,手机可将上述第一阈值、第二阈值和第三阈值均设置为99分。这样,只有当第一匹配度、第二匹配度和第三匹配度均大于99分时,手机确定当前的发声用户为预设用户。而当采集到的语音信息中包含安全等级较低的关键词时,手机可将上述第一阈值、第二阈值和第三阈值均设置为85分。这样,当第一匹配度、第二匹配度和第三匹配度均大于85分时,手机便可确定当前的发声用户为预设用户。也就是说,对于不同安全等级的声纹识别场景,手机可使用不同安全等级的鉴权策略对用户身份鉴权。
另外,如果手机内存储有或多个预设用户的声纹模型,例如,手机内存储有预设用户A、预设用户B和预设用户C的注册声纹特征,每个预设用户的注册声纹特征均包括第一注册声纹特征、第二注册声纹特征和第三注册声纹特征。那么,手机可以按照上述方法将采集到的第一语音分量、第二语音分量和第三语音分量分别与每个预设用户的注册声纹特征进行匹配。进而,手机可以将满足上述鉴权策略,且匹配度最高的预设用户(例如预设用户A)确定为此时的发声用户。
这样,手机接收到蓝牙耳机发送的语音信息中的第一语音分量、第二语音分量和第三语音分量后,可将第一语音分量、第二语音分量和第三语音分量融合后进行声纹识别,例如,计算第一语音分量、第二语音分量和第三语音分量融合后与预设用户的声纹模型之间的匹配度。进而,手机根据该匹配度也能够对用户身份鉴权。由于这种身份鉴权方法中预设用户的声纹模型被融合为一个,因此声纹模型的复杂度和所需的存储空间都相应降低,同时由于利用了第二语音分量的声纹特征信息所以也具有双重声纹保障和活体检测功能。
又例如,上述计算匹配度的算法可以为计算相似度。手机对第一语音分量进行特征提取,得到第一声纹特征,分别计算第一声纹特征与预先存储的预设用户的第一注册声纹特征的第一相似度,第二声纹特征与预先存储的预设用户的第二注册声纹特征的第二相似度,第三声纹特征与预先存储的预设用户的第三注册声纹特征的第三相似度,基于第一相似度、第二相似度和第三相似度,对用户进行身份鉴权。相似度计算的方法包括:欧氏距离(Euclidean Distance)、余弦相似度(Cosine)、皮尔逊相关系数(Pearson)、修正余弦相似度(Adjusted Cosine)、汉明距离(Hamming Distance)、曼哈顿距离(Manhattan Distance)等,本申请对此不作限制。
对用户进行身份鉴权的方式可以为手机根据环境声的分贝数和蓝牙耳机的播放音量,分别确定第一相似度对应的第一融合系数,第二相似度对应的第二融合系数,第三相似度对应的第三融合系数;根据第一融合系数、第二融合系数和第三融合系数融合第一相似度、第二相似度和第三相似度,得到融合相似度得分。若融合相似度得分大于第一阈值,则手机确定向蓝牙耳机输入语音信息的用户为预设用户。
在一种可能的实现方式中,环境声的分贝数是蓝牙耳机的声压传感器检测得到的并发送给手机的,播放音量可以是蓝牙耳机的扬声器检测播放信号得到的并发送给手机的,也可以是手机本身调用自身数据得到的。
在一种可能的实现方式中,第二融合系数与环境声的分贝数呈负相关,第一融合系数、第三融合系数分别与播放音量的分贝数呈负相关,第一融合系数、第二融合系数和第三融合系数的和为固定值。也就是说,在第一融合系数、第二融合系数和第三融合系数的和为预设的固定值的情况下,环境声的分贝数越大,第二融合系数越小,此时,相应的,第一融合系数和第三融合系数会适应性增大,以维持在第一融合系数、第二融合系数和第三融合系数的和不变化;播放音量越大,第一融合系数和第三融合系数越小,此时,相应的,第二融合系数会适应性增大,以维持在第一融合系数、第二融合系数和第三融合系数的和不变化。该实现方式中的融合系数可以理解为动态的,换句话说,融合系数是根据环境声和播放音量动态变化的,根据麦克风检测到的周围环境声音的分贝数、耳内传感器检测到的播放音量来动态决定融合系数。若环境声的分贝数较高,说明环境噪声水平较高,可以认为蓝牙耳机受环境噪声影响较大,因此本申请提供的语音控制方法需要降低蓝牙耳机耳外传感器和骨振动传感器对应的融合系数,融合相似度得分的结果更加依赖受环境噪声影响比较小的耳内传感器;反之,若播放音量较大,说明在耳道内的播放声音的噪声水平较高,可以认为蓝牙耳机的耳 内传感器受播放声音的影响较大,因此本申请提供的语音控制方法需要降低耳内传感器对应的融合系数,融合相似度得分的结果更加依赖受播放声音影响比较小的耳外传感器和骨振动传感器。
具体的,在系统设计时可以根据以上原则设置查找表,在具体使用时,可以根据监测到的自身音量和环境声分贝数,通过查表的方式,确定融合系数。例如,表1-1所示为一个示例。其中耳内语音传感器和骨振动传感器采集的语音信号的相似度得分的融合系数分别用a1和a2表示,耳外语音传感器采集的语音信号得到的相似度得分的融合系数用b1表示。当环境音超过60dB的时候,此时的外界环境可以认为比较嘈杂,耳外语音传感器采集的语音信号会夹杂较多的环境噪音,耳外语音传感器采集的语音信号对应的融合系数可以使用较低数值或者直接置成0。耳机内部扬声器播放音量超过总音量的80%时,可以认为耳机内部的音量过大,耳内语音传感器采集的语音信号对应的融合系数可以使用较低数值或者直接置成0。当外界环境噪声过大(例如,环境音超过60dB)并且扬声器音量过高(例如,耳机扬声器音量超过总音量的60%)时,采集到的语音信号干扰太大,声纹识别失效。可以理解的是,具体应用中,“音量20%”、“音量40%”和“环境音20dB”、“环境音40dB”可以代表一个范围,例如,“音量20%”指的是“音量10%-30%”,音量40%”指的是“音量30%-50%”;“环境音20dB”指的是“环境音10dB-30dB”,“环境音40dB”指的是“环境音30dB-50dB”。
Figure PCTCN2022080436-appb-000001
表1-1
可以理解的是,上述具体设计仅为一种实例,具体的参数设置、阈值设置以及不同的环境音分贝数和扬声器音量对应怎样的系数,可以根据实际情况进行设计和更改,本申请对此不做限制。需要注意的是,本申请实施例提供的融合系数可以理解为“动态融合系数”,即融合系数可以根据不同的环境音分贝数和扬声器音量进行动态调整。
示例性的,在另一种可能的实现方式中,S706中基于第一声纹识别结果、第二声纹识别结果和第三声纹识别结果进行融合来对用户进行身份鉴权的策略,可以变更为直接对音频特征进行融合,基于融合音频特征和声纹模型提取得到声纹特征,计算该声纹特征与预先存储 的预设用户的注册声纹特征的相似度,继而进行身份鉴权。具体的,从耳内语音传感器和耳外语音传感器采集的当前用户的语音信号中提取各帧的音频特征feaE1,feaE2。从骨声纹传感器采集的当前用户的语音信号中提取各帧的音频特征feaB1。对上述音频特征feaE1,feaE2,feaB1进行融合,包括但不限于下述方法:对feaE1,feaE2和feaB1进行归一化处理得到feaE1’,feaE2’和feaB1’,然后拼接成一个特征矢量fea=[feaE1’,feaE2’,feaB1’]。将特征矢量fea通过声纹模型进行声纹特征提取,获得当前用户的声纹特征。同理,注册用户的注册语音可以参照上述方法获得注册用户的声纹特征。将当前用户的声纹特征和注册用户的声纹特征进行相似度比对,从而得到相似度得分,判断相似度得分与预设阈值的关系,从而获得鉴权结果。
示例性的,在另一种可能的实现方式中,S706中基于第一相似度、第二相似度和第三相似度进行融合来对用户进行身份鉴权的策略,可以变更为对第一声纹特征,第二声纹特征和第三声纹特征进行融合得到融合声纹特征,计算融合声纹特征与预先存储的预设用户的注册融合声纹特征的相似度,继而进行身份鉴权。具体的,将从耳内语音传感器和耳外语音传感器采集的当前用户的语音信号通过声纹模型进行特征提取,得到声纹特征e1、e2。将从骨声纹传感器采集的当前用户的语音信号通过声纹模型进行特征提取,得到声纹特征b1。对上述声纹特征e1、e2、b1进行拼接融合,得到拼接后的当前用户的声纹特征m1=[e1,e2,b1]。同理,注册用户的注册语音可以参照上述方法获得拼接后的注册用户的声纹特征。将拼接后的当前用户的声纹特征和拼接后的注册用户的声纹特征进行相似度比对,从而得到相似度得分,判断相似度得分与预设阈值的关系,从而获得鉴权结果。
S707、若上述用户为预设用户,则手机执行与上述语音信息对应的操作指令。
通过上述步骤S706的鉴权过程,如果鉴权通过,手机确定出步骤S702中输入语音信息的发声用户为预设用户,则手机可执行与上述语音信息对应的操作指令,若鉴权不通过,则不执行后续的操作指令。可以理解的是,操作指令包括但不限于手机解锁操作或者确认支付操作。例如,当上述语音信息为“小E,使用微信支付”时,与其对应的操作指令为打开微信APP的支付界面。这样,手机生成打开微信APP中支付界面的操作指令后,可自动打开微信APP,并显示微信APP中的支付界面。
另外,由于手机已经确定出上述用户为预设用户,因此,如图9所示,如果当前手机处于锁定状态,手机还可以先解锁屏幕,再执行打开微信APP中支付界面的操作指令,显示显示微信APP中的支付界面901。
示例性的,上述步骤S701-S707提供的语音控制方法可以是语音助手APP提供的一项功能。蓝牙耳机与手机交互时,如果通过声纹识别确定此时的发声用户为预设用户,手机可将生成的操作指令或语音信息等数据发送给应用程序层运行的语音助手APP。进而,由语音助手APP调用应用程序框架层的相关接口或服务执行与上述语音信息对应的操作指令。
可以看出,本申请实施例中提供的语音控制方法可以在利用声纹识别用户身份的同时,对手机解锁并执行语音信息中的相关操作指令。即用户只需要输入一次语音信息即可完成用户身份鉴权、手机解锁以及打开手机某一功能等一些列操作,从而大大提高了用户对手机的操控效率和用户体验。
在上述步骤S701-S707中,是以手机作为执行主体进行声纹识别以及用户身份鉴权等操作。可以理解的是,上述步骤S701-S707中的部分或全部内容也可以由蓝牙耳机完成,这可以降低手机的实现复杂度以及手机的功耗。如图10所示,该语音控制方法可以包括:
S1001、手机与蓝牙耳机建立蓝牙连接。
S1002(可选的)、蓝牙耳机检测是否处于佩戴状态。
S1003、若处于佩戴状态,则蓝牙耳机通过第一语音传感器采集从而获取用户输入的语音信息中的第一语音分量,通过第二语音传感器采集上述语音信息中的第二语音分量,并通过骨振动传感器采集上述语音信息中的第三语音分量。
其中,步骤S1001-S1003中蓝牙耳机与手机建立蓝牙连接,检测蓝牙耳机是否处于佩戴状态,以及检测语音信息中的第一语音分量、第二语音分量和第三语音分量的具体方法可参见上述步骤S701-S703的相关描述,故此处不再赘述。
需要说明的时,蓝牙耳机获取到上述第一语音分量、第二语音分量和第三语音分量后,还可以对检测到的第一语音分量和第二语音分量进行增强、降噪或滤波等操作,本申请实施例对此不做任何限制。
在本申请的一些实施例中,由于蓝牙耳机具有音频播放功能,而当蓝牙耳机的扬声器在工作时,蓝牙耳机上的气传导麦克风和骨传导麦克风可能会接收到扬声器所播放的音源的回声信号。因此,当蓝牙耳机获取到上述第一语音分量和第二语音分量后,还可以使用回声消除算法(adaptive echo cancellation,AEC)消除第一语音分量和第二语音分量中的回声信号,以提高后续声纹识别的准确性。
S1004、蓝牙耳机分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别,得到与第一语音分量对应的第一声纹识别结果,与第二语音分量对应的第二声纹识别结果以及与第三语音分量对应的第三声纹识别结果。
与上述步骤S701-S707不同的是,在步骤S1004中,蓝牙耳机内可预先存储一个或多个声纹模型和预设用户的注册声纹特征。这样,蓝牙耳机获取到上述第一语音分量、第二语音分量和第三语音分量后,可使用蓝牙耳机本地存储的声纹模型对第一语音分量、第二语音分量和第三语音分量进行声纹识别以分别获取语音分量对应的声纹特征,将获取到的语音分量对应的声纹特征与对应的注册声纹特征进行比对。从而进行声纹识别。其中,蓝牙耳机分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别的具体方法,可参见上述步骤S705中手机分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别的具体方法,故此处不再赘述。
S1005、蓝牙耳机根据第一声纹识别结果、第二声纹识别结果和第三声纹识别结果对用户身份鉴权。
其中,蓝牙耳机根据第一声纹识别结果、第二声纹识别结果和第三声纹识别结果对用户身份鉴权的过程可参见上述步骤S706中手机根据第一声纹识别结果、第二声纹识别结果和第三声纹识别结果对用户身份鉴权的相关描述,故此处不再赘述。
S1006、若上述用户为预设用户,则蓝牙耳机通过蓝牙连接向手机发送与上述语音信息对应的操作指令。
S1007、手机执行上述操作指令。
如果蓝牙耳机确定出输入上述语音信息的发声用户为预设用户,则蓝牙耳机可生成与上述语音信息对应的操作指令。操作指令可以参加上述步骤S707中手机的操作指令例子,此处不再赘述。
另外,由于蓝牙耳机已经确定出上述用户为预设用户,因此,当手机处于锁定状态时,蓝牙耳机还可以向手机发送用户身份鉴权通过的消息或者解锁指令,使得手机可以先解锁屏幕,再执行与上述语音信息对应的操作指令。当然,蓝牙耳机也可以将采集到的语音信息发送给手机,由手机根据该语音信息生成对应的操作指令,并执行该操作指令。
在本申请的一些实施例中,蓝牙耳机向手机发送上述语音信息或对应的操作指令时,还可以将自身的设备标识(例如MAC地址)发送给手机。由于手机内存储有已经通过鉴权的预设蓝牙设备的标识,因此,手机可根据接收到的设备标识确定当前连接的蓝牙耳机是否为预设蓝牙设备。如果该蓝牙耳机是预设蓝牙设备,则手机可进一步执行该蓝牙耳机发送来的操作指令,或者对该蓝牙耳机发送来的语音信息进行语音识别等操作,否则,手机可丢弃该蓝牙耳机发来的操作指令,从而避免非法蓝牙设备恶意操控手机导致的安全性问题。
或者,手机与预设蓝牙设备可以预先约定传输上述操作指令时的口令或密码。这样,蓝牙耳机向手机发送上述语音信息或对应的操作指令时,还可以向手机发送预先约定的口令或密码,使得手机确定当前连接的蓝牙耳机是否为预设蓝牙设备。
又或者,手机与预设蓝牙设备可以预先约定传输上述操作指令时使用的加密和解密算法。这样,蓝牙耳机向手机发送上述语音信息或对应的操作指令前,可使用约定的加密算法对该操作指令进行加密。手机接收到加密后的操作指令后,如果使用约定的解密算法能够解密出上述操作指令,则说明当前连接的蓝牙耳机为预设蓝牙设备,则手机可进一步执行该蓝牙耳机发送来的操作指令;否则,说明当前连接的蓝牙耳机为非法蓝牙设备,手机可丢弃该蓝牙耳机发来的操作指令。
需要说明的是,上述步骤S701-S707以及步骤S1001-S1007仅为在本申请提供的语音控制方法的两种实现方式。可以理解的是,本领域技术人员可以根据实际应用场景或实际经验设置上述实施例中哪些步骤由蓝牙耳机执行,哪些步骤由手机执行,本申请实施例对此不做任何限制。另外,本申请提供的语音控制方法还可以以服务器作为执行主体,即蓝牙耳机与服务器建立连接,服务器实现上述实施例中手机的功能,具体过程此处不再赘述。
例如,蓝牙耳机也可以在对第一语音分量、第二语音分量和第三语音分量进行声纹识别之后,将得到的第一声纹识别结果、第二声纹识别结果和第三声纹识别结果发送给手机,后续由手机根据该声纹识别结果进行用户身份鉴权等操作。
又例如,蓝牙耳机也可以在获取到上述第一语音分量、第二语音分量和第三语音分量后,先判断是否需要对第一语音分量、第二语音分量和第三语音分量进行声纹识别。如果需要对第一语音分量、第二语音分量和第三语音分量进行声纹识别,则蓝牙耳机可向手机发送该第一语音分量、第二语音分量和第三语音分量,进而由手机完成后续声纹识别、用户身份鉴权等操作;否则,蓝牙耳机无需向手机发送该第一语音分量、第二语音分量和第三语音分量,避免增加手机处理该第一语音分量、第二语音分量和第三语音分量的功耗。
另外,如图11中的(a)所示,用户还可以进入手机的设置界面1101中开启或关闭上述语音控制控能。如果用户开启上述语音控制控能,用户可通过设置按钮1102设置触发该语音控制的关键词,例如“小E”、“支付”等,用户也可以通过设置按钮1103管理预设用户的声纹模型,例如添加或删除预设用户的声纹模型,用户还可以通过设置按钮1104设置语音助手能够支持的操作指令,例如支付、拨打电话、订餐等。这样,用户可以获得定制化的语音控制体验。
在本申请的一些实施例中,本申请实施例公开了一种语音控制装置,如图12所示,该语音控制装置包括语音信息获取单元1201、识别单元1202、身份信息获取单元1203以及执行单元1204。可以理解的是,该语音控制装置本身可以为一个终端或者可穿戴设备,该语音控制装置可以全部集成于可穿戴设备中,也可以将可穿戴设备与终端组成一套语音控制系统,即部分单元位于可穿戴设备中,部分单元位于终端中。
在一种可能的实现方式中,以该语音控制装置可以全部集成于蓝牙耳机中为例。其中, 语音信息获取单元1201用于获取用户的语音信息,在本申请实施例中,用户可在佩戴蓝牙耳机时向蓝牙耳机输入语音信息,此时,蓝牙耳机可以基于用户输入的语音信息,通过耳内语音传感器采集第一语音分量,通过耳外语音传感器采集第二语音分量,通过骨振动传感器采集第三语音分量。
识别单元1202用于分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别,得到与第一语音分量对应的第一声纹识别结果、与第二语音分量对应的第二声纹识别结果以及与第三语音分量对应的第三声纹识别结果。
在一种可能的实现方式中,识别单元1202还可以用于对用户向蓝牙耳机输入的语音信息进行关键词检测,当语音信息中包括预设的关键词时,分别对第一语音分量、第二语音分量和第三语音分量进行声纹识别;或者;识别单元1202可以用于对用户输入进行检测,当接收到用户输入的预设操作时,分别对所述第一语音分量、所述第二语音分量和所述第三语音分量进行声纹识别。用户输入可以为用户通过触摸屏或按键对蓝牙耳机的输入,例如,用户点击蓝牙耳机的解锁键。可选的,识别单元1202对语音信息进行关键词检测或者对用户输入进行检测之前,获取单元1201还可以获取佩戴状态检测结果,当佩戴状态检测结果通过时,识别单元1202对语音信息进行关键词检测,或者,对用户输入进行检测。
在一种可能的实现方式中,识别单元1202具体用于:对第一语音分量进行特征提取,得到第一声纹特征,计算第一声纹特征与预设用户的第一注册声纹特征的第一相似度,第一注册声纹特征是第一注册语音经过第一声纹模型进行特征提取得到的,第一注册声纹特征用于反映耳内语音传感器采集到的预设用户的音频特征;对第二语音分量进行特征提取,得到第二声纹特征,计算第二声纹特征与预设用户的第二注册声纹特征的第二相似度,第二注册声纹特征是第二注册语音经过第二声纹模型进行特征提取得到的,第二注册声纹特征用于反映耳外语音传感器采集到的预设用户的音频特征;对第三语音分量进行特征提取,得到第三声纹特征,计算第三声纹特征与预设用户的第三注册声纹特征的第三相似度,第三注册声纹特征是第三注册语音经过第三声纹模型进行特征提取得到的,第三注册声纹特征用于反映骨振动传感器采集到的预设用户的音频特征。
在一种可能的实现方式中,第一注册声纹特征是通过第一声纹模型进行特征提取得到的,第一注册声纹特征用于反映耳内语音传感器采集到的预设用户的声纹特征;第二注册声纹特征是通过第二声纹模型进行特征提取得到的,第二注册声纹特征用于反映耳外语音传感器采集到的所述预设用户的声纹特征;第三注册声纹特征是通过第三声纹模型进行特征提取得到的,第三注册声纹特征用于反映骨振动传感器采集到的所述预设用户的声纹特征。
身份信息获取单元1203用于进行获取用户身份信息以进行用户身份鉴权,具体地,根据环境声的分贝数和播放音量,分别确定第一相似度对应的第一融合系数,第二相似度对应的第二融合系数,第三相似度对应的第三融合系数;根据第一融合系数、第二融合系数和第三融合系数融合第一相似度、第二相似度和第三相似度,得到融合相似度得分。若融合相似度得分大于第一阈值,则手机确定向蓝牙耳机输入语音信息的用户为预设用户。其中,环境声的分贝数是蓝牙耳机的声压传感器检测得到的,播放音量可以是蓝牙耳机的扬声器检测播放信号得到的。
在一种可能的实现方式中,第二融合系数与环境声的分贝数呈负相关,第一融合系数、第三融合系数分别与播放音量的分贝数呈负相关,第一融合系数、第二融合系数和第三融合系数的和为固定值。也就是说,在第一融合系数、第二融合系数和第三融合系数的和为预设的固定值的情况下,环境声的分贝数越大,第二融合系数越小,此时,相应的,第一融合系 数和第三融合系数会适应性增大,以维持在第一融合系数、第二融合系数和第三融合系数的和不变化;播放音量越大,第一融合系数和第三融合系数越小,此时,相应的,第二融合系数会适应性增大,以维持在第一融合系数、第二融合系数和第三融合系数的和不变化。可以理解的是,上述可变的融合系数能够兼顾不同的应用场景(噪声环境较大或耳机播放音乐的情况下)下的识别准确率。
当手机确定向蓝牙耳机输入语音信息的用户为预设用户后,或者说,鉴权通过后,执行单元1204用于执行与所述语音信息对应的操作指令,所述操作指令包括解锁指令、支付指令、关机指令、打开应用程序指令或呼叫指令。
上述本申请实施例提供的语音控制方法,由于相比于现有技术,增加了通过耳内语音传感器采集声纹特征的方法,当用户佩戴包括耳内语音传感器的耳机之后,用户的外耳道与中耳道会形成一个封闭的腔室,声音在腔室里有一定的放大作用,即空腔效应,因此,耳内语音传感器采集到的声音会更加清晰,尤其对于高频声音信号具有明显的增强作用,能够弥补骨振动传感器在采集语音信息时,会丢失部分语音信息的高频信号分量所造成的失真问题,提升耳机整体的声纹采集效果和声纹识别的准确度,从而提升用户体验。并且,由于本申请实施例在相似度融合时采用了动态融合系数,针对不同的应用环境和应用场景,采用动态的融合系数对具有不同属性的语音信号获得的声纹识别结果进行融合,利用这些不同属性的语音信号的互补性可以提升声纹识别的鲁棒性和准确率。例如,在噪声环境较大或耳机播放音乐的情况下能够显著提升的识别准确率和准确率。其中,不同属性的语音信号也可以理解为通过不同的传感器(耳内语音传感器、耳外语音传感器、骨振动传感器)获取到的语音信号。
本申请另一实施例还提供一种可穿戴设备,图13是本申请实施例提供的一种可穿戴设备130的示意图。图13所示的可穿戴设备包括存储器1301、处理器1302、通信接口1303、总线1304、耳内语音传感器1305,耳外语音传感器1306,骨振动传感器1307。其中,存储器1301、处理器1302、通信接口1303通过总线1304实现彼此之间的通信连接。存储器1301和处理器1302耦合,存储器801用于存储计算机程序代码,计算机程序代码包括计算机指令,当处理器802执行该计算机指令时,能够使可穿戴设备执行上述实施例中描述的语音控制方法。
耳内语音传感器1305用于采集语音信息的第一语音分量,耳外语音传感器1306用于采集语音信息的第二语音分量,骨振动传感器1307用于采集语音信息的第三语音分量。
存储器1301可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器1301可以存储程序,当存储器1301中存储的程序被处理器1302执行时,处理器1302和通信接口1303用于执行本申请实施例的语音控制方法的各个步骤。
处理器1302可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的语音控制装置中的单元所需执行的功能,或者执行本申请方法实施例的语音控制方法。
处理器1302还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的语音控制方法的各个步骤可以通过处理器1302中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1302还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate  Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1301,处理器1302读取存储器1301中的信息,结合其硬件完成本申请实施例的语音控制装置中包括的单元所需执行的功能,或者执行本申请方法实施例的语音控制方法。
通信接口1303使用例如但不限于收发器一类的收发装置,能够进行有线通信或无线通信,从而实现可穿戴设备1300与其他设备或通信网络之间的通信。例如,可穿戴设备可以通过通信接口1303与终端设备建立通信连接。
总线1304可包括在装置1300各个部件(例如,存储器1301、处理器1302、通信接口1303)之间传送信息的通路。
本申请另一实施例还提供一种终端,图14是本申请实施例提供的一种终端示意图。图14所示的终端包括触摸屏1401、处理器1402、存储器1403、一个或多个计算机程序1404、总线1405、通信接口1408。其中,触摸屏1401包括触敏表面1406和显示屏1407,该终端还可以包括一个或多个应用程序(未示出)。上述各器件可以通过一个或多个通信总线1405连接。
存储器1403和处理器1402耦合,存储器1403用于存储计算机程序代码,计算机程序代码包括计算机指令,当处理器1402执行该计算机指令时,能够使终端执行上述实施例中描述的语音控制方法。
触摸屏1401用于与用户进行交互,能够接收到用户的输入信息。用户通过触敏表面1406对手机进行输入,例如,用户点击手机触敏表面1406上显示的解锁键。
存储器1403可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器1403可以存储程序,当存储器1403中存储的程序被处理器1402执行时,处理器1402和通信接口1403用于执行本申请实施例的语音控制方法的各个步骤。
处理器1402可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的语音控制装置中的单元所需执行的功能,或者执行本申请方法实施例的语音控制方法。
处理器1402还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的语音控制方法的各个步骤可以通过处理器1402中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1402还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程 存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1403,处理器1402读取存储器1403中的信息,结合其硬件完成本申请实施例的语音控制装置中包括的单元所需执行的功能,或者执行本申请方法实施例的语音控制方法。
通信接口1408使用例如但不限于收发器一类的收发装置,能够进行有线通信或无线通信,从而实现终端1400与其他设备或通信网络之间的通信。例如,终端可以通过通信接口1408与可穿戴设备建立通信连接。
总线1405可包括在装置1400各个部件(例如,触摸屏1401、存储器1403、处理器1402、通信接口1408)之间传送信息的通路。
应注意,尽管图13和图14所示的可穿戴设备1300和终端1400仅仅示出了存储器、处理器、通信接口等,但是在具体实现过程中,本领域的技术人员应当理解,可穿戴设备1300和终端1400还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,可穿戴设备1300和终端1400还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,可穿戴设备1300和终端1400也可仅仅包括实现本申请实施例所必须的器件,而不必包括图13或图14中所示的全部器件。
本申请另一实施例还提供一种芯片系统,如图15所示为该芯片系统示意图,该芯片系统包括至少一个处理器1501、至少一个接口电路1502和总线1503。处理器1501和接口电路1502可通过线路互联。例如,接口电路1502可用于从其它装置(例如语音控制装置的存储器)接收信号。又例如,接口电路1502可用于向其它装置(例如处理器1501)发送信号。示例性的,接口电路1502可读取存储器中存储的指令,并将该指令发送给处理器1501。当所述指令被处理器1501执行时,可使得语音控制装置执行上述实施例中的各个步骤。当然,该芯片系统还可以包含其他分立器件,本申请实施例对此不作具体限定。
本申请另一实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在语音控制装置上运行时,该语音控制装置执行上述方法实施例所示的方法流程中识别装置执行的各个步骤。
本申请另一实施例还提供一种计算机程序产品,该计算机程序产品中存储有计算机指令,当指令在语音控制装置上的识别装置上运行时,该识别装置执行上述方法实施例所示的方法流程中识别装置执行的各个步骤。
在一些实施例中,所公开的方法可以实施为以机器可读格式被编码在计算机可读存储介质上的或者被编码在其它非瞬时性介质或者制品上的计算机程序指令。
在一个实施例中,计算机程序产品是使用信号承载介质来提供的。所述信号承载介质可以包括一个或多个程序指令,其当被一个或多个处理器运行时可以实现本申请实施例的语音控制方法的功能。因此,例如,参考图7中S701~S707的一个或多个特征可以由与信号承载介质相关联的一个或多个指令来承担。
在一些示例中,信号承载介质可以包含计算机可读介质,诸如但不限于,硬盘驱动器、紧密盘(CD)、数字视频光盘(DVD)、数字磁带、存储器、只读存储记忆体(read-only memory,ROM)或随机存储记忆体(random access memory,RAM)等等。
在一些实施方式中,信号承载介质可以包含计算机可记录介质,诸如但不限于,存储器、读/写(R/W)CD、R/W DVD、等等。
在一些实施方式中,信号承载介质可以包含通信介质,诸如但不限于,数字和/或模拟通信介质(例如,光纤电缆、波导、有线通信链路、无线通信链路、等等)。
信号承载介质可以由无线形式的通信介质(例如,遵守IEEE 802.16标准或者其它传输 协议的无线通信介质)来传达。一个或多个程序指令可以是,例如,计算机可执行指令或者逻辑实施指令。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的以上所述,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何在本申请实施例揭露的技术范围内的变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以所述权利要求的保护范围为准。

Claims (23)

  1. 一种语音控制方法,其特征在于,包括:
    获取用户的语音信息,所述语音信息包括第一语音分量,第二语音分量和第三语音分量,所述第一语音分量是由耳内语音传感器采集到的,所述第二语音分量是由耳外语音传感器采集到的,所述第三语音分量是由骨振动传感器采集到的;
    分别对所述第一语音分量,所述第二语音分量和所述第三语音分量进行声纹识别;
    根据所述第一语音分量的声纹识别结果、所述第二语音分量的声纹识别结果和所述第三语音分量的声纹识别结果,得到所述用户的身份信息;
    当所述用户的身份信息与预设的信息匹配时,执行操作指令,其中,所述操作指令是根据所述语音信息确定的。
  2. 根据权利要求1所述的语音控制方法,其特征在于,所述对所述第一语音分量、所述第二语音分量和所述第三语音分量进行声纹识别之前,还包括:
    对所述语音信息进行关键词检测,或者,对用户输入进行检测。
  3. 根据权利要求2所述的语音控制方法,其特征在于,所述对所述语音信息进行关键词检测或者对用户输入进行检测之前,还包括:
    获取所述可穿戴设备的佩戴状态检测结果。
  4. 根据权利要求1-3任一所述的语音控制方法,其特征在于,所述对所述第一语音分量进行声纹识别,具体包括:
    对所述第一语音分量进行特征提取,得到第一声纹特征,计算所述第一声纹特征与所述用户的第一注册声纹特征的第一相似度,所述第一注册声纹特征是第一注册语音经过第一声纹模型进行特征提取得到的,所述第一注册声纹特征用于反映所述耳内语音传感器采集到的所述用户的预设音频特征。
  5. 根据权利要求1-3任一所述的语音控制方法,其特征在于,所述对所述第二语音分量进行声纹识别,具体包括:
    对所述第二语音分量进行特征提取,得到第二声纹特征,计算所述第二声纹特征与所述用户的第二注册声纹特征的第二相似度,所述第二注册声纹特征是第二注册语音经过第二声纹模型进行特征提取得到的,所述第二注册声纹特征用于反映所述耳外语音传感器采集到的所述用户的预设音频特征。
  6. 根据权利要求1-3任一所述的语音控制方法,其特征在于,所述对所述第三语音分量进行声纹识别,具体包括:
    对所述第三语音分量进行特征提取,得到第三声纹特征,计算所述第三声纹特征与所述用户的第三注册声纹特征的第三相似度,所述第三注册声纹特征是第三注册语音经过第三声纹模型进行特征提取得到的,所述第三注册声纹特征用于反映所述骨振动传感器采集到的所述用户的预设音频特征。
  7. 根据权利要求1-6任一所述的语音控制方法,其特征在于,所述根据所述第一语音分量的声纹识别结果、所述第二语音分量的声纹识别结果和所述第三语音分量的声纹识别结果,得到所述用户的身份信息,具体包括:
    确定所述第一相似度对应的第一融合系数,所述第二相似度对应的第二融合系数,所述第三相似度对应的第三融合系数;
    根据所述第一融合系数、所述第二融合系数和所述第三融合系数融合所述第一相似度、第二相似度和第三相似度,得到融合相似度得分,若所述融合相似度得分大于第一阈值,则 确定所述用户的身份信息与预设身份信息匹配。
  8. 根据权利要求7所述的语音控制方法,其特征在于,确定所述第一融合系数、所述第二融合系数和所述第三融合系数,具体包括:
    根据声压传感器得到环境声的分贝数;
    根据扬声器的播放信号,确定播放音量;
    根据所述环境声的分贝数和所述播放音量,分别确定所述第一融合系数、所述第二融合系数和所述第三融合系数,其中:
    所述第二融合系数与所述环境声的分贝数呈负相关,所述第一融合系数、所述第三融合系数分别与所述播放音量的分贝数呈负相关,所述第一融合系数、第二融合系数和第三融合系数的和为固定值。
  9. 根据权利要求1-8中任一项所述的语音控制方法,其特征在于,所述操作指令包括解锁指令、支付指令、关机指令、打开应用程序应用程序指令或呼叫呼叫指令。
  10. 一种语音控制装置,其特征在于,包括:
    语音信息获取单元,所述语音信息获取单元用于获取用户的语音信息,所述语音信息包括第一语音分量,第二语音分量和第三语音分量,所述第一语音分量是由耳内语音传感器采集到的,所述第二语音分量是由耳外语音传感器采集到的,所述第三语音分量是由骨振动传感器采集到的;
    识别单元,所述识别单元用于分别对所述第一语音分量,所述第二语音分量和所述第三语音分量进行声纹识别;
    身份信息获取单元,所述身份信息获取单元用于根据所述第一语音分量的声纹识别结果、所述第二语音分量的声纹识别结果和所述第三语音分量的声纹识别结果,得到所述用户的身份信息;
    执行单元,所述执行单元用于当所述用户的身份信息与预设的信息匹配时,执行操作指令,其中,所述操作指令是根据所述语音信息确定的。
  11. 根据权利要求10所述的语音控制装置,其特征在于,所述语音信息获取单元还用于:
    对所述语音信息进行关键词检测,或者,对用户输入进行检测。
  12. 根据权利要求11所述的语音控制装置,其特征在于,所述语音信息获取单元还用于:
    获取所述可穿戴设备的佩戴状态检测结果。
  13. 根据权利要10-12任一所述的语音控制装置,其特征在于,所述识别单元具体用于:
    对所述第一语音分量进行特征提取,得到第一声纹特征,计算所述第一声纹特征与所述用户的第一注册声纹特征的第一相似度,所述第一注册声纹特征是第一注册语音经过第一声纹模型进行特征提取得到的,所述第一注册声纹特征用于反映所述耳内语音传感器采集到的所述用户的预设音频特征;
  14. 根据权利要求10-12任一所述的语音控制装置,其特征在于,所述识别单元具体用于:
    对所述第二语音分量进行特征提取,得到第二声纹特征,计算所述第二声纹特征与所述用户的第二注册声纹特征的第二相似度,所述第二注册声纹特征是第二注册语音经过第二声纹模型进行特征提取得到的,所述第二注册声纹特征用于反映所述耳外语音传感器采集到的所述用户的预设音频特征;
  15. 根据权利要求10-12任一所述的语音控制装置,其特征在于,所述识别单元具体用于:
    对所述第三语音分量进行特征提取,得到第三声纹特征,计算所述第三声纹特征与所述用户的第三注册声纹特征的第三相似度,所述第三注册声纹特征是第三注册语音经过第三声纹模型进行特征提取得到的,所述第三注册声纹特征用于反映所述骨振动传感器采集到的所述用户的预设音频特征。
  16. 根据权利要求10-15任一所述的语音控制装置,其特征在于,所述身份信息获取单元具体用于:
    确定所述第一相似度对应的第一融合系数,所述第二相似度对应的第二融合系数,所述第三相似度对应的第三融合系数;
    根据所述第一融合系数、所述第二融合系数和所述第三融合系数融合所述第一相似度、第二相似度和第三相似度,得到融合相似度得分,若所述融合相似度得分大于第一阈值,则确定所述用户的身份信息与预设身份信息匹配。
  17. 根据权利要求16所述的语音控制装置,其特征在于,所述身份信息获取单元具体用于:
    根据声压传感器得到环境声的分贝数;
    根据扬声器的播放信号,确定播放音量;
    根据所述环境声的分贝数和所述播放音量,分别确定所述第一融合系数、所述第二融合系数和所述第三融合系数,其中:
    所述第二融合系数与所述环境声的分贝数呈负相关,所述第一融合系数、所述第三融合系数分别与所述播放音量的分贝数呈负相关,所述第一融合系数、第二融合系数和第三融合系数的和为固定值。
  18. 根据权利要求10-17中任一项所述的语音控制装置,其特征在于,
    所述操作指令包括解锁指令、支付指令、关机指令、打开应用程序应用程序指令或呼叫呼叫指令。
  19. 一种可穿戴设备,其特征在于,所述可穿戴设备包括耳内语音传感器,耳外语音传感器,骨振动传感器,存储器和处理器;
    所述耳内语音传感器用于采集语音信息的第一语音分量,所述耳外语音传感器用于采集语音信息的第二语音分量,所述骨振动传感器用于采集语音信息的第三语音分量;
    所述存储器和所述处理器耦合;所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;当所述处理器执行所述计算机指令时,所述可穿戴设备执行如权利要求1-9中任意一项所述的语音控制方法。
  20. 一种终端,其特征在于,所述终端包括存储器和处理器;所述存储器和所述处理器耦合;所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;当所述处理器执行所述计算机指令时,所述终端执行如权利要求1-9中任意一项所述的语音控制方法。
  21. 一种芯片系统,其特征在于,所述芯片系统应用于电子设备;所述芯片系统包括一个或多个接口电路,以及一个或多个处理器;所述接口电路和所述处理器通过线路互联;所述接口电路用于从所述电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信号包括所述存储器中存储的计算机指令;当所述处理器执行所述计算机指令时,所述电子设备执行如权利要求1-9中任意一项所述的语音控制方法。
  22. 一种计算机可读存储介质,其特征在于,包括计算机指令,当所述计算机指令在语音控制装置上运行时,使得所述语音控制装置执行如权利要求1-9中任意一项所述的语音控 制方法。
  23. 一种计算机程序产品,其特征在于,包括计算机指令,当所述计算机指令在语音控制装置上运行时,使得所述语音控制装置执行如权利要求1-9中任意一项所述的语音控制方法。
PCT/CN2022/080436 2021-03-24 2022-03-11 一种语音控制方法和装置 WO2022199405A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP22774067.7A EP4297023A4 (en) 2021-03-24 2022-03-11 VOICE CONTROL METHOD AND APPARATUS
JP2023558328A JP2024510779A (ja) 2021-03-24 2022-03-11 音声制御方法及び装置
US18/471,702 US20240013789A1 (en) 2021-03-24 2023-09-21 Voice control method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110313304.3 2021-03-24
CN202110313304.3A CN115132212A (zh) 2021-03-24 2021-03-24 一种语音控制方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/471,702 Continuation US20240013789A1 (en) 2021-03-24 2023-09-21 Voice control method and apparatus

Publications (1)

Publication Number Publication Date
WO2022199405A1 true WO2022199405A1 (zh) 2022-09-29

Family

ID=83373864

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/080436 WO2022199405A1 (zh) 2021-03-24 2022-03-11 一种语音控制方法和装置

Country Status (5)

Country Link
US (1) US20240013789A1 (zh)
EP (1) EP4297023A4 (zh)
JP (1) JP2024510779A (zh)
CN (1) CN115132212A (zh)
WO (1) WO2022199405A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117116258A (zh) * 2023-04-12 2023-11-24 荣耀终端有限公司 一种语音唤醒方法及电子设备

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117133281B (zh) * 2023-01-16 2024-06-28 荣耀终端有限公司 语音识别方法和电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102084668A (zh) * 2008-05-22 2011-06-01 伯恩同通信有限公司 处理信号的方法和系统
CN106713569A (zh) * 2016-12-27 2017-05-24 广东小天才科技有限公司 一种可穿戴设备的操作控制方法及可穿戴设备
US20180324518A1 (en) * 2017-05-04 2018-11-08 Apple Inc. Automatic speech recognition triggering system
CN111432303A (zh) * 2020-03-19 2020-07-17 清华大学 单耳耳机、智能电子设备、方法和计算机可读介质
CN111916101A (zh) * 2020-08-06 2020-11-10 大象声科(深圳)科技有限公司 一种融合骨振动传感器和双麦克风信号的深度学习降噪方法及系统
CN112017696A (zh) * 2020-09-10 2020-12-01 歌尔科技有限公司 耳机的语音活动检测方法、耳机及存储介质
CN112420035A (zh) * 2018-06-29 2021-02-26 华为技术有限公司 一种语音控制方法、可穿戴设备及终端

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102084668A (zh) * 2008-05-22 2011-06-01 伯恩同通信有限公司 处理信号的方法和系统
CN106713569A (zh) * 2016-12-27 2017-05-24 广东小天才科技有限公司 一种可穿戴设备的操作控制方法及可穿戴设备
US20180324518A1 (en) * 2017-05-04 2018-11-08 Apple Inc. Automatic speech recognition triggering system
CN112420035A (zh) * 2018-06-29 2021-02-26 华为技术有限公司 一种语音控制方法、可穿戴设备及终端
CN111432303A (zh) * 2020-03-19 2020-07-17 清华大学 单耳耳机、智能电子设备、方法和计算机可读介质
CN111916101A (zh) * 2020-08-06 2020-11-10 大象声科(深圳)科技有限公司 一种融合骨振动传感器和双麦克风信号的深度学习降噪方法及系统
CN112017696A (zh) * 2020-09-10 2020-12-01 歌尔科技有限公司 耳机的语音活动检测方法、耳机及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4297023A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117116258A (zh) * 2023-04-12 2023-11-24 荣耀终端有限公司 一种语音唤醒方法及电子设备

Also Published As

Publication number Publication date
JP2024510779A (ja) 2024-03-11
EP4297023A1 (en) 2023-12-27
EP4297023A4 (en) 2024-07-10
CN115132212A (zh) 2022-09-30
US20240013789A1 (en) 2024-01-11

Similar Documents

Publication Publication Date Title
KR102525294B1 (ko) 음성 제어 방법, 웨어러블 디바이스 및 단말
WO2022033556A1 (zh) 电子设备及其语音识别方法和介质
CN111131601B (zh) 一种音频控制方法、电子设备、芯片及计算机存储介质
WO2022199405A1 (zh) 一种语音控制方法和装置
CN110070863A (zh) 一种语音控制方法及装置
WO2021114953A1 (zh) 语音信号的采集方法、装置、电子设备以及存储介质
US20190147890A1 (en) Audio peripheral device
WO2022022585A1 (zh) 电子设备及其音频降噪方法和介质
US20200278832A1 (en) Voice activation for computing devices
US20230239800A1 (en) Voice Wake-Up Method, Electronic Device, Wearable Device, and System
CN113643707A (zh) 一种身份验证方法、装置和电子设备
CN113299309A (zh) 语音翻译方法及装置、计算机可读介质和电子设备
CN114360206B (zh) 一种智能报警方法、耳机、终端和系统
WO2023124248A1 (zh) 声纹识别方法和装置
WO2023207185A1 (zh) 声纹识别方法、图形界面及电子设备
CN113506566B (zh) 声音检测模型训练方法、数据处理方法以及相关装置
WO2022007757A1 (zh) 跨设备声纹注册方法、电子设备及存储介质
CN115731923A (zh) 命令词响应方法、控制设备及装置
US11393449B1 (en) Methods and apparatus for obtaining biometric data
WO2022252858A1 (zh) 一种语音控制方法及电子设备
WO2022233239A1 (zh) 一种升级方法、装置及电子设备
CN116530944B (zh) 声音处理方法及电子设备
US20220261218A1 (en) Electronic device including speaker and microphone and method for operating the same
CN117953872A (zh) 语音唤醒模型更新方法、存储介质、程序产品及设备
CN116935858A (zh) 声纹识别方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22774067

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022774067

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2023558328

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 2022774067

Country of ref document: EP

Effective date: 20230920

NENP Non-entry into the national phase

Ref country code: DE