WO2023124248A9 - 声纹识别方法和装置 - Google Patents

声纹识别方法和装置 Download PDF

Info

Publication number
WO2023124248A9
WO2023124248A9 PCT/CN2022/118924 CN2022118924W WO2023124248A9 WO 2023124248 A9 WO2023124248 A9 WO 2023124248A9 CN 2022118924 W CN2022118924 W CN 2022118924W WO 2023124248 A9 WO2023124248 A9 WO 2023124248A9
Authority
WO
WIPO (PCT)
Prior art keywords
terminal device
voiceprint
threshold
value
vector
Prior art date
Application number
PCT/CN2022/118924
Other languages
English (en)
French (fr)
Other versions
WO2023124248A1 (zh
Inventor
王耀光
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Publication of WO2023124248A1 publication Critical patent/WO2023124248A1/zh
Publication of WO2023124248A9 publication Critical patent/WO2023124248A9/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/12Score normalisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/74Details of telephonic subscriber devices with voice recognition means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the present application relates to the field of terminal technology, and in particular to a voiceprint recognition method and device.
  • the terminal device can support the user to wake up the device through voice, or wake up certain functions in the device. Due to the uniqueness of the voiceprint data of different users, the terminal device can use the voiceprint data to determine whether the received voice is the voice of the registered user (or the owner of the terminal device).
  • the terminal device can score the registered user's voiceprint data and the received speaker's voiceprint data based on the voiceprint model. When the score exceeds a preset threshold, the terminal device can be awakened, or when the score is less than the preset When the threshold is set, the terminal device cannot be woken up.
  • the above-mentioned voiceprint recognition method has a high false positive rate and may pose a threat to user privacy.
  • Embodiments of the present application provide a voiceprint recognition method and device.
  • the terminal device can be set with a voiceprint blacklist database.
  • the score of the received speaker's voiceprint data in the registered user's voiceprint data is greater than the first threshold, and the speaker
  • the score of the voiceprint data in the voiceprint blacklist database is less than the second threshold, the terminal device is awakened, so that the terminal device can accurately identify the user's voice, reducing the rate of false positives while improving the security of voiceprint recognition.
  • embodiments of the present application provide a voiceprint recognition method, which is applied to a terminal device.
  • the terminal device is provided with a preset database.
  • the preset database includes at least one second user's voiceprint vector; the voiceprint vector is used to characterize The user's voice characteristics, the method includes: the terminal device collects the first voice, and the first voice corresponds to the first voiceprint vector; if the terminal device determines that the first voice is the preset voice, the terminal device obtains the first voiceprint vector and the preset voice The similarity score of the voiceprint vector is obtained to obtain the first value; the preset voiceprint vector is the voiceprint vector of the first user; the terminal device obtains the similarity score between the first voiceprint vector and each voiceprint vector in the preset database.
  • the terminal device determines that the first value is greater than the first threshold, and the second value is less than the second threshold, the terminal device determines that the first user's voiceprint recognition is successful; the second threshold is greater than the second threshold. a threshold.
  • the terminal device can be set up with a voiceprint blacklist database.
  • the score of the received speaker's voiceprint data in the registered user's voiceprint data is greater than the first threshold, and the score of the speaker's voiceprint data in the voiceprint blacklist database When it is less than the second threshold, the terminal device is awakened, so that the terminal device can accurately recognize the user's voice, which reduces the rate of false alarms and improves the security of voiceprint recognition.
  • the preset database can be the voiceprint blacklist database in the embodiment of the present application;
  • the first value can be the registration template score in the embodiment of the present application;
  • the second value can be the blacklist score in the embodiment of the present application;
  • the first threshold may be T2 in the embodiment of the present application;
  • the second threshold may be T1 in the embodiment of the present application; and
  • the first user may be a registered user in the embodiment of the present application.
  • the method further includes: when the terminal device determines that the first value is greater than the third threshold and the second value is less than the first threshold, the terminal device adds the first voiceprint vector to the preset database;
  • the first threshold is greater than the third threshold.
  • the terminal device can add voiceprint vectors that are threatening to the system and have low similarity with the voiceprint blacklist in the voiceprint blacklist database to the voiceprint blacklist database.
  • the third threshold may be T3 in the embodiment of this application.
  • the terminal device when the terminal device determines that the first value is greater than the third threshold and the second value is less than the first threshold, the terminal device adds the first voiceprint vector to the preset database, including: when the terminal When the device determines that the first value is greater than the third threshold, the second value is less than the first threshold, and the signal-to-noise ratio value corresponding to the first voiceprint vector is greater than the fourth threshold, the terminal device adds the first voiceprint vector to the preset database. .
  • the terminal device can extract a higher-quality voiceprint vector through the judgment of the signal-to-noise ratio to avoid misjudging the user's voice in a noisy environment as the voice of an impostor.
  • the fourth threshold may be the signal-to-noise ratio threshold N in the embodiment of the present application.
  • the voiceprint vector in the preset database records the storage time stored in the preset database, and records the number of uses.
  • the number of uses is the number of times the second value is calculated, and the terminal device will Adding the voiceprint vector to the preset database includes: the terminal device eliminates the voiceprint vector with the longest storage time in the preset database, and/or eliminates the voiceprint vector with the least usage in the preset database; the terminal device Add the first voiceprint vector to the default database.
  • the terminal device can ensure the effectiveness of the voiceprint blacklist database by dynamically adjusting the voiceprint blacklist in the voiceprint blacklist database, and can avoid storing too much data in the voiceprint blacklist database for voiceprint recognition. The speed of the method.
  • the terminal device when the terminal device determines that the first value is greater than the third threshold and the second value is less than the first threshold, the terminal device adds the first voiceprint vector to the preset database, including: when the terminal When the device determines that the first value is greater than the third threshold and the second value is less than the first threshold, the terminal device displays a first interface; wherein the first interface includes: a prompt for whether to add the first voiceprint vector to the preset database the prompt information in, the first control for adding the first voiceprint vector to the preset database, and the second control for refusing to add the first voiceprint vector to the preset database; when the terminal device receives When the first control is triggered, or when no trigger is received for any control in the first interface within the preset time threshold, the terminal device adds the first voiceprint vector to the preset database. In this way, when the voice is different due to the user's voice state or the scene he is in, the terminal device can avoid the misoperation of directly adding the voice to the voiceprint blacklist library.
  • the method further includes: when the terminal device receives an operation for setting the voiceprint recognition mode, the terminal device displays a second interface; the second interface includes a third interface for turning on the first recognition mode. Three controls; when the terminal device receives an operation for the third control, the terminal device displays a third interface; the third interface includes: a fourth control for turning on prompt information; when the terminal device determines that the first value is greater than the third threshold , and when the second value is less than the first threshold, the terminal device displays the first interface, including: when the terminal device determines that the first value is greater than the third threshold, the second value is less than the first threshold, and the fourth control is on, the terminal The device displays the first interface. In this way, users can flexibly set up the voiceprint blacklist library according to their own needs, which improves the user experience of using the voice wake-up function.
  • the method further includes: the terminal device obtains a similarity score between the first voiceprint vector and each voiceprint vector in the preset database to obtain a third value; the terminal device deletes the first value greater than When the second threshold value and the third value are greater than the first threshold value, the first voiceprint vector corresponds to the voiceprint vector in the preset database. In this way, the terminal device can delete the voiceprint blacklist that has been accidentally entered into the voiceprint blacklist database for some reasons, thereby improving the accuracy of the voiceprint recognition method.
  • the method further includes: when the terminal device determines that the first value is greater than the second threshold, the terminal device obtains a similarity score between the first voiceprint vector and each voiceprint vector in the preset database. , obtain the third value; when the third value is greater than the first threshold, the terminal device deletes the voiceprint vector in the preset database corresponding to the first voiceprint vector. In this way, the terminal device can delete the voiceprint blacklist that has been accidentally entered into the voiceprint blacklist database for some reasons, thereby improving the accuracy of the voiceprint recognition method.
  • the terminal device determines that the first user's voiceprint recognition is successful, including: when the terminal device determines that the first user's voiceprint recognition is successful, When a value is greater than the first threshold and the second value is less than the second threshold, or when the terminal device determines that the first value is greater than the second threshold, the terminal device determines that the first user's voiceprint recognition is successful.
  • the terminal device can set a higher threshold to ensure a voice that is highly similar to the registered user's voice. For example, only the registered user's own voice can pass voiceprint recognition, so that the terminal device can achieve accurate recognition of the user's voice, reducing the The system's false alarm rate.
  • the method further includes: when the terminal device determines that the first value is less than or equal to the first threshold, and/or the second value is greater than or equal to the second threshold, the terminal device determines the voice of the first user. Pattern recognition failed. In this way, the terminal device can not wake up the terminal device when the voice of a non-registered user is recognized, thereby ensuring the security of the device.
  • inventions of the present application provide a voiceprint recognition device.
  • the device is provided with a preset database.
  • the preset database includes at least one second user's voiceprint vector; the voiceprint vector is used to represent the user's voice characteristics. It includes: a processing unit, configured to collect a first voice, which corresponds to a first voiceprint vector; if the terminal device determines that the first voice is a preset voice, the processing unit is also configured to obtain the first voiceprint vector and the preset voice.
  • the processing unit is also used to obtain the first voiceprint vector and each voiceprint in the preset database The highest score among the similarity scores of the vector is obtained to obtain the second value; when the terminal device determines that the first value is greater than the first threshold and the second value is less than the second threshold, the processing unit is also used to determine the voice of the first user. Pattern recognition is successful; the second threshold is greater than the first threshold.
  • the processing unit when the terminal device determines that the first value is greater than the third threshold and the second value is less than the first threshold, the processing unit is also configured to add the first voiceprint vector to the preset database; The first threshold is greater than the third threshold.
  • the processing unit when the terminal device determines that the first value is greater than the third threshold, the second value is less than the first threshold, and the signal-to-noise ratio value corresponding to the first voiceprint vector is greater than the fourth threshold, the processing unit, Specifically used to add the first voiceprint vector to the preset database.
  • the voiceprint vector in the preset database records the storage time stored in the preset database, and records the number of uses.
  • the number of uses is the number of times the second value is calculated.
  • the processing unit specifically Used to eliminate the voiceprint vector with the longest storage time in the preset database, and/or eliminate the voiceprint vector with the least use in the preset database; the processing unit is also specifically used to add the first voiceprint vector to in the default database.
  • the display unit when the terminal device determines that the first value is greater than the third threshold and the second value is less than the first threshold, the display unit is configured to display a first interface; wherein the first interface includes: Prompt information for prompting whether to add the first voiceprint vector to the default database, a first control for adding the first voiceprint vector to the default database, and a first control for refusing to add the first voiceprint vector to the default database. Preset the second control in the database; when the terminal device receives a trigger for the first control, or does not receive a trigger for any control in the first interface within the preset time threshold, the processing unit is specifically used to Add the first voiceprint vector to the default database.
  • the display unit when the terminal device receives an operation for setting the voiceprint recognition mode, the display unit is also used to display a second interface; the second interface includes a third interface for turning on the first recognition mode.
  • the display unit when the terminal device receives an operation for the third control, the display unit is also used to display a third interface; the third interface includes: a fourth control for turning on prompt information; when the terminal device determines the first value When the second value is greater than the third threshold, the second value is less than the first threshold, and the fourth control is on, the processing unit is also used to display the first interface.
  • the processing unit is also used to obtain the similarity score between the first voiceprint vector and each voiceprint vector in the preset database to obtain a third value; the processing unit is also used to delete When the first value is greater than the second threshold and the third value is greater than the first threshold, the first voiceprint vector corresponds to the voiceprint vector in the preset database.
  • the processing unit when the terminal device determines that the first value is greater than the second threshold, the processing unit is also configured to obtain a similarity score between the first voiceprint vector and each voiceprint vector in the preset database. , to obtain the third value; the processing unit is also configured to delete the voiceprint vector in the preset database corresponding to the first voiceprint vector when the third value is greater than the first threshold.
  • the processing unit when the terminal device determines that the first value is greater than the first threshold and the second value is less than the second threshold, or when the terminal device determines that the first value is greater than the second threshold, the processing unit is specifically configured to determine The first user's voiceprint recognition was successful.
  • the processing unit when the terminal device determines that the first value is less than or equal to the first threshold, and/or the second value is greater than or equal to the second threshold, the processing unit is also configured to determine the voice of the first user. Pattern recognition failed.
  • embodiments of the present application provide a terminal device, including a processor and a memory.
  • the memory is used to store code instructions; the processor is used to run the code instructions, so that the terminal device can execute the first aspect or any one of the first aspects.
  • the voiceprint recognition method described in this implementation is used to determine whether the terminal device can execute the first aspect or any one of the first aspects.
  • embodiments of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores instructions.
  • the instructions executes as in the first aspect or any implementation of the first aspect. Describe the voiceprint recognition method.
  • a fifth aspect is a computer program product, including a computer program that, when executed, causes the computer to perform the voiceprint recognition method described in the first aspect or any implementation of the first aspect.
  • Figure 1 is a schematic diagram of a scenario provided by an embodiment of the present application.
  • Figure 2 is a schematic flow chart of a voiceprint recognition method
  • Figure 3 is a schematic diagram of the hardware structure of a terminal device provided by an embodiment of the present application.
  • Figure 4 is a schematic flow chart of another voiceprint recognition method provided by an embodiment of the present application.
  • Figure 5 is a schematic flowchart of determining a registration template score provided by an embodiment of the present application.
  • Figure 6 is a schematic flowchart of obtaining the first voiceprint blacklist provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of an interface for setting a voiceprint recognition mode provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of another interface for setting a voiceprint recognition mode provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of an interface for displaying prompt information provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of a voiceprint recognition device provided by an embodiment of the present application.
  • Figure 11 is a schematic diagram of the hardware structure of a control device provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • words such as “first” and “second” are used to distinguish the same or similar items with basically the same functions and effects.
  • the first value and the second value are only used to distinguish different values, and their order is not limited.
  • words such as “first” and “second” do not limit the number and execution order, and words such as “first” and “second” do not limit the number and execution order.
  • At least one refers to one or more, and “plurality” refers to two or more.
  • “And/or” describes the association of associated objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the related objects are in an “or” relationship.
  • “At least one of the following” or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items).
  • At least one of a, b, or c can represent: a, b, c, a and b, a and c, b and c, or a, b and c, where a, b, c can be single or multiple.
  • the voiceprint can be the sound wave spectrum carrying speech information displayed by electroacoustic instruments, and the voiceprint can be used to characterize the speaker's voice characteristics.
  • Voiceprints are not only specific, but also relatively stable. It is understandable that whether the speaker deliberately imitates the voice and tone of others, or speaks softly in a whisper, even if the imitation is lifelike, the voiceprint will always be different from the real voiceprint of the person being imitated. Therefore, voiceprint recognition can be widely used in speaker recognition scenarios.
  • the terminal device can use the voiceprint to determine whether the received voice is the voice of a registered user, and wake up the terminal device when it is determined that the received voice is the voice of a registered user.
  • Figure 1 is a schematic diagram of a scenario provided by an embodiment of the present application.
  • the terminal device is a mobile phone as an example for illustration. This example does not constitute a limitation on the embodiment of the present application.
  • this scenario may include user 101, user 102 and mobile phone 103.
  • User 101 and user 102 may be twins with very similar voices.
  • User 101 may be a registered user of mobile phone 103 (or it may be understood that user 101 may is the owner of mobile phone 103).
  • user 101 is a registered user of mobile phone 103, and the voiceprint data of user 101 can be registered in mobile phone 103. Therefore, user 101 can use the voiceprint recognition method as shown in Figure 2 to wake up mobile phone 103. And other voice commands are used to instruct the mobile phone 103 to implement various functions.
  • Figure 2 is a schematic flow chart of a voiceprint recognition method. As shown in Figure 2, the voiceprint recognition method may include the following steps:
  • the terminal device obtains microphone (microphone, MIC) data.
  • the MIC data may be collected based on the microphone of the terminal device.
  • the MIC data may be an electrical signal corresponding to the user's voice data.
  • the MIC data may also be called speaker voiceprint data.
  • the speaker's voiceprint data will be used as an example for explanation below.
  • the terminal device performs wake word detection.
  • the wake-up word may be an instruction used to instruct the terminal device to perform a corresponding function.
  • the wake-up word may be used to put the device into a sleep state (or a low-power consumption state). The terminal device wakes up the command.
  • the terminal device calculates the speaker's voiceprint vector and registration template score based on the voiceprint model.
  • the speaker's voiceprint vector can be used to characterize the speaker's voice characteristics.
  • the speaker's voiceprint vector is extracted and calculated by extracting and calculating the acoustic features of the speaker's voiceprint data in step S201. Obtained; the registration template score is used to indicate the similarity between the speaker's voice and the registered user's voice. For example, the higher the registration template score is, the higher the similarity between the speaker's voice and the registered user's voice is.
  • the terminal device determines whether the registration template score is greater than T2.
  • the terminal device when the terminal device determines that the registration template score is greater than (or greater than or equal to) T2, the terminal device may perform the steps shown in S205; or, when the terminal device determines that the registration template score is less than or equal to (or less than) T2, The terminal device can perform the steps shown in S206.
  • the threshold T2 can be used to determine whether the speaker's voice belongs to the registered user's voice. For example, when the highest value of the registration template score is 100 points, the value of T2 can be 80 points.
  • the terminal device determines that the judgment is successful and wakes up the terminal device.
  • the terminal device determines that the judgment has failed.
  • the terminal device in order to realize that the user can wake up the terminal device through voice in various scenarios, the terminal device usually sets relatively loose judgment conditions, for example, by setting a lower threshold T2, For example, T2 is set to 80 points to ensure a higher awakening rate.
  • the user 101 can successfully wake up the mobile phone 103 based on the voiceprint recognition method in the embodiment corresponding to FIG. 2 .
  • the voiceprint recognition method in the embodiment corresponding to Figure 2 since user 102 and user 101 are twins, and their voices are very similar, mobile phone 103 may recognize The voice of user 102 is different from that of user 101, but due to the influence of looser judgment conditions, mobile phone 103 still wakes up.
  • the registration template score corresponding to user 102 can be 81 points, which exceeds the 80 points corresponding to threshold T2, causing user 102 to wake up mobile phone 103, resulting in a high accidental entry rate and possibly posing a threat to user 101's device privacy.
  • embodiments of the present application provide a voiceprint recognition method.
  • the terminal device can be set with a voiceprint blacklist database.
  • the terminal device is awakened, so that the terminal device can accurately identify the user's voice, reducing the rate of false positives while improving the security of voiceprint recognition.
  • the first threshold may be T2 described in the embodiment of this application
  • the second threshold may be T1 described in the embodiment of this application.
  • the voiceprint recognition method provided by the embodiment of this application can not only be used in the device wake-up scenario as shown in Figure 1, but can also be used in other scenarios for identity authentication such as payment scenarios.
  • the implementation of this application There is no specific limitation on this in the example.
  • the above-mentioned terminal equipment can also be called a terminal (terminal), user equipment (user equipment, UE), mobile station (mobile station, MS), mobile terminal (mobile terminal, MT), etc.
  • the terminal device can be a mobile phone with a microphone, a smart TV, a wearable device, a tablet (Pad), a computer with wireless transceiver functions, a virtual reality (VR) terminal device, augmented reality, AR) terminal equipment, wireless terminals in industrial control, wireless terminals in self-driving, wireless terminals in remote medical surgery, wireless terminals in smart grid Terminals, wireless terminals in transportation safety, wireless terminals in smart cities, wireless terminals in smart homes, etc.
  • the embodiments of this application do not limit the specific technology and specific equipment form used by the terminal equipment.
  • FIG. 3 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the terminal device may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, an antenna 1, an antenna 2, and a mobile communication module.
  • a processor 110 an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, an antenna 1, an antenna 2, and a mobile communication module.
  • Wireless communication module 160 audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, indicator 192, camera 193, and display screen 194, etc.
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the terminal device.
  • the terminal device may include more or less components than shown in the figures, or some components may be combined, or some components may be separated, or may be arranged differently.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.
  • Processor 110 may include one or more processing units. Among them, different processing units can be independent devices or integrated in one or more processors.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the USB interface 130 is an interface that complies with the USB standard specification, and may be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc.
  • the USB interface 130 can be used to connect a charger to charge the terminal device, and can also be used to transmit data between the terminal device and peripheral devices. It can also be used to connect headphones to play audio through them. This interface can also be used to connect other devices, such as AR devices, etc.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the power management module 141 is used to connect the charging management module 140 and the processor 110 .
  • the wireless communication function of the terminal device can be implemented through antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, modem processor and baseband processor, etc.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Antennas in end devices can be used to cover single or multiple communication bands. Different antennas can also be reused to improve antenna utilization.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G applied to terminal devices.
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the wireless communication module 160 can provide applications including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (BT), and global navigation satellite systems (WLAN) applied to terminal devices.
  • WLAN wireless local area networks
  • BT Bluetooth
  • WLAN global navigation satellite systems
  • GNSS global navigation satellite system
  • FM frequency modulation
  • the terminal device implements display functions through the GPU, the display screen 194, and the application processor.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • the display screen 194 is used to display images, videos, etc.
  • Display 194 includes a display panel.
  • the terminal device may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the terminal device can realize the shooting function through the ISP, camera 193, video codec, GPU, display screen 194 and application processor.
  • Camera 193 is used to capture still images or video.
  • the terminal device may include 1 or N cameras 193, where N is a positive integer greater than 1.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal device.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include a program storage area and a data storage area.
  • the terminal device can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals.
  • Speaker 170A also called “speaker”
  • Receiver 170B also called “earpiece”
  • the headphone interface 170D is used to connect wired headphones.
  • Microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the terminal device can receive a sound signal for waking up the terminal device based on the microphone 170C, and convert the sound signal into an electrical signal that can be subsequently processed.
  • the terminal device can have at least one microphone 170C.
  • the sensor module 180 may include one or more of the following sensors, such as: pressure sensor, gyroscope sensor, air pressure sensor, magnetic sensor, acceleration sensor, distance sensor, proximity light sensor, fingerprint sensor, temperature sensor, touch sensor, ambient light Sensor, or bone conduction sensor, etc. (not shown in Figure 3).
  • sensors such as: pressure sensor, gyroscope sensor, air pressure sensor, magnetic sensor, acceleration sensor, distance sensor, proximity light sensor, fingerprint sensor, temperature sensor, touch sensor, ambient light Sensor, or bone conduction sensor, etc. (not shown in Figure 3).
  • the buttons 190 include a power button, a volume button, etc.
  • Key 190 may be a mechanical key. It can also be a touch button.
  • the terminal device can receive key input and generate key signal input related to user settings and function control of the terminal device.
  • the indicator 192 may be an indicator light, which may be used to indicate charging status, power changes, or may be used to indicate messages, missed calls, notifications, etc.
  • the software system of the terminal device can adopt a layered architecture, event-driven architecture, micro-kernel architecture, micro-service architecture, or cloud architecture, etc., which will not be described again here.
  • FIG. 4 is a schematic flowchart of another voiceprint recognition method provided by an embodiment of the present application.
  • the terminal device may be provided with a voiceprint blacklist database for voiceprint verification of impersonators (or understood as unfamiliar users, or unregistered users).
  • the voiceprint recognition method may include the following steps:
  • the terminal device obtains MIC data.
  • the MIC data can be called the speaker's voiceprint data.
  • the terminal device performs wake word detection.
  • the wake-up word in a scenario where a wake-up word is used to wake up a terminal device in a sleeping state, can be Hello Yoyo; or in a scenario where a wake-up word is used for payment, the wake-up word can be confirmation of payment; it is understood that It should be noted that the wake-up word can be set according to the actual application scenario, which is not limited in the embodiments of the present application.
  • the terminal device can obtain the speaker's voiceprint data in real time and perform wake-up word detection on the speaker's voiceprint data. When the wake-up word is detected, the terminal device can perform the steps shown in S403.
  • the terminal device calculates the speaker's voiceprint vector, registration template score, and blacklist score based on the voiceprint model.
  • the speaker's voiceprint vector can be used to characterize the speaker's voice characteristics; the registration template score is used to indicate the similarity between the speaker's voice and the registered user's voice; the blacklist score is used to indicate the speaking The similarity of the human voice to the impersonator's voice.
  • the terminal device may obtain the blacklist score corresponding to the speaker's voiceprint data based on the voiceprint blacklist database used to store the impersonator's voiceprint vector.
  • the impersonator's voiceprint vector stored in the voiceprint blacklist database can be used to characterize the voice characteristics of the impersonator.
  • the terminal device can calculate the speaker's voiceprint vector and the registration template score based on the voiceprint model.
  • FIG. 5 is a schematic flowchart of determining a registration template score provided by an embodiment of the present application.
  • a possible implementation of the terminal device calculating the registration template score based on the voiceprint model can be: the terminal device can obtain the speaker's voiceprint data and the registered user's voiceprint data respectively; and extract the speaker's voiceprint respectively.
  • PLDA probabilistic linear discriminant analysis
  • the terminal device can store the registered user's voiceprint vector to avoid subsequent calculations of the registration template scores of other speakers. Repeated calculation of texture vectors.
  • the terminal device can calculate the blacklist score based on the voiceprint model.
  • a possible implementation of the terminal device calculating the blacklist score based on the voiceprint model may be: the terminal device may be provided with a voiceprint blacklist library, and the voiceprint blacklist library stores at least one voiceprint blacklist, Each voiceprint blacklist can correspond to the voiceprint vector of an impostor.
  • the voiceprint blacklist database can store voiceprint blacklist 1, voiceprint blacklist 2,..., and voiceprint blacklist M, where M is a positive integer.
  • the terminal device can use the voiceprint model to perform similarity discrimination on the speaker's voiceprint vector and the voiceprint vectors in the voiceprint blacklist database, and use the score with the highest similarity as the blacklist score.
  • the voiceprint model described above may include one or more of the following, for example: Gaussian mixture model (gaussian mixture model, GMM), Gaussian mixture background model (GMM-universal background model, GMM-UBM) , Gaussian mixture support vector machine (GMM-support vector machine, GMM-SVM), joint factor analysis (joint factor analysis, JFA), GMM-based i-vector method, deep neural networks (DNN)-based d-vector method, or x-vector based on neural networks (NNET), etc.
  • Gaussian mixture model Gaussian mixture model, GMM
  • Gaussian mixture background model GMM-universal background model, GMM-UBM
  • GMM-support vector machine GMM-SVM
  • JFA joint factor analysis
  • GMM-based i-vector method deep neural networks
  • DNN deep neural networks
  • NNET x-vector based on neural networks
  • the terminal device may use one or more of the following methods to extract acoustic features, such as: Mel-scale frequency cepstral coefficients (MFCC), filter bank (FBank), or Methods such as linear prediction coefficient (LPC) are not specifically limited in the embodiments of this application for extracting acoustic features.
  • MFCC Mel-scale frequency cepstral coefficients
  • FBank filter bank
  • LPC linear prediction coefficient
  • the terminal device determines whether the registered template score is >T1.
  • the terminal device when the terminal device determines that the registration template score is greater than (or greater than or equal to) T1, the terminal device may perform the steps shown in S410; or, when the terminal device determines that the registration template score is less than or equal to (or less than) T1, The terminal device can perform the steps shown in S405.
  • the terminal device can set a higher threshold T1 to ensure a voice that is highly similar to the registered user's voice. For example, only the registered user's own voice can pass voiceprint recognition, so that the terminal device can realize the user's voice recognition. Accurate identification reduces the system’s false alarm rate.
  • the terminal device determines whether the registration template score is >T2 and the blacklist score is ⁇ T1.
  • the terminal device when the terminal device determines that: the registration template score is greater than (or greater than or equal to) T2 and the blacklist score is less than (or less than or equal to) T1, the terminal device can perform the steps shown in S410; or, when the terminal device When it is determined that it is not satisfied: the registration template score is greater than (or equal to) T2 and the blacklist score is less than (or equal to) T1, the terminal device can perform the steps shown in S406 and S409.
  • the terminal device does not satisfy that the registration template score is greater than (or greater than or equal to) T2 and the blacklist score is less than (or less than or equal to) T1, which can be understood as: the terminal device determines that the registration template score is less than or equal to (or less than) T2, and the blacklist score is greater than or equal to (or greater than) T1, or the registration template score is less than or equal to (or less than) T2 and the blacklist score is greater than or equal to (or greater than) T1.
  • the terminal device can determine whether the registered template score is greater than T2 and whether the blacklist score is less than T1, thereby reducing the false entry rate and improving the success rate of the voiceprint recognition method.
  • the terminal device determines that the registration template score corresponding to the received speaker's voiceprint data is 81 points, which is greater than the 80 points corresponding to T2, based on the voiceprint recognition method corresponding to Figure 2, the terminal device can determine the decision at this time Successfully, and wake up the terminal device. Due to the looser judgment conditions, the speaker's voice close to the threshold T2 is likely to be the voice of an impostor who is close to the registered user's voice, and the voice of the impostor waking up the terminal device will bring a higher rate of false alarms. .
  • the terminal device can further determine the relationship between the blacklist score corresponding to the speaker's voiceprint data and T1, for example, by making the blacklist score less than T1 to ensure that the current speaker's voice does not belong to the impersonator's voice recorded by the terminal device. , thereby improving the success rate of voiceprint recognition while reducing the rate of false positives.
  • the terminal device uses a higher threshold T1 to accurately identify the sound based on the step shown in S404, because the recognition method corresponding to the threshold T1 is relatively strict, the terminal device may not be able to recognize the user in different scenarios or in different situations. Sounds in sound states, for example, the terminal device may not be able to recognize the user's voice when he or she is having a cold, resulting in a lower success rate.
  • the terminal device can ensure a higher success rate by setting a lower threshold T2, and through the relationship between the blacklist score corresponding to the speaker's voiceprint data and T1, for example, the blacklist score is less than T1 to ensure that the current speaker's voice is not The voice of the impostor does not belong to the voice recorded by the terminal device, thereby improving the success rate of voiceprint recognition while ensuring a reduced rate of false identification.
  • the terminal device determines whether the registration template score is >T3 and the blacklist score is ⁇ T2.
  • the terminal device when the terminal device determines that: the registration template score is greater than (or greater than or equal to) T3 and the blacklist score is less than (or less than or equal to) T2, the terminal device can perform the steps shown in S407; or, when the terminal device Not satisfied: When the registration template score is greater than (or equal to) T3 and the blacklist score is less than (or equal to) T2, the terminal device can end the step of adding the current speaker's voiceprint vector to the voiceprint blacklist database.
  • the terminal device does not satisfy that the registration template score is greater than (or equal to) T3 and the blacklist score is less than (or equal to) T2, which can be understood as: the terminal device determines that the registration template score is less than or equal to (or less than) T3, and the blacklist score is greater than or equal to (or greater than) T2, or the registration template score is less than or equal to (or less than) T3 and the blacklist score is greater than or equal to (or greater than) T2.
  • the terminal device can determine whether to add sounds that threaten the terminal device to the voiceprint blacklist library by determining whether the registered template score is greater than T3 and whether the blacklist score is less than T2.
  • the terminal device determines that the registration template score is greater than T3, it can be understood that the similarity between the currently received voice and the registered user's voice is low.
  • the received voice may be a voice that threatens the device.
  • the terminal device determines that the blacklist score is less than T2, it can be understood that the currently received voice does not belong to the impersonator's voice stored in the voiceprint blacklist database. Therefore, the terminal device can further ensure voiceprint recognition by adding the speaker's voiceprint vector corresponding to the voice that is a threat to the terminal device and has not been added to the voiceprint blacklist database to the voiceprint blacklist database. Method security. Among them, when the terminal device determines that the blacklist score is greater than or equal to T2, it can be understood that the speaker's voiceprint vector corresponding to the current speaker's voice is already in the voiceprint blacklist library, so there is no need to add it repeatedly.
  • the terminal device determines whether the signal-to-noise ratio is >NdB.
  • the signal-to-noise ratio is used to indicate the ratio of the user's voice signal to the noise signal in the environment.
  • the terminal device determines that the signal-to-noise ratio is greater than (or greater than or equal to) NdB, the terminal device can perform the steps shown in S408; or when the terminal device determines that the signal-to-noise ratio is less than or equal to (or less than) NdB, the terminal device can end the current conversation. Steps to add the human voiceprint vector to the voiceprint blacklist library.
  • the terminal device can extract higher-quality voiceprint vectors based on signal-to-noise ratio judgment to avoid misjudging the user's voice in a noisy environment as the voice of an impostor.
  • the terminal device obtains the current speaker's voiceprint vector, and adds the current speaker's voiceprint vector to the voiceprint blacklist library.
  • the voiceprint blacklist database can store multiple voiceprint blacklists, such as voiceprint blacklist 1, voiceprint blacklist 2, ..., and voiceprint blacklist M.
  • voiceprint blacklist database can only store M pieces of data, and the current M+1 speaker's voiceprint vector needs to be added to it, the terminal device can blacklist the voiceprints according to the voiceprint blacklist database. The time when the list was added and/or the number of times the voiceprint blacklist has been used determines the voiceprint blacklist that needs to be removed.
  • the terminal device can remove the voiceprint blacklist with the longest time of addition from the M voiceprint blacklist; or, the terminal device can remove the M voiceprint blacklist.
  • the voiceprint blacklist with the least number of uses among the M voiceprint blacklists; or, the terminal device can remove the voiceprint that has been added the longest from the P voiceprint blacklists with the least use times among the M voiceprint blacklists.
  • Tattoo blacklist wherein, the M is greater than (or greater than or equal to) P.
  • the terminal device can also automatically clean the voiceprint blacklist database periodically, such as every other day, or every 4 hours, based on the number of times the voiceprint blacklist is used and the time when the voiceprint blacklist is added.
  • the terminal device can ensure the effectiveness of the voiceprint blacklist database by dynamically adjusting the voiceprint blacklist in the voiceprint blacklist database, and can avoid storing too much data in the voiceprint blacklist database. Speed impact of voiceprint recognition methods.
  • the terminal device determines that the judgment has failed.
  • this round of verification fails. For example, when using voice to wake up a terminal device in a sleeping state, when the terminal device determines that the judgment fails, it can continue to maintain the sleeping state.
  • the terminal device determines that the judgment is successful and wakes up the terminal device.
  • the terminal device when using voice to wake up a terminal device in a sleeping state, when the terminal device determines that the judgment is successful, the terminal device can be woken up. For example, the terminal device can turn on the screen and play a voice message. For example, when the user wakes up the terminal through Hello Yoyo When the device is installed, the terminal device can play such as: I am here or other voice messages after the judgment is successful.
  • the terminal device can verify the voiceprint blacklist database based on the steps shown in S411-S413.
  • the terminal device determines whether the registered template score is >T1.
  • the terminal device when the terminal device determines that the registration template score is greater than (or greater than or equal to) T1, the terminal device may perform the steps shown in S412; or, when the terminal device determines that the registration template score is less than or equal to (or less than T1), the terminal device can end the verification step for the voiceprint blacklist database.
  • the terminal device can filter out the voices of registered users by determining whether the registration template score is greater than T1.
  • the terminal device determines whether the blacklist score is >T2.
  • the blacklist score can be the score corresponding to the speaker's voiceprint vector in each voiceprint blacklist in the voiceprint blacklist database (or it can be understood as the speaker's voice corresponding to each voiceprint blacklist in the voiceprint blacklist database).
  • the similarity score of the impersonator's voice rather than the maximum value of the blacklist score in the voiceprint blacklist database. For example, when the speaker's voiceprint vector is in the voiceprint blacklist database and there are 5 blacklist scores greater than T2, the terminal device can extract the 5 corresponding voiceprint blacklists when the 5 blacklists are greater than T2.
  • the terminal device may perform the steps shown in S413; or, when the terminal device determines that the blacklist score is less than or equal to (or less than T2), the terminal device may end the sound processing. Verification steps for tattoo blacklist database.
  • the blacklist score can be calculated by the terminal device in the step shown in S403, and saved in the device, so that the terminal device can call it in the step shown in S412.
  • the terminal device can calculate the speaker's voiceprint vector based on the voiceprint model and the corresponding M blacklist scores in the M voiceprint blacklists in the voiceprint blacklist database, and Stored in the device, the M blacklist scores are called in the step shown in S412, and the corresponding voiceprint blacklist is determined when the blacklist score is greater than T2.
  • the blacklist score can also be calculated based on the voiceprint blacklist database and the speaker's voiceprint vector in the step shown in S412.
  • the terminal device can wake up the device whose registration template score is greater than T1, and the registration template score is greater than T2 and the blacklist score is less than T1, and then in the step shown in S412, the terminal device can wake up the device based on the voiceprint.
  • the model calculates the M blacklist scores corresponding to the speaker's voiceprint vectors in the M voiceprint blacklists in the voiceprint blacklist database, and further obtains the corresponding voiceprint blacklist when the blacklist score is greater than T2. It can be understood that the terminal device performs blacklist score calculation in the step shown in S412, which can increase the speed of waking up the device based on voiceprint data.
  • the terminal device can filter out the voiceprint vectors of registered users who have mistakenly entered the voiceprint blacklist database by determining whether the registration template score is > T1 and the blacklist score is > T2.
  • the terminal device deletes the corresponding voiceprint blacklist.
  • the terminal device can delete the voiceprint blacklists in all voiceprint blacklist libraries corresponding to the registration template score > T1 and the blacklist score > T2.
  • the terminal device can be set up with a voiceprint blacklist library, and use the voiceprint vectors of registered users and the voiceprint vectors in the voiceprint blacklist library to score the speaker's voiceprint vectors received by the terminal device, so that the terminal device It can achieve accurate recognition of the user's voice and improve the security of voiceprint recognition while reducing the rate of false alarms.
  • FIG. 6 is a schematic flowchart of obtaining the first voiceprint blacklist provided by an embodiment of the present application.
  • the method of obtaining the first voiceprint blacklist may include the following steps:
  • the terminal device obtains MIC data.
  • the MIC data may be the speaker's voiceprint data.
  • the terminal device performs wake word detection.
  • the terminal device calculates the speaker's voiceprint vector and registration template score based on the voiceprint model.
  • the terminal device determines whether the registered template score is >T2.
  • the terminal device when the terminal device determines that the registration template score is greater than (or greater than or equal to) T2, the terminal device may perform the steps shown in S605; or, when the terminal device determines that the registration template score is less than or equal to (or less than) T2, The terminal device can perform the steps shown in S606.
  • the terminal device determines that the judgment is successful and wakes up the terminal device.
  • the terminal device determines whether the registered template score is >T3.
  • the terminal device when the terminal device determines that the registration template score is greater than (or greater than or equal to) T3, the terminal device may perform the steps shown in S607; or, when the terminal device determines that the registration template score is less than or equal to (or less than) T3, The terminal device can end the step of adding the current voiceprint to the voiceprint blacklist database.
  • the terminal device can filter out the sounds that threaten the system through the threshold T3 and add them to the voiceprint blacklist library.
  • the terminal device determines whether the signal-to-noise ratio is >NdB.
  • the terminal device determines that the signal-to-noise ratio is greater than (or equal to) NdB, the terminal device can perform the steps shown in S608; or when the signal-to-noise ratio is less than or equal to (or less than) NdB, the terminal device can end adding the current voiceprint to Steps in the voiceprint blacklist library.
  • the terminal device obtains the current speaker's voiceprint vector, and adds the current speaker's voiceprint vector to the voiceprint blacklist library.
  • the voiceprint blacklist database stores the voiceprint blacklist 1 corresponding to the current speaker's voiceprint vector.
  • the terminal device can add sounds that threaten the device to the voiceprint blacklist library, so that the voiceprint blacklist library can be used for subsequent voiceprint recognition.
  • the terminal device may support voiceprint recognition in different modes, such as a high recognition rate mode and a low recognition rate mode.
  • the high recognition rate mode can be understood as a mode used to provide accurate recognition. In this mode, only the voice is very similar to the registered user's voice, or does not belong to the voiceprint blacklist stored in the terminal device. Only the impersonator's voice can be recognized, and the recognition accuracy is higher.
  • the high recognition rate mode may correspond to the voiceprint recognition method described in the corresponding embodiment of FIG. 4 .
  • the low recognition rate mode can be understood as a mode used to provide a higher recognition success rate. In this mode, the user's voice recognition can be realized in different scenes or different sound states, and the recognition success rate is higher.
  • the low recognition rate mode may correspond to the voiceprint recognition method described in the corresponding embodiment of FIG. 2 .
  • FIG. 7 is a schematic diagram of an interface for setting a voiceprint recognition mode provided by an embodiment of the present application.
  • the terminal device is a mobile phone as an example for illustration. This example does not constitute a limitation on the embodiment of the present application.
  • the mobile phone When the mobile phone receives the user's operation to set the voice wake-up function, the mobile phone can display an interface as shown in a in Figure 7.
  • This interface can display controls for setting user information and controls for setting power key wake-up. , a control 701 for setting voice wake-up, and a control for the user to view more functions, etc.
  • the mobile phone when the mobile phone receives the user's operation to trigger the control 701 for setting voice wake-up, the mobile phone can display the interface shown in b in Figure 7 .
  • the interface shown in b in FIG. 7 includes a control 702 for enabling voice wake-up and so on.
  • the mobile phone when the mobile phone receives the user's operation to trigger the control 702 for turning on voice wake-up, the mobile phone can display the interface shown in c in Figure 7 .
  • the interface shown in c in Figure 7 may include: a control for turning off voice wake-up, a control 703 for setting a high recognition rate mode, a control 704 for setting a low recognition rate mode, and a control for setting wake-up. Command controls, etc.
  • the wake-up command can be: Hello Yoyo.
  • the mobile phone when the mobile phone receives the user's operation to trigger the control 703 for setting the high recognition rate mode, the mobile phone can based on the voiceprint blacklist library, and Register the user's voiceprint data and perform voiceprint recognition on the received speaker's voiceprint data.
  • the mobile phone when the mobile phone receives the user's operation to trigger the control 704 for setting the low recognition rate mode, the mobile phone can receive the pair based on the registered user's voiceprint data.
  • the obtained speaker's voiceprint data is used for voiceprint recognition.
  • users can flexibly set the voiceprint recognition mode according to their own needs, which improves the user experience of using the voice wake-up function.
  • FIG. 8 is a schematic diagram of another interface for setting the voiceprint recognition mode provided by an embodiment of the present application.
  • the mobile phone when the mobile phone receives the user's operation to trigger the control 703 for setting the high recognition rate mode, the mobile phone can display the interface shown in b in Figure 8 .
  • the interface shown in b in Figure 8 may include: a control 801 corresponding to the high recognition rate mode for turning on a prompt for adding the voiceprint blacklist.
  • the prompt for adding the voiceprint blacklist can be understood as: when the mobile phone recognizes a voice that does not belong to the user's registration (or is understood to recognize the voice of an impersonator), it initiates a prompt for adding the voice to the voiceprint blacklist. .
  • the interface shown in a in FIG. 8 is similar to the interface shown in c in FIG. 7 , and will not be described again here.
  • the mobile phone when the mobile phone receives the user's operation to trigger the control 801 for opening the prompt for adding the voiceprint blacklist library, the mobile phone can initiate the operation when it recognizes a voice that does not belong to the registered user. Prompt information; or, when the mobile phone does not receive the user's operation to trigger the control 801 for opening the blacklist database prompt, the mobile phone can add the detected voice that does not belong to the registered user to the voiceprint blacklist database by default.
  • FIG. 9 is a schematic diagram of an interface for displaying prompt information provided by an embodiment of the present application.
  • the mobile phone In the sleep state of the mobile phone (or the screen-off state of the mobile phone), when the mobile phone receives the speaker's voiceprint data based on the high recognition rate mode in the voice wake-up function and determines the registration corresponding to the speaker's voiceprint data
  • the template score is greater than (or greater than or equal to) T3
  • the blacklist score is less than (or less than or equal to) T2
  • the signal-to-noise ratio is greater than (or greater than or equal to) NdB
  • the mobile phone can obtain the speaker's voiceprint vector corresponding to the speaker's voiceprint data. , and the interface shown in Figure 9 is displayed.
  • the interface shown in Figure 9 can display: prompt information 901, a confirmation control 902 for adding the current speaker's voiceprint vector to the voiceprint blacklist, and a confirmation control 902 for refusing to add the current speaker's voiceprint vector to the voiceprint blacklist.
  • the prompt information 901 may be: An impostor's voice has been detected, please confirm whether to add the voice to the voiceprint blacklist database.
  • the mobile phone when the mobile phone does not receive the user's operations on the confirmation control 902 and the cancel control 903 within a certain period of time when the prompt information 901 is displayed, the mobile phone can add the current voiceprint data to the voiceprint blacklist library by default. A step of.
  • the terminal device can avoid the misoperation of directly adding the voice to the voiceprint blacklist library.
  • Figure 10 is a schematic structural diagram of a voiceprint recognition device provided by an embodiment of the present application.
  • the voiceprint recognition device may be a terminal device in an embodiment of the present application, or may be a chip or chip in the terminal device. Chip system.
  • the voiceprint recognition device 100 can be used in communication equipment, circuits, hardware components or chips.
  • the voiceprint recognition device includes: a display unit 1001 and a processing unit 1002 .
  • the display unit 1001 is used to support the display steps performed by the voiceprint recognition device 100;
  • the processing unit 1002 is used to support the information processing steps performed by the voiceprint recognition device 100.
  • An embodiment of the present application provides a voiceprint recognition device 100.
  • a preset database is provided in the device.
  • the preset database includes at least one second user's voiceprint vector; the voiceprint vector is used to represent the user's voice characteristics, including: processing Unit 1002 is used to collect the first voice, which corresponds to the first voiceprint vector; if the terminal device determines that the first voice is the preset voice, the processing unit 1002 is also used to obtain the first voiceprint vector and the preset voice The similarity score of the voiceprint vector is used to obtain the first value; the preset voiceprint vector is the voiceprint vector of the first user; the processing unit 1002 is also used to obtain the first voiceprint vector and each voiceprint in the preset database The highest score among the similarity scores of the vector is obtained to obtain the second value; when the terminal device determines that the first value is greater than the first threshold and the second value is less than the second threshold, the processing unit 1002 is also used to determine the first user's Voiceprint recognition is successful; the second threshold is greater than the first threshold
  • the processing unit 1002 when the terminal device determines that the first value is greater than the third threshold and the second value is less than the first threshold, the processing unit 1002 is also configured to add the first voiceprint vector to the preset database. ; The first threshold is greater than the third threshold.
  • the processing unit 1002 when the terminal device determines that the first value is greater than the third threshold, the second value is less than the first threshold, and the signal-to-noise ratio value corresponding to the first voiceprint vector is greater than the fourth threshold, the processing unit 1002 , specifically used to add the first voiceprint vector to the preset database.
  • the voiceprint vector in the preset database records the storage time stored in the preset database, and records the number of uses.
  • the number of uses is the number of times the second value is calculated.
  • the processing unit 1002 Specifically used to eliminate the voiceprint vector with the longest storage time in the preset database, and/or to eliminate the voiceprint vector with the least usage in the preset database; the processing unit 1002 is also specifically used to convert the first voiceprint vector into Added to the default database.
  • the display unit 1001 when the terminal device determines that the first value is greater than the third threshold and the second value is less than the first threshold, the display unit 1001 is configured to display a first interface; wherein the first interface includes: prompt information for prompting whether to add the first voiceprint vector to the preset database, a first control for adding the first voiceprint vector to the preset database, and a first control for refusing to add the first voiceprint vector to the preset database to the second control in the preset database; when the terminal device receives a trigger for the first control, or does not receive a trigger for any control in the first interface within the preset time threshold, the processing unit 1002, specifically Used to add the first voiceprint vector to the default database.
  • the display unit 1001 when the terminal device receives an operation for setting the voiceprint recognition mode, the display unit 1001 is also used to display a second interface; the second interface includes an operation for turning on the first recognition mode.
  • the third control when the terminal device receives an operation for the third control, the display unit 1001 is also used to display a third interface; the third interface includes: a fourth control for turning on prompt information; when the terminal device determines the third control When a value is greater than the third threshold, the second value is less than the first threshold, and the fourth control is in an open state, the processing unit 1002 is also used to display the first interface.
  • the processing unit 1002 is also used to obtain the similarity score between the first voiceprint vector and each voiceprint vector in the preset database to obtain a third value; the processing unit 1002 is also used to obtain the similarity score between the first voiceprint vector and each voiceprint vector in the preset database.
  • the voiceprint vector in the default database corresponding to the first voiceprint vector is deleted.
  • the processing unit 1002 when the terminal device determines that the first value is greater than the second threshold, the processing unit 1002 is also configured to obtain the similarity between the first voiceprint vector and each voiceprint vector in the preset database. score to obtain a third value; the processing unit 1002 is also configured to delete the voiceprint vector corresponding to the first voiceprint vector in the preset database when the third value is greater than the first threshold.
  • the processing unit 1002 when the terminal device determines that the first value is greater than the first threshold and the second value is less than the second threshold, or when the terminal device determines that the first value is greater than the second threshold, the processing unit 1002 is specifically configured to It is determined that the voiceprint recognition of the first user is successful.
  • the processing unit 1002 when the terminal device determines that the first value is less than or equal to the first threshold, and/or the second value is greater than or equal to the second threshold, the processing unit 1002 is also configured to determine the first user's Voiceprint recognition failed.
  • the voiceprint device 100 may also include a communication unit 1003. Specifically, the communication unit is used to support the voiceprint recognition device 100 in performing the steps of sending data and receiving data.
  • the communication unit 1003 may be an input or output interface, a pin or a circuit, etc.
  • the voiceprint recognition device may also include: a storage unit 1004.
  • the processing unit 1002 and the storage unit 1004 are connected through lines.
  • the storage unit 1004 may include one or more memories, which may be devices used to store programs or data in one or more devices or circuits.
  • the storage unit 1004 may exist independently and be connected to the processing unit 1002 of the voiceprint recognition device through a communication line.
  • the storage unit 1004 may also be integrated with the processing unit 1002.
  • the storage unit 1004 may store computer execution instructions for the method in the terminal device, so that the processing unit 1002 executes the method in the above embodiment.
  • the storage unit 1004 may be a register, cache, RAM, etc., and the storage unit 1004 may be integrated with the processing unit 1002.
  • the storage unit 1004 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, and the storage unit 1004 may be independent from the processing unit 1002.
  • FIG 11 is a schematic diagram of the hardware structure of a control device provided by an embodiment of the present application.
  • the control device includes a processor 1101, a communication line 1104 and at least one communication interface (the communication interface is used as an example in Figure 11 1103 as an example).
  • the processor 1101 can be a general central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more processors used to control the execution of the program of the present application. integrated circuit.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • Communication lines 1104 may include circuitry that communicates information between the components described above.
  • the communication interface 1103 uses any device such as a transceiver to communicate with other devices or communication networks, such as Ethernet, wireless local area networks (WLAN), etc.
  • a transceiver to communicate with other devices or communication networks, such as Ethernet, wireless local area networks (WLAN), etc.
  • WLAN wireless local area networks
  • control device may also include a memory 1102.
  • Memory 1102 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory (RAM)) or other type that can store information and instructions.
  • a dynamic storage device can also be an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by a computer Any other medium for access, but not limited to this.
  • the memory may exist independently and be connected to the processor through a communication line 1104 . Memory can also be integrated with the processor.
  • the memory 1102 is used to store computer execution instructions for executing the solution of the present application, and is controlled by the processor 1101 for execution.
  • the processor 1101 is used to execute computer execution instructions stored in the memory 1102, thereby implementing the voiceprint recognition method provided by the embodiment of the present application.
  • the computer execution instructions in the embodiments of the present application may also be called application program codes, which are not specifically limited in the embodiments of the present application.
  • the processor 1101 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 11 .
  • control device may include multiple processors, such as the processor 1101 and the processor 1105 in FIG. 11 .
  • processors may be a single-CPU processor or a multi-CPU processor.
  • a processor here may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • FIG. 12 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip 120 includes one or more (including two) processors 1220 and a communication interface 1230.
  • memory 1240 stores the following elements: executable modules or data structures, or subsets thereof, or extensions thereof.
  • the memory 1240 may include a read-only memory and a random access memory, and provide instructions and data to the processor 1220.
  • a portion of memory 1240 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1240, the communication interface 1230 and the memory 1240 are coupled together through the bus system 1210.
  • the bus system 1210 may also include a power bus, a control bus, a status signal bus, etc.
  • various buses are labeled as bus system 1210 in FIG. 12 .
  • the methods described in the above embodiments of the present application can be applied to the processor 1220 or implemented by the processor 1220.
  • the processor 1220 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 1220 .
  • the above-mentioned processor 1220 can be a general processor (for example, a microprocessor or a conventional processor), a digital signal processor (DSP), an application specific integrated circuit (ASIC), or an off-the-shelf programmable gate.
  • the processor 1220 can implement or execute the disclosed methods, steps and logical block diagrams in the embodiments of the present invention. .
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a storage medium mature in this field such as random access memory, read-only memory, programmable read-only memory or electrically erasable programmable read only memory (EEPROM).
  • the storage medium is located in the memory 1240.
  • the processor 1220 reads the information in the memory 1240 and completes the steps of the above method in combination with its hardware.
  • the instructions stored in the memory for execution by the processor may be implemented in the form of a computer program product.
  • the computer program product may be written in the memory in advance, or may be downloaded and installed in the memory in the form of software.
  • a computer program product includes one or more computer instructions. When computer program instructions are loaded and executed on a computer, processes or functions according to embodiments of the present application are generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g., computer instructions may be transmitted from a website, computer, server or data center via a wired link (e.g. Coaxial cable, optical fiber, digital subscriber line (DSL) or wireless (such as infrared, wireless, microwave, etc.) means to transmit to another website site, computer, server or data center.
  • a wired link e.g. Coaxial cable, optical fiber, digital subscriber line (DSL) or wireless (such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium can be Any available media that a computer can store or is a data storage device such as a server, data center, or other integrated server that includes one or more available media.
  • available media may include magnetic media (eg, floppy disks, hard disks, or tapes), optical media (eg, Digital versatile disc (digital versatile disc, DVD)), or semiconductor media (for example, solid state disk (solid state disk, SSD)), etc.
  • Computer-readable media may include computer storage media and communication media and may include any medium that can transfer a computer program from one place to another.
  • the storage media can be any target media that can be accessed by the computer.
  • the computer-readable medium may include compact disc read-only memory (CD-ROM), RAM, ROM, EEPROM or other optical disk storage; the computer-readable medium may include a magnetic disk memory or other disk storage device.
  • any connection line is also properly termed a computer-readable medium.
  • coaxial cable, fiber optic cable, twisted pair, DSL or wireless technologies such as infrared, radio and microwave
  • coaxial cable, fiber optic cable, twisted pair, DSL or wireless technologies such as infrared, radio and microwave
  • Disk and optical disk include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-ray disc, where disks typically reproduce data magnetically, while discs reproduce data optically using lasers. Reproduce data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

一种声纹识别方法和装置(100),涉及终端技术领域,应用于终端设备,方法包括:终端设备采集到第一语音;若终端设备确定第一语音为预设语音时,终端设备获取第一声纹向量与预设声纹向量的相似度得分,得到第一数值;终端设备获取第一声纹向量与预设数据库中的每一个声纹向量的相似度得分中的最高的得分,得到第二数值;当终端设备确定第一数值大于第一阈值,且第二数值小于第二阈值时,终端设备确定第一用户的声纹识别成功;第二阈值大于第一阈值。这样,终端设备基于第一阈值以及第二阈值,唤醒终端设备,使得终端设备可以实现对声音的精准识别,在降低误闯率的同时提高声纹识别的安全性。

Description

声纹识别方法和装置
本申请要求于2021年12月28日提交中国国家知识产权局、申请号为202111627924.0、申请名称为“声纹识别方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及终端技术领域,尤其涉及一种声纹识别方法和装置。
背景技术
随着互联网的普及和发展,人们对于终端设备的功能需求也越发多样化。例如,为了简化用户使用终端设备的方式,终端设备可以支持用户通过语音的方式唤醒设备、或者唤醒设备中的某些功能。由于不同用户所具有的声纹数据的独特性,使得终端设备可以通过声纹数据,判别接收到的声音是否为注册用户(或理解为终端设备的机主)的声音。
通常情况下,终端设备可以基于声纹模型对注册用户声纹数据,以及接收到的说话人声纹数据进行打分,当得分超过预设的阈值时,可以唤醒终端设备,或者当该得分小于预设的阈值时,则无法唤醒终端设备。
然而,上述声纹识别方法的误闯率较高,可能对用户的隐私造成威胁。
发明内容
本申请实施例提供一种声纹识别方法和装置,终端设备可以设置有声纹黑名单库,当接收到的说话人声纹数据在注册用户声纹数据中的得分大于第一阈值,且说话人声纹数据在声纹黑名单库中的得分小于第二阈值时,唤醒终端设备,使得终端设备可以实现对用户声音的精准识别,在降低误闯率的同时提高声纹识别的安全性。
第一方面,本申请实施例提供一种声纹识别方法,应用于终端设备,终端设备设置有预设数据库,预设数据库中包括至少一个第二用户的声纹向量;声纹向量用于表征用户的声音特征,方法包括:终端设备采集到第一语音,第一语音对应第一声纹向量;若终端设备确定第一语音为预设语音时,终端设备获取第一声纹向量与预设声纹向量的相似度得分,得到第一数值;预设声纹向量为第一用户的声纹向量;终端设备获取第一声纹向量与预设数据库中的每一个声纹向量的相似度得分中的最高的得分,得到第二数值;当终端设备确定第一数值大于第一阈值,且第二数值小于第二阈值时,终端设备确定第一用户的声纹识别成功;第二阈值大于第一阈值。这样,终端设备可以设置有声纹黑名单库,当接收到的说话人声纹数据在注册用户声纹数据中的得分大于第一阈值,且说话人声纹数据在声纹黑名单库中的得分小于第二阈值时,唤醒终端设备,使得终端设备可以实现对用户声音的精准识别,在降低误闯率的同时提高声纹识别的安全性。
其中,该预设数据库可以为本申请实施例中的声纹黑名单库;第一数值可以为本申请实施例中的注册模板得分;第二数值可以为本申请实施例中的黑名单得分;第一阈值可以为本申请实施例中的T2;第二阈值可以为本申请实施例中的T1;第一用户可以为本申请 实施例中的注册用户。
在一种可能的实现方式中,方法还包括:当终端设备确定第一数值大于第三阈值,且第二数值小于第一阈值时,终端设备将第一声纹向量加入到预设数据库中;第一阈值大于第三阈值。这样,终端设备就可以将对系统具有威胁,且与声纹黑名单库中的声纹黑名单的相似度较低的声纹向量,加入到声纹黑名单库中。其中,第三阈值可以为本申请实施例中的T3。
在一种可能的实现方式中,当终端设备确定第一数值大于第三阈值,且第二数值小于第一阈值时,终端设备将第一声纹向量加入到预设数据库中,包括:当终端设备确定第一数值大于第三阈值,第二数值小于第一阈值,且第一声纹向量对应的信噪比数值大于第四阈值时,终端设备将第一声纹向量加入到预设数据库中。这样,终端设备可以通过信噪比的判断,提取质量较高的声纹向量,避免将用户在嘈杂环境中的声音误判为冒认者的声音的情况。其中,该第四阈值可以为本申请实施例中的信噪比阈值N。
在一种可能的实现方式中,预设数据库中的声纹向量记录有存储在预设数据库中的存储时间,以及记录有使用次数,使用次数为计算得到第二数值的次数,终端设备将第一声纹向量加入到预设数据库中,包括:终端设备剔除预设数据库中的存储时间最长的声纹向量,和/或,剔除预设数据库中的使用次数最少的声纹向量;终端设备将第一声纹向量加入到预设数据库中。这样,终端设备可以通过对于声纹黑名单库中的声纹黑名单的动态调整,保障声纹黑名单库的有效性,并且可以避免声纹黑名单库中的存储过多数据对于声纹识别方法的速度影响。
在一种可能的实现方式中,当终端设备确定第一数值大于第三阈值,且第二数值小于第一阈值时,终端设备将第一声纹向量加入到预设数据库中,包括:当终端设备确定第一数值大于第三阈值,且第二数值小于第一阈值时,终端设备显示第一界面;其中,第一界面中包括:用于提示是否将第一声纹向量加入到预设数据库中的提示信息、用于将第一声纹向量加入到预设数据库中的第一控件、以及用于拒绝将第一声纹向量加入到预设数据库中的第二控件;当终端设备接收到针对第一控件的触发,或者在预设时间阈值内未接收到针对第一界面中的任一控件的触发时,终端设备将第一声纹向量加入到预设数据库中。这样,当由于用户的声音状态或者所处的场景使得声音有所不同时,终端设备可以避免将该声音直接加入到声纹黑名单库的误操作。
在一种可能的实现方式中,方法还包括:当终端设备接收到用于设置声纹识别模式的操作时,终端设备显示第二界面;第二界面中包括用于开启第一识别模式的第三控件;当终端设备接收到针对第三控件的操作时,终端设备显示第三界面;第三界面中包括:用于开启提示信息的第四控件;当终端设备确定第一数值大于第三阈值,且第二数值小于第一阈值时,终端设备显示第一界面,包括:当终端设备确定第一数值大于第三阈值,第二数值小于第一阈值,且第四控件为开启状态时,终端设备显示第一界面。这样,使得用户可以根据自身需求,对加入声纹黑名单库进行灵活设置,提高了用户使用语音唤醒功能的体验感。
在一种可能的实现方式中,方法还包括:终端设备获取第一声纹向量分别与预设数据库中的每一个声纹向量的相似度得分,得到第三数值;终端设备删除第一数值大于第二阈值且第三数值大于第一阈值时,第一声纹向量对应的预设数据库中的声纹向量。这样,使 得终端设备可以将由于某些原因误入到声纹黑名单库中的声纹黑名单删除,进而提高声纹识别方法的准确性。
在一种可能的实现方式中,方法还包括:当终端设备确定第一数值大于第二阈值时,终端设备获取第一声纹向量分别与预设数据库中的每一个声纹向量的相似度得分,得到第三数值;终端设备删除第三数值大于第一阈值时,第一声纹向量对应的预设数据库中的声纹向量。这样,使得终端设备可以将由于某些原因误入到声纹黑名单库中的声纹黑名单删除,进而提高声纹识别方法的准确性。
在一种可能的实现方式中,当终端设备确定第一数值大于第一阈值,且第二数值小于第二阈值时,终端设备确定第一用户的声纹识别成功,包括:当终端设备确定第一数值大于第一阈值且第二数值小于第二阈值,或者,终端设备确定第一数值大于第二阈值时,终端设备确定第一用户的声纹识别成功。这样,终端设备可以通过设置较高的阈值,保证与注册用户的声音相似度极高的声音,例如注册用户本人的声音才能通过声纹识别,使得终端设备可以实现对用户声音的精准识别,降低系统的误闯率。
在一种可能的实现方式中,方法还包括:当终端设备确定第一数值小于或等于第一阈值,和/或,第二数值大于或等于第二阈值时,终端设备确定第一用户的声纹识别失败。这样,使得终端设备可以在识别到的非注册用户的声音时不唤醒终端设备,保障设备的安全性。
第二方面,本申请实施例提供一种声纹识别装置,装置中设置有预设数据库,预设数据库中包括至少一个第二用户的声纹向量;声纹向量用于表征用户的声音特征,包括:处理单元,用于采集到第一语音,第一语音对应第一声纹向量;若终端设备确定第一语音为预设语音时,处理单元,还用于获取第一声纹向量与预设声纹向量的相似度得分,得到第一数值;预设声纹向量为第一用户的声纹向量;处理单元,还用于获取第一声纹向量与预设数据库中的每一个声纹向量的相似度得分中的最高的得分,得到第二数值;当终端设备确定第一数值大于第一阈值,且第二数值小于第二阈值时,处理单元,还用于确定第一用户的声纹识别成功;第二阈值大于第一阈值。
在一种可能的实现方式中,当终端设备确定第一数值大于第三阈值,且第二数值小于第一阈值时,处理单元,还用于将第一声纹向量加入到预设数据库中;第一阈值大于第三阈值。
在一种可能的实现方式中,当终端设备确定第一数值大于第三阈值,第二数值小于第一阈值,且第一声纹向量对应的信噪比数值大于第四阈值时,处理单元,具体用于将第一声纹向量加入到预设数据库中。
在一种可能的实现方式中,预设数据库中的声纹向量记录有存储在预设数据库中的存储时间,以及记录有使用次数,使用次数为计算得到第二数值的次数,处理单元,具体用于剔除预设数据库中的存储时间最长的声纹向量,和/或,剔除预设数据库中的使用次数最少的声纹向量;处理单元,还具体用于将第一声纹向量加入到预设数据库中。
在一种可能的实现方式中,当终端设备确定第一数值大于第三阈值,且第二数值小于第一阈值时,显示单元,用于显示第一界面;其中,第一界面中包括:用于提示是否将第一声纹向量加入到预设数据库中的提示信息、用于将第一声纹向量加入到预设数据库中的第一控件、以及用于拒绝将第一声纹向量加入到预设数据库中的第二控件;当终端设备接 收到针对第一控件的触发,或者在预设时间阈值内未接收到针对第一界面中的任一控件的触发时,处理单元,具体用于将第一声纹向量加入到预设数据库中。
在一种可能的实现方式中,当终端设备接收到用于设置声纹识别模式的操作时,显示单元,还用于显示第二界面;第二界面中包括用于开启第一识别模式的第三控件;当终端设备接收到针对第三控件的操作时,显示单元,还用于显示第三界面;第三界面中包括:用于开启提示信息的第四控件;当终端设备确定第一数值大于第三阈值,第二数值小于第一阈值,且第四控件为开启状态时,处理单元,还用于显示第一界面。
在一种可能的实现方式中,处理单元,还用于获取第一声纹向量分别与预设数据库中的每一个声纹向量的相似度得分,得到第三数值;处理单元,还用于删除第一数值大于第二阈值且第三数值大于第一阈值时,第一声纹向量对应的预设数据库中的声纹向量。
在一种可能的实现方式中,当终端设备确定第一数值大于第二阈值时,处理单元,还用于获取第一声纹向量分别与预设数据库中的每一个声纹向量的相似度得分,得到第三数值;处理单元,还用于删除第三数值大于第一阈值时,第一声纹向量对应的预设数据库中的声纹向量。
在一种可能的实现方式中,当终端设备确定第一数值大于第一阈值且第二数值小于第二阈值,或者,终端设备确定第一数值大于第二阈值时,处理单元,具体用于确定第一用户的声纹识别成功。
在一种可能的实现方式中,当终端设备确定第一数值小于或等于第一阈值,和/或,第二数值大于或等于第二阈值时,处理单元,还用于确定第一用户的声纹识别失败。
第三方面,本申请实施例提供一种终端设备,包括处理器和存储器,存储器用于存储代码指令;处理器用于运行代码指令,使得终端设备以执行如第一方面或第一方面的任一种实现方式中描述的声纹识别方法。
第四方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有指令,当指令被执行时,使得计算机执行如第一方面或第一方面的任一种实现方式中描述的声纹识别方法。
第五方面,一种计算机程序产品,包括计算机程序,当计算机程序被运行时,使得计算机执行如第一方面或第一方面的任一种实现方式中描述的声纹识别方法。
应当理解的是,本申请的第二方面至第五方面与本申请的第一方面的技术方案相对应,各方面及对应的可行实施方式所取得的有益效果相似,不再赘述。
附图说明
图1为本申请实施例提供的一种场景示意图;
图2为一种声纹识别方法的流程示意图;
图3为本申请实施例提供的一种终端设备的硬件结构示意图;
图4为本申请实施例提供的另一种声纹识别方法的流程示意图;
图5为本申请实施例提供的一种确定注册模板得分的流程示意图;
图6为本申请实施例提供的一种获取第一个声纹黑名单的流程示意图;
图7为本申请实施例提供的一种设置声纹识别模式的界面示意图;
图8为本申请实施例提供的另一种设置声纹识别模式的界面示意图;
图9为本申请实施例提供的一种显示提示信息的界面示意图;
图10为本申请实施例提供的一种声纹识别装置的结构示意图;
图11为本申请实施例提供的一种控制设备的硬件结构示意图;
图12为本申请实施例提供的一种芯片的结构示意图。
具体实施方式
为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。例如,第一值和第二值仅仅是为了区分不同的值,并不对其先后顺序进行限定。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
需要说明的是,本申请中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a和b,a和c,b和c,或a、b和c,其中a,b,c可以是单个,也可以是多个。
声纹可以为电声学仪器显示的携带言语信息的声波频谱,声纹可以用于表征说话人的声音特征。声纹不仅具有特定性,并且具有相对稳定性。可以理解的是,无论说话者是故意模仿他人声音和语气,还是耳语轻声讲话,即使模仿得惟妙惟肖,其声纹却始终与被模仿者的真实声纹不同。因此,声纹识别可以广泛用于说话人识别的场景中。本申请实施例中,终端设备可以利用声纹判别接收到的声音是否为注册用户的声音,并在确定该接收到的声音为注册用户的声音时唤醒终端设备。
示例性的,图1为本申请实施例提供的一种场景示意图。在图1对应的实施例中,以终端设备为手机为例进行示例说明,该示例并不构成对本申请实施例的限定。
如图1所示,该场景中可以包括用户101、用户102以及手机103,该用户101与用户102可以为声音极为相似的双胞胎,用户101可以为手机103的注册用户(或理解为用户101可以为手机103的机主)。
在图1对应的场景中,用户101为手机103的注册用户,则手机103中可以注册有用户101的声纹数据,因此用户101可以利用如图2所示的声纹识别方法唤醒手机103,并利用其它语音指令指示手机103实现多种功能。
示例性的,图2为一种声纹识别方法的流程示意图。如图2所示,该声纹识别方法可以包括如下步骤:
S201、终端设备获取麦克风(microphone,MIC)数据。
本申请实施例中,该MIC数据可以为基于终端设备的麦克风采集到的。例如,该MIC数据可以为用户的声音数据对应的电信号。其中,该MIC数据也可以称为说话人声纹数据,下文将以说话人声纹数据为例进行示例说明。
S202、终端设备进行唤醒词检测。
本申请实施例中,该唤醒词(或称为命令词)可以为用于指示终端设备执行相应功能的指令,例如该唤醒词可以为用于将处于睡眠状态(或称为低功耗状态)的终端设备唤醒的指令。
S203、终端设备基于声纹模型计算说话人声纹向量以及注册模板得分。
本申请实施例中,该说话人声纹向量可以用于表征说话人的声音特征,例如该说话人声纹向量是通过对S201所示的步骤中的说话人声纹数据的声学特征提取和计算得到的;该注册模板得分用于指示说话人声音、与注册用户声音之间的相似度,例如该注册模板得分越高,则可以理解为说话人声音与注册用户声音的相似度越高。
S204、终端设备判断注册模板得分是否大于T2。
本申请实施例中,当终端设备确定注册模板得分大于(或大于等于)T2时,终端设备可以执行S205所示的步骤;或者,当终端设备确定注册模板得分小于等于(或小于)T2时,终端设备可以执行S206所示的步骤。
可以理解的是,该阈值T2可以用于判定说话人声音是否属于注册用户声音。例如当注册模板得分的最高取值为100分时,该T2可以取值为80分。
S205、终端设备确定判决成功,并唤醒终端设备。
S206、终端设备确定判决失败。
可以理解的是,在上述声纹识别方法中,为了实现用户可以在各种场景中均能够通过语音唤醒终端设备,因此终端设备通常设置较为宽松的判决条件,例如通过设置较低的阈值T2,例如T2设置为80分,保证较高的唤醒率。
结合图1以及图2对应的实施例,用户101可以基于图2对应的实施例中的声纹识别方法,成功唤醒手机103。而当用户102基于图2对应的实施例中的声纹识别方法,对手机103进行语音唤醒时,由于用户102与用户101为双胞胎,且两人的声音极为相似,使得手机103可能出现识别到用户102与用户101的声音有所不同,但受较为宽松的判决条件的影响,依旧唤醒手机103的情况。例如用户102对应的注册模板得分可以为81分,超过阈值T2对应的80分,造成用户102唤醒手机103的情况,带来较高的误闯率,且可能对用户101的设备隐私造成威胁。
有鉴于此,本申请实施例提供一种声纹识别方法,终端设备可以设置有声纹黑名单库,当接收到的说话人声纹数据在注册用户声纹数据中的得分大于第一阈值,且说话人声纹数据在声纹黑名单库中的得分小于第二阈值时,唤醒终端设备,使得终端设备可以实现对用户声音的精准识别,在降低误闯率的同时提高声纹识别的安全性。其中,该第一阈值可以为本申请实施例中描述的T2,该第二阈值可以为本申请实施例中描述的T1。
可以理解的是,本申请实施例提供的声纹识别方法,不仅可以用于如图1所示的设备唤醒的场景,也可以用于支付场景等其他用于身份认证的场景中,本申请实施例中对此不做具体限定。
可以理解的是,上述终端设备也可以称为终端,(terminal)、用户设备(user equipment, UE)、移动台(mobile station,MS)、移动终端(mobile terminal,MT)等。终端设备可以为拥有麦克风的手机(mobile phone)、智能电视、穿戴式设备、平板电脑(Pad)、带无线收发功能的电脑、虚拟现实(virtual reality,VR)终端设备、增强现实(augmented reality,AR)终端设备、工业控制(industrial control)中的无线终端、无人驾驶(self-driving)中的无线终端、远程手术(remote medical surgery)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端、智慧家庭(smart home)中的无线终端等等。本申请的实施例对终端设备所采用的具体技术和具体设备形态不做限定。
因此,为了能够更好地理解本申请实施例,下面对本申请实施例的终端设备的结构进行介绍。示例性的,图3为本申请实施例提供的一种终端设备的结构示意图。
终端设备可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,指示器192,摄像头193,以及显示屏194等。
可以理解的是,本申请实施例示意的结构并不构成对终端设备的具体限定。在本申请另一些实施例中,终端设备可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。处理器110中还可以设置存储器,用于存储指令和数据。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为终端设备充电,也可以用于终端设备与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他设备,例如AR设备等。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。电源管理模块141用于连接充电管理模块140与处理器110。
终端设备的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。终端设备中的天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。
移动通信模块150可以提供应用在终端设备上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。
无线通信模块160可以提供应用在终端设备上的包括无线局域网(wirelesslocal area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation, FM)等无线通信的解决方案。
终端设备通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。在一些实施例中,终端设备可以包括1个或N个显示屏194,N为大于1的正整数。
终端设备可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
摄像头193用于捕获静态图像或视频。在一些实施例中,终端设备可以包括1个或N个摄像头193,N为大于1的正整数。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端设备的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。
终端设备可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。终端设备可以通过扬声器170A收听音乐,或收听免提通话。受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当终端设备接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。耳机接口170D用于连接有线耳机。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。本申请实施例中,终端设备可以基于麦克风170C接收用于唤醒终端设备的声音信号,并将声音信号转换为可以进行后续处理的电信号,该终端设备可以拥有至少一个麦克风170C。
传感器模块180可以包括下述一种或多种传感器,例如:压力传感器,陀螺仪传感器,气压传感器,磁传感器,加速度传感器,距离传感器,接近光传感器,指纹传感器,温度传感器,触摸传感器,环境光传感器,或骨传导传感器等(图3中未示出)。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。终端设备可以接收按键输入,产生与终端设备的用户设置以及功能控制有关的键信号输入。指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
终端设备的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构等,在此不再赘述。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以独立实现,也可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。
示例性的,图4为本申请实施例提供的另一种声纹识别方法的流程示意图。在图4对应的实施例中,终端设备中可以设置有用于进行冒认者(或理解为陌生用户、或非注册用 户)声纹验证的声纹黑名单库。
如图4所示,声纹识别方法可以包括如下步骤:
S401、终端设备获取MIC数据。
其中,该MIC数据可以称为说话人声纹数据。
S402、终端设备进行唤醒词检测。
示例性的,在利用唤醒词唤醒处于睡眠状态的终端设备的场景中,该唤醒词可以为你好悠悠;或者,在利用唤醒词进行支付的场景中,该唤醒词可以为确认支付;可以理解的是,该唤醒词可以根据实际应用场景进行设置,本申请实施例中对此不做限定。
示例性的,终端设备可以实时获取说话人声纹数据并对该说话人声纹数据进行唤醒词检测,当检测到唤醒词时,终端设备可以执行S403所示的步骤。
S403、终端设备基于声纹模型计算说话人声纹向量、注册模板得分、以及黑名单得分。
本申请实施例中,该说话人声纹向量可以用于表征说话人的声音特征;该注册模板得分用于指示说话人声音与注册用户声音之间的相似度;该黑名单得分用于指示说话人声音在冒认者声音中的相似度。示例性的,终端设备可以基于用于存储冒认者声纹向量的声纹黑名单库,获取说话人声纹数据对应的黑名单得分。其中,该声纹黑名单库中存储的冒认者声纹向量可以用于表征冒认者的声音特征。
本申请实施例中,终端设备可以基于声纹模型计算说话人声纹向量、以及注册模板得分。示例性的,图5为本申请实施例提供的一种确定注册模板得分的流程示意图。
如图5所示,终端设备基于声纹模型计算注册模板得分的一种可能的实现可以为:终端设备可以分别获取说话人声纹数据、以及注册用户声纹数据;并分别提取说话人声纹数据对应的说话人声学特征、以及注册用户声纹数据对应的注册用户声学特征;终端设备将说话人声学特征、以及注册用户声学特征输入至声纹模型中,得到说话人声纹向量、以及注册用户声纹向量;进一步的,终端设备可以利用余弦(cosine)评分以及概率线性判别分析(probabilistic linear discriminant analysis,PLDA)等方法,对说话人声纹向量、以及注册用户声纹向量进行判别,得到说话人声纹向量对应的注册模板得分。
可以理解的是,在首次基于声纹模型计算得到注册用户声纹向量后,终端设备可以存储该注册用户声纹向量,避免后续对其他说话人的注册模板得分进行计算时,对该注册用户声纹向量的重复计算。
本申请实施例中,终端设备可以基于声纹模型计算黑名单得分。示例性的,终端设备基于声纹模型计算黑名单得分的一种可能的实现可以为:终端设备中可以设置有声纹黑名单库,该声纹黑名单库中存储有至少一个声纹黑名单,每一条声纹黑名单可以对应于一个冒认者的声纹向量。如图4所示,声纹黑名单库中可以存储有声纹黑名单1、声纹黑名单2,…,以及声纹黑名单M,M为正整数。进一步的,终端设备可以利用声纹模型对说话人声纹向量、以及声纹黑名单库中的声纹向量分别进行相似度判别,并将相似度最高的得分作为黑名单得分。
本申请实施例中,上述描述的声纹模型可以包括下述一种或多种,例如:高斯混合模型(gaussian mixture model,GMM)、高斯混合背景模型(GMM-universal background model,GMM-UBM)、高斯混合支持向量机(GMM-support vector machine,GMM-SVM)、联合因子分析法(joint factor analysis,JFA)、基于GMM的i-vector方法、基于深度神 经网络(deep neural networks,DNN)的d-vector方法、或基于神经网络(neural networks,NNET)的x-vector等,本申请实施例中对采用的声纹模型不做具体限定。
本申请实施例中,终端设备可以利用下述一种或多种方法提取声学特征,例如:梅尔倒谱系数(mel-scale frequency cepstral coefficients,MFCC)、滤波器组(filterbank,FBank)、或线性预测系数(linear prediction coefficient,LPC)等方法,本申请实施例中对提取声学特征的方法不做具体限定。
可以理解的是,上述声纹模型、以及提取声学特征的方法可以不限于上述描述,本申请实施例中对此不做限定。
S404、终端设备判断是否注册模板得分>T1。
本申请实施例中,当终端设备确定注册模板得分大于(或大于等于)T1时,终端设备可以执行S410所示的步骤;或者,当终端设备确定注册模板得分小于等于(或小于)T1时,终端设备可以执行S405所示的步骤。
其中,该T1与S206所示的步骤中的T2的关系可以为:T1>T2,例如T1=T2×N,该N可能的取值范围为1.5-2,本申请实施例中对此不做具体限定。
可以理解的是,终端设备可以通过设置较高的阈值T1,保证与注册用户的声音相似度极高的声音,例如注册用户本人的声音才能通过声纹识别,使得终端设备可以实现对用户声音的精准识别,降低系统的误闯率。
S405、终端设备判断是否满足注册模板得分>T2且黑名单得分<T1。
本申请实施例中,当终端设备确定满足:注册模板得分大于(或大于等于)T2且黑名单得分小于(或小于等于)T1时,终端设备可以执行S410所示的步骤;或者,当终端设备确定不满足:注册模板得分大于(或大于等于)T2且黑名单得分小于(或小于等于)T1时,终端设备可以执行S406以及S409所示的步骤。
其中,终端设备不满足注册模板得分大于(或大于等于)T2且黑名单得分小于(或小于等于)T1可以理解为:终端设备确定注册模板得分小于等于(或小于)T2,黑名单得分大于等于(或大于)T1,或者,注册模板得分小于等于(或小于)T2且黑名单得分大于等于(或大于)T1。
可以理解的是,终端设备可以通过判断是否注册模板得分大于T2,以及判断是否黑名单得分小于T1,实现降误闯率的同时提高声纹识别方法的成功率。
一方面,当终端设备基于图2对应的声纹识别方法,确定接收到的说话人声纹数据对应的注册模板得分为81分,大于T2对应的80分时,则终端设备可以确定此时判决成功,并唤醒终端设备。由于较为宽松的判决条件,使得接近阈值T2的说话人声音很可能是与注册用户的声音较为接近的冒认者的声音,而冒认者的声音唤醒终端设备则带来较高的误闯率。因此,终端设备可以通过进一步判断该说话人声纹数据对应的黑名单得分与T1的关系,例如通过黑名单得分小于T1,保证当前说话人的声音并不属于终端设备记录的冒认者的声音,进而在降低误闯率的同时提高声纹识别的成功率。
另一方面,当终端设备基于S404所示的步骤中利用较高的阈值T1对声音精准识别时,由于阈值T1所对应的识别方法较为严格,使得终端设备可能无法识别到用户在不同场景或者不同声音状态下的声音,例如终端设备可能无法识别到用户在感冒状态时的声音,从而带来较低的成功率。因此终端设备可以通过设置较低的阈值T2保证较高的成功率,并 且通过说话人声纹数据对应的黑名单得分与T1的关系,例如通过黑名单得分小于T1,保证当前说话人的声音并不属于终端设备记录的冒认者的声音,进而在提高声纹识别的成功率的同时保障降低的误闯率。
S406、终端设备判断是否满足注册模板得分>T3且黑名单得分<T2。
本申请实施例中,当终端设备确定满足:注册模板得分大于(或大于等于)T3且黑名单得分小于(或小于等于)T2时,终端设备可以执行S407所示的步骤;或者,当终端设备不满足:注册模板得分大于(或大于等于)T3且黑名单得分小于(或小于等于)T2时,终端设备可以结束将当前说话人声纹向量加入到声纹黑名单库的步骤。
其中,终端设备不满足注册模板得分大于(或大于等于)T3且黑名单得分小于(或小于等于)T2可以理解为:终端设备确定注册模板得分小于等于(或小于)T3,黑名单得分大于等于(或大于)T2,或者,注册模板得分小于等于(或小于)T3且黑名单得分大于等于(或大于)T2。
其中,该T2与T3的关系可以为:T2>T3,例如T3=T2×Q,该Q可能的取值范围为0.5-0.9,本申请实施例中对此不做具体限定。
可以理解的是,终端设备可以通过判断是否注册模板得分大于T3,以及判断是否黑名单得分小于T2,确定是否将对终端设备具有威胁的声音加入到声纹黑名单库。
具体的,当终端设备确定注册模板得分大于T3时,则可以理解为当前接收到的声音与注册用户的声音的相似度较低,例如该接收到的声音可以为对设备具有威胁的声音。
当终端设备确定黑名单得分小于T2时,则可以理解为当前接收到的声音不属于声纹黑名单库中存储的冒认者的声音。因此,终端设备可以通过将该对终端设备具有威胁,且没有加入到声纹黑名单库中的声音所对应的说话人声纹向量加入到声纹黑名单库的方法中,进一步保障声纹识别方法的安全性。其中,当终端设备确定黑名单得分大于等于T2时,则可以理解为当前说话人的声音对应的说话人声纹向量已经在到声纹黑名单库中,因此则不需要重复加入。
S407、终端设备判断是否信噪比>NdB。
本申请实施例中,该信噪比用于指示用户的声音信号与环境中的噪声信号的比值。当终端设备确定信噪比大于(或大于等于)NdB时,终端设备可以执行S408所示的步骤;或者当终端设备确定信噪比小于等于(或小于)NdB时,终端设备可以结束将当前说话人声纹向量加入到声纹黑名单库的步骤。
可以理解的是,终端设备可以通过信噪比的判断,提取质量较高的声纹向量,避免将用户在嘈杂环境中的声音误判为冒认者的声音的情况。
S408、终端设备获取当前说话人声纹向量,并将当前说话人声纹向量加入声纹黑名单库。
本申请实施例中,该声纹黑名单库中可以存储有多条声纹黑名单,例如声纹黑名单1、声纹黑名单2、...、声纹黑名单M。当该声纹黑名单库中只能存储M条数据,且需要将当前的第M+1条说话人声纹向量加入到其中时,则终端设备可以根据声纹黑名单库中的声纹黑名单的加入时间、和/或声纹黑名单的使用次数,确定需要剔除的声纹黑名单。
示例性的,在需要加入第M+1条说话人声纹向量时,终端设备可以剔除该M条声纹黑名单中的加入时间最长的声纹黑名单;或者,终端设备可以剔除该M条声纹黑名单中的 使用次数最少的声纹黑名单;或者,终端设备可以在该M条声纹黑名单中的使用次数最少的P条声纹黑名单中,剔除加入时间最长的声纹黑名单。其中,该M大于(或大于等于)P。
可能的实现方式中,终端设备也可以定期,例如每隔一天、或每隔4小时等,基于声纹黑名单的使用次数以及声纹黑名单的加入时间自动清理声纹黑名单库。
可以理解的是,终端设备可以通过对于声纹黑名单库中的声纹黑名单的动态调整,保障声纹黑名单库的有效性,并且可以避免声纹黑名单库中的存储过多数据对于声纹识别方法的速度影响。
S409、终端设备确定判决失败。
可以理解的是,当终端设备确定当前接收到的声音为非注册用户的声音时,本轮验证失败。示例性的,在利用语音唤醒处于睡眠状态的终端设备时,当终端设备确定判决失败,则可以继续保持睡眠状态。
S410、终端设备确定判决成功,并唤醒终端设备。
示例性的,在利用语音唤醒处于睡眠状态的终端设备时,当终端设备确定判决成功时,可以唤醒终端设备,例如终端设备可以亮屏、并播放语音消息,例如在用户通过你好悠悠唤醒终端设备时,终端设备可以在判决成功后播放如:我在或其他语音消息。
可能的实现方式中,在S410之后,终端设备可以基于S411-S413所示的步骤,对声纹黑名单库进行验证。
S411、终端设备判断是否注册模板得分>T1。
本申请实施例中,当终端设备确定注册模板得分大于(或大于等于)T1时,终端设备可以执行S412所示的步骤;或者,当终端设备确定注册模板得分小于等于(或小于T1)时,终端设备可以结束对于声纹黑名单库的验证步骤。
可以理解的是,终端设备可以通过判断是否注册模板得分大于T1,筛选出注册用户的声音。
S412、终端设备判断是否黑名单得分>T2。
本步骤中,该黑名单得分可以为说话人声纹向量在声纹黑名单库中的各声纹黑名单中对应的得分(或理解为说话人声音分别与该声纹黑名单库中的各冒认者声音的相似度得分),而非该声纹黑名单库中的黑名单得分的最大值。例如,当说话人声纹向量在声纹黑名单库中,存在5个黑名单得分大于T2的情况时,则终端设备可以提取这5个黑名单大于T2时对应的5个声纹黑名单。
当终端设备确定黑名单得分大于(或大于等于)T2时,终端设备可以执行S413所示的步骤;或者,当终端设备确定黑名单得分小于等于(或小于T2)时,终端设备可以结束对于声纹黑名单库的验证步骤。
一种实现中,该黑名单得分可以为终端设备在S403所示的步骤中计算得到的,并保存在本设备中,使得终端设备可以在S412所示的步骤中调用。示例性的,终端设备可以在S403所示的步骤中基于声纹模型计算说话人声纹向量分别在声纹黑名单库中的M个声纹黑名单中,对应的M个黑名单得分,并存储在设备中,在执行S412所示的步骤中调用该M个黑名单得分,并判断黑名单得分大于T2时对应的声纹黑名单。
另一种实现中,该黑名单得分也可以在S412所示的步骤中基于声纹黑名单库、以及 说话人声纹向量的计算得到的。示例性的,终端设备可以在S410所示的步骤中对注册模板得分大于T1、以及、注册模板得分大于T2且黑名单得分小于T1的设备进行唤醒,再在S412所示的步骤中基于声纹模型计算说话人声纹向量分别在声纹黑名单库中的M个声纹黑名单中,对应的M个黑名单得分,进一步获取黑名单得分大于T2时对应的声纹黑名单。可以理解的是,终端设备在S412所示的步骤中进行黑名单得分计算,可以提高基于声纹数据唤醒设备的速度。
可以理解的是,终端设备可以通过判断是否注册模板得分>T1以及黑名单得分>T2,筛选出误入到声纹黑名单库中的注册用户的声纹向量。
S413、终端设备删除对应声纹黑名单。
示例性的,终端设备可以删除满足注册模板得分>T1且黑名单得分>T2对应的所有声纹黑名单库中的声纹黑名单。
可以理解的是,上述S411-S413用于将由于某些原因误入到声纹黑名单库中的声纹黑名单删除,进而提高声纹识别方法的准确性。
基于此,终端设备可以设置有声纹黑名单库,并利用注册用户声纹向量以及声纹黑名单库中的声纹向量,分别对终端设备接收到的说话人声纹向量进行打分,使得终端设备可以实现对用户声音的精准识别,在降低误闯率的同时提高声纹识别的安全性。
可能的实现方式中,当终端设备中未设置有声纹黑名单库时,终端设备可以基于下述如图6对应的实施例获取第一个声纹黑名单。示例性的,图6为本申请实施例提供的一种获取第一个声纹黑名单的流程示意图。
如图6所示,该获取第一个声纹黑名单的方法可以包括如下步骤:
S601、终端设备获取MIC数据。
其中,该MIC数据可以为说话人声纹数据。
S602、终端设备进行唤醒词检测。
S603、终端设备基于声纹模型计算说话人声纹向量以及注册模板得分。
其中,终端设备计算说话人声纹向量以及注册模板得分的过程可以参见S403所示的步骤,在此不再赘述。
S604、终端设备判断是否注册模板得分>T2。
本申请实施例中,当终端设备确定注册模板得分大于(或大于等于)T2时,终端设备可以执行S605所示的步骤;或者,当终端设备确定注册模板得分小于等于(或小于)T2时,终端设备可以执行S606所示的步骤。
S605、终端设备确定判决成功,并唤醒终端设备。
S606、终端设备判断是否注册模板得分>T3。
本申请实施例中,当终端设备确定注册模板得分大于(或大于等于)T3时,终端设备可以执行S607所示的步骤;或者,当终端设备确定注册模板得分小于等于(或小于)T3时,终端设备可以结束将当前声纹加入到声纹黑名单库的步骤。
可以理解的是,终端设备可以通过阈值T3筛选出对系统有威胁的声音,并将其添加到声纹黑名单库中。
S607、终端设备判断是否信噪比>NdB。
当终端设备判断信噪比大于(或大于等于)NdB时,终端设备可以执行S608所示的 步骤;或者当信噪比小于等于(或小于)NdB时,终端设备可以结束将当前声纹加入到声纹黑名单库中的步骤。
S608、终端设备获取当前说话人声纹向量,并将当前说话人声纹向量加入声纹黑名单库。
可以理解的是,该声纹黑名单库中存储有当前说话人声纹向量对应的声纹黑名单1。
基于此,终端设备可以在注册模板得分大于T3时,将对设备具有威胁的声音加入到声纹黑名单库中,使得该声纹黑名单库可以用于后续的声纹识别。
在图4对应的实施例的基础上,可能的实现方式中,终端设备可以支持不同模式下的声纹识别,例如高识别率模式以及低识别率模式。
本申请实施例中,该高识别率模式可以理解为用于提供精准识别的模式,在该模式下,只有与注册用户的声音极为相似,或者不属于终端设备存储的声纹黑名单库中的冒认者的声音才能够通过识别,识别的准确率较高。其中,该高识别率模式可以对应于图4对应的实施例中描述的声纹识别方法。
该低识别率模式可以理解为用于提供较高识别成功率的模式,在该模式下,可以实现用户在不同场景或不同声音状态下的声音识别,识别的成功率较高。其中,该低识别率模式可以对应于图2对应的实施例中描述的声纹识别方法。
示例性的,图7为本申请实施例提供的一种设置声纹识别模式的界面示意图。在图7对应的实施例中,以终端设备为手机为例进行示例说明,该示例并不构成对本申请实施例的限定。
当手机接收到用户用于设置语音唤醒功能的操作时,手机可以显示如图7中的a所示的界面,该界面中可以显示用于设置用户信息的控件、用于设置电源键唤醒的控件、用于设置语音唤醒的控件701、以及用户查看更多功能的控件等。
如图7中的a所示的界面中,当手机接收到用户触发该用于设置语音唤醒的控件701的操作时,手机可以显示如图7中的b所示的界面。该如图7中的b所示的界面中包括用于开启语音唤醒的控件702等。
如图7中的b所示的界面中,当手机接收到用户触发该用于开启语音唤醒的控件702的操作时,手机可以显示如图7中的c所示的界面。该如图7中的c所示的界面中可以包括:用于关闭语音唤醒的控件、用于设置高识别率模式的控件703、用于设置低识别率模式的控件704、以及用于设置唤醒命令的控件等。其中,该唤醒命令可以为:你好悠悠。
可能的实现方式中,在如图7中的c所示的界面中,当手机接收到用户触发该用于设置高识别率模式的控件703的操作时,手机可以基于声纹黑名单库、以及注册用户声纹数据对接收到的说话人声纹数据进行声纹识别。
可能的实现方式中,在如图7中的c所示的界面中,当手机接收到用户触发该用于设置低识别率模式的控件704的操作时,手机可以基于注册用户声纹数据对接收到的说话人声纹数据进行声纹识别。
基于此,用户可以根据自身需求对声纹识别模式进行灵活设置,提高了用户使用语音唤醒功能的体验感。
进一步的,在图7对应的实施例的基础上,用户也可以通过开启高识别率模式,对高识别率模式中识别到的冒认者加入声纹黑名单库的提示情况进行设置。示例性的,图8为 本申请实施例提供的另一种设置声纹识别模式的界面示意图。
如图8中的a所示的界面中,当手机接收到用户触发该用于设置高识别率模式的控件703的操作时,手机可以显示如图8中的b所示的界面。该图8中的b所示的界面中可以包括:高识别率模式对应的用于开启加入声纹黑名单库提示的控件801。该加入声纹黑名单库提示可以为理解为:手机在识别到不属于用户注册的声音(或理解为识别到冒认者的声音)时,发起将该声音加入到声纹黑名单库的提示。其中,该图8中的a所示的界面与图7中的c所示的界面类似,在此不再赘述。
如图8中的b所示的界面中,当手机接收到用户触发该用于开启加入声纹黑名单库提示的控件801的操作时,手机可以实现在识别到不属于注册用户的声音时发起提示信息;或者,当手机未接收到用户触发该用于开启黑名单库提示的控件801的操作,则手机可以默认将检测到的不属于注册用户的声音加入到声纹黑名单库。
基于此,用户可以根据自身需求,对加入声纹黑名单库进行灵活设置,提高了用户使用语音唤醒功能的体验感。
在如图8对应的实施例的基础上,当用户开启该加入声纹黑名单库提示时,终端设备则可以显示提示信息。示例性的,图9为本申请实施例提供的一种显示提示信息的界面示意图。
在手机的睡眠状态(或也可以为手机的息屏状态)下,当手机基于语音唤醒功能中的高识别率模式,接收到说话人声纹数据且确定到该说话人声纹数据对应的注册模板得分大于(或大于等于)T3、黑名单得分小于(或小于等于)T2、且信噪比大于(或大于等于)NdB时,手机可以获取该说话人声纹数据对应的说话人声纹向量,并显示如图9所示的界面。该如图9所示的界面中可以显示:提示信息901、用于将当前说话人声纹向量加入声纹黑名单库的确认控件902、用于拒绝将当前说话人声纹向量加入声纹黑名单库的取消控件903。其中,该提示信息901可以为:检测到冒认者的声音,请确认是否将该声音加入到声纹黑名单库。
可能的实现方式中,当手机在显示提示信息901的一段时间阈值内,未接收到用户针对确认控件902以及取消控件903的操作时,手机可以默认执行将当前声纹数据加入声纹黑名单库的步骤。
基于此,当由于用户的声音状态或者所处的场景使得声音有所不同时,终端设备可以避免将该声音直接加入到声纹黑名单库的误操作。
可以理解的是,上述实施例中提供的界面仅作为一种示例,并不能够成对本申请实施例的限定。
上面结合图4-图9,对本申请实施例提供的方法进行了说明,下面对本申请实施例提供的执行上述方法的装置进行描述。如图10所示,图10为本申请实施例提供的一种声纹识别装置的结构示意图,该声纹识别装置可以是本申请实施例中的终端设备,也可以是终端设备内的芯片或芯片系统。
如图10所示,声纹识别装置100可以用于通信设备、电路、硬件组件或者芯片中,该声纹识别装置包括:显示单元1001、以及处理单元1002。其中,显示单元1001用于支持声纹识别装置100执行的显示的步骤;处理单元1002用于支持声纹识别装置100执行信息处理的步骤。
本申请实施例提供一种声纹识别装置100,装置中设置有预设数据库,预设数据库中包括至少一个第二用户的声纹向量;声纹向量用于表征用户的声音特征,包括:处理单元1002,用于采集到第一语音,第一语音对应第一声纹向量;若终端设备确定第一语音为预设语音时,处理单元1002,还用于获取第一声纹向量与预设声纹向量的相似度得分,得到第一数值;预设声纹向量为第一用户的声纹向量;处理单元1002,还用于获取第一声纹向量与预设数据库中的每一个声纹向量的相似度得分中的最高的得分,得到第二数值;当终端设备确定第一数值大于第一阈值,且第二数值小于第二阈值时,处理单元1002,还用于确定第一用户的声纹识别成功;第二阈值大于第一阈值。
在一种可能的实现方式中,当终端设备确定第一数值大于第三阈值,且第二数值小于第一阈值时,处理单元1002,还用于将第一声纹向量加入到预设数据库中;第一阈值大于第三阈值。
在一种可能的实现方式中,当终端设备确定第一数值大于第三阈值,第二数值小于第一阈值,且第一声纹向量对应的信噪比数值大于第四阈值时,处理单元1002,具体用于将第一声纹向量加入到预设数据库中。
在一种可能的实现方式中,预设数据库中的声纹向量记录有存储在预设数据库中的存储时间,以及记录有使用次数,使用次数为计算得到第二数值的次数,处理单元1002,具体用于剔除预设数据库中的存储时间最长的声纹向量,和/或,剔除预设数据库中的使用次数最少的声纹向量;处理单元1002,还具体用于将第一声纹向量加入到预设数据库中。
在一种可能的实现方式中,当终端设备确定第一数值大于第三阈值,且第二数值小于第一阈值时,显示单元1001,用于显示第一界面;其中,第一界面中包括:用于提示是否将第一声纹向量加入到预设数据库中的提示信息、用于将第一声纹向量加入到预设数据库中的第一控件、以及用于拒绝将第一声纹向量加入到预设数据库中的第二控件;当终端设备接收到针对第一控件的触发,或者在预设时间阈值内未接收到针对第一界面中的任一控件的触发时,处理单元1002,具体用于将第一声纹向量加入到预设数据库中。
在一种可能的实现方式中,当终端设备接收到用于设置声纹识别模式的操作时,显示单元1001,还用于显示第二界面;第二界面中包括用于开启第一识别模式的第三控件;当终端设备接收到针对第三控件的操作时,显示单元1001,还用于显示第三界面;第三界面中包括:用于开启提示信息的第四控件;当终端设备确定第一数值大于第三阈值,第二数值小于第一阈值,且第四控件为开启状态时,处理单元1002,还用于显示第一界面。
在一种可能的实现方式中,处理单元1002,还用于获取第一声纹向量分别与预设数据库中的每一个声纹向量的相似度得分,得到第三数值;处理单元1002,还用于删除第一数值大于第二阈值且第三数值大于第一阈值时,第一声纹向量对应的预设数据库中的声纹向量。
在一种可能的实现方式中,当终端设备确定第一数值大于第二阈值时,处理单元1002,还用于获取第一声纹向量分别与预设数据库中的每一个声纹向量的相似度得分,得到第三数值;处理单元1002,还用于删除第三数值大于第一阈值时,第一声纹向量对应的预设数据库中的声纹向量。
在一种可能的实现方式中,当终端设备确定第一数值大于第一阈值且第二数值小于第二阈值,或者,终端设备确定第一数值大于第二阈值时,处理单元1002,具体用于确定第 一用户的声纹识别成功。
在一种可能的实现方式中,当终端设备确定第一数值小于或等于第一阈值,和/或,第二数值大于或等于第二阈值时,处理单元1002,还用于确定第一用户的声纹识别失败。
可能的实现方式中,该声纹装置100中也可以包括通信单元1003。具体的,通信单元用于支持声纹识别装置100执行数据的发送以及数据的接收的步骤。其中,该通信单元1003可以是输入或者输出接口、管脚或者电路等。
可能的实施例中,声纹识别装置还可以包括:存储单元1004。处理单元1002、存储单元1004通过线路相连。存储单元1004可以包括一个或者多个存储器,存储器可以是一个或者多个设备、电路中用于存储程序或者数据的器件。存储单元1004可以独立存在,通过通信线路与声纹识别装置具有的处理单元1002相连。存储单元1004也可以和处理单元1002集成在一起。
存储单元1004可以存储终端设备中的方法的计算机执行指令,以使处理单元1002执行上述实施例中的方法。存储单元1004可以是寄存器、缓存或者RAM等,存储单元1004可以和处理单元1002集成在一起。存储单元1004可以是只读存储器(read-only memory,ROM)或者可存储静态信息和指令的其他类型的静态存储设备,存储单元1004可以与处理单元1002相独立。
图11为本申请实施例提供的一种控制设备的硬件结构示意图,如图11所示,该控制设备包括处理器1101,通信线路1104以及至少一个通信接口(图11中示例性的以通信接口1103为例进行说明)。
处理器1101可以是一个通用中央处理器(central processing unit,CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本申请方案程序执行的集成电路。
通信线路1104可包括在上述组件之间传送信息的电路。
通信接口1103,使用任何收发器一类的装置,用于与其他设备或通信网络通信,如以太网,无线局域网(wireless local area networks,WLAN)等。
可能的,该控制设备还可以包括存储器1102。
存储器1102可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过通信线路1104与处理器相连接。存储器也可以和处理器集成在一起。
其中,存储器1102用于存储执行本申请方案的计算机执行指令,并由处理器1101来控制执行。处理器1101用于执行存储器1102中存储的计算机执行指令,从而实现本申请实施例所提供的声纹识别方法。
可能的,本申请实施例中的计算机执行指令也可以称之为应用程序代码,本申请实施 例对此不作具体限定。
在具体实现中,作为一种实施例,处理器1101可以包括一个或多个CPU,例如图11中的CPU0和CPU1。
在具体实现中,作为一种实施例,控制设备可以包括多个处理器,例如图11中的处理器1101和处理器1105。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
示例性的,图12为本申请实施例提供的一种芯片的结构示意图。芯片120包括一个或两个以上(包括两个)处理器1220和通信接口1230。
在一些实施方式中,存储器1240存储了如下的元素:可执行模块或者数据结构,或者他们的子集,或者他们的扩展集。
本申请实施例中,存储器1240可以包括只读存储器和随机存取存储器,并向处理器1220提供指令和数据。存储器1240的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。
本申请实施例中,存储器1240、通信接口1230以及存储器1240通过总线系统1210耦合在一起。其中,总线系统1210除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。为了便于描述,在图12中将各种总线都标为总线系统1210。
上述本申请实施例描述的方法可以应用于处理器1220中,或者由处理器1220实现。处理器1220可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1220中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1220可以是通用处理器(例如,微处理器或常规处理器)、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门、晶体管逻辑器件或分立硬件组件,处理器1220可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。
结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。其中,软件模块可以位于随机存储器、只读存储器、可编程只读存储器或带电可擦写可编程存储器(electrically erasable programmable read only memory,EEPROM)等本领域成熟的存储介质中。该存储介质位于存储器1240,处理器1220读取存储器1240中的信息,结合其硬件完成上述方法的步骤。
在上述实施例中,存储器存储的供处理器执行的指令可以以计算机程序产品的形式实现。其中,计算机程序产品可以是事先写入在存储器中,也可以是以软件形式下载并安装在存储器中。
计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL)或无线(例如红外、无线、微波等)方式向另一个网站 站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包括一个或多个可用介质集成的服务器、数据中心等数据存储设备。例如,可用介质可以包括磁性介质(例如,软盘、硬盘或磁带)、光介质(例如,数字通用光盘(digital versatile disc,DVD))、或者半导体介质(例如,固态硬盘(solid state disk,SSD))等。
本申请实施例还提供了一种计算机可读存储介质。上述实施例中描述的方法可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。计算机可读介质可以包括计算机存储介质和通信介质,还可以包括任何可以将计算机程序从一个地方传送到另一个地方的介质。存储介质可以是可由计算机访问的任何目标介质。
作为一种可能的设计,计算机可读介质可以包括紧凑型光盘只读储存器(compact disc read-only memory,CD-ROM)、RAM、ROM、EEPROM或其它光盘存储器;计算机可读介质可以包括磁盘存储器或其它磁盘存储设备。而且,任何连接线也可以被适当地称为计算机可读介质。例如,如果使用同轴电缆,光纤电缆,双绞线,DSL或无线技术(如红外,无线电和微波)从网站,服务器或其它远程源传输软件,则同轴电缆,光纤电缆,双绞线,DSL或诸如红外,无线电和微波之类的无线技术包括在介质的定义中。如本文所使用的磁盘和光盘包括光盘(CD),激光盘,光盘,数字通用光盘(digital versatile disc,DVD),软盘和蓝光盘,其中磁盘通常以磁性方式再现数据,而光盘利用激光光学地再现数据。
上述的组合也应包括在计算机可读介质的范围内。以上,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (13)

  1. 一种声纹识别方法,其特征在于,应用于终端设备,所述终端设备设置有预设数据库,所述预设数据库中包括至少一个第二用户的声纹向量;所述声纹向量用于表征用户的声音特征,所述方法包括:
    所述终端设备采集到第一语音,所述第一语音对应第一声纹向量;
    若所述终端设备确定所述第一语音为预设语音时,所述终端设备获取所述第一声纹向量与预设声纹向量的相似度得分,得到第一数值;所述预设声纹向量为第一用户的声纹向量;
    所述终端设备获取所述第一声纹向量与所述预设数据库中的每一个声纹向量的相似度得分中的最高的得分,得到第二数值;
    当所述终端设备确定所述第一数值大于第一阈值,且所述第二数值小于第二阈值时,所述终端设备确定所述第一用户的声纹识别成功;所述第二阈值大于所述第一阈值。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    当所述终端设备确定所述第一数值大于第三阈值,且所述第二数值小于所述第一阈值时,所述终端设备将所述第一声纹向量加入到所述预设数据库中;所述第一阈值大于所述第三阈值。
  3. 根据权利要求2所述的方法,其特征在于,所述当所述终端设备确定所述第一数值大于第三阈值,且所述第二数值小于所述第一阈值时,所述终端设备将所述第一声纹向量加入到所述预设数据库中,包括:
    当所述终端设备确定所述第一数值大于所述第三阈值,所述第二数值小于所述第一阈值,且所述第一声纹向量对应的信噪比数值大于第四阈值时,所述终端设备将所述第一声纹向量加入到所述预设数据库中。
  4. 根据权利要求2或3所述的方法,其特征在于,所述预设数据库中的声纹向量记录有存储在所述预设数据库中的存储时间,以及记录有使用次数,所述使用次数为计算得到所述第二数值的次数,所述终端设备将所述第一声纹向量加入到所述预设数据库中,包括:
    所述终端设备剔除所述预设数据库中的存储时间最长的声纹向量,和/或,剔除所述预设数据库中的使用次数最少的声纹向量;
    所述终端设备将所述第一声纹向量加入到所述预设数据库中。
  5. 根据权利要求2或3所述的方法,其特征在于,所述当所述终端设备确定所述第一数值大于第三阈值,且所述第二数值小于所述第一阈值时,所述终端设备将所述第一声纹向量加入到所述预设数据库中,包括:
    当所述终端设备确定所述第一数值大于所述第三阈值,且所述第二数值小于所述第一阈值时,所述终端设备显示第一界面;其中,所述第一界面中包括:用于提示是否将所述第一声纹向量加入到所述预设数据库中的提示信息、用于将所述第一声纹向量加入到所述预设数据库中的第一控件、以及用于拒绝将所述第一声纹向量加入到所述预设数据库中的第二控件;
    当所述终端设备接收到针对所述第一控件的触发,或者在预设时间阈值内未接收到针对所述第一界面中的任一控件的触发时,所述终端设备将所述第一声纹向量加入到所述预设数据库中。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    当所述终端设备接收到用于设置声纹识别模式的操作时,所述终端设备显示第二界面;所述第二界面中包括用于开启第一识别模式的第三控件;
    当所述终端设备接收到针对所述第三控件的操作时,所述终端设备显示第三界面;所述第三界面中包括:用于开启所述提示信息的第四控件;
    所述当所述终端设备确定所述第一数值大于所述第三阈值,且所述第二数值小于所述第一阈值时,所述终端设备显示第一界面,包括:当所述终端设备确定所述第一数值大于所述第三阈值,所述第二数值小于所述第一阈值,且所述第四控件为开启状态时,所述终端设备显示所述第一界面。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述方法还包括:
    所述终端设备获取所述第一声纹向量分别与所述预设数据库中的每一个声纹向量的相似度得分,得到第三数值;
    所述终端设备删除所述第一数值大于所述第二阈值且所述第三数值大于所述第一阈值时,所述第一声纹向量对应的所述预设数据库中的声纹向量。
  8. 根据权利要求1-6任一项所述的方法,其特征在于,所述方法还包括:
    当所述终端设备确定所述第一数值大于所述第二阈值时,所述终端设备获取所述第一声纹向量分别与所述预设数据库中的每一个声纹向量的相似度得分,得到第三数值;
    所述终端设备删除所述第三数值大于所述第一阈值时,所述第一声纹向量对应的所述预设数据库中的声纹向量。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述当所述终端设备确定所述第一数值大于第一阈值,且所述第二数值小于第二阈值时,所述终端设备确定所述第一用户的声纹识别成功,包括:
    当所述终端设备确定所述第一数值大于所述第一阈值且所述第二数值小于所述第二阈值,或者,所述终端设备确定所述第一数值大于所述第二阈值时,所述终端设备确定所述第一用户的声纹识别成功。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述方法还包括:
    当所述终端设备确定所述第一数值小于或等于所述第一阈值,和/或,所述第二数值大于或等于所述第二阈值时,所述终端设备确定所述第一用户的声纹识别失败。
  11. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时,使得所述终端设备执行如权利要求1至10任一项所述的方法。
  12. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,使得计算机执行如权利要求1至10任一项所述的方法。
  13. 一种计算机程序产品,其特征在于,包括计算机程序,当所述计算机程序被运行时,使得计算机执行如权利要求1至10任一项所述的方法。
PCT/CN2022/118924 2021-12-28 2022-09-15 声纹识别方法和装置 WO2023124248A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111627924.0A CN115019806A (zh) 2021-12-28 2021-12-28 声纹识别方法和装置
CN202111627924.0 2021-12-28

Publications (2)

Publication Number Publication Date
WO2023124248A1 WO2023124248A1 (zh) 2023-07-06
WO2023124248A9 true WO2023124248A9 (zh) 2023-10-26

Family

ID=83064298

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/118924 WO2023124248A1 (zh) 2021-12-28 2022-09-15 声纹识别方法和装置

Country Status (2)

Country Link
CN (1) CN115019806A (zh)
WO (1) WO2023124248A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019806A (zh) * 2021-12-28 2022-09-06 北京荣耀终端有限公司 声纹识别方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7240007B2 (en) * 2001-12-13 2007-07-03 Matsushita Electric Industrial Co., Ltd. Speaker authentication by fusion of voiceprint match attempt results with additional information
EP2897076B8 (en) * 2014-01-17 2018-02-07 Cirrus Logic International Semiconductor Ltd. Tamper-resistant element for use in speaker recognition
CN108806695A (zh) * 2018-04-17 2018-11-13 平安科技(深圳)有限公司 自更新的反欺诈方法、装置、计算机设备和存储介质
CN108985776A (zh) * 2018-09-13 2018-12-11 南京硅基智能科技有限公司 基于多重信息验证的信用卡安全监测方法
US10659588B1 (en) * 2019-03-21 2020-05-19 Capital One Services, Llc Methods and systems for automatic discovery of fraudulent calls using speaker recognition
CN110246503A (zh) * 2019-05-20 2019-09-17 平安科技(深圳)有限公司 黑名单声纹库构建方法、装置、计算机设备和存储介质
CN115019806A (zh) * 2021-12-28 2022-09-06 北京荣耀终端有限公司 声纹识别方法和装置

Also Published As

Publication number Publication date
CN115019806A (zh) 2022-09-06
WO2023124248A1 (zh) 2023-07-06

Similar Documents

Publication Publication Date Title
CN111083678B (zh) 蓝牙音箱的播放控制方法、系统及智能设备
WO2022033556A1 (zh) 电子设备及其语音识别方法和介质
WO2020088483A1 (zh) 一种音频控制方法及电子设备
CN110364156A (zh) 语音交互方法、系统、终端及可读存储介质
CN111328417A (zh) 音频外围设备
CN111933112A (zh) 唤醒语音确定方法、装置、设备及介质
US20240013789A1 (en) Voice control method and apparatus
WO2023124248A9 (zh) 声纹识别方法和装置
CN114299933A (zh) 语音识别模型训练方法、装置、设备、存储介质及产品
US20200286475A1 (en) Two-person Automatic Speech Recognition Training To Interpret Unknown Voice Inputs
CN110992953A (zh) 一种语音数据处理方法、装置、系统及存储介质
US20210264923A1 (en) Audio system with digital microphone
CN114360546A (zh) 电子设备及其唤醒方法
CN115312068B (zh) 语音控制方法、设备及存储介质
CN112259076A (zh) 语音交互方法、装置、电子设备及计算机可读存储介质
CN115273431B (zh) 设备的寻回方法、装置、存储介质和电子设备
CN116524919A (zh) 设备唤醒方法、相关装置及通信系统
CN114120987B (zh) 一种语音唤醒方法、电子设备及芯片系统
CN113162837B (zh) 语音消息的处理方法、装置、设备及存储介质
CN114765026A (zh) 一种语音控制方法、装置及系统
CN111028846B (zh) 免唤醒词注册的方法和装置
CN114822525A (zh) 语音控制方法和电子设备
CN116935858A (zh) 声纹识别方法和装置
CN115331672B (zh) 设备控制方法、装置、电子设备及存储介质
EP4293664A1 (en) Voiceprint recognition method, graphical interface, and electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913516

Country of ref document: EP

Kind code of ref document: A1