WO2022007757A1 - 跨设备声纹注册方法、电子设备及存储介质 - Google Patents

跨设备声纹注册方法、电子设备及存储介质 Download PDF

Info

Publication number
WO2022007757A1
WO2022007757A1 PCT/CN2021/104585 CN2021104585W WO2022007757A1 WO 2022007757 A1 WO2022007757 A1 WO 2022007757A1 CN 2021104585 W CN2021104585 W CN 2021104585W WO 2022007757 A1 WO2022007757 A1 WO 2022007757A1
Authority
WO
WIPO (PCT)
Prior art keywords
terminal device
voice
voiceprint
registered
registration
Prior art date
Application number
PCT/CN2021/104585
Other languages
English (en)
French (fr)
Inventor
芦宇
李卓龙
胡伟湘
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022007757A1 publication Critical patent/WO2022007757A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0861Network architectures or network communication protocols for network security for authentication of entities using biometrical features, e.g. fingerprint, retina-scan

Definitions

  • the present application belongs to the technical field of terminals, and in particular, relates to a cross-device voiceprint registration method, an electronic device, and a computer-readable storage medium.
  • Voiceprint recognition that is, speaker recognition
  • speaker recognition is a technology that automatically recognizes and confirms the speaker's identity through speech, and is widely used in mobile phones, smart speakers and other terminal devices.
  • the user Before performing voiceprint recognition, the user needs to register the voiceprint in the terminal device, that is, the user needs to input the registered voice in the terminal device to generate a voiceprint template according to the input registered voice, so as to perform user registration according to the voiceprint template. identification.
  • users often have multiple terminal devices, and to realize the voiceprint recognition of each terminal device, the user needs to perform voice input in each terminal device to register the voiceprint of each terminal device. The number of voice input is large, which affects the user. experience.
  • the embodiments of the present application provide a cross-device voiceprint registration method, an electronic device, and a computer-readable storage medium, which can migrate registered voices to realize the purpose of registering voiceprints on multiple terminal devices by one voice input, and reduce the voiceprints of multiple terminal devices. Registered voice input times to improve user experience.
  • an embodiment of the present application provides a cross-device voiceprint registration method, which is applied to a second terminal device.
  • the method may include:
  • a voiceprint template corresponding to the second terminal device is generated according to the second registered voice.
  • the second terminal device can convert the first registered voice acquired by the first terminal device to generate the second registered voice corresponding to the second terminal device, so as to register the second terminal device.
  • Voiceprint registration in order to realize the purpose of registering voiceprints on multiple terminal devices with one registration voice input, reduce the number of voice input for voiceprint registration on multiple terminal devices, and improve user experience.
  • the converting the first registration voice to obtain the second registration voice corresponding to the second terminal device may include:
  • the first registered voice is converted through the first channel model corresponding to the first terminal device to obtain the original voice corresponding to the first registered voice, and the first channel model is used to represent the first terminal The mapping relationship between the voice corresponding to the device and the original voice;
  • the original voice is converted through a second channel model corresponding to the second terminal device to obtain a second registered voice corresponding to the second terminal device, where the second channel model is used to represent the second terminal
  • the mapping relationship between the voice corresponding to the device and the original voice is converted through a second channel model corresponding to the second terminal device to obtain a second registered voice corresponding to the second terminal device, where the second channel model is used to represent the second terminal.
  • the channel model corresponding to the terminal device may be a channel model of the terminal device relative to the original voice signal, which is used to represent the mapping relationship between the voice signal acquired by the terminal device and the original voice signal, and the original voice signal is a terminal device without a terminal.
  • the device's channel information is appended to the voice signal.
  • the converting the first registration voice to obtain the second registration voice corresponding to the second terminal device may include:
  • the first registered voice is converted through a third channel model to obtain a second registered voice corresponding to the second terminal device, and the third channel model is used to represent the voice corresponding to the first terminal device and the Describe the mapping relationship between the voices corresponding to the second terminal device.
  • the channel model corresponding to the terminal device may also be a channel model between two terminal devices, which is used to represent the mapping relationship between the voice signals obtained by the two terminal devices.
  • the first channel model and the second channel model are channel models constructed based on frequency response curves, or channel models constructed based on spectral characteristics.
  • the third channel model is a channel model constructed based on a frequency response curve, or a channel model constructed based on spectral characteristics.
  • the original frequency sweep signal when establishing the channel model of the terminal device relative to the original voice signal, can be played to the terminal device, the frequency response curve St of the sound signal received by the terminal device can be measured, and the frequency of the original frequency sweep signal can be measured.
  • the response curve S then each frequency response gain value can be calculated according to the frequency response curve St and the frequency response curve S, and a channel model of the terminal device relative to the original voice signal can be established according to each frequency response gain value.
  • the original frequency sweep signal may be the original sound signal output by the frequency sweep signal generator, and the frequency response gain value may be the ratio between the frequency response curve St and the value corresponding to the same frequency in the frequency response curve S.
  • the channel model of the original speech signal can be St/S.
  • the original voice signal when establishing a channel model of the terminal device relative to the original voice signal, can be played to the terminal device, the voice signal received by the terminal device can be obtained, and the original voice signal and the voice signal received by the terminal device can be obtained.
  • the neural network model can extract the spectral feature A corresponding to the original voice signal and the spectral feature B corresponding to the voice signal, and learn the mapping relationship between the spectral feature A and the spectral feature B, so as to obtain The channel model of the end device relative to the original speech signal.
  • the original scan can be played to the first terminal device and the second terminal device respectively.
  • frequency signal wherein the original frequency sweep signal played to the first terminal device and the second terminal device is the same, and the frequency response curve St1 of the sound signal received by the first terminal device and the frequency response curve St1 of the sound signal received by the second terminal device are measured.
  • the frequency response curve St2 then each frequency response gain value can be calculated according to the frequency response curves St1 and St2, and the channel model of the first terminal device relative to the second terminal device can be established according to the frequency response gain value, and/or the second terminal device can be established with respect to the channel model of the first terminal device.
  • the channel model of the first terminal device relative to the second terminal device may be St1/St2, and the channel model of the second terminal device relative to the first terminal device may be St2/St1.
  • the original voice when establishing a channel model between two terminal devices, such as when establishing a channel model between the first terminal device and the second terminal device, the original voice can be played to the first terminal device and the second terminal device respectively.
  • signal obtain the voice signal C received by the first terminal device and the voice signal D received by the second terminal device, and can send the voice signal C and the voice signal D to a preset neural network model
  • the neural network model can be respectively Extracting the spectral feature C corresponding to the voice signal C and the spectral feature D corresponding to the voice signal D, and learning the mapping relationship between the spectral feature C and the spectral feature D, thereby obtaining the channel model of the first terminal device relative to the second terminal device, And/or obtain a channel model of the second terminal device relative to the first terminal device.
  • the generating a voiceprint template corresponding to the second terminal device according to the second registered voice may include:
  • a voiceprint template corresponding to the second terminal device is generated according to the voiceprint recognition model corresponding to the second terminal device and the second registered voice, and the voiceprint recognition model corresponding to the second terminal device is based on the second registered voice.
  • the voiceprint recognition model obtained by the training voice training obtained by the second terminal device.
  • the voiceprint template corresponding to the second terminal device according to the second registered voice it may further include:
  • the voiceprint recognition model corresponding to the second terminal device is updated according to the authentication voice.
  • the method may further include: converting the authentication voice to obtain the training voice corresponding to the first terminal device, and reporting it to the first terminal device.
  • a terminal device sends the training voice, and the training voice is used to update the voiceprint recognition model corresponding to the first terminal device.
  • the embodiments of the present application can obtain high-quality authentication voices in the daily use process of users to update the voiceprint recognition model corresponding to each terminal device, thereby improving the voiceprint recognition model corresponding to each terminal device and the actual use scenario.
  • the matching of each terminal device improves the robustness of the voiceprint recognition in each terminal device, thereby improving the accuracy of the voiceprint recognition of each terminal device.
  • an embodiment of the present application provides a cross-device voiceprint registration method, which is applied to a first terminal device or a server, and the method may include:
  • the first terminal device or the server can convert the first registered voice acquired by the first terminal device, generate the second registered voice corresponding to the second terminal device, and send the second registered voice to the second terminal.
  • the device sends the second registration voice, and the second terminal device can directly register the voiceprint of the second terminal device based on the received second registration voice, which can realize the voiceprint registration of multiple terminal devices with one registration voice input.
  • the purpose is to reduce the number of voice input for voiceprint registration of multi-terminal devices and improve user experience.
  • the calculation amount of the second terminal device can also be reduced, and the use performance of the second terminal device can be ensured.
  • the converting the first registration voice to obtain the second registration voice corresponding to the second terminal device may include:
  • the first registered voice is converted through the first channel model corresponding to the first terminal device to obtain the original voice corresponding to the first registered voice, and the first channel model is used to represent the first terminal The mapping relationship between the voice corresponding to the device and the original voice;
  • the original voice is converted through a second channel model corresponding to the second terminal device to obtain a second registered voice corresponding to the second terminal device, where the second channel model is used to represent the second terminal
  • the mapping relationship between the voice corresponding to the device and the original voice is converted through a second channel model corresponding to the second terminal device to obtain a second registered voice corresponding to the second terminal device, where the second channel model is used to represent the second terminal.
  • the converting the first registration voice to obtain the second registration voice corresponding to the second terminal device may include:
  • the first registered voice is converted through a third channel model to obtain a second registered voice corresponding to the second terminal device, and the third channel model is used to represent the voice corresponding to the first terminal device and the Describe the mapping relationship between the voices corresponding to the second terminal device.
  • the server when the method is applied to the server, after the server obtains the second registered voice corresponding to the second terminal device through conversion processing, the server can also directly use the voiceprint recognition model corresponding to the second terminal device and the second registered voice. Generate a voiceprint template corresponding to the second terminal device, and send the generated voiceprint template to the second terminal device, so as to directly generate the voiceprint template through the server and send it to the second terminal device, thereby reducing the amount of calculation of the second terminal device and reducing Performance requirements for the second terminal equipment.
  • the method may further include:
  • the server when the method provided in the embodiment of the present application is applied to the server, the server can obtain high-quality authentication voices in the daily use of the user, and convert the authentication voices to obtain training voices corresponding to each terminal device.
  • To update the voiceprint recognition model corresponding to each terminal device improve the matching between the voiceprint recognition model corresponding to each terminal device and the actual use scene, improve the robustness of voiceprint recognition in each terminal device, and improve the performance of each terminal device. Accuracy of voiceprint recognition.
  • an embodiment of the present application provides a cross-device voiceprint registration method, which may include:
  • the first terminal device acquires the first registered voice corresponding to the first terminal device
  • the first terminal device converts the first registered voice to obtain a first original voice corresponding to the first registered voice, and sends the first original voice to the second terminal device;
  • the second terminal device receives the first original voice from the first terminal device, and performs conversion processing on the first original voice to obtain a second registered voice corresponding to the second terminal device;
  • the second terminal device generates a voiceprint template corresponding to the second terminal device according to the second registered voice.
  • the process of converting the first registered voice to the second registered voice can be decomposed into the process of converting the first registered voice to the original voice and the process of converting the original voice to the second registered voice, and can The process of converting the first registered voice to the original voice is performed by the first terminal device, and the process of converting the original voice to the second registered voice is performed by the second terminal device, so as to decompose the process of converting the first registered voice to the second registered voice To be executed by the first terminal device and the second terminal device, the calculation amount of each terminal device can be reduced, thereby ensuring the use performance of each terminal device.
  • acquiring, by the first terminal device, the first registration voice corresponding to the first terminal device may include:
  • the first terminal device acquires the interactive voice between the first terminal device and the user, and acquires the target voice in the interactive voice, where the target voice is the voice corresponding to the user;
  • the first terminal device acquires the first registration voice corresponding to the first terminal device from the target voice according to the signal-to-noise ratio and/or voice energy level corresponding to the target voice.
  • the first terminal device can also obtain the first registration voice from the daily voice interaction between the user and the first terminal device, so as to perform the voiceprint registration of each terminal device in a self-learning and registration-free manner, thereby simplifying the voice printing process.
  • the operation process of pattern registration improves user experience.
  • the second terminal device after the second terminal device generates a voiceprint template corresponding to the second terminal device according to the second registered voice, it may further include:
  • the second terminal device acquires the authentication voice corresponding to the second terminal device, generates an authentication template corresponding to the authentication voice according to the voiceprint recognition model corresponding to the second terminal device and the authentication voice, and determines the the similarity between the authentication template and the voiceprint template;
  • the second terminal device updates the voiceprint recognition model corresponding to the second terminal device according to the authentication voice, and updates the voiceprint recognition model corresponding to the second terminal device according to the authentication voice, and updates the voiceprint recognition model corresponding to the second terminal device according to the authentication voice.
  • the two-channel model converts the authentication voice to obtain a second original voice corresponding to the authentication voice, and sends the second original voice to the first terminal device;
  • the first terminal device receives the second original voice from the second terminal device, performs conversion processing on the second original voice according to the first channel model corresponding to the first terminal device, and obtains the second original voice.
  • a training voice corresponding to a terminal device, and a voiceprint recognition model corresponding to the first terminal device is updated according to the training voice corresponding to the first terminal device.
  • an embodiment of the present application provides an apparatus for cross-device voiceprint registration, which is applied to a second terminal device.
  • the apparatus may include:
  • a registered voice acquisition module configured to acquire the first registered voice corresponding to the first terminal device
  • a conversion processing module configured to perform conversion processing on the first registered voice to obtain a second registered voice corresponding to the second terminal device
  • a voiceprint registration module configured to generate a voiceprint template corresponding to the second terminal device according to the second registered voice.
  • the conversion processing module may include:
  • a first conversion processing unit configured to perform conversion processing on the first registered voice through a first channel model corresponding to the first terminal device to obtain the original voice corresponding to the first registered voice, the first channel model for characterizing the mapping relationship between the voice corresponding to the first terminal device and the original voice;
  • a second conversion processing unit configured to convert the original voice through a second channel model corresponding to the second terminal device to obtain a second registered voice corresponding to the second terminal device, the second channel model It is used to characterize the mapping relationship between the voice corresponding to the second terminal device and the original voice.
  • the conversion processing module may include:
  • a third conversion processing unit configured to perform conversion processing on the first registration voice through a third channel model to obtain a second registration voice corresponding to the second terminal device, where the third channel model is used to represent the first registration voice
  • the first channel model and the second channel model are channel models constructed based on frequency response curves, or channel models constructed based on spectral characteristics.
  • the third channel model is a channel model constructed based on a frequency response curve, or a channel model constructed based on spectral characteristics.
  • the voiceprint registration module is specifically configured to generate a voiceprint template corresponding to the second terminal device according to the voiceprint recognition model corresponding to the second terminal device and the second registered voice, and the first The voiceprint recognition model corresponding to the second terminal device is a voiceprint recognition model obtained by training based on the training voice obtained by the second terminal device.
  • the apparatus may further include:
  • an authentication voice acquisition module configured to acquire the authentication voice corresponding to the second terminal device
  • an authentication template generation module configured to generate an authentication template corresponding to the authentication voice according to the voiceprint recognition model corresponding to the second terminal device and the authentication voice;
  • a similarity determination module configured to determine the similarity between the authentication template and the voiceprint template
  • a model updating module configured to update the voiceprint recognition model corresponding to the second terminal device according to the authentication voice when the similarity is greater than a preset similarity threshold.
  • the apparatus may further include:
  • a training voice acquisition module is used to convert the authentication voice to obtain the training voice corresponding to the first terminal device, and send the training voice to the first terminal device, where the training voice is used to update all the training voices. Describe the voiceprint recognition model corresponding to the first terminal device.
  • an embodiment of the present application provides an apparatus for cross-device voiceprint registration, which is applied to a first terminal device or a server, and the apparatus may include:
  • a first registered voice acquisition module configured to acquire the first registered voice corresponding to the first terminal device
  • a conversion processing module configured to perform conversion processing on the first registered voice to obtain a second registered voice corresponding to the second terminal device
  • the second registration voice sending module is configured to send the second registration voice to the second terminal device, where the second registration voice is used to generate a voiceprint template corresponding to the second terminal device.
  • the conversion processing module may include:
  • a first conversion processing unit configured to perform conversion processing on the first registered voice through a first channel model corresponding to the first terminal device to obtain the original voice corresponding to the first registered voice, the first channel model for characterizing the mapping relationship between the voice corresponding to the first terminal device and the original voice;
  • a second conversion processing unit configured to convert the original voice through a second channel model corresponding to the second terminal device to obtain a second registered voice corresponding to the second terminal device, the second channel model It is used to characterize the mapping relationship between the voice corresponding to the second terminal device and the original voice.
  • the conversion processing module may further include:
  • a third conversion processing unit configured to perform conversion processing on the first registration voice through a third channel model to obtain a second registration voice corresponding to the second terminal device, where the third channel model is used to represent the first registration voice
  • an embodiment of the present application provides a cross-device voiceprint registration system, which may include a first terminal device and a second terminal device, where the first terminal device includes a registered voice acquisition module and a first conversion processing module, so The second terminal device includes a second conversion processing module and a voiceprint registration module;
  • the registration voice acquisition module configured to acquire the first registration voice corresponding to the first terminal device
  • the first conversion processing module is configured to perform conversion processing on the first registered voice, obtain a first original voice corresponding to the first registered voice, and send the first original voice to the second terminal device;
  • the second conversion processing module is configured to receive the first original voice from the first terminal device, and perform conversion processing on the first original voice to obtain a second registration corresponding to the second terminal device voice;
  • the voiceprint registration module is configured to generate a voiceprint template corresponding to the second terminal device according to the second registered voice.
  • the registered voice acquisition module may include:
  • a target voice obtaining unit configured to obtain the interactive voice between the first terminal device and the user, and obtain the target voice in the interactive voice, where the target voice is the voice corresponding to the user;
  • a registered voice obtaining unit configured to obtain the first registered voice corresponding to the first terminal device from the target voice according to the signal-to-noise ratio and/or voice energy level corresponding to the target voice.
  • the second terminal device may further include an authentication voice acquisition module, an authentication template generation module, a similarity determination module and a first model update module; the first terminal device further includes A training speech acquisition module and a second model update module may be included:
  • the authentication voice acquisition module configured to acquire the authentication voice corresponding to the second terminal device
  • the authentication template generation module is configured to generate an authentication template corresponding to the authentication voice according to the voiceprint recognition model corresponding to the second terminal device and the authentication voice;
  • the similarity determination module configured to determine the similarity between the authentication template and the voiceprint template
  • the first model updating module is configured to update the voiceprint recognition model corresponding to the second terminal device according to the authentication voice when the similarity is greater than a preset similarity threshold, and update the voiceprint recognition model corresponding to the second terminal device according to the authentication voice.
  • the second channel model corresponding to the device performs conversion processing on the authentication voice, obtains the second original voice corresponding to the authentication voice, and sends the second original voice to the first terminal device;
  • the training voice acquisition module is configured to receive the second original voice from the second terminal device, and convert the second original voice according to the first channel model corresponding to the first terminal device to obtain the training voice corresponding to the first terminal device;
  • the second model updating module is configured to update the voiceprint recognition model corresponding to the first terminal device according to the training voice corresponding to the first terminal device.
  • an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program , enabling the electronic device to implement the cross-device voiceprint registration method according to any one of the first aspect or the second aspect.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a computer, enables the computer to implement any one of the above-mentioned first aspects.
  • Item 1 or the cross-device voiceprint registration method according to any item of the second aspect.
  • an embodiment of the present application provides a computer program product, which, when the computer program product runs on an electronic device, enables the electronic device to perform any one of the above-mentioned first aspect, or any one of the second aspect
  • the cross-device voiceprint registration method is provided.
  • FIG. 1 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a software architecture of a terminal device provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram 1 of an application interface provided by an embodiment of the present application.
  • FIG. 5 is a second schematic diagram of an application interface provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram 3 of an application interface provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram 4 of an application interface provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a cross-device voiceprint registration method provided in Embodiment 1 of the present application.
  • FIG. 9 is a schematic diagram of another application scenario provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a cross-device voiceprint registration method provided in Embodiment 2 of the present application.
  • FIG. 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting “.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • Voiceprint recognition is a technology that automatically recognizes and confirms the identity of the speaker through speech, and can be applied to terminal devices such as mobile phones, smart watches, and smart speakers.
  • voiceprint recognition includes two stages: voiceprint registration and voiceprint verification.
  • the voiceprint registration stage the user needs to input the registration voice to the terminal device, and the terminal device can generate a voiceprint template according to the obtained registration voice;
  • the voiceprint verification stage the terminal device can combine the authentication voice input by the user with the generated voiceprint generated in the registration stage.
  • the voiceprint template is used for similarity scoring to identify the user's identity.
  • the registered voice acquired by the terminal device will be affected by the channel corresponding to the terminal device, that is, the registered voice acquired by the terminal device will have channel information corresponding to the terminal device appended.
  • Different terminal devices are often composed of different hardware devices, so different terminal devices have different channel information, so that different terminal devices have different registration voices for voiceprint registration. Due to different channels, the registered voice obtained by a terminal device can only be applied to the voiceprint registration of the terminal device.
  • the user needs to perform voice input on each terminal device to register the voiceprint of each terminal device, and the number of voice input is large, which affects the user experience.
  • the embodiments of the present application provide a cross-device voiceprint registration method, an electronic device, and a computer-readable storage medium, which can convert the registration voice acquired by a certain terminal device to generate registration voices corresponding to other terminal devices, It can register voiceprints on other terminal devices, realize the purpose of registering voiceprints on multiple terminal devices by inputting a registered voice, reduce the number of voiceprints for voiceprint registration on multiple terminal devices, and improve user experience.
  • terminal devices involved in the embodiments of the present application may be mobile phones, tablet computers, wearable devices (such as smart earphones, smart bracelets, etc.), smart speakers, smart homes, in-vehicle devices, augmented reality (AR) )/virtual reality (VR) devices, laptops, ultra-mobile personal computers (UMPCs), personal digital assistants (PDAs), desktop computers, etc.
  • wearable devices such as smart earphones, smart bracelets, etc.
  • VR virtual reality
  • laptops laptops
  • UMPCs ultra-mobile personal computers
  • PDAs personal digital assistants
  • desktop computers etc.
  • FIG. 1 is a schematic structural diagram of a terminal device 100 provided by an embodiment of the present application.
  • the terminal device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
  • SIM Subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and an environmental sensor Light sensor 180L, bone conduction sensor 180M, etc.
  • the terminal device 100 may include more or less components than shown, or some components are combined, or some components are separated, or different components are arranged.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • baseband processor baseband processor
  • neural-network processing unit neural-network processing unit
  • the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transceiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the I2C interface is a bidirectional synchronous serial bus that includes a serial data line (SDA) and a serial clock line (SCL).
  • the processor 110 may contain multiple sets of I2C buses.
  • the processor 110 can be respectively coupled to the touch sensor 180K, the charger, the flash, the camera 193 and the like through different I2C bus interfaces.
  • the processor 110 may couple the touch sensor 180K through the I2C interface, so that the processor 110 and the touch sensor 180K communicate with each other through the I2C bus interface, so as to realize the touch function of the terminal device 100 .
  • the I2S interface can be used for audio communication.
  • the processor 110 may contain multiple sets of I2S buses.
  • the processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 .
  • the audio module 170 can transmit audio signals to the wireless communication module 160 through the I2S interface, so as to realize the function of answering calls through a Bluetooth headset.
  • the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
  • the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface.
  • the audio module 170 can also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to realize the function of answering calls through the Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • a UART interface is typically used to connect the processor 110 with the wireless communication module 160 .
  • the processor 110 communicates with the Bluetooth module in the wireless communication module 160 through the UART interface to implement the Bluetooth function.
  • the audio module 170 can transmit audio signals to the wireless communication module 160 through the UART interface, so as to realize the function of playing music through the Bluetooth headset.
  • the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 .
  • MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
  • the processor 110 communicates with the camera 193 through the CSI interface, so as to realize the shooting function of the terminal device 100 .
  • the processor 110 communicates with the display screen 194 through the DSI interface to implement the display function of the terminal device 100 .
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like.
  • the GPIO interface can also be configured as I2C interface, I2S interface, UART interface, MIPI interface, etc.
  • the USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface 130 can be used to connect a charger to charge the terminal device 100, and can also be used to transmit data between the terminal device 100 and peripheral devices. It can also be used to connect headphones to play audio through the headphones. This interface can also be used to connect other terminal devices, such as AR devices.
  • the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 .
  • the terminal device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger may be a wireless charger or a wired charger.
  • the charging management module 140 may receive charging input from the wired charger through the USB interface 130 .
  • the charging management module 140 may receive wireless charging input through the wireless charging coil of the terminal device 100 . While the charging management module 140 charges the battery 142 , it can also supply power to the terminal device through the power management module 141 .
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.
  • the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance).
  • the power management module 141 may also be provided in the processor 110 .
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the terminal device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in terminal device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the terminal device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
  • the mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low frequency baseband signal is processed by the baseband processor and passed to the application processor.
  • the application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 .
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent of the processor 110, and may be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the terminal device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), and infrared technology (IR).
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication
  • IR infrared technology
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2 .
  • the antenna 1 of the terminal device 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the terminal device 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • the GNSS may include a global positioning system (global positioning system, GPS), a global navigation satellite system (GLONASS), a Beidou navigation satellite system (BDS), a quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS).
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • BDS Beidou navigation satellite system
  • QZSS quasi-zenith satellite system
  • SBAS satellite based augmentation systems
  • the terminal device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 194 is used to display images, videos, and the like.
  • Display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
  • LED diode AMOLED
  • flexible light-emitting diode flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • the terminal device 100 may include one or N display screens 194 , where N is a positive integer greater than one.
  • the terminal device 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
  • the ISP is used to process the data fed back by the camera 193 .
  • the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193 .
  • Camera 193 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the terminal device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy, and the like.
  • Video codecs are used to compress or decompress digital video.
  • the terminal device 100 may support one or more video codecs.
  • the terminal device 100 can play or record videos in various encoding formats, for example, moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
  • MPEG moving picture experts group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the terminal device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal device 100 .
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
  • the storage data area may store data (such as audio data, phone book, etc.) created during the use of the terminal device 100 and the like.
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the processor 110 executes various functional applications and data processing of the terminal device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the terminal device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the terminal device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the terminal device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
  • the terminal device 100 may be provided with at least one microphone 170C.
  • the terminal device 100 may be provided with two microphones 170C, which may implement a noise reduction function in addition to collecting sound signals.
  • the terminal device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the earphone jack 170D is used to connect wired earphones.
  • the earphone interface 170D may be the USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
  • the pressure sensor 180A may be provided on the display screen 194 .
  • the capacitive pressure sensor may be comprised of at least two parallel plates of conductive material. When a force is applied to the pressure sensor 180A, the capacitance between the electrodes changes.
  • the terminal device 100 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 194, the terminal device 100 detects the intensity of the touch operation according to the pressure sensor 180A.
  • the terminal device 100 may also calculate the touched position according to the detection signal of the pressure sensor 180A.
  • touch operations acting on the same touch position but with different touch operation intensities may correspond to different operation instructions. For example, when a touch operation whose intensity is less than the first pressure threshold acts on the short message application icon, the instruction for viewing the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, the instruction to create a new short message is executed.
  • the gyro sensor 180B may be used to determine the motion attitude of the terminal device 100 .
  • the angular velocity of the end device 100 about three axes ie, the x, y and z axes
  • the gyro sensor 180B can be used for image stabilization.
  • the gyro sensor 180B detects the shaking angle of the terminal device 100, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to offset the shaking of the terminal device 100 through reverse motion to achieve anti-shake.
  • the gyro sensor 180B can also be used for navigation and somatosensory game scenarios.
  • the air pressure sensor 180C is used to measure air pressure.
  • the terminal device 100 calculates the altitude through the air pressure value measured by the air pressure sensor 180C to assist in positioning and navigation.
  • the magnetic sensor 180D includes a Hall sensor.
  • the terminal device 100 can detect the opening and closing of the flip holster using the magnetic sensor 180D.
  • the terminal device 100 can detect the opening and closing of the flip according to the magnetic sensor 180D. Further, according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, characteristics such as automatic unlocking of the flip cover are set.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the terminal device 100 in various directions (generally three axes).
  • the magnitude and direction of gravity can be detected when the terminal device 100 is stationary. It can also be used to identify the posture of terminal devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.
  • the terminal device 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the terminal device 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
  • Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
  • the light emitting diodes may be infrared light emitting diodes.
  • the terminal device 100 emits infrared light to the outside through the light emitting diode.
  • the terminal device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device 100 . When insufficient reflected light is detected, the terminal device 100 may determine that there is no object near the terminal device 100 .
  • the terminal device 100 can use the proximity light sensor 180G to detect that the user holds the terminal device 100 close to the ear to talk, so as to automatically turn off the screen to save power.
  • Proximity light sensor 180G can also be used in holster mode, pocket mode automatically unlocks and locks the screen.
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • the terminal device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the terminal device 100 is in a pocket, so as to prevent accidental touch.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the terminal device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering incoming calls with fingerprints, and the like.
  • the temperature sensor 180J is used to detect the temperature.
  • the terminal device 100 uses the temperature detected by the temperature sensor 180J to execute the temperature processing strategy. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold value, the terminal device 100 reduces the performance of the processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection.
  • the terminal device 100 when the temperature is lower than another threshold, the terminal device 100 heats the battery 142 to avoid abnormal shutdown of the terminal device 100 caused by the low temperature.
  • the terminal device 100 boosts the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperature.
  • Touch sensor 180K also called “touch device”.
  • the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output associated with touch operations may be provided via display screen 194.
  • the touch sensor 180K may also be disposed on the surface of the terminal device 100 , which is different from the position where the display screen 194 is located.
  • the bone conduction sensor 180M can acquire vibration signals.
  • the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice.
  • the bone conduction sensor 180M can also contact the pulse of the human body and receive the blood pressure beating signal.
  • the bone conduction sensor 180M can also be disposed in the earphone, combined with the bone conduction earphone.
  • the audio module 170 can analyze the voice signal based on the vibration signal of the vocal vibration bone block obtained by the bone conduction sensor 180M, so as to realize the voice function.
  • the application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 180M, and realize the function of heart rate detection.
  • the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
  • the terminal device 100 may receive key input and generate key signal input related to user settings and function control of the terminal device 100 .
  • Motor 191 can generate vibrating cues.
  • the motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback.
  • touch operations acting on different applications can correspond to different vibration feedback effects.
  • the motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 .
  • Different application scenarios for example: time reminder, receiving information, alarm clock, games, etc.
  • the touch vibration feedback effect can also support customization.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
  • the SIM card interface 195 is used to connect a SIM card.
  • the SIM card can be contacted and separated from the terminal device 100 by inserting into the SIM card interface 195 or pulling out from the SIM card interface 195 .
  • the terminal device 100 may support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
  • the SIM card interface 195 can support Nano SIM card, Micro SIM card, SIM card and so on. Multiple cards can be inserted into the same SIM card interface 195 at the same time. The types of the plurality of cards may be the same or different.
  • the SIM card interface 195 can also be compatible with different types of SIM cards.
  • the SIM card interface 195 is also compatible with external memory cards.
  • the terminal device 100 interacts with the network through the SIM card to realize functions such as calls and data communication.
  • the terminal device 100 adopts an eSIM, that is, an embedded SIM card.
  • the eSIM card can be embedded in the terminal device 100 and cannot be separated from the terminal device 100 .
  • the software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of the terminal device 100 .
  • FIG. 2 is a block diagram of a software structure of a terminal device 100 according to an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and so on.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, a phone book, and the like.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the telephony manager is used to provide the communication function of the terminal device 100 .
  • the management of call status including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the terminal device vibrates, and the indicator light flashes.
  • Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
  • the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • a system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • surface manager surface manager
  • media library Media Libraries
  • 3D graphics processing library eg: OpenGL ES
  • 2D graphics engine eg: SGL
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
  • a corresponding hardware interrupt is sent to the kernel layer.
  • the kernel layer processes touch operations into raw input events (including touch coordinates, timestamps of touch operations, etc.). Raw input events are stored at the kernel layer.
  • the application framework layer obtains the original input event from the kernel layer, and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and the control corresponding to the click operation is the control of the camera application icon, for example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer.
  • the camera 193 captures still images or video.
  • the method can obtain the registered voice input by the user in the terminal device, and can convert the registered voice according to the channel model corresponding to each terminal device, and obtain a The registered voice corresponding to the user's other terminal equipment is used to register the voiceprint of the user's other terminal equipment to obtain a voiceprint template corresponding to each terminal equipment.
  • the channel model corresponding to the terminal device may be the channel model of the terminal device relative to the original voice signal, which is used to represent the mapping relationship between the voice signal obtained by the terminal device and the original voice signal, and the original voice signal is no terminal device.
  • the voice signal attached to the channel information or, it can be a channel model between two terminal devices, which is used to represent the mapping relationship between the voice signals obtained by the two terminal devices.
  • a channel model corresponding to a terminal device may be constructed based on a frequency response curve of a voice signal, or a channel model corresponding to a terminal device may be constructed based on a spectral feature of the voice signal.
  • the original frequency sweep signal when establishing the channel model of the terminal device relative to the original voice signal, can be played to the terminal device, the frequency response curve St of the sound signal received by the terminal device can be measured, and the frequency of the original frequency sweep signal can be measured.
  • the response curve S then each frequency response gain value can be calculated according to the frequency response curve St and the frequency response curve S, and a channel model of the terminal device relative to the original voice signal can be established according to each frequency response gain value.
  • the original frequency sweep signal may be the original sound signal output by the frequency sweep signal generator, and the frequency response gain value may be the ratio between the frequency response curve St and the value corresponding to the same frequency in the frequency response curve S.
  • the channel model of the original speech signal can be St/S.
  • the original voice signal can be played to the terminal device, the voice signal received by the terminal device can be obtained, and the original voice signal and the voice signal received by the terminal device can be sent to a preset neural network model, and the neural network model can be extracted separately.
  • the spectral feature A corresponding to the original voice signal and the spectral feature B corresponding to the voice signal are learned, and the mapping relationship between the spectral feature A and the spectral feature B is learned, thereby obtaining the channel model of the terminal device relative to the original voice signal.
  • the original voice signal may be a voice signal collected by a standard sound acquisition device (such as a microphone), and the preset neural network model may be a neural network model obtained by training based on a large number of voice signal data pairs, each voice signal data The pair includes the original voice signal and the voice signal received by the original voice signal via the terminal device.
  • a standard sound acquisition device such as a microphone
  • the preset neural network model may be a neural network model obtained by training based on a large number of voice signal data pairs, each voice signal data The pair includes the original voice signal and the voice signal received by the original voice signal via the terminal device.
  • the original scan can be played to the first terminal device and the second terminal device respectively.
  • frequency signal wherein the original frequency sweep signal played to the first terminal device and the second terminal device is the same, and the frequency response curve St1 of the sound signal received by the first terminal device and the frequency response curve St1 of the sound signal received by the second terminal device are measured.
  • the frequency response curve St2 then each frequency response gain value can be calculated according to the frequency response curves St1 and St2, and the channel model of the first terminal device relative to the second terminal device can be established according to the frequency response gain value, and/or the second terminal device can be established with respect to the channel model of the first terminal device.
  • the channel model of the first terminal device relative to the second terminal device may be St1/St2, and the channel model of the second terminal device relative to the first terminal device may be St2/St1.
  • the original voice signal can be played to the first terminal device and the second terminal device respectively, the voice signal C received by the first terminal device and the voice signal D received by the second terminal device can be obtained, and the voice signal C and the voice signal D can be obtained.
  • the signal D is sent to a preset neural network model, and the neural network model can separately extract the spectral feature C corresponding to the speech signal C and the spectral feature D corresponding to the speech signal D, and learn the mapping relationship between the spectral feature C and the spectral feature D. , so as to obtain the channel model of the first terminal device relative to the second terminal device, and/or obtain the channel model of the second terminal device relative to the first terminal device.
  • the original voice signal may be a voice signal collected by a standard sound acquisition device (such as a microphone), and the preset neural network model may be a neural network model obtained by training based on a large number of voice signal data pairs, each voice signal data The pair includes the voice signal corresponding to the first terminal device and the voice signal corresponding to the second terminal device.
  • a standard sound acquisition device such as a microphone
  • the preset neural network model may be a neural network model obtained by training based on a large number of voice signal data pairs, each voice signal data The pair includes the voice signal corresponding to the first terminal device and the voice signal corresponding to the second terminal device.
  • the voiceprint template corresponding to each terminal device may be generated according to the voiceprint recognition model corresponding to each terminal device. Therefore, in this embodiment of the present application, a voiceprint recognition model corresponding to each terminal device can be obtained by training in advance, so as to generate a voiceprint template corresponding to each terminal device according to the voiceprint recognition model corresponding to each terminal device and the registered voice.
  • the voiceprint template may be a feature vector output by the voiceprint recognition model, etc., that is, the voiceprint template may be a feature vector composed of voiceprint features extracted from the registered speech by the voiceprint recognition model, and the like.
  • the voiceprint recognition model may be a voiceprint recognition model based on a Gaussian mixture model-universal background model (GMM-UBM), or may be a support vector machine (support vector machine, SVM)-based voiceprint recognition model
  • the voiceprint recognition model may be either a voiceprint recognition model based on joint factor analysis (JFA), or a voiceprint recognition model based on a full factor space (identity vector, i-vector), or a voiceprint recognition model based on Voiceprint recognition model of time-delay neural network (TDNN).
  • JFA joint factor analysis
  • i-vector voiceprint recognition model based on a full factor space
  • TDNN Voiceprint recognition model of time-delay neural network
  • the voiceprint recognition models corresponding to each terminal device may be the same or different.
  • the voiceprint recognition models corresponding to each terminal device may all be voiceprint recognition models based on GMM-UBM, or may all be voiceprint recognition models based on TDNN.
  • the voiceprint recognition model corresponding to terminal device A may be a voiceprint recognition model based on GMM-UBM
  • the voiceprint recognition model corresponding to terminal device B may be a voiceprint recognition model based on SVM
  • the voiceprint recognition model corresponding to terminal device C The model can be a JFA-based voiceprint recognition model, and so on.
  • voice features may be extracted for the registered voice, wherein the extracted voice features may be Mel-frequency cepstral coefficients (MFCC) or can be a filter bank (filter bank, FBank) feature. Then, the extracted voice features can be processed through the voiceprint recognition model to obtain a voiceprint template corresponding to the registered voice.
  • MFCC Mel-frequency cepstral coefficients
  • FBank filter bank
  • the speech features can be processed by the GMM-UBM model to obtain a Gaussian mean hypervector as the voiceprint template corresponding to the registered speech; or the speech features can be processed by the i-vector model, and the i-vector vector can be obtained as the registered speech.
  • the voiceprint template corresponding to the voice; or the voice features can be processed through a deep neural network (DNN) to obtain a d-vector vector as the voiceprint template corresponding to the registered voice; or the voice features can be processed through the TDNN network. processing, to obtain the x-vector vector as the voiceprint template corresponding to the registered voice, and so on.
  • DNN deep neural network
  • the voiceprint recognition model corresponding to each terminal device may be obtained by training based on the training voice set obtained by each terminal device.
  • the voiceprint recognition model corresponding to terminal device A can be trained by using the training voice set A obtained by terminal device A
  • the voiceprint recognition model corresponding to terminal device B can be trained by using the training voice set B obtained by terminal device B.
  • the voiceprint recognition model corresponding to the terminal device C may be trained by using the training voice set C obtained by the terminal device C, and so on.
  • the training speech acquired by the terminal device is appended with channel information corresponding to the terminal device.
  • each voiceprint recognition model is trained through the training voice set obtained by each terminal device, so that each voiceprint recognition model can be better matched with each terminal device, so as to improve the accuracy of the voiceprint template corresponding to each terminal device. Accuracy, thereby improving the recognition accuracy of voiceprint recognition of each terminal device.
  • the embodiments of the present application do not specifically limit the training process of the voiceprint recognition model.
  • the existing training method may be used to train the voiceprint recognition model.
  • FIG. 3 is a schematic diagram of an application scenario of the cross-device voiceprint registration method provided in Embodiment 1 of the present application.
  • the application scenario may include multiple terminal devices 100 , and the terminal devices 100 may be interconnected through short-range communication or network.
  • each terminal device 100 or a storage device communicatively connected to each terminal device 100 stores the channel model of the terminal device relative to the original voice signal, or stores the channel model between the terminal device and any other terminal device.
  • the terminal device 100 is a terminal device having a function of cross-device voiceprint registration.
  • the first terminal device may input the user into the The first registered voice of the first terminal device
  • the second terminal devices can use the channel model of the first terminal device relative to the original voice signal
  • the first registered voice is converted with respect to the channel model of the original voice signal to obtain the second registered voice corresponding to each second terminal device;
  • the first registered voice is converted and processed by using the channel model between the second terminal devices, and the second registered voice corresponding to each second terminal device is obtained, so as to generate the voiceprint template corresponding to each second terminal device according to each second registered voice, so as to reduce the number of multi-terminal devices.
  • the cross-device voiceprint registration function of any terminal device 100 can be manually enabled by the user.
  • the terminal device 100 may display an icon and/or a menu bar for cross-device voiceprint registration for the user to manually enable the cross-device voiceprint registration function of the terminal device 100 .
  • the terminal device 100 may display an icon of cross-device voiceprint registration in the shortcut control interface, and the shortcut control interface may also include Bluetooth, airplane mode, mobile data, wireless local area network, Icons for conventional functions such as flashlight and brightness to achieve quick operations of related functions such as Bluetooth, airplane mode, and mobile data.
  • the shortcut control interface may also include Bluetooth, airplane mode, mobile data, wireless local area network, Icons for conventional functions such as flashlight and brightness to achieve quick operations of related functions such as Bluetooth, airplane mode, and mobile data.
  • the terminal device 100 may display a menu bar for cross-device voiceprint registration in the setting interface, and the setting interface may also include Bluetooth, airplane mode, mobile data, wireless local area network, brightness
  • the setting menu bar of other general functions to realize the setting operation of related functions such as Bluetooth, airplane mode, and wireless LAN.
  • the terminal device 100 When the terminal device 100 detects a user's related operation (for example, detects a related touch operation or click operation) on the shortcut control interface or the setting interface, it can be determined that the cross-device voiceprint registration function of the terminal device 100 needs to be enabled, and the terminal device 100 can directly enable the cross-device voiceprint registration function, or can ask the user through a pop-up window whether to confirm to enable the cross-device voiceprint registration function. For example, the terminal device 100 detects that the icon for cross-device voiceprint registration is lit as shown in FIG. 4(a), or detects that the cross-device voiceprint registration icon in the menu bar of FIG.
  • a user's related operation for example, detects a related touch operation or click operation
  • the terminal device 100 can directly enable the cross-device voiceprint registration function; or can display a pop-up window as shown in (c) in FIG. 4 to display the query "Are you sure to enable the cross-device voiceprint registration?" information, and selection keys for "Yes” and "No".
  • the terminal device 100 may enable the cross-device voiceprint registration function.
  • the terminal device 100 may ask the user whether to enable the cross-device voiceprint registration function through a pop-up window. For example, when the terminal device 100 detects that the user is registering voice input in the terminal device 100 to register the voiceprint on the terminal device 100, the terminal device 100 may display a pop-up window as shown in FIG. When the terminal device 100 detects that the user clicks "Yes", the terminal device 100 can enable the cross-device voiceprint registration function .
  • the terminal device 100 can provide a voiceprint registration management interface as shown in (a) in FIG. 6 for the user to select a terminal device for cross-device voiceprint registration .
  • the voiceprint registration management interface may include: a selected device bar 60, a candidate device bar 61, and an add control 62 for adding a new device.
  • the selected device column 60 is used to display the terminal device selected by the user for cross-device voiceprint registration. If no terminal device is selected, a prompt message indicating that the terminal device is not selected can be displayed, for example, as shown in Figure 6. The prompt message "No terminal device has been selected" shown in (a).
  • the candidate device column 61 is used to display terminal devices with cross-device voiceprint registration function.
  • the terminal device displayed in the candidate device column 61 can be associated with the user's account. After the user logs in to the account, all terminal devices with cross-device voiceprint registration function associated with the account can be displayed in the candidate device column 61 . It should be understood that the terminal devices displayed in the selected device column 60 and the candidate device column 61 may be the device name and/or device identification of the terminal device, for example, the device name as shown in FIG. 6 and FIG. Device identification "AAA", "BBB", "CCC” and "DDD” etc.
  • the user may directly select the device name and/or device identifier in the candidate device column 61 to select a terminal device for cross-device voiceprint registration.
  • the terminal device 100 may directly add the device name and/or device identification to the selected device column 60 after the user selects the device name and/or device identification in the candidate device column 61; or, may ask the user through a pop-up window. Whether to confirm the selection of this terminal device. For example, as shown in (a) of FIG.
  • the terminal device 100 when the terminal device 100 detects that the user performs a selection operation (eg, a click operation or a touch operation) on the device name "AAA" in the candidate device bar 61, the terminal device 100 may A pop-up window as shown in (b) of FIG. 6 is displayed to display the query information of "Are you sure to select the terminal device AAA?", and the selection keys of "Confirm” and “Cancel”, when the terminal device 100 detects that the user clicks " When confirming”, as shown in (c) of FIG.
  • a selection operation eg, a click operation or a touch operation
  • the user can select the device name and/or device identifier to be deleted in the selected device column 60 (for example, click operation or touch operation).
  • the device name and/or device identifier is deleted from the device bar 60; alternatively, the user may be asked through a pop-up window whether to confirm to delete the terminal device.
  • the terminal device 100 may display ( b) The pop-up window shown, to display the query information of "Are you sure to delete the terminal device BBB?", and the selection keys of "Confirm" and "Cancel”.
  • the terminal device 100 can delete the device name “BBB” from the selected device column 60, and can add the device name “BBB” ” is added to the candidate device column 61 to facilitate the user to select again.
  • the added terminal device may be a terminal device that has enabled the cross-device voiceprint registration function, or may be a terminal device that has not enabled the cross-device voiceprint registration function.
  • the terminal device 100 may directly add the device name and/or device identifier of the added terminal device to the selected device column 60 .
  • the terminal device 100 may send an opening request to the added terminal device, and a relevant pop-up window may pop up in the added terminal device to prompt the user to enable all Added cross-voiceprint registration function for terminal devices.
  • the terminal device 100 can add the device name and/or device identifier of the added terminal device to the selected device column 60 .
  • FIG. 8 shows a schematic flowchart of a cross-device voiceprint registration method provided by this embodiment. As shown in Figure 8, the method may include:
  • a user inputs a first registration voice to a first terminal device.
  • the first terminal device generates a voiceprint template corresponding to the first terminal device according to the first registered voice.
  • S803 The first terminal device sends the first registration voice to the second terminal device.
  • the user can input the first registration voice into the first terminal device.
  • the first registered voice refers to the voice received by the first terminal device, that is, the first registered voice is appended with channel information corresponding to the first terminal device.
  • the first terminal device can process the first registered voice according to the voiceprint recognition model corresponding to the first terminal device to obtain a voiceprint template corresponding to the first registered voice, so as to complete the first terminal device.
  • the device's voiceprint registration the first terminal device can also send the first registration voice to each second terminal device, and each second terminal device can obtain the second registration voice corresponding to each second terminal device according to the first registration voice to Two terminal devices perform voiceprint registration.
  • the first terminal device can also obtain the first registration voice from the daily voice interaction between the user and the first terminal device, so as to perform the voiceprint registration of each terminal device in a self-learning and registration-free manner, thereby simplifying the voice printing process.
  • the operation process of pattern registration improves user experience.
  • the first terminal device can screen out the user's voice through self-learning, for example, the user's voice can be screened out through a clustering method, and can be selected from the selected voice.
  • the voice with good quality is selected as the first registered voice from the output voice to perform voiceprint registration on each terminal device.
  • the speech with good quality can be selected by evaluating the signal-to-noise ratio and/or the speech energy level of each speech.
  • the second terminal device performs conversion processing on the first registered voice according to the first channel model of the first terminal device relative to the original voice signal and the second channel model of the second terminal device relative to the original voice signal; or according to the first terminal device.
  • the channel model between the device and the second terminal device converts the first registered voice to obtain the second registered voice.
  • the second terminal device generates a voiceprint template corresponding to the second terminal device according to the second registered voice.
  • the channel model is the channel model of the terminal device relative to the original voice signal
  • the second terminal device receives the first registration voice sent by the first terminal device
  • the device identification determines a first channel model of the first terminal device relative to the original voice signal, and determines a second channel model of the second terminal device relative to the original voice signal according to the device name and/or device identification of the second terminal device.
  • the channel information corresponding to the first terminal device in the first registered voice can be removed by using the first channel model, so as to obtain the original voice that does not contain the channel information.
  • the channel information corresponding to the second terminal device can be added to the original voice through the second channel model, so as to obtain the second registered voice including the channel information corresponding to the second terminal device, so that the second terminal device can be generated according to the second registered voice Corresponding voiceprint template.
  • the second terminal device may perform frequency domain conversion on the first registered voice to obtain a frequency domain signal St1' corresponding to the first registered voice.
  • the frequency domain transformation of the first registered voice may be performed through a fast Fourier transform (fast fourier transform, FFT) to obtain a frequency domain signal St1' corresponding to the first registered voice.
  • FFT fast fourier transform
  • the frequency domain signal St1' can be converted through the first channel model of the first terminal device relative to the original voice signal, that is, the frequency domain signal can be removed according to the mapping relationship between the voice signal corresponding to the first terminal device and the original voice signal.
  • the channel information corresponding to the first terminal device in the signal St1' is used to obtain the original frequency domain signal S'.
  • the original frequency domain signal S' may be converted by the second terminal device relative to the second channel model of the original speech signal, that is, the first frequency domain signal S' may be added according to the mapping relationship between the speech signal corresponding to the second channel model and the original speech signal.
  • the channel information corresponding to the second terminal equipment is converted to the original frequency domain signal S', and the frequency domain signal St2' corresponding to the second terminal equipment is obtained.
  • inverse FFT transformation can be performed on the frequency domain signal St2' to obtain the second registered voice corresponding to the second terminal device.
  • the second terminal device when the channel model is a channel model between terminal devices, after receiving the first registration voice sent by the first terminal device, the second terminal device can The device name and/or device identification of the second terminal device determines the channel model between the first terminal device and the second terminal device, and can directly convert the first registered voice into the second terminal device corresponding to the second terminal device according to the channel model. Registering the voice, reducing the conversion times of the registered voice, so as to reduce the loss of information during the conversion of the registered voice, thereby improving the accuracy of the voiceprint template generated based on the second registered voice.
  • the second terminal device may perform frequency domain conversion on the first registered voice to obtain a frequency domain signal St1' corresponding to the first registered voice. Then, the frequency domain signal St1' can be directly converted into the frequency domain signal St2' corresponding to the second terminal device according to the mapping relationship between the voice signal corresponding to the first terminal device and the voice signal corresponding to the second terminal device. Finally, inverse FFT transformation can be performed on the frequency domain signal St2' to obtain the second registered voice corresponding to the second terminal device.
  • the first terminal device may also directly send the voice feature corresponding to the first registered voice to each second terminal device, and each second terminal device may obtain each second terminal device according to the voice feature corresponding to the first registered voice.
  • the second registration voice corresponding to the device is used to register the voiceprint of each second terminal device. That is, the first terminal device can also directly send the voice features after frequency domain conversion of the first registered voice to each second terminal device, so that each second terminal device can save the frequency domain conversion process and improve the performance of each second terminal device. processing performance.
  • the conversion process of converting the first registered voice to obtain the second registered voice in this embodiment may also be performed by the first terminal device.
  • the first terminal device can remove the channel information of the first registered voice relative to the first channel model of the original voice signal by the first terminal device, so as to obtain no channel information.
  • the original voice of the message Then, the channel information corresponding to the second terminal device can be added to the original voice through the second channel model of the second terminal device relative to the original voice signal, so as to obtain a second registered voice including the channel information corresponding to the second terminal device, and can add The second registration voice is sent to the second terminal device.
  • the first terminal device can directly convert the first registration voice into the first registration voice corresponding to the second terminal device through the channel model between the first terminal device and the second terminal device.
  • Second registration voice and can send the second registration voice to the second terminal device.
  • the conversion process of the registered voice can be decomposed into the first terminal device equipment and second terminal equipment. That is, after acquiring the first registered voice input by the user, the first terminal device can remove the channel information of the first registered voice relative to the first channel model of the original voice signal by the first terminal device to obtain the original voice, and can The original voice is sent to the second terminal device. After the second terminal device receives the original voice, the second terminal device can add channel information to the original voice through the second channel model of the original voice signal, so as to obtain the second registration including the channel information corresponding to the second terminal device. voice.
  • the user can directly use the voiceprint recognition function of the second terminal device without performing voiceprint registration in the second terminal device.
  • the user can directly input the authentication voice into the second terminal device, and after receiving the authentication voice, the second terminal device can obtain the feature vector (ie, the authentication template) corresponding to the authentication voice through the voiceprint recognition model, and calculate the obtained
  • the similarity between the feature vector and the voiceprint template in the second terminal device is used to identify the identity of the user according to the similarity and a preset first similarity threshold.
  • the first similarity threshold may be specifically set according to the actual situation, for example, the first similarity threshold may be set to 70%.
  • This embodiment does not limit the algorithm for calculating the similarity between the feature vector corresponding to the authentication voice and the voiceprint template.
  • the authentication can be calculated by any one of algorithms such as cosine distance (CDS), linear discriminant analysis (LDA), prob-ailistic linear discriminant analysis (PLDA), etc.
  • CDS cosine distance
  • LDA linear discriminant analysis
  • PLDA prob-ailistic linear discriminant analysis
  • the terminal device when the terminal device determines the similarity between the authentication voice and the voiceprint template in the terminal device When it is greater than or equal to the preset second similarity threshold, the terminal device can determine the authentication voice as a high-quality voice sample, and can use the high-quality voice sample to perform incremental learning on the voiceprint recognition model corresponding to the terminal device , to update the voiceprint recognition model corresponding to the terminal device.
  • the terminal device can also generate high-quality voice samples corresponding to other terminal devices according to the authentication voice, so that other terminal devices can perform incremental learning on the voiceprint recognition models corresponding to other terminal devices according to the high-quality voice samples, so as to update other terminal devices.
  • the voiceprint recognition model corresponding to the terminal device can obtain high-quality authentication voices in the daily use of the user to update the voiceprint recognition model corresponding to each terminal device, improve the matching between the voiceprint recognition model corresponding to each terminal device and the actual use scene, and improve the performance of each terminal device. The robustness of the voiceprint recognition in the device, thereby improving the accuracy of the voiceprint recognition of each terminal device.
  • the second similarity threshold may be specifically set according to the actual situation, and the second similarity threshold may be greater than or equal to the first similarity threshold.
  • the second similarity threshold may be set to 90%.
  • the high-quality voice samples may be jointly trained with the original training data corresponding to each terminal device in a weighted manner to update the voiceprint recognition model corresponding to each terminal device.
  • the registration voice acquired by the first terminal device can be converted and processed to generate registration voices corresponding to each second terminal device, so as to register the voiceprint of each second terminal device, so that one registration voice can be input to many
  • the purpose of performing voiceprint registration on multiple terminal devices is to reduce the number of voice input for voiceprint registration of multiple terminal devices and improve user experience.
  • the method provided by the above-mentioned Embodiment 1 needs to perform the registration voice conversion processing through the first terminal device and/or the second terminal device, which greatly increases the calculation amount of the first terminal device and/or the second terminal device, and affects the first terminal device and/or the second terminal device.
  • the usage performance of the terminal device and/or the second terminal device needs to perform the registration voice conversion processing through the first terminal device and/or the second terminal device, which greatly increases the calculation amount of the first terminal device and/or the second terminal device, and affects the first terminal device and/or the second terminal device.
  • FIG. 9 shows a schematic diagram of an application scenario of the cross-device voiceprint registration method provided by Embodiment 2 of the present application.
  • the application scenario may include multiple terminal devices 100 and servers 90, wherein the server 90 may be a cloud server or a control center, etc., so as to perform the registration voice conversion processing through the server, thereby reducing the impact of the first terminal device and/or the second terminal device.
  • the amount of calculation ensures the use performance of the first terminal device and/or the second terminal device.
  • the server 90 may communicate with each terminal device 100 respectively.
  • the server 90 or the storage device communicatively connected to the server 90 may store the channel model of each terminal device 100 relative to the original voice signal, or store the channel model between any two terminal devices 100 .
  • the first terminal device may send the first registration voice input by the user to the server 90 .
  • the server 90 can perform conversion processing on the first registered voice according to the first channel model of the first terminal device relative to the original voice signal and the second channel model of each second terminal device relative to the original voice signal to obtain the second channel model of each second terminal device relative to the original voice signal.
  • the second registered voice corresponding to the terminal device; or, the server 90 may convert the first registered voice according to the channel model between the first terminal device and each second terminal device to obtain the second registered voice corresponding to each second terminal device.
  • Register voice Therefore, a voiceprint template corresponding to each second terminal device can be generated according to each second registered voice, so as to reduce the number of times of voice input for voiceprint registration of multiple terminal devices.
  • FIG. 10 shows a schematic flowchart of a method for cross-device voiceprint registration provided by this embodiment. As shown in Figure 10, the method may include:
  • a user inputs a first registration voice to a first terminal device.
  • the first terminal device generates a voiceprint template corresponding to the first terminal device according to the first registered voice.
  • S1003 The first terminal device sends the first registration voice to the server.
  • the user can input the first registration voice into the first terminal device.
  • the first registered voice refers to the voice received by the first terminal device, that is, the first registered voice is appended with channel information corresponding to the first terminal device.
  • the first terminal device can process the first registered voice by using a voiceprint recognition model corresponding to the first terminal device to obtain a voiceprint template corresponding to the first registered voice, so as to complete the first terminal device.
  • the device's voiceprint registration the first terminal device may also send the first registration voice to the server 90, so that the server 90 obtains registration voices corresponding to other terminal devices according to the first registration voice to perform voiceprint registration on other terminal devices.
  • the server performs conversion processing on the first registered voice according to the first channel model of the first terminal device relative to the original voice signal and the second channel model of the second terminal device relative to the original voice signal;
  • the channel model between the two terminal devices converts the first registered voice to obtain the second registered voice.
  • the server sends the second registration voice to the second terminal device.
  • the channel model is the channel model of the terminal device relative to the original voice signal
  • the server receives the first registration voice sent by the first terminal device, it can determine according to the device name and/or device identifier of the first terminal device.
  • the first channel model of the first terminal device relative to the original voice signal, and the second channel model of the second terminal device relative to the original voice signal is determined according to the device name and/or device identification of the second terminal device.
  • the channel information corresponding to the first terminal device in the first registered voice may be removed according to the first channel model, so as to obtain the original voice without channel information.
  • the channel information corresponding to the second terminal device can be added to the original voice according to the second channel model, so as to obtain the second registration voice including the channel information corresponding to each second terminal device, and each second registration voice can be sent to the corresponding the second terminal equipment.
  • the server when the channel model is a channel model between terminal devices, after receiving the first registration voice sent by the first terminal device, the server can The device name and/or device identification of the device determines the channel model between the first terminal device and the second terminal device, and can directly convert the first registered voice into the second registered voice corresponding to the second terminal device according to the channel model, In order to reduce the conversion times of the registered voice and reduce the information loss during the conversion of the registered voice, the accuracy of the voiceprint template generated based on the second registered voice is improved.
  • the second terminal device generates a voiceprint template corresponding to the second terminal device according to the second registered voice.
  • each second terminal device after each second terminal device receives the corresponding second registered voice, it can process each second registered voice according to the voiceprint recognition model corresponding to each second terminal device to obtain a voiceprint corresponding to the second registered voice Template, that is, extracting voiceprint features from each second registered voice to obtain a voiceprint template corresponding to each second terminal device. For example, after receiving the second registered voice A, the second terminal device A can extract the voiceprint features of the second registered voice A according to the voiceprint recognition model A corresponding to the second terminal device A, and obtain the corresponding voiceprint of the second terminal device A.
  • the voiceprint template A after the second registered voice B received by the second terminal device B, the voiceprint features of the second registered voice B can be extracted according to the voiceprint recognition model B corresponding to the second terminal device B, to obtain the first Two voiceprint template B corresponding to terminal device B, and so on.
  • the second registered voice A is the voice to which the channel information corresponding to the second terminal device A is added
  • the second registered voice B is the voice to which the channel information corresponding to the second terminal device B is added.
  • the server 90 may also directly generate a voiceprint template corresponding to each second terminal device.
  • a voiceprint recognition model corresponding to each terminal device may also be stored in the server 90 or a storage device communicatively connected to the server 90 .
  • the server 90 can obtain the voiceprint recognition model corresponding to each second terminal device according to the device name and/or device identification corresponding to each second terminal device, and Each second registered voice can be processed according to the voiceprint recognition model corresponding to each second terminal device to obtain a voiceprint template corresponding to each second terminal device, and the voiceprint template corresponding to each second terminal device can be sent to the corresponding the second terminal equipment.
  • the terminal device when the terminal device determines the similarity between the authentication voice and the voiceprint template in the terminal device When the value is greater than or equal to the preset second similarity threshold, the terminal device may send the authentication voice to the server 90 .
  • the server 90 may use the authentication voice to perform incremental learning on the voiceprint recognition model corresponding to the terminal device to update the voiceprint recognition model corresponding to the terminal device, and may send the updated voiceprint recognition model to the terminal device.
  • the server 90 can also generate training voices of other terminal devices according to the authentication voice to perform incremental learning on the voiceprint recognition models corresponding to other terminal devices, so as to update the voiceprint recognition models corresponding to other terminal devices, and can update the updated voiceprint recognition models.
  • Each voiceprint recognition model is sent to the corresponding terminal device respectively. That is, in this embodiment, the server can obtain the high-quality authentication voice in the daily use of the user to update the voiceprint recognition model corresponding to each terminal device, so as to improve the matching between the voiceprint recognition model corresponding to each terminal device and the actual use scene. , to improve the robustness of voiceprint recognition in each terminal device, thereby improving the accuracy of voiceprint recognition of each terminal device.
  • the server performs registration voice conversion processing to generate registration voices corresponding to other terminal devices, which not only realizes the purpose of registering voiceprints for multiple terminal devices by inputting a registration voice at one time, but also reduces the sound of multiple terminal devices.
  • the number of voice input for fingerprint registration improves user experience.
  • the calculation amount of each terminal device can be reduced to ensure the use performance of each terminal device.
  • FIG. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device 11 of this embodiment includes: at least one processor 1100 (only one is shown in FIG. 11 ), a memory 1101 , and at least one processor 1100 stored in the memory 1101 and available in the at least one processor 1100
  • the computer program 1102 running on the processor 1100 when the processor 1100 executes the computer program 1102, enables the electronic device 11 to implement the steps in any of the above-mentioned embodiments of the cross-device voiceprint registration method.
  • the electronic device 11 may be a terminal device or a server.
  • the electronic device 11 may include, but is not limited to, a processor 1100 and a memory 1101 .
  • FIG. 11 is only an example of the electronic device 11, and does not constitute a limitation to the electronic device 11, and may include more or less components than the one shown, or combine some components, or different components , for example, may also include input and output devices, network access devices, and the like.
  • the processor 1100 may be a central processing unit (CPU), and the processor 1100 may also be other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (application specific integrated circuits) , ASIC), field programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSPs digital signal processors
  • ASIC application specific integrated circuits
  • FPGA field programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 1101 may be an internal storage unit of the electronic device 11 in some embodiments, such as a hard disk or a memory of the electronic device 11 . In other embodiments, the memory 1101 may also be an external storage device of the electronic device 11, such as a plug-in hard disk, a smart media card (SMC), a secure digital memory card equipped on the electronic device 11 (secure digital, SD) card, flash card (flash card), etc. Further, the memory 1101 may also include both an internal storage unit of the electronic device 11 and an external storage device.
  • the memory 1101 is used to store operating systems, application programs, bootloaders (bootloaders), data, and other programs, such as program codes of the computer programs, and the like. The memory 1101 may also be used to temporarily store data that has been output or will be output.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a computer, the computer enables the computer to implement the steps in the foregoing method embodiments.
  • the embodiments of the present application provide a computer program product, which enables the electronic device to implement the steps in the foregoing method embodiments when the computer program product runs on an electronic device.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the present application realizes all or part of the processes in the methods of the above embodiments, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like.
  • the computer-readable storage medium may include at least: any entity or apparatus capable of carrying a computer program code to an apparatus/electronic device, a recording medium, computer memory, read-only memory (ROM), random access memory (random access memory, RAM), electrical carrier signals, telecommunication signals, and software distribution media.
  • ROM read-only memory
  • RAM random access memory
  • electrical carrier signals telecommunication signals
  • software distribution media For example, U disk, mobile hard disk, disk or CD, etc.
  • computer-readable storage media may not be electrical carrier signals and telecommunications signals.
  • the disclosed apparatus/electronic device and method may be implemented in other manners.
  • the above-described embodiments of the apparatus/electronic device are only illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

Abstract

本申请适用于终端技术领域,尤其涉及跨设备声纹注册方法、电子设备及计算机可读存储介质。所述方法可以对第一终端设备获取的第一注册语音进行转换处理,生成第二终端设备对应的第二注册语音,来对第二终端设备进行声纹注册,以实现一次注册语音的输入可对多个终端设备进行声纹注册的目的,减少多终端设备声纹注册的语音输入次数,提升用户体验。

Description

跨设备声纹注册方法、电子设备及存储介质
本申请要求于2020年07月07日提交国家知识产权局、申请号为202010650133.9、申请名称为“跨设备声纹注册方法、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于终端技术领域,尤其涉及跨设备声纹注册方法、电子设备及计算机可读存储介质。
背景技术
声纹识别,即说话人识别,是一种通过语音自动辨识和确认说话人身份的技术,广泛应用于手机、智能音箱等终端设备。在进行声纹识别之前,需要用户先在终端设备中进行声纹注册,即需要用户在终端设备中输入注册语音,以根据所输入的注册语音生成声纹模板,从而根据声纹模板来进行用户身份的识别。目前,用户往往具有多个终端设备,而要实现各终端设备的声纹识别则需要用户分别在各终端设备中进行语音输入来对各终端设备进行声纹注册,语音输入次数较多,影响用户体验。
发明内容
本申请实施例提供了跨设备声纹注册方法、电子设备及计算机可读存储介质,可进行注册语音的迁移来实现一次语音输入对多终端设备进行声纹注册的目的,减少多终端设备声纹注册的语音输入次数,提高用户体验。
第一方面,本申请实施例提供了一种跨设备声纹注册方法,应用于第二终端设备,所述方法可以包括:
获取第一终端设备对应的第一注册语音;
对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音;
根据所述第二注册语音生成所述第二终端设备对应的声纹模板。
通过上述的跨设备声纹注册方法,第二终端设备可以对第一终端设备获取的第一注册语音进行转换处理,生成第二终端设备所对应的第二注册语音,来对第二终端设备进行声纹注册,以实现一次注册语音的输入可对多个终端设备进行声纹注册的目的,减少多终端设备声纹注册的语音输入次数,提升用户体验。
在一个示例中,所述对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音,可以包括:
通过所述第一终端设备对应的第一信道模型对所述第一注册语音进行转换处理,得到所述第一注册语音对应的原始语音,所述第一信道模型用于表征所述第一终端设备对应的语音与原始语音之间的映射关系;
通过所述第二终端设备对应的第二信道模型对所述原始语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第二信道模型用于表征所述第二终端设备对应的语音与原始语音之间的映射关系。
本申请实施例中,终端设备对应的信道模型可以是终端设备相对于原始语音信号的信道模型,用于表征终端设备获取的语音信号与原始语音信号之间的映射关系,原始语音信号为没有终端设备的信道信息附加的语音信号。
在另一个示例中,所述对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音,可以包括:
通过第三信道模型对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第三信道模型用于表征所述第一终端设备对应的语音与所述第二终端设备对应的语音之间的映射关系。
本申请实施例中,终端设备对应的信道模型也可以是两终端设备之间的信道模型,用于表征两终端设备获取的语音信号之间的映射关系。
在第一方面的一种可能的实现方式中,所述第一信道模型和所述第二信道模型为基于频率响应曲线构建的信道模型,或者为基于频谱特征构建的信道模型。
同样地,所述第三信道模型为基于频率响应曲线构建的信道模型,或者为基于频谱特征构建的信道模型。
示例性的,在建立终端设备相对于原始语音信号的信道模型时,可向终端设备播放原始扫频信号,测量终端设备接收到的声音信号的频率响应曲线St,并测量原始扫频信号的频率响应曲线S,然后可根据频率响应曲线St和频率响应曲线S计算各频率响应增益值,并可根据各频率响应增益值建立终端设备相对于原始语音信号的信道模型。其中,原始扫频信号可以为扫频信号发生器输出的原始声音信号,频率响应增益值可以为频率响应曲线St与频率响应曲线S中相同频率所对应的值之间的比值,终端设备相对于原始语音信号的信道模型可以为St/S。
示例性的,在建立终端设备相对于原始语音信号的信道模型时,可向终端设备播放原始语音信号,获取终端设备接收到的语音信号,并可以将原始语音信号和终端设备接收到的语音信号发送至预设的神经网络模型,该神经网络模型可分别提取原始语音信号对应的频谱特征A和语音信号对应的频谱特征B,并学习频谱特征A和频谱特征B之间的映射关系,从而得到终端设备相对于原始语音信号的信道模型。
示例性的,在建立两终端设备之间的信道模型时,如在建立第一终端设备与第二终端设备之间的信道模型时,可分别向第一终端设备和第二终端设备播放原始扫频信号,其中,向第一终端设备和第二终端设备所播放的原始扫频信号相同,测量第一终端设备接收到的声音信号的频率响应曲线St1和第二终端设备接收到的声音信号的频率响应曲线St2,然后可以根据频率响应曲线St1和St2计算各频率响应增益值,并根据频率响应增益值建立第一终端设备相对于第二终端设备的信道模型,和/或建立第二终端设备相对于第一终端设备的信道模型。其中,第一终端设备相对于第二终端设备的信道模型可以为St1/St2,第二终端设备相对于第一终端设备的信道模型可以为St2/St1。
示例性的,在建立两终端设备之间的信道模型时,如在建立第一终端设备与第二终端设备之间的信道模型时,可分别向第一终端设备和第二终端设备播放原始语音信号,获取第一终端设备接收到的语音信号C和第二终端设备接收到的语音信号D,并可以将语音信号C和语音信号D发送至预设的神经网络模型,该神经网络模型可分别提取语音信号C对应的频谱特征C和语音信号D对应的频谱特征D,并学习频谱特征C和频谱特征D之间的映射关系,从而得到第一终端设备相对于第二终端设备的信道模型,和/或得到第二终端设备相对于第一终端设备的信道模型。
需要说明的是,所述根据所述第二注册语音生成所述第二终端设备对应的声纹模板,可以包括:
根据所述第二终端设备对应的声纹识别模型和所述第二注册语音生成所述第二终端设备对应的声纹模板,所述第二终端设备对应的声纹识别模型为基于所述第二终端设备获取的训练语音训练得到的声纹识别模型。
可选地,在所述根据所述第二注册语音生成所述第二终端设备对应的声纹模板之后,还可以包括:
获取所述第二终端设备对应的认证语音;
根据所述第二终端设备对应的声纹识别模型和所述认证语音生成所述认证语音对应的认证模板;
确定所述认证模板与所述声纹模板之间的相似度;
当所述相似度大于预设的相似度阈值时,根据所述认证语音更新所述第二终端设备对应的声纹识别模型。
应理解,当所述相似度大于预设的相似度阈值时,所述方法还可以包括:对所述认证语音进行转换处理,得到所述第一终端设备对应的训练语音,并向所述第一终端设备发送所述训练语音,所述训练语音用于更新所述第一终端设备对应的声纹识别模型。
通过上述可选的方式,本申请实施例可以获取用户日常使用过程中高质量的认证语音来对各终端设备对应的声纹识别模型进行更新,提高各终端设备对应的声纹识别模型与实际使用场景的匹配性,提高各终端设备中声纹识别的鲁棒性,从而提高各终端设备声纹识别的准确率。
第二方面,本申请实施例提供了一种跨设备声纹注册方法,应用于第一终端设备或服务器,所述方法可以包括:
获取所述第一终端设备对应的第一注册语音;
对所述第一注册语音进行转换处理,得到第二终端设备对应的第二注册语音;
向所述第二终端设备发送所述第二注册语音,所述第二注册语音用于生成所述第二终端设备对应的声纹模板。
通过上述的跨设备声纹注册方法,第一终端设备或服务器可以对第一终端设备获取的第一注册语音进行转换处理,生成第二终端设备所对应的第二注册语音,并向第二终端设备发送第二注册语音,第二终端设备可以直接基于所接收到的第二注册语音对第二终端设备进行声纹注册,可实现一次注册语音的输入可对多个终端设备进行声纹注册的目的,减少多终端设备声纹注册的语音输入次数,提升用户体验。同时还可以减少第二终端设备的计算量,确保第二终端设备的使用性能。
在一个示例中,所述对所述第一注册语音进行转换处理,得到第二终端设备对应的第二注册语音,可以包括:
通过所述第一终端设备对应的第一信道模型对所述第一注册语音进行转换处理,得到所述第一注册语音对应的原始语音,所述第一信道模型用于表征所述第一终端设备对应的语音与原始语音之间的映射关系;
通过所述第二终端设备对应的第二信道模型对所述原始语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第二信道模型用于表征所述第二终端设备对应的语音与原始语音之间的映射关系。
在另一个示例中,所述对所述第一注册语音进行转换处理,得到第二终端设备对应的第二注册语音,可以包括:
通过第三信道模型对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第三信道模型用于表征所述第一终端设备对应的语音与所述第二终端设备对应的语音之间的映射关系。
可以理解的是,当所述方法应用于服务器时,服务器通过转换处理获取第二终端设备对应的第二注册语音后,也可以直接根据第二终端设备对应的声纹识别模型和第二注册语音生成第二终端设 备对应的声纹模板,并向第二终端设备发送所生成的声纹模板,以直接通过服务器生成声纹模板发送至第二终端设备,降低第二终端设备的计算量,降低对第二终端设备的性能要求。
示例性,在所述向所述第二终端设备发送所述第二注册语音之后,还可以包括:
获取所述第二终端设备对应的认证语音,所述认证语音与所述第二终端设备对应的声纹模板之间的相似度大于预设的相似度阈值;
对所述认证语音进行转换处理,得到所述第一终端设备对应的训练语音,并向所述第一终端设备发送所述训练语音,所述训练语音用于更新所述第一终端设备对应的声纹识别模型。
通过上述可选的方式,在本申请实施例提供的方法应用于服务器时,服务器可以获取用户日常使用过程中高质量的认证语音,并对认证语音进行转换处理,得到各终端设备对应的训练语音,来对各终端设备对应的声纹识别模型进行更新,提高各终端设备对应的声纹识别模型与实际使用场景的匹配性,提高各终端设备中声纹识别的鲁棒性,从而提高各终端设备声纹识别的准确率。
第三方面,本申请实施例提供了一种跨设备声纹注册方法,可以包括:
第一终端设备获取所述第一终端设备对应的第一注册语音;
所述第一终端设备对所述第一注册语音进行转换处理,得到所述第一注册语音对应的第一原始语音,并向第二终端设备发送所述第一原始语音;
所述第二终端设备接收来自所述第一终端设备的所述第一原始语音,并对所述第一原始语音进行转换处理,得到所述第二终端设备对应的第二注册语音;
所述第二终端设备根据所述第二注册语音生成所述第二终端设备对应的声纹模板。
通过上述的跨设备声纹注册方法,可以将第一注册语音转换为第二注册语音的过程分解为第一注册语音转换为原始语音的过程和原始语音转换为第二注册语音的过程,并可以通过第一终端设备执行第一注册语音转换为原始语音的过程,以及通过第二终端设备执行原始语音转换为第二注册语音的过程,以将第一注册语音转换为第二注册语音的过程分解至第一终端设备和第二终端设备执行,可以降低各终端设备的计算量,从而确保各终端设备的使用性能。
在第三方面的一种可能的实现方式中,所述第一终端设备获取所述第一终端设备对应的第一注册语音,可以包括:
所述第一终端设备获取所述第一终端设备与用户之间的交互语音,并获取所述交互语音中的目标语音,所述目标语音为所述用户对应的语音;
所述第一终端设备根据所述目标语音对应的信噪比和/或语音能量等级从所述目标语音中获取所述第一终端设备对应的第一注册语音。
需要说明的是,第一终端设备也可以从用户与第一终端设备日常的语音交互中获取第一注册语音,以通过自学习免注册的方式来进行各终端设备的声纹注册,从而简化声纹注册的操作流程,提高用户体验。
在一个示例中,所述第二终端设备根据所述第二注册语音生成所述第二终端设备对应的声纹模板之后,还可以包括:
所述第二终端设备获取所述第二终端设备对应的认证语音,根据所述第二终端设备对应的声纹识别模型和所述认证语音生成所述认证语音对应的认证模板,并确定所述认证模板与所述声纹模板之间的相似度;
当所述相似度大于预设的相似度阈值时,所述第二终端设备根据所述认证语音更新所述第二终端设备对应的声纹识别模型,并根据所述第二终端设备对应的第二信道模型对所述认证语音进行转 换处理,得到所述认证语音对应的第二原始语音,向所述第一终端设备发送所述第二原始语音;
所述第一终端设备接收来自所述第二终端设备的所述第二原始语音,根据所述第一终端设备对应的第一信道模型对所述第二原始语音进行转换处理,得到所述第一终端设备对应的训练语音,并根据所述第一终端设备对应的训练语音更新所述第一终端设备对应的声纹识别模型。
第四方面,本申请实施例提供了一种跨设备声纹注册装置,应用于第二终端设备,所述装置可以包括:
注册语音获取模块,用于获取第一终端设备对应的第一注册语音;
转换处理模块,用于对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音;
声纹注册模块,用于根据所述第二注册语音生成所述第二终端设备对应的声纹模板。
在一个示例中,所述转换处理模块,可以包括:
第一转换处理单元,用于通过所述第一终端设备对应的第一信道模型对所述第一注册语音进行转换处理,得到所述第一注册语音对应的原始语音,所述第一信道模型用于表征所述第一终端设备对应的语音与原始语音之间的映射关系;
第二转换处理单元,用于通过所述第二终端设备对应的第二信道模型对所述原始语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第二信道模型用于表征所述第二终端设备对应的语音与原始语音之间的映射关系。
在另一个示例中,所述转换处理模块,可以包括:
第三转换处理单元,用于通过第三信道模型对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第三信道模型用于表征所述第一终端设备对应的语音与所述第二终端设备对应的语音之间的映射关系。
在第四方面的一种可能的实现方式中,所述第一信道模型和所述第二信道模型为基于频率响应曲线构建的信道模型,或者为基于频谱特征构建的信道模型。
在第四方面的另一种可能的实现方式中,所述第三信道模型为基于频率响应曲线构建的信道模型,或者为基于频谱特征构建的信道模型。
可选地,所述声纹注册模块,具体用于根据所述第二终端设备对应的声纹识别模型和所述第二注册语音生成所述第二终端设备对应的声纹模板,所述第二终端设备对应的声纹识别模型为基于所述第二终端设备获取的训练语音训练得到的声纹识别模型。
在一个示例中,所述装置还可以包括:
认证语音获取模块,用于获取所述第二终端设备对应的认证语音;
认证模板生成模块,用于根据所述第二终端设备对应的声纹识别模型和所述认证语音生成所述认证语音对应的认证模板;
相似度确定模块,用于确定所述认证模板与所述声纹模板之间的相似度;
模型更新模块,用于当所述相似度大于预设的相似度阈值时,根据所述认证语音更新所述第二终端设备对应的声纹识别模型。
应理解,当所述相似度大于预设的相似度阈值时,所述装置还可以包括:
训练语音获取模块,用于对所述认证语音进行转换处理,得到所述第一终端设备对应的训练语音,并向所述第一终端设备发送所述训练语音,所述训练语音用于更新所述第一终端设备对应的声纹识别模型。
第五方面,本申请实施例提供了一种跨设备声纹注册装置,应用于第一终端设备或服务器,所述装置可以包括:
第一注册语音获取模块,用于获取所述第一终端设备对应的第一注册语音;
转换处理模块,用于对所述第一注册语音进行转换处理,得到第二终端设备对应的第二注册语音;
第二注册语音发送模块,用于向所述第二终端设备发送所述第二注册语音,所述第二注册语音用于生成所述第二终端设备对应的声纹模板。
在第五方面的一种可能的实现方式中,所述转换处理模块,可以包括:
第一转换处理单元,用于通过所述第一终端设备对应的第一信道模型对所述第一注册语音进行转换处理,得到所述第一注册语音对应的原始语音,所述第一信道模型用于表征所述第一终端设备对应的语音与原始语音之间的映射关系;
第二转换处理单元,用于通过所述第二终端设备对应的第二信道模型对所述原始语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第二信道模型用于表征所述第二终端设备对应的语音与原始语音之间的映射关系。
在第五方面的一种可能的实现方式中,所述转换处理模块,还可以包括:
第三转换处理单元,用于通过第三信道模型对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第三信道模型用于表征所述第一终端设备对应的语音与所述第二终端设备对应的语音之间的映射关系。
第六方面,本申请实施例提供了一种跨设备声纹注册系统,可以包括第一终端设备和第二终端设备,所述第一终端设备包括注册语音获取模块和第一转换处理模块,所述第二终端设备包括第二转换处理模块和声纹注册模块;
所述注册语音获取模块,用于获取所述第一终端设备对应的第一注册语音;
所述第一转换处理模块,用于对所述第一注册语音进行转换处理,得到所述第一注册语音对应的第一原始语音,向所述第二终端设备发送所述第一原始语音;
所述第二转换处理模块,用于接收来自所述第一终端设备的所述第一原始语音,并对所述第一原始语音进行转换处理,得到所述第二终端设备对应的第二注册语音;
所述声纹注册模块,用于根据所述第二注册语音生成所述第二终端设备对应的声纹模板。
在一个示例中,所述注册语音获取模块,可以包括:
目标语音获取单元,用于获取所述第一终端设备与用户之间的交互语音,并获取所述交互语音中的目标语音,所述目标语音为所述用户对应的语音;
注册语音获取单元,用于根据所述目标语音对应的信噪比和/或语音能量等级从所述目标语音中获取所述第一终端设备对应的第一注册语音。
在第六方面的一种可能的实现方式中,所述第二终端设备还可以包括认证语音获取模块、认证模板生成模块、相似度确定模块和第一模型更新模块;所述第一终端设备还可以包括训练语音获取模块和第二模型更新模块:
所述认证语音获取模块,用于获取所述第二终端设备对应的认证语音;
所述认证模板生成模块,用于根据所述第二终端设备对应的声纹识别模型和所述认证语音生成所述认证语音对应的认证模板;
所述相似度确定模块,用于确定所述认证模板与所述声纹模板之间的相似度;
所述第一模型更新模块,用于当所述相似度大于预设的相似度阈值时,根据所述认证语音更新所述第二终端设备对应的声纹识别模型,并根据所述第二终端设备对应的第二信道模型对所述认证语音进行转换处理,得到所述认证语音对应的第二原始语音,向所述第一终端设备发送所述第二原始语音;
所述训练语音获取模块,用于接收来自所述第二终端设备的所述第二原始语音,根据所述第一终端设备对应的第一信道模型对所述第二原始语音进行转换处理,得到所述第一终端设备对应的训练语音;
所述第二模型更新模块,用于根据所述第一终端设备对应的训练语音更新所述第一终端设备对应的声纹识别模型。
第七方面,本申请实施例提供了一种电子设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,使所述电子设备实现上述第一方面中任一项,或第二方面中任一项所述的跨设备声纹注册方法。
第八方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被计算机执行时,使所述计算机实现上述第一方面中任一项,或第二方面中任一项所述的跨设备声纹注册方法。
第九方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行上述第一方面中任一项,或第二方面中任一项所述的跨设备声纹注册方法。
附图说明
图1是本申请实施例提供的终端设备的结构示意图;
图2是本申请实施例提供的终端设备的软件架构示意图;
图3是本申请实施例提供的一个应用场景示意图;
图4是本申请实施例提供的应用界面示意图一;
图5是本申请实施例提供的应用界面示意图二;
图6是本申请实施例提供的应用界面示意图三;
图7是本申请实施例提供的应用界面示意图四;
图8是本申请实施例一提供的跨设备声纹注册方法的流程示意图;
图9是本申请实施例提供的另一个应用场景示意图;
图10是本申请实施例二提供的跨设备声纹注册方法的流程示意图;
图11是本申请实施例提供的电子设备的结构示意图。
具体实施方式
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
声纹识别是一种通过语音自动辨识和确认说话人身份的技术,可以应用于手机、智能手表、智能音箱等终端设备。其中,声纹识别包括声纹注册和声纹验证两个阶段。在声纹注册阶段,用户需要向终端设备输入注册语音,终端设备可以根据所获取到的注册语音生成声纹模板;在声纹验证阶段,终端设备可以将用户输入的认证语音与注册阶段所生成的声纹模板进行相似度打分来识别用户身份。
在声纹注册时,终端设备所获取的注册语音会受到该终端设备对应的信道的影响,即该终端设备所获取的注册语音中会有该终端设备对应的信道信息附加。而不同的终端设备往往由不同的硬器件组成,因此不同的终端设备具有不同的信道信息,使得不同的终端设备进行声纹注册的注册语音也不同。由于信道的不同,某一终端设备获取到的注册语音只能应用于该终端设备的声纹注册,当用户拥有多个终端设备,且用户想要在多个终端设备中使用声纹识别时,用户需要分别在各终端设备上进行语音输入来对各终端设备进行声纹注册,语音输入次数较多,影响用户体验。
为解决上述问题,本申请实施例提供了跨设备声纹注册方法、电子设备及计算机可读存储介质,可以对某一终端设备获取的注册语音进行转换处理,生成其他终端设备对应的注册语音,来对其他终端设备进行声纹注册,实现一次注册语音的输入可对多个终端设备进行声纹注册的目的,减少多终端设备声纹注册的语音输入次数,提升用户体验。
需要说明的是,本申请实施例涉及的终端设备可以为手机、平板电脑、可穿戴设备(如智能耳机、智能手环等)、智能音箱、智能家居、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、个人数字助理(personal digital assistant,PDA)、桌上型计算机等。
以下首先介绍本申请实施例涉及的终端设备。请参阅图1,图1是本申请实施例提供的终端设备100的结构示意图。
终端设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中,传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本申请实施例示意的结构并不构成对终端设备100的具体限定。在本申请另一些实施例中,终端设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些 部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。在一些实施例中,处理器110可以包含多组I2C总线。处理器110可以通过不同的I2C总线接口分别耦合触摸传感器180K,充电器,闪光灯,摄像头193等。例如:处理器110可以通过I2C接口耦合触摸传感器180K,使处理器110与触摸传感器180K通过I2C总线接口通信,实现终端设备100的触摸功能。
I2S接口可以用于音频通信。在一些实施例中,处理器110可以包含多组I2S总线。处理器110可以通过I2S总线与音频模块170耦合,实现处理器110与音频模块170之间的通信。在一些实施例中,音频模块170可以通过I2S接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块170与无线通信模块160可以通过PCM总线接口耦合。在一些实施例中,音频模块170也可以通过PCM接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。所述I2S接口和所述PCM接口都可以用于音频通信。
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在一些实施例中,UART接口通常被用于连接处理器110与无线通信模块160。例如:处理器110通过UART接口与无线通信模块160中的蓝牙模块通信,实现蓝牙功能。在一些实施例中,音频模块170可以通过UART接口向无线通信模块160传递音频信号,实现通过蓝牙耳机播放音乐的功能。
MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头193通过CSI接口通信,实现终端设备100的拍摄功能。处理器110和显示屏194通过DSI接口通信,实现终端设备100的显示功能。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头193,显示屏194,无线通信模块160,音频模块170,传感器模块180等。GPIO接口还可以被配置为I2C接口,I2S接口,UART接口,MIPI接口等。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为终端设备100充电,也可以用于终端设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他终端设备,例如AR设备等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端设备100的结构限定。在本申请另一些实施例中,终端设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过终端设备100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为终端设备供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。
终端设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。终端设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在终端设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在终端设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,终端设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得终端设备100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
终端设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端设备100可以包括1个或N个显示屏194,N为大于1的正整数。
终端设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端设备100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。 例如,当终端设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。终端设备100可以支持一种或多种视频编解码器。这样,终端设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行终端设备100的各种功能应用以及数据处理。
终端设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。终端设备100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当终端设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。终端设备100可以设置至少一个麦克风170C。在另一些实施例中,终端设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,终端设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动终端设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器180A,电极之间的电容改变。终端设备100根据电容的变化确定压力的强度。当有触摸操作作用于显示屏194,终端设备100根据压力传感器180A检测所述触摸操作强度。 终端设备100也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如:当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时,执行新建短消息的指令。
陀螺仪传感器180B可以用于确定终端设备100的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定终端设备100围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪传感器180B检测终端设备100抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消终端设备100的抖动,实现防抖。陀螺仪传感器180B还可以用于导航,体感游戏场景。
气压传感器180C用于测量气压。在一些实施例中,终端设备100通过气压传感器180C测得的气压值计算海拔高度,辅助定位和导航。
磁传感器180D包括霍尔传感器。终端设备100可以利用磁传感器180D检测翻盖皮套的开合。在一些实施例中,当终端设备100是翻盖机时,终端设备100可以根据磁传感器180D检测翻盖的开合。进而根据检测到的皮套的开合状态或翻盖的开合状态,设置翻盖自动解锁等特性。
加速度传感器180E可检测终端设备100在各个方向上(一般为三轴)加速度的大小。当终端设备100静止时可检测出重力的大小及方向。还可以用于识别终端设备姿态,应用于横竖屏切换,计步器等应用。
距离传感器180F,用于测量距离。终端设备100可以通过红外或激光测量距离。在一些实施例中,拍摄场景,终端设备100可以利用距离传感器180F测距以实现快速对焦。
接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。终端设备100通过发光二极管向外发射红外光。终端设备100使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定终端设备100附近有物体。当检测到不充分的反射光时,终端设备100可以确定终端设备100附近没有物体。终端设备100可以利用接近光传感器180G检测用户手持终端设备100贴近耳朵通话,以便自动熄灭屏幕达到省电的目的。接近光传感器180G也可用于皮套模式,口袋模式自动解锁与锁屏。
环境光传感器180L用于感知环境光亮度。终端设备100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测终端设备100是否在口袋里,以防误触。
指纹传感器180H用于采集指纹。终端设备100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。
温度传感器180J用于检测温度。在一些实施例中,终端设备100利用温度传感器180J检测的温度,执行温度处理策略。例如,当温度传感器180J上报的温度超过阈值,终端设备100执行降低位于温度传感器180J附近的处理器的性能,以便降低功耗实施热保护。在另一些实施例中,当温度低于另一阈值时,终端设备100对电池142加热,以避免低温导致终端设备100异常关机。在其他一些实施例中,当温度低于又一阈值时,终端设备100对电池142的输出电压执行升压,以避免低温导致的异常关机。
触摸传感器180K,也称“触控器件”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过 显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于终端设备100的表面,与显示屏194所处的位置不同。
骨传导传感器180M可以获取振动信号。在一些实施例中,骨传导传感器180M可以获取人体声部振动骨块的振动信号。骨传导传感器180M也可以接触人体脉搏,接收血压跳动信号。在一些实施例中,骨传导传感器180M也可以设置于耳机中,结合成骨传导耳机。音频模块170可以基于所述骨传导传感器180M获取的声部振动骨块的振动信号,解析出语音信号,实现语音功能。应用处理器可以基于所述骨传导传感器180M获取的血压跳动信号解析心率信息,实现心率检测功能。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。终端设备100可以接收按键输入,产生与终端设备100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏194不同区域的触摸操作,马达191也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和终端设备100的接触和分离。终端设备100可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口195可以支持Nano SIM卡,Micro SIM卡,SIM卡等。同一个SIM卡接口195可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口195也可以兼容不同类型的SIM卡。SIM卡接口195也可以兼容外部存储卡。终端设备100通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,终端设备100采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在终端设备100中,不能和终端设备100分离。
终端设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android系统为例,示例性说明终端设备100的软件结构。
图2是本申请实施例的终端设备100的软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图2所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图2所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频, 图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供终端设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,终端设备振动,指示灯闪烁等。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
下面结合捕获拍照场景,示例性说明终端设备100软件以及硬件的工作流程。
当触摸传感器180K接收到触摸操作,相应的硬件中断被发给内核层。内核层将触摸操作加工成原始输入事件(包括触摸坐标,触摸操作的时间戳等信息)。原始输入事件被存储在内核层。应用程序框架层从内核层获取原始输入事件,识别该输入事件所对应的控件。以该触摸操作是触摸单击操作,该单击操作所对应的控件为相机应用图标的控件为例,相机应用调用应用框架层的接口,启动相机应用,进而通过调用内核层启动摄像头驱动,通过摄像头193捕获静态图像或视频。
以下介绍本申请实施例提供的跨设备声纹注册方法。其中,该方法已事先对用户的各终端设备建立有对应的信道模型。当用户在某一终端设备中进行声纹注册时,该方法可以获取用户在该终端设备中所输入的注册语音,并可以根据各终端设备对应的信道模型对该注册语音进行转换处理,得到与用户的其他终端设备对应的注册语音,以对用户的其他终端设备进行声纹注册,得到各终端设备对应的声纹模板。
需要说明的是,终端设备对应的信道模型可以是终端设备相对于原始语音信号的信道模型,用于表征终端设备获取的语音信号与原始语音信号之间的映射关系,原始语音信号为没有终端设备的 信道信息附加的语音信号;或者,可以是两终端设备之间的信道模型,用于表征两终端设备获取的语音信号之间的映射关系。
本申请实施例可以基于语音信号的频率响应曲线来构建终端设备对应的信道模型,或者可以基于语音信号的频谱特征来构建终端设备对应的信道模型。
示例性的,在建立终端设备相对于原始语音信号的信道模型时,可向终端设备播放原始扫频信号,测量终端设备接收到的声音信号的频率响应曲线St,并测量原始扫频信号的频率响应曲线S,然后可根据频率响应曲线St和频率响应曲线S计算各频率响应增益值,并可根据各频率响应增益值建立终端设备相对于原始语音信号的信道模型。其中,原始扫频信号可以为扫频信号发生器输出的原始声音信号,频率响应增益值可以为频率响应曲线St与频率响应曲线S中相同频率所对应的值之间的比值,终端设备相对于原始语音信号的信道模型可以为St/S。
或者,可向终端设备播放原始语音信号,获取终端设备接收到的语音信号,并可以将原始语音信号和终端设备接收到的语音信号发送至预设的神经网络模型,该神经网络模型可分别提取原始语音信号对应的频谱特征A和语音信号对应的频谱特征B,并学习频谱特征A和频谱特征B之间的映射关系,从而得到终端设备相对于原始语音信号的信道模型。其中,原始语音信号可以为基于标准的声音采集装置(如麦克风)采集的语音信号,该预设的神经网络模型可以为基于大量的语音信号数据对训练得到的神经网络模型,每一语音信号数据对包括原始语音信号和该原始语音信号经由该终端设备所接收到的语音信号。
示例性的,在建立两终端设备之间的信道模型时,如在建立第一终端设备与第二终端设备之间的信道模型时,可分别向第一终端设备和第二终端设备播放原始扫频信号,其中,向第一终端设备和第二终端设备所播放的原始扫频信号相同,测量第一终端设备接收到的声音信号的频率响应曲线St1和第二终端设备接收到的声音信号的频率响应曲线St2,然后可以根据频率响应曲线St1和St2计算各频率响应增益值,并根据频率响应增益值建立第一终端设备相对于第二终端设备的信道模型,和/或建立第二终端设备相对于第一终端设备的信道模型。其中,第一终端设备相对于第二终端设备的信道模型可以为St1/St2,第二终端设备相对于第一终端设备的信道模型可以为St2/St1。
或者,可分别向第一终端设备和第二终端设备播放原始语音信号,获取第一终端设备接收到的语音信号C和第二终端设备接收到的语音信号D,并可以将语音信号C和语音信号D发送至预设的神经网络模型,该神经网络模型可分别提取语音信号C对应的频谱特征C和语音信号D对应的频谱特征D,并学习频谱特征C和频谱特征D之间的映射关系,从而得到第一终端设备相对于第二终端设备的信道模型,和/或得到第二终端设备相对于第一终端设备的信道模型。其中,原始语音信号可以为基于标准的声音采集装置(如麦克风)采集的语音信号,该预设的神经网络模型可以为基于大量的语音信号数据对训练得到的神经网络模型,每一语音信号数据对包括第一终端设备对应的语音信号和第二终端设备对应的语音信号。
应理解,各终端设备对应的声纹模板可以根据各终端设备对应的声纹识别模型生成。因此,本申请实施例可以事先训练得到各终端设备对应的声纹识别模型,以根据各终端设备对应的声纹识别模型和注册语音来生成各终端设备对应的声纹模板。其中,声纹模板可以为声纹识别模型输出的特征矢量等,即声纹模板可以为声纹识别模型从注册语音中提取出的声纹特征所组成的特征矢量等。
在此,声纹识别模型可以为基于高斯混合-通用背景模型(gaussian mixture model-universal background model,GMM-UBM)的声纹识别模型,或者可以为基于支持向量机(support vector machine,SVM)的声纹识别模型,或者可以为基于联合因子分析(jont factor analysis,JFA)的声纹识别模型, 或者可以为基于全因子空间(identity vector,i-vector)的声纹识别模型,或者可以为基于时延神经网络(time-delay neural network,TDNN)的声纹识别模型。其中,各终端设备对应的声纹识别模型可以相同,也可以不同。例如,各终端设备对应的声纹识别模型可以均为基于GMM-UBM的声纹识别模型,或者可以均为基于TDNN的声纹识别模型。例如,终端设备A对应的声纹识别模型可以为基于GMM-UBM的声纹识别模型,终端设备B对应的声纹识别模型可以为基于SVM的声纹识别模型,终端设备C对应的声纹识别模型可以为基于JFA的声纹识别模型,等等。
应理解,根据终端设备对应的声纹识别模型和注册语音生成终端设备对应的声纹模板的具体过程可以为:可以首先对注册语音进行语音特征的提取,其中,所提取的语音特征可以为梅尔频率倒谱系数(mel-frequency cepstral coefficients,MFCC)或者可以为滤波器组(filter bank,FBank)特征。然后可以通过声纹识别模型对所提取的语音特征进行处理,得到注册语音对应的声纹模板。例如,可以通过GMM-UBM模型对语音特征进行处理,得到高斯均值超矢量作为该注册语音对应的声纹模板;或者可以通过i-vector模型对语音特征进行处理,得到i-vector矢量作为该注册语音对应的声纹模板;或者可以通过深度神经网络(deep neural network,DNN)对语音特征进行处理,得到d-vector矢量作为该注册语音对应的声纹模板;或者可以通过TDNN网络对语音特征进行处理,得到x-vector矢量作为该注册语音对应的声纹模板,等等。
需要说明的是,各终端设备对应的声纹识别模型可以基于各终端设备获取的训练语音集训练得到。例如,可以利用终端设备A获取的训练语音集A对终端设备A对应的声纹识别模型进行训练,可以利用终端设备B获取的训练语音集B对终端设备B对应的声纹识别模型进行训练,可以利用终端设备C获取的训练语音集C对终端设备C对应的声纹识别模型进行训练,等等。其中,终端设备获取的训练语音中附加有该终端设备对应的信道信息。在此,通过各终端设备获取的训练语音集来进行各声纹识别模型的训练,使得各声纹识别模型可更好的与各终端设备相匹配,以提高各终端设备对应的声纹模板的准确性,从而提升各终端设备声纹识别的识别准确率。
需要说明的是,本申请实施例对声纹识别模型的训练过程不作具体限定,如可以采用现有的训练方法来对声纹识别模型进行训练。
以下结合具体应用场景介绍本申请实施例提供的跨设备声纹注册方法。
【实施例一】
图3是本申请实施例一提供的跨设备声纹注册方法的应用场景示意图。如图3所示,该应用场景可包括多个终端设备100,各终端设备100之间可以通过近距离通信或者网络实现互联互通。其中,各终端设备100或者与各终端设备100通信连接的存储装置中存储有该终端设备相对于原始语音信号的信道模型,或者存储有该终端设备与其他任一终端设备之间的信道模型。
应理解,终端设备100为具有跨设备声纹注册功能的终端设备。在各终端设备100的跨设备声纹注册功能被开启后,用户在某一终端设备100(以下称为第一终端设备)中进行声纹注册时,该第一终端设备可以将用户输入至该第一终端设备的第一注册语音发送至其他终端设备100(以下统称为第二终端设备),各第二终端设备可以利用第一终端设备相对于原始语音信号的信道模型,以及第二终端设备相对于原始语音信号的信道模型来对第一注册语音进行转换处理,得到各第二终端设备对应的第二注册语音;或者,各第二终端设备可以根据第一终端设备与第二终端设备之间的信道模型来对第一注册语音进行转换处理,得到各第二终端设备对应的第二注册语音,以根据各第二注册语音来生成各第二终端设备对应的声纹模板,减少多终端设备声纹注册的语音输入次数。
本实施例中,任一终端设备100的跨设备声纹注册功能可以由用户手动开启。
在一个示例中,终端设备100可以显示跨设备声纹注册的图标和/或菜单栏,来供用户手动开启终端设备100的跨设备声纹注册功能。例如,如图4中的(a)所示,终端设备100可以在快捷控制界面中显示跨设备声纹注册的图标,该快捷控制界面中还可以包括蓝牙、飞行模式、移动数据、无线局域网、手电筒、亮度等常规功能的图标,以实现蓝牙、飞行模式、移动数据等相关功能的快捷操作。或者,如图4中的(b)所示,终端设备100可以在设置界面中显示跨设备声纹注册的菜单栏,该设置界面中还可以包括蓝牙、飞行模式、移动数据、无线局域网、亮度等常规功能的设置菜单栏,以实现蓝牙、飞行模式、无线局域网等相关功能的设置操作。
当终端设备100在快捷控制界面或者设置界面上检测到用户的相关操作(例如检测到相关的触摸操作或点击操作)时,可确定终端设备100的跨设备声纹注册功能需被开启,终端设备100可以直接开启跨设备声纹注册功能,或者可以通过弹窗询问用户是否确定开启跨设备声纹注册功能。例如,终端设备100检测到如图4中的(a)所示跨设备声纹注册的图标被点亮,或者检测到如图4中的(b)所示跨设备声纹注册的菜单栏的开关键被开启时,终端设备100可以直接开启跨设备声纹注册功能;或者可以显示如图4中的(c)所示的弹窗,以显示“确定开启跨设备声纹注册?”的询问信息,以及“是”和“否”的选择键。当终端设备100检测到用户点击“是”时,终端设备100可以开启跨设备声纹注册功能。
在一个示例中,终端设备100检测到用户在终端设备100中进行注册语音的输入时,终端设备100可以通过弹窗询问用户是否开启跨设备声纹注册功能。例如,当终端设备100检测到用户正在终端设备100中进行注册语音的输入来对终端设备100进行声纹注册时,终端设备100可以显示如图5所示的弹窗,以显示“正在进行声纹注册,是否开启跨设备声纹注册?”,以及提供“是”或“否”的选择键,当终端设备100检测到用户点击“是”时,终端设备100可以开启跨设备声纹注册功能。
当终端设备100的跨设备声纹注册功能被开启后,终端设备100可以提供如图6中的(a)所示的声纹注册管理界面,以供用户选择进行跨设备声纹注册的终端设备。其中,声纹注册管理界面可以包括:已选设备栏60、候选设备栏61和用于添加新设备的添加控件62。已选设备栏60用于显示用户已选择的进行跨设备声纹注册的终端设备,若未选择任何终端设备,则可以显示用于表示未选择终端设备的提示信息,例如显示如图6中的(a)所示的“尚未选择任何终端设备”的提示消息。候选设备栏61用于显示具备跨设备声纹注册功能的终端设备。在此,候选设备栏61中显示的终端设备可以与用户的账号相关联,当用户登录账号后,该账号所关联的所有具备跨设备声纹注册功能的终端设备可以显示于候选设备栏61中。应理解,已选设备栏60和候选设备栏中61中所显示的终端设备可以为终端设备的设备名称和/或设备标识,例如,可以显示如图6和图7中所示的设备名称或设备标识“AAA”、“BBB”、“CCC”和“DDD”等。
在选择进行跨设备声纹注册的终端设备时,用户可以直接选择候选设备栏61中的设备名称和/或设备标识来选择进行跨设备声纹注册的终端设备。终端设备100可以在用户选择了候选设备栏61中的设备名称和/或设备标识后,直接将该设备名称和/或设备标识添加至已选设备栏60中;或者,可以通过弹窗询问用户是否确定选择该终端设备。例如,如图6中的(a)所示,在终端设备100检测到用户对候选设备栏61中的设备名称“AAA”执行了选择操作(例如点击操作或触摸操作)时,终端设备100可以显示如图6中的(b)所示的弹窗,以显示“确定选择终端设备AAA?”的询问信息,以及“确认”和“取消”的选择键,当终端设备100检测到用户点击“确认”时,如图6中的(c)所示,终端设备100可以将设备名称“AAA”添加至已选设备栏60中,并可以删除候选设备栏61中的设备名称“AAA”。
若用户需要删除某已选终端设备,用户可以对已选设备栏60中待删除的设备名称和/或设备标识进行选择操作(例如点击操作或触摸操作),此时终端设备100可以从已选设备栏60中删除该设备名称和/或设备标识;或者,可以通过弹窗询问用户是否确定删除该终端设备。例如,如图7中的(a)所示,在终端设备100检测到用户对已选设备栏60中的设备名称“BBB”执行了选择操作时,终端设备100可以显示如图7中的(b)所示的弹窗,以显示“确定删除终端设备BBB?”的询问信息,以及“确认”和“取消”的选择键。当终端设备100检测到用户点击“确认”时,如图7中的(c)所示,终端设备100可以将设备名称“BBB”从已选设备栏60中删除,并可以将设备名称“BBB”添加至候选设备栏61,以方便用户再次进行选择。
若除候选设备栏61中的终端设备外,用户还想在其他终端设备中进行跨设备声纹注册时,用户可以通过添加控件62进行设备添加。其中,所添加的终端设备可以为已开启跨设备声纹注册功能的终端设备,也可以为未开启跨设备声纹注册功能的终端设备。具体地,当所添加的终端设备为已开启跨设备声纹注册功能的终端设备时,终端设备100可以直接将所添加的终端设备的设备名称和/或设备标识添加至已选设备栏60。当所添加的终端设备为未开启跨设备声纹注册功能的终端设备时,终端设备100可以向所添加的终端设备发送开启请求,所添加的终端设备中可以弹出相关弹窗,以提示用户开启所添加的终端设备的跨声纹注册功能。在用户开启所添加的终端设备的跨设备声纹注册功能后,终端设备100即可以将所添加的终端设备的设备名称和/或设备标识添加至已选设备栏60中。
请参阅图8,图8示出了本实施例提供的一种跨设备声纹注册方法的流程示意图。如图8所示,该方法可以包括:
S801、用户向第一终端设备输入第一注册语音。
S802、第一终端设备根据第一注册语音生成第一终端设备对应的声纹模板。
S803、第一终端设备将第一注册语音发送至第二终端设备。
可以理解的是,在确定出进行跨设备声纹注册的终端设备后,用户可以向第一终端设备中输入第一注册语音。该第一注册语音是指第一终端设备所接收到的语音,即第一注册语音中附加有第一终端设备所对应的信道信息。第一终端设备接收到第一注册语音后,可以根据第一终端设备对应的声纹识别模型对该第一注册语音进行处理,得到该第一注册语音对应的声纹模板,以完成第一终端设备的声纹注册。同时,第一终端设备还可以将该第一注册语音发送至各第二终端设备,各第二终端设备可以根据该第一注册语音得到各第二终端设备对应的第二注册语音来对各第二终端设备进行声纹注册。
需要说明的是,第一终端设备也可以从用户与第一终端设备日常的语音交互中获取第一注册语音,以通过自学习免注册的方式来进行各终端设备的声纹注册,从而简化声纹注册的操作流程,提高用户体验。
具体地,在用户与第一终端设备进行语音交互时,第一终端设备可以通过自学习方式来筛选出该用户的语音,例如可通过聚类方式筛选出该用户的语音,并可以从所筛选出的语音中选取质量好的语音作为第一注册语音来对各终端设备进行声纹注册。在此,可以通过评估各语音的信噪比和/或语音能量等级等方式来选取质量好的语音。
S804、第二终端设备根据第一终端设备相对于原始语音信号的第一信道模型和第二终端设备相对于原始语音信号的第二信道模型对第一注册语音进行转换处理;或者根据第一终端设备与第二终端设备之间的信道模型对第一注册语音进行转换处理,得到第二注册语音。
S805、第二终端设备根据第二注册语音生成第二终端设备对应的声纹模板。
示例性的,当信道模型为终端设备相对于原始语音信号的信道模型时,第二终端设备接收到第一终端设备发送的第一注册语音后,可以根据第一终端设备的设备名称和/或设备标识确定第一终端设备相对于原始语音信号的第一信道模型,以及根据第二终端设备的设备名称和/或设备标识确定第二终端设备相对于原始语音信号的第二信道模型。然后可以通过第一信道模型去除第一注册语音中第一终端设备所对应的信道信息,以得到不包含信道信息的原始语音。随后可以通过第二信道模型添加第二终端设备所对应的信道信息至原始语音,以得到包含第二终端设备对应的信道信息的第二注册语音,从而可以根据第二注册语音生成第二终端设备对应的声纹模板。
具体地,第二终端设备接收到第一注册语音后,可以对第一注册语音进行频域转换,得到第一注册语音对应的频域信号St1’。例如,可以通过快速傅里叶变换(fast fourier transform,FFT)对第一注册语音进行频域转换,得到第一注册语音对应的频域信号St1’。然后可以通过第一终端设备相对于原始语音信号的第一信道模型对频域信号St1’进行转换处理,即可以根据第一终端设备对应的语音信号与原始语音信号之间的映射关系去除频域信号St1’中第一终端设备所对应的信道信息,得到原始频域信号S’。随后可以通过第二终端设备相对于原始语音信号的第二信道模型对原始频域信号S’进行转换处理,即可以根据第二信道模型对应的语音信号与原始语音信号之间的映射关系添加第二终端设备所对应的信道信息至原始频域信号S’,得到第二终端设备对应的频域信号St2’。最后可以对频域信号St2’进行FFT反变换来得到第二终端设备对应的第二注册语音。
示例性的,当信道模型为终端设备之间的信道模型时,第二终端设备接收到第一终端设备发送的第一注册语音后,可以根据第一终端设备的设备名称和/或设备标识以及第二终端设备的设备名称和/或设备标识确定第一终端设备与第二终端设备之间的信道模型,并可以根据该信道模型直接将第一注册语音转换为第二终端设备对应的第二注册语音,减少注册语音的转换次数,以减少注册语音转换过程中的信息损失,从而提高基于第二注册语音生成的声纹模板的准确性。
具体地,第二终端设备接收到第一注册语音后,可以对第一注册语音进行频域转换,得到第一注册语音对应的频域信号St1’。然后可以根据第一终端设备对应的语音信号与第二终端设备对应的语音信号之间的映射关系直接将频域信号St1’转换为第二终端设备对应的频域信号St2’。最后可以对频域信号St2’进行FFT反变换来得到第二终端设备对应的第二注册语音。
示例性的,第一终端设备也可以直接将该第一注册语音对应的语音特征发送至各第二终端设备,各第二终端设备可以根据该第一注册语音对应的语音特征得到各第二终端设备对应的第二注册语音来对各第二终端设备进行声纹注册。即第一终端设备也可以直接将该第一注册语音进行频域转换后的语音特征发送至各第二终端设备,可使得各第二终端设备省去频域转换过程,提高各第二终端设备的处理性能。
应理解,本实施例对第一注册语音进行转换得到第二注册语音的转换过程也可以由第一终端设备执行。
示例性的,第一终端设备获取到用户输入的第一注册语音后,可以通过第一终端设备相对于原始语音信号的第一信道模型对第一注册语音进行信道信息的去除,得到不包含信道信息的原始语音。然后可以通过第二终端设备相对于原始语音信号的第二信道模型将第二终端设备对应的信道信息添加至原始语音,得到包含第二终端设备对应的信道信息的第二注册语音,并可以将第二注册语音发送至第二终端设备。
示例性的,第一终端设备获取到用户输入的第一注册语音后,可以通过第一终端设备与第二终 端设备之间的信道模型将第一注册语音直接转换为第二终端设备对应的第二注册语音,并可以将第二注册语音发送至第二终端设备。
在一个示例中,当信道模型为终端设备相对于原始语音信号的信道模型时,为了降低第一终端设备和/或第二终端设备的计算量,可以将注册语音的转换过程分解至第一终端设备和第二终端设备。即第一终端设备获取到用户输入的第一注册语音后,可以通过第一终端设备相对于原始语音信号的第一信道模型对第一注册语音进行信道信息去除,以得到原始语音,并可以将原始语音发送至第二终端设备。第二终端设备接收到原始语音后,可以通过第二终端设备相对于原始语音信号的第二信道模型对原始语音进行信道信息的添加,以得到包含第二终端设备对应的信道信息的第二注册语音。
在生成第二终端设备对应的声纹模板后,用户可以直接使用第二终端设备的声纹识别功能,而无需再在第二终端设备中进行声纹注册。具体地,用户可以直接向第二终端设备输入认证语音,第二终端设备接收到认证语音后,可以通过声纹识别模型获取该认证语音对应的特征矢量(即认证模板),并计算所获取的特征矢量与第二终端设备中的声纹模板之间的相似度,以根据相似度与预设的第一相似度阈值来识别用户的身份。当相似度大于或等于第一相似度阈值时,可以确定认证语音来自所注册的用户;当相似度小于第一相似度阈值时,可以确定认证语音可能不是来自所注册的用户。其中,第一相似度阈值可以根据实际情况具体设置,例如可以将第一相似度阈值设置为70%。
本实施例对计算认证语音对应的特征矢量与声纹模板之间的相似度的算法不作限定。示例性的,可以通过余弦距离(cosine distance,CDS)、线性判别分析(linear discriminant analysis,LDA)、概率线性判别分析(prob-ailistic linear discriminant analysis,PLDA)等算法中的任一种来计算认证语音对应的特征矢量与声纹模板之间的相似度。
需要说明的是,在终端设备(包括上述的第一终端设备和第二终端设备)进行声纹识别的过程中,当该终端设备确定认证语音与该终端设备中声纹模板之间的相似度大于或等于预设的第二相似度阈值时,该终端设备可以将该认证语音确定为高质量语音样本,并可以利用该高质量语音样本对该终端设备对应的声纹识别模型进行增量学习,以更新该终端设备对应的声纹识别模型。同时该终端设备还可以根据该认证语音生成其他终端设备对应的高质量语音样本,使得其他终端设备可以根据该高质量语音样本对其他终端设备对应的声纹识别模型进行增量学习,以更新其他终端设备对应的声纹识别模型。即本实施例可以获取用户日常使用过程中高质量的认证语音来对各终端设备对应的声纹识别模型进行更新,提高各终端设备对应的声纹识别模型与实际使用场景的匹配性,提高各终端设备中声纹识别的鲁棒性,从而提高各终端设备声纹识别的准确率。
其中,第二相似度阈值可以根据实际情况具体设置,且第二相似度阈值可以大于或等于第一相似度阈值。例如可以将第二相似度阈值设置为90%。
本实施例对利用高质量语音样本进行增量学习的算法不作限定。示例性的,可以将高质量语音样本通过加权的方式与各终端设备对应的原始训练数据进行联合训练,来更新各终端设备对应的声纹识别模型。
本实施例可以对第一终端设备获取的注册语音进行转换处理,生成各第二终端设备所对应的注册语音,来对各第二终端设备进行声纹注册,实现一次注册语音的输入可对多个终端设备进行声纹注册的目的,减少多终端设备声纹注册的语音输入次数,提升用户体验。
【实施例二】
上述实施例一提供的方法需要通过第一终端设备和/或第二终端设备来进行注册语音的转换处理,极大地增加了第一终端设备和/或第二终端设备的计算量,影响第一终端设备和/或第二终端设备的使 用性能。
请参阅图9,图9示出了本申请实施例二提供的跨设备声纹注册方法的应用场景示意图。该应用场景可以包括多个终端设备100和服务器90,其中,服务器90可以云端服务器或控制中心等,以通过服务器来进行注册语音的转换处理,降低第一终端设备和/或第二终端设备的计算量,确保第一终端设备和/或第二终端设备的使用性能。
其中,服务器90可以分别与各终端设备100进行通信。服务器90或者与服务器90通信连接的存储装置中可以存储有各终端设备100相对于原始语音信号的信道模型,或者存储有任意两终端设备100之间的信道模型。
应理解,当用户在第一终端设备中进行声纹注册时,该第一终端设备可以将用户输入的第一注册语音发送至服务器90。服务器90可以根据该第一终端设备相对于原始语音信号的第一信道模型,以及各第二终端设备相对于原始语音信号的第二信道模型来对第一注册语音进行转换处理,得到各第二终端设备对应的第二注册语音;或者,服务器90可以根据第一终端设备与各第二终端设备之间的信道模型来对第一注册语音进行转换处理,得到各第二终端设备对应的第二注册语音。从而可以根据各第二注册语音生成各第二终端设备对应的声纹模板,以减少多终端设备声纹注册的语音输入次数。
请参阅图10,图10示出了本实施例提供的一种跨设备声纹注册方法的流程示意图。如图10所示,该方法可以包括:
S1001、用户向第一终端设备输入第一注册语音。
S1002、第一终端设备根据第一注册语音生成第一终端设备对应的声纹模板。
S1003、第一终端设备将第一注册语音发送至服务器。
可以理解的是,在确定出进行跨设备声纹注册的终端设备后,用户可以向第一终端设备中输入第一注册语音。该第一注册语音是指第一终端设备所接收到的语音,即第一注册语音中附加有第一终端设备所对应的信道信息。第一终端设备接收到第一注册语音后,可以利用第一终端设备对应的声纹识别模型对该第一注册语音进行处理,得到该第一注册语音对应的声纹模板,以完成第一终端设备的声纹注册。同时,第一终端设备还可以将该第一注册语音发送至服务器90,以便服务器90根据该第一注册语音得到其他终端设备所对应的注册语音来对其他终端设备进行声纹注册。
S1004、服务器根据第一终端设备相对于原始语音信号的第一信道模型和第二终端设备相对于原始语音信号的第二信道模型对第一注册语音进行转换处理;或者根据第一终端设备与第二终端设备之间的信道模型对第一注册语音进行转换处理,得到第二注册语音。
S1005、服务器将第二注册语音发送至第二终端设备。
示例性的,当信道模型为终端设备相对于原始语音信号的信道模型时,服务器接收到第一终端设备发送的第一注册语音后,可以根据第一终端设备的设备名称和/或设备标识确定第一终端设备相对于原始语音信号的第一信道模型,以及根据第二终端设备的设备名称和/或设备标识确定第二终端设备相对于原始语音信号的第二信道模型。然后可以根据第一信道模型去除第一注册语音中第一终端设备所对应的信道信息,以得到不包含信道信息的原始语音。随后可以根据第二信道模型添加第二终端设备所对应的信道信息至原始语音,以得到包含各第二终端设备对应的信道信息的第二注册语音,并可以将各第二注册语音发送至对应的第二终端设备。
示例性的,当信道模型为终端设备之间的信道模型时,服务器接收到第一终端设备发送的第一注册语音后,可以根据第一终端设备的设备名称和/或设备标识以及第二终端设备的设备名称和/或设 备标识确定第一终端设备与第二终端设备之间的信道模型,并可以根据该信道模型直接将第一注册语音转换为第二终端设备对应的第二注册语音,以减少注册语音的转换次数,减少注册语音转换过程中的信息损失,从而提高基于第二注册语音生成的声纹模板的准确性。
S1006、第二终端设备根据第二注册语音生成第二终端设备对应的声纹模板。
在此,各第二终端设备接收到对应的第二注册语音后,可以根据各第二终端设备对应的声纹识别模型对各第二注册语音进行处理,得到该第二注册语音对应的声纹模板,即从各第二注册语音中提取声纹特征来得到各第二终端设备对应的声纹模板。例如,第二终端设备A接收到第二注册语音A后,可以根据第二终端设备A对应的声纹识别模型A对第二注册语音A进行声纹特征的提取,得到第二终端设备A对应的声纹模板A;第二终端设备B接收到的第二注册语音B后,可以根据第二终端设备B对应的声纹识别模型B对第二注册语音B进行声纹特征的提取,得到第二终端设备B对应的声纹模板B,等等。其中,第二注册语音A为添加有第二终端设备A所对应的信道信息的语音,第二注册语音B为添加有第二终端设备B所对应的信道信息的语音。
在一个示例中,为进一步减少各第二终端设备的计算量,也可以直接由服务器90来生成各第二终端设备对应的声纹模板。服务器90或者与服务器90通信连接的存储装置中还可以存储有各终端设备对应的声纹识别模型。当服务器90获取到各第二终端设备对应的第二注册语音后,服务器90可以根据各第二终端设备对应的设备名称和/或设备标识获取各第二终端设备对应的声纹识别模型,并可以根据各第二终端设备对应的声纹识别模型对各第二注册语音进行处理,得到各第二终端设备对应的声纹模板,并将各第二终端设备对应的声纹模板分别发送至对应的第二终端设备。
需要说明的是,在终端设备(包括上述的第一终端设备和第二终端设备)进行声纹识别的过程中,当该终端设备确定认证语音与该终端设备中声纹模板之间的相似度大于或等于预设的第二相似度阈值时,该终端设备可以将该认证语音发送至服务器90。服务器90可以利用该认证语音对该终端设备对应的声纹识别模型进行增量学习,以更新该终端设备对应的声纹识别模型,并可以将更新后的声纹识别模型发送至该终端设备。同时服务器90还可以根据该认证语音分别生成其他终端设备的训练语音来对其他终端设备对应的声纹识别模型进行增量学习,以更新其他终端设备对应的声纹识别模型,并可以将更新后的各声纹识别模型分别发送至对应的终端设备。即本实施例中,服务器可以获取用户日常使用过程中高质量的认证语音来对各终端设备对应的声纹识别模型进行更新,以提高各终端设备对应的声纹识别模型与实际使用场景的匹配性,提高各终端设备中声纹识别的鲁棒性,从而提高各终端设备声纹识别的准确率。
本实施例中,通过服务器进行注册语音的转换处理,来生成其他终端设备对应的注册语音,不仅可实现一次注册语音的输入可对多个终端设备进行声纹注册的目的,减少多终端设备声纹注册的语音输入次数,提升用户体验。同时还可减少各终端设备的计算量,确保各终端设备的使用性能。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
图11为本申请一实施例提供的电子设备的结构示意图。如图11所示,该实施例的电子设备11包括:至少一个处理器1100(图11中仅示出一个)、存储器1101以及存储在所述存储器1101中并可在所述至少一个处理器1100上运行的计算机程序1102,所述处理器1100执行所述计算机程序1102时,使所述电子设备11实现上述任意各个跨设备声纹注册方法实施例中的步骤。
所述电子设备11可以是终端设备或服务器。所述电子设备11可包括,但不仅限于,处理器1100、存储器1101。本领域技术人员可以理解,图11仅仅是电子设备11的举例,并不构成对电子设备11 的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入输出设备、网络接入设备等。
所述处理器1100可以是中央处理单元(central processing unit,CPU),该处理器1100还可以是其他通用处理器、数字信号处理器(digital dignal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器1101在一些实施例中可以是所述电子设备11的内部存储单元,例如电子设备11的硬盘或内存。所述存储器1101在另一些实施例中也可以是所述电子设备11的外部存储设备,例如所述电子设备11上配备的插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(flash card)等。进一步地,所述存储器1101还可以既包括所述电子设备11的内部存储单元也包括外部存储设备。所述存储器1101用于存储操作系统、应用程序、引导装载程序(bootLoader)、数据以及其他程序等,例如所述计算机程序的程序代码等。所述存储器1101还可以用于暂时地存储已经输出或者将要输出的数据。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被计算机执行时,使所述计算机实现上述各个方法实施例中的步骤。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备实现上述各个方法实施例中的步骤。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读存储介质至少可以包括:能够将计算机程序代码携带到装置/电子设备的任何实体或装置、记录介质、计算机存储器、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读存储介质不可以是电载波信号和电信信号。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/电子设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/电子设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性, 机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (16)

  1. 一种跨设备声纹注册方法,其特征在于,应用于第二终端设备,所述方法包括:
    获取第一终端设备对应的第一注册语音;
    对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音;
    根据所述第二注册语音生成所述第二终端设备对应的声纹模板。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音包括:
    通过所述第一终端设备对应的第一信道模型对所述第一注册语音进行转换处理,得到所述第一注册语音对应的原始语音,所述第一信道模型用于表征所述第一终端设备对应的语音与原始语音之间的映射关系;
    通过所述第二终端设备对应的第二信道模型对所述原始语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第二信道模型用于表征所述第二终端设备对应的语音与原始语音之间的映射关系。
  3. 根据权利要求1所述的方法,其特征在于,所述对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音包括:
    通过第三信道模型对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第三信道模型用于表征所述第一终端设备对应的语音与所述第二终端设备对应的语音之间的映射关系。
  4. 根据权利要求2所述的方法,其特征在于,所述第一信道模型和所述第二信道模型为基于频率响应曲线构建的信道模型,或者为基于频谱特征构建的信道模型。
  5. 根据权利要求3所述的方法,其特征在于,所述第三信道模型为基于频率响应曲线构建的信道模型,或者为基于频谱特征构建的信道模型。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述根据所述第二注册语音生成所述第二终端设备对应的声纹模板包括:
    根据所述第二终端设备对应的声纹识别模型和所述第二注册语音生成所述第二终端设备对应的声纹模板,所述第二终端设备对应的声纹识别模型为基于所述第二终端设备获取的训练语音训练得到的声纹识别模型。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,在所述根据所述第二注册语音生成所述第二终端设备对应的声纹模板之后,还包括:
    获取所述第二终端设备对应的认证语音;
    根据所述第二终端设备对应的声纹识别模型和所述认证语音生成所述认证语音对应的认证模板;
    确定所述认证模板与所述声纹模板之间的相似度;
    当所述相似度大于预设的相似度阈值时,根据所述认证语音更新所述第二终端设备对应的声纹识别模型。
  8. 根据权利要求7所述的方法,其特征在于,当所述相似度大于预设的相似度阈值时,所述方法还包括:
    对所述认证语音进行转换处理,得到所述第一终端设备对应的训练语音,并向所述第一终端设备发送所述训练语音,所述训练语音用于更新所述第一终端设备对应的声纹识别模型。
  9. 一种跨设备声纹注册方法,其特征在于,应用于第一终端设备或服务器,所述方法包括:
    获取所述第一终端设备对应的第一注册语音;
    对所述第一注册语音进行转换处理,得到第二终端设备对应的第二注册语音;
    向所述第二终端设备发送所述第二注册语音,所述第二注册语音用于生成所述第二终端设备对应的声纹模板。
  10. 根据权利要求9所述的方法,其特征在于,所述对所述第一注册语音进行转换处理,得到第二终端设备对应的第二注册语音包括:
    通过所述第一终端设备对应的第一信道模型对所述第一注册语音进行转换处理,得到所述第一注册语音对应的原始语音,所述第一信道模型用于表征所述第一终端设备对应的语音与原始语音之间的映射关系;
    通过所述第二终端设备对应的第二信道模型对所述原始语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第二信道模型用于表征所述第二终端设备对应的语音与原始语音之间的映射关系。
  11. 根据权利要求9所述的方法,其特征在于,所述对所述第一注册语音进行转换处理,得到第二终端设备对应的第二注册语音包括:
    通过第三信道模型对所述第一注册语音进行转换处理,得到所述第二终端设备对应的第二注册语音,所述第三信道模型用于表征所述第一终端设备对应的语音与所述第二终端设备对应的语音之间的映射关系。
  12. 一种跨设备声纹注册方法,其特征在于,包括:
    第一终端设备获取所述第一终端设备对应的第一注册语音;
    所述第一终端设备对所述第一注册语音进行转换处理,得到所述第一注册语音对应的第一原始语音,并向第二终端设备发送所述第一原始语音;
    所述第二终端设备接收来自所述第一终端设备的所述第一原始语音,并对所述第一原始语音进行转换处理,得到所述第二终端设备对应的第二注册语音;
    所述第二终端设备根据所述第二注册语音生成所述第二终端设备对应的声纹模板。
  13. 根据权利要求12所述的方法,其特征在于,所述第一终端设备获取所述第一终端设备对应的第一注册语音包括:
    所述第一终端设备获取所述第一终端设备与用户之间的交互语音,并获取所述交互语音中的目标语音,所述目标语音为所述用户对应的语音;
    所述第一终端设备根据所述目标语音对应的信噪比和/或语音能量等级从所述目标语音中获取所述第一终端设备对应的第一注册语音。
  14. 根据权利要求12或13所述的方法,其特征在于,所述第二终端设备根据所述第二注册语音生成所述第二终端设备对应的声纹模板之后包括:
    所述第二终端设备获取所述第二终端设备对应的认证语音,根据所述第二终端设备对应的声纹识别模型和所述认证语音生成所述认证语音对应的认证模板,并确定所述认证模板与所述声纹模板之间的相似度;
    当所述相似度大于预设的相似度阈值时,所述第二终端设备根据所述认证语音更新所述第二终端设备对应的声纹识别模型,并根据所述第二终端设备对应的第二信道模型对所述认证语音进行转换处理,得到所述认证语音对应的第二原始语音,向所述第一终端设备发送所述第二原始语音;
    所述第一终端设备接收来自所述第二终端设备的所述第二原始语音,根据所述第一终端设备对 应的第一信道模型对所述第二原始语音进行转换处理,得到所述第一终端设备对应的训练语音,并根据所述第一终端设备对应的训练语音更新所述第一终端设备对应的声纹识别模型。
  15. 一种电子设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时,使所述电子设备实现如权利要求1至8任一项,或9-11任一项所述的跨设备声纹注册方法。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被计算机执行时,使所述计算机实现如权利要求1至8任一项,或9-11任一项所述的跨设备声纹注册方法。
PCT/CN2021/104585 2020-07-07 2021-07-05 跨设备声纹注册方法、电子设备及存储介质 WO2022007757A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010650133.9A CN114093368A (zh) 2020-07-07 2020-07-07 跨设备声纹注册方法、电子设备及存储介质
CN202010650133.9 2020-07-07

Publications (1)

Publication Number Publication Date
WO2022007757A1 true WO2022007757A1 (zh) 2022-01-13

Family

ID=79552256

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/104585 WO2022007757A1 (zh) 2020-07-07 2021-07-05 跨设备声纹注册方法、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN114093368A (zh)
WO (1) WO2022007757A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114400009A (zh) * 2022-03-10 2022-04-26 深圳市声扬科技有限公司 声纹识别方法、装置以及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060020460A1 (en) * 2003-07-31 2006-01-26 Fujitsu Limited Voice authentication system
CN101047508A (zh) * 2007-01-15 2007-10-03 深圳市莱克科技有限公司 登录认证系统
CN105321520A (zh) * 2014-06-16 2016-02-10 丰唐物联技术(深圳)有限公司 一种语音控制方法及装置
CN108259280A (zh) * 2018-02-06 2018-07-06 北京语智科技有限公司 一种室内智能化控制的实现方法、系统
CN108492830A (zh) * 2018-03-28 2018-09-04 深圳市声扬科技有限公司 声纹识别方法、装置、计算机设备和存储介质
CN109378006A (zh) * 2018-12-28 2019-02-22 三星电子(中国)研发中心 一种跨设备声纹识别方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060020460A1 (en) * 2003-07-31 2006-01-26 Fujitsu Limited Voice authentication system
CN101047508A (zh) * 2007-01-15 2007-10-03 深圳市莱克科技有限公司 登录认证系统
CN105321520A (zh) * 2014-06-16 2016-02-10 丰唐物联技术(深圳)有限公司 一种语音控制方法及装置
CN108259280A (zh) * 2018-02-06 2018-07-06 北京语智科技有限公司 一种室内智能化控制的实现方法、系统
CN108492830A (zh) * 2018-03-28 2018-09-04 深圳市声扬科技有限公司 声纹识别方法、装置、计算机设备和存储介质
CN109378006A (zh) * 2018-12-28 2019-02-22 三星电子(中国)研发中心 一种跨设备声纹识别方法及系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114400009A (zh) * 2022-03-10 2022-04-26 深圳市声扬科技有限公司 声纹识别方法、装置以及电子设备

Also Published As

Publication number Publication date
CN114093368A (zh) 2022-02-25

Similar Documents

Publication Publication Date Title
WO2020211701A1 (zh) 模型训练方法、情绪识别方法及相关装置和设备
CN112231025B (zh) Ui组件显示的方法及电子设备
CN111030990B (zh) 一种建立通信连接的方法及客户端、服务端
CN111742539B (zh) 一种语音控制命令生成方法及终端
WO2021052139A1 (zh) 手势输入方法及电子设备
WO2022127787A1 (zh) 一种图像显示的方法及电子设备
CN113821767A (zh) 应用程序的权限管理方法、装置和电子设备
WO2021218429A1 (zh) 应用窗口的管理方法、终端设备及计算机可读存储介质
CN111031492B (zh) 呼叫需求响应方法、装置及电子设备
CN115333941A (zh) 获取应用运行情况的方法及相关设备
WO2022007707A1 (zh) 家居设备控制方法、终端设备及计算机可读存储介质
WO2022007757A1 (zh) 跨设备声纹注册方法、电子设备及存储介质
CN113380240B (zh) 语音交互方法和电子设备
WO2022161077A1 (zh) 语音控制方法和电子设备
WO2022062902A1 (zh) 一种文件传输方法和电子设备
WO2022022319A1 (zh) 一种图像处理方法、电子设备、图像处理系统及芯片系统
CN114828098B (zh) 数据传输方法和电子设备
WO2021147483A1 (zh) 数据分享的方法和装置
WO2020233581A1 (zh) 一种测量高度的方法和电子设备
CN115730091A (zh) 批注展示方法、装置、终端设备及可读存储介质
CN115373957A (zh) 杀应用的方法及设备
CN114003241A (zh) 应用程序的界面适配显示方法、系统、电子设备和介质
CN113867851A (zh) 电子设备操作引导信息录制方法、获取方法和终端设备
WO2022179495A1 (zh) 一种隐私风险反馈方法、装置及第一终端设备
WO2022166550A1 (zh) 数据传输方法及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21837516

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21837516

Country of ref document: EP

Kind code of ref document: A1