CN114026637A - Voiceprint recognition and registration device and cross-device voiceprint recognition method - Google Patents

Voiceprint recognition and registration device and cross-device voiceprint recognition method Download PDF

Info

Publication number
CN114026637A
CN114026637A CN202080001170.5A CN202080001170A CN114026637A CN 114026637 A CN114026637 A CN 114026637A CN 202080001170 A CN202080001170 A CN 202080001170A CN 114026637 A CN114026637 A CN 114026637A
Authority
CN
China
Prior art keywords
voiceprint
user
voiceprint information
information
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080001170.5A
Other languages
Chinese (zh)
Inventor
高振东
吴晶
陈晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN114026637A publication Critical patent/CN114026637A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation

Abstract

A voiceprint recognition, register device, and cross-device voiceprint recognition method, the method is applied to a system, the system includes voiceprint recognition device and voiceprint register device, the voiceprint register device eliminates the noise related to the device in the first voice to obtain the register voiceprint information, the register voiceprint information provides the basis for the voiceprint recognition of the recognition device; the voiceprint recognition device eliminates noise related to the device in the second voice to obtain target voiceprint information; when voice recognition is carried out, the voiceprint recognition device matches the target voiceprint information with the registered voiceprint information to obtain a matching result, and the matching result is used for indicating the identity of the user; the voiceprint information is registered, the voiceprint information after the influence of the device on the biological voiceprint is eliminated can be shared by other equipment, a user can realize the voiceprint recognition function on the voiceprint recognition device without repeatedly carrying out voiceprint registration, the complicated voiceprint registration process is omitted, and cross-equipment voiceprint recognition is realized.

Description

Voiceprint recognition and registration device and cross-device voiceprint recognition method Technical Field
The application relates to the technical field of voiceprint recognition, in particular to a voiceprint recognition and registration device and a cross-device voiceprint recognition method.
Background
Voiceprint recognition is one of biometric recognition technologies, and is a recognition technology for determining the identity of a speaker by converting a voice signal into a digital signal and using a computer. At present, equipment need register user's voiceprint in advance when using voiceprint recognition function, along with the continuous development of smart machine technique, and the increase of voiceprint equipment registers the voiceprint and has become a tedious process.
In the prior art, a voiceprint mapping model between different devices may be established, a voiceprint is extracted from a voice recorded by a first device, voiceprint registration is performed, then a voiceprint feature is extracted from a voice command recorded by a second device, and the voiceprint feature is mapped to a voiceprint registered by the first device based on the established voiceprint mapping model, so that the voiceprint does not need to be registered on the second device.
In the above scheme, the stored voiceprint mapping models are related to devices, pairwise mapping of all the devices is required, and the number of mapping relationships is too large, for example, when Q devices exist in a certain environment, the number of establishing pairwise voiceprint mapping relationship models is Q × (Q-1); if the channel of the individual device has variation, the corresponding voiceprint mapping model set is all wrong, and the identification accuracy rate is difficult to guarantee.
Disclosure of Invention
The embodiment of the application provides a voiceprint recognition device in recognition equipment, a voiceprint registration device in registration equipment and a cross-equipment voiceprint recognition method, wherein the method is applied to a cross-equipment voiceprint recognition system which comprises the recognition equipment and the registration equipment, and the registration equipment registers voice of a user; the registration device eliminates noise related to the registration device in the recorded voice to obtain a biological voiceprint of the user, records the biological voiceprint under the voiceprint identification of the user to obtain registration voiceprint information, and can be understood that the biological voiceprint of the user is recorded in the book, and the recorded biological voiceprint of the user is called the registration voiceprint information of the user; registering voiceprint information to provide a basis for voiceprint identification; the voice recognition device receives the voice input by the user, eliminates noise related to the recognition device in the voice to obtain biological voice print (called target voice print information) of the user, realizes the voice print recognition function through the registered voice print information shared by the registered device, namely, matches the target voice print information with the registered voice print information, and indicates the identity of the user through the matching result.
In a first aspect, an embodiment of the present application provides a voiceprint recognition apparatus in a recognition device, where the recognition device may be a terminal, and the voiceprint recognition apparatus may be a processor, a chip, or a chip system in the terminal; or the voiceprint recognition device is a terminal which is a component of the recognition equipment; the device comprises a processor, a voice input and output device and a transceiver which are connected with the processor; the voice input and output device is used for receiving voice input by a user; the processor is used for eliminating noise related to the recognition device in the received voice to obtain target voiceprint information, namely the target voiceprint information is a biological voiceprint after the influence of the recognition device on the voice is eliminated, and the biological voiceprint of the user is equivalent to a voiceprint sent by a speaker who listens to the voice from the face to the face; then, obtaining, by the transceiver, enrollment voiceprint information for the user, the enrollment voiceprint information including at least one biological voiceprint template obtained by eliminating noise associated with the enrollment device; the processor is further configured to match the target voiceprint information with the registered voiceprint information to obtain a matching result, and the matching result is used for indicating the identity of the user; in the embodiment of the application, the registered voiceprint information is the biological voiceprint information of the user in the voice extracted after the noise related to the registered equipment in the voice is eliminated, because the influence of the registered device on the biological voiceprint is eliminated, the registered voiceprint information can be shared with other identification devices, the user can realize the voiceprint identification function on a plurality of devices by only registering the voiceprint on one device (the registered device), the identification equipment does not need to carry out voiceprint registration, omits the complicated voiceprint registration process, realizes cross-equipment voiceprint identification, can greatly improve the user experience, and the registered voiceprint information and the target voiceprint information are both biological voiceprints with relevant device noise separated, so that the influence of the device on the voiceprints is eliminated, and the identification device has higher accuracy no matter which type of terminal.
In one possible implementation, the apparatus further includes a transceiver connected to the processor; the transceiver is used for receiving the registration voiceprint information and the voiceprint identifier of the registration voiceprint information sent by the storage device or the registration device, and the voiceprint identifier is used for indicating a user; in this embodiment, the registered voiceprint information may be stored in the storage device, a basis is provided for sharing the registered voiceprint information with other devices, and when the user needs to use the identification device for voiceprint identification, the user does not need to perform voiceprint registration on the identification device, but can share the registered voiceprint information in the storage device or the registration device.
In a possible implementation manner, the transceiver is further configured to send a voiceprint identifier of the user to the storage device or the registration device to request for registration voiceprint information of the user; the biological voiceprint identification of the user has a corresponding relation with the registered voiceprint information, the voiceprint identification of the user is sent to the storage device or the registered device, and the registered voiceprint information corresponding to the voiceprint identification is sent to the voiceprint recognition device by the storage device or the registered device.
In one possible implementation manner, the processor is specifically configured to input the voice into a voiceprint extraction model, and eliminate noise related to the recognition device in the voice through the voiceprint extraction model to obtain target voiceprint information; in the embodiment of the application, the voiceprint extraction model is trained in advance in a machine learning mode, and the trained voiceprint extraction model is used for directly extracting the biological voiceprint of the user in the voice, so that the method has better robustness.
In one possible implementation, the voiceprint extraction model is obtained by learning corpora collected by a plurality of devices; the plurality of devices can be different types of terminals, and the different types of terminals include but are not limited to terminals in smart homes, lighting systems, smart sound boxes, robots, vehicle-mounted terminals, user equipment, mobile phones, tablet computers, personal computers, virtual reality terminal devices, augmented reality terminal devices and the like; in the embodiment of the application, the voiceprint extraction model is obtained by learning the corpora collected by a plurality of different devices, and the voiceprint information output by the voiceprint extraction model can eliminate the influence of different devices on voice so as to obtain the biological voiceprint of the user.
In one possible implementation manner, the processor is further configured to perform frequency response compensation on the speech through the filter, where the frequency response compensation is used to eliminate noise related to the recognition device in the speech to obtain target voiceprint information; in the embodiment of the application, the voice signals are processed through the filter, the highlighted signals are attenuated, the weaker signals are enhanced, the frequency response is compensated, the purpose of removing the relevant noise of the equipment is achieved, the noise signals are directly filtered through the filter filtering mode, the output signals are biological voiceprint signals of a user, and the realization is simple and fast.
In a possible implementation manner, the processor is further configured to obtain user data associated with a user identity; executing an operation corresponding to the user data; in the embodiment of the application, the processor executes the operation corresponding to the historical information, so that the user does not need to perform habitual operation every time, and the user experience is improved.
In a second aspect, an embodiment of the present application provides a voiceprint registration apparatus in a registration device, where the registration device may be a terminal, and the voiceprint registration apparatus may be a processor, a chip, or a chip system in the terminal; or the voiceprint registration device is a terminal which is a component of the identification equipment; the device comprises a processor and a voice input and output device connected with the processor; the voice input and output device is used for receiving voice input by a user; the processor is used for eliminating noise related to the registration equipment in the voice to obtain registration voiceprint information of the user, and the registration voiceprint information comprises a biological voiceprint of the user; the registered voiceprint information is used for providing a basis for voiceprint identification of other equipment, and can be the registered voiceprint information shared by a plurality of identification equipment; in the embodiment of the application, the received voice of the user is extracted by the registration equipment, the noise related to the equipment in the voice is eliminated, and the clean biological voiceprint of the user is obtained (the biological voiceprint is closest to the voiceprint of the user when the user speaks face to face), and because the influence of the equipment on the voice is eliminated, the extracted biological voiceprint information can be used as the registration voiceprint information shared by a plurality of equipment, cross-equipment voiceprint recognition is realized, and the user experience can be greatly improved.
In one possible implementation, the apparatus further includes a transceiver connected to the processor; the transceiver is used for sending the registered voiceprint information and the voiceprint identifier corresponding to the registered voiceprint information to the storage device or the identification device, and the voiceprint identifier is used for indicating a user; in the embodiment of the application, the transceiver sends the registered voiceprint information and the voiceprint identifier corresponding to the registered voiceprint information to the storage device or the identification device, and the storage device or the identification device is used as a storage center of the registered voiceprint information and the voiceprint identifier, so that a basis is provided for sharing the registered voiceprint information.
In one possible implementation manner, the processor is specifically configured to input the voice into the voiceprint extraction model, and eliminate noise related to the registered device in the voice through the voiceprint extraction model to obtain registered voiceprint information; in the embodiment of the application, the voiceprint extraction model is trained in advance in a machine learning mode, and the trained voiceprint extraction model is used for directly extracting the biological voiceprint of the user in the voice, so that the method has better robustness.
In one possible implementation, the voiceprint extraction model is obtained by learning corpora collected by a plurality of devices; in the embodiment of the application, the voiceprint information output by the voiceprint extraction model can eliminate the influence of different devices on voice so as to obtain the biological voiceprint of the user.
In one possible implementation manner, the processor is further configured to perform frequency response compensation on the speech through the filter, where the frequency response compensation is used to eliminate noise associated with the recognition device in the speech to obtain registered voiceprint information; in the embodiment of the application, the voice signals are processed through the filter, the highlighted signals are attenuated, the weaker signals are enhanced, the frequency response is compensated, the purpose of removing the relevant noise of the equipment is achieved, the noise signals are directly filtered through the filter filtering mode, the output signals are biological voiceprint signals of a user, and the realization is simple and fast.
In a third aspect, an embodiment of the present application provides a method for cross-device voiceprint recognition, which is applied to a recognition device, and may include the following steps: the recognition equipment receives voice input by a user; then eliminating noise related to the identification device in the voice to obtain a biological voiceprint of the user, wherein the biological voiceprint of the user is called target voiceprint information; acquiring registration voiceprint information of a user, wherein the registration voiceprint information comprises at least one biological voiceprint template obtained by eliminating noise related to registration equipment; matching the target voiceprint information with the registered voiceprint information to obtain a matching result, wherein the matching result is used for indicating the identity of the user; in the embodiment of the application, the registered voiceprint information is the biological voiceprint information of the user in the voice extracted after the noise related to the registered equipment in the voice is eliminated, because the influence of the registration equipment on the biological voiceprint is eliminated, the registration voiceprint information can be shared with other identification equipment for use, the user can realize the voiceprint identification function on a plurality of equipment only by registering the voiceprint on the registration equipment without registering the voiceprint on each identification equipment, the complicated voiceprint registration process is saved, the cross-equipment voiceprint identification is realized, the user experience can be greatly improved, and the registered voiceprint information and the target voiceprint information are both biological voiceprints with relevant device noise separated, so that the influence of the device on the voiceprints is eliminated, and the identification device has higher accuracy no matter which type of terminal.
In an optional implementation manner, the obtaining of the registered voiceprint information of the user may include: and receiving the registered voiceprint information sent by the storage device or the registered device and a voiceprint identifier corresponding to the registered voiceprint information, wherein the voiceprint identifier is used for indicating a user.
In an optional implementation manner, before receiving the registered voiceprint information and the voiceprint identifier corresponding to the registered voiceprint information sent by the storage device or the registered device, the method further includes: and sending the voiceprint identification to the storage device or the registration device to request the registration voiceprint information of the user.
In an alternative implementation, the removing the noise related to the recognition device in the speech to obtain the target voiceprint information may include inputting the speech into a voiceprint extraction model, and removing the noise related to the recognition device in the speech through the voiceprint extraction model to obtain the target voiceprint information; in the embodiment of the application, the voiceprint extraction model is trained in advance in a machine learning mode, and the trained voiceprint extraction model is used for directly extracting the biological voiceprint of the user in the voice, so that the method has better robustness.
In an optional implementation manner, the voiceprint extraction model is obtained by learning corpora collected by a plurality of devices; in the embodiment of the application, the voiceprint information output by the voiceprint extraction model can eliminate the influence of different devices on voice so as to obtain the biological voiceprint of the user.
In an optional implementation manner, after matching the target voiceprint information with the registered voiceprint information to obtain a matching result, the method further includes: acquiring user data associated with a user identity, wherein the user data can be historical data of user operation, and executing operation corresponding to the user data; in the embodiment of the application, the identification device acquires the user data of the user and executes the operation corresponding to the history information, so that the user does not need to perform habitual operation every time, and the user experience is improved.
In a fourth aspect, an embodiment of the present application provides a method for cross-device voiceprint recognition, which is applied to a registration device, and may include: receiving voice input by a user; eliminating noise related to the registered equipment in the voice to obtain registered voiceprint information; the registered voiceprint information comprises a biological voiceprint of the user; in the embodiment of the application, the received voice of the user is extracted by the registration equipment, the noise related to the equipment in the voice is eliminated, and the clean biological voiceprint of the user is obtained (the biological voiceprint is closest to the voiceprint of the user when the user speaks face to face).
In an optional implementation manner, the registration device sends the registration voiceprint information and a voiceprint identifier corresponding to the registration voiceprint information to the storage device or the identification device; the registration device sends the registration voiceprint information and the voiceprint identifier corresponding to the registration voiceprint information to the storage device or the identification device, and the storage device or the identification device is used as a storage center of the registration voiceprint information and the voiceprint identifier, so that a basis is provided for sharing the registration voiceprint information.
In an alternative implementation, canceling noise associated with the enrollment device from the speech to obtain the biometric voice print information may include: inputting the voice into a voiceprint extraction model, and eliminating noise related to registered equipment in the voice through the voiceprint extraction model to obtain registered voiceprint information; in the embodiment of the application, the voiceprint extraction model is trained in advance in a machine learning mode, and the trained voiceprint extraction model is used for directly extracting the biological voiceprint of the user in the voice, so that the method has better robustness.
In an optional implementation manner, the voiceprint extraction model is obtained by learning corpora collected by a plurality of devices; in the embodiment of the application, the voiceprint information output by the voiceprint extraction model can eliminate the influence of different devices on voice so as to obtain the biological voiceprint of the user.
In an alternative implementation, canceling noise associated with the enrollment device from the speech to obtain the biometric voice print information may include: performing frequency response compensation on the voice through a filter, wherein the frequency response compensation is used for eliminating noise related to the identification equipment in the voice to obtain registration voiceprint information; in the embodiment of the application, the voice signals are processed through the filter, the highlighted signals are attenuated, the weaker signals are enhanced, the frequency response is compensated, the purpose of removing the relevant noise of the equipment is achieved, the noise signals are directly filtered through the filter filtering mode, the output signals are biological voiceprint signals of a user, and the realization is simple and fast.
In a fifth aspect, an embodiment of the present application provides a voiceprint recognition apparatus, where the voiceprint recognition apparatus has a function that is implemented by the recognition device in the third aspect; the function can be realized by hardware, and can also be realized by executing corresponding software by hardware; the hardware or software includes one or more modules corresponding to the above functions, and the apparatus includes: the voice input and output module is used for receiving voice input by a user; the processing module is used for eliminating noise related to the recognition equipment in the voice received by the voice input and output module to obtain target voiceprint information, and the target voiceprint information is a biological voiceprint of the user; a transceiver module for obtaining registration voiceprint information of a user, the registration voiceprint information including at least one biological voiceprint template obtained by eliminating noise related to registration equipment; and the processing module is further configured to match the target voiceprint information obtained by the processing module with the registered voiceprint information obtained by the transceiver module 1303 to obtain a matching result, where the matching result is used to indicate the user identity.
In a sixth aspect, an embodiment of the present application provides a voiceprint registration apparatus, where the voiceprint recognition apparatus has a function that is implemented by the registration device in the fourth aspect; the function can be realized by hardware, and can also be realized by executing corresponding software by hardware; the hardware or software includes one or more modules corresponding to the above functions, and the apparatus includes: the voice input and output module is used for receiving voice input by a user; and the processing module is used for eliminating noise related to the registered equipment in the voice received by the voice input and output module so as to obtain the registered voiceprint information of the user.
In a seventh aspect, an embodiment of the present application provides a cross-device voiceprint recognition system, where the system includes a registration device and a recognition device; the method comprises the steps that a registration device receives a first voice input by a user and eliminates noise related to an identification device in the first voice to obtain registration voiceprint information, wherein the registration voiceprint information comprises a biological voiceprint of the user, and the registration voiceprint information provides a basis for the identification device to perform voiceprint identification; the recognition device receives a second voice input by the user; eliminating noise related to the identification device in the second voice to obtain target voiceprint information, wherein the target voiceprint information is a biological voiceprint of the user; the identification equipment matches the target voiceprint information with the registered voiceprint information to obtain a matching result, and the matching result is used for indicating the identity of the user. In the embodiment of the application, the registered voiceprint information is the biological voiceprint information of the user in the voice extracted after the noise related to the registered equipment in the voice is eliminated, because the influence of the registration equipment on the biological voiceprint is eliminated, the registration voiceprint information can be shared to other voiceprint recognition equipment for use, a user can realize the voiceprint recognition function on a plurality of equipment by only registering the voiceprint on one equipment (registration equipment), the voiceprint registration on each equipment is not needed, the complicated voiceprint registration process is saved, the cross-equipment voiceprint recognition is realized, the user experience can be greatly improved, and the registered voiceprint information and the target voiceprint information are the voiceprint information after the noise related to the equipment is eliminated, the influence of the equipment on the voiceprint is eliminated, and the user has higher accuracy no matter which voiceprint equipment is used for voiceprint identification.
In an alternative implementation, the system further includes a storage device; the storage device receives and stores the registered voiceprint information sent by the registration device and the voiceprint identifier corresponding to the registered voiceprint information, wherein the voiceprint identifier is used for indicating a user; and the identification equipment receives the registered voiceprint information sent by the storage equipment and the voiceprint information corresponding to the registered voiceprint information. In the embodiment of the application, the storage device stores the registered voiceprint information, a basis is provided for sharing the registered voiceprint information with other devices, and when the user needs to use other devices (also called voiceprint recognition devices) to perform voiceprint recognition, the user does not need to perform voiceprint registration on the voiceprint recognition devices, but can share the registered voiceprint information which is already registered on the registered devices.
In an eighth aspect, the present embodiment provides a chip, including a processor and a memory, where the memory is used to store a program or instructions, and when the program or instructions are executed by the processor, the identification device is caused to execute the method of any one of the third aspect or the registration device is caused to execute the method of any one of the fourth aspect.
In a ninth aspect, the present embodiments provide a computer readable medium for storing a computer program or instructions which, when executed, cause a computer to perform the method of any of the above third aspects or cause a computer to perform the method of any of the above fourth aspects.
Drawings
FIG. 1 is a schematic diagram of one embodiment of a communication system in an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram illustrating one embodiment of cross-device voiceprint recognition in an embodiment of the present application;
FIG. 3 is a flowchart illustrating steps of one embodiment of voiceprint registration of a registration device in an embodiment of the application;
FIG. 4 is a schematic diagram illustrating a scenario of an embodiment of storing registered voiceprint information according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a scenario of another embodiment of storing registered voiceprint information in an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a scenario of another embodiment of storing registered voiceprint information in an embodiment of the present application;
FIG. 7 is a flowchart illustrating the steps of one embodiment of voiceprint recognition in an embodiment of the present application;
FIG. 8 is a schematic diagram of an application scenario of cross-device voiceprint recognition in an embodiment of the present application;
FIG. 9 is a schematic diagram of another application scenario of cross-device voiceprint recognition in the embodiment of the present application;
FIG. 10 is a schematic diagram of training a voiceprint extraction model in an embodiment of the present application;
FIG. 11A is a graph illustrating an embodiment of the present application without frequency response compensation;
FIG. 11B is a schematic diagram of a frequency response compensated curve in the embodiment of the present application;
FIG. 12 is a diagram illustrating the generation of at least one voiceprint information template in an embodiment of the present application;
FIG. 13 is a schematic view of an embodiment of an apparatus in an embodiment of the present application;
FIG. 14 is a schematic view of another embodiment of an apparatus according to embodiments of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the application provides a cross-device voiceprint recognition method, which is to complete voiceprint registration on one device and use the registered voiceprint information on other unregistered devices so as to complete corresponding tasks. The multiple devices may include either the same type of device or different types of devices.
The method is applied to a communication system, an example of the architecture of which is shown in fig. 1, and the communication system includes a server 101 and a plurality of terminals 102. The server 101 may be a server, a server cluster, or a cloud server; the terminal 102 may be a terminal in a smart home (smart home), including but not limited to smart appliances (such as a smart screen, a television, a smart washing machine, a smart air conditioner, etc.), a lighting system, a smart speaker, a robot, etc.; the terminal 102 may also be a vehicle-mounted terminal, a User Equipment (UE), a mobile phone (mobile phone), a tablet computer (pad), a personal computer, a Virtual Reality (VR) terminal device, and an Augmented Reality (AR) terminal device; by way of example, and not limitation, in the present application, the terminal may also be a wearable device. Wearable equipment can also be called wearable intelligent equipment, is the general term of wearable equipment that uses wearable technique to carry out intelligent design to daily wearing, like glasses, gloves, wrist-watch, dress and shoes etc.. A wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction and cloud interaction. The generalized wearable smart device includes full functionality, large size, and can implement full or partial functionality without relying on a smart phone, such as: smart watches or smart glasses and the like, and only focus on a certain type of application functions, and need to be used in cooperation with other devices such as smart phones, such as various smart bracelets for physical sign monitoring, smart jewelry and the like.
In the present application, the terminal 102 may be a terminal in an internet of things (IoT) system, where IoT is an important component of future information technology development, and the main technical feature of the terminal is to connect an article with a network through a communication technology, so as to implement an intelligent network with interconnected human-computer and interconnected articles. To better explain the embodiments of the present application, the words referred to in the application will be explained first.
Device-dependent noise: the device adds device-related noise from the receiving of speech to the processing of the speech signal, including but not limited to channel noise, coding noise, correlated noise due to physical characteristics and number of microphones, gain, distance, environment, pre-processing algorithms, etc. It will be appreciated that the user's voice is entered via the registered device and the voiceprint information in the voice is affected by the device and thus changes, i.e. no longer a "clean" voiceprint, but a voiceprint following the noise associated with the device that was added. Such as different channels, noise or variations in the different devices. For example, the sound signal may change during transmission due to signal attenuation or signal delay during transmission. The sound signal is formed by combining signals of different frequencies, so that in the process of sound transmission, if the attenuation degree or delay degree of the signals of the various frequencies is not uniform, the received sound signal is distorted, which causes channel distortion. Or when the signal is transmitted, the analog signal is converted into a digital signal to be transmitted, and distortion can be introduced. During transmission, since the bandwidth of the transmission channel is limited, the more the bit rate is, the more high quality signals can be transmitted when encoding. However, the signal cannot be completely lossless transformed during encoding and decoding. Therefore, the speech signal after codec has a certain loss. The effect of different devices on the sound signal may be different.
And (3) voiceprint identification: the voiceprint identifier is used for distinguishing registered voiceprint information, each registered voiceprint information has a unique voiceprint identifier, the voiceprint identifier in the application can be a universal account number which is independently owned by each user, if the voiceprint identifier can be a user ID (identity), the user ID can be a universal account number which can be used for executing relevant operations of equipment, if the equipment is a mobile phone, complete services such as downloading software, data synchronization, mobile phone positioning and the like can be used through the universal account number. If the device is a television, user-related data recommendations (favorite programs) and the like may be performed. Or the voiceprint identifier can also be a special account number or identifier of the voiceprint information identification function. One voiceprint identification may correspond to at least one biometric voiceprint template of the same user, the voiceprint identification being indicative of the user. The voiceprint identifier in the embodiment of the present application may be described by taking a user ID as an example.
Registering voiceprint information: information relating to a biometric voice print of the user after the noise associated with the enrollment device is removed, the enrollment voice print information including at least one biometric voice print template (i.e., a predetermined biometric voice print for matching or identification).
Biological voiceprint of the user: the user's voice print, independent of the device that records the speech, is equivalent to the voice print that the speaker hears from the face-to-face.
In the present application, the apparatus included in the communication system corresponding to fig. 1 is distinguished by function, and includes: a registration device, a storage device, and a voiceprint recognition device. The registration device is used for registering the voice of the user. The enrollment device extracts enrollment voiceprint information in the speech, the enrollment voiceprint information including a user's own biometric voiceprint obtained after separating noise associated with the enrollment device. (i.e., register voiceprint information); the storage device (also called as a voiceprint sharing device) is used for storing the registered voiceprint information of the user and providing a registered voiceprint information sharing service. The voiceprint recognition device is a device that performs voiceprint recognition on a user voice (which may be a device that does not register a user voiceprint), but may implement a voiceprint recognition function through shared registered voiceprint information.
The registration device (or called voiceprint registration apparatus) may be any terminal of a plurality of terminals in the communication system, for example, the terminal may be a mobile phone, a tablet computer, or the like.
The storage device may be a server, a cloud server, or any terminal of a plurality of terminals, such as at least one terminal of the plurality of terminals, or each terminal of the plurality of terminals.
The voiceprint recognition device (or called recognition device, or called voiceprint recognition apparatus) may be a terminal that does not perform voiceprint registration among the plurality of terminals. For example, if the registered terminal is a mobile phone or a tablet computer, the voiceprint recognition device may be a wearable device, a vehicle-mounted terminal, a smart screen, an earphone, a personal computer, or the like.
In the present application, the voiceprint recognition device is also referred to as a "first terminal", and the registration device is also referred to as a "second terminal".
It should be noted that the above-mentioned registration device, storage device, and voiceprint recognition device are merely used to facilitate the description of the division of the devices from the functional layer to the respective devices, and each device is not limited to only executing one function. For example, a terminal may be a registration device, a storage device, or a voiceprint recognition device, such as a mobile phone, which may be used to register a voiceprint of a user, store registered voiceprint information, or recognize a voiceprint of the user; one terminal can be a storage device or a voiceprint recognition device, for example, a personal computer can be used for storing registered voiceprint information and can also be the voiceprint recognition device.
In the embodiment of the application, a registration device receives a voice input by a user, separates noise related to the registration device in the voice, extracts biological voiceprint information of the user in the voice, registers the voiceprint information to obtain registration voiceprint information, because the registration voiceprint information is voiceprint information after the noise related to the device is eliminated, the influence of the device on biological voiceprints of the user is eliminated, the registration voiceprint information can be shared by other devices, a storage device stores the registration voiceprint information, a basis is provided for sharing the registration voiceprint information to other devices, when the user needs to use other devices (also called voiceprint recognition devices) to perform voiceprint recognition, the user does not need to perform voiceprint registration on the voiceprint recognition devices, but can share the registered voiceprint information already registered on the registration device, and the voiceprint recognition device receives the voice input by the user, separating noise related to the voiceprint recognition device in the voice, extracting biological voiceprint information (also called target voiceprint information) of the user in the voice, acquiring the registration voiceprint information from the storage device by the voiceprint recognition device, and matching the registration voiceprint information with the target voiceprint information so as to perform voiceprint recognition to recognize the identity of the user. In the embodiment of the application, the registered voiceprint information is the biological voiceprint information of the user in the voice extracted after the noise related to the registered equipment in the voice is eliminated, because the influence of the registration equipment on the biological voiceprint is eliminated, the registration voiceprint information can be shared to other voiceprint recognition equipment for use, a user can realize the voiceprint recognition function on a plurality of equipment by only registering the voiceprint on one equipment (registration equipment), the voiceprint registration on each equipment is not needed, the complicated voiceprint registration process is saved, the cross-equipment voiceprint recognition is realized, the user experience can be greatly improved, and the registered voiceprint information and the target voiceprint information are voiceprint information with the relevant noise of the equipment separated, so that the influence of the equipment on the voiceprint is eliminated, and the user has higher accuracy no matter which voiceprint equipment is used for voiceprint identification.
The method mainly comprises three stages, namely 1) a collection stage of registered voiceprint information; 2) a storage stage of registered voiceprint information, and 3) a voiceprint identification stage of the speaker. Referring to fig. 2, in the acquisition phase of the registered voiceprint information: eliminating noise related to equipment in voice to obtain a biological voiceprint of a user, recording the biological voiceprint under the existing user ID, namely registering the biological voiceprint to obtain registered voiceprint information; in the storage stage of the registered voiceprint information: storing the registered voiceprint information and the user ID in an associated manner, and providing a basis for sharing the registered voiceprint information; and in the speaker identification stage, matching the registered voiceprint information with the target voiceprint information to obtain a matching result, wherein the matching result is used for indicating the identity of the user.
Firstly, aiming at 1) the acquisition stage of the registration voiceprint information, the execution subject of the stage is the registration device, or the execution subject of the stage is the processor, the chip or the chip system in the registration device. Referring to fig. 3, the main execution entity at this stage takes a registration device as an example, and the registration device may perform the following steps: step 301, the registered device receives the voice recorded by the user. The voice is not concerned with the text content of the voice itself, is not limited to the text content of the voice, and is mainly used for extracting voiceprint information. The processing may specifically acquire the voice through a voice input and output device, and may specifically refer to the specific description of fig. 14, which is not limited herein. Step 302, the enrollment device removes noise in the speech associated with the enrollment device to obtain a biometric voice print of the user. The specific processing scheme may refer to the specific description of the subsequent noise cancellation scheme, which is not described herein.
Step 303, registering the biological voiceprint of the user by the registration device to obtain registration voiceprint information of the user. "registration" means: the operation of recording the biological voiceprint of the user under the existing voiceprint identifier may be understood as an operation of establishing a correspondence between the biological voiceprint of the user and the voiceprint identifier, or may be understood as an operation of configuring the voiceprint identifier of the user for the biological voiceprint of the user. The voiceprint identification may be a generic account number that performs the associated operation of the device. For example, the registration device is a mobile phone, and after purchasing the mobile phone, the user registers a user ID, through which services such as downloading software, positioning of the mobile phone, and the like can be used. The voiceprint identifier may also be a dedicated account of the voiceprint recognition function, for example, a mobile phone number of the user may be used as the dedicated account. In order to distinguish between voiceprint information that has been "recorded" and voiceprint information that has not been "recorded", the voiceprint information that has been "recorded", or the biological voiceprint that has been configured with a voiceprint identification, is referred to as "registered voiceprint information". The registration equipment establishes the corresponding relation between the biological voiceprint of the user and the user ID, and completes the registration process of the biological voiceprint of the user. The enrollment voiceprint information includes the user's biometric voiceprint in step 302.
The registration device may act as a storage device for storing the registration voiceprint information. Optionally, the registration device may also send the registration voiceprint information and the corresponding user ID to another storage device, and the storage device stores the registration voiceprint information, and the storage device may be a cloud server, a server, or another terminal.
In the application, in the acquisition stage of the registered voiceprint information, the received voice of the user is extracted, the noise related to the equipment in the voice is eliminated, the clean biological voiceprint of the user is obtained (the biological voiceprint is closest to the voiceprint when the user speaks face to face), and the extracted biological voiceprint information can be used as the registered voiceprint information shared by a plurality of devices due to the fact that the influence of the devices on the voice is reduced.
Then, for the 2) stage of storing the registered voiceprint information (or sharing the registered voiceprint information), the execution subject storing the registered voiceprint information may be a storage device, or may also be a memory in the storage device. For example, if the storage device is a server, the execution subject storing the registered voiceprint information may be the server, or may be a memory in the server. And the storage equipment receives the registration voiceprint information and the corresponding user ID sent by the registration equipment, and stores the registration voiceprint information and the corresponding user ID. Optionally, the registration device may further send information, such as registration time for registering the voiceprint information, an identifier of the registration device, and a voice intensity received by the registration device, to the storage device, and the storage device stores the information in association with the user ID. The registration time is used for recording the time of voiceprint information registration, and can be used for prompting the user to perform timing update according to the registration time. The identification of the registered device is used for the storage device to identify the registered device, and when the storage device receives the user ID, it can be identified whether the registered device transmits or the unregistered device (voiceprint identification device) transmits. When the voice intensity information is used for representing and recording the registered voiceprint information, the distance between the speaker and the registration equipment is long, the user ID can correspond to a plurality of biological voiceprint templates of the same user, and the plurality of biological voiceprint templates can correspond to the same or different voice intensity information. Acquiring, eliminating noise and registering once by the registering device, namely, obtaining a biological voiceprint template by executing the processes from step 301 to step 303; the registration device can obtain a plurality of biological voiceprint templates by executing the above process for a plurality of times. When a plurality of biological voiceprint templates correspond to different voice intensity information, the plurality of biological voiceprint templates of the same user can cover various conditions (the condition that voiceprints are recorded at different distances), and the voiceprint recognition accuracy is improved.
Optionally, in an example, please refer to fig. 4, the registration device sends the registration voiceprint information and the user ID to a cloud (or a server), and the cloud (or the server) stores the registration voiceprint information and the user ID in an associated manner. In this example, the registered voiceprint information of the user is stored in the cloud (or the server), and the registered voiceprint information is distinguished by the user ID, so that the storage space of the terminal can be saved, and the method can be applied to wider application scenes, including not only indoor home application scenes, but also outdoor scenes, such as application scenes of a cellular vehicle network, as long as the terminals capable of being connected to the network can share the registered voiceprint information.
Alternatively, in the second example, please refer to fig. 5, the registration device sends the registration voiceprint information and the user ID to the cloud (or the server), any terminal (e.g. the terminal 2, which may also be referred to as a third terminal) in the plurality of terminals may download the registration voiceprint information to the local according to the condition of its own storage resource, the registered voiceprint information may be stored in a third terminal, such as a tablet, which may be any terminal in a local area network, such as in a home scenario, the terminals included in the family scene comprise a mobile phone, an intelligent screen, an intelligent lamp and a tablet personal computer, the terminals can be connected in a wireless fidelity (WiFi) communication mode, a Bluetooth communication mode, an infrared communication mode, a network card communication mode and the like, when the voiceprint recognition device needs to perform voiceprint recognition, the voiceprint recognition device can quickly acquire the registered voiceprint information from the third terminal.
Alternatively, in a third example, referring to fig. 6, registered voiceprint information may be distributively stored in each of a plurality of terminals (e.g., a smart screen, a personal computer, a vehicle-mounted terminal, a tablet computer, etc.). If each terminal sends a user ID to the cloud, each terminal may receive registration voiceprint information corresponding to the user ID from the cloud (or the server). Or, voice print data sharing can be performed among the terminals in the modes of Bluetooth, WiFi, infrared, network card and the like. The voiceprint of the voiceprint device can be shared between the terminals periodically or non-periodically. The user may also configure on each device to not perform voiceprint updates and not share voiceprints. If the registered voiceprint data can not be shared among the terminals through the communication mode, the registered voiceprint information can be shared when the terminals are connected with other equipment.
Each terminal already stores the registered voiceprint information corresponding to the user ID, at the moment, if voiceprint recognition is needed, the voiceprint recognition device is also a storage device, the voiceprint recognition device can obtain the registered voiceprint information from the local, the registered voiceprint information does not need to be obtained from the cloud or other devices, the registered voiceprint information is shared, and voice recognition can be rapidly carried out.
Finally, aiming at 3) the speaker voiceprint recognition stage, the execution subject of the stage is a voiceprint recognition device (also called as a first terminal) or can be a processor, a chip or a chip system in the first terminal, the first terminal is taken as an example of the execution subject of the stage, the first terminal receives voice input by a user, noise related to the first terminal in the voice is separated, biological voiceprint information (also called as target voiceprint information) of the user is extracted, the first terminal acquires registered voiceprint information from the storage device and matches the registered voiceprint information with the target voiceprint information, so that cross-device voiceprint recognition is performed, if the registered voiceprint information is matched with the target voiceprint information, the voiceprint recognition is successful, and if the registered voiceprint information is not matched with the target voiceprint information, the voiceprint recognition is failed.
Referring to fig. 7, in the stage of speaker voiceprint recognition, the execution subject takes the first terminal as an example for description, and the first terminal (e.g. smart speaker) may perform the following steps: step 701, the first terminal receives the voice input by the user, and the process can be specifically realized through voice acquisition of the voice input and output device, and is not expanded here. Step 702, the first terminal eliminates noise related to the recognition device in the voice to obtain target voiceprint information, where the target voiceprint information is a biological voiceprint of the user, and this process is specifically described in the following. Step 703, the first terminal obtains the registered voiceprint information of the user. The registered voiceprint information can be acquired from the registered device or other devices, or stored in the first terminal memory after being acquired from the registered device or other devices in advance, and the registered voiceprint information is read from the memory when in use. The specific process of obtaining this information from the registered device or other devices may be described with reference to the following.
In a possible implementation manner, the registered voiceprint information and the ID corresponding to the registered voiceprint information are stored by a storage device or a registration device, where the storage device may be a cloud server or a server, or the storage device may be one terminal, or the storage device may also be multiple terminals.
The voiceprint recognition device sends a user ID to a storage device or a registration device to request the registration voiceprint information of the user; and the voiceprint recognition device receives the registered voiceprint information and the corresponding user ID sent by the storage device or the registration device.
In another possible implementation manner, if the storage device is one terminal or multiple terminals in the communication system, the storage device may share (or synchronize) the registered voiceprint information and the corresponding user ID to other terminals. Such as the periodic synchronization, or multiple terminals connected to the same lan, etc. For example, if the storage device is a tablet computer, the communication system includes 3 terminals, the tablet computer, a television and a mobile phone, and when the 3 terminals are in communication connection (for example, through WiFi connection), the tablet computer transmits the stored registered voiceprint information and the corresponding ID to the mobile phone and the television, that is, the voiceprint recognition device receives the registered voiceprint information and the corresponding ID sent by the storage device.
In another possible implementation manner, the first terminal sends the user ID and the identifier of the first terminal to the cloud server, the cloud server determines, according to the user ID, a location where the registered voiceprint information corresponding to the user ID is stored (that is, in which device the registered voiceprint information is stored), the cloud server determines, according to the user ID, that the registered voiceprint information is stored in the first device, the cloud server sends the user ID and the identifier of the first terminal to the first device, the first device is in communication connection with the first terminal, the first device sends the registered voiceprint information to the first terminal according to the identifier of the first terminal, and the first terminal receives the registered voiceprint information sent by the first device.
Step 704, the first terminal matches the target voiceprint information with the registered voiceprint information to obtain a matching result, and the matching result is used for indicating the identity of the user. If the target voiceprint information is matched with the registered voiceprint information, the user is judged to be the same user corresponding to the registered voiceprint information, namely the user identity is true (or a preset user), and if the target voiceprint information is not matched with the registered voiceprint information, the user is judged to be different users corresponding to the user and the registered voiceprint information, namely the user identity is false. Optionally, if the user identity is a preset user, the first terminal executes the voice control instruction; if the user identity is false (or not a preset user), the user corresponding to the user and the registered voiceprint information is judged to be different users, and the first terminal does not need to execute the voice control instruction.
For the above 3 stages, an application scenario is illustrated, and please refer to fig. 8, where the terminals used by the user S at ordinary times include a mobile phone, a tablet computer, a computer, an intelligent screen, and a vehicle-mounted terminal. The user S can register his voice print through a registration device (such as a mobile phone), the mobile phone receives voice input by the user S, if the voice is 'hello, Xiaoyi', the input voice does not concern the text per se, the mobile phone extracts biological voice print information in the voice, the biological voice print information is voice print information after noise relevant to the mobile phone is separated, the biological voice print information is associated with the user ID of the user S, the mobile phone sends the user ID and the biological voice print information to a cloud, and the cloud serves as a storage center for registering the voice print information. When a user S wants to control the vehicle-mounted terminal through voice, the user ID is logged in the vehicle-mounted terminal (pre-login is available, login is not needed every time), the vehicle-mounted terminal sends the user ID to the cloud terminal through the cellular network, the cloud terminal sends registered voiceprint information corresponding to the user ID and the user ID to the vehicle-mounted terminal, the vehicle-mounted terminal receives voice input by the user S, if the voice is 'playing music', the vehicle-mounted terminal receives the voice input by the user S, separates noise related to the vehicle-mounted terminal in the voice, extracts target voiceprint information in the voice, matches the registered voiceprint information received from the cloud terminal with the target voiceprint information, and when the registered voiceprint information is matched with the target voiceprint information, the vehicle-mounted terminal confirms that the user identity of the user S is a preset user and executes a voice instruction of 'playing music'.
It is understood that the classification is performed according to the purpose of the voiceprint recognition technology, and can be classified into "speaker verification" and "speaker recognition". The "speaker confirmation" is to determine whether the subject is a specified person, and the "speaker identification" is to identify which person among the recorded speakers the subject is.
The application scenario corresponding to fig. 8 is an application scenario of "speaker confirmation", that is, a voice received by the vehicle-mounted terminal is subjected to voice extraction, target voiceprint information in the voice is determined, whether the target voiceprint information and the registered voiceprint information corresponding to the user ID are voiceprints of the same person is determined, and after the identity of the user S is determined, the vehicle-mounted terminal can execute a voice instruction of the user S. If the target voiceprint information is not matched with the registered voiceprint information, the vehicle-mounted terminal does not need to execute the voice command of the user S. The application scenarios corresponding to fig. 8 are only exemplary, and the application scenarios of the present application include, but are not limited to, application scenarios such as account login (e.g., bank account login), identity confirmation (e.g., security door voice recognition, identity recognition in financial stock exchange), and the like.
The application can also be applied to the application scene of 'speaker recognition'. Optionally, each registered voiceprint information corresponds to a user ID, for example, in a home scene, each user ID corresponds to user data, and the user data may be different information in different application scenes, for example, the user data may be a program that the user likes, a temperature of an air conditioner, and the like. The user data is the temperature of the air conditioner as an example, and the corresponding relationship between the user ID and the user data may be as shown in table 1 below:
TABLE 1
User ID User' s Air conditioning temperature
1A f 25 deg.C
2D g 20 degree centigrade
3C c At 27 degree centigrade
Matching the target voiceprint information with a plurality of registration voiceprint information corresponding to the voiceprint identification, and determining target registration voiceprint information matched with the target voiceprint information; acquiring user data associated with a user ID corresponding to the target registration voiceprint information; and then performs an operation corresponding to the user data.
For example, referring to fig. 9, in an application scenario of "speaker recognition", 3 family members, such as user f (e.g., dad), user g (e.g., mom), and user c (e.g., child), are included in a family. The tablet computer is a registration device, for example, and 3, the family members can perform voiceprint registration through the tablet computer, and the tablet computer can register a preset number of voiceprint information, where the preset number can be set by a user or a system of the tablet computer. In this application scenario, for example, a terminal may register voiceprint information of 3 users, a user f logs in the user ID (or logs in advance), a tablet computer receives a voice for registration input by the user f, the voice may be a voice of any text content (such as hello, xiao art), the text content of the voice is not limited, the tablet computer separates noise related to the tablet computer in the voice, extracts biological voiceprint information of the user f in the voice, associates the biological voiceprint information with the user ID (such as "1A"), and registers to obtain first registered voiceprint information of the user f. Similarly, the tablet computer receives the voice for registration input by the user g to obtain second registration voiceprint information of the user g, the user ID 'such as 2D' corresponding to the second registration voiceprint information, the tablet computer receives the voice for registration input by the user C to obtain third registration voiceprint information of the user C, the user ID 'such as 3C' corresponding to the third registration voiceprint information, and the tablet computer sends the user ID corresponding to each registration voiceprint information to the cloud end to be stored by the cloud end.
If the user needs to use the smart speaker, want to control the air conditioner through the smart speaker (the smart speaker is already associated with the user ID in the tablet pc smart home application), the smart speaker receives the voice input by the user f, "mini-art", turns on the air conditioner ", the smart speaker separates the noise in the voice related to the smart speaker, extracts the biological voiceprint information (target voiceprint information) of the user in the voice, and the smart speaker sends the user ID to the cloud, which sends the 3 user IDs and the corresponding 3 registered voiceprint information to the smart speaker, or the smart speaker has previously stored the 3 registered voiceprint information and the corresponding user ID, the smart speaker matches the target voiceprint information with the received 3 registered voiceprint information, if the target voiceprint information matches the first registered voiceprint information (e.g., the registered voiceprint information of the user f) in the 3 registered voiceprint information, the voice instruction is executed. Optionally, the smart speaker determines a user ID (e.g., "1A") corresponding to the first registered voiceprint information, and the smart speaker may determine related user data corresponding to the voiceprint identifier, e.g., "temperature 25 ℃". It can be understood that, when the smart speaker recognizes the user ID (e.g., "1A") corresponding to the first registered voiceprint information, indicating that the voice is the voice input by the user f, the historical user data recorded by the smart speaker may be: the user ID (1A) corresponds to' temperature 25 ℃, which means that the user f frequently regulates and controls the temperature of the air conditioner to 25 ℃, and the intelligent sound box sends a control instruction to the air conditioner according to the relevant information to regulate and control the temperature of the air conditioner to 25 ℃.
Voiceprint recognition technology is used to identify a particular user, with commands being selectively executed. Meanwhile, different users are identified, and the intelligent equipment can also provide personalized services for different users, so that the application of the intelligent equipment is expanded.
The application scene is a family application scene, the application can also be applied to a working scene, the registered voiceprint information of a plurality of members of a working group corresponds to a plurality of IDs, a speaker can be identified as which user in the working group (namely, the speaker is identified) through the voiceprint identification technology in the application, different users have different authorities, and if a user d corresponding to the ID of the speaker is identified, the voiceprint identification equipment directly executes the user data of the authority of the user d.
In this example, "speaker recognition" may be performed by voiceprint recognition, and the user data corresponding to the speaker, including but not limited to historical information associated with the "speaker," or user data corresponding to the "speaker" rights, may be determined by recognition of the speaker. The device can make intelligent customization or intelligent recommendation according to the user data. For example, in an application scenario, the smart speaker may perform an operation corresponding to history information (e.g., a temperature of 25 ℃) associated with the "speaker" (the user corresponding to the user ID "1A"), without requiring the user to perform a habitual operation each time, thereby improving user experience.
Optionally, in step 702 of the implementation manner corresponding to fig. 7, a specific manner of the first terminal extracting target voiceprint information, which is irrelevant to the first terminal, in the voice, and in step 302 of the implementation manner corresponding to fig. 3, a specific manner of the registered device extracting biological voiceprint information, which is irrelevant to the device, in the voice may include: firstly, a machine learning mode; and II, a signal processing mode.
Firstly, a machine learning mode.
And the first terminal (or the voiceprint recognition device) takes the voice as the input of a voiceprint extraction model, the target voiceprint information is output through the voiceprint extraction model, and the voiceprint extraction model is obtained by learning the linguistic data collected by various devices.
The voiceprint extraction model includes, but is not limited to, Gaussian Mixture Model (GMM), gaussian mixture model-general background model (GMM-UBM), i-vector, x-vector, DNN-vector, Deep Neural Network (DNN), speech parsing, speech factorization, clustering, transformation, and other algorithms.
The machine learning includes: a training phase for the voiceprint extraction model and an application phase for the voiceprint extraction model.
Referring to fig. 10, in the training phase of the voiceprint extraction model, a large amount of linguistic data for learning and reference data for reference are obtained, and the large amount of linguistic data includes linguistic data collected by different types of devices. For example, the different types of devices include, but are not limited to, smart appliances (e.g., smart screens, televisions, smart washing machines, smart air conditioners, etc.), lighting systems, smart speakers, robots, etc.; the terminal can also be a vehicle-mounted terminal, user equipment, a mobile phone, a tablet computer, a personal computer, virtual reality terminal equipment, augmented reality terminal equipment, wearable equipment and other equipment with a voice recording function. The reference data is voiceprint information that is device independent (or weakly dependent).
A large amount of language materials input by the same user are received through different types of equipment, the large amount of language materials are input into a voiceprint model (such as a GMM-UBM model), voiceprint data output by the GMM-UBM model are compared with reference data, and the reference data are biological voiceprint data of the user. If the difference between the output voiceprint data and the reference data is larger than or equal to the threshold value, the output voiceprint data is input into the GMM-UBM model again, the voiceprint data is output through the GMM-UBM model, then the output voiceprint data is compared with the reference data, through continuous iterative training, if the difference between the voiceprint data output from the GMM-UBM model and the reference data is smaller than the threshold value, the voiceprint data output from the model can be close to the reference data, a voiceprint extraction model is obtained, and the voiceprint extraction model is used for separating noise related to equipment and extracting biological voiceprints of a user. In this example, the voiceprint extraction model is trained in advance in a machine learning manner, and the trained voiceprint extraction model is used to directly extract the biological voiceprint of the user in the speech, so that the method has better robustness.
And II, a signal processing mode.
In the application, a voice signal can be processed through a filter, and frequency response compensation is carried out, wherein the frequency response compensation is used for eliminating noise related to the identification device in the voice so as to obtain biological voiceprint information of a user. The filter is the most basic signal processing device, and extracts a desired signal from a plurality of signals mixed together. In this application, the wave filter main function is the all kinds of noises that eliminate influence signal processing, and the wave filter produces different gains according to the frequency difference for specific signal is highlighted, with the signal attenuation of highlighting, strengthens weaker signal, thereby reaches the purpose of elimination equipment noise.
As the filter is expressed by the following expression:
Figure PCTCN2020090930-APPB-000001
the above formula (1) is a finite impulse response filter, where N is a time point, N is a unit impulse response length of a digital filter, a coefficient a is convolved with a coefficient x to generate a filter output y, and k is taken from 0 to N.
Alternatively, the filter is represented by the following expression:
Figure PCTCN2020090930-APPB-000002
the above formula (2) is an infinite impulse response filter, where N denotes a time point, N and P denote digital filter unit impulse response lengths, k is taken from 0 until N, and a coefficient a is convolved with x; j starts at 0 and goes up to P, and the sum of the coefficients b and y convolution yields the filter output y.
If the frequency responses of different devices tend to be horizontal or consistent by adjusting the coefficients a and x in the formula (1), and if the frequency responses of different devices tend to be horizontal or consistent by adjusting the coefficients a and b in the formula (2), noise related to the devices in the voice is filtered.
Referring to fig. 11A, fig. 11A includes 3 curves, an upper limit curve 1101, a lower limit curve 1102 and a curve 1103 without frequency response compensation, the frequency response curve 1103 has a peak between 600Hz and 1.2K Hz, the peak exceeds the upper limit curve 1101, and the frequency response curve 1103 has an upward trend between 300Hz and 500 Hz. Referring to fig. 11B, fig. 11B includes 3 curves, an upper limit curve 1101, a lower limit curve 1102 and a curve 1104 with frequency response compensation, in which in the above equation (1), the frequency responses of different devices tend to be horizontal by adjusting coefficients a and x, or in the above equation (2), the frequency responses of different devices tend to be horizontal by adjusting coefficients a and B, and the frequencies are in the range of 300hz to 2.5KHz, and thus the frequency responses tend to be consistent (e.g., -2dBr), so as to filter out the noise related to the devices in the speech, and obtain the biological sound print of the user.
In the example, the voice signals are processed through the filter, the highlighted signals are attenuated, the weaker signals are enhanced, the frequency response is compensated, the purpose of removing the relevant noise of the equipment is achieved, the noise signals are directly filtered through the filter, the output signals are biological voiceprint signals of the user, and the method is simple to achieve and high in speed.
In an optional implementation manner, the biological voiceprint of each person is not one-to-one but may change, for example, different time periods in a day, or different health conditions (such as health status and illness status), or different ages and other factors, which all may change the biological voiceprint of the same person, in order to improve the robustness of the system, a plurality of voiceprints may be registered for the same user, that is, the same user corresponds to a plurality of biological voiceprint templates, one user ID corresponds to one registered voiceprint information, and the registered voiceprint information includes a plurality of biological voiceprint templates.
The storage device may generate a voiceprint model (i.e., a voiceprint template) according to a plurality of biological voiceprints of the same user, please refer to fig. 12, which takes an x-vector system as an example, and the x-vector system is a speaker recognition system built based on DNN. By training the DNN, the speaker's speech is mapped to an embedded vector of fixed dimensions (embeddings), called x-vector.
The x-vector network receives voiceprint information for the same user, which is the user's voiceprint with device-dependent noise removed. The x-vector network can capture user voiceprint information by using shorter voice and has stronger robustness on phrase voice. An input may correspond to an x-vector, which is the user's voiceprint information (which is already device independent due to the previous removal of device-dependent information), or becomes the user's voiceprint model. If a plurality of pieces of equipment-independent voiceprint information are input through a plurality of pieces of equipment, or a plurality of pieces of equipment-independent voiceprint information are input through one piece of equipment, a plurality of x-vector vectors are generated, dimension reduction is performed through Linear Discriminant Analysis (LDA), the plurality of vectors can cover more user pronunciation conditions (such as the voiceprint information of users in different time periods, the voiceprint information in a healthy state or an unhealthy state, and the like), namely, one user corresponds to a plurality of biological voiceprint models (or templates), and the plurality of biological voiceprint models form a voiceprint information template library, so that the voiceprint recognition effect can be further enhanced. The number of the corresponding templates of the same user can be 10-30, and the templates can be updated regularly or irregularly, and the old templates are replaced by the new templates, so that the robustness of the biological voiceprint templates is improved.
It is understood that the enrollment device may enroll one biological voiceprint template at a time or may enroll a plurality of biological voiceprint templates in multiple times. Or each of the plurality of registration devices may register a biometric voice print template separately. Alternatively, the registration device may send one or more biometric voice print templates to the storage device or the identification device in one message at a time, or may send a plurality of biometric voice print templates to the storage device or the identification device in a plurality of messages, respectively. The transmission channel includes but is not limited to a wireless mode and a wired mode, and can be implemented by a transceiver, and the corresponding description of fig. 14 can be referred to. The identification device may retrieve the one or more biometric voice print templates from the enrollment device or the storage device. For the identification device, the one or more biometric voice print templates are used for matching for voice print identification, belonging to enrollment voice print information that the identification device is capable of using, although the enrollment voice print information is not generated and enrolled by the identification device itself, but from another device. By the scheme, voiceprint information sharing irrelevant to equipment is achieved, flexible cross-equipment voiceprint identification is achieved, and user experience is improved.
Corresponding to the method provided by the above method embodiment, the embodiment of the present application further provides a corresponding apparatus, which includes a module for executing the above embodiment. The module may be software, hardware, or a combination of software and hardware. As shown in fig. 13, an apparatus 1300, which may be a terminal or a component (e.g., an integrated circuit, a chip, etc.) of the terminal, is provided in the embodiments of the present application, where the voiceprint recognition apparatus includes a voice input and output module 1301 (or referred to as a voice input and output unit), a processing module 1302 (or referred to as a processing unit), and a transceiver module 1303 (or referred to as a transceiver unit).
In one possible design, the apparatus 1300 may perform the functions of the identification device in the above method embodiment: a voice input/output module 1301 for receiving a voice input by a user; a processing module 1302, configured to eliminate noise related to the recognition device in the speech received by the speech input/output module 1301 to obtain target voiceprint information, where the target voiceprint information is a biological voiceprint of the user; a transceiver module 1303, configured to obtain registration voiceprint information of a user, where the registration voiceprint information includes at least one biological voiceprint template obtained by eliminating noise related to a registration device; the processing module 1302 is further configured to match the target voiceprint information obtained by the processing module 1302 with the registered voiceprint information obtained by the transceiver module 1303 to obtain a matching result, where the matching result is used to indicate the user identity.
Further, the voice input/output module 1301 is configured to execute step 701 in the embodiment corresponding to fig. 7, and please refer to the detailed description in step 701 for specific implementation, which is not described herein again. The processing module 1302 is configured to execute step 702 and step 704 in the embodiment corresponding to fig. 7, and please refer to the detailed description of step 702 and step 704, which is not described herein again. The transceiver module 1303 is configured to execute step 703 in the embodiment corresponding to fig. 7, and please refer to the detailed description of step 703 for specific implementation, which is not described herein again.
In another possible design, the apparatus 1300 may perform the functions of the registration device in the above method embodiment: a voice input/output module 1301 for receiving a voice input by a user; a processing module 1302, configured to eliminate noise related to registering a device in the voice received by the voice input/output module 1301 to obtain registered voiceprint information of the user. Optionally, the transceiver module 1303 is configured to send the registered voiceprint information obtained by the processing module 1302 and the corresponding voiceprint identifier to another storage device.
Further, the voice input/output module 1301 is configured to execute step 301 in the embodiment corresponding to fig. 3, and please refer to the detailed description in step 301 for specific implementation, which is not described herein again. The processing module 1302 is configured to execute step 302 and step 303 in the embodiment corresponding to fig. 3, which is not described herein again.
In another implementation, the apparatus may be a chip or an integrated circuit. In this case, the transceiver module 1303 may be a communication interface, the processing module 1302 may be a logic circuit, and the voice input/output module 1301 may be an audio circuit. Alternatively, the communication interface may be an input-output interface or a transceiving circuit. The input-output interface may include an input interface and an output interface. The transceiver circuitry may include input interface circuitry and output interface circuitry.
In one implementation, the processing module 1302 can be a processing device, and the functions of the processing device can be implemented partially or completely by software. Alternatively, the functions of the processing means may be partly or wholly implemented by software. At this time, the processing device may include a memory for storing the computer program and a processor for reading and executing the computer program stored in the memory to perform the corresponding processes and/or steps in any one of the method embodiments. Alternatively, the processing means may comprise only a processor. The memory for storing the computer program is located outside the processing device and the processor is connected to the memory by means of circuits/wires to read and execute the computer program stored in the memory.
It is to be understood that each of the functional components in fig. 13 can be implemented by software, hardware or a combination of the two, and is not particularly limited.
In addition, fig. 14 is a schematic structural diagram of an apparatus 1400 according to an embodiment of the present disclosure. The device may be a terminal, or may also be an integrated circuit, a chip or a system of chips in a terminal. The device takes a terminal as an example, and the terminal can include but is not limited to a terminal in a smart home, a lighting system, a smart sound box, a robot and the like; the terminal may also be a vehicle-mounted terminal, user equipment, a mobile phone, a tablet computer, a personal computer, etc.
As shown in fig. 14, the apparatus 1400 comprises a processor 1401, a transceiver 1402, a memory 1403, and a voice input output device 1404. The processor 1401, the transceiver 1402, the memory 1403 and the voice input/output device 1404 can communicate with each other via an internal connection path, and transmit a control signal and/or a data signal. Wherein, the memory 1403 is used for storing computer programs, and the processor 1401 is used for calling and running the computer programs from the memory 1403 to control the transceiver 1402 to transmit and receive signals. A processor 1401, which is a control center of the apparatus, connects various parts of the entire cellular phone by various interfaces and lines, and performs various functions of the cellular phone and processes data by operating or executing software programs and/or modules stored in the memory 1403 and calling data stored in the memory 1403.
A voice input output device 1404, the voice input output device 1404 for providing an audio interface between a user and a cellular phone. The voice input unit may be an audio circuit, or may be a voice recognizer. The audio circuit may include a speaker 14041 and a microphone 14042, the microphone 14042 converts the collected sound signals into electrical signals, which are received by the audio circuit and converted into audio data, which are output to a processor 1401 for processing, and the processor 1401 removes noise associated with the device to obtain a biological voiceprint of the user.
Optionally, the apparatus may further comprise an antenna. The transceiver 1402 transmits or receives wireless signals through an antenna. The transceiver 1402 may be used to send or receive registration voiceprint information and corresponding voiceprint identification to other devices. Alternatively, the processor 1401 and the memory 1403 may be combined into one processing means, and the processor 1401 is configured to execute the program code stored in the memory 1403 to implement the above-mentioned functions. Optionally, the memory 1403 may also be integrated in the processor 1401. Alternatively, the memory 1403 is separate from the processor 1401, i.e. external to the processor 1401. Optionally, the transceiver 1402 includes, but is not limited to, a Radio Frequency (RF) circuit, a communication interface, a WiFi module, a bluetooth module, and the like.
Optionally, the apparatus may further include a display unit 1405 for displaying information input by the user or information provided to the user and various images. The Display unit may configure the Display panel in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
In one possible design, the apparatus 1400 may be used to perform the functions performed by the identification device in the method embodiment: a voice input-output device 1404 for receiving a voice input by a user; a processor 1401 for eliminating noise in the speech related to the recognition device to obtain target voiceprint information, the target voiceprint information being a biological voiceprint of the user; a transceiver 1402 for obtaining enrollment voiceprint information for a user, the enrollment voiceprint information including at least one biological voiceprint template resulting from removal of noise associated with an enrollment device; the processor 1401 is further configured to match the target voiceprint information with the registered voiceprint information to obtain a matching result, where the matching result is used to indicate the user identity.
Optionally, the transceiver 1402 is configured to receive the registered voiceprint information and a voiceprint identifier of the registered voiceprint information sent by the storage device or the registration device, where the voiceprint identifier is used to indicate the user. Optionally, the transceiver 1402 is further configured to send the voiceprint identification of the user to the storage device or the registration device to request the registration voiceprint information of the user.
Optionally, the processor 1401 is specifically configured to input the speech into a voiceprint extraction model, and eliminate noise related to the recognition device in the speech through the voiceprint extraction model to obtain the target voiceprint information. Optionally, the voiceprint extraction model is obtained by learning corpora collected by a plurality of devices. Optionally, the processor 1401 is further configured to perform frequency response compensation on the speech through a filter, where the frequency response compensation is used to eliminate noise associated with the recognition device in the speech to obtain the target voiceprint information. Optionally, the processor 1401 is further configured to obtain user data associated with the user identity; an operation corresponding to the user data is performed.
In one possible design, the apparatus 1400 is configured to perform the functions performed by the registration device in the above method embodiment: a voice input-output device 1404 for receiving a voice input by a user; a processor 1401 for: and eliminating noise related to the registered equipment in the voice to obtain registered voiceprint information of the user, wherein the registered voiceprint information specifically comprises a biological voiceprint of the user.
Optionally, the transceiver 1402 is configured to send the registered voiceprint information and a voiceprint identifier corresponding to the registered voiceprint information to the storage device or the identification device, where the voiceprint identifier is used to indicate the user. Optionally, the processor 1401 is specifically configured to input the voice into a voiceprint extraction model, and eliminate noise related to the registered device in the voice through the voiceprint extraction model to obtain the registered voiceprint information. Optionally, the voiceprint extraction model is obtained by learning corpora collected by a plurality of devices. Optionally, the processor 1401 is further configured to perform frequency response compensation on the speech through a filter, where the frequency response compensation is used to eliminate noise associated with the recognition device in the speech to obtain the registered voiceprint information.
It is understood that the processor in the embodiments of the present application may be an integrated circuit chip having signal processing capability. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components.
The approaches described herein may be implemented in a variety of ways. For example, these techniques may be implemented in hardware, software, or a combination of hardware and software. For a hardware implementation, the processing units used to perform these techniques at a communication device (e.g., a base station, terminal, network entity, or chip) may be implemented in one or more general-purpose processors, DSPs, digital signal processing devices, ASICs, programmable logic devices, FPGAs, or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or any combinations of the above. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
It will be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM, enhanced SDRAM, SLDRAM, Synchronous Link DRAM (SLDRAM), and direct rambus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
The present application also provides a computer-readable medium having stored thereon a computer program which, when executed by a computer, performs the functions of any of the method embodiments described above. The present application also provides a computer program product which, when executed by a computer, implements the functionality of any of the above-described method embodiments.
It should be appreciated that reference throughout this specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the various embodiments are not necessarily referring to the same embodiment throughout the specification. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
It should be understood that, in the present application, "when …", "if" and "if" all refer to the fact that the device performs the corresponding processing under certain objective conditions, and are not limited to time, and do not require any judgment action for the device to perform, nor do they imply other limitations. Reference in the present application to an element using the singular is intended to mean "one or more" rather than "one and only one" unless specifically stated otherwise. In the present application, unless otherwise specified, "at least one" is intended to mean "one or more" and "a plurality" is intended to mean "two or more". Additionally, the terms "system" and "network" are often used interchangeably herein.
Herein, the term "at least one of … …" or "at least one of … …" means all or any combination of the listed items, e.g., "at least one of A, B and C", may mean: the compound comprises six cases of separately existing A, separately existing B, separately existing C, simultaneously existing A and B, simultaneously existing B and C, and simultaneously existing A, B and C, wherein A can be singular or plural, B can be singular or plural, and C can be singular or plural.
It is understood that in the embodiments of the present application, "B corresponding to a" means that B is associated with a, from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A can be singular or plural, and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
For convenience and brevity of description, a person skilled in the art may refer to the corresponding processes in the foregoing method embodiments for specific working processes of the system, the apparatus, and the unit described above, which are not described herein again.
It will be appreciated that the systems, apparatus and methods described herein may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The functions described in this embodiment, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The same or similar parts between the various embodiments in this application may be referred to each other. In the embodiments and the implementation methods/implementation methods in the embodiments in the present application, unless otherwise specified or conflicting in logic, terms and/or descriptions between different embodiments and between various implementation methods/implementation methods in various embodiments have consistency and can be mutually cited, and technical features in different embodiments and various implementation methods/implementation methods in various embodiments can be combined to form new embodiments, implementation methods, or implementation methods according to the inherent logic relationships thereof. The above-described embodiments of the present application do not limit the scope of the present application.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims (23)

  1. A voiceprint recognition apparatus in a recognition device, comprising: the processor is connected with the voice input and output device and the transceiver;
    the voice input and output device is used for receiving voice input by a user;
    the processor is used for eliminating noise related to the identification equipment in the voice to obtain target voiceprint information, and the target voiceprint information is a biological voiceprint of the user;
    the transceiver is used for acquiring registration voiceprint information of the user, wherein the registration voiceprint information comprises at least one biological voiceprint template obtained by eliminating noise related to registration equipment;
    the processor is further configured to match the target voiceprint information with the registration voiceprint information to obtain a matching result, where the matching result is used to indicate the user identity.
  2. The apparatus of claim 1,
    the transceiver is configured to receive the registered voiceprint information and a voiceprint identifier of the registered voiceprint information sent by the storage device or the registered device, where the voiceprint identifier is used to indicate the user.
  3. The apparatus of claim 2,
    the transceiver is further configured to send the voiceprint identifier of the user to the storage device or the registration device to request for registration voiceprint information of the user.
  4. The device according to any one of claims 1 to 3,
    the processor is specifically configured to input the voice into a voiceprint extraction model, and eliminate noise in the voice related to the recognition device through the voiceprint extraction model to obtain the target voiceprint information.
  5. The apparatus according to claim 4, wherein the voiceprint extraction model is obtained by learning corpora collected by a plurality of devices.
  6. The device according to any one of claims 1 to 3,
    the processor is further configured to perform frequency response compensation on the speech through a filter, where the frequency response compensation is used to eliminate noise in the speech related to the recognition device to obtain the target voiceprint information.
  7. The apparatus of claims 1-6, wherein the processor is further configured to:
    obtaining user data associated with the user identity;
    and executing the operation corresponding to the user data.
  8. A voiceprint registration apparatus in a registration device, comprising: the system comprises a processor and a voice input and output device connected with the processor;
    the voice input and output device is used for receiving voice input by a user;
    the processor is configured to eliminate noise associated with the enrollment device from the speech to obtain enrollment voiceprint information of the user, the enrollment voiceprint information including a biological voiceprint of the user.
  9. The apparatus of claim 8, further comprising a transceiver coupled to the processor;
    the transceiver is configured to send the registered voiceprint information and a voiceprint identifier corresponding to the registered voiceprint information to a storage device or an identification device, where the voiceprint identifier is used to indicate the user.
  10. The apparatus according to claim 8 or 9,
    the processor is specifically configured to input the voice into a voiceprint extraction model, and eliminate noise related to the registration device in the voice through the voiceprint extraction model to obtain the registration voiceprint information.
  11. The apparatus according to claim 10, wherein the voiceprint extraction model is learned from corpora collected by a plurality of devices.
  12. The apparatus according to claim 8 or 9,
    the processor is further configured to perform frequency response compensation on the voice through a filter, where the frequency response compensation is used to eliminate noise in the voice related to the recognition device to obtain the registration voiceprint information.
  13. A cross-device voiceprint recognition method is applied to recognition devices and is characterized by comprising the following steps:
    receiving voice input by a user;
    eliminating noise related to the recognition device in the voice to obtain target voiceprint information, wherein the target voiceprint information is a biological voiceprint of the user;
    acquiring registration voiceprint information of the user, wherein the registration voiceprint information comprises at least one biological voiceprint template obtained by eliminating noise related to registration equipment;
    and matching the target voiceprint information with the registered voiceprint information to obtain a matching result, wherein the matching result is used for indicating the identity of the user.
  14. The method of claim 13, wherein the obtaining registered voiceprint information of the user comprises:
    and receiving the registered voiceprint information and a voiceprint identifier corresponding to the registered voiceprint information, which are sent by a storage device or the registered device, wherein the voiceprint identifier is used for indicating the user.
  15. The method according to claim 14, wherein before receiving the registered voiceprint information and the voiceprint identifier corresponding to the registered voiceprint information sent by the storage device or the registered device, the method further comprises:
    and sending the voiceprint identification to the storage device or the registration device to request the registration voiceprint information of the user.
  16. The method according to any one of claims 13-15, wherein said removing noise associated with the recognition device from the speech to obtain target voiceprint information comprises:
    and inputting the voice into a voiceprint extraction model, and eliminating noise related to the recognition equipment in the voice through the voiceprint extraction model to obtain the target voiceprint information.
  17. The method of claim 16, wherein the voiceprint extraction model is learned from corpora collected by a plurality of devices.
  18. The method according to any one of claims 13-17, wherein after matching the target voiceprint information with the registered voiceprint information to obtain a matching result, the method further comprises:
    obtaining user data associated with the user identity;
    and executing the operation corresponding to the user data.
  19. A cross-device voiceprint recognition method is applied to a registration device and is characterized by comprising the following steps:
    receiving voice input by a user;
    eliminating noise related to the registered equipment in the voice to obtain registered voiceprint information; the registered voiceprint information includes a biometric voiceprint of the user.
  20. The method of claim 19, further comprising:
    and sending the registered voiceprint information and the voiceprint identification corresponding to the registered voiceprint information to a storage device or an identification device.
  21. The method according to claim 19 or 20, wherein said removing noise associated with said enrollment device from said speech to obtain biometric voice print information comprises:
    and inputting the voice into a voiceprint extraction model, and eliminating noise related to the registration equipment in the voice through the voiceprint extraction model to obtain the registration voiceprint information.
  22. The method of claim 21, wherein the voiceprint extraction model is learned from corpora collected by a plurality of devices.
  23. The method according to claim 19 or 20, wherein said removing noise associated with said enrollment device from said speech to obtain biometric voice print information comprises:
    and performing frequency response compensation on the voice through a filter, wherein the frequency response compensation is used for eliminating noise related to the identification equipment in the voice to obtain the registration voiceprint information.
CN202080001170.5A 2020-05-19 2020-05-19 Voiceprint recognition and registration device and cross-device voiceprint recognition method Pending CN114026637A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/090930 WO2021232213A1 (en) 2020-05-19 2020-05-19 Voiceprint recognition apparatus, voiceprint registration apparatus and cross-device voiceprint recognition method

Publications (1)

Publication Number Publication Date
CN114026637A true CN114026637A (en) 2022-02-08

Family

ID=78708944

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080001170.5A Pending CN114026637A (en) 2020-05-19 2020-05-19 Voiceprint recognition and registration device and cross-device voiceprint recognition method

Country Status (2)

Country Link
CN (1) CN114026637A (en)
WO (1) WO2021232213A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780787A (en) * 2022-04-01 2022-07-22 杭州半云科技有限公司 Voiceprint retrieval method, identity verification method, identity registration method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9373330B2 (en) * 2014-08-07 2016-06-21 Nuance Communications, Inc. Fast speaker recognition scoring using I-vector posteriors and probabilistic linear discriminant analysis
CN106779121A (en) * 2016-12-23 2017-05-31 上海木爷机器人技术有限公司 Electronic bill processing method and system
CN108172230A (en) * 2018-01-03 2018-06-15 平安科技(深圳)有限公司 Voiceprint registration method, terminal installation and storage medium based on Application on Voiceprint Recognition model
CN108989349B (en) * 2018-08-31 2022-11-29 平安科技(深圳)有限公司 User account unlocking method and device, computer equipment and storage medium
CN109360579A (en) * 2018-12-05 2019-02-19 途客电力科技(天津)有限公司 Charging pile phonetic controller and system
CN111161746B (en) * 2019-12-31 2022-04-15 思必驰科技股份有限公司 Voiceprint registration method and system

Also Published As

Publication number Publication date
WO2021232213A1 (en) 2021-11-25

Similar Documents

Publication Publication Date Title
CN111699528B (en) Electronic device and method for executing functions of electronic device
US11386905B2 (en) Information processing method and device, multimedia device and storage medium
KR102571011B1 (en) Responding to Remote Media Classification Queries Using Classifier Models and Context Parameters
CN102779509B (en) Voice processing equipment and voice processing method
CN104335559B (en) A kind of method of automatic regulating volume, volume adjustment device and electronic equipment
CN112074900B (en) Audio analysis for natural language processing
CN103024530A (en) Intelligent television voice response system and method
US9911417B2 (en) Internet of things system with voice-controlled functions and method for processing information of the same
CN109724215A (en) Air conditioning control method, air conditioning control device, air-conditioning equipment and storage medium
WO2014004865A1 (en) System for adaptive delivery of context-based media
CN104575509A (en) Voice enhancement processing method and device
EP4009206A1 (en) System and method for authenticating a user by voice to grant access to data
CN111081234A (en) Voice acquisition method, device, equipment and storage medium
WO2019101099A1 (en) Video program identification method and device, terminal, system, and storage medium
CN114026637A (en) Voiceprint recognition and registration device and cross-device voiceprint recognition method
US20180182393A1 (en) Security enhanced speech recognition method and device
CN111653284B (en) Interaction and identification method, device, terminal equipment and computer storage medium
CN114090986A (en) Method for identifying user on public equipment and electronic equipment
CN105930522A (en) Intelligent music recommendation method, system and device
CN112420063A (en) Voice enhancement method and device
CN114722234A (en) Music recommendation method, device and storage medium based on artificial intelligence
CN113593582A (en) Control method and device of intelligent device, storage medium and electronic device
CN113436613A (en) Voice recognition method and device, electronic equipment and storage medium
KR20220082258A (en) Electronic device, and method for providing memory service in electronic device
US11349979B2 (en) Electronic device for supporting user-customized service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination