CN116935858A - Voiceprint recognition method and voiceprint recognition device - Google Patents
Voiceprint recognition method and voiceprint recognition device Download PDFInfo
- Publication number
- CN116935858A CN116935858A CN202210374386.7A CN202210374386A CN116935858A CN 116935858 A CN116935858 A CN 116935858A CN 202210374386 A CN202210374386 A CN 202210374386A CN 116935858 A CN116935858 A CN 116935858A
- Authority
- CN
- China
- Prior art keywords
- voice
- voiceprint
- voiceprint vector
- terminal equipment
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 87
- 239000013598 vector Substances 0.000 claims abstract description 334
- 238000012545 processing Methods 0.000 claims description 68
- 230000015654 memory Effects 0.000 claims description 42
- 230000009467 reduction Effects 0.000 claims description 25
- 238000001914 filtration Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 14
- 239000000284 extract Substances 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 9
- 238000004891 communication Methods 0.000 description 22
- 230000006870 function Effects 0.000 description 19
- 238000010586 diagram Methods 0.000 description 13
- 238000012795 verification Methods 0.000 description 9
- 238000012790 confirmation Methods 0.000 description 7
- 230000005236 sound signal Effects 0.000 description 7
- 238000007726 management method Methods 0.000 description 6
- 238000001514 detection method Methods 0.000 description 5
- 238000010295 mobile communication Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 230000002618 waking effect Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000835 fiber Substances 0.000 description 3
- -1 for example Substances 0.000 description 3
- 238000003672 processing method Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- GRRMZXFOOGQMFA-UHFFFAOYSA-J YoYo-1 Chemical compound [I-].[I-].[I-].[I-].C12=CC=CC=C2C(C=C2N(C3=CC=CC=C3O2)C)=CC=[N+]1CCC[N+](C)(C)CCC[N+](C)(C)CCC[N+](C1=CC=CC=C11)=CC=C1C=C1N(C)C2=CC=CC=C2O1 GRRMZXFOOGQMFA-UHFFFAOYSA-J 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000000556 factor analysis Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/14—Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The embodiment of the application provides a voiceprint recognition method and device, which relate to the technical field of terminals, and the method comprises the following steps: the terminal equipment collects first voice to obtain a first voiceprint vector corresponding to the first voice; when the terminal equipment determines that the first voice meets the preset condition, the terminal equipment obtains a similarity score of the first voiceprint vector and a preset second voiceprint vector to obtain a first numerical value; when the first value is smaller than or equal to a first threshold value, the terminal equipment obtains a similarity score of the third voiceprint vector and a preset fourth voiceprint vector to obtain a second value; and when the second value is larger than the second threshold value, the terminal equipment executes the task corresponding to the first voice. Therefore, the terminal equipment can simulate the similarity between the third voiceprint vector obtained when the user wears the mask scene and the voiceprint vector of the user when the user wears the mask, so that the accurate recognition of the voice in the mask scene worn by the user is realized, and the accuracy of the voiceprint recognition method is improved.
Description
Technical Field
The present application relates to the field of terminal technologies, and in particular, to a voiceprint recognition method and apparatus.
Background
With the popularization and development of the internet, the functional demands of people on terminal devices are becoming more diverse. For example, to simplify the way in which a user uses a terminal device, the terminal device may support the user to wake up the device by voice, or wake up certain functions in the device. Because of the uniqueness of the voiceprint data possessed by different users, the terminal device can judge whether the received sound is the sound of the registered user (or understood as the owner of the terminal device) through the voiceprint data.
In general, the user terminal device may score the similarity of the voice print data of the registered user and the voice print data of the received speaker based on the voice print model, and may wake up the terminal device when the score exceeds a preset threshold, or may not wake up the terminal device when the score is less than the preset threshold.
However, when the user wears the mask to wake up by voice, the accuracy of the voiceprint recognition method is low due to the interference of the mask on the voice signal.
Disclosure of Invention
The embodiment of the application provides a voiceprint recognition method and a voiceprint recognition device, wherein a terminal device can acquire first voice and acquire a first voiceprint vector when the first voice meets a preset condition, when the similarity score of the first voiceprint vector and the second voiceprint vector is smaller than or equal to a first threshold value, the terminal device simulates a third voiceprint vector when a user wears a mask scene, and when the similarity of a fourth voiceprint vector and the third voiceprint vector is larger than a second threshold value, the terminal device can realize accurate recognition of the voice in the mask scene worn by the user through voiceprint recognition, and the accuracy of the voiceprint recognition method is improved.
In a first aspect, an embodiment of the present application provides a voiceprint recognition method, including: the terminal equipment collects first voice to obtain a first voiceprint vector corresponding to the first voice; when the terminal equipment determines that the first voice meets the preset condition, the terminal equipment obtains a similarity score of the first voiceprint vector and a preset second voiceprint vector to obtain a first numerical value; when the first value is smaller than or equal to a first threshold value, the terminal equipment obtains a similarity score of the third voiceprint vector and a preset fourth voiceprint vector to obtain a second value; the third voiceprint vector is a voiceprint vector obtained by filtering out preset frequency in the first voice, the fourth voiceprint vector is a voiceprint vector obtained by filtering out preset frequency in the second voice, and the second voice is a voice corresponding to the second voiceprint vector; and when the second value is larger than the second threshold value, the terminal equipment executes the task corresponding to the first voice. Therefore, the terminal equipment can simulate the similarity between the third voiceprint vector obtained when the user wears the mask scene and the voiceprint vector of the user when the user wears the mask, so that the accurate recognition of the voice in the mask scene worn by the user is realized, and the accuracy of the voiceprint recognition method is improved.
The first voice may be speaker voiceprint data in the embodiment of the present application; the first voiceprint vector may be a speaker voiceprint vector C in the embodiment of the present application; the second voiceprint vector may be the registered voiceprint vector a in the embodiment of the present application; the first threshold may be T1 in the embodiment of the present application; the third voiceprint vector may be a speaker voiceprint vector D in the embodiment of the present application; the fourth voiceprint vector may be the registered voiceprint vector B in the embodiment of the present application; the second threshold may be T2 in an embodiment of the present application.
In a possible implementation manner, the terminal device obtains a similarity score of the third voiceprint vector and a preset fourth voiceprint vector, so as to obtain a second value, including: the terminal equipment obtains the ratio of the energy of the first signal in the first voice to the energy of the second signal in the first voice to obtain a third numerical value; wherein the signal frequency of the second signal is greater than the signal frequency of the first signal; when the terminal equipment determines that the first ratio of the third numerical value to the fourth numerical value does not meet the preset range in the database, or the first ratio meets the preset range and no corresponding relation exists between the preset range and the voiceprint vector, the terminal equipment acquires the similarity score of the third voiceprint vector and the fourth voiceprint vector to obtain a second numerical value; the fourth value is a ratio of energy of the third signal in the second voice to energy of the fourth signal in the second voice, and a signal frequency of the fourth signal is greater than a signal frequency of the third signal. Therefore, when the preset range corresponding to the first ratio is not found from the database, the terminal equipment can compare the fourth voiceprint vector with the third voiceprint vector, and the complexity of the algorithm is simplified.
The third energy may be K2 in the embodiment of the present application, the fourth energy may be K1 in the embodiment of the present application, and the first ratio may be K2/K1 in the embodiment of the present application.
In one possible implementation manner, after the terminal device performs the task corresponding to the first voice, the method further includes: when the terminal equipment determines that the first ratio meets the preset range and the corresponding relation between the preset range and the voiceprint vector does not exist, the terminal equipment establishes the corresponding relation between the first voiceprint vector and the preset range in the database. Therefore, since the third voiceprint vector and the fourth voiceprint vector can be low-frequency signals for filtering out high-frequency signals, certain risks may exist when the terminal equipment performs similarity comparison by using the two low-frequency signals, and therefore the terminal equipment can add the first voiceprint vector with higher similarity with the registered user into the database for subsequent use, and further accuracy of subsequent voiceprint recognition is improved.
In one possible implementation manner, when the terminal device determines that the first ratio meets the preset range and the voiceprint vector do not have a corresponding relationship, the terminal device establishes a corresponding relationship between the first voiceprint vector and the preset range in the database, including: when the terminal equipment determines that the first ratio meets a preset range and the corresponding relation between the preset range and the voiceprint vector does not exist, the terminal equipment displays a first interface; the first interface comprises prompt information for indicating whether the first voice is of a preset user or not and a first control for indicating that the first voice is of the preset user or not; when the terminal equipment receives the operation aiming at the first control, the terminal equipment establishes a corresponding relation between the first voiceprint vector and a preset range in a database. In this way, the terminal device can add the first voiceprint vector to the database through confirmation of the speaker, thereby increasing the security of voiceprint recognition.
In one possible implementation manner, the terminal device obtains a ratio of energy of the first signal in the first voice to energy of the second signal in the first voice, to obtain a third value, including: the terminal equipment performs noise reduction processing on the first voice to obtain a third voice; the terminal equipment obtains the ratio of the energy of the fifth signal in the third voice to the energy of the sixth signal in the third voice to obtain a third numerical value; wherein the fifth signal corresponds to the first signal and the sixth signal corresponds to the second signal. In this way, the accuracy of the acquired sound can be improved through noise reduction processing, so that the terminal equipment can perform voiceprint recognition based on the noise-reduced sound, and the accuracy of a voiceprint recognition method is improved.
In one possible implementation, the method further includes: when the terminal equipment determines that the first ratio meets a preset range and a corresponding relation exists between the preset range and the voiceprint vector, the terminal equipment extracts a fifth voiceprint vector corresponding to the preset range from a database; the terminal equipment obtains a similarity score of the first voiceprint vector and a fifth voiceprint vector to obtain a fifth numerical value; and when the fifth value is larger than the third threshold value, the terminal equipment executes the task corresponding to the first voice. In this way, the terminal equipment can perform voiceprint recognition based on the fifth voiceprint vector with higher similarity with the first voiceprint vector, and ensure accurate information of voiceprint recognition.
The fifth voiceprint vector may be a speaker voiceprint vector X in the embodiment of the present application; the third threshold may be T3 in an embodiment of the present application.
In one possible implementation manner, after the terminal device performs the task corresponding to the first voice, the method further includes: the terminal equipment performs voiceprint fusion on the fifth voiceprint vector and the first voiceprint vector to obtain a sixth voiceprint vector; and the terminal equipment establishes a corresponding relation between the sixth voiceprint vector and a preset range in the database. Therefore, as the first voiceprint vector and the fifth voiceprint vector can have certain deviation from the user real voiceprint vector, the terminal equipment can obtain a sixth voiceprint vector which is closer to the user real voice through fusion of the voiceprint vectors, so that the subsequent terminal equipment can execute more accurate voiceprint recognition based on the sixth voiceprint vector.
In one possible implementation, the method further includes: and when the first value is larger than a first threshold value, the terminal equipment executes the task corresponding to the first voice. Thus, the terminal equipment can accurately identify the voice of the user in the scene of not wearing the mask.
In one possible implementation manner, before the terminal device collects the first voice, the method further includes: when the terminal equipment receives the operation for starting voiceprint recognition, the terminal equipment displays a second interface; the second interface comprises prompt information for indicating whether to acquire second voice of the preset user and a second control for indicating to acquire the second voice; and when the terminal equipment receives the operation for the second control, the terminal equipment acquires the second voice. Therefore, the user can register voiceprint according to the prompt interface corresponding to the voice awakening function, so that the terminal equipment can recognize voiceprint according to voiceprint data recorded during user registration.
In one possible implementation manner, the preset frequency is a frequency preset by the terminal device, and the preset frequency is used for indicating a frequency corresponding to a signal filtered when the mask is worn.
In one possible implementation, the preset frequency is greater than 2000 hertz. Therefore, the terminal equipment can simulate the corresponding voiceprint vector when the user wears the mask scene by filtering the voiceprint vector obtained after the preset frequency, and further carry out voiceprint recognition by utilizing the voiceprint vector obtained after the preset frequency is filtered, so that the accuracy of voiceprint recognition under the mask wearing scene is improved.
In a second aspect, an embodiment of the present application provides a voiceprint recognition device, an acquisition unit, configured to acquire a first voice, and obtain a first voiceprint vector corresponding to the first voice; when the terminal equipment determines that the first voice meets the preset condition, the processing unit is used for obtaining a similarity score of the first voiceprint vector and a preset second voiceprint vector to obtain a first numerical value; when the first value is smaller than or equal to the first threshold, the processing unit is further configured to obtain a similarity score of the third voiceprint vector and a preset fourth voiceprint vector, so as to obtain a second value; the third voiceprint vector is a voiceprint vector obtained by filtering out preset frequency in the first voice, the fourth voiceprint vector is a voiceprint vector obtained by filtering out preset frequency in the second voice, and the second voice is a voice corresponding to the second voiceprint vector; and when the second value is larger than the second threshold value, the processing unit is also used for executing the task corresponding to the first voice.
In one possible implementation manner, the processing unit is specifically configured to obtain a ratio of energy of the first signal in the first voice to energy of the second signal in the first voice, so as to obtain a third value; wherein the signal frequency of the second signal is greater than the signal frequency of the first signal; when the terminal device determines that the first ratio of the third value to the fourth value does not meet the preset range in the database, or the first ratio meets the preset range and no corresponding relation exists between the preset range and the voiceprint vector, the processing unit is further specifically configured to obtain a similarity score of the third voiceprint vector and the fourth voiceprint vector, so as to obtain a second value; the fourth value is a ratio of energy of the third signal in the second voice to energy of the fourth signal in the second voice, and a signal frequency of the fourth signal is greater than a signal frequency of the third signal.
In one possible implementation manner, when the terminal device determines that the first ratio meets the preset range and the voiceprint vector do not have a corresponding relationship, the processing unit is further configured to establish a corresponding relationship between the first voiceprint vector and the preset range in the database.
In one possible implementation manner, when the terminal device determines that the first ratio meets a preset range and a corresponding relation between the preset range and the voiceprint vector does not exist, the display unit is used for displaying a first interface; the first interface comprises prompt information for indicating whether the first voice is of a preset user or not and a first control for indicating that the first voice is of the preset user or not; when the terminal equipment receives the operation for the first control, the processing unit is further used for establishing a corresponding relation between the first voiceprint vector and a preset range in the database.
In one possible implementation manner, the processing unit is specifically configured to perform noise reduction processing on the first voice to obtain a third voice; the processing unit is specifically configured to obtain a ratio of energy of a fifth signal in the third voice to energy of a sixth signal in the third voice, so as to obtain a third numerical value; wherein the fifth signal corresponds to the first signal and the sixth signal corresponds to the second signal.
In one possible implementation manner, when the terminal device determines that the first ratio meets the preset range and a corresponding relationship exists between the preset range and the voiceprint vector, the processing unit is specifically configured to extract a fifth voiceprint vector corresponding to the preset range from the database; the processing unit is further specifically configured to obtain a similarity score of the first voiceprint vector and the fifth voiceprint vector, so as to obtain a fifth numerical value; when the fifth value is greater than the third threshold, the processing unit is further specifically configured to execute a task corresponding to the first voice.
In one possible implementation manner, the processing unit is further configured to perform voiceprint fusion on the fifth voiceprint vector and the first voiceprint vector by using the terminal device to obtain a sixth voiceprint vector; and the processing unit is also used for establishing the corresponding relation between the sixth voiceprint vector and the preset range in the database.
In one possible implementation, the processing unit is further configured to perform the task corresponding to the first voice when the first value is greater than the first threshold.
In a possible implementation manner, when the terminal device receives an operation for starting voiceprint recognition, the display unit is further configured to display a second interface; the second interface comprises prompt information for indicating whether to acquire second voice of the preset user and a second control for indicating to acquire the second voice; when the terminal equipment receives the operation for the second control, the processing unit is further used for acquiring second voice.
In one possible implementation manner, the preset frequency is a frequency preset by the terminal device, and the preset frequency is used for indicating a frequency corresponding to a signal filtered when the mask is worn.
In one possible implementation, the preset frequency is greater than 2000 hertz.
In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, which when executed causes the terminal device to perform a voiceprint recognition method as described in the first aspect or any implementation manner of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing instructions that, when executed, cause a computer to perform a voiceprint recognition method as described in the first aspect or any implementation of the first aspect.
In a fifth aspect, a computer program product comprising a computer program which, when run, causes a computer to perform the voiceprint recognition method as described in the first aspect or any implementation of the first aspect.
It should be understood that the second to fifth aspects of the present application correspond to the technical solutions of the first aspect of the present application, and the advantages obtained by each aspect and the corresponding possible embodiments are similar, and are not repeated.
Drawings
FIG. 1 is a schematic view of a scene provided in an embodiment of the present application;
fig. 2 is a schematic hardware structure of a terminal device according to an embodiment of the present application;
fig. 3 is a schematic flow chart of voiceprint registration according to an embodiment of the present application;
fig. 4 is a schematic flow chart of a voiceprint recognition method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of sound attenuation according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating another voiceprint recognition method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an interface for voiceprint registration according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an interface for speaker verification according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a voiceprint recognition device according to an embodiment of the present application;
fig. 10 is a schematic hardware structure diagram of another terminal device according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a chip according to an embodiment of the present application.
Detailed Description
In order to clearly describe the technical solution of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. For example, the first value and the second value are merely for distinguishing between different values, and are not limited in their order. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.
In the present application, the words "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a and b, a and c, b and c, or a, b and c, wherein a, b, c may be single or plural.
Along with the aggravation of epidemic situations at home and abroad, the mask becomes an indispensable living article, and the interference effect of the mask on the voice signal also has a certain influence on the voice awakening function of the terminal equipment, so that the problem to be solved is that how to awaken the terminal equipment normally through voiceprints under the scene that the user wears the mask.
The voiceprint can be a sound wave spectrum which is displayed by the electroacoustic instrument and carries speech information, and can be used for representing the sound characteristics of a speaker. Voiceprints are not only specific but also relatively stable. It will be appreciated that regardless of whether the speaker intentionally mimics the voice and mood of another person or whistles, the voiceprint of the speaker is always different from the actual voiceprint of the person being imitated, even if the speaker mimics a wonderful unique word. Therefore, voiceprint recognition can be widely used in the context of speaker recognition.
In the embodiment of the application, when the user wears the mask, the terminal device can judge whether the received sound is the sound of the registered user or not by utilizing the voiceprint data of the user when the user wears the mask, and wake up the terminal device when the received sound is determined to be the sound of the registered user.
Exemplary, fig. 1 is a schematic view of a scenario provided in an embodiment of the present application. In the embodiment corresponding to fig. 1, a terminal device is taken as an example for illustrating a mobile phone, and the example does not limit the embodiment of the present application.
As shown in fig. 1, the scenario may include: a user 101 wearing a mask, and a mobile phone 102, the user 101 may be a registered user of the mobile phone 102 (or it is understood that the user 101 may be a owner of the mobile phone 102).
In the scenario corresponding to fig. 1, the user 101 is a registered user of the mobile phone 102, and the mobile phone 102 may register voiceprint data of the user 101 in a quiet scenario, so that the mobile phone 102 may score similarity between the registered voiceprint data of the user 101 and speaker voiceprint data when the user 101 wakes up the mobile phone 102, and when the score exceeds a preset threshold, the mobile phone 102 may be woken up, or when the score is less than the preset threshold, the mobile phone 102 may not be woken up.
However, in the case where the user 101 wears the mask to wake up the device, since the mask is essentially equivalent to a low-pass filter, high-frequency signals (such as sounds satisfying values above 2000 hertz (hz) or sounds satisfying ranges of 2000hz-7000 hz) in voiceprint data of the speaker of the user 101 are filtered, so that there is a large difference between the voiceprint data of the mask and the registered voiceprint data of the user 101, and therefore, the similarity score between the voiceprint data of the mask and the registered voiceprint data of the user 101 is difficult to exceed a preset threshold, which results in difficulty in waking up the mobile phone 102. For example, when the preset threshold is 80 minutes and the similarity score between the voiceprint data of the mask and the registered voiceprint data of the user 101 is 50 minutes, the user 101 wearing the mask has little possibility of waking up the mobile phone 102. Therefore, when the user wears the mask to carry out voiceprint recognition, the voiceprint recognition method has lower accuracy and seriously affects the normal use of the equipment awakening function of the user.
In view of this, an embodiment of the present application provides a voiceprint recognition method, where a terminal device may collect a first voice and obtain a first voiceprint vector when the first voice meets a preset condition, and when a similarity score between the first voiceprint vector and the second voiceprint vector is less than or equal to a first threshold, the terminal device simulates a third voiceprint vector when a user wears a mask scene, and when it is determined that a similarity between a fourth voiceprint vector and the third voiceprint vector is greater than a second threshold, through voiceprint recognition, the terminal device may implement accurate recognition of the voice in the mask scene worn by the user, and increase accuracy of the voiceprint recognition method.
The first voice may be speaker voiceprint data in the embodiment of the present application; the first voiceprint vector may be a speaker voiceprint vector C in the embodiment of the present application; the second voiceprint vector may be the registered voiceprint vector a in the embodiment of the present application; the first threshold may be T1 in the embodiment of the present application; the third voiceprint vector may be a speaker voiceprint vector D in the embodiment of the present application; the fourth voiceprint vector may be the registered voiceprint vector B in the embodiment of the present application; the second threshold may be T2 in an embodiment of the present application.
It can be understood that the voiceprint recognition method provided by the embodiment of the present application not only can be used in the equipment wake-up scene shown in fig. 1, but also can be used in other scenes for identity authentication, such as payment scenes, and the embodiment of the present application is not limited in particular.
It is understood that the above terminal device may also be referred to as a terminal (terminal), a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), etc. The terminal device may be a mobile phone (mobile phone) with a microphone, a smart tv, a wearable device, a tablet (Pad), a computer with wireless transceiving function, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal in industrial control (industrial control), a wireless terminal in unmanned (self-driving), a wireless terminal in teleoperation (remote medical surgery), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation security (transportation safety), a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), etc. The embodiment of the application does not limit the specific technology and the specific equipment form adopted by the terminal equipment.
Therefore, in order to better understand the embodiments of the present application, the structure of the terminal device of the embodiments of the present application will be described below. Fig. 2 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
The terminal device may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, an indicator 192, a camera 193, a display 194, and the like.
It will be appreciated that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the terminal device. In other embodiments of the application, the terminal device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Processor 110 may include one or more processing units. Wherein the different processing units may be separate devices or may be integrated in one or more processors. A memory may also be provided in the processor 110 for storing instructions and data.
The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge a terminal device, or may be used to transfer data between the terminal device and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other terminal devices, such as AR devices, etc.
The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. The power management module 141 is used for connecting the charge management module 140 and the processor 110.
The wireless communication function of the terminal device may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.
The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Antennas in the terminal device may be used to cover single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas.
The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G or the like applied on a terminal device. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation.
The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wirelesslocal area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), etc. as applied on a terminal device.
The terminal device implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering.
The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. In some embodiments, the terminal device may include 1 or N display screens 194, N being a positive integer greater than 1.
The terminal device may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
The camera 193 is used to capture still images or video. In some embodiments, the terminal device may include 1 or N cameras 193, N being a positive integer greater than 1.
The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to realize expansion of the memory capability of the terminal device. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.
The internal memory 121 may be used to store computer-executable program code that includes instructions. The internal memory 121 may include a storage program area and a storage data area.
The terminal device may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.
The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. The terminal device can listen to music through the speaker 170A or listen to hands-free calls. A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When the terminal device picks up a call or voice message, the voice can be picked up by placing the receiver 170B close to the human ear. The earphone interface 170D is used to connect a wired earphone.
Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. In the embodiment of the present application, the terminal device may receive the sound signal for waking up the terminal device based on the microphone 170C and convert the sound signal into an electrical signal that may be processed later, such as voiceprint data described in the embodiment of the present application, and the terminal device may have at least one microphone 170C.
The sensor module 180 may include one or more of the following sensors, for example: a pressure sensor, a gyroscope sensor, a barometric sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, or a bone conduction sensor, etc. (not shown in fig. 2).
The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The terminal device may receive key inputs, generating key signal inputs related to user settings of the terminal device and function control. The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc.
The software system of the terminal device may adopt a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, a cloud architecture, or the like, which will not be described herein.
The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be implemented independently or combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.
In the embodiment of the application, before voiceprint recognition is performed, the terminal equipment can acquire the voiceprint vectors of the registered user voiceprint data in different scenes by using the voiceprint model based on the embodiment corresponding to the following fig. 3, so that the subsequent terminal equipment can score the similarity of the registered user voiceprint data based on the voiceprint vectors in different scenes. Wherein the voiceprint vector can be used to characterize the voice characteristics of the speaker.
Fig. 3 is a schematic flow chart of voiceprint registration according to an embodiment of the present application. In the embodiment corresponding to fig. 3, the voiceprint data of the registered user collected by the terminal device is voiceprint data when the registered user does not wear the mask.
As shown in fig. 3, the voice print registration process may include the steps of:
s301, the terminal equipment acquires voice print data of the registered user.
The voice print data of the registered user can be voice data of the registered user acquired by the terminal equipment based on a microphone.
For example, the voiceprint data of the registered user may be voiceprint data acquired by the registered user in a scene where the environment is quiet and the mask is not worn.
S302, the terminal equipment detects wake-up words.
For example, in a scenario where a terminal device in a sleep state is awakened by a wake-up word, the wake-up word may be hello YOYO or the like; alternatively, in a scenario where payment is made using a wake word, the wake word may be a confirmation payment or the like; it can be understood that the wake-up word can be set according to an actual application scenario, which is not limited in the embodiment of the present application.
In an exemplary scenario of user voiceprint registration, the terminal device may acquire the registered user voiceprint data in real time and perform wake-up word detection on the registered user voiceprint data, and when the wake-up word is detected, the terminal device may perform the step shown in S303, or when the wake-up word is not detected, the terminal device may display prompt information, where the prompt information may be used to instruct the registered user to perform wake-up word detection again.
It will be appreciated that after the wake-up word detection is passed, the terminal device may perform the steps shown in S303 to obtain a registered voiceprint vector a, perform the steps shown in S304-S305 to obtain a registered voiceprint vector B, and perform the steps shown in S306-S307 to obtain a value K1.
The registered voiceprint vector a can be understood as a voiceprint vector acquired by a registered user in a scene of not wearing the mask; the registered voiceprint vector B can be understood as a voiceprint vector obtained by simulating a registered user in a wearing mask scene; the K1 can be understood as the energy ratio of the low frequency signal in the voice print data of the registered user to the high frequency signal in the voice print data of the registered user.
S303, the terminal equipment calculates a registered voiceprint vector A corresponding to voiceprint data of the registered user based on the voiceprint model.
For example, one possible implementation of calculating, by the terminal device, the registered voiceprint vector a corresponding to the registered user voiceprint data based on the voiceprint model may be: the terminal equipment can acquire voice print data of a registered user; extracting acoustic characteristics of the registered user corresponding to the voice print data of the registered user; the registered user acoustic features are input into a voiceprint model to obtain a registered voiceprint vector A.
In an embodiment of the present application, the voiceprint model may include one or more of the following, for example: gaussian mixture model (gaussian mixture model, GMM), gaussian mixture background model (GMM-universal background model, GMM-UBM), gaussian mixture support vector machine (GMM-support vector machine, GMM-SVM), joint factor analysis (joint factor analysis, JFA), GMM-based i-vector method, deep neural network (deep neural networks, DNN) -based d-vector method, or Neural Network (NNET) -based x-vector, etc., the voiceprint model employed in the embodiments of the present application is not particularly limited.
In the embodiment of the application, the terminal equipment can extract the acoustic characteristics by using one or more of the following methods, for example: the method for extracting acoustic features in the embodiment of the present application is not specifically limited, such as mel-scale frequency cepstral coefficients, MFCC, filter bank (FBank), or linear prediction coefficient (linear prediction coefficient, LPC).
It can be understood that the terminal device may also obtain the voiceprint vector corresponding to the voiceprint data based on other methods, for example, a trained neural network model, which is not limited in the embodiment of the present application.
S304, the terminal equipment filters high-frequency signals above 2000hz from voice print data of the registered user.
By way of example, the terminal device may filter out high frequency signals above 2000hz in the voice print data of the registered user using a low pass filter, such as a butterworth filter, or a chebyshev filter.
It can be understood that, because the mask is essentially equivalent to a low-pass filter, high-frequency signals above 2000hz in voice print data of the registered user can be filtered, so that the terminal device can simulate voice print data of the user when wearing the mask through a processing mode of filtering the high-frequency signals in the voice print data of the registered user. The frequency range of the filtered high-frequency signal may not be limited to 2000hz or more, or may be a value of 1000hz or more, or 3000hz or more, or may be a frequency range such as 2000hz-7000hz, which is not limited in the embodiment of the present application.
S305, the terminal equipment calculates a registered voiceprint vector B corresponding to the registered user voiceprint data after filtering the high-frequency signals based on the voiceprint model.
The method for calculating the registered voiceprint vector B corresponding to the registered user voiceprint data after filtering the high-frequency signal based on the voiceprint model by the terminal device may refer to the description in the step shown in S303, and will not be described herein.
S306, the terminal equipment performs noise reduction processing on the voice print data of the registered user to obtain the voice print data of the registered user after the noise reduction processing.
For example, the terminal device may perform noise reduction processing on the voice print data of the registered user by adopting noise reduction processing methods such as NN noise reduction, so as to filter out environmental noise in the voice print data of the registered user, and further obtain voice print data with better sound effect. The training process of NN noise reduction generally adopts a large amount of training data, so that the model can identify voiceprint data, and a cost function commonly used in model training can be as follows: mean square error (mean square error, MSE), or size invariance signal-to-noise ratio (scale invariant signal to noise radio, SI-SNR), etc. For example, the SI-SNR may be formulated as:
wherein S can be understood as voice print data of registered user, whichIt may be understood that the noise-reduced voice print data of the registered user, may represent the euclidean distance,<>dot products can be represented.
It can be understood that the noise reduction method and the cost function involved in the noise reduction method are not limited in the embodiment of the present application.
S307, the terminal equipment calculates the energy ratio K1 of the low-frequency signal in the voice print data of the registered user after the noise reduction processing and the high-frequency signal in the voice print data of the registered user after the noise reduction processing.
In the embodiment of the application, the low-frequency signal can be a signal with the signal frequency range meeting the range of 250hz-2000hz and the like; the high frequency signal may be a signal whose frequency range satisfies the range of 2000hz to 7000hz or the like. The value of the signal frequency range in the low-frequency signal and the value of the signal frequency range in the high-frequency signal may be other values, which are not limited in the embodiment of the present application.
Based on the above, after the terminal device calculates the registered voiceprint vector a, the registered voiceprint vector B, and the value K1 for the first time, the terminal device may store the registered voiceprint vector a, the registered voiceprint vector B, and the value K1, so that the speaker voiceprint data detected in the voiceprint recognition scene may be determined based on the three data. The registered voiceprint vector a and the registered voiceprint vector B may be stored in a preset database, so that the terminal device may be used at any time, and other voiceprint vectors of the registered user may also be stored in the preset data.
In the embodiment corresponding to fig. 3, the terminal device may perform voiceprint recognition based on the embodiment corresponding to fig. 4 below, on the basis of the registered voiceprint vector a, the registered voiceprint vector B, and the value K1 obtained by the terminal device based on voiceprint registration of the registered user.
Fig. 4 is a schematic flow chart of a voiceprint recognition method according to an embodiment of the present application. In the embodiment corresponding to fig. 4, the preset data stores a registered voiceprint vector a and a registered voiceprint vector B.
As shown in fig. 4, the voiceprint recognition method may include the steps of:
s401, the terminal equipment acquires voice print data of a speaker.
The speaker voiceprint data may be voiceprint data obtained by the terminal device when the speaker does not wear the mask, or may be voiceprint data obtained by the terminal device when the speaker wears the mask, which is not specifically limited in the embodiment of the present application.
S402, the terminal equipment detects wake-up words.
In a possible implementation manner, after the wake-up word detection is passed, the terminal device may synchronously execute the steps shown in S403 to obtain a speaker voiceprint vector C, execute the steps shown in S405 to S406 to obtain a speaker voiceprint vector D, and/or execute the steps shown in S410 to S411 to obtain a value K2; further, the terminal device may temporarily store the speaker voiceprint vector C, the speaker voiceprint vector D, and/or the value K2 for subsequent use.
S403, the terminal equipment calculates a speaker voiceprint vector C corresponding to the speaker voiceprint data based on the voiceprint model.
It can be understood that the method for calculating the speaker voiceprint vector C corresponding to the speaker voiceprint data by the terminal device based on the voiceprint model can be referred to the description in the step shown in S303, and will not be described herein.
S404, the terminal equipment judges whether the similarity score of the speaker voiceprint vector C and the registered voiceprint vector A is larger than T1.
In the embodiment of the present application, when the terminal device determines that the similarity score between the speaker voiceprint vector C and the registered voiceprint vector a is greater than (or equal to) T1, the terminal device may execute the step shown in S408; alternatively, when the terminal device determines that the similarity score of the speaker voiceprint vector C and the registered voiceprint vector a is equal to or less than T1, the terminal device may perform the step shown in S406.
For example, the terminal device may calculate the similarity score between the speaker voiceprint vector C and the registered voiceprint vector a by using cosine (cosine) score, probability linear discriminant analysis (probabilistic linear discriminant analysis, PLDA) and other methods, which is not limited in the embodiment of the present application.
It can be understood that in a scene that the speaker does not wear the mask to perform voiceprint recognition, the terminal device can ensure the voice with extremely high similarity to the voice of the registered user through setting the threshold T1, for example, the voice of the registered user can be recognized through the voiceprint when the mask is not worn, so that the terminal device can realize accurate recognition of the voice of the user in the scene that the mask is not worn.
S405, the terminal equipment filters high-frequency signals above 2000hz from voice print data of the speaker.
S406, the terminal equipment calculates a speaker voiceprint vector D corresponding to speaker voiceprint data after filtering the high-frequency signals based on the voiceprint model.
It can be understood that the definition of the high-frequency signal in the steps shown in S405-S406 and the method for calculating the speaker voiceprint vector D can be described in the steps shown in S304-S305, and will not be described herein.
S407, the terminal equipment judges whether the similarity score of the speaker voiceprint vector D and the registered voiceprint vector B is larger than T2.
In the embodiment of the present application, when the terminal device determines that the similarity score between the speaker voiceprint vector D and the registered voiceprint vector B is greater than (or equal to) T2, the terminal device may execute the step shown in S408; alternatively, when the terminal device determines that the similarity score of the speaker voiceprint vector D and the registered voiceprint vector B is equal to or less than T2, the terminal device may perform the step shown in S409.
It can be understood that in the scenario that the speaker wears the mask to perform voiceprint recognition, since the speaker voiceprint vector D has filtered out the high-frequency signal at 2000hz, both the speaker voiceprint vector D and the registered voiceprint vector B may be low-frequency signals that are not affected by the mask, so that when the terminal device performs voiceprint recognition by using the similarity between the speaker voiceprint vector D and the registered voiceprint vector B, the influence of the mask may be ignored, so that the voiceprint recognition result is more accurate.
S408, the terminal equipment determines that the judgment is successful.
For example, when a terminal device in a sleep state is awakened by voice, the terminal device may be awakened when the terminal device determines that the decision is successful, for example, the terminal device may be on a screen, and play a voice message, for example, when a user wakes the terminal device by hello YOYO, the terminal device may play the following after the decision is successful: i am or other voice message.
S409, the terminal equipment determines that the judgment fails.
For example, when the terminal device determines that the currently received voiceprint data does not belong to voiceprint data of the registered user, the present round of verification fails. For example, when a terminal device in a sleep state is awakened by voice, when the terminal device determines that the decision fails, the terminal device cannot be awakened, and thus the terminal device can continue to maintain the sleep state.
It can be appreciated that, based on the steps shown in S401-S409, the terminal device may implement voiceprint recognition of a speaker in a mask-wearing scene. In a possible implementation manner, the terminal device may also determine whether to add the speaker voiceprint vector C to a preset database based on the steps shown in S410-S413 described below.
S410, the terminal equipment performs noise reduction processing on the voice print data of the speaker to obtain the voice print data of the speaker after the noise reduction processing.
S411, the terminal equipment calculates the energy ratio K2 of the low-frequency signal in the voice print data of the speaker after the noise reduction processing and the high-frequency signal in the voice print data of the speaker after the noise reduction processing.
It is understood that the noise reduction processing method, the definition of the low frequency signal, the definition of the high frequency signal, and the method for calculating the value K2 in the steps shown in S410-S411 can be described in the steps shown in S306-S307, and will not be described herein.
S412, under the condition that the wake-up word judgment is successful and no voiceprint vector exists in the range of K2/K1 in the preset database, the terminal equipment confirms the speaker.
In the embodiment of the application, the preset data can store not only the registered voiceprint vector A and the registered voiceprint vector B, but also the corresponding relation between different numerical ranges and the voiceprint vector. The different numerical ranges may be divided based on the value of K2/K1, for example, the value range of K2/K1 may be: (a) 1 ,b 1 )、(a 2 ,b 2 ) And (a) 3 ,b 3 ) Etc.
It will be appreciated that the range of values of K2/K1 may be related to the different levels of attenuation of sound when different masks are worn. Fig. 5 is a schematic diagram illustrating sound attenuation according to an embodiment of the present application. In the embodiment corresponding to fig. 5, the abscissa may be the frequency of sound, and the ordinate may be the attenuation caused by wearing the mask, in units of: dB, the solid line can be understood as a simple mask, and the dotted line can be understood as a special mask, such as an N95 mask.
As shown in FIG. 5, when the user wears the simple mask, the high frequency signal of 2000hz to 7000hz is reduced by about 3dB to 4dB; whereas when a user wears the N95 mask, the high frequency signal of 2000hz to 7000hz is reduced by about 12dB, it is understood that the user wears the mask while attenuating the high frequency signal by about 3dB to 12 dB.
Thus, the terminal device may divide the attenuation range from 3dB to 12dB, such as into 3dB to 6dB, 6dB to 9dB, and 9dB to 12dB, so that it may correspond toThe attenuation degree of different types of masks. Further, when the attenuation range is 3dB-6dB, the corresponding a is set as 1 Can be 2, b 1 The value of (2) may be 4, and the value range of K2/K1 may be (2, 4); at an attenuation range of 6dB-9dB, the corresponding a 2 Can be a value of 4, b 2 The value of (2) may be 8, and the value range of K2/K1 may be (4, 8); at an attenuation range of 9dB-12dB, the corresponding a 3 Can be 8, b 3 The value of (2) may be 16, and the value of K2/K1 may be (8, 16).
It is understood that the number of the ranges of K2/K1 and the values of the ranges of K2/K1 may be other values, which are not limited in the embodiment of the present application.
In a possible implementation manner, the terminal device may also directly add the voice print vector C of the speaker identified by the voice print to the preset database, and similarly, the preset database may store voice print vectors of the speaker in different scenes. Furthermore, the terminal device may determine whether to identify by voiceprint based on the similarity between the stored voiceprint vector in the preset database and the currently received voiceprint vector.
In a possible implementation manner, the terminal device may perform speaker verification based on video verification or triggering of the prompt information in the terminal device by the user, so as to further verify whether the current speaker is a registered user.
It can be understood that, since the speaker voiceprint vector D and the registered voiceprint vector B may be low-frequency signals from which high-frequency signals are filtered, when the terminal device performs similarity comparison by using the two low-frequency signals, there may be a risk, so that the terminal device may add the speaker voiceprint vector C having a higher similarity with the registered user to a preset database based on speaker confirmation, so as to be used later.
S413, under the condition that the speaker confirms passing, the terminal device adds the speaker voiceprint vector C into a preset database.
For example, the terminal device mayBased on the value of K2/K1, the speaker voiceprint vector C is added to the range of K2/K1 in the preset database, for example, when the terminal device determines that K2/K1 satisfies (a 1 ,b 1 ) When in range, the terminal device can add the speaker voiceprint vector C to (a) 1 ,b 1 ) The range in which (a) can be stored in a database preset at the moment 1 ,b 1 ) And speaker voiceprint vector C.
Based on the voice print recognition method, the voice print recognition method and the voice print recognition system can not only realize voice print recognition under the wearing mask scene, but also add voice print vectors recognized through voice prints into a preset database, so that subsequent terminal equipment can recognize voice prints based on the voice print vectors in the preset database.
In the embodiment corresponding to fig. 3, the terminal device obtains a registered voiceprint vector a, a registered voiceprint vector B, and a value K1 based on voiceprint registration of the registered user; in addition, in the embodiment corresponding to fig. 4, when the terminal device adds the speaker voiceprint vector that is judged to be successful to the preset database, the terminal device may perform voiceprint recognition based on the embodiment corresponding to fig. 6 described below.
Fig. 6 is a schematic flow chart of another voiceprint recognition method according to an embodiment of the present application. In the embodiment corresponding to fig. 6, preset data are stored in: registered voiceprint vector A, registered voiceprint vector B, speaker voiceprint vector X1 and (a) 1 ,b 1 ) Correspondence between speaker voiceprint vector X2 and (a) 2 ,b 2 ) Correspondence between the speaker voiceprint vectors X3 and (a) 3 ,b 3 ) Correspondence between them. The speaker voiceprint vector X1, the speaker voiceprint vector X2, and the speaker voiceprint vector X3 may be voiceprint vectors added to a predetermined database by the terminal device based on the embodiment corresponding to fig. 4.
As shown in fig. 6, in the case that a plurality of voiceprint vectors are stored in a preset database, a speaker (the speaker may be the same as the speaker in the embodiment corresponding to fig. 4) performing a voiceprint recognition method may include the steps of:
s601, the terminal equipment acquires voice print data of a speaker.
S602, the terminal equipment detects wake-up words.
In a possible implementation manner, after the wake-up word detection is passed, the terminal device may synchronously execute the steps shown in S603 to obtain a speaker voiceprint vector C, execute the steps shown in S609-S610 to obtain a speaker voiceprint vector D, and/or execute the steps shown in S605-S606 to obtain a value K2; further, the terminal device may temporarily store the speaker voiceprint vector C, the speaker voiceprint vector D, and/or the value K2.
S603, the terminal equipment calculates a speaker voiceprint vector C corresponding to the speaker voiceprint data based on the voiceprint model.
It can be understood that the method for calculating the speaker voiceprint vector C corresponding to the speaker voiceprint data by the terminal device based on the voiceprint model can be referred to the description in the step shown in S303, and will not be described herein.
S604, the terminal equipment judges whether the similarity score of the speaker voiceprint vector C and the registered voiceprint vector A is larger than T1.
In the embodiment of the present application, when the terminal device determines that the similarity score between the speaker voiceprint vector C and the registered voiceprint vector a is greater than (or equal to) T1, the terminal device may execute the step shown in S612; alternatively, when the terminal device determines that the similarity score of the speaker voiceprint vector C and the registered voiceprint vector a is equal to or less than T1, the terminal device may perform the step shown in S607.
It can be understood that the method for determining whether the similarity score between the speaker voiceprint vector C and the registered voiceprint vector a by the terminal device can be referred to the description in the step shown in S404, and will not be described herein.
S605, the terminal equipment performs noise reduction processing on the voice print data of the speaker to obtain the voice print data of the speaker after the noise reduction processing.
S606, the terminal equipment calculates the energy ratio K2 of the low-frequency signal in the voice print data of the speaker after the noise reduction processing and the high-frequency signal in the voice print data of the speaker after the noise reduction processing.
It is understood that the noise reduction processing method, the definition of the low frequency signal, the definition of the high frequency signal, and the method for calculating the value K2 in the steps shown in S605-S606 can be described in the steps shown in S306-S307, and will not be described herein.
S607, the terminal equipment judges whether a voiceprint vector exists in the range of K2/K1 in the preset database.
In the embodiment of the present application, when the terminal device determines that the voice print vector exists in the range of K2/K1 in the preset database, the terminal device may extract the voice print vector, for example, extract the voice print vector X of the speaker, and execute the step shown in S608; or, when the terminal device determines that there is no voiceprint vector in the range of K2/K1 in the preset database, the terminal device may perform the step shown in S610. The speaker voiceprint vector X may be any voiceprint vector in the preset database except the registered voiceprint vector a and the registered voiceprint vector B.
In a possible implementation manner, when the terminal device determines that the K2/K1 has a range in a preset database and that the K2/K1 has a voiceprint vector in the range in the preset database, the terminal device may extract the voiceprint vector, for example, extract the speaker voiceprint vector X, and execute the step shown in S608; or when the terminal device determines that the K2/K1 does not have a range in the preset database, or that the K2/K1 has a range in the preset database and that the K2/K1 does not have a voiceprint vector in the range in the preset database, the terminal device may execute the step shown in S610.
S608, the terminal equipment judges whether the similarity score of the speaker voiceprint vector C and the speaker voiceprint vector X is larger than T3.
In the embodiment of the present application, when the terminal device determines that the similarity score between the speaker voiceprint vector C and the speaker voiceprint vector X is greater than (or equal to) T3, the terminal device may execute the step shown in S612; alternatively, when the terminal device determines that the similarity score of the speaker voiceprint vector C and the speaker voiceprint vector X is equal to or less than (or less than) T3, the terminal device may perform the step shown in S610.
It can be understood that, in the step shown in S604, the terminal device performs similarity comparison on the speaker voiceprint vector C by using the registered voiceprint vector a, where the registered voiceprint vector a is a voiceprint vector obtained by the registered user in a scene of not wearing the mask, and the speaker voiceprint vector C is likely to be a voiceprint vector obtained by the user in a scene of wearing the mask, so that the terminal device cannot recognize the voice of the user in a different scene or in a different voice state by using only the registered voiceprint vector a, for example, the terminal device may not recognize the voice of the user wearing the mask based on the registered voiceprint vector a, so that the success rate of voiceprint recognition is low. Therefore, the terminal device can find out the speaker voiceprint vector X possibly obtained in the scene of the user wearing the mask in a historical manner in the preset database, and judge the similarity of the speaker voiceprint vector C obtained in the current scene, so that the success of voiceprint recognition is improved.
S609, the terminal equipment filters high-frequency signals above 2000hz in the voice print data of the speaker.
S610, the terminal equipment calculates a speaker voiceprint vector D corresponding to speaker voiceprint data after filtering high-frequency signals based on the voiceprint model.
It can be understood that the definition of the high-frequency signal in the steps shown in S609-S610 and the method for calculating the speaker voiceprint vector D can be described in the steps shown in S304-S305, and are not described herein.
S611, the terminal device judges whether the similarity score of the speaker voiceprint vector D and the registered voiceprint vector B is larger than T2.
In the embodiment of the present application, when the terminal device determines that the similarity score between the speaker voiceprint vector D and the registered voiceprint vector B is greater than (or equal to) T2, the terminal device may execute the step shown in S612; alternatively, when the terminal device determines that the similarity score of the speaker voiceprint vector D and the registered voiceprint vector B is equal to or less than T2 (or less than T2), the terminal device may perform the step shown in S613.
S612, the terminal equipment determines that the judgment is successful.
S613, the terminal equipment determines that the judgment fails.
It will be appreciated that based on the steps shown in S601-S613 above, the terminal device may implement voiceprint recognition for the speaker. In a possible implementation manner, the terminal device may also determine whether to add the fused voiceprint vector corresponding to the speaker voiceprint vector C or the speaker voiceprint vector C to a preset database based on the steps shown in S614-S616 below. The fused voiceprint vector may be a voiceprint vector obtained when the speaker voiceprint vector C and the speaker voiceprint vector X are fused.
S614, the terminal equipment determines whether a voiceprint vector exists in the range of K2/K1 in a preset database.
In the embodiment of the present application, when the terminal device determines that the voice print vector exists in the range of K2/K1 in the preset database, the terminal device may extract the voice print vector, for example, extract the voice print vector X of the speaker, and execute the step shown in S615; or, when the terminal device determines that there is no voiceprint vector in the range of K2/K1 in the preset database, the terminal device may perform the step shown in S616.
S615, the terminal equipment fuses the voice print vector C of the speaker and the voice print vector X of the speaker to obtain a fused voice print vector, and the fused voice print vector is added to a range of K2/K1 in a preset database.
It can be understood that, the speaker voiceprint vector C and the speaker voiceprint vector X can be voiceprint vectors obtained by a speaker under different scenes, so that the speaker voiceprint vector C and the speaker voiceprint vector X can have a certain deviation from the user real voiceprint vector, and therefore, the terminal device can obtain a fused voiceprint vector closer to the user real voice through fusion of the voiceprint vectors. Further, after the terminal device adds the fusion voiceprint vector to a preset database, the terminal device can more accurately judge the voice of the speaker based on the fusion voiceprint vector which is closer to the real voice of the user, so that the accuracy of the voiceprint recognition method is enhanced.
S616, the terminal equipment confirms the speaker, and adds the voice print vector C of the speaker to the range of K2/K1 in a preset database after the speaker confirms the speaker.
It can be understood that the method for speaker verification and the step of adding the speaker voiceprint vector C to the range of K2/K1 in the preset database can be described in the step shown in S412, and will not be described herein.
Based on the voice print recognition method, the terminal equipment can not only recognize voice print based on data in a preset database, but also realize fusion of the acquired voice print vector and the voice print vector in the preset database, and store the fused voice print vector in the preset database, so that the flexibility and accuracy of voice print recognition are enhanced.
On the basis of the embodiment corresponding to fig. 3, the terminal device may perform voiceprint registration based on the embodiment corresponding to fig. 7 described below. Fig. 7 is an interface schematic diagram of voiceprint registration according to an embodiment of the present application. In the embodiment corresponding to fig. 7, a terminal device is taken as an example for a mobile phone to be described as an example, which does not limit the embodiment of the present application.
When the mobile phone receives the operation of setting the voice wake function from the user, the mobile phone may display an interface as shown in a of fig. 7, where the interface may include: controls for setting user information (e.g., controls corresponding to information displayed as my), controls for setting power key wakeup, controls 701 for setting voice wakeup, and controls for the user to view more functions, etc.
In the interface shown as a in fig. 7, when the mobile phone receives the operation that the user triggers the control 701 for setting voice wakeup (or turning on voiceprint recognition), the mobile phone may display the interface shown as b in fig. 7. A control 702 for turning on voice wakeup, etc., is included in the interface shown as b in fig. 7.
In the interface shown as b in fig. 7, when the handset receives the operation of the user triggering the control 702 for turning on voice wakeup, the handset may display the interface shown as c in fig. 7. The interface shown in c in fig. 7 may include: prompt 703, a confirmation control corresponding to the prompt 703, a cancel control corresponding to the prompt 703, and the like. The prompt 703 may be used to instruct the user to perform voiceprint registration, for example, the prompt 703 may be displayed as: when the voice wake-up function is first turned on, the device needs to record your voice. It will be appreciated that the content displayed in the prompt 703 is not limited in the embodiment of the present application.
In a possible implementation manner, the terminal device may also support the user to register a plurality of voiceprint data in the device, so that the terminal device can recognize the voices of a plurality of users when the voice wakes up.
Based on the voice print registration, the user can register voice print according to the prompt interface corresponding to the voice wake-up function, so that the terminal equipment can recognize voice print according to voice print data recorded during the registration of the user.
On the basis of the embodiment corresponding to fig. 4 or fig. 6, the terminal device may perform speaker verification based on the embodiment corresponding to fig. 8 described below. Fig. 8 is a schematic diagram of an interface for speaker verification according to an embodiment of the present application.
S412 in the embodiment corresponding to fig. 4 may be: under the condition that the wake-up word judgment is successful and no voiceprint vector exists in the range of K2/K1 in the preset database, the terminal equipment can carry out speaker confirmation based on the embodiment corresponding to FIG. 8. Alternatively, S616 in the corresponding embodiment of fig. 6 may be: the terminal device may perform speaker verification based on the corresponding embodiment of fig. 8.
As shown in fig. 8, the interface may be an interface displayed by the mobile phone after the mobile phone is awakened by voice, where the interface may include: file management control, email control, music control, gallery control, camera control, address book control, telephone control, information control, prompt information 801, confirmation control corresponding to the prompt information 801, cancellation control corresponding to the prompt information 801, and the like. The prompt 801 is used to instruct to perform authentication again, for example, the prompt 801 may be: a voice wake-up is detected requesting confirmation of whether it is the voice of the registered user of the device. It will be appreciated that the content displayed in the prompt 801 is not limited in this embodiment of the present application.
Based on this, the terminal device can confirm whether to add the voice of the user at the time of waking up the voice to a preset database based on the secondary voiceprint verification.
It should be understood that the interface provided by the embodiment of the present application is only an example, and is not limited to the embodiment of the present application.
The method provided by the embodiment of the present application is described above with reference to fig. 3 to 8, and the device for performing the method provided by the embodiment of the present application is described below. As shown in fig. 9, fig. 9 is a schematic structural diagram of a voiceprint recognition device according to an embodiment of the present application, where the voiceprint recognition device may be a terminal device in the embodiment of the present application, or may be a chip or a chip system in the terminal device.
As shown in fig. 9, the voiceprint recognition apparatus 90 may be used in a communication device, circuit, hardware component, or chip, and includes: acquisition unit 901, display unit 902, and processing unit 903. Wherein, the acquisition unit 901 is used for supporting the voice recognition device 90 to perform the voice acquisition step, and the display unit 902 is used for supporting the voice recognition device 90 to perform the display step; the processing unit 903 is used to support the step of the voiceprint recognition device 90 performing information processing.
The embodiment of the application provides a voiceprint recognition device 90, which is used for acquiring first voice to obtain a first voiceprint vector corresponding to the first voice; when the terminal device determines that the first voice meets a preset condition, the processing unit 903 is configured to obtain a similarity score of the first voiceprint vector and a preset second voiceprint vector, so as to obtain a first numerical value; when the first value is less than or equal to the first threshold, the processing unit 903 is further configured to obtain a similarity score between the third voiceprint vector and a preset fourth voiceprint vector, so as to obtain a second value; the third voiceprint vector is a voiceprint vector obtained by filtering out preset frequency in the first voice, the fourth voiceprint vector is a voiceprint vector obtained by filtering out preset frequency in the second voice, and the second voice is a voice corresponding to the second voiceprint vector; when the second value is greater than the second threshold, the processing unit 903 is further configured to execute a task corresponding to the first voice.
In a possible implementation manner, the processing unit 903 is specifically configured to obtain a ratio of energy of the first signal in the first voice to energy of the second signal in the first voice, so as to obtain a third value; wherein the signal frequency of the second signal is greater than the signal frequency of the first signal; when the terminal device determines that the first ratio of the third value to the fourth value does not meet the preset range in the database, or the first ratio meets the preset range and no corresponding relation exists between the preset range and the voiceprint vector, the processing unit 903 is further specifically configured to obtain a similarity score of the third voiceprint vector and the fourth voiceprint vector, so as to obtain a second value; the fourth value is a ratio of energy of the third signal in the second voice to energy of the fourth signal in the second voice, and a signal frequency of the fourth signal is greater than a signal frequency of the third signal.
In a possible implementation manner, when the terminal device determines that the first ratio meets the preset range and the voiceprint vector do not have a corresponding relationship, the processing unit 903 is further configured to establish a corresponding relationship between the first voiceprint vector and the preset range in the database.
In one possible implementation manner, when the terminal device determines that the first ratio meets the preset range and the preset range does not have a corresponding relationship with the voiceprint vector, the display unit 902 is configured to display a first interface; the first interface comprises prompt information for indicating whether the first voice is of a preset user or not and a first control for indicating that the first voice is of the preset user or not; when the terminal device receives an operation for the first control, the processing unit 903 is further configured to establish a correspondence between the first voiceprint vector and a preset range in the database.
In one possible implementation manner, the processing unit 903 is specifically configured to perform noise reduction processing on the first voice to obtain a third voice; the processing unit 903 is specifically configured to obtain a ratio of the energy of the fifth signal in the third voice to the energy of the sixth signal in the third voice, so as to obtain a third value; wherein the fifth signal corresponds to the first signal and the sixth signal corresponds to the second signal.
In one possible implementation manner, when the terminal device determines that the first ratio meets the preset range and the preset range has a corresponding relationship with the voiceprint vector, the processing unit 903 is specifically configured to extract a fifth voiceprint vector corresponding to the preset range from the database; the processing unit 903 is further specifically configured to obtain a similarity score of the first voiceprint vector and the fifth voiceprint vector, to obtain a fifth numerical value; when the fifth value is greater than the third threshold, the processing unit 903 is further specifically configured to execute the task corresponding to the first voice.
In a possible implementation manner, the processing unit 903 is further configured to perform voiceprint fusion on the fifth voiceprint vector and the first voiceprint vector by using the terminal device to obtain a sixth voiceprint vector; the processing unit 903 is further configured to establish a correspondence between the sixth voiceprint vector and a preset range in the database.
In a possible implementation, when the first value is greater than the first threshold, the processing unit 903 is further configured to perform a task corresponding to the first voice.
In a possible implementation manner, when the terminal device receives an operation for turning on voiceprint recognition, the display unit 902 is further configured to display a second interface; the second interface comprises prompt information for indicating whether to acquire second voice of the preset user and a second control for indicating to acquire the second voice; when the terminal device receives an operation for the second control, the processing unit 903 is further configured to obtain a second voice.
In one possible implementation manner, the preset frequency is a frequency preset by the terminal device, and the preset frequency is used for indicating a frequency corresponding to a signal filtered when the mask is worn.
In one possible implementation, the preset frequency is greater than 2000 hertz.
In a possible implementation, the voiceprint recognition device 90 may also include a communication unit 904. Specifically, the communication unit is configured to support the voiceprint recognition device 90 to perform the step of transmitting data and the step of receiving data. The communication unit 904 may be an input or output interface, a pin or circuit, or the like.
In a possible embodiment, the voiceprint recognition device may further include: a storage unit 905. The processing unit 903 and the storage unit 905 are connected by a line. The storage unit 905 may include one or more memories, which may be one or more devices, devices in a circuit for storing programs or data. The storage unit 905 may exist independently and is connected to the processing unit 903 provided in the voiceprint recognition apparatus through a communication line. The storage unit 905 may also be integrated with the processing unit 903.
The storage unit 905 may store computer-executable instructions of the method in the terminal device to cause the processing unit 903 to perform the method in the above-described embodiment. The storage unit 905 may be a register, a cache, a RAM, or the like, and the storage unit 905 may be integrated with the processing unit 903. The storage unit 905 may be a read-only memory (ROM) or other type of static storage device that may store static information and instructions, and the storage unit 905 may be independent of the processing unit 903.
Fig. 10 is a schematic diagram of a hardware structure of another terminal device according to an embodiment of the present application, as shown in fig. 10, where the terminal device includes a processor 1001, a communication line 1004, and at least one communication interface (the communication interface 1003 is exemplified in fig. 10).
The processor 1001 may be a general purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present application.
Communication line 1004 may include circuitry to communicate information between the components described above.
Communication interface 1003 uses any transceiver-like device for communicating with other devices or communication networks, such as ethernet, wireless local area network (wireless local area networks, WLAN), etc.
Possibly, the terminal device may also comprise a memory 1002.
The memory 1002 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, or an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc read-only memory (compact disc read-only memory) or other optical disk storage, a compact disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be implemented on its own and coupled to the processor via communication line 1004. The memory may also be integrated with the processor.
The memory 1002 is used for storing computer-executable instructions for performing the aspects of the present application, and is controlled by the processor 1001 for execution. The processor 1001 is configured to execute computer-executable instructions stored in the memory 1002, thereby implementing the voiceprint recognition method provided by the embodiment of the present application.
Possibly, the computer-executable instructions in the embodiments of the present application may also be referred to as application program codes, which are not limited in particular.
In a particular implementation, the processor 1001 may include one or more CPUs, such as CPU0 and CPU1 in fig. 10, as one embodiment.
In a specific implementation, as an embodiment, the terminal device may include a plurality of processors, such as processor 1001 and processor 1005 in fig. 10. Each of these processors may be a single-core (single-CPU) processor or may be a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
Fig. 11 is a schematic structural diagram of a chip according to an embodiment of the present application. Chip 110 includes one or more (including two) processors 1120 and a communication interface 1130.
In some implementations, the memory 1140 stores the following elements: executable modules or data structures, or a subset thereof, or an extended set thereof.
In an embodiment of the application, memory 1140 may include read only memory and random access memory and provide instructions and data to processor 1120. A portion of memory 1140 may also include non-volatile random access memory (non-volatile random access memory, NVRAM).
In an embodiment of the application, memory 1140, communication interface 1130, and memory 1140 are coupled together by bus system 1110. The bus system 1110 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. For ease of description, the various buses are labeled as bus system 1110 in FIG. 11.
The methods described above for embodiments of the present application may be applied to the processor 1120 or implemented by the processor 1120. The processor 1120 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware or instructions in software in processor 1120. The processor 1120 described above may be a general purpose processor (e.g., a microprocessor or a conventional processor), a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), an off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gates, transistor logic, or discrete hardware components, and the processor 1120 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application.
The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a state-of-the-art storage medium such as random access memory, read-only memory, programmable read-only memory, or charged erasable programmable memory (electrically erasable programmable read only memory, EEPROM). The storage medium is located in the memory 1140, and the processor 1120 reads information in the memory 1140 and performs the steps of the above method in combination with its hardware.
In the above embodiments, the instructions stored by the memory for execution by the processor may be implemented in the form of a computer program product. The computer program product may be written in the memory in advance, or may be downloaded in the form of software and installed in the memory.
The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL), or wireless (e.g., infrared, wireless, microwave, etc.), or semiconductor medium (e.g., solid state disk, SSD)) or the like.
The embodiment of the application also provides a computer readable storage medium. The methods described in the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. Computer readable media can include computer storage media and communication media and can include any medium that can transfer a computer program from one place to another. The storage media may be any target media that is accessible by a computer.
As one possible design, the computer-readable medium may include compact disk read-only memory (CD-ROM), RAM, ROM, EEPROM, or other optical disk memory; the computer readable medium may include disk storage or other disk storage devices. Moreover, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital versatile disc (digital versatile disc, DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.
Combinations of the above should also be included within the scope of computer-readable media. The foregoing is merely illustrative embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the technical scope of the present invention, and the invention should be covered. Therefore, the protection scope of the invention is subject to the protection scope of the claims.
Claims (14)
1. A method of voiceprint recognition, the method comprising:
the method comprises the steps that terminal equipment collects first voice to obtain a first voiceprint vector corresponding to the first voice;
when the terminal equipment determines that the first voice meets a preset condition, the terminal equipment obtains a similarity score of the first voiceprint vector and a preset second voiceprint vector to obtain a first numerical value;
when the first value is smaller than or equal to a first threshold value, the terminal equipment obtains a similarity score of a third voiceprint vector and a preset fourth voiceprint vector to obtain a second value; the third voiceprint vector is a voiceprint vector obtained by filtering out the preset frequency in the first voice, the fourth voiceprint vector is a voiceprint vector obtained by filtering out the preset frequency in the second voice, and the second voice is a voice corresponding to the second voiceprint vector;
And when the second value is larger than a second threshold value, the terminal equipment executes the task corresponding to the first voice.
2. The method of claim 1, wherein the obtaining, by the terminal device, a similarity score between the third voiceprint vector and a preset fourth voiceprint vector, and obtaining the second value includes:
the terminal equipment obtains the ratio of the energy of a first signal in the first voice to the energy of a second signal in the first voice to obtain a third numerical value; wherein the signal frequency of the second signal is greater than the signal frequency of the first signal;
when the terminal equipment determines that the first ratio of the third numerical value to the fourth numerical value does not meet a preset range in a database, or the first ratio meets the preset range and no corresponding relation exists between the preset range and the voiceprint vector, the terminal equipment obtains a similarity score of the third voiceprint vector and the fourth voiceprint vector to obtain the second numerical value;
the fourth value is a ratio of energy of a third signal in the second voice to energy of a fourth signal in the second voice, and a signal frequency of the fourth signal is greater than a signal frequency of the third signal.
3. The method of claim 2, wherein after the terminal device performs the task corresponding to the first voice, the method further comprises:
when the terminal equipment determines that the first ratio meets the preset range and the corresponding relation between the preset range and the voiceprint vector does not exist, the terminal equipment establishes the corresponding relation between the first voiceprint vector and the preset range in the database.
4. A method according to claim 3, wherein when the terminal device determines that the first ratio meets the preset range and that there is no correspondence between the preset range and a voiceprint vector, the terminal device establishes a correspondence between the first voiceprint vector and the preset range in the database, comprising:
when the terminal equipment determines that the first ratio meets the preset range and the corresponding relation between the preset range and the voiceprint vector does not exist, the terminal equipment displays a first interface; the first interface comprises prompt information for indicating whether the first voice is the voice of a preset user or not and a first control for indicating that the first voice is the voice of the preset user;
When the terminal equipment receives the operation for the first control, the terminal equipment establishes a corresponding relation between the first voiceprint vector and the preset range in the database.
5. The method of claim 2, wherein the terminal device obtains a ratio of energy of a first signal in the first voice to energy of a second signal in the first voice to obtain a third value, comprising:
the terminal equipment performs noise reduction processing on the first voice to obtain a third voice;
the terminal equipment obtains the ratio of the energy of the fifth signal in the third voice to the energy of the sixth signal in the third voice to obtain the third numerical value; wherein the fifth signal corresponds to the first signal and the sixth signal corresponds to the second signal.
6. The method according to claim 2, wherein the method further comprises:
when the terminal equipment determines that the first ratio meets the preset range and the corresponding relation exists between the preset range and the voiceprint vector, the terminal equipment extracts a fifth voiceprint vector corresponding to the preset range from the database;
The terminal equipment obtains a similarity score of the first voiceprint vector and the fifth voiceprint vector to obtain a fifth numerical value;
and when the fifth value is larger than a third threshold value, the terminal equipment executes the task corresponding to the first voice.
7. The method of claim 6, wherein after the terminal device performs the task corresponding to the first voice, the method further comprises:
the terminal equipment performs voiceprint fusion on the fifth voiceprint vector and the first voiceprint vector to obtain a sixth voiceprint vector;
and the terminal equipment establishes a corresponding relation between the sixth voiceprint vector and the preset range in the database.
8. The method according to any one of claims 1-7, further comprising:
and when the first value is larger than the first threshold value, the terminal equipment executes the task corresponding to the first voice.
9. The method of claim 1, wherein prior to the terminal device collecting the first voice, the method further comprises:
when the terminal equipment receives an operation for starting voiceprint recognition, the terminal equipment displays a second interface; the second interface comprises prompt information for indicating whether to acquire the second voice of the preset user and a second control for indicating to acquire the second voice;
And when the terminal equipment receives the operation for the second control, the terminal equipment acquires the second voice.
10. The method according to claim 1, wherein the preset frequency is a frequency preconfigured by the terminal device, and the preset frequency is used for indicating a frequency corresponding to a signal filtered when the mask is worn.
11. The method of claim 10, wherein the predetermined frequency is greater than 2000 hz.
12. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the terminal device to perform the method according to any of claims 1 to 11.
13. A computer readable storage medium storing a computer program, which when executed by a processor causes a computer to perform the method of any one of claims 1 to 11.
14. A computer program product comprising a computer program which, when run, causes a computer to perform the method of any one of claims 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210374386.7A CN116935858A (en) | 2022-04-11 | 2022-04-11 | Voiceprint recognition method and voiceprint recognition device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210374386.7A CN116935858A (en) | 2022-04-11 | 2022-04-11 | Voiceprint recognition method and voiceprint recognition device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116935858A true CN116935858A (en) | 2023-10-24 |
Family
ID=88393127
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210374386.7A Pending CN116935858A (en) | 2022-04-11 | 2022-04-11 | Voiceprint recognition method and voiceprint recognition device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116935858A (en) |
-
2022
- 2022-04-11 CN CN202210374386.7A patent/CN116935858A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022033556A1 (en) | Electronic device and speech recognition method therefor, and medium | |
CN111696532B (en) | Speech recognition method, device, electronic equipment and storage medium | |
CN113393856B (en) | Pickup method and device and electronic equipment | |
WO2019105238A1 (en) | Method and terminal for speech signal reconstruction and computer storage medium | |
CN111933112A (en) | Awakening voice determination method, device, equipment and medium | |
CN111445901A (en) | Audio data acquisition method and device, electronic equipment and storage medium | |
CN114299933B (en) | Speech recognition model training method, device, equipment, storage medium and product | |
US20240013789A1 (en) | Voice control method and apparatus | |
CN105635452A (en) | Mobile terminal and contact person identification method thereof | |
CN114067776A (en) | Electronic device and audio noise reduction method and medium thereof | |
CN112233689A (en) | Audio noise reduction method, device, equipment and medium | |
CN115881118A (en) | Voice interaction method and related electronic equipment | |
CN113362836B (en) | Vocoder training method, terminal and storage medium | |
CN114333774A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
WO2023124248A1 (en) | Voiceprint recognition method and apparatus | |
CN112133319B (en) | Audio generation method, device, equipment and storage medium | |
CN111652624A (en) | Ticket buying processing method, ticket checking processing method, device, equipment and storage medium | |
CN116935858A (en) | Voiceprint recognition method and voiceprint recognition device | |
CN116386091A (en) | Fingerprint identification method and device | |
CN113506566B (en) | Sound detection model training method, data processing method and related device | |
CN114120987B (en) | Voice wake-up method, electronic equipment and chip system | |
CN111028846B (en) | Method and device for registration of wake-up-free words | |
CN113192531A (en) | Method, terminal and storage medium for detecting whether audio is pure music audio | |
CN116524919A (en) | Equipment awakening method, related device and communication system | |
CN114093368A (en) | Cross-device voiceprint registration method, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |