WO2021169365A1 - 声纹识别的方法和装置 - Google Patents

声纹识别的方法和装置 Download PDF

Info

Publication number
WO2021169365A1
WO2021169365A1 PCT/CN2020/125337 CN2020125337W WO2021169365A1 WO 2021169365 A1 WO2021169365 A1 WO 2021169365A1 CN 2020125337 W CN2020125337 W CN 2020125337W WO 2021169365 A1 WO2021169365 A1 WO 2021169365A1
Authority
WO
WIPO (PCT)
Prior art keywords
emotion
voiceprint
user
emotions
registered
Prior art date
Application number
PCT/CN2020/125337
Other languages
English (en)
French (fr)
Inventor
郎玥
徐嘉明
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021169365A1 publication Critical patent/WO2021169365A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • This application relates to the field of biometrics, and more specifically, to methods and devices for voiceprint recognition.
  • Voiceprint recognition achieves the purpose of distinguishing unknown sounds by analyzing the characteristics of one or more speech signals. Simply put, it is a technology to distinguish whether a certain sentence is spoken by a certain person.
  • the theoretical basis of voiceprint recognition is that each person's voice has a unique feature, and the voice of different people can be effectively distinguished through this feature.
  • the basic principle of voiceprint recognition is to achieve the purpose of voiceprint recognition by analyzing the similarity between the spectrum of voice signals.
  • the characteristics of the language spectrum directly affect the result of voiceprint recognition. Generally, users are relatively calm when registering for voiceprint templates.
  • the user In the actual use process, the user’s emotions are diverse, sometimes they are more anxious, sometimes they are more happy and excited. These emotions will affect the characteristics of the language spectrum, so that the fluctuation of emotions will accurately recognize the voiceprint.
  • the rate has a negative impact.
  • the emotion detection method is used to detect the degree of deformation of the emotion language to calculate the emotion factor, and the language changes caused by emotion are compensated at the model layer and the feature layer in the training and recognition phases.
  • this solution relies on the accuracy of emotion detection when determining emotion factors, and inaccurate emotion detection will reduce the accuracy of voiceprint recognition.
  • the compensation of voice features will further affect the accuracy of voiceprint recognition.
  • the present application provides a method and device for voiceprint recognition.
  • By matching the voice signal to be recognized in the same mood with the voiceprint template it can help reduce the influence of the user's mood fluctuations on the voiceprint recognition, thereby enhancing the voiceprint Robustness of recognition.
  • a method for voiceprint recognition includes:
  • the voice signal to be recognized and the voiceprint template it is determined whether the user to be recognized is the registered user.
  • emotion recognition is performed on the to-be-recognized voice signal of the user to be recognized, the first emotion of the to-be-recognized voice signal is obtained, and the voiceprint template of the registered user in the first emotion is obtained, according to the to-be-recognized voice signal.
  • the voice signal is matched with the voiceprint template to determine whether the user to be identified is a registered user. Therefore, by matching the voice signal to be recognized in the same emotion with the voiceprint template, the embodiment of the present application can help reduce the influence of the user's mood fluctuations on the voiceprint recognition, thereby helping to realize that the user is in different emotions. , To obtain a consistent user experience of voiceprint recognition, thereby enhancing the robustness of voiceprint recognition.
  • the voiceprint template of the registered user can be matched with the voice feature vector of the voice signal to be recognized, and the similarity between the voice feature vector of the voice signal to be recognized and the voiceprint template of the registered user can be obtained. Then, it can be judged whether the similarity is higher than the threshold. When the similarity is higher than the threshold, it is determined that the user to be identified is a registered user. At this time, corresponding operations can be performed in response to the user's request, such as unlocking the smart terminal, or opening an application, etc., which is not limited. When the similarity is not higher than the threshold, it is determined that the user to be identified is not a registered user. At this time, the user's request can be rejected, such as keeping the screen locked, or refusing to open the application, etc., which is not limited.
  • the first emotion includes at least one of calm, joy, anger, sadness, eagerness, fear, and surprise.
  • the first emotion can be a single emotion among multiple emotions, such as calm, joy, anger, sadness, eagerness, fear or surprise, etc.
  • the first emotion can also be a mixture of multiple emotions. Emotions, such as mixed emotions of calm and joy, mixed emotions of anger, eagerness, and sadness, etc., which are not limited in the embodiments of the present application.
  • the first emotions are different emotions, the corresponding voiceprint templates are different.
  • the obtaining the voiceprint template of the registered user in the first emotion includes:
  • the first emotion at this time is a single emotion among multiple emotions, and at this time, voiceprint recognition can be performed by calling the voiceprint template under the emotion.
  • the embodiment of the present application recognizes the emotion of the to-be-recognized voice signal of the user to be recognized, and invokes the voiceprint template of the registered user in the emotion, and compares the to-be-recognized voice signal with the voiceprint of the registered user in the emotion.
  • the template performs voiceprint matching to determine whether the user to be identified is a registered user. Therefore, by matching the voice signal to be recognized in the same mood with the voiceprint template, the embodiment of the present application can help reduce the influence of the user's mood fluctuations on the voiceprint recognition, thereby helping to realize that the user is in different moods. , To obtain a consistent user experience of voiceprint recognition, thereby enhancing the robustness of voiceprint recognition.
  • the first emotion is characterized by a weight coefficient of each of at least two emotions.
  • the obtaining the voiceprint template of the registered user in the first emotion includes:
  • the voiceprint template corresponding to the first emotion is obtained.
  • the first emotion at this time is a mixed emotion composed of multiple emotions.
  • the mixed voiceprint template corresponding to the first emotion can be generated according to the voiceprint template corresponding to the multiple emotions of the registered user, and then according to The mixed voiceprint template performs voiceprint matching.
  • the weight coefficient of each emotion contained in the current emotion of the user is identified, and the weighted sum of each emotion in the voiceprint template set of the registered user is performed to obtain the mixed sound according to the weight coefficient of each emotion in the emotion.
  • Pattern template matching the voice signal to be recognized with the mixed voiceprint template, and judging whether the user to be recognized is a registered user. Therefore, by matching the voice signal to be recognized in the same emotion with the voiceprint template, the embodiment of the present application can help reduce the influence of the user's mood fluctuations on the voiceprint recognition, thereby helping to realize that the user is in different emotions. , To obtain a consistent user experience of voiceprint recognition, thereby enhancing the robustness of voiceprint recognition.
  • the first emotion may also be displayed through a display interface, so that the user knows the emotion corresponding to the current voice signal to be recognized.
  • the first emotion when the first emotion is characterized by a weight coefficient of each of at least two emotions, the first emotion is displayed through a display interface, The each emotion and the weight coefficient of each emotion can be displayed through the display interface.
  • the user's first operation may also be obtained, where the first operation Used to correct the type of the first emotion, or used to correct the weight coefficient of each of the at least two emotions in the first emotion. Then, in response to the first operation, the first emotion is updated.
  • the first emotion is displayed to the user, and when the user is not satisfied with the type of the first emotion, or the weight coefficient of each emotion in the first emotion, the user's judgment of his true emotion can be referred to.
  • Update the first emotion which helps to accurately identify the user’s current emotional state, helps to reduce the impact of the user’s emotional fluctuations on voiceprint recognition, and thus helps to achieve consistent users in different emotions.
  • the user experience of voiceprint recognition thereby enhancing the robustness of voiceprint recognition.
  • the method before the obtaining the voiceprint template of the registered user in the first emotion, the method further includes:
  • a voiceprint template of each emotion of the registered user in the multiple different emotions is obtained.
  • the embodiment of the present application can generate voiceprint templates for users in different emotions, and the voiceprint templates under different emotions are different. Therefore, the embodiment of the present application can adapt to different emotional changes of the user in the process of voiceprint recognition, which helps to improve the accuracy of voiceprint recognition.
  • the registered voice of the user under different emotions can be directly collected, and the registered voice signal of the user under different emotions can be obtained.
  • At least two preset emotions can be displayed to the user through a display interface; then, a second operation of the user is obtained, and the second operation is used to enter the user's Describe the speech in at least two preset emotions.
  • the second operation obtain the registered voice signals in the at least two preset emotions, wherein the registered voice signals in the multiple different emotions include the registered voice signals in the at least two preset emotions .
  • the preset emotion may be calm, joy, anger, sadness, eagerness, fear or surprise, etc., which is not limited in the embodiment of the present application.
  • the acquiring registered voice signals in multiple different emotions includes:
  • Emotion conversion is performed on the first registered voice signal, and registered voice signals in the multiple different emotions are obtained.
  • the performing emotion conversion on the first registered voice signal to obtain the registered voice signals in the multiple different emotions includes:
  • emotion transformation is performed on the first registered voice signal, and registered voice signals in the multiple different emotions are obtained.
  • the determining whether the user to be recognized is the registered user based on the voice signal to be recognized and the voiceprint template includes:
  • the voiceprint information and the voiceprint template it is determined whether the user to be identified is the registered user.
  • the voiceprint information can identify the characteristic information of the voice signal to be recognized. Therefore, the voiceprint information of the voice signal to be recognized can be matched with the voiceprint template of the registered user under the emotion, and it can be judged whether the user to be recognized is Registered users.
  • an embodiment of the present application provides a voiceprint recognition device, which is used to execute the method in the first aspect or any possible implementation of the first aspect.
  • the device includes a device for executing the first aspect.
  • an embodiment of the present application provides a voiceprint recognition device, including: one or more processors; a memory for storing one or more programs; when the one or more programs are used by the one Or multiple processors execute, so that the one or more processors implement the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • an embodiment of the present application provides a computer-readable medium for storing a computer program, and the computer program includes instructions for executing the first aspect or any possible implementation of the first aspect.
  • the embodiments of the present application also provide a computer program product containing instructions, which when the computer program product runs on a computer, cause the computer to execute the method in the first aspect or any possible implementation of the first aspect .
  • beneficial effects achieved by the second to fifth aspects of the application and the corresponding implementation manners can refer to the beneficial effects achieved by the first aspect of the application and the corresponding implementation manners, and will not be repeated here.
  • Figure 1 is a schematic flow chart of a method for voiceprint recognition
  • Fig. 2 is a schematic diagram of a voiceprint recognition system provided by an embodiment of the present application.
  • Fig. 3 is a specific example of a voiceprint registration process provided by an embodiment of the present application.
  • FIG. 4 is another specific example of the voiceprint registration process provided by the embodiment of the present application.
  • FIG. 5 is a specific example of the voiceprint recognition process provided by the embodiment of the present application.
  • FIG. 6 is another specific example of the voiceprint recognition process provided by the embodiment of the present application.
  • Fig. 7 is an example of a display interface of a terminal device provided by an embodiment of the present application.
  • FIG. 8 is another example of a display interface of a terminal device provided by an embodiment of the present application.
  • FIG. 9 is another example of a display interface of a terminal device provided by an embodiment of the present application.
  • FIG. 10 is another example of a display interface of a terminal device provided by an embodiment of the present application.
  • FIG. 11 is another example of a display interface of a terminal device provided by an embodiment of the present application.
  • FIG. 12 is another example of a display interface of a terminal device provided by an embodiment of the present application.
  • FIG. 13 is another example of a display interface of a terminal device provided by an embodiment of the present application.
  • FIG. 14 is another example of a display interface of a terminal device provided by an embodiment of the present application.
  • 15 is a schematic flowchart of a method for voiceprint recognition provided by an embodiment of the present application.
  • FIG. 16 is a schematic block diagram of a voiceprint recognition device provided by an embodiment of the present application.
  • FIG. 17 is a schematic block diagram of another voiceprint recognition device provided by an embodiment of the present application.
  • Fig. 1 shows a schematic flowchart of a method 100 for voiceprint recognition.
  • voiceprint recognition mainly includes two processes: voiceprint registration and voiceprint confirmation/identification.
  • the voiceprint registration phase including step 101, step 102, step 103, and step 105
  • one or more users' voiceprint templates can be obtained.
  • the voiceprint confirmation/identification stage including step 101, step 102, step 103, step 106, and step 107
  • the voice feature information of the unknown speaker can be obtained, and then the voice feature information can be combined with the voice feature information obtained in the voiceprint registration stage.
  • the known voiceprint template is matched, and the voiceprint confirmation/identification is performed.
  • the voiceprint confirmation/discrimination stage may also be referred to as the voiceprint recognition stage.
  • voiceprint confirmation is speaker confirmation, which is used to determine whether the unknown speaker is a designated person.
  • voiceprint confirmation the acquired voice feature information of the unknown speaker can be matched with the voiceprint template of the designated person to confirm whether the unknown speaker is the designated person.
  • Voiceprint recognition is speaker recognition, which is used to determine which of the known recorded speakers is the unknown speaker.
  • voiceprint identification the acquired voice characteristics of the unknown speaker can be matched with the voiceprint templates of multiple known recorded persons, respectively, to determine which of the several known recorded speakers is the speaker at the location Bit.
  • the collected user's voice signal (also called registered voice signal) can be processed for signal processing, such as performing step 101 (that is, voice detection) and step 102 (voice enhancement). ) And other processing to obtain the processed registered voice signal.
  • step 101 that is, voice detection
  • step 102 that is, voice enhancement
  • step 103 is performed on the processed registered voice signal, that is, feature extraction is performed to obtain feature information of the registered voice signal.
  • step 104 is performed, that is, the feature information of the registered voice signal is trained through the voiceprint model to obtain the voiceprint template of the user.
  • the user's voiceprint template can be obtained.
  • the user can be referred to as a "registered user”.
  • the voiceprint template of at least one user can be obtained in the above-mentioned manner, that is, the voiceprint template of at least one registered user can be obtained.
  • a voiceprint template library may be established through the above voiceprint registration process, and the voiceprint template library may include multiple voiceprint templates of different registered users.
  • the collected user's voice signal (also called the voice signal to be recognized) can also be processed for signal processing, such as performing step 101 (voice detection) and step 102 (voice enhancement) Wait for processing to obtain the processed voice signal to be recognized.
  • step 103 is performed, that is, feature extraction is performed on the processed voice signal to be recognized, and the feature information of the voice signal to be recognized is obtained.
  • step 105 is performed, which is to perform voiceprint matching between the feature information of the voice signal to be recognized and the voiceprint template of the registered user.
  • the feature information of the voice signal to be recognized and the similarity score of the voiceprint template can be obtained.
  • step 106 is executed, that is, according to the similarity score, it is confirmed whether the user to be identified is a registered user.
  • the user's voiceprint template includes the spectral characteristics of the user's voice signal.
  • the speech spectrum of a sound signal is a graphical representation of the sound signal, which can represent the change of the frequency amplitude of each frequency point of the sound signal over time.
  • the amplitude of the sound signal at each frequency point can be distinguished by color.
  • the fundamental frequency and harmonic frequency of the speaker's voice appear as bright lines on the spectrum.
  • the user's emotions can be diverse, and these emotions will affect the spectral characteristics of the user's voice. This may cause the same user in different emotions, the difference in the spectral characteristics of the voices may be relatively large, thereby affecting the accuracy of voiceprint recognition.
  • the user performs voiceprint registration when the user is in a calm mood, and the obtained voiceprint template at this time contains the spectral characteristics of the voice signal of the user in a calm state.
  • the difference between the spectral characteristics of the extracted speech signal to be recognized and the spectral characteristics in the voiceprint template may be relatively large, which may result in a low degree of voiceprint matching and affect the recognition of voiceprints. Accuracy.
  • the embodiment of the present application provides an emotional adaptive voiceprint recognition method by generating voiceprint templates (or voiceprint template sets) of multiple emotions, and according to the voiceprint templates (or voiceprint templates) of multiple emotions. Template set) for voiceprint matching, so as to realize emotion-adaptive voiceprint recognition.
  • the emotion may include at least one emotion of calm, joy, anger, sadness, eagerness, fear, and surprise. That is, the emotion can be a single emotion in situations such as calm, joy, anger, sadness, eagerness, fear, and surprise, or a combination of two or more emotions, or mixed emotions, etc., which are not limited in the embodiments of the present application. .
  • voiceprint templates (or voiceprint template sets) for multiple emotions is completed in the voiceprint registration stage.
  • the registered voice signals under different emotions of the user can be entered, or the registered voice signals under one emotion can be emotionally changed to generate registered voice signals under different emotions. Then train the registered voice signals under different emotions to generate voiceprint templates of multiple different emotions.
  • emotions can be preset in the terminal device, such as preset emotions such as calm, joy, anger, sadness, fear, eagerness, and surprise.
  • the voiceprint templates for each of a variety of different preset emotions can be generated separately, such as the voiceprint template under calm emotions, the voiceprint template under happy emotions, and the voice under angry emotions.
  • the pattern template, the voiceprint template in sad mood, the voiceprint template in fear mood, the voiceprint template in eager mood, and the voiceprint template in surprise mood are not limited in the embodiment of the application. Among them, the corresponding voiceprint templates are different under different emotions.
  • the voiceprint matching is completed in the voiceprint recognition stage.
  • emotion recognition may be performed on the speech to be recognized, and a corresponding voiceprint template may be obtained according to the result of the emotion recognition, and then voiceprint matching may be performed according to the voiceprint template.
  • the result of emotion recognition that is, the emotion obtained by performing emotion recognition on the voice signal to be recognized, can be called the first emotion.
  • the voiceprint templates corresponding to the different emotions are different.
  • the first emotion may be one of multiple preset emotions, that is, a single emotion, such as calm, joy, anger, sadness, fear, eagerness, or surprise.
  • the voiceprint template corresponding to the emotion can be selected from the voiceprint templates of multiple preset emotions, and then the voiceprint matching is performed according to the selected voiceprint template and the voiceprint characteristics of the voice signal to be recognized.
  • the voiceprint template in the joy emotion can be determined as the voiceprint template in the first emotion.
  • the first emotion may be a mixed emotion composed of multiple preset emotions, such as a mixed emotion of calm and sadness, a mixed emotion of joy and eagerness, a mixed emotion of anger, sadness and eagerness, and so on.
  • the voiceprint templates of the multiple preset emotions can be used to generate the mixed voiceprint template of the first emotion, and the voiceprint matching is performed according to the mixed voiceprint template and the voiceprint feature of the voice signal to be recognized.
  • the mixed voiceprint template under the first emotion can be generated according to the voiceprint template under the calm emotion and the voiceprint template under the sad emotion.
  • the embodiment of this application matches the voice signal to be recognized in the same emotion with the voiceprint template, the embodiment of this application can help reduce the influence of mood fluctuations on voiceprint recognition, thereby helping to realize the user's Under different emotions, a consistent user experience of voiceprint recognition is obtained, thereby enhancing the robustness of voiceprint recognition.
  • FIG. 2 shows a schematic diagram of a voiceprint recognition system 200 provided by an embodiment of the present application.
  • the system 200 can be applied to the voiceprint recognition function of various terminal devices, such as smart devices such as mobile phones, smart speakers, and in-vehicle electronic devices, for the terminal device to confirm the user's identity, so as to wake up the device and start the smart assistant.
  • terminal devices such as smart devices such as mobile phones, smart speakers, and in-vehicle electronic devices
  • the system 200 may include a signal processing module 201, an emotion change module 202, a voiceprint template generation module 203, a feature extraction module 204, an emotion recognition module 205, a voiceprint template acquisition module 206, and a voiceprint matching module 207.
  • the arrow in FIG. 2 can be used to indicate the transmission direction of the signal flow.
  • the signal processing module 201, the emotion recognition module 205, the voiceprint template acquisition module 206, the feature extraction module 204, and the voiceprint matching module 207 can be used in the voiceprint confirmation/recognition process.
  • the pattern template generation module 203 can be used in the voiceprint registration process. Generally, before voiceprint confirmation/recognition, voiceprint registration is required.
  • the signal processing module 201 is used to perform signal processing on the acquired voice signal.
  • the signal is processed, such as voice activation detection, noise reduction processing, de-reverberation processing, etc., to obtain a processed voice signal.
  • the signal processing module 201 is used to perform signal processing on the registered voice signal to obtain the processed registered voice signal; in the voiceprint confirmation/recognition stage, the signal processing module 201 is used to perform processing on the voice signal to be recognized Signal processing to obtain the registered voice signal after processing.
  • the system 200 may include one or more signal processing modules 201, which is not limited in the embodiment of the present application.
  • the signal processing module that performs signal processing on the registered voice signal and the signal processing module that performs signal processing on the voice signal to be recognized may be the same module or different modules, and both are within the scope of protection of the embodiments of the present application.
  • the emotion change module 202 is used for performing emotion change processing on the registered voice signal during the voiceprint registration stage to obtain registered voice signals under different emotions.
  • the emotion change module 202 may perform emotion change processing on the registered voice signals processed by the signal processing module 201 to obtain registered voice signals in different emotions.
  • different emotions can be referred to the above description, for the sake of brevity, it will not be repeated here.
  • the voiceprint template generation module 203 is configured to perform voiceprint template training according to registered voice signals corresponding to different emotions, and obtain voiceprint templates corresponding to different emotions, that is, voiceprint templates of multiple emotions.
  • the voiceprint template generation module 203 may extract feature information of the voice signal to be recognized, and perform voiceprint template training on the feature information, to generate a voiceprint template corresponding to the voice signal to be recognized.
  • voiceprint template training may be performed on registered voice signals in multiple different emotions, so as to obtain voiceprint templates of the user in different emotions.
  • the feature extraction module 204 is configured to perform feature extraction on the voice signal to be recognized in the voiceprint registration stage to obtain feature information of the voice signal to be recognized, that is, voiceprint information.
  • the emotion recognition module 205 is configured to perform emotion recognition on the to-be-recognized voice signal of the user to be recognized in the voiceprint confirmation/recognition stage, and determine the emotion corresponding to the to-be-recognized voice signal.
  • the emotion recognition module 205 can recognize the user's emotion according to the characteristics of the language spectrum in the acquired speech signal. For example, the user is eager when recording a voice, and at this time, it can be recognized that the emotion corresponding to the voice signal to be recognized is an eager emotion. For another example, the user is joyful when recording a voice, and at this time, it can be recognized that the emotion corresponding to the voice signal to be recognized is a joyful emotion. For another example, when you are angry when recording a voice, it can be recognized that the emotion corresponding to the voice signal to be recognized is an angry emotion.
  • the emotion recognition module 205 may be a discrete speech emotion classifier, or a dimensional speech emotion prediction or the like, which is not limited in the embodiment of the present application.
  • the voiceprint template acquisition module 206 is used to determine the voiceprint template used in the voiceprint matching according to the emotion recognition result and the voiceprint template of multiple emotions in the voiceprint confirmation/recognition stage.
  • the voiceprint template obtaining module 206 may obtain the voiceprint template of the emotion corresponding to the voice signal to be recognized from the voiceprint template library, or generate the voiceprint template corresponding to the voice signal to be recognized according to the voiceprint template in the voiceprint template library. Emotional mixed voiceprint template.
  • the voiceprint matching module 207 is configured to perform voiceprint matching according to the voiceprint template and the characteristic information of the voice signal to be recognized in the voiceprint confirmation/recognition stage, and determine whether the user to be recognized is a registered user.
  • FIG. 2 shows the modules or units of the voiceprint recognition system 200, but these modules or units are only examples.
  • the voiceprint recognition device in the embodiment of the present application may also include other modules or units, or include The deformation of each module or unit.
  • the image acquisition device in FIG. 2 may not include all the modules or units in FIG. 2.
  • the registered voice signals of the user in different emotions can be obtained, and the voiceprint templates of the corresponding emotions are generated according to the registered voice signals in different emotions.
  • the following describes two ways of generating voiceprint templates in different emotions provided by the embodiments of the present application in conjunction with FIG. 3 and FIG. 4.
  • Fig. 3 shows a specific example of the voiceprint registration process provided by the embodiment of the present application. Among them, different emotion changes can be made according to the user's voice, the user's voice in different emotions can be obtained, and then a voiceprint module corresponding to the emotion can be generated.
  • step 301 can first obtain the registered voice signal input by the user.
  • the user may input a piece of voice through the voice acquisition module of the device to obtain the registered voice signal corresponding to the voice.
  • the registered voice signal may be referred to as the registered voice signal input by the user.
  • the registered voice signal may be processed by the signal processing module 201 in FIG. 2 to obtain the processed registered voice signal.
  • the processing process can be referred to the above description, for the sake of brevity, it will not be repeated here.
  • the user can input the voice in a calm mood, or input the voice in a situation of emotional fluctuations such as sadness, anger, and joy, which is not limited in the embodiment of the present application.
  • the voice input by the user may be text-related or text-independent, which is not limited in the embodiments of the present application.
  • step 302 can be executed to transform the registered voice signal of the user into different emotions.
  • the emotion change module 202 of FIG. 2 performs emotion changes on the registered voice signal input by the user to obtain the registered voice signal of the user under various emotions.
  • Emotional change is the direct conversion of the user's registered voice signal.
  • the emotional change may change the user's registered voice signal into a registered voice signal in sad mood, a registered voice in angry mood, a registered voice in joy mode, etc., which are not limited in the embodiment of the present application.
  • the user's registered voice signal may be the user's voice signal collected by the device, and may be a time-domain signal that has undergone processing such as endpoint detection, noise reduction processing, and de-reverberation.
  • a spectrum-prosody double conversion speech emotion conversion algorithm may be used to achieve emotion conversion, or a sparsely constrained emotion speech conversion algorithm may be used to achieve emotion conversion, which is not limited in the embodiment of the present application.
  • the types of emotions can be preset.
  • the emotion change module 202 can preset (ie preset) four emotions of sadness, anger, joy, and eagerness. At this time, when the emotion change module 202 obtains the registered voice signal input by the user, it can perform emotion conversion on the registered voice signal input by the user, and obtain the registered voice signal of the user in sad mood and the registered voice signal in angry mood. , The registration voice signal in the joyful mood and the registration voice signal in the eager mood.
  • preset emotion types can be added, changed, or deleted according to user needs.
  • step 303 may be performed to generate voiceprint templates in different emotions according to the registered voice signals of the user in different emotions.
  • the voiceprint module 203 in FIG. 2 may be used to generate a voiceprint template of the user in the mood.
  • the user's voiceprint templates under different emotions may constitute a set, which may be referred to as a set of voiceprint templates for multiple emotions.
  • the voiceprint template library may include voiceprint template sets of multiple emotions of multiple registered users.
  • the voiceprint registration of the user can be completed.
  • the user can be referred to as a registered user.
  • the embodiment of the present application can generate voiceprint templates for users in different emotions, and the voiceprint templates under different emotions are different. Therefore, the embodiment of the present application can adapt to different emotional changes of the user in the process of voiceprint recognition, which helps to improve the accuracy of voiceprint recognition.
  • Fig. 4 shows another specific example of the voiceprint registration process provided by the embodiment of the present application.
  • the registered voice signals of users under different emotions can be directly collected, and then corresponding voiceprint templates can be trained according to the registered voices under different emotions.
  • At least one registered voice signal input by the user can be obtained first through step 401, where the at least one registered voice signal includes at least one registered voice signal of the user in at least one emotion.
  • the at least one registered voice signal includes at least one registered voice signal of the user in at least one emotion.
  • the user when the user performs voiceprint registration, the user may be prompted to enter voices in different emotions through the interface of the terminal device, or the user may be prompted to enter voices in different emotions through voice. This is not limited.
  • step 402 can be performed, that is, according to the registered voice signals in different emotions, voiceprint templates in corresponding emotions are generated.
  • step 402 is similar to step 303, and reference may be made to the above description. For the sake of brevity, it will not be repeated here.
  • the voiceprint registration of the user can be completed.
  • the user can be referred to as a registered user.
  • the embodiment of the present application can generate voiceprint templates for users in different emotions. Therefore, the embodiment of the present application can adapt to different emotional changes of the user in the process of voiceprint recognition, which helps to improve the accuracy of voiceprint recognition.
  • the system 200 when the system architecture 200 includes the emotion change module 202 and the voiceprint template generation module 203, the system 200 can complete the voiceprint registration process and the voiceprint confirmation/recognition process.
  • the terminal device that includes the system 200 can send the acquired registered voice signal to other devices, such as a cloud server, and the other device will use the received
  • the user's registered voice signal is trained to generate the user's voiceprint template, and then the voiceprint template is sent to the terminal device.
  • the process of generating the voiceprint template by the cloud server is similar to the process of generating the voiceprint template by the terminal device, which can be referred to the above description. For the sake of brevity, it will not be repeated here.
  • the voiceprint template of the registered user under the emotion can be obtained according to the emotion of the recognized user’s voice signal to be recognized, and then the feature information of the voice signal to be recognized can be compared with the registered user’s voice.
  • Voiceprint matching is performed on the voiceprint template under emotion, and the result of voiceprint confirmation/recognition is obtained.
  • FIG. 5 shows a specific example of the voiceprint recognition process provided by the embodiment of the present application.
  • emotion recognition it can be judged that the current emotional state of the user is a single emotion among a plurality of preset different emotions.
  • the voiceprint recognition can be performed by calling the voiceprint template under the emotion.
  • step 501 may be used to obtain the voice signal to be recognized input by the user.
  • the user may input a piece of voice through the voice acquisition module of the device to obtain the voice signal to be recognized corresponding to the voice.
  • the user is the user to be identified.
  • the to-be-recognized voice signal may be processed by the signal processing module 201 in FIG. 2 to obtain the processed to-be-recognized voice signal.
  • the processing process can be referred to the above description, for the sake of brevity, it will not be repeated here.
  • the user may input the voice in a calm mood, or input the voice in a situation of emotional fluctuations such as sadness, anger, joy, etc., which is not limited in the embodiment of the present application.
  • step 502 can be performed to perform emotion recognition on the voice signal to be recognized, and obtain the first emotion of the current user.
  • the first emotion may be one of preset emotions, such as sadness, anger, joy, etc.
  • step 502 may be performed by the emotion recognition module 205 in FIG. 2.
  • step 503 may be performed to perform voiceprint feature extraction on the voice signal to be recognized to obtain voiceprint information of the voice signal to be recognized.
  • step 503 may be performed by the feature extraction module 204 in FIG. 2.
  • the user's to-be-recognized voice signal may be the user's to-be-recognized voice signal collected by the device, and may be a time-domain signal that has undergone processing such as endpoint detection, noise reduction processing, and de-reverberation.
  • the feature extraction algorithm used for voiceprint feature extraction in the voiceprint confirmation/recognition phase is the same as the feature extraction algorithm used in the voiceprint registration phase training to generate the voiceprint template.
  • step 504 may be performed to retrieve the voiceprint template of the first emotion of the registered user according to the recognition result of emotion recognition, and perform voiceprint judgment on the voice signal to be recognized, thereby determining the identity of the user.
  • the voiceprint information obtained in step 503 can be matched with the voiceprint template of the registered user's first emotion to determine whether the user to be identified is the registered user.
  • step 504 may be performed by the voiceprint template obtaining module 206 and the voiceprint matching module 207 in FIG. 2.
  • the voiceprint template obtaining module 206 may obtain the voiceprint template of the first emotion from the voiceprint template set of the registered user according to the first emotion identified in step 502. Then, the voiceprint matching module 207 matches the voiceprint information obtained in step 503 with the voiceprint template in the first mood, and determines whether the user to be identified is the registered user.
  • the voiceprint template of the registered user can be matched with the voice feature vector of the voice signal to be recognized, and the similarity between the voice feature vector of the voice signal to be recognized and the voiceprint template of the registered user can be obtained. Then, it can be judged whether the similarity is higher than the threshold. When the similarity is higher than the threshold, it is determined that the user to be identified is a registered user. At this time, corresponding operations can be performed in response to the user's request, such as unlocking the smart terminal, or opening an application, etc., which is not limited. When the similarity is not higher than the threshold, it is determined that the user to be identified is not a registered user. At this time, the user's request can be rejected, such as keeping the screen locked, or refusing to open the application, etc., which is not limited.
  • the embodiment of the present application recognizes the emotion of the to-be-recognized voice signal of the user to be recognized, and calls the voiceprint template of the registered user under the emotion, and compares the characteristic information of the voice signal to be recognized with the emotion of the registered user. Perform voiceprint matching on the voiceprint template to determine whether the user to be identified is a registered user. Therefore, by matching the voice signal to be recognized in the same emotion with the voiceprint template, the embodiment of the present application can help reduce the influence of the user's mood fluctuations on the voiceprint recognition, thereby helping to realize that the user is in different emotions. , To obtain a consistent user experience of voiceprint recognition, thereby enhancing the robustness of voiceprint recognition.
  • FIG. 6 shows another specific example of the voiceprint recognition process provided by the embodiment of the present application.
  • emotion recognition it is judged that the current emotional state of the user is a mixed emotion composed of multiple preset emotions.
  • the mixed voiceprint of the user's current emotion can be generated according to the voiceprint template corresponding to the multiple preset emotions. Template, and then perform voiceprint recognition based on the mixed voiceprint template.
  • step 601 can obtain the voice signal to be recognized input by the user. Specifically, for step 601, refer to the description of step 501, and for brevity, it will not be repeated here.
  • step 602 may be performed to perform emotion recognition on the voice signal to be recognized, and obtain the first emotion of the current user.
  • the first emotion is a mixed emotion composed of multiple preset emotions, that is, the first emotion is a combination of two or more emotions in the preset emotions.
  • the voice of the user to be recognized often contains multiple emotional factors, such as anger and eagerness, or joy and excitement.
  • emotion recognition it is difficult to define which kind of emotion the current emotion belongs to.
  • a combination of multiple emotions can be used to describe the current user's emotional state.
  • the first emotion may be characterized by a weight coefficient of each emotion in at least two emotions.
  • an emotion recognition module may be used to perform emotion recognition on the to-be-recognized voice signal of the user to be recognized, and obtain the weight coefficient of each emotion in the user's current first emotion.
  • the weight coefficient of each emotion can represent the proportion of each emotion in the first emotion. That is, by multiplying each of the at least two emotions by the weight coefficient of each emotion, and then summing the at least two products, the first emotion can be obtained.
  • the weight coefficient of each of the at least two emotions included in the first emotion may constitute a weight coefficient vector of the first emotion.
  • the weight coefficient vector of the first emotion can be obtained through step 602, which can be expressed as [W 1 ...W i ...W N ], where W i is the weight coefficient corresponding to the i-th emotion, which represents the i-th emotion
  • W i is the weight coefficient corresponding to the i-th emotion, which represents the i-th emotion
  • N represents the total number of emotion types contained in the first emotion.
  • N may be the number of voiceprint templates of different emotions included in the voiceprint template set of multiple emotions, or N may be a preset emotion type. Among them, N is a positive integer greater than 1.
  • the emotion recognition module 205 can recognize that in the first emotion, the probability of anger is 60%, the probability of eager emotion is 30%, and the probability of sad emotion is 10%, then the weight coefficient of anger can be recorded as 0.6 , The weight coefficient of eager emotion can be recorded as 0.3, and the weight coefficient of sad emotion can be recorded as 0.1.
  • step 603 may be performed to perform voiceprint feature extraction on the voice signal to be recognized to obtain voiceprint information of the voice signal to be recognized.
  • step 603 refer to the description of step 503, and for brevity, it will not be repeated here.
  • step 604 can be executed to generate a mixed voiceprint template.
  • the mixed voiceprint template is the voiceprint template in the first mood.
  • the voiceprint template corresponding to each of the at least two emotions in the first emotion may be determined from the voiceprint templates of the registered user under multiple different emotions, and then the voiceprint template corresponding to each emotion The pattern template and the weight coefficient of each emotion are used to obtain the voiceprint template corresponding to the first emotion.
  • step 604 may be performed by the voiceprint template obtaining module 206 in FIG. 2.
  • the voiceprint template obtaining module 206 can obtain the voiceprint template set of the registered user, and then according to the weight coefficient of each emotion in the first emotion, that is, the weight coefficient vector of the first emotion, among the first emotions in the voiceprint template set
  • the voiceprint templates of each emotion are weighted and averaged to obtain a mixed voiceprint template.
  • the mixed voiceprint template can satisfy the following formula (1):
  • x represents the mixed voiceprint template
  • x i represents the voiceprint template corresponding to the i-th emotion
  • W i and N can be referred to the above description.
  • step 605 may be executed to perform a voiceprint judgment on the voice signal to be recognized based on the mixed voiceprint template obtained in step 604.
  • step 605 may be performed by the voiceprint matching module 207 in FIG. 2.
  • the voiceprint matching module 207 can match the voiceprint information obtained in step 603 with the mixed voiceprint template to determine whether the user to be identified is the registered user.
  • the voiceprint template of the registered user can be matched with the voice feature vector of the voice signal to be recognized, and the similarity between the voice feature vector of the voice signal to be recognized and the voiceprint template of the registered user can be obtained. Then, it can be judged whether the similarity is higher than the threshold. When the similarity is higher than the threshold, it is determined that the user to be identified is a registered user. At this time, corresponding operations can be performed in response to the user's request, such as unlocking the smart terminal, or opening an application, etc., which is not limited. When the similarity is not higher than the threshold, it is determined that the user to be identified is not a registered user. At this time, the user's request can be rejected, such as keeping the screen locked, or refusing to open the application, etc., which is not limited.
  • the weight coefficient of each emotion contained in the current emotion of the user is identified, and the weighted sum of each emotion in the voiceprint template set of the registered user is performed to obtain the mixed sound according to the weight coefficient of each emotion in the emotion.
  • Pattern template matching the characteristic information of the voice signal to be recognized with the mixed voiceprint template, and judging whether the user to be recognized is a registered user. Therefore, by matching the voice signal to be recognized in the same emotion with the voiceprint template, the embodiment of the present application can help reduce the influence of the user's mood fluctuations on the voiceprint recognition, thereby helping to realize that the user is in different emotions. , To obtain a consistent user experience of voiceprint recognition, thereby enhancing the robustness of voiceprint recognition.
  • the terminal device may prompt the user whether the voiceprint registration is required.
  • Figure 7 shows an example of a terminal device display interface.
  • the display interface of the terminal device can display "whether to register a voiceprint template".
  • the terminal device may also display two virtual keys of "Yes" and "No" for obtaining user operations.
  • the terminal device can enter the interface for recording the user's voice.
  • the terminal device exits the interface.
  • the terminal device may also obtain the user's operation through physical buttons. For example, when the user selects the "confirm” button, the user can enter the interface for recording the user's registered voice, and when the user selects the "return” button, the interface shown in FIG. 7 is exited.
  • the terminal device can give voice prompts to the user, such as playing "whether to register a voiceprint template" through an audio player, or other voices,
  • voice prompts such as playing "whether to register a voiceprint template" through an audio player, or other voices
  • the user can also choose to add a new voiceprint template for voiceprint recognition in the security settings.
  • FIG. 8 shows an example of a terminal device.
  • the user can enter the "voiceprint” operation through the security and privacy display interface on the left in Figure 8.
  • the display meeting may present an interface as shown on the right side of FIG. 8.
  • the user can input the "new voiceprint template” operation.
  • the end-side interactive device can enter the interface for recording the user's voice.
  • FIG. 9 shows an example of the display interface of the terminal device.
  • the display interface of the terminal device can display "Please select the voice recording method", as well as the two voice recording methods, which are "voice with multiple emotions" and "voice with one emotion”. ".
  • the terminal device can enter the interface for recording multiple voices.
  • the terminal device enters an interface for recording an emotional voice.
  • Fig. 10 shows an example of an interface for inputting a user's voice.
  • "Please select the emotions when recording voices” can be displayed on the display interface, and the preset emotions in the terminal device can be displayed, for example Calm, sad, joy, fear, anger, and eagerness, etc., but the embodiments of the present application are not limited thereto.
  • the user can perform an operation of selecting an emotion, for example, the user selects the emotion of "fear”.
  • the user can select the emotion that he wishes to record voice according to his own mood.
  • the interface shown in (b) in FIG. 10 may be displayed to the user.
  • the prompt “please enter the voice in the fear emotion”
  • the virtual button of "start recording” can be displayed on the interface at this time.
  • the user can perform the operation of inputting the voice in fear.
  • the terminal device may obtain the registered voice signal in the fear emotion entered by the user through a voice acquisition module (for example, a microphone component).
  • the above description only takes the user's input of voices in fear emotions as an example, and the user can also use the same method to input voices in other emotions, which is not limited in the embodiment of the present application.
  • the embodiments of the present application do not limit the time and sequence of the user inputting voices in a certain emotion.
  • the user can input voices in different emotions at different times. These are all within the protection scope of the embodiments of the present application.
  • operation #1 the operation performed by the user to select a preset emotion and enter the voice under the preset emotion
  • operation #1 is used to record the user's voice under the preset emotion
  • the user when the terminal device does not display an interface, or while the terminal device displays the interface shown in FIG. 10, the user can also be voiced promptly, for example, through an audio player to play "Please select the emotion when recording voice ", "Please enter the voice in fear", etc., or other voices, which are not limited in the embodiment of this application.
  • the terminal device After the terminal device obtains the registered voice signals under different emotions of the user, it can perform signal processing on the registered voice signals under different emotions, such as voice activation detection, voice noise reduction processing, de-reverberation processing, etc.
  • signal processing on the registered voice signals under different emotions, such as voice activation detection, voice noise reduction processing, de-reverberation processing, etc.
  • voice activation detection voice activation detection
  • voice noise reduction processing voice noise reduction processing
  • de-reverberation processing etc.
  • the embodiments of this application are This is not limited.
  • Fig. 11 shows another example of an interface for inputting a user's voice.
  • (a) in Figure 11 when you choose to enter an emotional voice, you can display "Please select the emotional conversion emotion" on the display interface, and display the preset emotions in the terminal device, such as calm , Sadness, joy, fear, anger, eagerness, etc., but the embodiments of the present application are not limited thereto.
  • the user after seeing the display interface, the user can perform operation #2 of selecting multiple different emotions from the at least two preset emotions, for example, the user selects emotions such as "calm", “joy", and "fear”.
  • the interface shown in (b) in FIG. 11 may be displayed to the user.
  • the prompt of "please record voice” and the virtual button of "start recording” can be displayed on the interface.
  • the user can perform the operation of recording voice.
  • the terminal device may obtain the registered voice signal entered by the user through a voice acquisition module (for example, a microphone component).
  • a voice acquisition module for example, a microphone component
  • the terminal device may perform signal processing on the registered voice signal, such as voice activation detection, voice noise reduction processing, de-reverberation processing, etc., which are not limited in this embodiment of the application.
  • the terminal device can perform emotion conversion on the registered voice signal, transform the registered voice signal into a registered voice signal in at least two emotions selected in FIG. 11, that is, obtain registered voice signals of multiple emotions of the user.
  • the emotion change template 202 in FIG. 2 can be used to perform emotion changes on the registered voice signal.
  • a voiceprint template of the user's multiple emotions can be generated.
  • the voiceprint template generation module 203 in FIG. 2 may be used to generate voiceprint templates in multiple emotions.
  • the emotion change and the process of generating the voiceprint template can be referred to the above description, for the sake of brevity, it will not be repeated here.
  • the voiceprint recognition of the user to be recognized can be performed.
  • the terminal device may prompt the user to perform voiceprint verification.
  • the terminal device can enter the interface for recording the tester's test voice.
  • Fig. 12 shows another example of an interface for inputting a user's voice.
  • "Please enter your voice for voiceprint verification” can be displayed on the display interface.
  • the terminal device may also display a virtual button for "start recording" on the interface.
  • the terminal device may obtain the user's to-be-recognized voice signal through a voice acquisition module (for example, a microphone component).
  • a voice acquisition module for example, a microphone component
  • the terminal device when the terminal device does not display an interface, or while the terminal device displays the interface shown in FIG. 12, the terminal device can also give a voice prompt to the user, for example, through an audio player to play "Please enter a voice To perform voiceprint verification" or other voices, the embodiment of the present application does not limit this.
  • the terminal device may perform signal processing on the voice signal to be recognized, such as voice activation detection, voice noise reduction processing, de-reverberation processing, etc., which are not limited in the embodiment of the present application.
  • the terminal device After the terminal device obtains the voice signal to be recognized, on the one hand, it can perform feature extraction on the voice signal to be recognized to obtain the voiceprint information of the voice signal to be recognized.
  • the feature extraction module 204 in FIG. 2 can perform feature extraction on the voice signal to be recognized.
  • emotion recognition can be performed on the voice signal to be recognized, and the first emotion corresponding to the voice signal to be recognized can be obtained.
  • the emotion recognition module 205 in FIG. 2 can be used to perform emotion recognition on the voice signal to be recognized.
  • the first emotion that is, the detected emotion of the user, may be displayed to the user through a display interface.
  • Diagram (a) in FIG. 13 shows an example of an interface displaying the first emotion.
  • the first emotion is one of the preset emotions.
  • Figure 14 (a) shows another example of an interface that displays the first emotion.
  • the first emotion is based on the weight coefficient standard of each of the at least two emotions. At this time, the at least one emotion can be displayed on the display interface.
  • Each of the two preset emotions and the weight coefficient of each preset emotion For example, as shown in Figure 14 (a), the weight coefficient of anger in the first emotion is 0.6, the weight coefficient of eager emotion is 0.3, the weight coefficient of sad emotion is 0.1, and the weight coefficient of other emotions is 0.
  • the operation when the user is not satisfied with the type of the first emotion displayed in the display interface, or is not satisfied with the weight coefficient of each of the at least two emotions in the first emotion, the operation may be performed #3, that is, modify the type of the first emotion, or modify the weight coefficient of each emotion in at least two emotions in the first emotion.
  • the terminal device After obtaining the user's operation #3, the terminal device can update the first emotion according to the operation #3.
  • the interface shown in Figure 13 (b) can be displayed to the user.
  • the user can be provided with optional emotion types, such as eagerness, or Calm, for users to choose.
  • the emotions for the user to select in FIG. 13(b) may be the types of emotions that may be obtained when performing emotion recognition on the voice signal to be recognized.
  • the interface shown in (b) in FIG. 14 can be displayed to the user. At this time, the user can choose to change the weight coefficient of each emotion.
  • the voiceprint template in the first emotion of the registered user can be directly called, and the voiceprint template can be matched with the voiceprint information of the voice signal to be recognized. Determine whether the user to be identified is a registered user.
  • the voiceprint template in the first emotion of the registered user can be obtained through the voiceprint module obtaining module 206 in FIG. 2, and the voiceprint template and the voiceprint information of the to-be-recognized voice signal Match, and get the matching result.
  • the weight coefficient vector corresponding to the first emotion can be determined, and the registered voiceprint templates of different emotions of the registered users are weighted by the weight coefficient vector to obtain the mixed voice Pattern template. Then, the mixed voiceprint template is matched with the voiceprint information of the voice signal to be recognized, and it is determined whether the user to be recognized is a registered user.
  • the mixed voiceprint template can be obtained by the voiceprint module obtaining module 206 in FIG. 2, and the mixed voiceprint template can be matched with the voiceprint information of the voice signal to be recognized by the voiceprint matching module 207, and the matching result can be obtained. .
  • the embodiment of the present application matches the feature information of the voice signal to be recognized in the same emotion with the voiceprint template, the embodiment of the present application can help reduce the influence of the user's mood fluctuations on voiceprint recognition, thereby helping In order to realize that users can obtain a consistent user experience of voiceprint recognition under different emotions, thereby enhancing the robustness of voiceprint recognition.
  • FIG. 15 shows a schematic flowchart of a method for voiceprint recognition provided by an embodiment of the present application. Wherein, the method can be executed by the system 200 in FIG. 2. The method includes steps 710 to 740.
  • Step 710 Obtain the to-be-recognized voice signal of the to-be-recognized user.
  • Step 720 Perform emotion recognition on the voice signal to be recognized, and obtain a first emotion corresponding to the voice signal to be recognized.
  • Step 730 Obtain a voiceprint template corresponding to the first emotion of the registered user.
  • the voiceprint templates corresponding to the different emotions are different.
  • Step 740 Determine whether the user to be recognized is the registered user according to the voice signal to be recognized and the voiceprint template.
  • emotion recognition is performed on the to-be-recognized voice signal of the user to be recognized, the first emotion of the to-be-recognized voice signal is obtained, and the voiceprint template of the registered user in the first emotion is obtained, according to the to-be-recognized voice signal.
  • the voice signal is matched with the voiceprint template to determine whether the user to be identified is a registered user. Therefore, by matching the voice signal to be recognized in the same emotion with the voiceprint template, the embodiment of the present application can help reduce the influence of the user's mood fluctuations on the voiceprint recognition, thereby helping to realize that the user is in different emotions. , To obtain a consistent user experience of voiceprint recognition, thereby enhancing the robustness of voiceprint recognition.
  • the voiceprint template of the registered user can be matched with the voice feature vector of the voice signal to be recognized, and the similarity between the voice feature vector of the voice signal to be recognized and the voiceprint template of the registered user can be obtained. Then, it can be judged whether the similarity is higher than the threshold. When the similarity is higher than the threshold, it is determined that the user to be identified is a registered user. At this time, corresponding operations can be performed in response to the user's request, such as unlocking the smart terminal, or opening an application, etc., which is not limited. When the similarity is not higher than the threshold, it is determined that the user to be identified is not a registered user. At this time, the user's request can be rejected, such as keeping the screen locked, or refusing to open the application, etc., which is not limited.
  • the obtaining the voiceprint template of the registered user in the first emotion includes:
  • the first emotion may be a single emotion among a plurality of preset emotions, and at this time, voiceprint recognition can be performed by calling a voiceprint template under the emotion.
  • voiceprint recognition can be performed by calling a voiceprint template under the emotion.
  • the embodiment of the present application recognizes the emotion of the to-be-recognized voice signal of the user to be recognized, and invokes the voiceprint template of the registered user in the emotion, and compares the to-be-recognized voice signal with the voiceprint of the registered user in the emotion.
  • the template performs voiceprint matching to determine whether the user to be identified is a registered user. Therefore, by matching the voice signal to be recognized in the same emotion with the voiceprint template, the embodiment of the present application can help reduce the influence of the user's mood fluctuations on the voiceprint recognition, thereby helping to realize that the user is in different emotions. , To obtain a consistent user experience of voiceprint recognition, thereby enhancing the robustness of voiceprint recognition.
  • the first emotion is characterized by a weight coefficient of each of at least two emotions.
  • the obtaining the voiceprint template of the registered user in the first emotion includes:
  • the voiceprint template corresponding to the first emotion is obtained.
  • the first emotion at this time can be a mixed emotion composed of multiple preset emotions.
  • the mixed sound corresponding to the first emotion can be generated according to the voiceprint template corresponding to the multiple preset emotions of the registered user. And then perform voiceprint matching according to the mixed voiceprint template.
  • the mixed voiceprint template corresponding to the multiple preset emotions of the registered user.
  • the weight coefficient of each emotion contained in the current emotion of the user is identified, and the weighted sum of each emotion in the voiceprint template set of the registered user is performed to obtain the mixed sound according to the weight coefficient of each emotion in the emotion.
  • Pattern template matching the voice signal to be recognized with the mixed voiceprint template, and judging whether the user to be recognized is a registered user. Therefore, by matching the voice signal to be recognized in the same emotion with the voiceprint template, the embodiment of the present application can help reduce the influence of the user's mood fluctuations on the voiceprint recognition, thereby helping to realize that the user is in different emotions. , To obtain a consistent user experience of voiceprint recognition, thereby enhancing the robustness of voiceprint recognition.
  • the first emotion may also be displayed through a display interface.
  • the first emotion when the first emotion is characterized by the weight coefficient of each of the at least two emotions, the first emotion may be displayed through the display interface, and the first emotion may be displayed through the display interface. State each emotion and the weight coefficient of each emotion.
  • the user's first operation may also be obtained, where the first operation Used to correct the type of the first emotion, or used to correct the weight coefficient of each of the at least two emotions in the first emotion. Then, in response to the first operation, the first emotion is updated.
  • the first emotion is displayed to the user, and when the user is not satisfied with the type of the first emotion, or the weight coefficient of each emotion in the first emotion, the user's judgment of his true emotion can be referred to.
  • Update the first emotion which helps to accurately identify the user’s current emotional state, helps to reduce the impact of the user’s emotional fluctuations on voiceprint recognition, and thus helps to achieve consistent users in different emotions.
  • the user experience of voiceprint recognition thereby enhancing the robustness of voiceprint recognition.
  • the method before the obtaining the voiceprint template of the registered user in the first emotion, the method further includes:
  • a voiceprint template of each emotion of the registered user in the multiple different emotions is obtained.
  • the embodiment of the present application can generate voiceprint templates for users in different emotions, and the voiceprint templates under different emotions are different. Therefore, the embodiment of the present application can adapt to different emotional changes of the user in the process of voiceprint recognition, which helps to improve the accuracy of voiceprint recognition.
  • the registered voice of the user under different emotions can be directly collected, and the registered voice signal of the user under different emotions can be obtained.
  • the registered voice signal of the user under different emotions can be obtained.
  • the acquiring registered voice signals in multiple different emotions includes:
  • the acquiring registered voice signals in multiple different emotions includes:
  • Emotion conversion is performed on the first registered voice signal, and registered voice signals in the multiple different emotions are obtained.
  • the performing emotion conversion on the first registered voice signal to obtain the registered voice signals in the multiple different emotions includes:
  • emotion transformation is performed on the first registered voice signal, and registered voice signals in the multiple different emotions are obtained.
  • the determining whether the user to be recognized is the registered user according to the voice signal to be recognized and the voiceprint template includes:
  • the voiceprint information and the voiceprint template it is determined whether the user to be identified is the registered user.
  • the first emotion includes at least one of calm, joy, anger, sadness, eagerness, fear, and surprise.
  • the voiceprint recognition method provided by the embodiment of the present application is described in detail above with reference to FIGS. 1 to 15.
  • the voiceprint recognition apparatus of the embodiment of the present application will be introduced below with reference to FIG. 16 and FIG. 17. It should be understood that the voiceprint recognition device in FIG. 16 and FIG. 17 can execute each step in the voiceprint recognition method in the embodiment of the present application. In order to avoid repetition, the voiceprint recognition in FIG. 16 and FIG. 17 will be introduced below. The repeated description is appropriately omitted when the device is installed.
  • Fig. 16 is a schematic block diagram of a voiceprint recognition device according to an embodiment of the present application.
  • the voiceprint recognition device 800 in FIG. 16 includes a first acquisition unit 810, an emotion recognition unit 820, a second acquisition unit 830, and a judgment unit 840.
  • the voiceprint recognition device 800 executes the voiceprint recognition method
  • the first obtaining unit 810 is used to obtain the voice signal to be recognized of the user to be recognized
  • the emotion recognition unit 820 is used to compare the voice signal to be recognized Perform emotion recognition to obtain the first emotion corresponding to the voice signal to be recognized
  • the second obtaining unit 830 is used to obtain the voiceprint template of the registered user in the first emotion, where the first emotion corresponds to different emotions
  • the determining unit 840 is configured to determine whether the user to be recognized is the registered user according to the voice signal to be recognized and the voiceprint template.
  • the second obtaining unit 860 is specifically configured to obtain the voiceprint template corresponding to the first emotion from the voiceprint templates of the registered user in multiple different emotions, where: The multiple different emotions include the first emotion.
  • the first emotion is characterized by a weight coefficient of each of at least two emotions.
  • the second acquiring unit 830 is specifically configured to determine each of the at least two emotions in the first emotion from the voiceprint templates of the registered user in multiple different emotions. The corresponding voiceprint template, and then according to the voiceprint template of each emotion and the weight coefficient of each emotion, the voiceprint template corresponding to the first emotion is obtained.
  • the device 800 further includes a display interface for displaying the first emotion.
  • the display interface is specifically configured to display each emotion and the information of each emotion. Weight coefficient.
  • the device 800 further includes a third obtaining unit for obtaining a first operation of the user, where the first operation is used to correct the type of the first emotion, or used to correct the The weight coefficient of each of the at least two emotions in the first emotion.
  • the emotion recognition unit 820 is further configured to update the first emotion in response to the first operation.
  • the device 800 further includes a fourth acquiring unit for acquiring registered voice signals in multiple different emotions. And, it may further include a fifth acquiring unit, configured to acquire a voiceprint template of each emotion of the registered user in the multiple different emotions according to the registered voice signals of the multiple different emotions.
  • the fourth acquiring unit may be the same unit as the first acquiring unit, but the embodiment of the present application does not limit this.
  • the fourth obtaining unit is specifically configured to display at least two preset emotions to the user through a display interface; and then obtain a second operation of the user, and the second operation is used to enter the user's Voices in at least two preset emotions; in response to the second operation, acquiring registered voice signals in the at least two preset emotions, wherein the registered voice signals in the multiple different emotions include the at least Registration voice signals under two preset emotions.
  • the fourth acquiring unit is specifically configured to acquire a first registered voice signal, and then perform emotional conversion on the first registered voice signal to acquire registered voice signals in the multiple different emotions.
  • the fourth obtaining unit is specifically configured to display at least two preset emotions to the user through a display interface; and then obtain a third operation of the user, where the third operation is used to The multiple different emotions are selected from at least two preset emotions; in response to the third operation, emotion transformation is performed on the first registered voice signal to obtain registered voice signals in the multiple different emotions.
  • the determining unit 840 is specifically configured to perform voiceprint feature extraction on the voice signal to be recognized, and obtain voiceprint information of the voice signal to be recognized. Then, the judging unit 840 judges whether the user to be identified is the registered user according to the voiceprint information and the voiceprint template.
  • the first emotion includes at least one of calm, joy, anger, sadness, eagerness, fear, and surprise.
  • FIG. 17 is a schematic structural diagram of a voiceprint recognition device according to an embodiment of the present application.
  • the voiceprint recognition device may be a terminal device.
  • the voiceprint recognition device includes a communication module 910, a sensor 920, a user input module 930, an output module 940, a processor 950, an audio and video input module 960, a memory 970, and a power supply 980.
  • the communication module 910 may include at least one module that enables communication between the computer system and a communication system or other computer systems.
  • the communication module 910 may include one or more of a wired network interface, a broadcast receiving module, a mobile communication module, a wireless Internet module, a local area communication module, and a location (or positioning) information module.
  • the sensor 920 may sense the current state of the system, such as open/closed state, position, contact with the user, direction, and acceleration/deceleration, and the sensor 920 may generate a sensing signal for controlling the operation of the system.
  • the user input module 930 is used to receive inputted digital information, character information, or contact touch operations/non-contact gestures, and receive signal input related to user settings and function control of the system.
  • the user input module 930 includes a touch panel and/or other input devices.
  • the user input module 930 may be used to obtain a first operation input by the user, where the first operation is used to correct the type of the first emotion, or used to correct each of at least two emotions in the first emotion.
  • the weight coefficient of sentiment may be used to obtain a first operation input by the user, where the first operation is used to correct the type of the first emotion, or used to correct each of at least two emotions in the first emotion.
  • the user input module 930 may be used to obtain a second operation input by the user, where the second operation is used to record the user's voice in the at least two preset emotions.
  • the user input module 930 may be used to obtain a third operation input by the user, and the third operation is used to select the multiple different emotions from at least two preset emotions.
  • the output module 940 includes a display panel for displaying information input by the user, information provided to the user, various menu interfaces of the system, and the like.
  • the display panel may be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
  • the touch panel may cover the display panel to form a touch display screen.
  • the output module 940 may also include an audio output module, an alarm, a haptic module, and so on.
  • the output module 940 is configured to display the first emotion to the user through the display screen, for example, display the type of the first emotion, or the weight coefficient of each emotion in at least two emotions in the first emotion.
  • the output module 940 may be used for displaying or prompting the user whether to register a voiceprint template through the display screen, or prompting the user to select the emotion when recording voice, or prompting the user to select the emotion of emotion conversion. This is not limited.
  • the audio and video input module 960 is used to input audio signals or video signals.
  • the audio and video input module 960 may include a camera and a microphone.
  • the power supply 980 may receive external power and internal power under the control of the processor 950, and provide power required for the operation of various components of the system.
  • the processor 950 may indicate one or more processors.
  • the processor 950 may include one or more central processing units, or include a central processing unit and a graphics processor, or include an application processor and a coprocessor (For example, micro control unit).
  • the processor 950 includes multiple processors, the multiple processors may be integrated on the same chip, or each may be an independent chip.
  • a processor may include one or more physical cores, where the physical core is the smallest processing module.
  • the processor 950 is configured to obtain the to-be-recognized voice signal of the to-be-recognized user, perform emotion recognition on the to-be-recognized voice signal, and obtain the first emotion corresponding to the to-be-recognized voice signal. Then, the processor 950 obtains the voiceprint templates of the registered user in the first emotion, where when the first emotion corresponds to different emotions, the voiceprint templates corresponding to the different emotions are different. Then, the processor 950 determines whether the user to be recognized is the registered user according to the voice signal to be recognized and the voiceprint template.
  • the processor 930 is further configured to obtain registered voice signals in multiple different emotions, and then obtain each of the registered users in the multiple different emotions according to the registered voice signals of the multiple different emotions. Emotional voiceprint template.
  • the processor 930 is further configured to update the first emotion in response to the user's first operation. Or, in response to a second operation of the user, a registered voice signal in the at least two preset emotions is acquired. Or, in response to a third operation of the user, emotion transformation is performed on the first registered voice signal to obtain registered voice signals in the multiple different emotions.
  • the memory 970 stores a computer program, and the computer program includes an operating system program 972, an application program 971, and the like.
  • Typical operating systems such as Microsoft’s Windows, Apple’s MacOS, etc. are used in desktop or notebook systems, as well as systems developed by Google based on Android System and other systems used in mobile terminals.
  • the method provided in the foregoing embodiment can be implemented in software, and can be considered as a specific implementation of the application program 971.
  • the memory 970 may be one or more of the following types: flash memory, hard disk type memory, micro multimedia card type memory, card type memory (such as SD or XD memory), random access memory (random access memory) , RAM), static random access memory (static RAM, SRAM), read only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable Read-only memory (programmable ROM, PROM), magnetic memory, magnetic disk or optical disk.
  • the memory 970 may also be a network storage device on the Internet, and the system may perform operations such as updating or reading the memory 970 on the Internet.
  • the processor 950 is used to read the computer program in the memory 970, and then execute the method defined by the computer program. For example, the processor 950 reads the operating system program 972 to run the operating system on the system and implement various functions of the operating system, or read One or more application programs 971 are taken to run applications on the system.
  • the memory 970 also stores other data 973 besides the computer program, such as the voiceprint template, the voice signal to be recognized, and the registered voice signal involved in this application.
  • connection relationship of the various modules in FIG. 16 is only an example, and the method provided by any embodiment of the present application can also be applied to voiceprint recognition devices of other connection modes, for example, all modules are connected through a bus.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Abstract

一种声纹识别的方法和装置,声纹识别的方法包括:获取待识别用户的待识别语音信号(710);对待识别语音信号进行情感识别,获取待识别语音信号对应的第一情绪(720);获取已注册用户在第一情绪下的声纹模板,当第一情绪对应不同情绪时,不同情绪对应的声纹模板不同(730);根据待识别语音信号和声纹模板,判断待识别用户是否为已注册用户(740)。通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而增强声纹识别的鲁棒性。

Description

声纹识别的方法和装置
本申请要求于2020年02月29日提交中国专利局、申请号为202010132716.2、申请名称为“声纹识别的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及生物识别领域,并且更具体的,涉及声纹识别的方法和装置。
背景技术
声纹识别是通过对一种或多种语音信号的特征分析来达到对未知声音辨别的目的,简单的说就是辨别某一句话是否是某一个人说的技术。声纹识别的理论基础是每一个人的声音都具有独特的特征,通过该特征能将不同人的声音进行有效的区分。声纹识别的基本原理是通过分析声音信号的语谱之间的相似度,达到声纹识别的目的。语谱的特征直接影响了声纹识别的结果。通常,用户在注册声纹模板的时候情绪相对平静。
而在实际使用过程中,用户的情绪是多种多样的,有时会比较焦急,有时候又会比较开心、激动,这些情绪都会影响语谱的特征,使得情绪的波动会对声纹识别的准确率有负面的影响。现有的一种声纹识别的方案中,会通过情感检测方法检测情感语言的形变程度计算情感因子,并在训练与识别阶段分别在模型层与特征层对情感所引起的语言变化进行补偿。但是,一方面,该方案在确定情感因子时依赖于情感检测的准确度,不准确的情感检测会降低声纹识别的准确率。另一方面,对语音特征的补偿会进一步影响声纹识别的准确率。
因此,在用户情绪波动的情况下,如何提高声纹识别的准确率是亟需解决的问题。
发明内容
本申请提供一种声纹识别的方法和装置,通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而增强声纹识别的鲁棒性。
第一方面,提供了一种声纹识别的方法,该方法包括:
获取待识别用户的待识别语音信号;
对所述待识别语音信号进行情感识别,获取所述待识别语音信号对应的第一情绪;
获取已注册用户在所述第一情绪对应的声纹模板,当第一情绪对应不同情绪时,所述不同情绪对应的声纹模板不同;
根据所述待识别语音信号和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
因此,本申请实施例通过对待识别用户的待识别语音信号进行情感识别,获取该待识别语音信号的第一情绪,并获取已注册用户在该第一情绪下的声纹模板,根据该待识别语 音信号与该声纹模板进行声纹匹配,判断该待识别用户是否为已注册用户。因此,本申请实施例通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
作为示例,可以将已注册用户的声纹模板与待识别语音信号的语音特征向量的进行匹配,获取待识别语音信号的语音特征向量与已注册用户的声纹模板的相似度。然后,可以判断相似度是否高于阈值。当相似度高于阈值时,则判断待识别用户为已注册用户。此时,可以响应于用户的请求,执行相应的操作,例如智能终端解锁,或开启应用等,不作限定。当相似度不高于阈值时,则判断待识别用户不是已注册用户。此时,可以拒绝用户的请求,例如保持锁屏,或拒绝开启应用等,不作限定。
示例性的,所述第一情绪包括平静、喜悦、愤怒、悲伤、急切、恐惧和惊讶中的至少一种。也就是说,第一情绪可以是多种情绪中的其中一种单一的情绪,例如平静、喜悦、愤怒、悲伤、急切、恐惧或惊讶等,第一情绪也可以是由多种情绪组成的混合情绪,例如平静和喜悦的混合情绪,愤怒、急切和悲伤的混合情绪等,本申请实施例对此不作限定。在本申请实施例中,当第一情绪为不同情绪时,对应的声纹模板不同。
结合第一方面,在第一方面的某些实现方式中,所述获取已注册用户在所述第一情绪下的声纹模板,包括:
从所述已注册用户的多种不同情绪下的声纹模板中,获取所述第一情绪对应的声纹模板,其中,所述多种不同情绪包括所述第一情绪。
也就是说,此时第一情绪为多种情绪中的其中一种单一的情绪,此时可以通过调用该情绪下的声纹模板进行声纹识别。
因此,本申请实施例通过识别待识别用户的待识别语音信号的情绪,并调用已注册用户在该情绪下的声纹模板,将该待识别语音信号与已注册用户的该情绪下的声纹模板进行声纹匹配,判断该待识别用户是否为已注册用户。因此,本申请实施例通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
结合第一方面,在第一方面的某些实现方式中,所述第一情绪由至少两种情绪中的每种情绪的权重系数表征。
其中,所述获取已注册用户在所述第一情绪下的声纹模板,包括:
从所述已注册用户的多种不同情绪下的声纹模板中,确定所述第一情绪中的所述至少两种情绪中的每种情绪对应的声纹模板;
根据所述每种情绪的声纹模板,以及所述每种情绪的权重系数,获取所述第一情绪对应的声纹模板。
也就是说,此时第一情绪为由多种情绪组成的混合情绪,此时可以根据已注册用户的该多种情绪对应的声纹模板,生成第一情绪对应的混合声纹模板,然后根据该混合声纹模板进行声纹匹配。
因此,本申请实施例通过识别用户当前情绪中包含的各个情绪的权重系数,并根据该情绪中的各个情绪的权重系数,对已注册用户的声纹模板集中各个情绪进行加权求和获取混合声纹模板,将该待识别语音信号与该混合声纹模板进行匹配,判断该待识别用户是否 为已注册用户。因此,本申请实施例通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
结合第一方面,在第一方面的某些实现方式中,还可以通过显示界面显示所述第一情绪,使得用户获知当前待识别语音信号对应的情绪。
结合第一方面,在第一方面的某些实现方式中,当所述第一情绪由至少两种情绪中的每种情绪的权重系数表征时,所述通过显示界面显示所述第一情绪,可以通过所述显示界面显示所述每种情绪和所述每种情绪的权重系数。
在一些可能的实现方式中,当用户对于第一情绪的类型,或者对第一情绪中的每种情绪的权重系数不满意时,还可以获取用户的第一操作,其中,所述第一操作用于修正所述第一情绪的类型,或者用于修正所述第一情绪中的至少两种情绪中的每种情绪的权重系数。然后,响应于所述第一操作,对所述第一情绪进行更新。
因此,本申请实施例通过向用户显示第一情绪,并在用户对第一情绪的类型,或者第一情绪中每种情绪的权重系数不满意时,可以参考用户的对自己真实情绪的判断,对第一情绪进行更新,进而有助于准确的识别用户当前的情绪状态,有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
结合第一方面,在第一方面的某些实现方式中,所述获取已注册用户在所述第一情绪下的声纹模板之前,还包括:
获取多种不同情绪下的注册语音信号;
根据所述多种不同情绪的注册语音信号,获取所述已注册用户在所述多种不同情绪中的每种情绪的声纹模板。
因此,相对于现有技术中只生成用户在情绪平静状态下的声纹模板,本申请实施例能够生成用户在不同的情绪下的声纹模板,并且该不同情绪下的声纹模板不同。因此,本申请实施例能够在声纹识别的过程中,适配用户不同的情绪变化,有助于提升声纹识别的准确率。
作为一种实现方式,可以直接采集用户在不同情绪下的注册语音,获取该用户的不同情绪下的注册语音信号。
结合第一方面,在第一方面的某些实现方式中,可以通过显示界面向用户显示至少两种预设情绪;然后,获取用户的第二操作,所述第二操作用于录入用户在所述至少两种预设情绪下的语音。响应于所述第二操作,获取所述至少两种预设情绪下的注册语音信号,其中,所述多种不同情绪下的注册语音信号包括所述至少两种预设情绪下的注册语音信号。
示例性的,预设情绪可以为平静、喜悦、愤怒、悲伤、急切、恐惧或惊讶等,本申请实施例对此不作限定。
这样,能够实现通过终端设备的界面,引导用户录入在至少两种情绪下的语音,从而获取用户的不同情绪下的注册语音信号。
结合第一方面,在第一方面的某些实现方式中,所述获取多种不同情绪下的注册语音信号,包括:
获取第一注册语音信号;
对所述第一注册语音信号进行情感转换,获取所述多种不同情绪下的注册语音信号。
因此,通过采集用户在一种情绪下的注册语音,获取用户在该情绪下的注册语音信号,并对该注册语音信号进行情感识别,可以获取多种不同情绪下的注册语音信号。
结合第一方面,在第一方面的某些实现方式中,所述对所述第一注册语音信号进行情感转换,获取所述多种不同情绪下的注册语音信号,包括:
通过显示界面向用户显示至少两个预设情绪;
获取用户的第三操作,其中,所述第三操作用于在所述至少两种预设情绪中选择所述多种不同情绪;
响应于所述第三操作,对所述第一注册语音信号进行情感变换,获取所述多种不同情绪下的注册语音信号。
这样,能够实现通过终端设备的界面,引导用户选择需要进行情感转换的情绪,从而根据用户选择的情绪类型,对注册语音信号进行情感转换,获取用户的不同情绪下的注册语音信号。
结合第一方面,在第一方面的某些实现方式中,所述根据所述待识别语音信号和所述声纹模板,判断所述待识别用户是否为所述已注册用户,包括:
对所述待识别语音信号进行声纹特征提取,获取所述待识别语音信号的声纹信息;
根据所述声纹信息和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
声纹信息能够标识待识别语音信号的特征信息,因此,将该待识别语音信号的声纹信息与已注册用户的该情绪下的声纹模板进行声纹匹配,能够判断该待识别用户是否为已注册用户。
第二方面,本申请实施例提供了一种声纹识别的装置,用于执行上述第一方面或第一方面的任意可能的实现方式中的方法,具体的,该装置包括用于执行上述第一方面或第一方面的任意可能的实现方式中的方法的模块。
第三方面,本申请实施例提供了一种声纹识别的装置,包括:一个或多个处理器;存储器,用于存储一个或多个程序;当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述第一方面或第一方面的任意可能的实现方式中的方法。
第四方面,本申请实施例提供了一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。
第五方面,本申请实施例还提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得该计算机执行第一方面或第一方面的任意可能的实现方式中的方法。
应理解,本申请的第二至第五方面及对应的实现方式所取得的有益效果参见本申请的第一方面及对应的实现方式所取得的有益效果,不再赘述。
附图说明
图1是一种声纹识别的方法的示意性流程图;
图2是本申请实施例提供的一种声纹识别的系统的示意图;
图3是本申请实施例提供的声纹注册流程的一个具体示例;
图4是本申请实施例提供的声纹注册流程的另一个具体示例;
图5是本申请实施例提供的声纹识别流程的一个具体示例;
图6是本申请实施例提供的声纹识别流程的另一个具体示例;
图7是本申请实施例提供的终端设备的显示界面的一个示例;
图8是本申请实施例提供的终端设备的显示界面的另一个示例;
图9是本申请实施例提供的终端设备的显示界面的另一个示例;
图10是本申请实施例提供的终端设备的显示界面的另一个示例;
图11是本申请实施例提供的终端设备的显示界面的另一个示例;
图12是本申请实施例提供的终端设备的显示界面的另一个示例;
图13是本申请实施例提供的终端设备的显示界面的另一个示例;
图14是本申请实施例提供的终端设备的显示界面的另一个示例;
图15是本申请实施例提供的一种声纹识别的方法的示意性流程图;
图16是本申请实施例提供的一种声纹识别的装置的示意性框图;
图17是本申请实施例提供的另一种声纹识别的装置的示意性框图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
图1示出了一种声纹识别的方法100的示意性流程图。如图1所示,声纹识别主要包括声纹注册和声纹确认/辨别两个流程。其中,在声纹注册阶段(包括步骤101、步骤102、步骤103和步骤105),可以获取一个或多个用户的声纹模板。在声纹确认/辨别阶段(包括步骤101、步骤102、步骤103、步骤106和步骤107),可以获取未知说话人的声音特征信息,然后将该声音特征信息与在声纹注册阶段获取的已知的声纹模板进行匹配,进行声纹确认/辨别。声纹确认/辨别阶段也可以被称为声纹识别阶段。
其中,声纹确认即说话人确认,用于判断未知说话人是否为某个指定人。示例性的,在进行声纹确认时,可以将获取的未知说话人的声音特征信息与指定人的声纹模板进行匹配,确认该未知说话人是否为该指定人。
声纹辨认即说话人辨认,用于判断未知说话人是已知记录说话人中的哪一位。在进行声纹辨别时,可以将获取的未知说话人的声音特性与多个已知记录人的声纹模板分别进行匹配,判断该位置说话人是这几个已知记录说话人中的哪一位。
请继续参考图1,在声纹注册阶段,可以对采集到的用户的声音信号(也可以称为注册语音信号),进行信号处理,比如执行步骤101(即语音检测)、步骤102(语音增强)等处理,以获取处理后的注册语音信号。作为示例,步骤101,即语音检测例如可以包括语音激活检测,步骤102,即语音增强例如可以包括语音降噪处理、去混响处理等。然后,对处理后的注册语音信号执行步骤103,即进行特征提取,获取注册语音信号的特征信息。然后,执行步骤104,即通过声纹模型对该注册语音信号的特征信息进行训练,得到该用户的声纹模板。
在用户完成声纹注册之后,可以获取该用户的声纹模板。此时可以将该用户称为“已注册用户”。
另外,可以通过上述方式获取至少一个用户的声纹模板,即获取至少一个已注册用户的声纹模板。在一些实施例中,可以通过上述声纹注册过程,建立一个声纹模板库,该声纹模板库中可以包括不同已注册用户的多个声纹模板。
在声纹确认/辨认的阶段,同样可以对采集到的用户的声音信号(也可以称为待识别 语音信号),进行信号处理,比如执行步骤101(即语音检测)、步骤102(语音增强)等处理,以获取处理后的待识别语音信号。然后,执行步骤103,即对处理后的待识别语音信号进行特征提取,获取待识别语音信号的特征信息。然后,执行步骤105,即将待识别语音信号的特征信息和已注册用户的声纹模板进行声纹匹配。一个示例,可以获取该待识别语音信号的特征信息和该声纹模板的相似度得分。然后执行步骤106,即根据该相似度得分,确认待识别用户是否为已注册用户。
一些实施例中,用户的声纹模板中包括该用户的声音信号的语谱特征。具体而言,声音信号的语谱是声音信号的一种图像化的表示方式,能够表示声音信号的各个频率点的频率幅值随时间的变化情况。一个示例,声音信号在各个频率点的幅值大小可以用颜色来区分。其中,说话人的声音的基频以及谐频在语谱上表现为一条一条的亮线。在声纹匹配的时候,可以对用户的声音信号进行处理,获取该声音信号的语谱,然后对比该图谱与声纹模板中的语谱之间的相似度,最终达到声纹识别的目的。
在声纹识别的过程中,用户的情绪可以是多种多样的,这些情绪会影响用户发出的语音的频谱特性。这可能会导致相同用户在不同的情绪下,发出的语音的频谱特性的差别可能是比较大,从而影响声纹识别的准确率。例如,用户在情绪平静时进行了声纹注册,此时获得的声纹模板中包含该用户在情绪平静的状态下的声音信号的语谱特性。而当用户在喜悦的状态下,提取到的待识别语音信号的语谱特性与声纹模板中的语谱特性的差别可能会比较大,可能导致声纹匹配度较低,影响声纹识别的准确率。
有鉴于此,本申请实施例提供了一种情绪自适应声纹识别方法,通过生成多种情绪的声纹模板(或声纹模板集),并根据多种情绪的声纹模板(或声纹模板集)进行声纹匹配,从而实现情绪自适应的声纹识别。
作为示例,在本申请实施例中,情绪可以包括平静、喜悦、愤怒、悲伤、急切、恐惧和惊讶等中的至少一种情绪。即,情绪可以为平静、喜悦、愤怒、悲伤、急切、恐惧和惊讶等情形中的一种单一情绪,或者其中两种以上的组合情绪,或混合情绪等,本申请实施例对此不做限定。
生成多种情绪的声纹模板(或声纹模板集)是在声纹注册阶段完成的。例如,可以录入用户不同情绪下的注册语音信号,或者可以对一种情绪下的注册语音信号进行情感变化,生成不同情绪下的注册语音信号。然后对该不同情绪下的注册语音信号进行训练,生成多种不同情绪的声纹模版。
作为一种实现方式,可以在终端设备中预设情绪,例如预设平静、喜悦、愤怒、悲伤、恐惧、急切和惊讶等多种预设情绪。在声纹注册阶段,可以分别生成多种不同预设情绪中的每种预设情绪下的声纹模板,比如平静情绪下的声纹模板、喜悦情绪下的声纹模板、愤怒情绪下的声纹模板、悲伤情绪下的声纹模板、恐惧情绪下的声纹模板、急切情绪下的声纹模板和惊讶情绪下的声纹模板,本申请实施例对此不作限定。其中,不同情绪下对应的声纹模板不同。
根据多种情绪的声纹模板(或声纹模板集)进行声纹匹配是在声纹识别阶段完成的。示例性的,可以对待识别的语音进行情感识别,并根据情感识别的结果,获得对应的声纹模板,然后根据该声纹模板进行声纹匹配。情感识别的结果,即对待识别的语音信号进行情感识别所获得的情绪,可以称为第一情绪。当第一情绪对应不同情绪时,该不同情绪对应的声纹模板不同。
作为一种实现方式,该第一情绪可以是预设的多种不同情绪中的其中一种情绪,即单一的情绪,例如平静、喜悦、愤怒、悲伤、恐惧、急切或惊讶等。此时,可以从多种预设情绪的声纹模板中选择对应情绪的声纹模板,然后根据所选择的声纹模板与待识别语音信号的声纹特征进行声纹匹配。作为一个具体的例子,当第一情绪为喜悦时,可以将喜悦情绪下的声纹模板确定为该第一情绪下的声纹模板。
作为另一种实现方式,该第一情绪可以由多种预设情绪组成的混合情绪,例如平静和悲伤的混合情绪,喜悦和急切的混合情绪,愤怒、悲伤和急切的混合情绪等。此时,可以利用该多种预设情绪的声纹模板生成第一情绪的混合声纹模板,根据该混合声纹模板与待识别语音信号的声纹特征进行声纹匹配。作为一个具体例子,当第一情绪为平静和悲伤的混合情绪时,可以根据平静情绪下的声纹模板和悲伤情绪下的声纹模板,生成该第一情绪下的混合声纹模板。
由于本申请实施例是将相同情绪下的待识别语音信号与声纹模板进行匹配的,因此本申请实施例能够有助于减小情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
图2示出了本申请实施例提供的一种声纹识别的系统200的示意图。示例性的,该系统200可以应用于各种终端设备,比如手机、智能音箱、车载电子设备等智能设备的声纹识别功能中,用于终端设备确认用户身份,以便实现唤醒设备、启动智能助手等功能,本申请实施例对此不做限定。
如图2所示,该系统200中可以包括信号处理模块201、情感变化模块202、声纹模板生成模块203、特征提取模块204、情感识别模块205、声纹模板获取模块206和声纹匹配模块207。其中,图2中的箭头可以用于表示信号流的传输方向。
其中,信号处理模块201、情感识别模块205、声纹模板获取模块206、特征提取模块204以及声纹匹配模块207可以用于声纹确认/辨认过程,信号处理模块201、情感变化模块202以及声纹模板生成模块203可以用于声纹注册过程。通常,在声纹确认/辨认之前,需要先进行声纹注册。
其中,信号处理模块201用于对获取的语音信号进行信号处理。作为示例,对信号进行处理例如对信号进行语音激活检测、降噪处理、去混响处理等,以获得处理后的语音信号。
作为示例,在声纹注册阶段,信号处理模块201用于对注册语音信号进行信号处理,获得处理后的注册语音信号;在声纹确认/辨认阶段,信号处理模块201用于对待识别语音信号进行信号处理,获得处理后的注册语音信号。
本申请实施例中,系统200中可以包括一个或多个信号处理模块201,本申请实施例对此不作限制。在具体实施例中,对注册语音信号进行信号处理的信号处理模块与对待识别语音信号进行信号处理的信号处理模块可以为同一个模块,或者不同的模块,都在本申请实施例的保护范围之内。
情感变化模块202用于在声纹注册阶段,对注册语音信号进行情感变化处理,以获得不同情绪下的注册语音信号。作为示例,情感变化模块202可以对经信号处理模块201处理之后的注册语音信号进行情感变化处理,获取不同情绪下的注册语音信号。具体的,不同情绪可以参见上文中的描述,为了简洁,这里不再赘述。
声纹模板生成模块203用于根据不同情绪对应的注册语音信号进行声纹模版训练,获 得不同情绪对应的声纹模版,即多种情绪的声纹模版。
作为示例,声纹模板生成模块203可以提取待识别语音信号的特征信息,并对该特征信息进行声纹模板训练,生成该待识别语音信号对应的声纹模板。在一些实施例中,可以对多种不同情绪下的注册语音信号分别进行声纹模板训练,以分别获得用户在不同情绪下的声纹模板。
特征提取模块204用于在声纹注册阶段,对待识别语音信号进行特征提取,得到待识别语音信号的特征信息,即声纹信息。
情感识别模块205用于在声纹确认/辨认阶段,对待识别用户的待识别语音信号进行情感识别,确定该待识别语音信号对应的情绪。
具体而言,用户的情绪的波动会影响用户的语音的语谱的特征。情感识别模块205能够根据获取的语音信号中的语谱的特征,识别出用户的情绪。比如用户在录入语音时是急切的,此时可以识别出该待识别语音信号对应的情绪为急切情绪。又比如用户在录入语音时是喜悦的,此时可以识别出该待识别语音信号对应的情绪为喜悦情绪。又比如在录入语音时是愤怒的,此时可以识别出该待识别语音信号对应的情绪为愤怒情绪。
示例性的,情感识别模块205可以为离散语音情感分类器,或者维度语音情感预测其等,本申请实施例对此不做限定。
声纹模板获取模块206,用于在声纹确认/辨认阶段,根据情感识别结果和多种情绪的声纹模板,确定声纹匹配中使用的声纹模板。示例性的,声纹模板获取模块206可以从声纹模板库中获取待识别语音信号对应的情绪的声纹模板,或者根据声纹模板库中的声纹模板,生成该待识别语音信号对应的情绪的混合声纹模板。
声纹匹配模块207用于在声纹确认/辨认阶段,根据声纹模板和特待识别语音信号的特征信息,进行声纹匹配,判断待识别用户是否为已注册用户。
应理解,图2示出了声纹识别的系统200的模块或单元,但这些模块或单元仅是示例,本申请实施例的声纹识别装置还可以包括其他模块或单元,或者包括图2中的各个模块或单元的变形。此外,图2中的图像获取装置有可能并非要包括图2中的全部模块或单元。
下面,结合图2中所示的声纹识别的系统200,以及下文中的图3至图6,详细描述本申请实施例提供的声纹注册和声纹确认/辨认的过程。
在声纹注册阶段,可以获得用户在不同情绪下的注册语音信号,并根据不同情绪下的注册语音信号,生成对应情绪下的声纹模板。下面结合图3和图4描述本申请实施例提供的两种生成不同情绪下的声纹模板的方式。
图3示出了本申请实施例提供的声纹注册流程的一个具体示例。其中,可以根据用户的语音进行不同情感的变化,获得用户在不同情绪下的语音,然后生成对应情绪的声纹模块。
如图3所示,在声纹注册阶段,首先可以通过步骤301获取用户输入的注册语音信号。示例性的,用户可以通过设备的语音获取模块输入一段语音,获取该语音对应的注册语音信号。该注册语音信号可以称为用户输入的注册语音信号。
可选的,可以通过图2中的信号处理模块201对该注册语音信号进行处理,获取处理后的注册语音信号。具体的,处理过程可以参见上文中的描述,为了简洁,这里不再赘述。
需要说明的是,用户可以在平静的情绪下输入该语音,或者是在悲伤、愤怒、喜悦等情绪波动的情况下输入该语音,本申请实施例对此不做限定。
还需要说明的是,在本申请实施例中,用户输入的语音可以是文本相关的,或者是文本无关的,本申请实施例对此不作限定。
然后,可以执行步骤302,将用户的注册语音信号进行不同情感的变换。例如,图2的情感变化模块202通过对用户输入的注册语音信号进行情感变化,获取该用户在各种不同情绪下的注册语音信号。
情感变化是直接对用户的注册语音信号进行转换。作为示例,情感变化可以将用户的注册语音信号变为悲伤情绪下的注册语音信号、愤怒情绪下的注册语音、喜悦模式下的注册语音等,本申请实施例对此不作限定。
作为示例,用户的注册语音信号可以是设备采集到的用户的语音信号,可以是经过端点检测、降噪处理、去混响等处理后的时域信号。
示例性的,可以采用谱-韵律双变换的语音情感转换算法实现情感转换,或者采用稀疏约束的情感语音转换算法实现情感转换,本申请实施例对此不做限定。
在一些实施例中,可以预先设定情绪的种类。作为一例,情感变化模块202可以预先设定(即预设)悲伤、愤怒、喜悦、急切四种情绪。此时,当情感变化模块202获取到用户输入的注册语音信号时,可以对用户输入的注册语音信号进行情感转换,获取该用户在悲伤情绪下的注册语音信号、在愤怒情绪下的注册语音信号、在喜悦情绪下的注册语音信号和在急切情绪下的注册语音信号。可选的,还可以根据用户的需求,增加、更改或者删除的预设的情绪种类。
然后,可以执行步骤303,根据用户在不同情绪下的注册语音信号,生成该不同情绪下的声纹模板。作为示例,可以通过图2中的声纹模块生成模块203生成该用户的在该情绪下的声纹模板。
一些实施例中,用户在不同情绪下的声纹模板可以构成一个集合,该集合可以称为多种情绪的声纹模板集。示例性的,声纹模板库中可以包含多个已注册用户的多种情绪的声纹模板集。
因此,通过上述步骤301至303,能够完成对该用户的声纹注册,此时该用户可以被称为已注册用户。并且,相对于现有技术中只生成用户在情绪平静状态下的声纹模板,本申请实施例能够生成用户在不同的情绪下的声纹模板,并且该不同情绪下的声纹模板不同。因此,本申请实施例能够在声纹识别的过程中,适配用户不同的情绪变化,有助于提升声纹识别的准确率。
图4示出了本申请实施例提供的声纹注册流程的另一个具体示例。其中,可以直接采集用户在不同情绪下的注册语音信号,然后根据不同情绪下的注册语音训练对应的声纹模板。
如图4所示,在声纹注册阶段,首先可以通过步骤401获取用户输入的至少一个注册语音信号,其中该至少一个注册语音信号包括用户在至少一种情绪下的至少一个注册语音信号。也就是说,可以直接采集用户在不同情绪下的注册语音,获取该用户的不同情绪下的注册语音信号。
作为一些可能的实现方式,在用户进行声纹注册的时候,可以通过终端设备的界面向用户提示录入不同情绪下的语音,或者通过语音向用户提示录入不同情绪下的语音,本申请实施例对此不做限定。
然后,可以执行步骤402,即根据不同情绪下的注册语音信号,生成对应情绪下的声 纹模板。具体的,步骤402与步骤303类似,可以参见上文中的描述,为了简洁,这里不再赘述。
因此,通过上述步骤401至402,能够完成对该用户的声纹注册,此时该用户可以被称为已注册用户。并且,相对于现有技术中只生成用户在情绪平静状态下的声纹模板,本申请实施例能够生成用户在不同的情绪下的声纹模板。因此,本申请实施例能够在声纹识别的过程中,适配用户不同的情绪变化,有助于提升声纹识别的准确率。
需要说明的是,本申请实施例中,当系统架构200中包括情感变化模块202和声纹模板生成模块203时,系统200可以完成声纹注册过程和声纹确认/辨认过程。当系统200中没有包括情感变化模块202和声纹模板生成模块203时,包含该系统200的终端设备可以将获取的注册语音信号发送至其他设备,比如云端服务器,由该其他设备根据接收到的用户的注册语音信号,训练生成该用户的声纹模板,再将该声纹模板发送给终端设备。具体的,云端服务器生成声纹模板的过程与终端设备生成声纹模板的过程相似,可以参见上文中的描述,为了简洁,这里不再赘述。
在声纹确认/辨认阶段,可以根据识别的用户的待识别语音信号的情绪,获取已注册用户在该情绪下的声纹模板,然后将该待识别语音信号的特征信息与已注册用户的该情绪下的声纹模板进行声纹匹配,获取声纹确认/辨认结果。下面结合图5至图6描述本申请实施例提供两种不同的声纹识别的方式。
图5示出了本申请实施例提供的声纹识别流程的一个具体示例。其中,通过情绪识别,可以判断出用户当前所处的情绪状态为预设的多种不同情绪中的其中一种单一情绪,此时可以通过调用该情绪下的声纹模板进行声纹识别。
如图5所示,在声纹确认/辨认阶段,首先可以通过步骤501获取用户输入的待识别语音信号。示例性的,用户可以通过设备的语音获取模块输入一段语音,获取该语音对应的待识别语音信号。这里,该用户即为待识别用户。
可选的,可以通过图2中的信号处理模块201对该待识别语音信号进行处理,获取处理后的待识别语音信号。具体的,处理过程可以参见上文中的描述,为了简洁,这里不再赘述。
这里,用户可以在平静的情绪下输入该语音,或者是在悲伤、愤怒、喜悦等情绪波动的情况下输入该语音,本申请实施例对此不做限定。
然后,可以执行步骤502,对待识别语音信号进行情感识别,获取当前用户的第一情绪。这里,该第一情绪可以为预先设定情绪中的其中一种,比如悲伤、愤怒、喜悦等。示例性的,步骤502可以由图2中的情感识别模块205执行。
然后,可以执行步骤503,对待识别语音信号进行声纹特征提取,以获取该待识别语音信号的声纹信息。示例性的,步骤503可以由图2中的特征提取模块204执行。
作为示例,用户的待识别语音信号可以是设备采集到的用户的待识别的语音信号,可以是经过端点检测、降噪处理、去混响等处理后的时域信号。
作为一种可能的实现方式,在声纹确认/辨认阶段进行声纹特征提取所采用的特征提取算法,与声纹注册阶段训练生成声纹模板时所采用的特征提取算法相同。
然后,可以执行步骤504,根据情感识别的识别结果,调取已注册用户的该第一情绪的声纹模板,对待识别语音信号进行声纹判决,从而判断出用户的身份。作为示例,可以将步骤503中获取的声纹信息与该已注册用户第一情绪的声纹模板进行匹配,确定待识别 用户是否为该已注册用户。
示例性的,步骤504可以由图2中的声纹模板获取模块206和声纹匹配模块207执行。其中,声纹模板获取模块206可以根据在步骤502中识别的第一情绪,在已注册用户的声纹模板集中获取该第一情绪的声纹模板。然后,声纹匹配模块207将在步骤503中获取的声纹信息与该第一情绪下的声纹模板进行匹配,确定待识别用户是否为该已注册用户。
作为示例,可以将已注册用户的声纹模板与待识别语音信号的语音特征向量的进行匹配,获取待识别语音信号的语音特征向量与已注册用户的声纹模板的相似度。然后,可以判断相似度是否高于阈值。当相似度高于阈值时,则判断待识别用户为已注册用户。此时,可以响应于用户的请求,执行相应的操作,例如智能终端解锁,或开启应用等,不作限定。当相似度不高于阈值时,则判断待识别用户不是已注册用户。此时,可以拒绝用户的请求,例如保持锁屏,或拒绝开启应用等,不作限定。
因此,本申请实施例通过识别待识别用户的待识别语音信号的情绪,并调用已注册用户在该情绪下的声纹模板,将该待识别语音信号的特征信息与已注册用户的该情绪下的声纹模板进行声纹匹配,判断该待识别用户是否为已注册用户。因此,本申请实施例通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
图6示出了本申请实施例提供的声纹识别流程的另一个具体示例。其中,通过情绪识别,判断用户当前所处的情绪状态为由多种预设情绪组成的混合情绪,此时可以根据该多种预设情绪对应的声纹模板,生成用户当前情绪的混合声纹模板,然后根据该混合声纹模板进行声纹识别。
如图6所示,在声纹确认/辨认阶段,首先可以通过步骤601获取用户输入的待识别语音信号。具体的,步骤601可以参见步骤501的描述,为了简洁,这里不再赘述。
然后,可以执行步骤602,对待识别语音信号进行情感识别,获取当前用户的第一情绪。这里,该第一情绪为由多种预设情绪组成的混合情绪,即,第一情绪为预先设定情绪中的两种或两种以上的情绪的组合。
在一些场景中,待识别用户的语音往往包含多种情绪因素,比如愤怒加急切,又比如喜悦加激动等。而在情感识别时,难以界定当前的情绪属于哪一种,此时可以采用多种情绪的组合来描述当前用户的情绪状态。
作为一种可能的实现方式,第一情绪可以由至少两种情绪中的每种情绪的权重系数表征。示例性的,可以使用情感识别模模块,对待识别用户的待识别语音信号进行情感识别,获得用户当前的第一情绪的中的每种情绪的权重系数。这里,每种情绪的权重系数能够表示该每种情绪在第一情绪中所占的比例。也就是说,通过对至少两个情绪中的每个情绪分别乘以每个情绪的权重系数,然后对该至少两个乘积求和,可以获得第一情绪。
在一些实施例中,第一情绪中包括的该至少两种情绪中每种情绪的权重系数可以组成该第一情绪的权重系数矢量。
示例性的,通过步骤602可以获取第一情绪的权重系数矢量,可以表示为[W 1…W i…W N],其中W i是第i种情绪对应的权重系数,表征了第i种情绪在待识别语音信号中发生的概率,N表示第一情绪中包含的情绪种类的总数。一个具体的例子,N可以为多种情绪的声纹模板集中包含的不同情绪的声纹模板的数量,或者N可以为预先设定的 情绪的种类。其中,N为大于1的正整数。
例如,情感识别模块205可以识别出在第一情绪中,愤怒情绪的概率是60%,急切情绪的概率是30%,悲伤的情绪的概率是10%,那么愤怒情绪的权重系数可以记为0.6,急切情绪的权重系数可以记为0.3,悲伤情绪的权重系数可以记为0.1。
然后,可以执行步骤603,对待识别语音信号进行声纹特征提取,以获取该待识别语音信号的声纹信息。具体的,步骤603可以参见步骤503的描述,为了简洁,这里不再赘述。
然后,可以执行步骤604,生成混合声纹模板。这里,该混合声纹模板即第一情绪下的声纹模板。示例性的,可以从已注册用户的多种不同情绪下的声纹模板中,确定该第一情绪中的至少两种情绪中的每种情绪对应的声纹模板,然后根据每种情绪的声纹模板,以及每种情绪的权重系数,获取第一情绪对应的声纹模板。
示例性的,步骤604可以由图2中的声纹模板获取模块206执行。声纹模板获取模块206可以获取已注册用户的声纹模板集,再根据第一情绪中的各个情绪的权重系数,即第一情绪的权重系数矢量,对该声纹模板集中的第一情绪中的各个情绪的声纹模板进行加权平均,得到混合声纹模板。一个示例,混合声纹模板可以满足如下公式(1):
Figure PCTCN2020125337-appb-000001
其中,x表示该混合声纹模板,x i表示第i种情绪对应的声纹模板,W i、N可以参见上文中的描述。
然后,可以执行步骤605,根据步骤604中获取的混合声纹模板,对待识别语音信号进行声纹判决。示例性的,步骤605可以由图2中的声纹匹配模块207执行。声纹匹配模块207可以将步骤603中获取的声纹信息与该混合声纹模板进行匹配,确定所述待识别用户是否为该已注册用户。
作为示例,可以将已注册用户的声纹模板与待识别语音信号的语音特征向量的进行匹配,获取待识别语音信号的语音特征向量与已注册用户的声纹模板的相似度。然后,可以判断相似度是否高于阈值。当相似度高于阈值时,则判断待识别用户为已注册用户。此时,可以响应于用户的请求,执行相应的操作,例如智能终端解锁,或开启应用等,不作限定。当相似度不高于阈值时,则判断待识别用户不是已注册用户。此时,可以拒绝用户的请求,例如保持锁屏,或拒绝开启应用等,不作限定。
因此,本申请实施例通过识别用户当前情绪中包含的各个情绪的权重系数,并根据该情绪中的各个情绪的权重系数,对已注册用户的声纹模板集中各个情绪进行加权求和获取混合声纹模板,将该待识别语音信号的特征信息与该混合声纹模板进行匹配,判断该待识别用户是否为已注册用户。因此,本申请实施例通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
以下,结合图7至图14从用户使用终端设备的角度,描述本申请实施例提供的声纹识别的方法。
首先,进行声纹注册,生成用户的声纹模型。
一个示例,用户首次使用终端设备的声纹识别功能时,终端设备可以提示用户是否需要进行声纹注册。作为一个具体的例子,请参考图7,示出了终端设备显示界面的一个示 例。如图7所示,可以通过终端设备的显示界面显示“是否注册声纹模板”。可选的,终端设备还可以显示“是”和“否”两个虚拟按键,用于获取用户的操作。当用户输入“是”的操作时,响应于该操作,终端设备可以进入录入用户语音的界面。当用户输入“否”的操作时,响应于该操作,终端设备退出该界面。
在一些实施例中,终端设备还可以通过物理按键,获取用户的操作。例如,当用户选择“确认”按键时,可以进入录入用户注册语音的界面,当用户选择“返回”按键时,退出图7所示界面。
当终端设备不具有显示界面时,或者在终端设备显示图7所示的界面的同时,终端设备可以对用户进行语音提示,例如通过音频播放器播放“是否注册声纹模板”,或者其他语音,本申请实施例对此不做限定。
另一个示例,用户还可以在安全设置中,选择增加新的用于声纹识别的声纹模板。作为一个具体的例子,请参考图8,示出了终端设备的一个示例。如图8所示,用户可以通过图8中左侧的安全与隐私的显示界面,输入进入“声纹”的操作。响应于该操作,显示见面可以呈现如图8中右侧所示的界面。此时,用户可以通过输入“新建声纹模板”的操作。响应于该操作,端侧交互设备可以进入录入用户语音的界面。
在本申请实施例中,可以采用两种方式获取用户的多个不同情绪下的注册语音信号。作为一个具体的例子,请参考图9,示出了终端设备显示界面的一个示例。如图9所示,可以通过终端设备的显示界面显示“请选择录入语音的方式”,以及该两种录入语音的方式,分别为“录入多种情绪的语音”和“录入一种情绪的语音”。当用户执行选择“录入多种情绪的语音”的操作时,响应于该操作,终端设备可以进入录入多种语音的界面。当用户输入“录入一种情绪的语音”的操作时,响应于该操作,终端设备进入录入一种情绪的语音的界面。
图10示出了录入用户语音的界面的一个示例。如图10中的(a)图所示,当选择录入多种情绪的语音之后,可以在显示界面显示“请选择录入语音时的情绪”,并显示在该终端设备中预设的情绪,例如平静、悲伤、喜悦、恐惧、愤怒和急切等,但是本申请实施例并不限于此。对应的,用户看到该显示界面后,可以执行选择一种情绪的操作,例如用户选择了“恐惧”的情绪。示例性的,用户可以根据自己的心情,选择希望录入语音的情绪。
响应于用户执行的选择情绪的操作,可以向用户显示图10中的(b)图所示的界面。以用户选择“恐惧”情绪为例,此时可以在界面中显示“请录入恐惧情绪下的语音”的提示,以及“开始录音”的虚拟按键。此时,用户可以执行录入恐惧情绪下的语音的操作。例如,用户可以长按“开始录音”虚拟按键,并同时输入一段恐惧情绪下的语音。响应于用户执行录入恐惧情绪下的语音的操作,终端设备可以通过语音获取模块(例如麦克风组件)来获取用户的录入的恐惧情绪下的注册语音信号。
需要说明的是,以上仅以用户录入恐惧情绪下的语音为例进行描述,用户还可以采用同样的方式录入其他情绪下的语音,本申请实施例对此不作限定。另外,本申请实施例对用户录入某种情绪下的语音的时间和先后顺序不作限定,例如用户可以在不同的时间分别录入不同的情绪下的语音,这些都在本申请实施例的保护范围之内。
在图10中,可以将用户执行的选择预设情绪,并录入该预设情绪下的语音的操作称为操作#1,即操作#1用于录入用户在该预设情绪下的语音,但是本申请实施例并不限于 此。
在一些实施例中,在终端设备没有显示界面,或者在终端设备显示图10所示的界面的同时,还可以对用户进行语音提示,例如,通过音频播放器播放“请选择录入语音时的情绪”,“请录入恐惧情绪下的语音”等,或者其他语音,本申请实施例对此不做限定。
终端设备在获取用户的不同情绪下的注册语音信号之后,可以对该不同情绪下的注册语音信号进行信号处理,例如语音激活检测、语音降噪处理、去混响处理等,本申请实施例对此不做限定。
图11示出了录入用户语音的界面的另一个示例。如图11中的(a)图所示,当选择录入一种情绪的语音之后,可以在显示界面显示“请选择情感转换的情绪”,并显示在该终端设备中预设的情绪,例如平静、悲伤、喜悦、恐惧、愤怒和急切等,但是本申请实施例并不限于此。对应的,用户看到该显示界面后,可以执行从该至少两种预设情绪中选择多种不同情绪的操作#2,例如用户选择了“平静”、“喜悦”、“恐惧”等情绪。
响应于用户执行的选择情绪的操作#2,可以向用户显示图11中的(b)图所示的界面。此时可以在界面中显示“请录入的语音”的提示,以及“开始录音”的虚拟按键。此时,用户可以执行录入语音的操作。例如,用户可以长按“开始录音”虚拟按键,并同时输入一段的语音。响应于用户执行录入语音的操作,终端设备可以通过语音获取模块(例如麦克风组件)来获取用户的录入的注册语音信号。需要说明的是,这里对用户录入该语音的情绪的类型不作限定。
可选的,终端设备在获取该注册语音信号之后,可以对该注册语音信号进行信号处理,例如语音激活检测、语音降噪处理、去混响处理等,本申请实施例对此不做限定。
然后,终端设备可以对该注册语音信号进行情感转换,将该注册语音信号变换为在图11中选择的至少两种情绪下的注册语音信号,即获得用户的多种情绪的注册语音信号。作为示例,可以通过图2中的情感变化模板202对注册语音信号进行情感变化。
然后,可以根据用户的多种情绪下的注册语音信号,生成该用户的多种情绪的声纹模板。作为示例,可以通过图2中的声纹模板生成模块203生成多种情绪下的声纹模板。
具体的,情感变化以及生成声纹模板的过程可以参见上文中的描述,为了简洁,这里不再赘述。
声纹注册完成之后,可以对待识别用户进行声纹识别。
一个示例,用户在开启终端设备,或者启用终端设备的某些需要安全验证的功能时,终端设备可以提示用户需要进行声纹验证。作为一个示例,终端设备可以进入录入测试者测试语音的界面。图12示出了录入用户语音的界面的另一个示例。如图12所示,可以在显示界面显示“请您录入语音以进行声纹验证”。可选的,终端设备还可以在界面中显示“开始录音”的虚拟按键。当用户选择录入语音时,用户可以通过点击或长按“开始录音”虚拟按键,并在点击“开始录音”按键之后,或者长按“开始录音”按键的同时输入一段待识别语音。响应于用户输入语音的操作,终端设备可以通过语音获取模块(例如麦克风组件)来获取用户的待识别语音信号。
在一些实施例中,在终端设备没有显示界面,或者在终端设备显示图12所示的界面的同时,终端设备还可以对用户进行语音提示,例如,通过音频播放器播放“请您录入一段语音以进行声纹验证”,或者其他语音,本申请实施例对此不做限定。
终端设备在获取用户的待识别语音信号之后,可以对该待识别语音信号进行信号处理, 例如语音激活检测、语音降噪处理、去混响处理等,本申请实施例对此不做限定。
终端设备在获取待识别语音信号之后,一方面,可以对该待识别语音信号进行特征提取,获取该待识别语音信号的声纹信息。作为示例,可以通过图2中的特征提取模块204对待识别语音信号进行特征提取。另一方面,可以对待识别语音信号进行情感识别,获取该待识别语音信号对应的第一情绪。作为示例,可以通过图2中的情感识别模块205对待识别语音信号进行情感识别。
在一些实施例中,可以通过显示界面向用户显示该第一情绪,即检测到的用户的情绪。图13中的图(a)示出了显示第一情绪的界面的一个示例。其中,第一情绪为预设情绪中的一种。在图14中的(a)图示出了显示第一情绪的界面的另一个示例,第一情绪由至少两种情绪中的每种情绪的权重系数标准,此时可以通过显示界面显示该至少两种预设情绪中的每个预设情绪和每种预设情绪的权重系数。例如,如图14中(a)图所示,第一情绪中愤怒情绪的权重系数为0.6,急切情绪的权重系数为0.3,悲伤情绪的权重系数为0.1,其余情绪的权重系数为0。
在一些可选的实施例中,当用户对显示界面中显示的第一情绪的类型不满意,或者对第一情绪中的至少两种情绪中每种情绪的权重系数不满意时,可以执行操作#3,即修正该第一情绪的类型,或者修改该第一情绪中的至少两种情绪中每种情绪的权重系数。在获取用户的操作#3之后,终端设备可以根据该操作#3,更新第一情绪。
作为一个具体的例子,在图13中,当用户执行修改操作时,可以向用户显示图13中(b)图所示的界面,此时可以向用户提供可选的情绪类型,例如急切,或者平静,供用户选择。作为一种可能的实现方式,在图13中(b)中供用户选择的情绪可以为在对待识别语音信号进行情感识别时,可能得到的情绪的类型。
作为另一个具体的例子,在图14中,当用户执行修改操作时,可以向用户显示图14中(b)图所示的界面,此时用户可以选择更改每种情绪的权重系数。
当第一情绪为预先设定的情绪中的其中一时,可以直接调用已注册用户的该第一情绪下的声纹模板,并将该声纹模板与待识别语音信号的声纹信息进行匹配,判断待识别用户是否为已注册用户。
作为示例,可以通过图2中的声纹模块获取模块206获取已注册用户的第一情绪下的声纹模板,通过声纹匹配模块207对该声纹模板与待识别语音信号的声纹信息进行匹配,并获得匹配结果。
当第一情绪为预先设定的多种情绪的组合时,可以确定第一情绪对应的权重系数矢量,通过该权重系数矢量对已注册用户的不同情绪的注册声纹模板进行加权,得到混合声纹模板。然后,将该混合声纹模板与待识别语音信号的声纹信息进行匹配,判断待识别用户是否为已注册用户。
作为示例,可以通过图2中的声纹模块获取模块206获取该混合声纹模板,通过声纹匹配模块207对该混合声纹模板与待识别语音信号的声纹信息进行匹配,并获得匹配结果。
具体的,声纹识别的过程可以参见上文中的描述,为了简洁,这里不再赘述。
由于本申请实施例是将相同情绪下的待识别语音信号的特征信息与声纹模板进行匹配的,因此本申请实施例能够有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
图15示出了本申请实施例提供的一种声纹识别的方法的示意性流程图。其中,该方 法可以由图2中的系统200执行。该方法包括步骤710至步骤740。
步骤710,获取待识别用户的待识别语音信号。
步骤720,对所述待识别语音信号进行情感识别,获取所述待识别语音信号对应的第一情绪。
步骤730,获取已注册用户在所述第一情绪对应的声纹模板,当第一情绪对应不同情绪时,所述不同情绪对应的声纹模板不同。
步骤740,根据所述待识别语音信号和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
因此,本申请实施例通过对待识别用户的待识别语音信号进行情感识别,获取该待识别语音信号的第一情绪,并获取已注册用户在该第一情绪下的声纹模板,根据该待识别语音信号与该声纹模板进行声纹匹配,判断该待识别用户是否为已注册用户。因此,本申请实施例通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
作为示例,可以将已注册用户的声纹模板与待识别语音信号的语音特征向量的进行匹配,获取待识别语音信号的语音特征向量与已注册用户的声纹模板的相似度。然后,可以判断相似度是否高于阈值。当相似度高于阈值时,则判断待识别用户为已注册用户。此时,可以响应于用户的请求,执行相应的操作,例如智能终端解锁,或开启应用等,不作限定。当相似度不高于阈值时,则判断待识别用户不是已注册用户。此时,可以拒绝用户的请求,例如保持锁屏,或拒绝开启应用等,不作限定。
在一些可能的实现方式中,所述获取已注册用户在所述第一情绪下的声纹模板,包括:
从所述已注册用户的多种不同情绪下的声纹模板中,获取所述第一情绪对应的声纹模板,其中,所述多种不同情绪包括所述第一情绪。
也就是说,此时第一情绪可以为预设的多种情绪中的其中一种单一的情绪,此时可以通过调用该情绪下的声纹模板进行声纹识别。具体的,可以参见上文图5中的描述,为了简洁,这里不再赘述。
因此,本申请实施例通过识别待识别用户的待识别语音信号的情绪,并调用已注册用户在该情绪下的声纹模板,将该待识别语音信号与已注册用户的该情绪下的声纹模板进行声纹匹配,判断该待识别用户是否为已注册用户。因此,本申请实施例通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
在一些可能的实现方式中,所述第一情绪由至少两种情绪中的每种情绪的权重系数表征。
其中,所述获取已注册用户在所述第一情绪下的声纹模板,包括:
从所述已注册用户的多种不同情绪下的声纹模板中,确定所述第一情绪中的所述至少两种情绪中的每种情绪对应的声纹模板;
根据所述每种情绪的声纹模板,以及所述每种情绪的权重系数,获取所述第一情绪对应的声纹模板。
也就是说,此时第一情绪可以为由多种预设情绪组成的混合情绪,此时可以根据已注 册用户的该多种预设情绪对应的声纹模板,生成第一情绪对应的混合声纹模板,然后根据该混合声纹模板进行声纹匹配。具体的,可以参见上文中图6的描述,为了简洁,这里不再赘述。
因此,本申请实施例通过识别用户当前情绪中包含的各个情绪的权重系数,并根据该情绪中的各个情绪的权重系数,对已注册用户的声纹模板集中各个情绪进行加权求和获取混合声纹模板,将该待识别语音信号与该混合声纹模板进行匹配,判断该待识别用户是否为已注册用户。因此,本申请实施例通过将相同情绪下的待识别语音信号与声纹模板进行匹配,能够有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
在一些可能的实现方式中,还可以通过显示界面显示所述第一情绪。
在一些可能的实现方式中,当所述第一情绪由至少两种情绪中的每种情绪的权重系数表征时,所述通过显示界面显示所述第一情绪,可以通过所述显示界面显示所述每种情绪和所述每种情绪的权重系数。
在一些可能的实现方式中,当用户对于第一情绪的类型,或者对第一情绪中的每种情绪的权重系数不满意时,还可以获取用户的第一操作,其中,所述第一操作用于修正所述第一情绪的类型,或者用于修正所述第一情绪中的至少两种情绪中的每种情绪的权重系数。然后,响应于所述第一操作,对所述第一情绪进行更新。
具体的,显示第一情绪,以及对第一情绪进行更新,可以参见上文图13和图14中的描述,为了简洁,这里不再赘述。
因此,本申请实施例通过向用户显示第一情绪,并在用户对第一情绪的类型,或者第一情绪中每种情绪的权重系数不满意时,可以参考用户的对自己真实情绪的判断,对第一情绪进行更新,进而有助于准确的识别用户当前的情绪状态,有助于降低用户的情绪波动对声纹识别的影响,从而有助于实现用户在不同的情绪下,获得一致的声纹识别的用户体验,从而增强声纹识别的鲁棒性。
在一些可能的实现方式中,所述获取已注册用户在所述第一情绪下的声纹模板之前,还包括:
获取多种不同情绪下的注册语音信号;
根据所述多种不同情绪的注册语音信号,获取所述已注册用户在所述多种不同情绪中的每种情绪的声纹模板。
因此,相对于现有技术中只生成用户在情绪平静状态下的声纹模板,本申请实施例能够生成用户在不同的情绪下的声纹模板,并且该不同情绪下的声纹模板不同。因此,本申请实施例能够在声纹识别的过程中,适配用户不同的情绪变化,有助于提升声纹识别的准确率。
作为一种实现方式,可以直接采集用户在不同情绪下的注册语音,获取该用户的不同情绪下的注册语音信号。具体的,可以参见上文图4中的描述,为了简洁,这里不再赘述。
在一些可能的实现方式中,所述获取多种不同情绪下的注册语音信号,包括:
通过显示界面向用户显示至少两种预设情绪;
获取用户的第二操作,所述第二操作用于录入用户在所述至少两种预设情绪下的语音;
响应于所述第二操作,获取所述至少两种预设情绪下的注册语音信号,其中,所述多种不同情绪下的注册语音信号包括所述至少两种预设情绪下的注册语音信号。
具体的,可以参见上文图10中的描述,为了简洁,这里不再赘述。
在一些可能的实现方式中,所述获取多种不同情绪下的注册语音信号,包括:
获取第一注册语音信号;
对所述第一注册语音信号进行情感转换,获取所述多种不同情绪下的注册语音信号。
具体的,可以参见上文图3中的描述,为了简洁,这里不再赘述。
在一些可能的实现方式中,所述对所述第一注册语音信号进行情感转换,获取所述多种不同情绪下的注册语音信号,包括:
通过显示界面向用户显示至少两个预设情绪;
获取用户的第三操作,其中,所述第三操作用于在所述至少两种预设情绪中选择所述多种不同情绪;
响应于所述第三操作,对所述第一注册语音信号进行情感变换,获取所述多种不同情绪下的注册语音信号。
具体的,可以参见上文图11中的描述,为了简洁,这里不再赘述。
在一些可能的实现方式中,所述根据所述待识别语音信号和所述声纹模板,判断所述待识别用户是否为所述已注册用户,包括:
对所述待识别语音信号进行声纹特征提取,获取所述待识别语音信号的声纹信息;
根据所述声纹信息和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
在一些可能的实现方式中,所述第一情绪包括平静、喜悦、愤怒、悲伤、急切、恐惧和惊讶中的至少一种。
上文结合图1至图15对本申请实施例提供的声纹识别的方法进行了详细描述,下面结合图16和图17对本申请实施例的声纹识别的装置进行介绍。应理解,图16和图17中的声纹识别的装置能够执行本申请实施例中的声纹识别的方法中的各个步骤,为了避免重复,下面在介绍图16和图17中的声纹识别的装置时适当省略重复的描述。
图16是本申请实施例的声纹识别的装置的示意性框图。图16中的声纹识别的装置800包括第一获取单元810、情感识别单元820、第二获取单元830和判断单元840。
具体的,当声纹识别的装置800执行声纹识别的方法时,第一获取单元810,用于获取待识别用户的待识别语音信号;情感识别单元820,用于对所述待识别语音信号进行情感识别,获取所述待识别语音信号对应的第一情绪;第二获取单元830,用于获取已注册用户在所述第一情绪下的声纹模板,其中,当第一情绪对应不同情绪时,所述不同情绪对应的声纹模板不同;判断单元840,用于根据所述待识别语音信号和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
在一些可能的实现方式中,所述第二获取单元860具体用于从所述已注册用户的多种不同情绪下的声纹模板中,获取所述第一情绪对应的声纹模板,其中,所述多种不同情绪包括所述第一情绪。
在一些可能的实现方式中,所述第一情绪由至少两种情绪中的每种情绪的权重系数表征。此时,所述第二获取单元830具体用于从所述已注册用户的多种不同情绪下的声纹模板中,确定所述第一情绪中的所述至少两种情绪中的每种情绪对应的声纹模板,然后根据所述每种情绪的声纹模板,以及所述每种情绪的权重系数,获取所述第一情绪对应的声纹模板。
在一些可能的实现方式中,装置800还包括显示界面,用于显示所述第一情绪。
在一些可能的实现方式中,当所述第一情绪由至少两种情绪中的每种情绪的权重系数表征时,所述显示界面具体用于显示所述每种情绪和所述每种情绪的权重系数。
在一些可能的实现方式中,装置800还包括第三获取单元,用于获取用户的第一操作,其中,所述第一操作用于修正所述第一情绪的类型,或者用于修正所述第一情绪中的至少两种情绪中的每种情绪的权重系数。所述情感识别单元820还应用于响应于所述第一操作,对所述第一情绪进行更新。
在一些可能的实现方式中,装置800还包括第四获取单元,用于获取多种不同情绪下的注册语音信号。以及,还可以包括第五获取单元,用于根据所述多种不同情绪的注册语音信号,获取所述已注册用户在所述多种不同情绪中的每种情绪的声纹模板。
一种可能的实现方式,第四获取单元可以与第一获取单元为相同的单元,但是本申请实施例对此不作限定。
在一些可能的实现方式中,所述第四获取单元具体用于通过显示界面向用户显示至少两种预设情绪;然后获取用户的第二操作,所述第二操作用于录入用户在所述至少两种预设情绪下的语音;响应于所述第二操作,获取所述至少两种预设情绪下的注册语音信号,其中,所述多种不同情绪下的注册语音信号包括所述至少两种预设情绪下的注册语音信号。
在一些可能的实现方式中,所述第四获取单元具体用于获取第一注册语音信号,然后对所述第一注册语音信号进行情感转换,获取所述多种不同情绪下的注册语音信号。
在一些可能的实现方式中,所述第四获取单元具体用于通过显示界面向用户显示至少两个预设情绪;然后获取用户的第三操作,其中,所述第三操作用于在所述至少两种预设情绪中选择所述多种不同情绪;响应于所述第三操作,对所述第一注册语音信号进行情感变换,获取所述多种不同情绪下的注册语音信号。
在一些可能的实现方式中,所述判断单元840具体用于对所述待识别语音信号进行声纹特征提取,获取所述待识别语音信号的声纹信息。然后,判断单元840根据所述声纹信息和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
在一些可能的实现方式中,所述第一情绪包括平静、喜悦、愤怒、悲伤、急切、恐惧和惊讶中的至少一种。
图17是本申请实施例的声纹识别的装置的结构示意图。作为示例,该声纹识别的装置可以为终端设备。如图17所示,该声纹识别的装置包括通信模块910、传感器920、用户输入模块930、输出模块940、处理器950、音视频输入模块960、存储器970以及电源980。
通信模块910可以包括至少一个能使该计算机系统与通信系统或其他计算机系统之间进行通信的模块。例如,通信模块910可以包括有线网络接口,广播接收模块、移动通信模块、无线因特网模块、局域通信模块和位置(或定位)信息模块等其中的一个或多个。这多种模块均在现有技术中有多种实现,本申请不一一描述。
传感器920可以感测系统的当前状态,诸如打开/闭合状态、位置、与用户是否有接触、方向、和加速/减速,并且传感器920可以生成用于控制系统的操作的感测信号。
用户输入模块930,用于接收输入的数字信息、字符信息或接触式触摸操作/非接触式手势,以及接收与系统的用户设置以及功能控制有关的信号输入等。用户输入模块930包括触控面板和/或其他输入设备。
例如,用户输入模块930可以用于获取用户输入的第一操作,其中,所述第一操作用 于修正第一情绪的类型,或者用于修正第一情绪中的至少两种情绪中的每种情绪的权重系数。
又例如,用户输入模块930可以用于获取所述用户输入的第二操作,其中,所述第二操作用于录入用户在所述至少两种预设情绪下的语音。
又例如,用户输入模块930可以用于获取所述用户输入的第三操作,所述第三操作用于在至少两种预设情绪中选择所述多种不同情绪。
输出模块940包括显示面板,用于显示由用户输入的信息、提供给用户的信息或系统的各种菜单界面等。可选的,可以采用液晶显示器(liquid crystal display,LCD)或有机发光二极管(organic light-emitting diode,OLED)等形式来配置显示面板。在其他一些实施例中,触控面板可覆盖显示面板上,形成触摸显示屏。另外,输出模块940还可以包括音频输出模块、告警器以及触觉模块等。
例如,输出模块940用于通过显示屏向用户显示第一情绪,例如显示第一情绪的类型,或者第一情绪中的至少两个情绪中每种情绪的权重系数。
又例如,输出模块940可以用于通过显示屏向所述用户显示或提示是否注册声纹模板,或者提示用户选择录入语音时的情绪,或者提示用户选择情感转换的情绪等,本申请实施例对此不作限定。
音视频输入模块960,用于输入音频信号或视频信号。音视频输入模块960可以包括摄像头和麦克风。
电源980可以在处理器950的控制下接收外部电力和内部电力,并且提供系统的各个组件的操作所需的电力。
处理器950可以指示一个或多个处理器,例如,处理器950可以包括一个或多个中央处理器,或者包括一个中央处理器和一个图形处理器,或者包括一个应用处理器和一个协处理器(例如微控制单元)。当处理器950包括多个处理器时,这多个处理器可以集成在同一块芯片上,也可以各自为独立的芯片。一个处理器可以包括一个或多个物理核,其中物理核为最小的处理模块。
例如,处理器950用于获取待识别用户的待识别语音信号,并对所述待识别语音信号进行情感识别,获取所述待识别语音信号对应的第一情绪。然后,处理器950获取已注册用户在所述第一情绪下的声纹模板,其中,当所述第一情绪对应不同情绪时,所述不同情绪对应的声纹模板不同。然后,处理器950根据所述待识别语音信号和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
又例如,处理器930还用于获取多种不同情绪下的注册语音信号,然后根据所述多种不同情绪的注册语音信号,获取所述已注册用户在所述多种不同情绪中的每种情绪的声纹模板。
又例如,所述处理器930还用于响应于用户的第一操作,更新第一情绪。或者,响应于用户的第二操作,获取所述至少两种预设情绪下的注册语音信号。或者,响应于用户的第三操作,对所述第一注册语音信号进行情感变换,获取所述多种不同情绪下的注册语音信号。
存储器970存储计算机程序,该计算机程序包括操作系统程序972和应用程序971等。典型的操作系统如微软公司的Windows,苹果公司的MacOS等用于台式机或笔记本的系统,又如谷歌公司开发的基于
Figure PCTCN2020125337-appb-000002
的安卓
Figure PCTCN2020125337-appb-000003
系统等用于移动终端的系统。 前述实施例提供的方法可以通过软件的方式实现,可以认为是应用程序971的具体实现。
存储器970可以是以下类型中的一种或多种:闪速(flash)存储器、硬盘类型存储器、微型多媒体卡型存储器、卡式存储器(例如SD或XD存储器)、随机存取存储器(random access memory,RAM)、静态随机存取存储器(static RAM,SRAM)、只读存储器(read only memory,ROM)、电可擦除可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、可编程只读存储器(programmable ROM,PROM)、磁存储器、磁盘或光盘。在其他一些实施例中,存储器970也可以是因特网上的网络存储设备,系统可以对在因特网上的存储器970执行更新或读取等操作。
处理器950用于读取存储器970中的计算机程序,然后执行计算机程序定义的方法,例如处理器950读取操作系统程序972从而在该系统运行操作系统以及实现操作系统的各种功能,或读取一种或多种应用程序971,从而在该系统上运行应用。
存储器970还存储有除计算机程序之外的其他数据973,例如本申请中涉及的声纹模板、待识别语音信号、注册语音信号等。
图16中各个模块的连接关系仅为一种示例,本申请任意实施例提供的方法也可以应用在其它连接方式的声纹识别的装置中,例如所有模块通过总线连接。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的 介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (25)

  1. 一种声纹识别的方法,其特征在于,包括:
    获取待识别用户的待识别语音信号;
    对所述待识别语音信号进行情感识别,获取所述待识别语音信号对应的第一情绪;
    获取已注册用户在所述第一情绪下的声纹模板,其中,当所述第一情绪对应不同情绪时,所述不同情绪对应的声纹模板不同;
    根据所述待识别语音信号和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
  2. 根据权利要求1所述的方法,其特征在于,所述获取已注册用户在所述第一情绪下的声纹模板,包括:
    从所述已注册用户的多种不同情绪下的声纹模板中,获取所述第一情绪对应的声纹模板,其中,所述多种不同情绪包括所述第一情绪。
  3. 根据权利要求1所述的方法,其特征在于,所述第一情绪由至少两种情绪中的每种情绪的权重系数表征;
    其中,所述获取已注册用户在所述第一情绪下的声纹模板,包括:
    从所述已注册用户的多种不同情绪下的声纹模板中,确定所述第一情绪中的所述至少两种情绪中的每种情绪对应的声纹模板;
    根据所述每种情绪的声纹模板,以及所述每种情绪的权重系数,获取所述第一情绪对应的声纹模板。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述方法还包括:
    通过显示界面显示所述第一情绪。
  5. 根据权利要求4所述的方法,其特征在于,当所述第一情绪由至少两种情绪中的每种情绪的权重系数表征时,所述通过显示界面显示所述第一情绪,包括:
    通过所述显示界面显示所述每种情绪和所述每种情绪的权重系数。
  6. 根据权利要求4或5所述的方法,其特征在于,还包括:
    获取用户的第一操作,其中,所述第一操作用于修正所述第一情绪的类型,或者用于修正所述第一情绪中的至少两种情绪中的每种情绪的权重系数;
    响应于所述第一操作,对所述第一情绪进行更新。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述获取已注册用户在所述第一情绪下的声纹模板之前,还包括:
    获取多种不同情绪下的注册语音信号;
    根据所述多种不同情绪的注册语音信号,获取所述已注册用户在所述多种不同情绪中的每种情绪的声纹模板。
  8. 根据权利要求7所述的方法,其特征在于,所述获取多种不同情绪下的注册语音信号,包括:
    通过显示界面向用户显示至少两种预设情绪;
    获取用户的第二操作,所述第二操作用于录入用户在所述至少两种预设情绪下的语音;
    响应于所述第二操作,获取所述至少两种预设情绪下的注册语音信号,其中,所述多 种不同情绪下的注册语音信号包括所述至少两种预设情绪下的注册语音信号。
  9. 根据权利要求7所述的方法,其特征在于,所述获取多种不同情绪下的注册语音信号,包括:
    获取第一注册语音信号;
    对所述第一注册语音信号进行情感转换,获取所述多种不同情绪下的注册语音信号。
  10. 根据权利要求9所述的方法,其特征在于,所述对所述第一注册语音信号进行情感转换,获取所述多种不同情绪下的注册语音信号,包括:
    通过显示界面向用户显示至少两个预设情绪;
    获取用户的第三操作,其中,所述第三操作用于在所述至少两种预设情绪中选择所述多种不同情绪;
    响应于所述第三操作,对所述第一注册语音信号进行情感变换,获取所述多种不同情绪下的注册语音信号。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,所述根据所述待识别语音信号和所述声纹模板,判断所述待识别用户是否为所述已注册用户,包括:
    对所述待识别语音信号进行声纹特征提取,获取所述待识别语音信号的声纹信息;
    根据所述声纹信息和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
  12. 根据权利要求1-11任一项所述的方法,其特征在于,所述第一情绪包括平静、喜悦、愤怒、悲伤、急切、恐惧和惊讶中的至少一种。
  13. 一种声纹识别的装置,其特征在于,包括:
    第一获取单元,用于获取待识别用户的待识别语音信号;
    情感识别单元,用于对所述待识别语音信号进行情感识别,获取所述待识别语音信号对应的第一情绪;
    第二获取单元,用于获取已注册用户在所述第一情绪下的声纹模板,其中,当第一情绪对应不同情绪时,所述不同情绪对应的声纹模板不同;
    判断单元,用于根据所述待识别语音信号和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
  14. 根据权利要求13所述的装置,其特征在于,所述第二获取单元具体用于:
    从所述已注册用户的多种不同情绪下的声纹模板中,获取所述第一情绪对应的声纹模板,其中,所述多种不同情绪包括所述第一情绪。
  15. 根据权利要求13所述的装置,其特征在于,所述第一情绪由至少两种情绪中的每种情绪的权重系数表征;
    其中,所述第二获取单元具体用于:
    从所述已注册用户的多种不同情绪下的声纹模板中,确定所述第一情绪中的所述至少两种情绪中的每种情绪对应的声纹模板;
    根据所述每种情绪的声纹模板,以及所述每种情绪的权重系数,获取所述第一情绪对应的声纹模板。
  16. 根据权利要求13-15任一项所述的装置,其特征在于,还包括:
    显示界面,用于显示所述第一情绪。
  17. 根据权利要求16所述的装置,其特征在于,当所述第一情绪由至少两种情绪中的每种情绪的权重系数表征时,所述显示界面具体用于显示所述每种情绪和所述每种情绪 的权重系数。
  18. 根据权利要求16或17所述的装置,其特征在于,还包括:
    第三获取单元,用于获取用户的第一操作,其中,所述第一操作用于修正所述第一情绪的类型,或者用于修正所述第一情绪中的至少两种情绪中的每种情绪的权重系数;
    所述情感识别单元还应用于响应于所述第一操作,对所述第一情绪进行更新。
  19. 根据权利要求13-18任一项所述的装置,其特征在于,还包括:
    第四获取单元,用于获取多种不同情绪下的注册语音信号;
    第五获取单元,用于根据所述多种不同情绪的注册语音信号,获取所述已注册用户在所述多种不同情绪中的每种情绪的声纹模板。
  20. 根据权利要求19所述的装置,其特征在于,所述第四获取单元具体用于:
    通过显示界面向用户显示至少两种预设情绪;
    获取用户的第二操作,所述第二操作用于录入用户在所述至少两种预设情绪下的语音;
    响应于所述第二操作,获取所述至少两种预设情绪下的注册语音信号,其中,所述多种不同情绪下的注册语音信号包括所述至少两种预设情绪下的注册语音信号。
  21. 根据权利要求19所述的装置,其特征在于,所述第四获取单元具体用于:
    获取第一注册语音信号;
    对所述第一注册语音信号进行情感转换,获取所述多种不同情绪下的注册语音信号。
  22. 根据权利要求21所述的装置,其特征在于,所述第四获取单元具体用于:
    通过显示界面向用户显示至少两个预设情绪;
    获取用户的第三操作,其中,所述第三操作用于在所述至少两种预设情绪中选择所述多种不同情绪;
    响应于所述第三操作,对所述第一注册语音信号进行情感变换,获取所述多种不同情绪下的注册语音信号。
  23. 根据权利要求13-22任一项所述的装置,其特征在于,所述判断单元具体用于:
    对所述待识别语音信号进行声纹特征提取,获取所述待识别语音信号的声纹信息;
    根据所述声纹信息和所述声纹模板,判断所述待识别用户是否为所述已注册用户。
  24. 根据权利要求13-23任一项所述的装置,其特征在于,所述第一情绪包括平静、喜悦、愤怒、悲伤、急切、恐惧和惊讶中的至少一种。
  25. 一种终端设备,其特征在于,包括:
    一个或多个处理器;
    存储器,用于存储一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-12中任一项所述的方法。
PCT/CN2020/125337 2020-02-29 2020-10-30 声纹识别的方法和装置 WO2021169365A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010132716.2 2020-02-29
CN202010132716.2A CN113327620A (zh) 2020-02-29 2020-02-29 声纹识别的方法和装置

Publications (1)

Publication Number Publication Date
WO2021169365A1 true WO2021169365A1 (zh) 2021-09-02

Family

ID=77413073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125337 WO2021169365A1 (zh) 2020-02-29 2020-10-30 声纹识别的方法和装置

Country Status (2)

Country Link
CN (1) CN113327620A (zh)
WO (1) WO2021169365A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012205A (zh) * 2022-04-29 2023-11-07 荣耀终端有限公司 声纹识别方法、图形界面及电子设备
CN117133281A (zh) * 2023-01-16 2023-11-28 荣耀终端有限公司 语音识别方法和电子设备
CN116612766B (zh) * 2023-07-14 2023-11-17 北京中电慧声科技有限公司 具备声纹注册功能的会议系统及声纹注册方法
CN117198338B (zh) * 2023-11-07 2024-01-26 中瑞科技术有限公司 一种基于人工智能的对讲机声纹识别方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226742A (zh) * 2007-12-05 2008-07-23 浙江大学 基于情感补偿的声纹识别方法
CN101419800A (zh) * 2008-11-25 2009-04-29 浙江大学 基于频谱平移的情感说话人识别方法
US20120253807A1 (en) * 2011-03-31 2012-10-04 Fujitsu Limited Speaker state detecting apparatus and speaker state detecting method
CN103456302A (zh) * 2013-09-02 2013-12-18 浙江大学 一种基于情感gmm模型权重合成的情感说话人识别方法
CN109346079A (zh) * 2018-12-04 2019-02-15 北京羽扇智信息科技有限公司 基于声纹识别的语音交互方法及装置
CN109473106A (zh) * 2018-11-12 2019-03-15 平安科技(深圳)有限公司 声纹样本采集方法、装置、计算机设备及存储介质
US20190356779A1 (en) * 2016-11-02 2019-11-21 International Business Machines Corporation System and Method for Monitoring and Visualizing Emotions in Call Center Dialogs at Call Centers

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4438014B1 (ja) * 2008-11-06 2010-03-24 株式会社ネイクス 有害顧客検知システム、その方法及び有害顧客検知プログラム
US20160372116A1 (en) * 2012-01-24 2016-12-22 Auraya Pty Ltd Voice authentication and speech recognition system and method
CN108305643B (zh) * 2017-06-30 2019-12-06 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
CN110164455A (zh) * 2018-02-14 2019-08-23 阿里巴巴集团控股有限公司 用户身份识别的装置、方法和存储介质
CN108764010A (zh) * 2018-03-23 2018-11-06 姜涵予 情绪状态确定方法及装置
CN110265062A (zh) * 2019-06-13 2019-09-20 上海指旺信息科技有限公司 基于情绪检测的智能贷后催收方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226742A (zh) * 2007-12-05 2008-07-23 浙江大学 基于情感补偿的声纹识别方法
CN101419800A (zh) * 2008-11-25 2009-04-29 浙江大学 基于频谱平移的情感说话人识别方法
US20120253807A1 (en) * 2011-03-31 2012-10-04 Fujitsu Limited Speaker state detecting apparatus and speaker state detecting method
CN103456302A (zh) * 2013-09-02 2013-12-18 浙江大学 一种基于情感gmm模型权重合成的情感说话人识别方法
US20190356779A1 (en) * 2016-11-02 2019-11-21 International Business Machines Corporation System and Method for Monitoring and Visualizing Emotions in Call Center Dialogs at Call Centers
CN109473106A (zh) * 2018-11-12 2019-03-15 平安科技(深圳)有限公司 声纹样本采集方法、装置、计算机设备及存储介质
CN109346079A (zh) * 2018-12-04 2019-02-15 北京羽扇智信息科技有限公司 基于声纹识别的语音交互方法及装置

Also Published As

Publication number Publication date
CN113327620A (zh) 2021-08-31

Similar Documents

Publication Publication Date Title
WO2021169365A1 (zh) 声纹识别的方法和装置
US11495224B2 (en) Contact resolution for communications systems
JP6725672B2 (ja) クレデンシャルを提供する音声入力の識別
US9934775B2 (en) Unit-selection text-to-speech synthesis based on predicted concatenation parameters
CN110310623B (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
EP3824462B1 (en) Electronic apparatus for processing user utterance and controlling method thereof
CN108346427A (zh) 一种语音识别方法、装置、设备及存储介质
CN111179975A (zh) 用于情绪识别的语音端点检测方法、电子设备及存储介质
US11664030B2 (en) Information processing method, system, electronic device, and computer storage medium
CN107707745A (zh) 用于提取信息的方法和装置
JP2004533640A (ja) 人についての情報を管理する方法及び装置
WO2020107834A1 (zh) 唇语识别的验证内容生成方法及相关装置
CN111653265B (zh) 语音合成方法、装置、存储介质和电子设备
US10699706B1 (en) Systems and methods for device communications
WO2020098523A1 (zh) 一种语音识别方法、装置及计算设备
CN110544468B (zh) 应用唤醒方法、装置、存储介质及电子设备
CN107610706A (zh) 语音搜索结果的处理方法和处理装置
US10866948B2 (en) Address book management apparatus using speech recognition, vehicle, system and method thereof
CN112765971A (zh) 文本语音的转换方法、装置、电子设备及存储介质
US11151995B2 (en) Electronic device for mapping an invoke word to a sequence of inputs for generating a personalized command
JP4143541B2 (ja) 動作モデルを使用して非煩雑的に話者を検証するための方法及びシステム
CN109064720B (zh) 位置提示方法、装置、存储介质及电子设备
CN110781329A (zh) 图像搜索方法、装置、终端设备及存储介质
US10841411B1 (en) Systems and methods for establishing a communications session
KR102622350B1 (ko) 전자 장치 및 그 제어 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921482

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921482

Country of ref document: EP

Kind code of ref document: A1