WO2021139589A1 - Voice processing method, medium, and system - Google Patents

Voice processing method, medium, and system Download PDF

Info

Publication number
WO2021139589A1
WO2021139589A1 PCT/CN2020/141600 CN2020141600W WO2021139589A1 WO 2021139589 A1 WO2021139589 A1 WO 2021139589A1 CN 2020141600 W CN2020141600 W CN 2020141600W WO 2021139589 A1 WO2021139589 A1 WO 2021139589A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
current
feature
template
features
Prior art date
Application number
PCT/CN2020/141600
Other languages
French (fr)
Chinese (zh)
Inventor
胡伟湘
王亚如
李伟
芦宇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021139589A1 publication Critical patent/WO2021139589A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • One or more embodiments of the present application generally relate to the field, and specifically relate to a voice processing method, medium, and system.
  • Speaker recognition that is, voiceprint recognition
  • biometric recognition technology which is a technology that automatically recognizes and confirms the identity of the speaker through the analysis and extraction of voice signals.
  • the existing speaker recognition includes two stages: registration and verification.
  • registration phase the system will require the registrant to enter multiple registered voices according to the specified requirements, and the system will convert these registered voices into corresponding speaker models.
  • voiceprint verification phase the system performs feature analysis and extraction on the input verification voice and scores the similarity with the speaker model generated in the registration phase, and judges whether the verification voice matches the registrant according to the set threshold.
  • the first aspect of the present application provides a voice processing method, which may include: receiving multiple voice inputs and extracting multiple voice features from the multiple voice inputs; determining multiple speaker features based on the multiple voice features; The multiple speaker feature clusters are clustered into at least one speaker feature category, where at least one speaker feature category corresponds to at least one speaker one-to-one, and each speaker feature category in the at least one speaker feature category includes multiple speaker feature categories.
  • At least one speaker feature among speaker features at least one speaker template is determined based on at least one speaker feature category, where at least one speaker template corresponds to at least one speaker one-to-one; and receiving current information from the current speaker Voice input, and based on the current voice input and the at least one speaker template, it is determined whether the current speaker matches one of the at least one speaker.
  • the speaker does not need to perform voice registration specifically, but in the process of voice interaction with the speaker, by clustering the speaker characteristics of different speakers, the speaker corresponding to each speaker is obtained.
  • Speaker template to identify different speakers. Therefore, the embodiments of the present application can realize non-sense registration, and avoid the negative experience brought by the registration to the speaker.
  • determining multiple speaker features based on multiple voice features includes: determining multiple speaker features through a voiceprint model based on multiple voice features; wherein the voiceprint model includes a mixed Gaussian-general background model, At least one of the I-vector model and the joint factor analysis model, and the multiple speaker characteristics include the hypermean vector of the mixed Gaussian-universal background model, the I-vector vector of the I-vector model, and the joint factor analysis model and the speaker At least one of human-related super vectors.
  • determining at least one speaker template based on at least one speaker feature category includes: determining the mean or weighted sum of at least one speaker feature in each speaker feature category; At least one mean or weighted sum serves as at least one speaker template.
  • clustering the multiple speaker features into at least one speaker feature category includes: based on the similarity between every two speaker features in the multiple speaker features, two of the multiple speaker features At least one of the offset between the speaker features and the density distribution of the multiple speaker features, clustering the multiple speaker features into at least one speaker feature category.
  • receiving the current voice input from the current speaker, and determining whether the current speaker matches one of the at least one speaker based on the current voice input and the at least one speaker template further includes: receiving from The current voice input of the current speaker, and extract the current voice feature from the current voice input; determine the current speaker feature based on the current voice feature; determine whether the current speaker feature matches one of the at least one speaker template; In the case where it is determined that the current speaker feature matches a speaker template, it is determined that the current speaker matches a speaker corresponding to a speaker template.
  • receiving the current voice input from the current speaker, and based on the current voice input and the at least one speaker template, determining whether the current speaker matches one of the at least one speaker includes: based on the current speech The similarity between the human feature and each speaker template in the at least one speaker template determines whether the current speaker matches one speaker in the at least one speaker template.
  • the method further includes: in the case of determining that the current speaker matches a speaker, determining the number of current speaker features of the current speaker and at least one speaker feature category corresponding to one speaker Whether the sum of the number of at least one speaker feature in one speaker category is equal to the first threshold; in the case where it is determined that the sum of the number is not equal to the first threshold, based on the current speaker feature and a speaker feature in the category At least one speaker feature, update the speaker template corresponding to one speaker.
  • the speaker feature of the current speaker is added to the speaker feature category that matches the current speaker, and the speaker template of the speaker feature category is updated. Improve the accuracy of speaker recognition.
  • the method further includes: in the case of determining that the current speaker matches a speaker, determining the number of current speaker features of the current speaker and at least one speaker feature category corresponding to one speaker Whether the sum of the number of at least one speaker feature in one speaker category is equal to the first threshold; in the case of determining that the sum of the number is equal to the first threshold, add the current speaker feature to multiple speaker features to form an updated The multiple speaker features of the; cluster the updated multiple speaker features into the updated at least one speaker feature category, wherein the updated at least one speaker feature category and the updated at least one speaker are one-to-one Corresponding and each updated speaker feature category of the updated at least one speaker feature category includes at least one speaker feature of the updated multiple speaker features; based on the at least one updated speaker feature category , Determine the updated at least one speaker template, where the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.
  • the speaker category corresponding to the speaker matching the current speaker satisfies the clustering condition, all speaker characteristics (including the current speaker characteristics) are re- Clustering is performed to update the speaker feature category and the speaker template corresponding to the speaker feature category. In this way, as the number of received voice inputs increases, the recognition accuracy of the speaker can be gradually improved.
  • the method further includes: in a case where it is determined that the current speaker does not match the at least one speaker, determining the number of current speaker features of the current speaker and the number of features that are not included in the at least one speaker feature category Whether the sum of the number of at least one speaker feature is greater than or equal to the second threshold; in the case where it is determined that the sum of the number is greater than or equal to the second threshold, the current speaker feature and at least one that is not included in at least one speaker feature category
  • One speaker feature cluster is at least one other speaker feature category, where at least one other speaker feature category corresponds to at least one other speaker one-to-one; based on at least one other speaker feature category, at least one other speaker template is determined , Where at least one other speaker template corresponds to at least one other speaker one-to-one.
  • the features of the unsuccessfully matched speakers are clustered, thereby A new speaker feature category and a new speaker template corresponding to the new speaker feature category are obtained. In this way, as the number of received voice inputs increases, the accuracy of speaker recognition can be gradually improved.
  • the method further includes: in a case where it is determined that the current speaker does not match the at least one speaker, determining the number of current speaker characteristics of the current speaker and at least one that is not included in the at least one speaker category. Whether the sum of the characteristics of a speaker is greater than or equal to the second threshold; in the case where it is determined that the sum of the characteristics is greater than or equal to the second threshold, the current speaker characteristics and at least one speaker that is not included in the at least one speaker category are spoken Adding multiple speaker features to human features to form updated multiple speaker features; clustering the updated multiple speaker features into at least one updated speaker feature category; based on the updated at least one speaker feature The category determines the updated at least one speaker template, where the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.
  • the recognition accuracy of the speaker can be gradually improved.
  • the method further includes, in the case where it is determined that the current speaker matches one of the multiple speakers, obtaining the current user ID of the current speaker through interaction with the current speaker; and
  • the current user identification is associated with one speaker feature category in at least one speaker feature category and one speaker template in at least one speaker template, wherein one speaker feature category and one speaker template correspond to one speaker.
  • a more personalized and intelligent interactive experience can be provided to the speaker.
  • the current user identification includes at least one of the name, gender, age, permissions, and preferences of the current speaker.
  • the method further includes: receiving the next voice input from the next speaker, and determining whether the next speaker is related to one of the at least one speaker based on the next voice input and the at least one speaker template. Speaker matching; in the case where it is determined that the next speaker matches a speaker, the current user ID is used to identify the next speaker.
  • the method further includes: one updated speaker feature category in the at least one updated speaker feature category includes multiple speaker features and the multiple speaker features are associated with multiple user identities In the case of, determine a user ID associated with the largest number of speaker features among the multiple speaker features; and associate a user ID with at least one of the updated speaker templates, one of which is The updated speaker template corresponds to an updated speaker feature category, wherein each user ID in the plurality of user IDs includes at least one of the speaker's name, gender, age, authority, and preferences.
  • the method further includes: determining the voice attribute of the current speaker based on the current voice input of the current speaker; and in the case where it is determined that the current speaker matches one of the multiple speakers, combining The sound attribute is associated with the speaker characteristic category corresponding to a speaker.
  • a more personalized and intelligent interactive experience can be provided to the speaker.
  • the sound attribute includes at least one of the age attribute of the sound and the gender attribute of the sound.
  • the second aspect of the present application provides a machine-readable medium on which instructions are stored.
  • the instructions When the instructions are executed on a machine, the machine executes any of the above voice processing methods.
  • a third aspect of the present application provides a system, which includes: a processor; a memory, where instructions are stored in the memory, and when the instructions are executed by the processor, the system executes any of the above voice processing methods.
  • Fig. 1 shows a schematic diagram of a scene of speaker recognition according to an embodiment of the present application
  • Fig. 2 shows a schematic structural diagram of a speaker recognition device according to an embodiment of the present application
  • Fig. 3 shows a schematic structural diagram of a speaker template acquisition module according to an embodiment of the present application
  • FIG. 4 shows a schematic diagram of a scene of voice interaction between a current speaker and a speaker recognition device according to an embodiment of the present application
  • FIG. 5 shows a schematic flowchart of a speaker recognition method according to an embodiment of the present application
  • Fig. 6 shows another schematic flowchart of a speaker recognition method according to an embodiment of the present application
  • Fig. 7 shows another schematic flow chart of a speaker recognition method according to an embodiment of the present application.
  • Fig. 8 shows a schematic structural diagram of a system according to an embodiment of the present application.
  • FIG. 1 shows a schematic diagram of a scene of speaker recognition according to an embodiment of the present application.
  • the speaker recognition apparatus 100 can interact with multiple speakers 200 at different times, and during the interaction process A voice input 300 from multiple speakers 200 is received.
  • the speaker recognition device 100 does not require the speaker 200 to perform special voice registration, but can recognize different speakers 200 based on the voice interaction with the speaker 200.
  • the speaker recognition device 100 may include, but is not limited to, smart speakers, smart headphones, smart bracelets, smart large screens, portable or mobile devices, mobile phones, personal digital assistants, cellular phones, and handheld PCs.
  • Portable media players, handheld devices, wearable devices for example, watches, bracelets, display glasses or goggles, head-mounted displays, head-mounted devices
  • navigation equipment servers, network equipment, graphics equipment, video game equipment , Set-top boxes, laptop devices, virtual reality and/or augmented reality devices, Internet of Things devices, industrial control devices, in-vehicle infotainment devices, streaming media client devices, e-books, reading devices, POS machines, and other devices.
  • the speaker recognition device 100 may include an interaction module 110, a speaker feature acquisition module 120, and a speaker template acquisition module 130 and the speaker matching module 140, and optionally include a user identification acquisition module 150 and a voice attribute recognition module 160.
  • one or more components of the speaker recognition device 100 can be provided by application specific integrated circuit (ASIC), electronic circuit, (shared, dedicated or group) processor and/or memory, combinational logic circuit, or combinational logic circuit that executes one or more software or firmware programs Any combination of the described functions and other suitable components.
  • ASIC application specific integrated circuit
  • the processor may be a microprocessor, a digital signal processor, a microcontroller, etc., and/or any combination thereof.
  • the processor may be a single-core processor, a multi-core processor, etc., and/or any combination thereof.
  • the speaker recognition device 100 may also include, but is not limited to, an input/output module for receiving voice input 300 from the speaker 200,
  • the interactive sentences with the speaker 200 may also be output in the form of voice, text, etc.
  • Examples of input/output modules may include, but are not limited to, speakers, microphones, displays (for example, liquid crystal displays, touch screen displays, etc.).
  • the interaction module 110 is used to interact with the speaker 200, where the interaction may include, but is not limited to, interactive sentences in the form of voice and/or text.
  • the interaction module 110 may be implemented by using any interaction technology in the prior art.
  • the speaker feature acquisition module 120 is configured to extract the voice features of the speaker 200 from the voice input 300 of the speaker 200 (for example, but not limited to, FBank (Filter Bank, filter bank) features, MFCC (Mel Frequency Cepstrum Coefficient, Mel frequency Cepstrum Coefficient) characteristics, etc.), and based on the voice characteristics of the speaker 200, according to the voiceprint model (for example, but not limited to, GMM-UBM (Gaussian Mixture Model-Universal Background Model), Mix Gaussian-universal background model), I-vector model, JFA (Joint Factor Analysis) model, etc.) to obtain speaker characteristics of speaker 200.
  • FBank Feter Bank, filter bank
  • MFCC Mel Frequency Cepstrum Coefficient
  • the voiceprint model for example, but not limited to, GMM-UBM (Gaussian Mixture Model-Universal Background Model), Mix Gaussian-universal background model), I-vector model, JFA (Joint Factor Analysis) model, etc.
  • the speaker template acquisition module 130 is used to determine whether the speaker feature set satisfies the clustering condition, where the speaker feature set may include multiple speaker features from multiple speakers 200; When the speaker feature set meets the clustering conditions, the speaker template acquisition module 130 can cluster multiple speaker features in the speaker feature set into at least one speaker feature category, where the speaker feature category is related to the speaker feature category. 200 one-to-one correspondence, and each speaker feature category includes at least one speaker feature in the speaker feature set; and for each speaker feature category, the speaker template acquisition module 130 can be based on the speaker feature category For at least one speaker feature, a speaker template of the speaker 200 related to the speaker feature category is obtained, where the speaker template corresponds to the speaker 200 one-to-one.
  • the clustering condition may include at least one of the following: in the speaker feature set, the number of speaker features not included in any speaker feature category is greater than or equal to the first clustering threshold; The number of speaker features in a speaker feature category is equal to the second clustering threshold.
  • a speaker template corresponding to the speaker feature category may be determined based on the mean value and/or weighted sum of at least one speaker feature in the speaker feature category.
  • the speaker matching module 140 is configured to determine whether the current speaker 200 matches one of the at least one existing speaker template based on whether the current speaker feature of the current speaker 200 matches One of the at least one speaker 200 corresponding to the at least one speaker template matches.
  • the speaker matching module 140 may determine that the current speaker 200 matches the current speaker feature of the current speaker 200 with one of the at least one existing speaker template.
  • the speaker 200 corresponding to the matched speaker template is matched; in the case where it is determined that the current speaker feature of the current speaker 200 does not match each speaker template, the current speaker 200 and each speaker template corresponding to each speaker template are determined 200 does not match.
  • the speaker matching module 140 may determine whether the current speaker feature of the current speaker 200 matches based on the similarity between the current speaker feature of the current speaker 200 and at least one existing speaker template. One speaker template among at least one speaker template that exists.
  • the user identification acquisition module 150 is configured to acquire the user identification of the speaker 200 based on the interaction between the interaction module 110 and the speaker 200.
  • the user identification may include, but is not limited to, name, gender, and age. , Permissions, preferences, etc.
  • the voice attribute recognition module 60 is used to recognize the voice attribute information of the speaker 200 based on the voice features of the speaker 200 (for example, but not limited to, FBank features, MFCC features, etc.).
  • the information may include, but is not limited to, the gender attribute of the voice, the age attribute of the voice, and so on.
  • the interaction module 110 may interact with the speaker 200 based on the sound attribute information of the speaker 200, so that the user identification obtaining module 150 obtains the user identification of the speaker 200.
  • the speaker feature acquisition module 120 may preprocess the speech input 300 of the speaker 200, where the preprocessing may include, but is not limited to, signal enhancement, de-reverberation, denoising, etc.
  • the speaker feature acquisition module 120 can also extract the voice features of the speaker 200 from the preprocessed voice input 300, such as, but not limited to, FBank features, MFCC features, and the like.
  • the speaker feature acquisition module 120 can also determine the signal-to-noise ratio of the speech input 300 of the speaker 200, so that the speaker template acquisition module 130 can filter high-quality speech input for acquiring the speaker template .
  • the speaker feature obtaining module 120 may also obtain the speaker feature of the speaker 200 based on the voice feature of the speaker 200 and the voiceprint model of the speaker 200.
  • the voiceprint model is used to describe the spatial distribution of the speech features of the speaker 200
  • examples of the voiceprint model may include, but are not limited to, GMM-UBM model, I-vector model, JFA model, etc.
  • the GMM-UBM model uses the Gaussian probability density function to describe the spatial distribution of the speaker’s voice features.
  • the establishment of the GMM-UBM model includes two parts. First, train a speaker that can describe the common features of the speaker based on a large number of speaker’s voice data.
  • the universal background model UBM uses the voice characteristics of each speaker to perform adaptive training based on the maximum posterior probability, thereby obtaining the speaker's Gaussian mixture model GMM.
  • the JFA model is based on the speaker’s GMM model, which defines the eigen-sound space, eigen-channel space and residual space to describe the spatial distribution of the speaker’s 200 speech features, that is, the mean supervector of the speaker’s GMM model (by The mean vector of each Gaussian probability density function is connected) and divided into a super vector related to the speaker and a super vector related to the channel, so that the interference of the channel can be removed, and a more accurate description of the speaker can be obtained.
  • the I-vector model is also based on the speaker’s GMM model, which defines a global difference space that includes both the difference between speakers and the difference between channels to describe the spatial distribution of the voice features of the speaker 200, and is based on this
  • the global difference space extracts a more compact I-vector vector from the mean supervector of the GMM model of the speaker.
  • the speaker feature determined by the speaker feature acquisition module 120 may be the mean supervector of the GMM-UBM model; In the case where the voiceprint model is the JFA model, the speaker feature determined by the speaker feature acquisition module 120 may be the speaker-related supervector of the JFA model; in the case where the voiceprint model of the speaker 200 is an I-vector model Next, the speaker feature determined by the speaker feature acquisition module 120 may be the I-vector vector of the I-vector model.
  • voiceprint model and speaker characteristics of the speaker 200 are not limited to this, and other types of voiceprint models and speaker characteristics may also be used.
  • the speaker template acquisition module 130 includes, but is not limited to, a speaker feature set maintenance unit 131, The human feature clustering unit 132 and the speaker template obtaining unit 133.
  • the speaker feature set maintenance unit 131 can add the speaker features of the speaker 200 to the speaker feature set according to predetermined rules; the speaker feature set maintenance unit 131 can also determine whether the speaker feature set satisfies The clustering condition, if the speaker feature set satisfies the aforementioned clustering condition, the speaker feature clustering unit 132 is triggered to cluster multiple speaker features in the speaker feature set into at least one speaker feature category.
  • the foregoing predetermined rule may include: in the case that there is no speaker template, the speaker feature set maintenance unit 131 may directly add the current speaker feature of the current speaker 200 to the speaker feature set; In the case that there is a speaker template and the speaker matching module 140 determines that the current speaker feature of the current speaker 200 matches a speaker template, the speaker feature set maintenance unit 131 can compare the current speaker feature of the current speaker 200 Add to the speaker feature category corresponding to the matched speaker template; in the case that a speaker template already exists and the speaker matching module 140 determines that the current speaker feature of the current speaker 200 does not match each speaker template, The speaker feature set maintenance unit 131 may add the current speaker feature of the current speaker 200 to the speaker feature set and make it not belong to any speaker feature category.
  • the foregoing predetermined rule may further include: the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 according to the signal-to-noise ratio of the voice input 300 of the current speaker 200 Speaker feature set. Specifically, if the signal-to-noise ratio of the voice input 300 of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 Speaker feature set. If the signal-to-noise ratio of the voice input 300 of the current speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set .
  • the aforementioned clustering conditions may include: speaker features not included in any speaker feature category (for example, speaker features that are not included in any speaker feature category because there is no speaker feature category yet)
  • speaker features not included in any speaker feature category for example, speaker features that are not included in any speaker feature category because there is no speaker feature category yet
  • the number of human features, or speaker features that are determined by the speaker matching module 140 as not matching each speaker template is greater than or equal to the first clustering threshold.
  • the aforementioned clustering condition may include: the number of speaker features in at least one speaker feature category is equal to a second clustering threshold, and the second clustering threshold may include one or more values, for example, But not limited to 50, 100, 200, 500 or other values
  • the speaker feature clustering unit 132 is configured to cluster multiple speaker features in the speaker feature set into at least one speaker feature category according to the trigger instruction of the speaker feature set maintenance unit 131 , And assign a system identifier (for example, but not limited to, speaker A, speaker B, etc.) to the at least one speaker feature category, where one system identifier is associated with a speaker feature category and is also associated with the speaker feature At least one speaker feature in the category is associated with a speaker template corresponding to the speaker feature category.
  • a system identifier for example, but not limited to, speaker A, speaker B, etc.
  • examples of clustering algorithms may include, but are not limited to, mean-shift (Mean-shift) clustering algorithms, density clustering algorithms (for example, but not limited to, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) Class algorithm), hierarchical clustering algorithm or other clustering algorithms, etc., among them, the mean-shift (Mean-shift) clustering algorithm clusters the speaker features based on the offset between the speaker features, and the density clustering algorithm The speaker features are clustered based on the density distribution of the speaker features.
  • the hierarchical clustering algorithm clusters the speaker recognition features based on the similarity between every two speaker features.
  • the speaker feature clustering unit 132 can target these speaker features that are not included in any speaker.
  • the speaker features in the speaker feature category are clustered, and clustering can also be performed on all speaker features in the speaker feature set; the number of speaker features in at least one speaker feature category is equal to the second clustering threshold In this case, the speaker feature clustering unit 132 can cluster all speaker features in the speaker feature set.
  • the Mean-Shift clustering algorithm can include the following steps:
  • S2 Determine all speaker features in the area with the center point as the center and radius r, group these speaker features into the same cluster, and record the number of times these speaker features appear;
  • S3 Calculate the offset vector (ie, the difference vector) required to move from the center point to the speaker feature determined in S2, and use the mean-shift vector of the offset vector as the mean-shift vector;
  • S7 Determine the speaker feature category. For each speaker feature in each cluster, the cluster with the most frequent occurrences of the speaker feature is used as the speaker feature category to which the speaker feature belongs.
  • the speaker template obtaining unit 133 may determine the mean value of at least one speaker feature in the speaker feature category and use it as a speaker template, for example, in In the case where the speaker feature is an I-vector vector of the I-vector model, the speaker template obtaining unit 133 may use the mean vector of at least one I-vector vector in the speaker feature category as the speaker template.
  • the speaker template obtaining unit 133 may determine the weighted sum of at least one speaker feature in the speaker feature category and use it as a speaker template, where: The weight of each speaker feature can be determined according to the signal-to-noise ratio of the voice input corresponding to the speaker feature.
  • the speaker matching module 140 may first determine whether a speaker template already exists. When the speaker matching module 140 determines that there is no speaker template (for example, the speaker feature set has not yet clustered), it does not match the current speaker 200, and compares the current speaker 200 to the current speaker 200 The features are sent to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 of the speaker template acquisition module 130 can add the current speaker feature of the current speaker 200 to the speaker feature set.
  • the speaker matching module 140 determines that a speaker template already exists (for example, the speaker feature set has been clustered), it can determine the similarity between the current speaker feature and each speaker template and each similarity respectively And determine whether the maximum similarity is higher than or equal to the similarity threshold.
  • the similarity between the current speaker feature and the speaker template can be determined by calculating the distance between the current speaker feature and the speaker template, where the distance can include, but is not limited to, cosine Distance, EMD distance (Earth Mover's Distance), Euclidean distance, Manhattan distance, etc.
  • the speaker matching module 140 determines that the maximum similarity is higher than or equal to the similarity threshold, it can determine that the current speaker feature matches the speaker template with the maximum similarity to determine the current speaker 200 matches the speaker 200 corresponding to the speaker template with the greatest similarity.
  • the speaker matching module 140 can also compare the features of the current speaker and the matching result of the current speaker 200 (for example, but not limited to, a speaker template that matches the features of the current speaker, or a speaker corresponding to the speaker template.
  • the system identification of the feature category for example, but not limited to, speaker A
  • speaker A is sent to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 can add the current speaker feature to the speaker template matching the current speaker feature In the corresponding speaker feature category (hereinafter referred to as the speaker feature category matching the current speaker 200, for example, but not limited to, speaker A).
  • the speaker feature set maintenance unit 131 may determine whether the number of speaker features in the speaker feature category (for example, but not limited to, speaker A) that matches the current speaker 200 is equal to the second clustering threshold. If the speaker feature set maintenance unit 131 determines that the number of speaker features in the speaker feature category is equal to the second clustering threshold, the speaker feature set maintenance unit 131 can trigger the speaker feature clustering unit 132 to use a clustering algorithm to analyze All speaker features in the speaker feature set are re-clustered to obtain the updated at least one speaker feature category, where the updated at least one speaker feature category corresponds to the speaker 200 one-to-one, and each speaker feature The updated speaker feature category includes at least one speaker feature in the speaker feature set.
  • the speaker feature set maintenance unit 131 may determine whether the number of speaker features in the speaker feature category (for example, but not limited to, speaker A) that matches the current speaker 200 is equal to the second clustering threshold. If the speaker feature set maintenance unit 131 determines that the number of speaker features in the speaker feature category is equal to the second clustering threshold
  • the speaker feature set maintenance unit 131 may determine the system identification associated with each updated speaker feature category. Specifically, for each updated speaker feature category, the speaker template acquisition unit 133 may determine that in the system identification associated with the updated at least one speaker feature in the updated speaker feature category, the corresponding The system identification of the largest number of speaker features, and the system identification is associated with the updated speaker feature category and the updated at least one speaker feature within the updated speaker feature category. The speaker feature set maintenance unit 131 may also determine an updated speaker feature category matching the current speaker 200 among the updated at least one speaker feature category, wherein the updated speaker feature category is associated with the updated speaker feature category The system identifier of is the same as the system identifier associated with the speaker feature category matched with the current speaker 200 before re-clustering.
  • the speaker template obtaining unit 133 may obtain the updated speaker template corresponding to the updated speaker feature category according to the updated at least one speaker feature in the updated speaker feature category matching the current speaker 200 .
  • the speaker template obtaining unit 133 may also obtain an updated speaker template corresponding to each updated speaker feature category according to the updated at least one speaker feature in each updated speaker feature category, where
  • the speaker template of has a one-to-one correspondence with the speaker 200.
  • the speaker feature set maintenance unit 131 determines that the number of speaker features in the speaker feature category matched by the current speaker 200 is not equal to the second clustering threshold, the speaker feature set maintenance unit 131 does not trigger the speaker
  • the feature clustering unit 132 re-clusters the speaker feature set.
  • the speaker template acquiring unit 133 may update the speaker corresponding to the speaker feature category matching the current speaker 200 according to the speaker features (including the current speaker features) in the speaker feature category matching the current speaker 200 Templates, where the speaker template can be obtained with reference to the above description, which will not be repeated here.
  • the speaker matching module 140 determines that the maximum similarity between the current speaker feature and each speaker template is lower than the similarity threshold, it can determine that the current speaker feature is different from each speaker template. Matching means that the current speaker 200 does not match the speaker 200 corresponding to each speaker template.
  • the speaker matching module 140 can compare the characteristics of the current speaker and the matching result of the current speaker 200 (for example, but not limited to, information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template) Sent to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 of the speaker template acquisition module 130 can add the current speaker feature to the speaker feature set, and make the current speaker feature not belong to any speaker feature category.
  • the characteristics of the current speaker and the matching result of the current speaker 200 for example, but not limited to, information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template
  • the speaker feature set maintenance unit 131 may determine whether the number of speaker features that are not included in any speaker feature category in the speaker feature set is greater than or equal to the first clustering threshold, and is not included in the speaker feature set. In the case that the number of speaker features included in any speaker feature category is greater than or equal to the first clustering threshold, the speaker feature set maintenance unit 131 can trigger the speaker feature clustering unit 132 to use a clustering algorithm to analyze these features.
  • the speaker features included in any speaker feature category are clustered, and at least one new speaker feature category is obtained, and the at least one new speaker feature category is assigned a new system identifier (for example, but not limited to) , Speaker C, Speaker D, etc.), where the new speaker feature category corresponds to speaker 200 one-to-one, and each new speaker feature category includes at least one speaker feature.
  • a new system identifier can be associated with one
  • the new speaker feature category is associated with at least one speaker feature in the new speaker feature category and the speaker template corresponding to the new speaker feature category.
  • the speaker feature set maintenance unit 131 may trigger the speaker feature clustering unit 132 to use a clustering algorithm to re-cluster all speaker features in the speaker feature set to obtain the updated at least one Speaker feature categories, where at least one updated speaker feature category corresponds to the speaker 200 one-to-one, and each updated speaker feature category includes at least one speaker feature in the speaker feature set.
  • the speaker template obtaining unit 133 may obtain a new speaker template corresponding to each new speaker feature category according to at least one speaker feature in each new speaker feature category, where the new speaker template There is a one-to-one correspondence with speaker 200.
  • the speaker template obtaining unit 133 may also obtain the updated corresponding to each updated speaker feature category according to the updated at least one speaker feature in each updated speaker feature category.
  • the updated speaker template corresponds to the speaker 200 one-to-one.
  • the acquisition of the speaker template please refer to the above description, which will not be repeated here.
  • the speaker feature set maintenance unit 131 determines that the number of speaker features that are not included in any speaker feature category in the speaker feature set is less than the first clustering threshold, the speaker feature set maintenance unit 131 does not The speaker feature clustering unit 132 is triggered to perform clustering.
  • the matching result user identification acquisition module 150 may determine whether the current speaker 200 needs to be acquired according to the matching result of the current speaker 200
  • the user ID of, for example, name, gender, age, permissions, preferences, etc.
  • the user identification acquisition module 150 determines to return If the user ID of the speaker feature category corresponding to the speaker template is not obtained, then the user ID obtaining module 150 can trigger the interaction module 110 to interact with the current speaker 200, and determine to speak with the current speaker 200 according to the interactive voice input of the current speaker 200 The user ID of the speaker feature category matched by the person 200.
  • FIG. 4 is a schematic diagram of a scene of voice interaction between the current speaker 200 and the speaker recognition device 100 according to an embodiment of the present application.
  • the interaction module 110 of the speaker recognition device 100 may also communicate with the current speaker in text form. 200 to interact.
  • the interaction module 110 and the current speaker 200 may have the following voice interactions:
  • Interaction module 110 "I have listened to your voice for a long time, what do you call it?"
  • Interactive module 110 "It's nice to meet you, can you know more about you?"
  • Interactive module 110 "Which of the following types of movies do you prefer: Kung Fu, Comedy, Thriller".
  • the interaction module 110 and the current speaker 200 may have the following voice interactions:
  • Interaction module 110 "I have listened to your voice for a long time, what do you call it?"
  • Interactive module 110 "Please enter the highest authority password.”
  • Interactive module 110 "The password is wrong, please enter it again"
  • Interactive module 110 "The password is correct, Mr. Zhang, congratulations you have the Host authority.”
  • the user identification acquisition module 150 may determine that the name of the current speaker 200 is "Zhang San”, the “like” is “comedy”, and the "authority” is "Host authority” based on the above voice interaction.
  • the user ID obtaining module 150 may send the user ID to the speaker template obtaining module 130, so that the speaker feature set maintenance unit 131 adds the current speaker feature to match the current speaker 200
  • the user ID of the current speaker is associated with the speaker feature category, the speaker feature in the speaker feature category, and the speaker template corresponding to the speaker feature category.
  • the interaction module 110 can provide the current speaker with a more personalized and intelligent interactive experience based on the user identification of the current speaker 200.
  • the interaction module 110 can identify the future speaker according to the user identifier associated with the speaker feature in the speaker feature category 200, and provide future speakers 200 with a more personalized and intelligent interactive experience, such as, but not limited to, the following interactive scenarios:
  • Future Speaker 200 "Help me adjust the temperature of the air conditioner to 25 degrees.”
  • Interaction module 110 (According to the user identification of the speaker feature category matching the future speaker 200, the name of the future speaker 200 is determined to be Li Si) "Okay, Mr. Li, I have helped you set the temperature of the air conditioner to 25 degrees. .”
  • Interaction module 110 (According to the user identification of the speaker feature category matched with the future speaker 200, the preference of the future speaker 200 is determined to be a comedy) "Mr. Li, a comedy movie with a high rating has been launched recently, named “The Richest Man in Tomatoes" "Please enjoy it next.”
  • Interaction module 110 (According to the user ID of the speaker feature category matching the future speaker 200, the permission of the future speaker 200 is determined as the guest permission) "Mr. Li, sorry, your current permission is not enough. Please upgrade to the highest permission.”
  • the matching result of the current speaker 200 includes a speaker template matching the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template
  • the user identification acquisition module 150 determines The user ID of the speaker feature category corresponding to the speaker template has been obtained, or the matching result of the current speaker 200 includes information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template
  • the user identification acquisition module 150 will not acquire the user identification of the current speaker 200, that is, the interaction module 110 will not be triggered to interact with the current speaker 200.
  • the speaker recognition device 100 may further include a voice attribute recognition module 160.
  • the voice attribute recognition module 60 may be based on the current speech.
  • the voice features of the person 200 for example, but not limited to, FBank features, MFCC features, etc.
  • identify the voice attribute information of the current speaker 200 where the voice attribute information may include, but is not limited to, the gender attributes of the voice (male voice, female voice) ), the age attribute of the sound (for example, children, adults, etc.), etc.
  • the voice attribute module 60 can use any voice attribute recognition technology in the prior art to recognize the voice attribute of the current speaker 200, for example, but not limited to, training the classification nerve through a large amount of voice sample data.
  • a sound classifier obtained from the Internet.
  • the voice attribute recognition module 160 After the voice attribute recognition module 160 recognizes the voice attribute information of the current speaker 200, the voice attribute information can be sent to the interaction module 110, so that the interaction module 110 interacts with the current speaker 200 based on the voice attribute information to obtain the current speaker 200 user ID. For example, the voice attribute recognition module 160 can recognize that the voice of the current speaker 200 belongs to the voice of a child. Then, when the interaction module 110 interacts with the current speaker 200, the current speaker 200 can be referred to as a "child", and to determine the current speaker 200.
  • the name of the speaker 200, the interaction module 110 and the current speaker 200 may have the following interaction scenarios:
  • Interaction module 110 "Hello, kid, it sounds like a very cute baby, what's your name?"
  • the user ID obtaining module 150 and the voice attribute recognition module 160 may respectively send the user ID and voice attribute information of the current speaker 200 to the speaker template obtaining module 130, so that the speaker characteristics are set
  • the maintenance unit 131 may add the current speaker feature to the speaker feature category matching the current speaker 200, and compare the current speaker’s voice attribute information and user identification with the speaker feature category and the speaker feature category.
  • the speaker feature of the speaker is associated with the speaker template corresponding to the speaker feature category.
  • the interaction module 110 can provide the current speaker with a more personalized and intelligent interaction experience based on the user identification and voice attribute information of the current speaker 200.
  • the interaction module 110 can recognize the future speaker according to the voice attribute information and user identification associated with the feature category of the speaker. 200, and provide a more personalized and smarter interactive experience to future speakers 200, but not limited to the following interactive scenarios:
  • Future Speaker 200 "Play a song for me.”
  • Interaction module 110 (determine that the name of the future speaker 200 is Yaya according to the user identification of the speaker feature category matching the future speaker 200, and determine the age attribute of the voice as a child according to the voice attribute information)
  • the voice attribute recognition module 160 may determine whether the voice attribute of the current speaker 200 needs to be recognized according to the matching result of the speaker recognition device 100 to the current speaker 200. For example, in the case that the matching result of the current speaker 200 includes a speaker template that matches the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the voice attribute recognition module 160 determines that it has not Acquire the voice attribute information of the speaker category, then the voice attribute recognition module 160 can recognize the voice attribute of the current speaker 200, and send the voice attribute information to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 The voice attribute information of the current speaker may be associated with the speaker feature in the speaker feature category matched with the current speaker 200.
  • the voice attribute recognition module 160 determines that the speaker has been identified The voice attribute information of the speaker feature category, or when the matching result of the current speaker 200 includes information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template, the voice attribute recognition module 160 does not The voice attributes of the current speaker 200 are recognized.
  • the interaction module 110 can provide the current speaker with a more personalized and intelligent interaction experience based on the voice attribute information of the current speaker 200.
  • the interaction module 110 may report to the future speaker 200 according to the voice attribute information associated with the feature category of the speaker. Provide a more personalized and intelligent interactive experience.
  • the voice attribute recognition module 160 can recognize the voice attributes of the current speaker 200, and send the voice attribute information to the interaction module 110, so that The interaction module 110 can provide the current speaker 200 with a more personalized and intelligent interaction experience based on the sound attribute information, for example, but not limited to, the following interaction scenarios:
  • Interaction module 110 (The age attribute of the sound is determined by the sound attribute recognition module 160 to be a child) "The air conditioner has been turned on and adjusted to 28 degrees Celsius.”
  • Interaction module 110 (The age attribute of the voice is determined by the voice attribute recognition module 160 to be an adult) "The air conditioner has been turned on and it is adjusted to 25 degrees Celsius.”
  • the speaker feature category may use the user identification and/or voice attribute information corresponding to the largest number of speaker features as the user identification and/or voice attribute information associated with the updated speaker feature category, for example, re-clustering
  • the last updated speaker feature category there are 48 speaker features corresponding to the user ID of "Zhang San", and 2 speaker features corresponding to the user ID of "Li Si", then the user ID of "Zhang San” The identification is associated with the updated speaker feature category.
  • the speaker there is no need for the speaker to specifically perform voice registration, but in the process of voice interaction with the speaker, by clustering the speaker characteristics of different speakers, the corresponding speaker is obtained.
  • Speaker template to identify different speakers. Therefore, the embodiments of the present application can realize non-sense registration, and avoid the negative experience brought by the registration to the speaker.
  • adding the speaker feature of the current speaker to the speaker feature category matching the current speaker and updating the speaker template of the speaker feature category can improve the speaker Recognition accuracy.
  • the speaker template of the existing speaker feature category is updated, or the speech of a new speaker feature category is obtained.
  • the person template can also improve the accuracy of speaker recognition.
  • a more personalized and intelligent interactive experience can be provided to the speaker.
  • Fig. 5 shows a schematic flow chart of a speaker recognition method according to an embodiment of the present application.
  • Different components of the speaker recognition apparatus 100 in Figs. 2-3 can implement different blocks or other parts of the method.
  • the description order of the method steps should not be construed as these steps must be executed depending on the order, these steps may not need to be executed in the order described, and the method may include other steps besides these steps. Some of these steps can be included.
  • the speaker recognition method includes:
  • Block 501 receiving the current voice input from the current speaker 200 through the interaction module 110;
  • the current speech input 300 of the current speaker 200 is preprocessed through the speaker feature acquisition module 120, where the preprocessing may include, but is not limited to, signal enhancement, dereverberation, denoising, etc.;
  • the speaker feature acquisition module 120 can also extract the current voice features of the current speaker 200 from the preprocessed current voice input 300, such as, but not limited to, FBank features, MFCC features, etc.;
  • Block 503 through the speaker feature acquisition module 120, based on the current voice features of the current speaker 200, according to the voiceprint model, such as, but not limited to, GMM-UBM, I-vector model, JFA model, etc., to obtain the current speaker 200
  • the voiceprint model such as, but not limited to, GMM-UBM, I-vector model, JFA model, etc.
  • the current speaker includes, but not limited to, the mean super vector of the GMM-UBM model, the speaker-related super vector of the JFA model, the I-vector vector of the I-vector model, etc.;
  • block 504 it is determined whether there is at least one speaker template through the speaker matching module 140, if it is determined that there is at least one speaker template, block 505 is executed, and if it is determined that there is no speaker template, block 507 is executed;
  • Block 505 through the speaker matching module 140, determine the similarity between the current speaker feature of the current speaker 200 and each speaker template;
  • the similarity between the current speaker feature and each speaker template can be determined by calculating the distance between the current speaker feature and each speaker template, where the distance can include, but is not limited to, Cosine distance, EMD distance (Earth Mover's Distance), Euclidean distance, Manhattan distance, etc.;
  • Block 506 through the speaker matching module 140, determine whether the current speaker 200 is speaking with at least one speaker corresponding to the at least one speaker template according to the maximum similarity between the current speaker feature of the current speaker 200 and each speaker template One speaker 200 among the people 200 matches;
  • the speaker matching module 140 when the speaker matching module 140 determines that the maximum similarity is higher than or equal to the similarity threshold, it can determine that the current speaker feature matches the speaker template with the maximum similarity to determine The current speaker 200 matches the speaker 200 corresponding to the speaker template with the maximum similarity; in the case that the speaker matching module 140 determines that the maximum similarity is lower than the similarity threshold, it can determine that the characteristics of the current speaker are Each speaker template does not match, that is, the current speaker 200 does not match each speaker 200 corresponding to each speaker template;
  • the speaker feature set maintenance unit 131 of the speaker template obtaining unit 130 adds the current speaker feature to the speaker feature sample set
  • the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker that matches the speaker template.
  • the speaker feature category corresponding to the speaker template in the case that the speaker matching module 140 determines that the current speaker feature does not match each speaker template, the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker feature In the set, and make the current speaker feature not belong to any speaker feature category;
  • the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 to the speaker feature set according to the signal-to-noise ratio of the current voice input of the current speaker 200, specifically If the signal-to-noise ratio of the current speech input of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 to the speaker feature set. If the signal-to-noise ratio of the current voice input of the speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set;
  • the speaker feature set maintenance unit 131 of the speaker template acquisition unit 130 is used to determine whether the speaker feature set meets the clustering condition, if yes, execute block 509, if not, return to execute block 501, that is, receive the next one Voice input;
  • the speaker feature set maintenance unit 131 determines whether the speaker feature set satisfies the clustering condition, which may include: Determine whether the number of speaker features in the speaker feature set that is not included in any speaker feature category is greater than or equal to the first clustering threshold;
  • the speaker feature set maintenance unit 131 determines whether the speaker feature set satisfies the clustering condition, It may include determining whether the number of speaker features in the speaker feature category corresponding to the speaker template matched with the current speaker feature is equal to a second clustering threshold, where the second clustering threshold may include one or more values, For example, but not limited to, 50, 100, 200, 500 or other values;
  • the speaker feature clustering unit 132 of the speaker template obtaining unit 130 uses a clustering algorithm to cluster the speaker features
  • the speaker matching module 140 determines that the current speaker feature does not match each speaker template, and the number of speaker features not included in any speaker feature category is greater than or equal to In the case of the first clustering threshold, the speaker feature clustering unit 132 can cluster the speaker features that are not included in any speaker feature category, and obtain at least one new speaker feature category.
  • the at least one new speaker feature category is assigned a new system identifier (for example, but not limited to, speaker C, speaker D, etc.), where the new speaker feature category corresponds to speaker 200 one-to-one, and each new speaker feature category
  • the speaker feature category includes at least one speaker feature.
  • a new system identifier can be associated with a new speaker feature category, or can be associated with at least one speaker feature in the new speaker feature category, and the new speaker feature category.
  • the speaker template corresponding to the speaker feature category is associated; in another example, the speaker feature clustering unit 132 may re-cluster all speaker features in the speaker feature set to obtain the updated at least one Speaker feature categories, where at least one updated speaker feature category corresponds to speaker 200 one-to-one, and each updated speaker feature category includes at least one speaker feature in the speaker feature set;
  • the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, and it is within the speaker feature category corresponding to the speaker template that matches the current speaker feature
  • the speaker feature clustering unit 132 may re-cluster all speaker features in the speaker feature set to obtain the updated at least one speaker feature Categories, where at least one updated speaker feature category corresponds to speaker 200 one-to-one, and each updated speaker feature category includes at least one speaker feature in the speaker feature set;
  • Block 510 through the speaker template obtaining unit 133 of the speaker template obtaining unit 130, for each speaker feature category, obtain a speaker template corresponding to the speaker feature category according to the speaker feature in the speaker feature category;
  • the speaker template obtaining unit 133 may obtain a new speaker template corresponding to each new speaker feature category according to at least one speaker feature in each new speaker feature category;
  • the speaker template obtaining unit 133 may also obtain an updated speaker template corresponding to each updated speaker feature category according to the updated at least one speaker feature in each updated speaker feature category;
  • the speaker template obtaining unit 133 may obtain each updated speaker feature according to the updated at least one speaker feature in each updated speaker feature category.
  • the updated speaker template corresponding to the speaker feature category; the speaker template obtaining unit 133 may also obtain the speaker template according to the updated at least one speaker feature in the updated speaker feature category matching the current speaker 200
  • the updated speaker template corresponding to the updated speaker feature category, where the updated speaker feature category matching the current speaker 200 can be determined with reference to the description in the device section above;
  • the speaker template acquisition unit 133 may update according to the speaker features (including the current speaker features) in the speaker feature category matching the current speaker 200 The speaker template corresponding to the speaker feature category matched with the current speaker 200;
  • Fig. 6 shows another schematic flow chart of a speaker recognition method according to an embodiment of the present application.
  • Different components of the speaker recognition device 100 in Figs. 2 and 3 can implement different blocks or other parts of the method.
  • the description order of the method steps should not be construed as these steps must be executed depending on the order, these steps may not need to be executed in the order described, and the method may include other steps besides these steps. Some of these steps can be included.
  • the speaker recognition method includes:
  • Block 601-block 606 please refer to the description of block 501-block 506, which will not be repeated here;
  • the user ID obtaining module 150 may determine whether it is necessary to obtain the user ID of the current speaker 200, for example, name, gender, age, authority, preference, etc., according to the matching result of the current speaker 200;
  • the matching result of the current speaker 200 includes a speaker template matching the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template
  • the user identification acquisition module 150 determines that it has not Acquire the user ID of the speaker feature category corresponding to the speaker template, then the user ID obtaining module 150 determines that the user ID of the current speaker 200 needs to be obtained, and triggers the interaction module 110 to interact with the current speaker 200;
  • the matching result of the current speaker 200 includes a speaker template that matches the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template
  • the user identification acquisition module 150 determines that the The user identification of the speaker feature category corresponding to the speaker template, or when the matching result of the current speaker 200 includes information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template, the user identification The acquiring module 150 determines that it is not necessary to acquire the user ID of the current speaker 200, and it will not trigger the interaction module 110 to interact with the current speaker 200;
  • the user ID of the current speaker 200 is obtained according to the interaction with the current speaker through the user ID acquisition module 150;
  • the speaker feature set maintenance unit 131 of the speaker template obtaining unit 130 adds the current speaker feature to the speaker feature sample set
  • the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker that matches the speaker template.
  • the speaker feature category corresponding to the person template in addition, when the user ID acquisition module 150 has acquired the user ID of the current speaker 200, the speaker feature set maintenance unit 131 also compares the user ID of the current speaker with the speaker. The feature category, the speaker feature in the speaker feature category, and the speaker template corresponding to the speaker feature category are associated;
  • the speaker feature set maintenance unit 131 may add the current speaker features to the speaker feature set. And make the current speaker feature not belong to any speaker feature category;
  • the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 to the speaker feature set according to the signal-to-noise ratio of the current voice input of the current speaker 200, specifically If the signal-to-noise ratio of the current speech input of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 to the speaker feature set. If the signal-to-noise ratio of the current voice input of the speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set;
  • Block 610 please refer to the description of block 508, which will not be repeated here;
  • Block 611 Please refer to the description of block 509, which will not be repeated here;
  • the speaker feature in the speaker feature category has a user identifier associated with it, if re-clustering occurs, then for each updated speaker feature category after re-clustering, you can The user ID corresponding to the largest number of speaker features is used as the user ID associated with the updated speaker feature category. For example, there are 48 speaker features in an updated speaker feature category after re-clustering Corresponding to the user ID of "Zhang San", there are 2 speaker characteristics corresponding to the user ID of "Li Si”, then the user ID of "Zhang San" is associated with the updated speaker characteristic category;
  • Fig. 7 shows another schematic flow chart of a speaker recognition method according to an embodiment of the present application.
  • Different components of the speaker recognition apparatus 100 in Figs. 2 and 3 can implement different blocks or other parts of the method.
  • the description order of the method steps should not be construed as these steps must be executed depending on the order, these steps may not need to be executed in the order described, and the method may include other steps besides these steps. Some of these steps can be included.
  • the speaker recognition method includes:
  • Blocks 701-707 please refer to the descriptions of blocks 601-607, which will not be repeated here;
  • Block 708 Through the voice attribute recognition module 160, recognize the voice attributes of the current speaker according to the voice characteristics of the current speaker; wherein, the voice attribute information may include, but is not limited to, the gender attribute of the voice, the age attribute of the voice, etc.;
  • the voice attribute module 60 can use any voice attribute recognition technology in the prior art to recognize the voice attribute of the current speaker 200, for example, but not limited to, training a classification neural network through a large amount of voice sample data And the sound classifier obtained.
  • Block 709 Obtain the user ID of the current speaker 200 according to the interaction with the current speaker 200 based on the voice attribute information through the user ID acquisition module 150;
  • the voice attribute information may be sent to the interaction module 110, so that the interaction module 110 interacts with the current speaker 200 based on the voice attribute information .
  • the user ID acquisition module 150 can determine the user ID of the current speaker 200 according to the interactive voice input of the current speaker 200;
  • Block 710 Add the current speaker feature to the speaker feature sample set through the speaker feature set maintenance unit 131 of the speaker template acquisition unit 130;
  • the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker that matches the speaker template.
  • the speaker feature category corresponding to the person template in addition, when the voice attribute recognition module 160 recognizes the voice attribute of the current speaker 200 and the user identification acquisition module 150 acquires the user identification of the current speaker 200, the speaker feature set The maintenance unit 131 also associates the voice attributes and user identification of the current speaker with the feature category of the speaker, the speaker feature in the feature category of the speaker, and the speaker template corresponding to the feature category of the speaker;
  • the speaker feature set maintenance unit 131 can add the current speaker features to the speaker feature set, and make the current speaker features different. Belongs to any speaker characteristic category;
  • the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 to the speaker feature set according to the signal-to-noise ratio of the current voice input of the current speaker 200, specifically If the signal-to-noise ratio of the current speech input of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 to the speaker feature set. If the signal-to-noise ratio of the current voice input of the speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set;
  • Block 711-block 713 please refer to the description of block 610-block 612, which will not be repeated here.
  • the updated speaker feature category may use the user identification and voice attribute information corresponding to the largest number of speaker features as the user identification and voice attribute information associated with the updated speaker feature category.
  • the speaker there is no need for the speaker to specifically perform voice registration, but in the process of voice interaction with the speaker, by clustering the speaker characteristics of different speakers, the corresponding speaker is obtained.
  • Speaker template to identify different speakers. Therefore, the embodiments of the present application can realize non-sense registration, and avoid the negative experience brought by the registration to the speaker.
  • adding the speaker feature of the current speaker to the speaker feature category matching the current speaker and updating the speaker template of the speaker feature category can improve the speaker Recognition accuracy.
  • the speaker template of the existing speaker feature category is updated, or the speech of a new speaker feature category is obtained.
  • the person template can also improve the accuracy of speaker recognition.
  • a more personalized and intelligent interactive experience can be provided to the speaker.
  • FIG. 8 shows a schematic structural diagram of an example system 800 according to an embodiment of the present application.
  • the system 800 may include one or more processors 802, a system control logic 808 connected to a plurality of the processors 802, a system memory 804 connected to the system control logic 808, and a non-volatile memory connected to the system control logic 808 (NVM) 806, and a network interface 810 connected to the system control logic 808.
  • processors 802 may include one or more processors 802, a system control logic 808 connected to a plurality of the processors 802, a system memory 804 connected to the system control logic 808, and a non-volatile memory connected to the system control logic 808 (NVM) 806, and a network interface 810 connected to the system control logic 808.
  • NVM non-volatile memory
  • the processor 802 may include one or more single-core or multi-core processors.
  • the processor 802 may include any combination of a general-purpose processor and a special-purpose processor (for example, a graphics processor, an application processor, a baseband processor, etc.).
  • the processor 802 may be configured to execute one or more embodiments according to the various embodiments shown in FIGS. 5-7.
  • system control logic 808 may include any suitable interface controller to provide any suitable interface to a plurality of the processors 802 and/or any suitable devices or components in communication with the system control logic 808.
  • system control logic 808 may include one or more memory controllers to provide an interface to the system memory 804.
  • the system memory 804 may be used to load and store data and/or instructions for the system 800.
  • the memory 804 of the system 800 may include any suitable volatile memory, such as a suitable dynamic random access memory (DRAM).
  • DRAM dynamic random access memory
  • the NVM/memory 806 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions.
  • the NVM/memory 806 may include any suitable non-volatile memory such as flash memory and/or any suitable non-volatile storage device, such as HDD (Hard Disk Drive, hard disk drive), CD (Compact Disc) , CD) drive, DVD (Digital Versatile Disc, Digital Versatile Disc) drive.
  • the NVM/memory 806 may include a part of storage resources installed on the device of the system 800, or it may be accessed by the device, but not necessarily a part of the device.
  • the NVM/storage 806 can be accessed through the network via the network interface 810.
  • system memory 804 and the NVM/memory 806 may include: a temporary copy and a permanent copy of the instruction 820, respectively.
  • the instructions 820 may include instructions that, when executed by at least one of the processors 802, cause the system 800 to implement one or more of the various embodiments shown in FIGS. 5-7.
  • the instructions 820, hardware, firmware, and/or software components thereof may additionally/alternatively be placed in the system control logic 808, the network interface 810, and/or the processor 802.
  • the network interface 810 may include a transceiver to provide a radio interface for the system 800, and then communicate with any other suitable devices (such as a front-end module, an antenna, etc.) through one or more networks.
  • the network interface 810 may be integrated with other components of the system 800.
  • the network interface 810 may include at least one of a processor 802, a system memory 804, an NVM/memory 806, and a firmware device (not shown) with instructions, when at least one of the processors 802 executes the instructions ,
  • the system 800 implements one or more of the various embodiments shown in FIGS. 5-7.
  • the network interface 810 may further include any suitable hardware and/or firmware to provide a multiple input multiple output radio interface.
  • the network interface 810 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.
  • multiple of the processors 802 may be packaged with the logic of one or more controllers for the system control logic 808 to form a system in package (SiP). In one embodiment, multiple of the processors 802 may be integrated with the logic of one or more controllers for the system control logic 808 on the same die to form a system on chip (SoC).
  • SiP system in package
  • SoC system on chip
  • the system 800 may further include: an input/output (I/O) interface 812.
  • the I/O interface 812 may include a user interface to enable a user to interact with the system 800; the design of the peripheral component interface enables the peripheral component to also interact with the system 800.
  • the system 800 further includes a sensor for determining at least one of environmental conditions and location information related to the system 800.
  • the user interface may include, but is not limited to, a display (e.g., liquid crystal display, touch screen display, etc.), speakers, microphones, one or more cameras (e.g., still image cameras and/or video cameras), flashlights (e.g., LED flash) and keyboard.
  • a display e.g., liquid crystal display, touch screen display, etc.
  • speakers e.g., speakers, microphones, one or more cameras (e.g., still image cameras and/or video cameras), flashlights (e.g., LED flash) and keyboard.
  • the peripheral component interface may include, but is not limited to, a non-volatile memory port, an audio jack, and a power interface.
  • the senor may include, but is not limited to, a gyroscope sensor, an accelerometer, a proximity sensor, an ambient light sensor, and a positioning unit.
  • the positioning unit may also be part of or interact with the network interface 810 to communicate with components of the positioning network (eg, global positioning system (GPS) satellites).
  • GPS global positioning system
  • module or “unit” can refer to, be, or include: application specific integrated circuit (ASIC), electronic circuit, (shared, dedicated, or group) processing that executes one or more software or firmware programs And/or memory, combinatorial logic circuits, and/or other suitable components that provide the described functions.
  • ASIC application specific integrated circuit
  • electronic circuit shared, dedicated, or group
  • processing that executes one or more software or firmware programs And/or memory, combinatorial logic circuits, and/or other suitable components that provide the described functions.
  • the various embodiments of the mechanism disclosed in this application may be implemented in hardware, software, firmware, or a combination of these implementation methods.
  • the embodiments of the present application can be implemented as a computer program or program code executed on a programmable system including multiple processors and storage systems (including volatile and non-volatile memories and/or storage elements) , Multiple input devices and multiple output devices.
  • Program codes can be applied to input instructions to perform the functions described in this application and generate output information.
  • the output information can be applied to one or more output devices in a known manner.
  • a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • the program code can be implemented in a high-level programming language or an object-oriented programming language to communicate with the processing system.
  • assembly language or machine language can also be used to implement the program code.
  • the mechanism described in this application is not limited to the scope of any particular programming language. In either case, the language can be a compiled language or an interpreted language.
  • the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof.
  • one or more aspects of at least some embodiments may be implemented by representative instructions stored on a computer-readable storage medium.
  • the instructions represent various logics in the processor, and the instructions, when read by a machine, cause This machine makes the logic used to execute the techniques described in this application.
  • IP cores can be stored on a tangible computer-readable storage medium and provided to multiple customers or production facilities to be loaded into the manufacturing machine that actually manufactures the logic or processor.
  • Such computer-readable storage media may include, but are not limited to, non-transitory tangible arrangements of objects manufactured or formed by machines or equipment, including storage media, such as hard disks, any other types of disks, including floppy disks, optical disks, compact disks, etc.
  • CD-ROM Compact disk rewritable
  • CD-RW compact disk rewritable
  • magneto-optical disk semiconductor devices such as read only memory (ROM), such as dynamic random access memory (DRAM) and static random access Random access memory (RAM) such as memory (SRAM), erasable programmable read-only memory (EPROM), flash memory, electrically erasable programmable read-only memory (EEPROM); phase change memory (PCM); magnetic card Or optical card; or any other type of medium suitable for storing electronic instructions.
  • ROM read only memory
  • DRAM dynamic random access memory
  • RAM static random access Random access memory
  • SRAM erasable programmable read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • PCM phase change memory
  • magnetic card Or optical card or any other type of medium suitable for storing electronic instructions.
  • each embodiment of the present application also includes a non-transitory computer-readable storage medium, which contains instructions or contains design data, such as hardware description language (HDL), which defines the structures, circuits, devices, etc. described in the present application. Processor and/or system characteristics.
  • HDL hardware description language

Abstract

A voice processing method comprises: receiving multiple pieces of voice input, and extracting multiple voice features from the multiple pieces of voice input; determining multiple speaker features on the basis of the multiple voice features; clustering the multiple speaker features into at least one speaker feature category, wherein the at least one speaker feature category corresponds one-to-one to at least one speaker (200), and each of the at least one speaker feature category comprises at least one of the multiple speaker features; determining at least one speaker template on the basis of the at least one speaker feature category, wherein the at least one speaker template corresponds one-to-one to the at least one speaker (200); and receiving current voice input from a current speaker (200), and determining, on the basis of the current voice input and the at least one speaker template, whether the current speaker (200) matches one of the at least one speaker (200). The method implements sensor-free registration, and prevents registration from causing a negative experience for the speaker (200).

Description

一种语音处理方法、介质及系统Voice processing method, medium and system
本申请要求于2020年1月10日提交中国专利局、申请号为202010025486.X、申请名称为“一种语音处理方法、介质及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 202010025486.X, and the application name is "a voice processing method, medium and system" on January 10, 2020, the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请的一个或多个实施例通常涉及领域,具体涉及一种语音处理方法、介质及系统。One or more embodiments of the present application generally relate to the field, and specifically relate to a voice processing method, medium, and system.
背景技术Background technique
说话人识别,即声纹识别,属于生物特征识别技术,是一种通过对语音信号分析和提取来自动辨识和确认说话人身份的技术。Speaker recognition, that is, voiceprint recognition, belongs to the biometric recognition technology, which is a technology that automatically recognizes and confirms the identity of the speaker through the analysis and extraction of voice signals.
现有的说话人识别包括注册和验证两个阶段。在注册阶段,系统会要求注册人按照指定要求录入多句注册语音,系统内部会将这些注册语音转换成对应的说话人模型。在声纹验证阶段,系统对录入的验证语音进行特征分析和提取并与注册阶段生成的说话人模型进行相似度打分,并根据设定的阈值判断该验证语音是否匹配注册人。The existing speaker recognition includes two stages: registration and verification. During the registration phase, the system will require the registrant to enter multiple registered voices according to the specified requirements, and the system will convert these registered voices into corresponding speaker models. In the voiceprint verification phase, the system performs feature analysis and extraction on the input verification voice and scores the similarity with the speaker model generated in the registration phase, and judges whether the verification voice matches the registrant according to the set threshold.
发明内容Summary of the invention
以下从多个方面介绍本申请,以下多个方面的实施方式和有益效果可互相参考。The following describes the application from multiple aspects, and the implementations and beneficial effects of the following multiple aspects can be referred to each other.
本申请的第一方面提供了一种语音处理方法,该方法可以包括:接收多个语音输入并从多个语音输入中提取多个语音特征;基于多个语音特征确定多个说话人特征;将多个说话人特征聚类为至少一个说话人特征类别,其中,至少一个说话人特征类别与至少一个说话人一一对应,并且至少一个说话人特征类别中的每个说话人特征类别包括多个说话人特征中的至少一个说话人特征;基于至少一个说话人特征类别,确定至少一个说话人模板,其中,至少一个说话人模板与至少一个说话人一一对应;和接收来自当前说话人的当前语音输入,并且基于当前语音输入和至少一个说话人模板,确定当前说话人是否与至少一个说话人中的一个说话人匹配。The first aspect of the present application provides a voice processing method, which may include: receiving multiple voice inputs and extracting multiple voice features from the multiple voice inputs; determining multiple speaker features based on the multiple voice features; The multiple speaker feature clusters are clustered into at least one speaker feature category, where at least one speaker feature category corresponds to at least one speaker one-to-one, and each speaker feature category in the at least one speaker feature category includes multiple speaker feature categories. At least one speaker feature among speaker features; at least one speaker template is determined based on at least one speaker feature category, where at least one speaker template corresponds to at least one speaker one-to-one; and receiving current information from the current speaker Voice input, and based on the current voice input and the at least one speaker template, it is determined whether the current speaker matches one of the at least one speaker.
根据本申请的实施例,不需要说话人专门进行语音注册,而是在跟说话人语音交互过程中,通过对不同的说话人的说话人特征进行聚类,获得与每个说话人相对应的According to the embodiment of the present application, the speaker does not need to perform voice registration specifically, but in the process of voice interaction with the speaker, by clustering the speaker characteristics of different speakers, the speaker corresponding to each speaker is obtained.
说话人模板,以此来识别不同的说话人。因此,本申请的实施例可以实现无感注册,避免了注册给说话人带来的负体验。Speaker template to identify different speakers. Therefore, the embodiments of the present application can realize non-sense registration, and avoid the negative experience brought by the registration to the speaker.
在一些实施例中,基于多个语音特征确定多个说话人特征,包括:基于多个语音特征,通过声纹模型确定多个说话人特征;其中,声纹模型包括混合高斯-通用背景模型、I-vector模型、联合因子分析模型中的至少一种,并且多个说话人特征包括混合高斯-通用背景模型的超均值矢量、I-vector模型的I-vector矢量、联合因子分析模型的与 说话人相关的超矢量中的至少一种。In some embodiments, determining multiple speaker features based on multiple voice features includes: determining multiple speaker features through a voiceprint model based on multiple voice features; wherein the voiceprint model includes a mixed Gaussian-general background model, At least one of the I-vector model and the joint factor analysis model, and the multiple speaker characteristics include the hypermean vector of the mixed Gaussian-universal background model, the I-vector vector of the I-vector model, and the joint factor analysis model and the speaker At least one of human-related super vectors.
在一些实施例中,基于至少一个说话人特征类别,确定至少一个说话人模板,包括:确定每个说话人特征类别内的至少一个说话人特征的均值或加权和;将至少一个说话人特征的至少一个均值或加权和作为至少一个说话人模板。In some embodiments, determining at least one speaker template based on at least one speaker feature category includes: determining the mean or weighted sum of at least one speaker feature in each speaker feature category; At least one mean or weighted sum serves as at least one speaker template.
在一些实施例中,将多个说话人特征聚类为至少一个说话人特征类别,包括:基于多个说话人特征中每两个说话人特征之间的相似度,多个说话人特征中两个说话人特征之间的偏移,以及多个说话人特征的密度分布中的至少一种,将多个说话人特征聚类为至少一个说话人特征类别。In some embodiments, clustering the multiple speaker features into at least one speaker feature category includes: based on the similarity between every two speaker features in the multiple speaker features, two of the multiple speaker features At least one of the offset between the speaker features and the density distribution of the multiple speaker features, clustering the multiple speaker features into at least one speaker feature category.
在一些实施例中,接收来自当前说话人的当前语音输入,并且基于当前语音输入和至少一个说话人模板,确定当前说话人是否与至少一个说话人中的一个说话人匹配,还包括:接收来自当前说话人的当前语音输入,并从当前语音输入中提取当前语音特征;基于当前语音特征确定当前说话人特征;确定当前说话人特征是否与至少一个说话人模板中的一个说话人模板匹配;在确定当前说话人特征与一个说话人模板匹配的情况下,确定当前说话人与一个说话人模板对应的说话人匹配。In some embodiments, receiving the current voice input from the current speaker, and determining whether the current speaker matches one of the at least one speaker based on the current voice input and the at least one speaker template, further includes: receiving from The current voice input of the current speaker, and extract the current voice feature from the current voice input; determine the current speaker feature based on the current voice feature; determine whether the current speaker feature matches one of the at least one speaker template; In the case where it is determined that the current speaker feature matches a speaker template, it is determined that the current speaker matches a speaker corresponding to a speaker template.
在一些实施例中,接收来自当前说话人的当前语音输入,并且基于当前语音输入和至少一个说话人模板,确定当前说话人是否与至少一个说话人中的一个说话人匹配,包括:基于当前说话人特征与至少一个说话人模板中的每个说话人模板之间的相似度,确定当前说话人是否与至少一个说话人中的一个说话人匹配。In some embodiments, receiving the current voice input from the current speaker, and based on the current voice input and the at least one speaker template, determining whether the current speaker matches one of the at least one speaker includes: based on the current speech The similarity between the human feature and each speaker template in the at least one speaker template determines whether the current speaker matches one speaker in the at least one speaker template.
在一些实施例中,该方法还包括:在确定当前说话人与一个说话人匹配的情况下,确定当前说话人的当前说话人特征的数量以及至少一个说话人特征类别中与一个说话人相对应的一个说话人类别中的至少一个说话人特征的数量之和是否等于第一阈值;在确定数量之和不等于第一阈值的情况下,基于当前说话人特征和与一个说话人特征类别中的至少一个说话人特征,更新与一个说话人相对应的说话人模板。In some embodiments, the method further includes: in the case of determining that the current speaker matches a speaker, determining the number of current speaker features of the current speaker and at least one speaker feature category corresponding to one speaker Whether the sum of the number of at least one speaker feature in one speaker category is equal to the first threshold; in the case where it is determined that the sum of the number is not equal to the first threshold, based on the current speaker feature and a speaker feature in the category At least one speaker feature, update the speaker template corresponding to one speaker.
根据本申请的实施例,在成功匹配当前说话人之后,将当前说话人的说话人特征加入到与当前说话人相匹配的说话人特征类别中并更新该说话人特征类别的说话人模板,可以提高对说话人的识别准确度。According to the embodiment of the present application, after the current speaker is successfully matched, the speaker feature of the current speaker is added to the speaker feature category that matches the current speaker, and the speaker template of the speaker feature category is updated. Improve the accuracy of speaker recognition.
在一些实施例中,该方法还包括:在确定当前说话人与一个说话人匹配的情况下,确定当前说话人的当前说话人特征的数量以及至少一个说话人特征类别中与一个说话人相对应的一个说话人类别中的至少一个说话人特征的数量之和是否等于第一阈值;在确定数量之和等于第一阈值的情况下,将当前说话人特征加入多个说话人特征,形成经更新的多个说话人特征;将经更新的多个说话人特征聚类为经更新的至少一个说话人特征类别,其中,经更新的至少一个说话人特征类别与经更新的至少一个说话人一一对应,并且经更新的至少一个说话人特征类别中的每个经更新的说话人特征类别包括经更新的多个说话人特征中的至少一个说话人特征;基于至少一个经更新的说话人特征类别,确定经更新的至少一个说话人模板,其中经更新的至少一个说话人模板与经更新的至少一个说话人一一对应。In some embodiments, the method further includes: in the case of determining that the current speaker matches a speaker, determining the number of current speaker features of the current speaker and at least one speaker feature category corresponding to one speaker Whether the sum of the number of at least one speaker feature in one speaker category is equal to the first threshold; in the case of determining that the sum of the number is equal to the first threshold, add the current speaker feature to multiple speaker features to form an updated The multiple speaker features of the; cluster the updated multiple speaker features into the updated at least one speaker feature category, wherein the updated at least one speaker feature category and the updated at least one speaker are one-to-one Corresponding and each updated speaker feature category of the updated at least one speaker feature category includes at least one speaker feature of the updated multiple speaker features; based on the at least one updated speaker feature category , Determine the updated at least one speaker template, where the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.
根据本申请的实施例,在成功匹配当前说话人之后,如果与当前说话人相匹配的说话人对应的说话人类别满足聚类条件,则对所有的说话人特征(包括当前说话人特征)重新进行聚类,从而更新说话人特征类别,以及说话人特征类别对应的说话人模 板,如此,可以随着接收的语音输入的数量的增多,逐步地提高对说话人的识别准确度。According to the embodiment of the present application, after successfully matching the current speaker, if the speaker category corresponding to the speaker matching the current speaker satisfies the clustering condition, all speaker characteristics (including the current speaker characteristics) are re- Clustering is performed to update the speaker feature category and the speaker template corresponding to the speaker feature category. In this way, as the number of received voice inputs increases, the recognition accuracy of the speaker can be gradually improved.
在一些实施例中,该方法还包括:在确定当前说话人不与至少一个说话人匹配的情况下,确定当前说话人的当前说话人特征的数量以及未包括在至少一个说话人特征类别中的至少一个说话人特征的数量之和是否大于或等于第二阈值;在确定数量之和大于或等于第二阈值的情况下,将当前说话人特征和未包括在至少一个说话人特征类别中的至少一个说话人特征聚类为至少一个其他说话人特征类别,其中,至少一个其他说话人特征类别与至少一个其他说话人一一对应;基于至少一个其他说话人特征类别,确定至少一个其他说话人模板,其中,至少一个其他说话人模板与至少一个其他说话人一一对应。In some embodiments, the method further includes: in a case where it is determined that the current speaker does not match the at least one speaker, determining the number of current speaker features of the current speaker and the number of features that are not included in the at least one speaker feature category Whether the sum of the number of at least one speaker feature is greater than or equal to the second threshold; in the case where it is determined that the sum of the number is greater than or equal to the second threshold, the current speaker feature and at least one that is not included in at least one speaker feature category One speaker feature cluster is at least one other speaker feature category, where at least one other speaker feature category corresponds to at least one other speaker one-to-one; based on at least one other speaker feature category, at least one other speaker template is determined , Where at least one other speaker template corresponds to at least one other speaker one-to-one.
根据本申请的实施例,在未成功匹配当前说话人之后,如果未被成功匹配的说话人的特征的数量满足聚类条件,则对这些未被成功匹配的说话人的特征进行聚类,从而获得新的说话人特征类别,以及新的说话人特征类别对应的新的说话人模板,如此,可以随着接收的语音输入的数量的增多,逐步地提高对说话人的识别准确度。According to the embodiment of the present application, after the current speaker is unsuccessfully matched, if the number of the features of the unsuccessfully matched speakers meets the clustering condition, the features of the unsuccessfully matched speakers are clustered, thereby A new speaker feature category and a new speaker template corresponding to the new speaker feature category are obtained. In this way, as the number of received voice inputs increases, the accuracy of speaker recognition can be gradually improved.
在一些实施例中,该方法还包括:在确定当前说话人不与至少一个说话人匹配的情况下,确定当前说话人的当前说话人特征的数量以及未包括在至少一个说话人类别中的至少一个说话人特征的数量之和是否大于或等于第二阈值;在确定数量之和大于或等于第二阈值的情况下,将当前说话人特征以及未包括在至少一个说话人类别中的至少一个说话人特征加入多个说话人特征,形成经更新的多个说话人特征;将经更新的多个说话人特征聚类为经更新的至少一个说话人特征类别;基于经更新的至少一个说话人特征类别,确定经更新的至少一个说话人模板,其中经更新的至少一个说话人模板与经更新的至少一个说话人一一对应。In some embodiments, the method further includes: in a case where it is determined that the current speaker does not match the at least one speaker, determining the number of current speaker characteristics of the current speaker and at least one that is not included in the at least one speaker category. Whether the sum of the characteristics of a speaker is greater than or equal to the second threshold; in the case where it is determined that the sum of the characteristics is greater than or equal to the second threshold, the current speaker characteristics and at least one speaker that is not included in the at least one speaker category are spoken Adding multiple speaker features to human features to form updated multiple speaker features; clustering the updated multiple speaker features into at least one updated speaker feature category; based on the updated at least one speaker feature The category determines the updated at least one speaker template, where the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.
根据本申请的实施例,在未成功匹配当前说话人之后,如果未被成功匹配的说话人的特征的数量满足聚类条件,则对所有的说话人特征(包括当前说话人特征)重新进行聚类,从而更新说话人特征类别,以及说话人特征类别对应的说话人模板,如此,可以随着接收的语音输入的数量的增多,逐步地提高对说话人的识别准确度。According to the embodiment of the present application, after the current speaker is not successfully matched, if the number of speaker features that have not been successfully matched satisfies the clustering condition, all speaker features (including the current speaker features) are clustered again. Class, thereby updating the speaker feature category and the speaker template corresponding to the speaker feature category. In this way, as the number of received voice inputs increases, the recognition accuracy of the speaker can be gradually improved.
在一些实施例中,该方法还包括,在确定当前说话人与多个说话人中的一个说话人匹配的情况下,通过与当前说话人的交互,获取当前说话人的当前用户标识;和将当前用户标识和至少一个说话人特征类别中的一个说话人特征类别以及至少一个说话人模板中的一个说话人模板相关联,其中一个说话人特征类别以及一个说话人模板与一个说话人相对应。In some embodiments, the method further includes, in the case where it is determined that the current speaker matches one of the multiple speakers, obtaining the current user ID of the current speaker through interaction with the current speaker; and The current user identification is associated with one speaker feature category in at least one speaker feature category and one speaker template in at least one speaker template, wherein one speaker feature category and one speaker template correspond to one speaker.
根据本申请的实施例,通过确定说话人的用户标识,可以向说话人提供更加个性化、更加智能的交互体验。According to the embodiments of the present application, by determining the user identification of the speaker, a more personalized and intelligent interactive experience can be provided to the speaker.
在一些实施例中,当前用户标识包括当前说话人的姓名、性别、年龄、权限、喜好中的至少一种。In some embodiments, the current user identification includes at least one of the name, gender, age, permissions, and preferences of the current speaker.
在一些实施例中,该方法还包括:接收来自下一个说话人的下一个语音输入,并且基于下一个语音输入和至少一个说话人模板,确定下一个说话人是否与至少一个说话人中的一个说话人匹配;在确定下一个说话人与一个说话人匹配的情况下,用当前用户标识识别下一个说话人。In some embodiments, the method further includes: receiving the next voice input from the next speaker, and determining whether the next speaker is related to one of the at least one speaker based on the next voice input and the at least one speaker template. Speaker matching; in the case where it is determined that the next speaker matches a speaker, the current user ID is used to identify the next speaker.
在一些实施例中,该方法还包括:在至少一个经更新的说话人特征类别中的一个经更新的说话人特征类别包括多个说话人特征并且多个说话人特征与多个用户标识相关联的情况下,确定与多个说话人特征中最大数量的说话人特征相关联的一个用户标识;和将一个用户标识与至少一个经更新的说话人模板中的一个说话人模板相关联,其中一个经更新的说话人模板与一个经更新的说话人特征类别相对应,其中,多个用户标识中的每个用户标识包括说话人的姓名、性别、年龄、权限、喜好中的至少一种。In some embodiments, the method further includes: one updated speaker feature category in the at least one updated speaker feature category includes multiple speaker features and the multiple speaker features are associated with multiple user identities In the case of, determine a user ID associated with the largest number of speaker features among the multiple speaker features; and associate a user ID with at least one of the updated speaker templates, one of which is The updated speaker template corresponds to an updated speaker feature category, wherein each user ID in the plurality of user IDs includes at least one of the speaker's name, gender, age, authority, and preferences.
在一些实施例中,该方法还包括:基于当前说话人的当前语音输入,确定当前说话人的声音属性;和在确定当前说话人与多个说话人中的一个说话人匹配的情况下,将声音属性和与一个说话人相对应的说话人特征类别相关联。In some embodiments, the method further includes: determining the voice attribute of the current speaker based on the current voice input of the current speaker; and in the case where it is determined that the current speaker matches one of the multiple speakers, combining The sound attribute is associated with the speaker characteristic category corresponding to a speaker.
根据本申请的实施例,通过确定说话人的声音属性信息,可以向说话人提供更加个性化、更加智能的交互体验。According to the embodiment of the present application, by determining the voice attribute information of the speaker, a more personalized and intelligent interactive experience can be provided to the speaker.
在一些实施例中,声音属性包括声音的年龄属性、声音的性别属性中的至少一种。In some embodiments, the sound attribute includes at least one of the age attribute of the sound and the gender attribute of the sound.
本申请的第二方面提供一种机器可读介质,在该介质上存储有指令,当指令在机器上运行时,使得机器执行以上任意一种语音处理方法。The second aspect of the present application provides a machine-readable medium on which instructions are stored. When the instructions are executed on a machine, the machine executes any of the above voice processing methods.
本申请的第三方面提供一种系统,该系统包括:处理器;存储器,在存储器上存储有指令,当指令被处理器运行时,使得系统执行以上任意一种语音处理方法。A third aspect of the present application provides a system, which includes: a processor; a memory, where instructions are stored in the memory, and when the instructions are executed by the processor, the system executes any of the above voice processing methods.
附图说明Description of the drawings
图1示出了根据本申请实施例的说话人识别的一种场景示意图;Fig. 1 shows a schematic diagram of a scene of speaker recognition according to an embodiment of the present application;
图2示出了根据本申请实施例的说话人识别装置的一种结构示意图;Fig. 2 shows a schematic structural diagram of a speaker recognition device according to an embodiment of the present application;
图3示出了根据本申请实施例的说话人模板获取模块的一种结构示意图;Fig. 3 shows a schematic structural diagram of a speaker template acquisition module according to an embodiment of the present application;
图4示出了根据本申请实施例的当前说话人与说话人识别装置之间的语音交互的一种场景示意图;FIG. 4 shows a schematic diagram of a scene of voice interaction between a current speaker and a speaker recognition device according to an embodiment of the present application;
图5示出了根据本申请实施例的说话人识别方法的一种流程示意图;FIG. 5 shows a schematic flowchart of a speaker recognition method according to an embodiment of the present application;
图6示出了根据本申请实施例的说话人识别方法的另一种流程示意图;Fig. 6 shows another schematic flowchart of a speaker recognition method according to an embodiment of the present application;
图7示出了根据本申请实施例的说话人识别方法的另一种流程示意图;Fig. 7 shows another schematic flow chart of a speaker recognition method according to an embodiment of the present application;
图8示出了根据本申请实施例的系统的一种结构示意图。Fig. 8 shows a schematic structural diagram of a system according to an embodiment of the present application.
具体实施方式Detailed ways
下面结合具体实施例和附图对本申请做进一步说明。此处描述的具体实施例仅仅是为了解释本申请,而非对本申请的限定。此外,为了便于描述,附图中仅示出了与本申请相关的部分而非全部的结构或过程。应注意的是,在本说明书中,相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。The application will be further described below in conjunction with specific embodiments and drawings. The specific embodiments described here are only for explaining the application, rather than limiting the application. In addition, for ease of description, the drawings only show a part of the structure or process related to the present application instead of all. It should be noted that in this specification, similar reference numerals and letters indicate similar items in the following drawings. Therefore, once a certain item is defined in one drawing, it is not necessary to refer to it in subsequent drawings. To further define and explain.
图1示出了根据本申请实施例的说话人识别的一种场景示意图,如图所示,说话人识别装置100可以在不同的时间,与多个说话人200进行交互,并在交互过程中接收来自多个说话人200的语音输入300。根据本申请的一些实施例,说话人识别装置100不需要说话人200进行专门的语音注册,而是可以基于与说话人200的语音交互,来识别不同的说话人200。FIG. 1 shows a schematic diagram of a scene of speaker recognition according to an embodiment of the present application. As shown in the figure, the speaker recognition apparatus 100 can interact with multiple speakers 200 at different times, and during the interaction process A voice input 300 from multiple speakers 200 is received. According to some embodiments of the present application, the speaker recognition device 100 does not require the speaker 200 to perform special voice registration, but can recognize different speakers 200 based on the voice interaction with the speaker 200.
根据本申请的一些实施例,说话人识别装置100可以包括,但不限于,智能音箱、智能耳机、智能手环、智慧大屏、便携式或移动设备、手机、个人数字助理、蜂窝电话、手持PC、便携式媒体播放器、手持设备、可穿戴设备(例如,手表、手环、显示眼镜或护目镜、头戴式显示器、头戴设备)、导航设备、服务器、网络设备、图形设备、视频游戏设备、机顶盒、膝上型设备、虚拟现实和/或增强现实设备、物联网设备、工业控制设备、车载信息娱乐设备、流媒体客户端设备、电子书、阅读设备、POS机以及其他设备。According to some embodiments of the present application, the speaker recognition device 100 may include, but is not limited to, smart speakers, smart headphones, smart bracelets, smart large screens, portable or mobile devices, mobile phones, personal digital assistants, cellular phones, and handheld PCs. , Portable media players, handheld devices, wearable devices (for example, watches, bracelets, display glasses or goggles, head-mounted displays, head-mounted devices), navigation equipment, servers, network equipment, graphics equipment, video game equipment , Set-top boxes, laptop devices, virtual reality and/or augmented reality devices, Internet of Things devices, industrial control devices, in-vehicle infotainment devices, streaming media client devices, e-books, reading devices, POS machines, and other devices.
图2示出了根据本申请实施例的说话人识别装置100的一种结构示意图,如图所示,说话人识别装置100可以包括交互模块110、说话人特征获取模块120、说话人模板获取模块130以及说话人匹配模块140,并且可选地包括用户标识获取模块150和声音属性识别模块160。其中,说话人识别装置100的一个或多个组件(例如,交互模块110、说话人特征获取模块120、说话人模板获取模块130、说话人匹配模块140、用户标识获取模块150和声音属性识别模块160中的一个或多个),可以由专用集成电路(ASIC)、电子电路、执行一个或多个软件或固件程序的(共享、专用或组)处理器和/或存储器、组合逻辑电路、提供所描述的功能的其他合适的组件的任意组合构成。根据一个方面,处理器可以是微处理器、数字信号处理器、微控制器等,和/或其任何组合。根据另一个方面,所述处理器可以是单核处理器,多核处理器等,和/或其任何组合。2 shows a schematic structural diagram of the speaker recognition device 100 according to an embodiment of the present application. As shown in the figure, the speaker recognition device 100 may include an interaction module 110, a speaker feature acquisition module 120, and a speaker template acquisition module 130 and the speaker matching module 140, and optionally include a user identification acquisition module 150 and a voice attribute recognition module 160. Among them, one or more components of the speaker recognition device 100 (for example, the interaction module 110, the speaker feature acquisition module 120, the speaker template acquisition module 130, the speaker matching module 140, the user identification acquisition module 150, and the voice attribute recognition module 160), can be provided by application specific integrated circuit (ASIC), electronic circuit, (shared, dedicated or group) processor and/or memory, combinational logic circuit, or combinational logic circuit that executes one or more software or firmware programs Any combination of the described functions and other suitable components. According to one aspect, the processor may be a microprocessor, a digital signal processor, a microcontroller, etc., and/or any combination thereof. According to another aspect, the processor may be a single-core processor, a multi-core processor, etc., and/or any combination thereof.
需要说明的是,说话人识别装置100的结构不限于图2示出的,说话人识别装置100还可以包括,但不限于,输入/输出模块,用于接收来自说话人200的语音输入300,也可以以语音、文字等形式输出与说话人200的交互语句,输入/输出模块的示例可以包括,但不限于,扬声器、麦克风、显示器(例如,液晶显示器、触摸屏显示器等)等。It should be noted that the structure of the speaker recognition device 100 is not limited to that shown in FIG. 2. The speaker recognition device 100 may also include, but is not limited to, an input/output module for receiving voice input 300 from the speaker 200, The interactive sentences with the speaker 200 may also be output in the form of voice, text, etc. Examples of input/output modules may include, but are not limited to, speakers, microphones, displays (for example, liquid crystal displays, touch screen displays, etc.).
根据本申请的一些实施例,交互模块110用于与说话人200进行交互,其中,所述交互可以包括,但不限于,语音形式和/或文字形式的交互语句。在本申请的实施例中,交互模块110可以利用现有技术中的任意一种交互技术来实现。According to some embodiments of the present application, the interaction module 110 is used to interact with the speaker 200, where the interaction may include, but is not limited to, interactive sentences in the form of voice and/or text. In the embodiment of the present application, the interaction module 110 may be implemented by using any interaction technology in the prior art.
根据本申请的一些实施例,说话人特征获取模块120用于从说话人200的语音输入300中提取说话人200的语音特征(例如,但不限于,FBank(Filter Bank,滤波器组)特征、MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征等),并基于说话人200的语音特征,根据声纹模型(例如,但不限于,GMM-UBM(Gaussian Mixture Model-Universal Background Model,混合高斯-通用背景模型)、I-vector模型、JFA(Joint Factor Analysis,联合因子分析)模型等),获取说话人200的说话人特征。According to some embodiments of the present application, the speaker feature acquisition module 120 is configured to extract the voice features of the speaker 200 from the voice input 300 of the speaker 200 (for example, but not limited to, FBank (Filter Bank, filter bank) features, MFCC (Mel Frequency Cepstrum Coefficient, Mel frequency Cepstrum Coefficient) characteristics, etc.), and based on the voice characteristics of the speaker 200, according to the voiceprint model (for example, but not limited to, GMM-UBM (Gaussian Mixture Model-Universal Background Model), Mix Gaussian-universal background model), I-vector model, JFA (Joint Factor Analysis) model, etc.) to obtain speaker characteristics of speaker 200.
根据本申请的一些实施例,说话人模板获取模块130用于确定说话人特征集合是否满足聚类条件,其中,说话人特征集合可以包括来自多个说话人200的多个说话人特征;在确定说话人特征集合满足聚类条件的情况下,说话人模板获取模块130可以将说话人特征集合中的多个说话人特征聚类为至少一个说话人特征类别,其中,说话人特征类别与说话人200一一对应,并且每个说话人特征类别包括说话人特征集合中的至少一个说话人特征;并且,对于每个说话人特征类别,说话人模板获取模块130可以基于该说话人特征类别内的至少一个说话人特征,获取与该说话人特征类别相关 的说话人200的说话人模板,其中,说话人模板与说话人200一一对应。According to some embodiments of the present application, the speaker template acquisition module 130 is used to determine whether the speaker feature set satisfies the clustering condition, where the speaker feature set may include multiple speaker features from multiple speakers 200; When the speaker feature set meets the clustering conditions, the speaker template acquisition module 130 can cluster multiple speaker features in the speaker feature set into at least one speaker feature category, where the speaker feature category is related to the speaker feature category. 200 one-to-one correspondence, and each speaker feature category includes at least one speaker feature in the speaker feature set; and for each speaker feature category, the speaker template acquisition module 130 can be based on the speaker feature category For at least one speaker feature, a speaker template of the speaker 200 related to the speaker feature category is obtained, where the speaker template corresponds to the speaker 200 one-to-one.
在一种示例中,聚类条件可以包括以下中的至少一个:在说话人特征集合中,不包括在任何一个说话人特征类别内的说话人特征的数量大于或等于第一聚类阈值;至少一个说话人特征类别内的说话人特征的数量等于第二聚类阈值。In an example, the clustering condition may include at least one of the following: in the speaker feature set, the number of speaker features not included in any speaker feature category is greater than or equal to the first clustering threshold; The number of speaker features in a speaker feature category is equal to the second clustering threshold.
在一种示例中,对于每个说话人特征类别,可以基于该说话人特征类别内的至少一个说话人特征的均值和/或加权和,确定与该说话人特征类别相对应的说话人模板。In an example, for each speaker feature category, a speaker template corresponding to the speaker feature category may be determined based on the mean value and/or weighted sum of at least one speaker feature in the speaker feature category.
根据本申请的一些实施例,说话人匹配模块140用于基于当前说话人200的当前说话人特征是否与已存在的至少一个说话人模板中的一个说话人模板匹配,确定当前说话人200是否与该至少一个说话人模板对应的至少一个说话人200中的一个说话人200匹配。According to some embodiments of the present application, the speaker matching module 140 is configured to determine whether the current speaker 200 matches one of the at least one existing speaker template based on whether the current speaker feature of the current speaker 200 matches One of the at least one speaker 200 corresponding to the at least one speaker template matches.
在一种示例中,说话人匹配模块140可以在确定当前说话人200的当前说话人特征与已存在的至少一个说话人模板中的一个说话人模板匹配的情况下,确定当前说话人200与该匹配的说话人模板对应的说话人200匹配;在确定当前说话人200的当前说话人特征与各个说话人模板均不匹配的情况下,确定当前说话人200与各个说话人模板对应的各个说话人200均不匹配。In an example, the speaker matching module 140 may determine that the current speaker 200 matches the current speaker feature of the current speaker 200 with one of the at least one existing speaker template. The speaker 200 corresponding to the matched speaker template is matched; in the case where it is determined that the current speaker feature of the current speaker 200 does not match each speaker template, the current speaker 200 and each speaker template corresponding to each speaker template are determined 200 does not match.
在一种示例中,说话人匹配模块140可以基于当前说话人200的当前说话人特征与已存在的至少一个说话人模板之间的相似度,确定当前说话人200的当前说话人特征是否匹配已存在的至少一个说话人模板中的一个说话人模板。In an example, the speaker matching module 140 may determine whether the current speaker feature of the current speaker 200 matches based on the similarity between the current speaker feature of the current speaker 200 and at least one existing speaker template. One speaker template among at least one speaker template that exists.
根据本申请的一些实施例,用户标识获取模块150用于基于交互模块110与说话人200的交互,获取说话人200的用户标识,其中,用户标识可以包括,但不限于,姓名、性别、年龄、权限、喜好等。According to some embodiments of the present application, the user identification acquisition module 150 is configured to acquire the user identification of the speaker 200 based on the interaction between the interaction module 110 and the speaker 200. The user identification may include, but is not limited to, name, gender, and age. , Permissions, preferences, etc.
根据本申请的一些实施例,声音属性识别模块60用于基于说话人200的语音特征(例如,但不限于,FBank特征、MFCC特征等),识别说话人200的声音属性信息,其中,声音属性信息可以包括,但不限于,声音的性别属性、声音的年龄属性等。According to some embodiments of the present application, the voice attribute recognition module 60 is used to recognize the voice attribute information of the speaker 200 based on the voice features of the speaker 200 (for example, but not limited to, FBank features, MFCC features, etc.). The information may include, but is not limited to, the gender attribute of the voice, the age attribute of the voice, and so on.
在一些示例中,交互模块110可以基于说话人200的声音属性信息,与说话人200进行交互,以使用户标识获取模块150获取说话人200的用户标识。In some examples, the interaction module 110 may interact with the speaker 200 based on the sound attribute information of the speaker 200, so that the user identification obtaining module 150 obtains the user identification of the speaker 200.
以下将参考图2和图3,进一步描述说话人识别装置100的多个模块的功能。Hereinafter, the functions of multiple modules of the speaker recognition device 100 will be further described with reference to FIGS. 2 and 3.
根据本申请的一些实施例,说话人特征获取模块120可以对说话人200的语音输入300进行预处理,其中,所述预处理可以包括,但不限于,信号增强、去混响、去噪等;另外,说话人特征获取模块120还可以从预处理后的语音输入300中提取说话人200的语音特征,例如,但不限于,FBank特征、MFCC特征等。According to some embodiments of the present application, the speaker feature acquisition module 120 may preprocess the speech input 300 of the speaker 200, where the preprocessing may include, but is not limited to, signal enhancement, de-reverberation, denoising, etc. In addition, the speaker feature acquisition module 120 can also extract the voice features of the speaker 200 from the preprocessed voice input 300, such as, but not limited to, FBank features, MFCC features, and the like.
根据本申请的另一些实施例,说话人特征获取模块120还可以确定说话人200的语音输入300的信噪比,使得说话人模板获取模块130能够筛选高质量的语音输入用于获取说话人模板。According to other embodiments of the present application, the speaker feature acquisition module 120 can also determine the signal-to-noise ratio of the speech input 300 of the speaker 200, so that the speaker template acquisition module 130 can filter high-quality speech input for acquiring the speaker template .
根据本申请的一些实施例,说话人特征获取模块120还可以基于说话人200的语音特征,根据说话人200的声纹模型,获取说话人200的说话人特征。其中,声纹模型用于描述说话人200的语音特征的空间分布,声纹模型的示例可以包括,但不限于,GMM-UBM模型、I-vector模型、JFA模型等。其中,GMM-UBM模型使用高斯概率密度函数来描述说话人的语音特征的空间分布,GMM-UBM模型的建立包括两部分, 首先根据大量说话人的语音数据训练一个能够描述说话人的共性特征的通用背景模型UBM,然后以UBM为初始模型,利用每个说话人的语音特征进行基于最大后验概率的自适应训练,从而得到说话人的混合高斯模型GMM。JFA模型基于说话人的GMM模型,其定义了本征音空间、本征信道空间和残差空间来描述说话人200的语音特征的空间分布,即将说话人的GMM模型的均值超矢量(由各个高斯概率密度函数的均值矢量连接而成)划分为与说话人相关的超矢量和与信道相关的超矢量,从而可以去除信道的干扰,得到对说话人更精确的描述。I-vector模型也基于说话人的GMM模型,其定义了既包含说话人之间的差异又包含了信道之间的差异的全局差异空间来描述说话人200的语音特征的空间分布,并基于该全局差异空间从说话人的GMM模型的均值超矢量提取更为紧凑的I-vector矢量。According to some embodiments of the present application, the speaker feature obtaining module 120 may also obtain the speaker feature of the speaker 200 based on the voice feature of the speaker 200 and the voiceprint model of the speaker 200. Among them, the voiceprint model is used to describe the spatial distribution of the speech features of the speaker 200, and examples of the voiceprint model may include, but are not limited to, GMM-UBM model, I-vector model, JFA model, etc. Among them, the GMM-UBM model uses the Gaussian probability density function to describe the spatial distribution of the speaker’s voice features. The establishment of the GMM-UBM model includes two parts. First, train a speaker that can describe the common features of the speaker based on a large number of speaker’s voice data. The universal background model UBM, and then UBM as the initial model, uses the voice characteristics of each speaker to perform adaptive training based on the maximum posterior probability, thereby obtaining the speaker's Gaussian mixture model GMM. The JFA model is based on the speaker’s GMM model, which defines the eigen-sound space, eigen-channel space and residual space to describe the spatial distribution of the speaker’s 200 speech features, that is, the mean supervector of the speaker’s GMM model (by The mean vector of each Gaussian probability density function is connected) and divided into a super vector related to the speaker and a super vector related to the channel, so that the interference of the channel can be removed, and a more accurate description of the speaker can be obtained. The I-vector model is also based on the speaker’s GMM model, which defines a global difference space that includes both the difference between speakers and the difference between channels to describe the spatial distribution of the voice features of the speaker 200, and is based on this The global difference space extracts a more compact I-vector vector from the mean supervector of the GMM model of the speaker.
作为一种示例,在说话人200的声纹模型为GMM-UBM模型的情况下,通过说话人特征获取模块120确定的说话人特征可以是GMM-UBM模型的均值超矢量;在说话人200的声纹模型为JFA模型的情况下,通过说话人特征获取模块120确定的说话人特征可以是JFA模型的与说话人相关的超矢量;在说话人200的声纹模型为I-vector模型的情况下,通过说话人特征获取模块120确定的说话人特征可以是I-vector模型的I-vector矢量。As an example, when the voiceprint model of the speaker 200 is a GMM-UBM model, the speaker feature determined by the speaker feature acquisition module 120 may be the mean supervector of the GMM-UBM model; In the case where the voiceprint model is the JFA model, the speaker feature determined by the speaker feature acquisition module 120 may be the speaker-related supervector of the JFA model; in the case where the voiceprint model of the speaker 200 is an I-vector model Next, the speaker feature determined by the speaker feature acquisition module 120 may be the I-vector vector of the I-vector model.
需要说明的是,说话人200的声纹模型和说话人特征不限于此,也可以采用其他类型的声纹模型和说话人特征。It should be noted that the voiceprint model and speaker characteristics of the speaker 200 are not limited to this, and other types of voiceprint models and speaker characteristics may also be used.
图3示出了根据本申请的实施例的说话人模板获取模块130的一种结构示意图,如图所示,说话人模板获取模块130包括,但不限于,说话人特征集合维护单元131、说话人特征聚类单元132以及说话人模板获取单元133。3 shows a schematic structural diagram of the speaker template acquisition module 130 according to an embodiment of the present application. As shown in the figure, the speaker template acquisition module 130 includes, but is not limited to, a speaker feature set maintenance unit 131, The human feature clustering unit 132 and the speaker template obtaining unit 133.
根据本申请的一些实施例,说话人特征集合维护单元131可以按照预定的规则将说话人200的说话人特征加入说话人特征集合;说话人特征集合维护单元131还可以判断说话人特征集合是否满足聚类条件,如果说话人特征集合满足上述聚类条件,则触发说话人特征聚类单元132将说话人特征集合中的多个说话人特征聚类为至少一个说话人特征类别。According to some embodiments of the present application, the speaker feature set maintenance unit 131 can add the speaker features of the speaker 200 to the speaker feature set according to predetermined rules; the speaker feature set maintenance unit 131 can also determine whether the speaker feature set satisfies The clustering condition, if the speaker feature set satisfies the aforementioned clustering condition, the speaker feature clustering unit 132 is triggered to cluster multiple speaker features in the speaker feature set into at least one speaker feature category.
作为一种示例,上述预定的规则可以包括:在还未存在说话人模板的情况下,说话人特征集合维护单元131可以直接将当前说话人200的当前说话人特征加入说话人特征集合;在已经存在说话人模板并且通过说话人匹配模块140确定当前说话人200的当前说话人特征与一个说话人模板相匹配的情况下,说话人特征集合维护单元131可以将当前说话人200的当前说话人特征加入该匹配的说话人模板对应的说话人特征类别中;在已经存在说话人模板并且通过说话人匹配模块140确定当前说话人200的当前说话人特征与各个说话人模板均不匹配的情况下,说话人特征集合维护单元131可以将当前说话人200的当前说话人特征加入说话人特征集合,并使其不属于任何一个说话人特征类别。As an example, the foregoing predetermined rule may include: in the case that there is no speaker template, the speaker feature set maintenance unit 131 may directly add the current speaker feature of the current speaker 200 to the speaker feature set; In the case that there is a speaker template and the speaker matching module 140 determines that the current speaker feature of the current speaker 200 matches a speaker template, the speaker feature set maintenance unit 131 can compare the current speaker feature of the current speaker 200 Add to the speaker feature category corresponding to the matched speaker template; in the case that a speaker template already exists and the speaker matching module 140 determines that the current speaker feature of the current speaker 200 does not match each speaker template, The speaker feature set maintenance unit 131 may add the current speaker feature of the current speaker 200 to the speaker feature set and make it not belong to any speaker feature category.
作为另一种示例,上述预定的规则还可以包括:说话人特征集合维护单元131还可以根据当前说话人200的语音输入300的信噪比,确定是否将当前说话人200的当前说话人特征加入说话人特征集合,具体地,如果当前说话人200的语音输入300的信噪比高于或等于信噪比阈值,则说话人特征集合维护单元131确定将当前说话人200 的当前说话人特征加入说话人特征集合,如果当前说话人200的语音输入300的信噪比低于信噪比阈值,则说话人特征集合维护单元131确定不将当前说话人200的当前说话人特征加入说话人特征集合。As another example, the foregoing predetermined rule may further include: the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 according to the signal-to-noise ratio of the voice input 300 of the current speaker 200 Speaker feature set. Specifically, if the signal-to-noise ratio of the voice input 300 of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 Speaker feature set. If the signal-to-noise ratio of the voice input 300 of the current speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set .
作为一种示例,上述聚类条件可以包括:不包括在任何一个说话人特征类别内的说话人特征(例如,由于尚不存在说话人特征类别而不包括在任何一个说话人特征类别内的说话人特征,或者经说话人匹配模块140确定为与各个说话人模板均不匹配的说话人特征)的数量大于或等于第一聚类阈值。As an example, the aforementioned clustering conditions may include: speaker features not included in any speaker feature category (for example, speaker features that are not included in any speaker feature category because there is no speaker feature category yet) The number of human features, or speaker features that are determined by the speaker matching module 140 as not matching each speaker template) is greater than or equal to the first clustering threshold.
作为另一种示例,上述聚类条件可以包括:至少一个说话人特征类别内的说话人特征的数量等于第二聚类阈值,并且该第二聚类阈值可以包括一个或多个数值,例如,但不限于,50、100、200、500或其他数值As another example, the aforementioned clustering condition may include: the number of speaker features in at least one speaker feature category is equal to a second clustering threshold, and the second clustering threshold may include one or more values, for example, But not limited to 50, 100, 200, 500 or other values
根据本申请的一些实施例,说话人特征聚类单元132用于根据说话人特征集合维护单元131的触发指令,将说话人特征集合中的多个说话人特征聚类为至少一个说话人特征类别,并为该至少一个说话人特征类别分配系统标识(例如,但不限于,说话人A、说话人B等),其中,一个系统标识与一个说话人特征类别相关联,也与该说话人特征类别内的至少一个说话人特征和该说话人特征类别对应的说话人模板相关联。其中,聚类算法的示例可以包括,但不限于,均值偏移(Mean-shift)聚类算法、密度聚类算法(例如,但不限于,DBSCAN(Density-Based Spatial Clustering of Applications with Noise)聚类算法)、层次聚类算法或其他聚类算法等,其中,均值偏移(Mean-shift)聚类算法基于说话人特征之间的偏移量对说话人特征进行聚类,密度聚类算法基于说话人特征的密度分布对说话人特征进行聚类,层次聚类算法是基于每两个说话人特征之间的相似度对说话认特征进行聚类。According to some embodiments of the present application, the speaker feature clustering unit 132 is configured to cluster multiple speaker features in the speaker feature set into at least one speaker feature category according to the trigger instruction of the speaker feature set maintenance unit 131 , And assign a system identifier (for example, but not limited to, speaker A, speaker B, etc.) to the at least one speaker feature category, where one system identifier is associated with a speaker feature category and is also associated with the speaker feature At least one speaker feature in the category is associated with a speaker template corresponding to the speaker feature category. Among them, examples of clustering algorithms may include, but are not limited to, mean-shift (Mean-shift) clustering algorithms, density clustering algorithms (for example, but not limited to, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) Class algorithm), hierarchical clustering algorithm or other clustering algorithms, etc., among them, the mean-shift (Mean-shift) clustering algorithm clusters the speaker features based on the offset between the speaker features, and the density clustering algorithm The speaker features are clustered based on the density distribution of the speaker features. The hierarchical clustering algorithm clusters the speaker recognition features based on the similarity between every two speaker features.
作为一种示例,在不包括在任何一个说话人特征类别内的说话人特征的数量大于或等于第一聚类阈值的情况下,说话人特征聚类单元132可以针对这些不包括在任何一个说话人特征类别内的说话人特征进行聚类,也可以针对说话人特征集合中的所有说话人特征进行聚类;在至少一个说话人特征类别内的说话人特征的数量等于第二聚类阈值的情况下,说话人特征聚类单元132可以对说话人特征集合中的所有说话人特征进行聚类。As an example, in the case that the number of speaker features not included in any speaker feature category is greater than or equal to the first clustering threshold, the speaker feature clustering unit 132 can target these speaker features that are not included in any speaker. The speaker features in the speaker feature category are clustered, and clustering can also be performed on all speaker features in the speaker feature set; the number of speaker features in at least one speaker feature category is equal to the second clustering threshold In this case, the speaker feature clustering unit 132 can cluster all speaker features in the speaker feature set.
以下以Mean-Shift聚类算法作为示例,描述说话人特征聚类单元132对说话人特征的聚类。Mean-Shift聚类算法可以包括以下步骤:The following uses the Mean-Shift clustering algorithm as an example to describe the clustering of speaker features by the speaker feature clustering unit 132. The Mean-Shift clustering algorithm can include the following steps:
S1:在待聚类说话人特征的集合中,随机选择一个说话人特征作为中心点;S1: In the set of speaker features to be clustered, randomly select a speaker feature as the center point;
S2:确定以中心点为中心,半径为r的区域内的所有说话人特征,将这些说话人特征归为同一个簇,并记录这些说话人特征出现的次数;S2: Determine all speaker features in the area with the center point as the center and radius r, group these speaker features into the same cluster, and record the number of times these speaker features appear;
S3:计算从中心点移动到S2中确定的说话人特征所需要的偏移向量(即,差值向量),并将偏移向量的均值向量作为mean-shift向量;S3: Calculate the offset vector (ie, the difference vector) required to move from the center point to the speaker feature determined in S2, and use the mean-shift vector of the offset vector as the mean-shift vector;
S4:将中心点沿着mean-shift向量的方向移动,移动的距离为mean-shift向量的大小,并获取新的中心点;S4: Move the center point along the direction of the mean-shift vector, the moving distance is the size of the mean-shift vector, and obtain a new center point;
S5:重复步骤S2至S4,直至mean-shift向量的大小小于预定值,需要说明的是,要将该次迭代过程涉及的所有说话人特征归为同一个簇;S5: Repeat steps S2 to S4 until the size of the mean-shift vector is less than the predetermined value. It should be noted that all speaker features involved in this iteration process should be grouped into the same cluster;
S6:重复步骤S1至S5,直至待聚类说话人特征的集合中所有的说话人特征都具 有对应的簇;S6: Repeat steps S1 to S5 until all speaker features in the set of speaker features to be clustered have corresponding clusters;
S7:确定说话人特征类别,对于各个簇中的每个说话人特征,将该说话人特征出现次数最多的簇作为该说话人特征所属的说话人特征类别。S7: Determine the speaker feature category. For each speaker feature in each cluster, the cluster with the most frequent occurrences of the speaker feature is used as the speaker feature category to which the speaker feature belongs.
根据本申请的一些实施例,对于每个说话人特征类别,说话人模板获取单元133可以确定该说话人特征类别内的至少一个说话人特征的均值,并将其作为说话人模板,例如,在说话人特征是I-vector模型的I-vector矢量的情况下,说话人模板获取单元133可以将说话人特征类别内的至少一个I-vector矢量的均值矢量作为说话人模板。According to some embodiments of the present application, for each speaker feature category, the speaker template obtaining unit 133 may determine the mean value of at least one speaker feature in the speaker feature category and use it as a speaker template, for example, in In the case where the speaker feature is an I-vector vector of the I-vector model, the speaker template obtaining unit 133 may use the mean vector of at least one I-vector vector in the speaker feature category as the speaker template.
根据本申请的一些实施例,对于每个说话人特征类别,说话人模板获取单元133可以确定该说话人特征类别内的至少一个说话人特征的加权和,并将其作为说话人模板,其中,每个说话人特征的权值可以根据与该说话人特征相对应的语音输入的信噪比确定。According to some embodiments of the present application, for each speaker feature category, the speaker template obtaining unit 133 may determine the weighted sum of at least one speaker feature in the speaker feature category and use it as a speaker template, where: The weight of each speaker feature can be determined according to the signal-to-noise ratio of the voice input corresponding to the speaker feature.
根据本申请的一些实施例,说话人匹配模块140可以首先确定是否已存在说话人模板。在说话人匹配模块140确定不存在说话人模板的情况下(例如,说话人特征集合还未发生聚类),其不进行对当前说话人200的匹配,并将当前说话人200的当前说话人特征发送给说话人模板获取模块130,使得说话人模板获取模块130的说话人特征集合维护单元131可以将当前说话人200的当前说话人特征加入说话人特征集合。According to some embodiments of the present application, the speaker matching module 140 may first determine whether a speaker template already exists. When the speaker matching module 140 determines that there is no speaker template (for example, the speaker feature set has not yet clustered), it does not match the current speaker 200, and compares the current speaker 200 to the current speaker 200 The features are sent to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 of the speaker template acquisition module 130 can add the current speaker feature of the current speaker 200 to the speaker feature set.
在说话人匹配模块140确定已存在说话人模板的情况下(例如,说话人特征集合已发生聚类),其可以分别确定当前说话人特征与各个说话人模板之间的相似度以及各个相似度中的最大相似度,并确定该最大相似度是否高于或等于相似度阈值。在一种示例中,可以通过计算当前说话人特征与说话人模板之间的距离来确定当前说话人特征与说话人模板之间的相似度,其中,所述距离可以包括,但不限于,余弦距离、EMD距离(Earth Mover's Distance,搬土距离)、欧式距离、曼哈顿距离等。In the case where the speaker matching module 140 determines that a speaker template already exists (for example, the speaker feature set has been clustered), it can determine the similarity between the current speaker feature and each speaker template and each similarity respectively And determine whether the maximum similarity is higher than or equal to the similarity threshold. In an example, the similarity between the current speaker feature and the speaker template can be determined by calculating the distance between the current speaker feature and the speaker template, where the distance can include, but is not limited to, cosine Distance, EMD distance (Earth Mover's Distance), Euclidean distance, Manhattan distance, etc.
一方面,在说话人匹配模块140确定该最大相似度高于或等于相似度阈值的情况下,其可以确定当前说话人特征和具有该最大相似度的说话人模板匹配,即可以确定当前说话人200与具有该最大相似度的说话人模板对应的说话人200匹配。On the one hand, when the speaker matching module 140 determines that the maximum similarity is higher than or equal to the similarity threshold, it can determine that the current speaker feature matches the speaker template with the maximum similarity to determine the current speaker 200 matches the speaker 200 corresponding to the speaker template with the greatest similarity.
随后,说话人匹配模块140还可以将当前说话人特征以及对当前说话人200的匹配结果(例如,但不限于,与当前说话人特征匹配的说话人模板,或者该说话人模板对应的说话人特征类别的系统标识,例如,但不限于,说话人A)发送给说话人模板获取模块130,使得说话人特征集合维护单元131可以将当前说话人特征加入与当前说话人特征匹配的说话人模板对应的说话人特征类别中(以下,简称与当前说话人200匹配的说话人特征类别,例如,但不限于,说话人A)。Subsequently, the speaker matching module 140 can also compare the features of the current speaker and the matching result of the current speaker 200 (for example, but not limited to, a speaker template that matches the features of the current speaker, or a speaker corresponding to the speaker template. The system identification of the feature category, for example, but not limited to, speaker A) is sent to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 can add the current speaker feature to the speaker template matching the current speaker feature In the corresponding speaker feature category (hereinafter referred to as the speaker feature category matching the current speaker 200, for example, but not limited to, speaker A).
随后,说话人特征集合维护单元131可以确定与当前说话人200匹配的说话人特征类别(例如,但不限于,说话人A)内的说话人特征的数量是否等于第二聚类阈值,在说话人特征集合维护单元131确定该说话人特征类别内的说话人特征的数量等于第二聚类阈值的情况下,说话人特征集合维护单元131可以触发说话人特征聚类单元132利用聚类算法对说话人特征集合内的所有说话人特征进行重新聚类,以获得经更新的至少一个说话人特征类别,其中,经更新的至少一个说话人特征类别与说话人200一一对应,并且每个经更新的说话人特征类别包括说话人特征集合中的至少一个说话人特征。Subsequently, the speaker feature set maintenance unit 131 may determine whether the number of speaker features in the speaker feature category (for example, but not limited to, speaker A) that matches the current speaker 200 is equal to the second clustering threshold. If the speaker feature set maintenance unit 131 determines that the number of speaker features in the speaker feature category is equal to the second clustering threshold, the speaker feature set maintenance unit 131 can trigger the speaker feature clustering unit 132 to use a clustering algorithm to analyze All speaker features in the speaker feature set are re-clustered to obtain the updated at least one speaker feature category, where the updated at least one speaker feature category corresponds to the speaker 200 one-to-one, and each speaker feature The updated speaker feature category includes at least one speaker feature in the speaker feature set.
说话人特征集合维护单元131可以确定与每个经更新的说话人特征类别相关联的系统标识。具体地,对于每个经更新的说话人特征类别,说话人模板获取单元133可以确定在与该经更新的说话人特征类别内的经更新的至少一个说话人特征相关联的系统标识中,对应最多数量的说话人特征的系统标识,并将该系统标识与该经更新的说话人特征类别、该经更新的说话人特征类别内的经更新的至少一个说话人特征相关联。说话人特征集合维护单元131还可以在经更新的至少一个说话人特征类别中,确定与当前说话人200匹配的经更新的说话人特征类别,其中,与该经更新的说话人特征类别相关联的系统标识和重新聚类之前与当前说话人200匹配的说话人特征类别关联的系统标识相同。The speaker feature set maintenance unit 131 may determine the system identification associated with each updated speaker feature category. Specifically, for each updated speaker feature category, the speaker template acquisition unit 133 may determine that in the system identification associated with the updated at least one speaker feature in the updated speaker feature category, the corresponding The system identification of the largest number of speaker features, and the system identification is associated with the updated speaker feature category and the updated at least one speaker feature within the updated speaker feature category. The speaker feature set maintenance unit 131 may also determine an updated speaker feature category matching the current speaker 200 among the updated at least one speaker feature category, wherein the updated speaker feature category is associated with the updated speaker feature category The system identifier of is the same as the system identifier associated with the speaker feature category matched with the current speaker 200 before re-clustering.
说话人模板获取单元133可以根据与当前说话人200匹配的经更新的说话人特征类别内的经更新的至少一个说话人特征,获取该经更新的说话人特征类别对应的经更新的说话人模板。说话人模板获取单元133也可以根据各个经更新的说话人特征类别内的经更新的至少一个说话人特征,获取各个经更新的说话人特征类别对应的经更新的说话人模板,其中,经更新的说话人模板与说话人200一一对应。其中,对于经更新的说话人模板的获取可以参考以上描述,在此不再赘述。The speaker template obtaining unit 133 may obtain the updated speaker template corresponding to the updated speaker feature category according to the updated at least one speaker feature in the updated speaker feature category matching the current speaker 200 . The speaker template obtaining unit 133 may also obtain an updated speaker template corresponding to each updated speaker feature category according to the updated at least one speaker feature in each updated speaker feature category, where The speaker template of has a one-to-one correspondence with the speaker 200. For obtaining the updated speaker template, please refer to the above description, which will not be repeated here.
另外,在说话人特征集合维护单元131确定当前说话人200匹配的说话人特征类别内的说话人特征的数量不等于第二聚类阈值的情况下,说话人特征集合维护单元131不触发说话人特征聚类单元132对说话人特征集合进行重新聚类。但是,说话人模板获取单元133可以根据与当前说话人200匹配的说话人特征类别内的说话人特征(包括当前说话人特征),更新与当前说话人200匹配的说话人特征类别对应的说话人模板,其中,说话人模板的获取可以参考以上描述,在此不再赘述。In addition, when the speaker feature set maintenance unit 131 determines that the number of speaker features in the speaker feature category matched by the current speaker 200 is not equal to the second clustering threshold, the speaker feature set maintenance unit 131 does not trigger the speaker The feature clustering unit 132 re-clusters the speaker feature set. However, the speaker template acquiring unit 133 may update the speaker corresponding to the speaker feature category matching the current speaker 200 according to the speaker features (including the current speaker features) in the speaker feature category matching the current speaker 200 Templates, where the speaker template can be obtained with reference to the above description, which will not be repeated here.
另一方面,在说话人匹配模块140确定当前说话人特征与各个说话人模板之间的最大相似度低于该相似度阈值的情况下,其可以确定当前说话人特征与各个说话人模板均不匹配,即当前说话人200与各个说话人模板对应的说话人200均不匹配。On the other hand, when the speaker matching module 140 determines that the maximum similarity between the current speaker feature and each speaker template is lower than the similarity threshold, it can determine that the current speaker feature is different from each speaker template. Matching means that the current speaker 200 does not match the speaker 200 corresponding to each speaker template.
随后,说话人匹配模块140可以将当前说话人特征以及对当前说话人200的匹配结果(例如,但不限于,指示当前说话人200与各个说话人模板对应的说话人200均不匹配的信息)发送给说话人模板获取模块130,使得说话人模板获取模块130的说话人特征集合维护单元131可以将当前说话人特征加入说话人特征集合中,并且使得当前说话人特征不属于任何一个说话人特征类别。Subsequently, the speaker matching module 140 can compare the characteristics of the current speaker and the matching result of the current speaker 200 (for example, but not limited to, information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template) Sent to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 of the speaker template acquisition module 130 can add the current speaker feature to the speaker feature set, and make the current speaker feature not belong to any speaker feature category.
随后,说话人特征集合维护单元131可以确定说话人特征集合中不包括在任何一个说话人特征类别内的说话人特征的数量是否大于或等于第一聚类阈值,在确定说话人特征集合中不包括在任何一个说话人特征类别内的说话人特征的数量大于或等于第一聚类阈值的情况下,说话人特征集合维护单元131可以触发说话人特征聚类单元132利用聚类算法对这些不包括在任何一个说话人特征类别内的说话人特征进行聚类,并获得至少一个新的说话人特征类别,并为该至少一个新的说话人特征类别分配新的系统标识(例如,但不限于,说话人C、说话人D等),其中,新的说话人特征类别与说话人200一一对应,每个新的说话人特征类别包括至少一个说话人特征,一个新的系统标识可以与一个新的说话人特征类别相关联,也可以与该新的说话人特征类别内的至少一个说话人特征、该新的说话人特征类别对应的说话人模板相关联。在另一种 示例中,说话人特征集合维护单元131可以触发说话人特征聚类单元132利用聚类算法对说话人特征集合内的所有说话人特征进行重新聚类,以获得经更新的至少一个说话人特征类别,其中,经更新的至少一个说话人特征类别与说话人200一一对应,并且每个经更新的说话人特征类别包括说话人特征集合中的至少一个说话人特征。Subsequently, the speaker feature set maintenance unit 131 may determine whether the number of speaker features that are not included in any speaker feature category in the speaker feature set is greater than or equal to the first clustering threshold, and is not included in the speaker feature set. In the case that the number of speaker features included in any speaker feature category is greater than or equal to the first clustering threshold, the speaker feature set maintenance unit 131 can trigger the speaker feature clustering unit 132 to use a clustering algorithm to analyze these features. The speaker features included in any speaker feature category are clustered, and at least one new speaker feature category is obtained, and the at least one new speaker feature category is assigned a new system identifier (for example, but not limited to) , Speaker C, Speaker D, etc.), where the new speaker feature category corresponds to speaker 200 one-to-one, and each new speaker feature category includes at least one speaker feature. A new system identifier can be associated with one The new speaker feature category is associated with at least one speaker feature in the new speaker feature category and the speaker template corresponding to the new speaker feature category. In another example, the speaker feature set maintenance unit 131 may trigger the speaker feature clustering unit 132 to use a clustering algorithm to re-cluster all speaker features in the speaker feature set to obtain the updated at least one Speaker feature categories, where at least one updated speaker feature category corresponds to the speaker 200 one-to-one, and each updated speaker feature category includes at least one speaker feature in the speaker feature set.
随后,说话人模板获取单元133可以根据每个新的说话人特征类别内的至少一个说话人特征,获取每个新的说话人特征类别对应的新的说话人模板,其中,新的说话人模板与说话人200一一对应。在另一种示例中,说话人模板获取单元133还可以根据每个经更新的说话人特征类别内的经更新的至少一个说话人特征,获取每个经更新的说话人特征类别对应的经更新的说话人模板,其中,经更新的说话人模板与说话人200一一对应。对于说话人模板的获取可以参考以上描述,在此不再赘述。Subsequently, the speaker template obtaining unit 133 may obtain a new speaker template corresponding to each new speaker feature category according to at least one speaker feature in each new speaker feature category, where the new speaker template There is a one-to-one correspondence with speaker 200. In another example, the speaker template obtaining unit 133 may also obtain the updated corresponding to each updated speaker feature category according to the updated at least one speaker feature in each updated speaker feature category. Among the speaker templates, the updated speaker template corresponds to the speaker 200 one-to-one. For the acquisition of the speaker template, please refer to the above description, which will not be repeated here.
另外,在说话人特征集合维护单元131确定说话人特征集合中不包括在任何一个说话人特征类别内的说话人特征的数量小于第一聚类阈值的情况下,说话人特征集合维护单元131不触发说话人特征聚类单元132进行聚类。In addition, when the speaker feature set maintenance unit 131 determines that the number of speaker features that are not included in any speaker feature category in the speaker feature set is less than the first clustering threshold, the speaker feature set maintenance unit 131 does not The speaker feature clustering unit 132 is triggered to perform clustering.
根据本申请的一些实施例,在说话人识别装置100包括用户标识获取模块150的情况下,匹配结果用户标识获取模块150可以根据对当前说话人200的匹配结果,确定是否需要获取当前说话人200的用户标识,例如,姓名、性别、年龄、权限、喜好等。According to some embodiments of the present application, when the speaker recognition apparatus 100 includes the user identification acquisition module 150, the matching result user identification acquisition module 150 may determine whether the current speaker 200 needs to be acquired according to the matching result of the current speaker 200 The user ID of, for example, name, gender, age, permissions, preferences, etc.
一方面,在对当前说话人200的匹配结果包括与当前说话人特征匹配的说话人模板,或者该说话人模板对应的说话人特征类别的系统标识的情况下,若用户标识获取模块150确定还未获取该说话人模板对应的说话人特征类别的用户标识,那么用户标识获取模块150可以触发交互模块110与当前说话人200进行交互,并根据当前说话人200的交互语音输入,确定与当前说话人200匹配的说话人特征类别的用户标识。On the one hand, in the case that the matching result of the current speaker 200 includes a speaker template that matches the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the user identification acquisition module 150 determines to return If the user ID of the speaker feature category corresponding to the speaker template is not obtained, then the user ID obtaining module 150 can trigger the interaction module 110 to interact with the current speaker 200, and determine to speak with the current speaker 200 according to the interactive voice input of the current speaker 200 The user ID of the speaker feature category matched by the person 200.
图4是根据本申请实施例的当前说话人200和说话人识别装置100进行语音交互的一个场景示意图,需要说明的是,说话人识别装置100的交互模块110也可以以文字形式与当前说话人200进行交互。在图4示出的场景中,为确定当前说话人200的姓名和喜好信息,交互模块110和当前说话人200之间可以具有以下语音交互:4 is a schematic diagram of a scene of voice interaction between the current speaker 200 and the speaker recognition device 100 according to an embodiment of the present application. It should be noted that the interaction module 110 of the speaker recognition device 100 may also communicate with the current speaker in text form. 200 to interact. In the scenario shown in FIG. 4, in order to determine the name and preference information of the current speaker 200, the interaction module 110 and the current speaker 200 may have the following voice interactions:
交互模块110:“听您的声音好久了,请问您怎么称呼?”Interaction module 110: "I have listened to your voice for a long time, what do you call it?"
当前说话人200:“张三。”Current speaker 200: "Zhang San."
交互模块110:“很高兴认识您,可以了解您更多吗?”Interactive module 110: "It's nice to meet you, can you know more about you?"
当前说话人200:“好啊。”Current speaker 200: "Okay."
交互模块110:“请问您更喜欢以下哪种类型的电影:功夫、喜剧、惊悚…”Interactive module 110: "Which of the following types of movies do you prefer: Kung Fu, Comedy, Thriller..."
当前说话人200:“喜剧”Current Speaker 200: "Comedy"
在另一种示例中,为确定当前说话人200的姓名和权限信息,交互模块110和当前说话人200之间可以具有以下语音交互:In another example, in order to determine the name and authority information of the current speaker 200, the interaction module 110 and the current speaker 200 may have the following voice interactions:
交互模块110:“听您的声音好久了,请问您怎么称呼?”Interaction module 110: "I have listened to your voice for a long time, what do you call it?"
当前说话人200:“张三。”Current speaker 200: "Zhang San."
当前说话人200:“张三,您好,请问您需要设定自己的权限吗?Host具有较高权限,Guest具有访客权限。”Current speaker 200: "Hello Zhang San, do you need to set your own permissions? Host has higher permissions, and Guest has guest permissions."
当前说话人200:“Host权限。”Current speaker 200: "Host permission."
交互模块110:“请您输入最高权限密码。”Interactive module 110: "Please enter the highest authority password."
当前说话人200:“1 2 3 4 5 6”Current speaker 200: "1 2 3 4 5 6"
交互模块110:“密码错误,请您再次输入”Interactive module 110: "The password is wrong, please enter it again"
当前说话人200:“6 5 4 3 2 1”Current speaker 200: "6 5 4 3 2 1"
交互模块110:“密码正确,张先生,恭喜您已经具备Host权限。”Interactive module 110: "The password is correct, Mr. Zhang, congratulations you have the Host authority."
用户标识获取模块150可以根据以上语音交互,确定当前说话人200的姓名为“张三”,“喜好”为“喜剧”,“权限”为“Host权限”。The user identification acquisition module 150 may determine that the name of the current speaker 200 is "Zhang San", the "like" is "comedy", and the "authority" is "Host authority" based on the above voice interaction.
在获取当前说话人200的用户标识之后,用户标识获取模块150可以将用户标识发送给说话人模板获取模块130,使得说话人特征集合维护单元131在将当前说话人特征加入与当前说话人200匹配的说话人特征类别中时,将当前说话人的用户标识与该说话人特征类别、该说话人特征类别内的说话人特征、该说话人特征类别对应的说话人模板相关联。After obtaining the user ID of the current speaker 200, the user ID obtaining module 150 may send the user ID to the speaker template obtaining module 130, so that the speaker feature set maintenance unit 131 adds the current speaker feature to match the current speaker 200 When in the speaker feature category, the user ID of the current speaker is associated with the speaker feature category, the speaker feature in the speaker feature category, and the speaker template corresponding to the speaker feature category.
如此,交互模块110可以根据当前说话人200的用户标识,向当前说话人提供更加个性化、更加智能的交互体验。另外,在说话人识别装置100确定某一个未来说话人200与该说话人特征类别匹配的情况下,交互模块110可以根据与该说话人特征类别内的说话人特征相关联用户标识识别未来说话人200,并向未来说话人200提供更加个性化、更加智能的交互体验,例如,但不限于,以下的交互场景:In this way, the interaction module 110 can provide the current speaker with a more personalized and intelligent interactive experience based on the user identification of the current speaker 200. In addition, when the speaker recognition device 100 determines that a certain future speaker 200 matches the feature category of the speaker, the interaction module 110 can identify the future speaker according to the user identifier associated with the speaker feature in the speaker feature category 200, and provide future speakers 200 with a more personalized and intelligent interactive experience, such as, but not limited to, the following interactive scenarios:
未来说话人200:“帮我调节空调温度到25度。”Future Speaker 200: "Help me adjust the temperature of the air conditioner to 25 degrees."
交互模块110:(根据与未来说话人200匹配的说话人特征类别的用户标识确定未来说话人200的姓名为李四)“好的,李先生,已经帮您把空调的温度设定为25度。”Interaction module 110: (According to the user identification of the speaker feature category matching the future speaker 200, the name of the future speaker 200 is determined to be Li Si) "Okay, Mr. Li, I have helped you set the temperature of the air conditioner to 25 degrees. ."
未来说话人200:“帮我播放最新的电影。”Future Speaker 200: "Help me play the latest movie."
交互模块110:(根据与未来说话人200匹配的说话人特征类别的用户标识确定未来说话人200的喜好为喜剧)“李先生,最近有一部评分很高的喜剧电影上线,名字叫《西红柿首富》,接下来请您欣赏。”Interaction module 110: (According to the user identification of the speaker feature category matched with the future speaker 200, the preference of the future speaker 200 is determined to be a comedy) "Mr. Li, a comedy movie with a high rating has been launched recently, named "The Richest Man in Tomatoes" "Please enjoy it next."
未来说话人200:“帮我删除本地所有的电影。”Future Speaker 200: "Help me delete all local movies."
交互模块110:(根据与未来说话人200匹配的说话人特征类别的用户标识确定未来说话人200的权限为Guest权限)“李先生,抱歉,您目前的权限不够。请升级为最高权限。”Interaction module 110: (According to the user ID of the speaker feature category matching the future speaker 200, the permission of the future speaker 200 is determined as the guest permission) "Mr. Li, sorry, your current permission is not enough. Please upgrade to the highest permission."
另一方面,在对当前说话人200的匹配结果包括与当前说话人特征匹配的说话人模板,或者该说话人模板对应的说话人特征类别的系统标识的情况下,若用户标识获取模块150确定已经获取了该说话人模板对应的说话人特征类别的用户标识,或者在对当前说话人200的匹配结果包括指示当前说话人200与各个说话人模板对应的说话人200均不匹配的信息的情况下,用户标识获取模块150将不会获取当前说话人200的用户标识,即不会触发交互模块110与当前说话人200进行交互。On the other hand, in the case that the matching result of the current speaker 200 includes a speaker template matching the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the user identification acquisition module 150 determines The user ID of the speaker feature category corresponding to the speaker template has been obtained, or the matching result of the current speaker 200 includes information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template Next, the user identification acquisition module 150 will not acquire the user identification of the current speaker 200, that is, the interaction module 110 will not be triggered to interact with the current speaker 200.
根据本申请的一些实施例,说话人识别装置100还可以包括声音属性识别模块160在用户标识获取模块150确定需要获取当前说话人200的用户标识的情况下,声音属性识别模块60可以基于当前说话人200的语音特征(例如,但不限于,FBank特征、MFCC特征等),识别当前说话人200的声音属性信息,其中,声音属性信息可以包 括,但不限于,声音的性别属性(男声、女声)、声音的年龄属性(例如,儿童、成人等)等。在本申请的实施例中,声音属性模块60可以利用现有技术中的任意一种声音属性识别技术来识别当前说话人200的声音属性,例如,但不限于,通过大量语音样本数据训练分类神经网络而获得的声音分类器。According to some embodiments of the present application, the speaker recognition device 100 may further include a voice attribute recognition module 160. When the user identification acquisition module 150 determines that the user identification of the current speaker 200 needs to be acquired, the voice attribute recognition module 60 may be based on the current speech. The voice features of the person 200 (for example, but not limited to, FBank features, MFCC features, etc.), identify the voice attribute information of the current speaker 200, where the voice attribute information may include, but is not limited to, the gender attributes of the voice (male voice, female voice) ), the age attribute of the sound (for example, children, adults, etc.), etc. In the embodiment of the present application, the voice attribute module 60 can use any voice attribute recognition technology in the prior art to recognize the voice attribute of the current speaker 200, for example, but not limited to, training the classification nerve through a large amount of voice sample data. A sound classifier obtained from the Internet.
在声音属性识别模块160识别出当前说话人200的声音属性信息之后,可以将声音属性信息发送给交互模块110,使得交互模块110基于声音属性信息与当前说话人200进行交互,以获取当前说话人200的用户标识。例如,声音属性识别模块160可以识别出当前说话人200的声音属于儿童的声音,那么在交互模块110与当前说话人200进行交互时,可以称呼当前说话人200为“小朋友”,并且为确定当前说话人200的姓名,交互模块110和当前说话人200之间可以具有以下交互场景:After the voice attribute recognition module 160 recognizes the voice attribute information of the current speaker 200, the voice attribute information can be sent to the interaction module 110, so that the interaction module 110 interacts with the current speaker 200 based on the voice attribute information to obtain the current speaker 200 user ID. For example, the voice attribute recognition module 160 can recognize that the voice of the current speaker 200 belongs to the voice of a child. Then, when the interaction module 110 interacts with the current speaker 200, the current speaker 200 can be referred to as a "child", and to determine the current speaker 200. The name of the speaker 200, the interaction module 110 and the current speaker 200 may have the following interaction scenarios:
交互模块110:“小朋友,你好,听起来是个很可爱的宝贝,请问你叫什么名字?”Interaction module 110: "Hello, kid, it sounds like a very cute baby, what's your name?"
当前说话人200:“我叫丫丫。”Current speaker 200: "My name is Yaya."
在获取当前说话人200的用户标识之后,用户标识获取模块150和声音属性识别模块160可以分别将当前说话人200的用户标识、声音属性信息发送给说话人模板获取模块130,使得说话人特征集合维护单元131可以在将当前说话人特征加入与当前说话人200的匹配的说话人特征类别中时,将当前说话人的声音属性信息和用户标识与该说话人特征类别、该说话人特征类别内的说话人特征、该说话人特征类别对应的说话人模板相关联。After obtaining the user ID of the current speaker 200, the user ID obtaining module 150 and the voice attribute recognition module 160 may respectively send the user ID and voice attribute information of the current speaker 200 to the speaker template obtaining module 130, so that the speaker characteristics are set The maintenance unit 131 may add the current speaker feature to the speaker feature category matching the current speaker 200, and compare the current speaker’s voice attribute information and user identification with the speaker feature category and the speaker feature category. The speaker feature of the speaker is associated with the speaker template corresponding to the speaker feature category.
如此,交互模块110可以根据当前说话人200的用户标识和声音属性信息,向当前说话人提供更加个性化、更加智能的交互体验。另外,在说话人识别装置100确定某一个未来说话人200与该说话人特征类别匹配的情况下,交互模块110可以根据与该说话人特征类别相关联的声音属性信息和用户标识识别未来说话人200,并向未来说话人200提供更加个性化、更加智能的交互体验,但不限于,以下的交互场景:In this way, the interaction module 110 can provide the current speaker with a more personalized and intelligent interaction experience based on the user identification and voice attribute information of the current speaker 200. In addition, when the speaker recognition device 100 determines that a certain future speaker 200 matches the feature category of the speaker, the interaction module 110 can recognize the future speaker according to the voice attribute information and user identification associated with the feature category of the speaker. 200, and provide a more personalized and smarter interactive experience to future speakers 200, but not limited to the following interactive scenarios:
未来说话人200:“帮我播放一首歌。”Future Speaker 200: "Play a song for me."
交互模块110:(根据与未来说话人200匹配的说话人特征类别的用户标识确定未来说话人200的姓名为丫丫,并且根据声音属性信息确定声音的年龄属性为儿童)Interaction module 110: (determine that the name of the future speaker 200 is Yaya according to the user identification of the speaker feature category matching the future speaker 200, and determine the age attribute of the voice as a child according to the voice attribute information)
“丫丫你好,接下来为你播放《两只老虎》。”"Hello, ya ya, next I will play "Two Tigers" for you."
根据本申请的另外一些实施例,声音属性识别模块160可以根据说话人识别装置100对当前说话人200的匹配结果,确定是否需要识别当前说话人200的声音属性。例如,在对当前说话人200的匹配结果包括与当前说话人特征匹配的说话人模板,或者该说话人模板对应的说话人特征类别的系统标识的情况下,若声音属性识别模块160确定还未获取该说话人类别的声音属性信息,那么声音属性识别模块160可以对当前说话人200的声音属性进行识别,并将声音属性信息发送给说话人模板获取模块130,使得说话人特征集合维护单元131可以将当前说话人的声音属性信息和与当前说话人200匹配的说话人特征类别内的说话人特征相关联。在对当前说话人200的匹配结果包括与当前说话人特征匹配的说话人模板,或者该说话人模板对应的说话人特征类别的系统标识的情况下,若声音属性识别模块160确定已经识别了该说话人特征类别的声音属性信息,或者在对当前说话人200的匹配结果包括指示当前说话人200与各个说话人模板对应的说话人200均不匹配的信息的情况下,声音属性识别模块160 不会对当前说话人200的声音属性进行识别。According to some other embodiments of the present application, the voice attribute recognition module 160 may determine whether the voice attribute of the current speaker 200 needs to be recognized according to the matching result of the speaker recognition device 100 to the current speaker 200. For example, in the case that the matching result of the current speaker 200 includes a speaker template that matches the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the voice attribute recognition module 160 determines that it has not Acquire the voice attribute information of the speaker category, then the voice attribute recognition module 160 can recognize the voice attribute of the current speaker 200, and send the voice attribute information to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 The voice attribute information of the current speaker may be associated with the speaker feature in the speaker feature category matched with the current speaker 200. In the case that the matching result of the current speaker 200 includes a speaker template matching the features of the current speaker, or a system identification of the speaker feature category corresponding to the speaker template, if the voice attribute recognition module 160 determines that the speaker has been identified The voice attribute information of the speaker feature category, or when the matching result of the current speaker 200 includes information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template, the voice attribute recognition module 160 does not The voice attributes of the current speaker 200 are recognized.
如此,交互模块110可以根据当前说话人200的声音属性信息,向当前说话人提供更加个性化、更加智能的交互体验。另外,在说话人识别装置100确定某一个未来说话人200与该说话人特征类别匹配的情况下,交互模块110可以根据与该说话人特征类别相关联的声音属性信息,向该未来说话人200提供更加个性化、更加智能的交互体验。In this way, the interaction module 110 can provide the current speaker with a more personalized and intelligent interaction experience based on the voice attribute information of the current speaker 200. In addition, in the case where the speaker recognition device 100 determines that a certain future speaker 200 matches the feature category of the speaker, the interaction module 110 may report to the future speaker 200 according to the voice attribute information associated with the feature category of the speaker. Provide a more personalized and intelligent interactive experience.
根据本申请的另外一些实施例,不管对当前说话人200的匹配结果如何,声音属性识别模块160均可以对当前说话人200的声音属性进行识别,并将声音属性信息发送给交互模块110,使得交互模块110可以基于声音属性信息,向当前说话人200提供更加个性化、更加智能的交互体验,例如,但不限于,以下交互场景:According to other embodiments of the present application, regardless of the matching result of the current speaker 200, the voice attribute recognition module 160 can recognize the voice attributes of the current speaker 200, and send the voice attribute information to the interaction module 110, so that The interaction module 110 can provide the current speaker 200 with a more personalized and intelligent interaction experience based on the sound attribute information, for example, but not limited to, the following interaction scenarios:
当前说话人200:“帮我打开空调。”Current speaker 200: "Help me turn on the air conditioner."
交互模块110:(经声音属性识别模块160确定声音的年龄属性为儿童)“空调已经打开,调至28摄氏度。”Interaction module 110: (The age attribute of the sound is determined by the sound attribute recognition module 160 to be a child) "The air conditioner has been turned on and adjusted to 28 degrees Celsius."
交互模块110:(经声音属性识别模块160确定声音的年龄属性为成人)“空调已打开,调至25摄氏度。”Interaction module 110: (The age attribute of the voice is determined by the voice attribute recognition module 160 to be an adult) "The air conditioner has been turned on and it is adjusted to 25 degrees Celsius."
需要说明的是,在说话人特征类别内的说话人特征具有与之关联的用户标识和/或声音属性信息的情况下,如果发生重新聚类,那么对于重新聚类之后的每个经更新的说话人特征类别,可以以对应最多数量的说话人特征的用户标识和/或声音属性信息作为与该经更新的说话人特征类别相关联的用户标识和/或声音属性信息,例如,重新聚类后的一个经更新的说话人特征类别内,有48个说话人特征对应“张三”的用户标识,有2个说话人特征对应“李四”的用户标识,那么将“张三”的用户标识与该经更新的说话人特征类别相关联。It should be noted that if the speaker features in the speaker feature category have user identification and/or voice attribute information associated with them, if re-clustering occurs, then for each updated after re-clustering The speaker feature category may use the user identification and/or voice attribute information corresponding to the largest number of speaker features as the user identification and/or voice attribute information associated with the updated speaker feature category, for example, re-clustering In the last updated speaker feature category, there are 48 speaker features corresponding to the user ID of "Zhang San", and 2 speaker features corresponding to the user ID of "Li Si", then the user ID of "Zhang San" The identification is associated with the updated speaker feature category.
在本申请的实施例中,不需要说话人专门进行语音注册,而是在跟说话人语音交互过程中,通过对不同的说话人的说话人特征进行聚类,获得与每个说话人相对应的In the embodiment of the present application, there is no need for the speaker to specifically perform voice registration, but in the process of voice interaction with the speaker, by clustering the speaker characteristics of different speakers, the corresponding speaker is obtained. of
说话人模板,以此来识别不同的说话人。因此,本申请的实施例可以实现无感注册,避免了注册给说话人带来的负体验。Speaker template to identify different speakers. Therefore, the embodiments of the present application can realize non-sense registration, and avoid the negative experience brought by the registration to the speaker.
进一步地,在成功识别当前说话人之后,将当前说话人的说话人特征加入到与当前说话人相匹配的说话人特征类别中并更新该说话人特征类别的说话人模板,可以提高对说话人的识别准确度。另外,通过对说话人特征集合设置聚类条件,在其满足聚类条件之后对其重新进行聚类,从而更新已有说话人特征类别的说话人模板,或者获得新的说话人特征类别的说话人模板,也可以提高对说话人的识别准确度。Further, after successfully identifying the current speaker, adding the speaker feature of the current speaker to the speaker feature category matching the current speaker and updating the speaker template of the speaker feature category can improve the speaker Recognition accuracy. In addition, by setting clustering conditions for the speaker feature set, and re-clustering it after it meets the clustering conditions, the speaker template of the existing speaker feature category is updated, or the speech of a new speaker feature category is obtained. The person template can also improve the accuracy of speaker recognition.
进一步地,通过确定说话人的用户标识和/或声音属性信息,可以向说话人提供更加个性化、更加智能的交互体验。Further, by determining the user identification and/or voice attribute information of the speaker, a more personalized and intelligent interactive experience can be provided to the speaker.
图5示出了根据本申请实施例的说话人识别方法的一种流程示意图,图2-3中说话人识别装置100的不同组件可以实施方法的不同块或其他部分。对于上述装置实施例中未描述的内容,可以参见下述方法实施例,同样,对于方法实施例中未描述的内容,可参见上述装置实施例。需要说明的是,对方法步骤的描述顺序不应被解释为这些步骤必须依赖于该顺序被执行,这些步骤可以不需要按描述顺序而执行,并且方法可以包括这些步骤之外的其他步骤,也可以包括这些步骤中的一部分。如图5所示, 说话人识别方法包括:Fig. 5 shows a schematic flow chart of a speaker recognition method according to an embodiment of the present application. Different components of the speaker recognition apparatus 100 in Figs. 2-3 can implement different blocks or other parts of the method. For the content not described in the foregoing device embodiment, refer to the following method embodiment, and similarly, for the content not described in the method embodiment, refer to the foregoing device embodiment. It should be noted that the description order of the method steps should not be construed as these steps must be executed depending on the order, these steps may not need to be executed in the order described, and the method may include other steps besides these steps. Some of these steps can be included. As shown in Figure 5, the speaker recognition method includes:
块501,通过交互模块110,接收来自当前当前说话人200的当前语音输入;Block 501, receiving the current voice input from the current speaker 200 through the interaction module 110;
块502,通过说话人特征获取模块120,对当前说话人200的当前语音输入300进行预处理,其中,所述预处理可以包括,但不限于,信号增强、去混响、去噪等;另外,说话人特征获取模块120还可以从预处理后的当前语音输入300中提取当前说话人200的当前语音特征,例如,但不限于,FBank特征、MFCC特征等;In block 502, the current speech input 300 of the current speaker 200 is preprocessed through the speaker feature acquisition module 120, where the preprocessing may include, but is not limited to, signal enhancement, dereverberation, denoising, etc.; , The speaker feature acquisition module 120 can also extract the current voice features of the current speaker 200 from the preprocessed current voice input 300, such as, but not limited to, FBank features, MFCC features, etc.;
块503,通过说话人特征获取模块120,基于当前说话人200的当前语音特征,根据声纹模型,例如,但不限于,GMM-UBM、I-vector模型、JFA模型等,获取当前说话人200的当前说话人特征,例如,但不限于,GMM-UBM模型的均值超矢量、JFA模型的与说话人相关的超矢量、I-vector模型的I-vector矢量等; Block 503, through the speaker feature acquisition module 120, based on the current voice features of the current speaker 200, according to the voiceprint model, such as, but not limited to, GMM-UBM, I-vector model, JFA model, etc., to obtain the current speaker 200 Features of the current speaker, such as, but not limited to, the mean super vector of the GMM-UBM model, the speaker-related super vector of the JFA model, the I-vector vector of the I-vector model, etc.;
块504,通过说话人匹配模块140,确定是否存在至少一个说话人模板,在确定存在至少一个说话人模板的情况下,执行块505,在确定不存在说话人模板的情况下,执行块507;In block 504, it is determined whether there is at least one speaker template through the speaker matching module 140, if it is determined that there is at least one speaker template, block 505 is executed, and if it is determined that there is no speaker template, block 507 is executed;
块505,通过说话人匹配模块140,确定当前说话人200的当前说话人特征与各个说话人模板之间的相似度;Block 505, through the speaker matching module 140, determine the similarity between the current speaker feature of the current speaker 200 and each speaker template;
在一种示例中,可以通过计算当前说话人特征与各个说话人模板之间的距离来确定当前说话人特征与各个说话人模板之间相似度,其中,所述距离可以包括,但不限于,余弦距离、EMD距离(Earth Mover's Distance,搬土距离)、欧式距离、曼哈顿距离等;In an example, the similarity between the current speaker feature and each speaker template can be determined by calculating the distance between the current speaker feature and each speaker template, where the distance can include, but is not limited to, Cosine distance, EMD distance (Earth Mover's Distance), Euclidean distance, Manhattan distance, etc.;
块506,通过说话人匹配模块140,根据当前说话人200的当前说话人特征与各个说话人模板之间的最大相似度,确定当前说话人200是否与该至少一个说话人模板对应的至少一个说话人200中的一个说话人200匹配; Block 506, through the speaker matching module 140, determine whether the current speaker 200 is speaking with at least one speaker corresponding to the at least one speaker template according to the maximum similarity between the current speaker feature of the current speaker 200 and each speaker template One speaker 200 among the people 200 matches;
在一种示例中,在说话人匹配模块140确定该最大相似度高于或等于相似度阈值的情况下,其可以确定当前说话人特征和具有该最大相似度的说话人模板匹配,即可以确定当前说话人200与具有该最大相似度的说话人模板对应的说话人200匹配;在说话人匹配模块140确定该最大相似度低于该相似度阈值的情况下,其可以确定当前说话人特征与各个说话人模板均不匹配,即当前说话人200与各个说话人模板对应的各个说话人200均不匹配;In an example, when the speaker matching module 140 determines that the maximum similarity is higher than or equal to the similarity threshold, it can determine that the current speaker feature matches the speaker template with the maximum similarity to determine The current speaker 200 matches the speaker 200 corresponding to the speaker template with the maximum similarity; in the case that the speaker matching module 140 determines that the maximum similarity is lower than the similarity threshold, it can determine that the characteristics of the current speaker are Each speaker template does not match, that is, the current speaker 200 does not match each speaker 200 corresponding to each speaker template;
块507,通过说话人模板获取单元130的说话人特征集合维护单元131,将当前说话人特征加入说话人特征样本集;In block 507, the speaker feature set maintenance unit 131 of the speaker template obtaining unit 130 adds the current speaker feature to the speaker feature sample set;
在一种示例中,在说话人匹配模块140确定当前说话人特征和具有最大相似度的说话人模板匹配的情况下,说话人特征集合维护单元131可以将当前说话人特征加入与该匹配的说话人模板对应的说话人特征类别中;在说话人匹配模块140确定当前说话人特征和各个说话人模板均不匹配的情况下,说话人特征集合维护单元131可以将当前说话人特征加入说话人特征集合中,并且使得当前说话人特征不属于任何一个说话人特征类别;In an example, when the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker that matches the speaker template. The speaker feature category corresponding to the speaker template; in the case that the speaker matching module 140 determines that the current speaker feature does not match each speaker template, the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker feature In the set, and make the current speaker feature not belong to any speaker feature category;
在另一种示例中,说话人特征集合维护单元131还可以根据当前说话人200的当前语音输入的信噪比,确定是否将当前说话人200的当前说话人特征加入说话人特征集合,具体地,如果当前说话人200的当前语音输入的信噪比高于或等于信噪比阈值, 则说话人特征集合维护单元131确定将当前说话人200的当前说话人特征加入说话人特征集合,如果当前说话人200的当前语音输入的信噪比低于信噪比阈值,则说话人特征集合维护单元131确定不将当前说话人200的当前说话人特征加入说话人特征集合;In another example, the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 to the speaker feature set according to the signal-to-noise ratio of the current voice input of the current speaker 200, specifically If the signal-to-noise ratio of the current speech input of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 to the speaker feature set. If the signal-to-noise ratio of the current voice input of the speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set;
块508,通过说话人模板获取单元130的说话人特征集合维护单元131,确定说话人特征集合是否满足聚类条件,若是,则执行块509,若否,则返回执行块501,即接收下一个语音输入;In block 508, the speaker feature set maintenance unit 131 of the speaker template acquisition unit 130 is used to determine whether the speaker feature set meets the clustering condition, if yes, execute block 509, if not, return to execute block 501, that is, receive the next one Voice input;
在一种示例中,在说话人匹配模块140确定当前说话人特征和各个说话人模板均不匹配的情况下,说话人特征集合维护单元131确定说话人特征集合是否满足聚类条件,可以包括,确定说话人特征集合中不包括在任何一个说话人特征类别内的说话人特征的数量是否大于或等于第一聚类阈值;In an example, when the speaker matching module 140 determines that the current speaker features do not match each speaker template, the speaker feature set maintenance unit 131 determines whether the speaker feature set satisfies the clustering condition, which may include: Determine whether the number of speaker features in the speaker feature set that is not included in any speaker feature category is greater than or equal to the first clustering threshold;
在另一种示例中,在说话人匹配模块140确定当前说话人特征和具有最大相似度的说话人模板匹配的情况下,说话人特征集合维护单元131确定说话人特征集合是否满足聚类条件,可以包括,确定与当前说话人特征匹配的说话人模板对应的说话人特征类别内的说话人特征的数量是否等于第二聚类阈值,其中,第二聚类阈值可以包括一个或多个数值,例如,但不限于,50、100、200、500或其他数值;In another example, when the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, the speaker feature set maintenance unit 131 determines whether the speaker feature set satisfies the clustering condition, It may include determining whether the number of speaker features in the speaker feature category corresponding to the speaker template matched with the current speaker feature is equal to a second clustering threshold, where the second clustering threshold may include one or more values, For example, but not limited to, 50, 100, 200, 500 or other values;
块509,通过说话人模板获取单元130的说话人特征聚类单元132,利用聚类算法对说话人特征进行聚类;In block 509, the speaker feature clustering unit 132 of the speaker template obtaining unit 130 uses a clustering algorithm to cluster the speaker features;
在一种示例中,在说话人匹配模块140确定当前说话人特征和各个说话人模板均不匹配的情况下,并且在不包括在任何一个说话人特征类别内的说话人特征的数量大于或等于第一聚类阈值的情况下,说话人特征聚类单元132可以针对这些不包括在任何一个说话人特征类别内的说话人特征进行聚类,并获得至少一个新的说话人特征类别,并为该至少一个新的说话人特征类别分配新的系统标识(例如,但不限于,说话人C、说话人D等),其中,新的说话人特征类别与说话人200一一对应,每个新的说话人特征类别包括至少一个说话人特征,一个新的系统标识可以与一个新的说话人特征类别相关联,也可以与该新的说话人特征类别内的至少一个说话人特征、该新的说话人特征类别对应的说话人模板相关联;在另一种示例中,说话人特征聚类单元132可以对说话人特征集合内的所有说话人特征进行重新聚类,以获得经更新的至少一个说话人特征类别,其中,经更新的至少一个说话人特征类别与说话人200一一对应,并且每个经更新的说话人特征类别包括说话人特征集合中的至少一个说话人特征;In an example, when the speaker matching module 140 determines that the current speaker feature does not match each speaker template, and the number of speaker features not included in any speaker feature category is greater than or equal to In the case of the first clustering threshold, the speaker feature clustering unit 132 can cluster the speaker features that are not included in any speaker feature category, and obtain at least one new speaker feature category. The at least one new speaker feature category is assigned a new system identifier (for example, but not limited to, speaker C, speaker D, etc.), where the new speaker feature category corresponds to speaker 200 one-to-one, and each new speaker feature category The speaker feature category includes at least one speaker feature. A new system identifier can be associated with a new speaker feature category, or can be associated with at least one speaker feature in the new speaker feature category, and the new speaker feature category. The speaker template corresponding to the speaker feature category is associated; in another example, the speaker feature clustering unit 132 may re-cluster all speaker features in the speaker feature set to obtain the updated at least one Speaker feature categories, where at least one updated speaker feature category corresponds to speaker 200 one-to-one, and each updated speaker feature category includes at least one speaker feature in the speaker feature set;
在另一种示例中,在说话人匹配模块140确定当前说话人特征和具有最大相似度的说话人模板匹配的情况下,并且在于当前说话人特征匹配的说话人模板对应的说话人特征类别内的说话人特征的数量等于第二聚类阈值的情况下,说话人特征聚类单元132可以对说话人特征集合中的所有说话人特征进行重新聚类,以获得经更新的至少一个说话人特征类别,其中,经更新的至少一个说话人特征类别与说话人200一一对应,并且每个经更新的说话人特征类别包括说话人特征集合中的至少一个说话人特征;In another example, in the case where the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, and it is within the speaker feature category corresponding to the speaker template that matches the current speaker feature In the case that the number of speaker features in is equal to the second clustering threshold, the speaker feature clustering unit 132 may re-cluster all speaker features in the speaker feature set to obtain the updated at least one speaker feature Categories, where at least one updated speaker feature category corresponds to speaker 200 one-to-one, and each updated speaker feature category includes at least one speaker feature in the speaker feature set;
另外,对于聚类算法的描述可以参考以上在装置部分的描述,在此不再赘述;In addition, for the description of the clustering algorithm, please refer to the description in the device section above, which will not be repeated here;
块510,通过说话人模板获取单元130的说话人模板获取单元133,对于每个说话人特征类别,根据该说话人特征类别内的说话人特征获取该说话人特征类别对应的说 话人模板; Block 510, through the speaker template obtaining unit 133 of the speaker template obtaining unit 130, for each speaker feature category, obtain a speaker template corresponding to the speaker feature category according to the speaker feature in the speaker feature category;
在一种示例中,在说话人匹配模块140确定当前说话人特征和各个说话人模板均不匹配的情况下,并且在不包括在任何一个说话人特征类别内的说话人特征的数量大于或等于第一聚类阈值的情况下,说话人模板获取单元133可以根据每个新的说话人特征类别内的至少一个说话人特征,获取每个新的说话人特征类别对应的新的说话人模板;说话人模板获取单元133也可以根据每个经更新的说话人特征类别内的经更新的至少一个说话人特征,获取每个经更新的说话人特征类别对应的经更新的说话人模板;In an example, when the speaker matching module 140 determines that the current speaker feature does not match each speaker template, and the number of speaker features not included in any speaker feature category is greater than or equal to In the case of the first clustering threshold, the speaker template obtaining unit 133 may obtain a new speaker template corresponding to each new speaker feature category according to at least one speaker feature in each new speaker feature category; The speaker template obtaining unit 133 may also obtain an updated speaker template corresponding to each updated speaker feature category according to the updated at least one speaker feature in each updated speaker feature category;
在另一种示例中,在说话人匹配模块140确定当前说话人特征和具有最大相似度的说话人模板匹配的情况下,并且在于当前说话人特征匹配的说话人模板对应的说话人特征类别内的说话人特征的数量等于第二聚类阈值的情况下,说话人模板获取单元133可以根据每个经更新的说话人特征类别内的经更新的至少一个说话人特征,获取每个经更新的说话人特征类别对应的经更新的说话人模板;说话人模板获取单元133也可以根据与当前说话人200匹配的经更新的说话人特征类别内的经更新的至少一个说话人特征,获取该经更新的说话人特征类别对应的经更新的说话人模板,其中,与当前说话人200匹配的经更新的说话人特征类别的确定可以参考以上在装置部分的描述;In another example, in the case where the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, and it is within the speaker feature category corresponding to the speaker template that matches the current speaker feature In the case that the number of speaker features of is equal to the second clustering threshold, the speaker template obtaining unit 133 may obtain each updated speaker feature according to the updated at least one speaker feature in each updated speaker feature category. The updated speaker template corresponding to the speaker feature category; the speaker template obtaining unit 133 may also obtain the speaker template according to the updated at least one speaker feature in the updated speaker feature category matching the current speaker 200 The updated speaker template corresponding to the updated speaker feature category, where the updated speaker feature category matching the current speaker 200 can be determined with reference to the description in the device section above;
在另一种示例中,在说话人匹配模块140确定当前说话人特征和具有最大相似度的说话人模板匹配的情况下,并且在于当前说话人特征匹配的说话人模板对应的说话人特征类别内的说话人特征的数量不等于第二聚类阈值的情况下,说话人模板获取单元133可以根据与当前说话人200匹配的说话人特征类别内的说话人特征(包括当前说话人特征),更新与当前说话人200匹配的说话人特征类别对应的说话人模板;In another example, in the case where the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, and it is within the speaker feature category corresponding to the speaker template that matches the current speaker feature In the case that the number of speaker features in is not equal to the second clustering threshold, the speaker template acquisition unit 133 may update according to the speaker features (including the current speaker features) in the speaker feature category matching the current speaker 200 The speaker template corresponding to the speaker feature category matched with the current speaker 200;
另外,对于说话人模板的获取,可以参数以上在装置部分的描述,在此不再赘述。In addition, for the acquisition of the speaker template, the description in the device part above can be parameterized, which will not be repeated here.
需要说明的是,在块510执行结束之后,可以返回执行块501,即接收下一个语音输入。It should be noted that after the execution of block 510 ends, the execution of block 501 may be returned to, that is, the next voice input is received.
图6示出了根据本申请实施例的说话人识别方法的另一种流程示意图,图2和图3中说话人识别装置100的不同组件可以实施方法的不同块或其他部分。对于上述装置实施例中未描述的内容,可以参见下述方法实施例,同样,对于方法实施例中未描述的内容,可参见上述装置实施例。需要说明的是,对方法步骤的描述顺序不应被解释为这些步骤必须依赖于该顺序被执行,这些步骤可以不需要按描述顺序而执行,并且方法可以包括这些步骤之外的其他步骤,也可以包括这些步骤中的一部分。如图6所示,说话人识别方法包括:Fig. 6 shows another schematic flow chart of a speaker recognition method according to an embodiment of the present application. Different components of the speaker recognition device 100 in Figs. 2 and 3 can implement different blocks or other parts of the method. For the content not described in the foregoing device embodiment, refer to the following method embodiment, and similarly, for the content not described in the method embodiment, refer to the foregoing device embodiment. It should be noted that the description order of the method steps should not be construed as these steps must be executed depending on the order, these steps may not need to be executed in the order described, and the method may include other steps besides these steps. Some of these steps can be included. As shown in Figure 6, the speaker recognition method includes:
块601-块606:请参见块501-块506的描述,在此不再赘述;Block 601-block 606: please refer to the description of block 501-block 506, which will not be repeated here;
块607,通过用户标识获取模块150,确定是否需要获取当前说话人200的用户标识;In block 607, it is determined whether the user ID of the current speaker 200 needs to be obtained through the user ID obtaining module 150;
作为一种示例,用户标识获取模块150可以根据对当前说话人200的匹配结果,确定是否需要获取当前说话人200的用户标识,例如,姓名、性别、年龄、权限、喜好等;As an example, the user ID obtaining module 150 may determine whether it is necessary to obtain the user ID of the current speaker 200, for example, name, gender, age, authority, preference, etc., according to the matching result of the current speaker 200;
例如,在对当前说话人200的匹配结果包括与当前说话人特征匹配的说话人模板, 或者该说话人模板对应的说话人特征类别的系统标识的情况下,若用户标识获取模块150确定还未获取该说话人模板对应的说话人特征类别的用户标识,那么用户标识获取模块150确定需要获取当前说话人200的用户标识,并触发交互模块110与当前说话人200进行交互;For example, in the case that the matching result of the current speaker 200 includes a speaker template matching the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the user identification acquisition module 150 determines that it has not Acquire the user ID of the speaker feature category corresponding to the speaker template, then the user ID obtaining module 150 determines that the user ID of the current speaker 200 needs to be obtained, and triggers the interaction module 110 to interact with the current speaker 200;
在对当前说话人200的匹配结果包括与当前说话人特征匹配的说话人模板,或者该说话人模板对应的说话人特征类别的系统标识的情况下,若用户标识获取模块150确定已经获取了该说话人模板对应的说话人特征类别的用户标识,或者在对当前说话人200的匹配结果包括指示当前说话人200与各个说话人模板对应的说话人200均不匹配的信息的情况下,用户标识获取模块150确定不需要获取当前说话人200的用户标识,也不会触发交互模块110与当前说话人200进行交互;In the case that the matching result of the current speaker 200 includes a speaker template that matches the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the user identification acquisition module 150 determines that the The user identification of the speaker feature category corresponding to the speaker template, or when the matching result of the current speaker 200 includes information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template, the user identification The acquiring module 150 determines that it is not necessary to acquire the user ID of the current speaker 200, and it will not trigger the interaction module 110 to interact with the current speaker 200;
块608,通过用户标识获取模块150,根据与当前说话人的交互,获取当前说话人200的用户标识;In block 608, the user ID of the current speaker 200 is obtained according to the interaction with the current speaker through the user ID acquisition module 150;
块609,通过说话人模板获取单元130的说话人特征集合维护单元131,将当前说话人特征加入说话人特征样本集;In block 609, the speaker feature set maintenance unit 131 of the speaker template obtaining unit 130 adds the current speaker feature to the speaker feature sample set;
在一种示例中,在说话人匹配模块140确定当前说话人特征和具有最大相似度的说话人模板匹配的情况下,说话人特征集合维护单元131可以将当前说话人特征加入与该匹配的说话人模板对应的说话人特征类别中,另外,在用户标识获取模块150获取了当前说话人200的用户标识的情况下,说话人特征集合维护单元131还将当前说话人的用户标识与该说话人特征类别、该说话人特征类别内的说话人特征以及该说话人特征类别对应的说话人模板相关联;In an example, when the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker that matches the speaker template. In the speaker feature category corresponding to the person template, in addition, when the user ID acquisition module 150 has acquired the user ID of the current speaker 200, the speaker feature set maintenance unit 131 also compares the user ID of the current speaker with the speaker. The feature category, the speaker feature in the speaker feature category, and the speaker template corresponding to the speaker feature category are associated;
在另一种示例中,在说话人匹配模块140确定当前说话人特征和各个说话人模板均不匹配的情况下,说话人特征集合维护单元131可以将当前说话人特征加入说话人特征集合中,并且使得当前说话人特征不属于任何一个说话人特征类别;In another example, when the speaker matching module 140 determines that the current speaker features do not match each speaker template, the speaker feature set maintenance unit 131 may add the current speaker features to the speaker feature set. And make the current speaker feature not belong to any speaker feature category;
在另一种示例中,说话人特征集合维护单元131还可以根据当前说话人200的当前语音输入的信噪比,确定是否将当前说话人200的当前说话人特征加入说话人特征集合,具体地,如果当前说话人200的当前语音输入的信噪比高于或等于信噪比阈值,则说话人特征集合维护单元131确定将当前说话人200的当前说话人特征加入说话人特征集合,如果当前说话人200的当前语音输入的信噪比低于信噪比阈值,则说话人特征集合维护单元131确定不将当前说话人200的当前说话人特征加入说话人特征集合;In another example, the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 to the speaker feature set according to the signal-to-noise ratio of the current voice input of the current speaker 200, specifically If the signal-to-noise ratio of the current speech input of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 to the speaker feature set. If the signal-to-noise ratio of the current voice input of the speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set;
块610:请参见块508的描述,在此不再赘述;Block 610: please refer to the description of block 508, which will not be repeated here;
块611:请参见块509的描述,在此不再赘述;Block 611: Please refer to the description of block 509, which will not be repeated here;
需要说明的是,在说话人特征类别内的说话人特征具有与之关联的用户标识的情况下,如果发生重新聚类,那么对于重新聚类之后的每个经更新的说话人特征类别,可以以对应最多数量的说话人特征的用户标识作为与该经更新的说话人特征类别相关联的用户标识,例如,重新聚类后的一个经更新的说话人特征类别内,有48个说话人特征对应“张三”的用户标识,有2个说话人特征对应“李四”的用户标识,那么将“张三”的用户标识与该经更新的说话人特征类别相关联;It should be noted that if the speaker feature in the speaker feature category has a user identifier associated with it, if re-clustering occurs, then for each updated speaker feature category after re-clustering, you can The user ID corresponding to the largest number of speaker features is used as the user ID associated with the updated speaker feature category. For example, there are 48 speaker features in an updated speaker feature category after re-clustering Corresponding to the user ID of "Zhang San", there are 2 speaker characteristics corresponding to the user ID of "Li Si", then the user ID of "Zhang San" is associated with the updated speaker characteristic category;
块612,请参见块510的描述,在此不再赘述。For block 612, please refer to the description of block 510, which will not be repeated here.
图7示出了根据本申请实施例的说话人识别方法的另一种流程示意图,图2和图3中说话人识别装置100的不同组件可以实施方法的不同块或其他部分。对于上述装置实施例中未描述的内容,可以参见下述方法实施例,同样,对于方法实施例中未描述的内容,可参见上述装置实施例。需要说明的是,对方法步骤的描述顺序不应被解释为这些步骤必须依赖于该顺序被执行,这些步骤可以不需要按描述顺序而执行,并且方法可以包括这些步骤之外的其他步骤,也可以包括这些步骤中的一部分。如图7所示,说话人识别方法包括:Fig. 7 shows another schematic flow chart of a speaker recognition method according to an embodiment of the present application. Different components of the speaker recognition apparatus 100 in Figs. 2 and 3 can implement different blocks or other parts of the method. For the content not described in the foregoing device embodiment, refer to the following method embodiment, and similarly, for the content not described in the method embodiment, refer to the foregoing device embodiment. It should be noted that the description order of the method steps should not be construed as these steps must be executed depending on the order, these steps may not need to be executed in the order described, and the method may include other steps besides these steps. Some of these steps can be included. As shown in Figure 7, the speaker recognition method includes:
块701-块707:请参见块601-块607的描述,在此不再赘述;Blocks 701-707: please refer to the descriptions of blocks 601-607, which will not be repeated here;
块708:通过声音属性识别模块160,根据当前说话人的语音特征,识别当前说话人的声音属性;其中,声音属性信息可以包括,但不限于,声音的性别属性、声音的年龄属性等;在本申请的实施例中,声音属性模块60可以利用现有技术中的任意一种声音属性识别技术来识别当前说话人200的声音属性,例如,但不限于,通过大量语音样本数据训练分类神经网络而获得的声音分类器。Block 708: Through the voice attribute recognition module 160, recognize the voice attributes of the current speaker according to the voice characteristics of the current speaker; wherein, the voice attribute information may include, but is not limited to, the gender attribute of the voice, the age attribute of the voice, etc.; In the embodiment of the present application, the voice attribute module 60 can use any voice attribute recognition technology in the prior art to recognize the voice attribute of the current speaker 200, for example, but not limited to, training a classification neural network through a large amount of voice sample data And the sound classifier obtained.
块709:通过用户标识获取模块150,根据与当前说话人200的基于其声音属性信息的交互,获取当前说话人200的用户标识;Block 709: Obtain the user ID of the current speaker 200 according to the interaction with the current speaker 200 based on the voice attribute information through the user ID acquisition module 150;
在一种示例中,在声音属性识别模块60识别出当前说话人200的声音属性信息之后,可以将声音属性信息发送给交互模块110,使得交互模块110基于声音属性信息与当前说话人200进行交互,用户标识获取模块150可以根据当前说话人200的交互语音输入确定当前说话人200的用户标识;In an example, after the voice attribute recognition module 60 recognizes the voice attribute information of the current speaker 200, the voice attribute information may be sent to the interaction module 110, so that the interaction module 110 interacts with the current speaker 200 based on the voice attribute information , The user ID acquisition module 150 can determine the user ID of the current speaker 200 according to the interactive voice input of the current speaker 200;
块710:通过说话人模板获取单元130的说话人特征集合维护单元131,将当前说话人特征加入说话人特征样本集;Block 710: Add the current speaker feature to the speaker feature sample set through the speaker feature set maintenance unit 131 of the speaker template acquisition unit 130;
在一种示例中,在说话人匹配模块140确定当前说话人特征和具有最大相似度的说话人模板匹配的情况下,说话人特征集合维护单元131可以将当前说话人特征加入与该匹配的说话人模板对应的说话人特征类别中,另外,在声音属性识别模块160识别了当前说话人200的声音属性以及用户标识获取模块150获取了当前说话人200的用户标识的情况下,说话人特征集合维护单元131还将当前说话人的声音属性、用户标识与该说话人特征类别、该说话人特征类别内的说话人特征以及该说话人特征类别对应的说话人模板相关联;In an example, when the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker that matches the speaker template. In the speaker feature category corresponding to the person template, in addition, when the voice attribute recognition module 160 recognizes the voice attribute of the current speaker 200 and the user identification acquisition module 150 acquires the user identification of the current speaker 200, the speaker feature set The maintenance unit 131 also associates the voice attributes and user identification of the current speaker with the feature category of the speaker, the speaker feature in the feature category of the speaker, and the speaker template corresponding to the feature category of the speaker;
在说话人匹配模块140确定当前说话人特征和各个说话人模板均不匹配的情况下,说话人特征集合维护单元131可以将当前说话人特征加入说话人特征集合中,并且使得当前说话人特征不属于任何一个说话人特征类别;In the case that the speaker matching module 140 determines that the current speaker features do not match each speaker template, the speaker feature set maintenance unit 131 can add the current speaker features to the speaker feature set, and make the current speaker features different. Belongs to any speaker characteristic category;
在另一种示例中,说话人特征集合维护单元131还可以根据当前说话人200的当前语音输入的信噪比,确定是否将当前说话人200的当前说话人特征加入说话人特征集合,具体地,如果当前说话人200的当前语音输入的信噪比高于或等于信噪比阈值,则说话人特征集合维护单元131确定将当前说话人200的当前说话人特征加入说话人特征集合,如果当前说话人200的当前语音输入的信噪比低于信噪比阈值,则说话人特征集合维护单元131确定不将当前说话人200的当前说话人特征加入说话人特征集合;In another example, the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 to the speaker feature set according to the signal-to-noise ratio of the current voice input of the current speaker 200, specifically If the signal-to-noise ratio of the current speech input of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 to the speaker feature set. If the signal-to-noise ratio of the current voice input of the speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set;
块711-块713:请参见块610-块612的描述,在此不再赘述。Block 711-block 713: please refer to the description of block 610-block 612, which will not be repeated here.
需要说明的是,在块712中,说话人特征类别内的说话人特征具有与之关联的用户标识和声音属性信息的情况下,如果发生重新聚类,那么对于重新聚类之后的每个经更新的说话人特征类别,可以以对应最多数量的说话人特征的用户标识和声音属性信息作为与该经更新的说话人特征类别相关联的用户标识和声音属性信息。It should be noted that in block 712, in the case that the speaker features in the speaker feature category have user identification and voice attribute information associated with them, if re-clustering occurs, then for each experience after re-clustering The updated speaker feature category may use the user identification and voice attribute information corresponding to the largest number of speaker features as the user identification and voice attribute information associated with the updated speaker feature category.
在本申请的实施例中,不需要说话人专门进行语音注册,而是在跟说话人语音交互过程中,通过对不同的说话人的说话人特征进行聚类,获得与每个说话人相对应的In the embodiment of the present application, there is no need for the speaker to specifically perform voice registration, but in the process of voice interaction with the speaker, by clustering the speaker characteristics of different speakers, the corresponding speaker is obtained. of
说话人模板,以此来识别不同的说话人。因此,本申请的实施例可以实现无感注册,避免了注册给说话人带来的负体验。Speaker template to identify different speakers. Therefore, the embodiments of the present application can realize non-sense registration, and avoid the negative experience brought by the registration to the speaker.
进一步地,在成功识别当前说话人之后,将当前说话人的说话人特征加入到与当前说话人相匹配的说话人特征类别中并更新该说话人特征类别的说话人模板,可以提高对说话人的识别准确度。另外,通过对说话人特征集合设置聚类条件,在其满足聚类条件之后对其重新进行聚类,从而更新已有说话人特征类别的说话人模板,或者获得新的说话人特征类别的说话人模板,也可以提高对说话人的识别准确度。Further, after successfully identifying the current speaker, adding the speaker feature of the current speaker to the speaker feature category matching the current speaker and updating the speaker template of the speaker feature category can improve the speaker Recognition accuracy. In addition, by setting clustering conditions for the speaker feature set, and re-clustering it after it meets the clustering conditions, the speaker template of the existing speaker feature category is updated, or the speech of a new speaker feature category is obtained. The person template can also improve the accuracy of speaker recognition.
进一步地,通过确定说话人的用户标识和/或声音属性信息,可以向说话人提供更加个性化、更加智能的交互体验。Further, by determining the user identification and/or voice attribute information of the speaker, a more personalized and intelligent interactive experience can be provided to the speaker.
图8示出了根据本申请实施例的示例系统800的一种结构示意图。系统800可以包括一个或多个处理器802,与处理器802中的多个连接的系统控制逻辑808,与系统控制逻辑808连接的系统内存804,与系统控制逻辑808连接的非易失性存储器(NVM)806,以及与系统控制逻辑808连接的网络接口810。FIG. 8 shows a schematic structural diagram of an example system 800 according to an embodiment of the present application. The system 800 may include one or more processors 802, a system control logic 808 connected to a plurality of the processors 802, a system memory 804 connected to the system control logic 808, and a non-volatile memory connected to the system control logic 808 (NVM) 806, and a network interface 810 connected to the system control logic 808.
处理器802可以包括一个或多个单核或多核处理器。处理器802可以包括通用处理器和专用处理器(例如,图形处理器,应用处理器,基带处理器等)的任何组合。在本申请的实施例中,处理器802可以被配置为执行根据如图5-7所示的各种实施例的一个或多个实施例。The processor 802 may include one or more single-core or multi-core processors. The processor 802 may include any combination of a general-purpose processor and a special-purpose processor (for example, a graphics processor, an application processor, a baseband processor, etc.). In the embodiment of the present application, the processor 802 may be configured to execute one or more embodiments according to the various embodiments shown in FIGS. 5-7.
在一些实施例中,系统控制逻辑808可以包括任意合适的接口控制器,以向处理器802中的多个和/或与系统控制逻辑808通信的任意合适的设备或组件提供任意合适的接口。In some embodiments, the system control logic 808 may include any suitable interface controller to provide any suitable interface to a plurality of the processors 802 and/or any suitable devices or components in communication with the system control logic 808.
在一些实施例中,系统控制逻辑808可以包括一个或多个存储器控制器,以提供连接到系统内存804的接口。系统内存804可以用于加载以及存储用于系统800的数据和/或指令。在一些实施例中,系统800的内存804可以包括任意合适的易失性存储器,例如合适的动态随机存取存储器(DRAM)。In some embodiments, the system control logic 808 may include one or more memory controllers to provide an interface to the system memory 804. The system memory 804 may be used to load and store data and/or instructions for the system 800. In some embodiments, the memory 804 of the system 800 may include any suitable volatile memory, such as a suitable dynamic random access memory (DRAM).
NVM/存储器806可以包括用于存储数据和/或指令的一个或多个有形的、非暂时性的计算机可读介质。在一些实施例中,NVM/存储器806可以包括闪存等任意合适的非易失性存储器和/或任意合适的非易失性存储设备,例如HDD(Hard Disk Drive,硬盘驱动器),CD(Compact Disc,光盘)驱动器,DVD(Digital Versatile Disc,数字通用光盘)驱动器中的多个。The NVM/memory 806 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the NVM/memory 806 may include any suitable non-volatile memory such as flash memory and/or any suitable non-volatile storage device, such as HDD (Hard Disk Drive, hard disk drive), CD (Compact Disc) , CD) drive, DVD (Digital Versatile Disc, Digital Versatile Disc) drive.
NVM/存储器806可以包括安装在系统800的装置上的一部分存储资源,或者它可以由设备访问,但不一定是设备的一部分。例如,可以经由网络接口810通过网络访问NVM/存储806。The NVM/memory 806 may include a part of storage resources installed on the device of the system 800, or it may be accessed by the device, but not necessarily a part of the device. For example, the NVM/storage 806 can be accessed through the network via the network interface 810.
特别地,系统内存804和NVM/存储器806可以分别包括:指令820的暂时副本 和永久副本。指令820可以包括:被处理器802中的至少一个执行时导致系统800实现图5-7所示的各种实施例的一个或多个实施例的指令。在一些实施例中,指令820、硬件、固件和/或其软件组件可另外地/替代地置于系统控制逻辑808,网络接口810和/或处理器802中。In particular, the system memory 804 and the NVM/memory 806 may include: a temporary copy and a permanent copy of the instruction 820, respectively. The instructions 820 may include instructions that, when executed by at least one of the processors 802, cause the system 800 to implement one or more of the various embodiments shown in FIGS. 5-7. In some embodiments, the instructions 820, hardware, firmware, and/or software components thereof may additionally/alternatively be placed in the system control logic 808, the network interface 810, and/or the processor 802.
网络接口810可以包括收发器,用于为系统800提供无线电接口,进而通过一个或多个网络与任意其他合适的设备(如前端模块,天线等)进行通信。在一些实施例中,网络接口810可以集成于系统800的其他组件。例如,网络接口810可以包括处理器802,系统内存804,NVM/存储器806,和具有指令的固件设备(未示出)中的至少一种,当处理器802中的至少一个执行所述指令时,系统800实现图5-7所示的各种实施例的一个或多个实施例。The network interface 810 may include a transceiver to provide a radio interface for the system 800, and then communicate with any other suitable devices (such as a front-end module, an antenna, etc.) through one or more networks. In some embodiments, the network interface 810 may be integrated with other components of the system 800. For example, the network interface 810 may include at least one of a processor 802, a system memory 804, an NVM/memory 806, and a firmware device (not shown) with instructions, when at least one of the processors 802 executes the instructions , The system 800 implements one or more of the various embodiments shown in FIGS. 5-7.
网络接口810可以进一步包括任意合适的硬件和/或固件,以提供多输入多输出无线电接口。例如,网络接口810可以是网络适配器,无线网络适配器,电话调制解调器和/或无线调制解调器。The network interface 810 may further include any suitable hardware and/or firmware to provide a multiple input multiple output radio interface. For example, the network interface 810 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.
在一个实施例中,处理器802中的多个可以与用于系统控制逻辑808的一个或多个控制器的逻辑封装在一起,以形成系统封装(SiP)。在一个实施例中,处理器802中的多个可以与用于系统控制逻辑808的一个或多个控制器的逻辑集成在同一管芯上,以形成片上系统(SoC)。In one embodiment, multiple of the processors 802 may be packaged with the logic of one or more controllers for the system control logic 808 to form a system in package (SiP). In one embodiment, multiple of the processors 802 may be integrated with the logic of one or more controllers for the system control logic 808 on the same die to form a system on chip (SoC).
系统800可以进一步包括:输入/输出(I/O)接口812。I/O接口812可以包括用户界面,使得用户能够与系统800进行交互;外围组件接口的设计使得外围组件也能够与系统800交互。在一些实施例中,系统800还包括传感器,用于确定与系统800相关的环境条件和位置信息的至少一种。The system 800 may further include: an input/output (I/O) interface 812. The I/O interface 812 may include a user interface to enable a user to interact with the system 800; the design of the peripheral component interface enables the peripheral component to also interact with the system 800. In some embodiments, the system 800 further includes a sensor for determining at least one of environmental conditions and location information related to the system 800.
在一些实施例中,用户界面可包括但不限于显示器(例如,液晶显示器,触摸屏显示器等),扬声器,麦克风,一个或多个相机(例如,静止图像照相机和/或摄像机),手电筒(例如,发光二极管闪光灯)和键盘。In some embodiments, the user interface may include, but is not limited to, a display (e.g., liquid crystal display, touch screen display, etc.), speakers, microphones, one or more cameras (e.g., still image cameras and/or video cameras), flashlights (e.g., LED flash) and keyboard.
在一些实施例中,外围组件接口可以包括但不限于非易失性存储器端口、音频插孔和电源接口。In some embodiments, the peripheral component interface may include, but is not limited to, a non-volatile memory port, an audio jack, and a power interface.
在一些实施例中,传感器可包括但不限于陀螺仪传感器,加速度计,近程传感器,环境光线传感器和定位单元。定位单元还可以是网络接口810的一部分或与网络接口810交互,以与定位网络的组件(例如,全球定位系统(GPS)卫星)进行通信。In some embodiments, the sensor may include, but is not limited to, a gyroscope sensor, an accelerometer, a proximity sensor, an ambient light sensor, and a positioning unit. The positioning unit may also be part of or interact with the network interface 810 to communicate with components of the positioning network (eg, global positioning system (GPS) satellites).
虽然本申请的描述将结合较佳实施例一起介绍,但这并不代表此发明的特征仅限于该实施方式。恰恰相反,结合实施方式作发明介绍的目的是为了覆盖基于本申请的权利要求而有可能延伸出的其它选择或改造。为了提供对本申请的深度了解,以下描述中将包含许多具体的细节。本申请也可以不使用这些细节实施。此外,为了避免混乱或模糊本申请的重点,有些具体细节将在描述中被省略。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Although the description of this application will be introduced in conjunction with the preferred embodiments, this does not mean that the features of the invention are limited to this embodiment. On the contrary, the purpose of introducing the invention in combination with the embodiments is to cover other options or modifications that may be extended based on the claims of this application. In order to provide an in-depth understanding of the application, the following description will contain many specific details. This application can also be implemented without using these details. In addition, in order to avoid confusion or obscuring the focus of this application, some specific details will be omitted in the description. It should be noted that the embodiments in the application and the features in the embodiments can be combined with each other if there is no conflict.
此外,各种操作将以最有助于理解说明性实施例的方式被描述为多个离散操作;然而,描述的顺序不应被解释为暗示这些操作必须依赖于顺序。特别是,这些操作不需要按呈现顺序执行。In addition, various operations will be described as a plurality of discrete operations in a manner that is most helpful for understanding the illustrative embodiments; however, the order of description should not be construed as implying that these operations must depend on the order. In particular, these operations need not be performed in the order of presentation.
除非上下文另有规定,否则术语“包含”,“具有”和“包括”是同义词。短语 “A/B”表示“A或B”。短语“A和/或B”表示“(A和B)或者(A或B)”。Unless the context dictates otherwise, the terms "comprising", "having" and "including" are synonymous. The phrase "A/B" means "A or B". The phrase "A and/or B" means "(A and B) or (A or B)".
如这里所使用的,术语“模块”或“单元”可以指代、是或者包括:专用集成电路(ASIC)、电子电路、执行一个或多个软件或固件程序的(共享、专用或组)处理器和/或存储器、组合逻辑电路和/或提供所描述的功能的其他合适的组件。As used herein, the term "module" or "unit" can refer to, be, or include: application specific integrated circuit (ASIC), electronic circuit, (shared, dedicated, or group) processing that executes one or more software or firmware programs And/or memory, combinatorial logic circuits, and/or other suitable components that provide the described functions.
在附图中,以特定布置和/或顺序示出一些结构或方法特征。然而,应该理解,可以不需要这样的特定布置和/或排序。在一些实施例中,这些特征可以以不同于说明性附图中所示的方式和/或顺序来布置。另外,在特定图中包含结构或方法特征并不意味着暗示在所有实施例中都需要这样的特征,并且在一些实施例中,可以不包括这些特征或者可以与其他特征组合。In the drawings, some structural or method features are shown in a specific arrangement and/or order. However, it should be understood that such a specific arrangement and/or ordering may not be required. In some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative drawings. In addition, the inclusion of structural or method features in a particular figure does not imply that such features are required in all embodiments, and in some embodiments, these features may not be included or may be combined with other features.
本申请公开的机制的各实施例可以被实现在硬件、软件、固件或这些实现方法的组合中。本申请的实施例可实现为在可编程系统上执行的计算机程序或程序代码,该可编程系统包括多个处理器、存储系统(包括易失性和非易失性存储器和/或存储元件)、多个输入设备以及多个输出设备。The various embodiments of the mechanism disclosed in this application may be implemented in hardware, software, firmware, or a combination of these implementation methods. The embodiments of the present application can be implemented as a computer program or program code executed on a programmable system including multiple processors and storage systems (including volatile and non-volatile memories and/or storage elements) , Multiple input devices and multiple output devices.
可将程序代码应用于输入指令,以执行本申请描述的各功能并生成输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的,处理系统包括具有诸如例如数字信号处理器(DSP)、微控制器、专用集成电路(ASIC)或微处理器之类的处理器的任何系统。Program codes can be applied to input instructions to perform the functions described in this application and generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
程序代码可以用高级程序化语言或面向对象的编程语言来实现,以便与处理系统通信。在需要时,也可用汇编语言或机器语言来实现程序代码。事实上,本申请中描述的机制不限于任何特定编程语言的范围。在任一情形下,该语言可以是编译语言或解释语言。The program code can be implemented in a high-level programming language or an object-oriented programming language to communicate with the processing system. When needed, assembly language or machine language can also be used to implement the program code. In fact, the mechanism described in this application is not limited to the scope of any particular programming language. In either case, the language can be a compiled language or an interpreted language.
在一些情况下,所公开的实施例可以以硬件、固件、软件或其任何组合来实现。在一些情况下,至少一些实施例的一个或多个方面可以由存储在计算机可读存储介质上的表示性指令来实现,指令表示处理器中的各种逻辑,指令在被机器读取时使得该机器制作用于执行本申请所述的技术的逻辑。被称为“IP核”的这些表示可以被存储在有形的计算机可读存储介质上,并被提供给多个客户或生产设施以加载到实际制造该逻辑或处理器的制造机器中。In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. In some cases, one or more aspects of at least some embodiments may be implemented by representative instructions stored on a computer-readable storage medium. The instructions represent various logics in the processor, and the instructions, when read by a machine, cause This machine makes the logic used to execute the techniques described in this application. These representations called "IP cores" can be stored on a tangible computer-readable storage medium and provided to multiple customers or production facilities to be loaded into the manufacturing machine that actually manufactures the logic or processor.
这样的计算机可读存储介质可以包括但不限于通过机器或设备制造或形成的物品的非瞬态的有形安排,其包括存储介质,诸如:硬盘任何其它类型的盘,包括软盘、光盘、紧致盘只读存储器(CD-ROM)、紧致盘可重写(CD-RW)以及磁光盘;半导体器件,例如只读存储器(ROM)、诸如动态随机存取存储器(DRAM)和静态随机存取存储器(SRAM)之类的随机存取存储器(RAM)、可擦除可编程只读存储器(EPROM)、闪存、电可擦除可编程只读存储器(EEPROM);相变存储器(PCM);磁卡或光卡;或适于存储电子指令的任何其它类型的介质。Such computer-readable storage media may include, but are not limited to, non-transitory tangible arrangements of objects manufactured or formed by machines or equipment, including storage media, such as hard disks, any other types of disks, including floppy disks, optical disks, compact disks, etc. Disk read only memory (CD-ROM), compact disk rewritable (CD-RW), and magneto-optical disk; semiconductor devices such as read only memory (ROM), such as dynamic random access memory (DRAM) and static random access Random access memory (RAM) such as memory (SRAM), erasable programmable read-only memory (EPROM), flash memory, electrically erasable programmable read-only memory (EEPROM); phase change memory (PCM); magnetic card Or optical card; or any other type of medium suitable for storing electronic instructions.
因此,本申请的各实施例还包括非瞬态的计算机可读存储介质,该介质包含指令或包含设计数据,诸如硬件描述语言(HDL),它定义本申请中描述的结构、电路、装置、处理器和/或系统特征。Therefore, each embodiment of the present application also includes a non-transitory computer-readable storage medium, which contains instructions or contains design data, such as hardware description language (HDL), which defines the structures, circuits, devices, etc. described in the present application. Processor and/or system characteristics.

Claims (18)

  1. 一种语音处理方法,其特征在于,包括:A voice processing method, characterized in that it comprises:
    接收多个语音输入并从所述多个语音输入中提取多个语音特征;Receiving multiple voice inputs and extracting multiple voice features from the multiple voice inputs;
    基于所述多个语音特征确定多个说话人特征;Determining multiple speaker features based on the multiple voice features;
    将所述多个说话人特征聚类为至少一个说话人特征类别,其中,所述至少一个说话人特征类别与至少一个说话人一一对应,并且所述至少一个说话人特征类别中的每个说话人特征类别包括所述多个说话人特征中的至少一个说话人特征;The multiple speaker features are clustered into at least one speaker feature category, wherein the at least one speaker feature category corresponds to at least one speaker one-to-one, and each of the at least one speaker feature category The speaker feature category includes at least one speaker feature among the plurality of speaker features;
    基于所述至少一个说话人特征类别,确定至少一个说话人模板,其中,所述至少一个说话人模板与所述至少一个说话人一一对应;和Determining at least one speaker template based on the at least one speaker feature category, wherein the at least one speaker template corresponds to the at least one speaker one-to-one; and
    接收来自当前说话人的当前语音输入,并且基于所述当前语音输入和所述至少一个说话人模板,确定所述当前说话人是否与所述至少一个说话人中的一个说话人匹配。A current voice input from a current speaker is received, and based on the current voice input and the at least one speaker template, it is determined whether the current speaker matches one of the at least one speaker.
  2. 如权利要求1所述的语音处理方法,其特征在于,所述基于所述多个语音特征确定多个说话人特征,包括:The voice processing method according to claim 1, wherein the determining a plurality of speaker characteristics based on the plurality of voice characteristics comprises:
    基于所述多个语音特征,通过声纹模型确定所述多个说话人特征;Based on the multiple voice features, determining the multiple speaker features through a voiceprint model;
    其中,所述声纹模型包括混合高斯-通用背景模型、I-vector模型、联合因子分析模型中的至少一种,并且所述多个说话人特征包括所述混合高斯-通用背景模型的超均值矢量、所述I-vector模型的I-vector矢量、所述联合因子分析模型的与说话人相关的超矢量中的至少一种。Wherein, the voiceprint model includes at least one of a mixed Gaussian-universal background model, an I-vector model, and a joint factor analysis model, and the multiple speaker features include a super mean value of the mixed Gaussian-universal background model At least one of a vector, an I-vector vector of the I-vector model, and a speaker-related hypervector of the joint factor analysis model.
  3. 如权利要求1或2所述的语音处理方法,其特征在于,所述基于所述至少一个说话人特征类别,确定至少一个说话人模板,包括:The speech processing method according to claim 1 or 2, wherein the determining at least one speaker template based on the at least one speaker characteristic category comprises:
    确定所述每个说话人特征类别内的所述至少一个说话人特征的均值或加权和;Determining the mean value or the weighted sum of the at least one speaker feature in each speaker feature category;
    将所述至少一个说话人特征的至少一个均值或加权和作为所述至少一个说话人模板。At least one mean value or weighted sum of the at least one speaker feature is used as the at least one speaker template.
  4. 如权利要求1-3中任一项所述的语音处理方法,其特征在于,所述将所述多个说话人特征聚类为至少一个说话人特征类别,包括:The speech processing method according to any one of claims 1 to 3, wherein the clustering the multiple speaker features into at least one speaker feature category comprises:
    基于所述多个说话人特征中每两个说话人特征之间的相似度,所述多个说话人特征中两个说话人特征之间的偏移,以及所述多个说话人特征的密度分布中的至少一种,将所述多个说话人特征聚类为所述至少一个说话人特征类别。Based on the similarity between each two speaker features in the multiple speaker features, the offset between the two speaker features in the multiple speaker features, and the density of the multiple speaker features At least one of the distributions, clustering the multiple speaker features into the at least one speaker feature category.
  5. 如权利要求1-4中任一项所述的语音处理方法,其特征在于,所述接收来自当前说话人的当前语音输入,并且基于所述当前语音输入和所述至少一个说话人模板,确定所述当前说话人是否与所述至少一个说话人中的一个说话人匹配,包括:The voice processing method according to any one of claims 1 to 4, wherein the receiving current voice input from the current speaker, and determining based on the current voice input and the at least one speaker template Whether the current speaker matches one of the at least one speaker includes:
    接收来自所述当前说话人的所述当前语音输入,并从所述当前语音输入中提取当前语音特征;Receiving the current voice input from the current speaker, and extracting current voice features from the current voice input;
    基于所述当前语音特征确定当前说话人特征;Determining the current speaker feature based on the current voice feature;
    确定所述当前说话人特征是否与所述至少一个说话人模板中的一个说话人模板匹配;Determining whether the current speaker feature matches one speaker template in the at least one speaker template;
    在确定所述当前说话人特征与所述一个说话人模板匹配的情况下,确定所述当前说话人与所述一个说话人模板对应的说话人匹配。In a case where it is determined that the current speaker feature matches the one speaker template, it is determined that the current speaker matches the speaker corresponding to the one speaker template.
  6. 如权利要求3所述的语音处理方法,其特征在于,所述接收来自当前说话人的当前语音输入,并且基于所述当前语音输入和所述至少一个说话人模板,确定所述当前说话人是否与所述至少一个说话人中的一个说话人匹配,包括:The voice processing method according to claim 3, wherein the current voice input from the current speaker is received, and based on the current voice input and the at least one speaker template, it is determined whether the current speaker is Matching with one of the at least one speaker includes:
    基于所述当前说话人特征与所述至少一个说话人模板中的每个说话人模板之间的相似度,确定所述当前说话人是否与所述至少一个说话人中的一个说话人匹配。Based on the similarity between the characteristics of the current speaker and each speaker template in the at least one speaker template, it is determined whether the current speaker matches one speaker in the at least one speaker template.
  7. 如权利要求1-6中任一项所述的语音处理方法,其特征在于,还包括:7. The voice processing method according to any one of claims 1-6, further comprising:
    在确定所述当前说话人与所述一个说话人匹配的情况下,确定所述当前说话人的当前说话人特征的数量以及所述至少一个说话人特征类别中与所述一个说话人相对应的一个说话人类别中的所述至少一个说话人特征的数量之和是否等于第一阈值;In the case where it is determined that the current speaker matches the one speaker, the number of current speaker features of the current speaker and the at least one speaker feature category corresponding to the one speaker are determined Whether the sum of the number of the at least one speaker feature in a speaker category is equal to the first threshold;
    在确定所述数量之和不等于所述第一阈值的情况下,基于所述当前说话人特征和与所述一个说话人特征类别中的所述至少一个说话人特征,更新与所述一个说话人相对应的所述说话人模板。In the case where it is determined that the sum of the numbers is not equal to the first threshold, based on the feature of the current speaker and the at least one speaker feature in the feature category with the one speaker, update speaking with the one speaker The speaker template corresponding to the person.
  8. 如权利要求1-7中任一项所述的语音处理方法,其特征在于,还包括:7. The voice processing method according to any one of claims 1-7, further comprising:
    在确定所述当前说话人与所述一个说话人匹配的情况下,确定所述当前说话人的当前说话人特征的数量以及所述至少一个说话人特征类别中与所述一个说话人相对应的一个说话人类别中的所述至少一个说话人特征的数量之和是否等于第一阈值;In the case where it is determined that the current speaker matches the one speaker, the number of current speaker features of the current speaker and the at least one speaker feature category corresponding to the one speaker are determined Whether the sum of the number of the at least one speaker feature in a speaker category is equal to the first threshold;
    在确定所述数量之和等于所述第一阈值的情况下,将所述当前说话人特征加入所述多个说话人特征,形成经更新的多个说话人特征;In the case where it is determined that the sum of the numbers is equal to the first threshold, adding the current speaker feature to the multiple speaker features to form an updated multiple speaker feature;
    将所述经更新的多个说话人特征聚类为经更新的至少一个说话人特征类别,其中,所述经更新的至少一个说话人特征类别与经更新的至少一个说话人一一对应,并且所述经更新的至少一个说话人特征类别中的每个经更新的说话人特征类别包括所述经更新的多个说话人特征中的至少一个说话人特征;Clustering the updated multiple speaker features into at least one updated speaker feature category, wherein the updated at least one speaker feature category corresponds to the updated at least one speaker one-to-one, and Each updated speaker feature category in the updated at least one speaker feature category includes at least one speaker feature in the updated multiple speaker features;
    基于所述至少一个经更新的说话人特征类别,确定经更新的至少一个说话人模板,其中所述经更新的至少一个说话人模板与经更新的至少一个说话人一一对应。Based on the at least one updated speaker feature category, an updated at least one speaker template is determined, wherein the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.
  9. 如权利要求1-8中任一项所述的语音处理方法,其特征在于,还包括:8. The voice processing method according to any one of claims 1-8, further comprising:
    在确定所述当前说话人不与所述至少一个说话人匹配的情况下,确定所述当前说话人的当前说话人特征的数量以及未包括在所述至少一个说话人特征类别中的至少一个说话人特征的数量之和是否大于或等于第二阈值;In the case of determining that the current speaker does not match the at least one speaker, determine the number of current speaker features of the current speaker and at least one speaker that is not included in the at least one speaker feature category Whether the sum of the number of human characteristics is greater than or equal to the second threshold;
    在确定所述数量之和大于或等于所述第二阈值的情况下,将所述当前说话人特征和所述未包括在所述至少一个说话人特征类别中的至少一个说话人特征聚类为至少一个其他说话人特征类别,其中,所述至少一个其他说话人特征类别与至少一个其他说话人一一对应;In the case where it is determined that the sum of the numbers is greater than or equal to the second threshold, cluster the current speaker feature and the at least one speaker feature not included in the at least one speaker feature category into At least one other speaker characteristic category, wherein the at least one other speaker characteristic category corresponds to at least one other speaker in a one-to-one correspondence;
    基于所述至少一个其他说话人特征类别,确定至少一个其他说话人模板,其中,所述至少一个其他说话人模板与所述至少一个其他说话人一一对应。Based on the at least one other speaker feature category, at least one other speaker template is determined, wherein the at least one other speaker template corresponds to the at least one other speaker one-to-one.
  10. 如权利要求1-9中任一项所述的语音处理方法,其特征在于,还包括:9. The voice processing method according to any one of claims 1-9, further comprising:
    在确定所述当前说话人不与所述至少一个说话人匹配的情况下,确定所述当前说话人的当前说话人特征的数量以及未包括在所述至少一个说话人类别中的至少一个说话人特征的数量之和是否大于或等于第二阈值;In the case of determining that the current speaker does not match the at least one speaker, determine the number of current speaker features of the current speaker and at least one speaker not included in the at least one speaker category Whether the sum of the number of features is greater than or equal to the second threshold;
    在确定所述数量之和大于或等于所述第二阈值的情况下,将所述当前说话人特征以及所述未包括在所述至少一个说话人类别中的至少一个说话人特征加入所述多个说 话人特征,形成经更新的多个说话人特征;In the case where it is determined that the sum of the numbers is greater than or equal to the second threshold, add the current speaker feature and the at least one speaker feature not included in the at least one speaker category to the multiple Individual speaker characteristics, forming updated multiple speaker characteristics;
    将所述经更新的多个说话人特征聚类为经更新的至少一个说话人特征类别;Clustering the updated multiple speaker features into at least one updated speaker feature category;
    基于所述经更新的至少一个说话人特征类别,确定经更新的至少一个说话人模板,其中所述经更新的至少一个说话人模板与经更新的至少一个说话人一一对应。Based on the updated at least one speaker feature category, an updated at least one speaker template is determined, wherein the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.
  11. 如权利要求1-10中任一项所述的语音处理方法,其特征在于,还包括,The voice processing method according to any one of claims 1-10, further comprising:
    在确定所述当前说话人与所述多个说话人中的一个说话人匹配的情况下,通过与所述当前说话人的交互,获取所述当前说话人的当前用户标识;和In the case where it is determined that the current speaker matches with one of the multiple speakers, obtain the current user ID of the current speaker through interaction with the current speaker; and
    将所述当前用户标识和所述至少一个说话人特征类别中的一个说话人特征类别以及所述至少一个说话人模板中的一个说话人模板相关联,其中所述一个说话人特征类别以及所述一个说话人模板与所述一个说话人相对应。Associate the current user identification with one speaker feature category in the at least one speaker feature category and one speaker template in the at least one speaker template, wherein the one speaker feature category and the One speaker template corresponds to the one speaker.
  12. 如权利要求11所述的语音处理方法,其特征在于,所述当前用户标识包括所述当前说话人的姓名、性别、年龄、权限、喜好中的至少一种。The voice processing method according to claim 11, wherein the current user identification includes at least one of the name, gender, age, authority, and preferences of the current speaker.
  13. 如权利要求11或12所述的语音处理方法,其特征在于,还包括:The voice processing method according to claim 11 or 12, further comprising:
    接收来自下一个说话人的下一个语音输入,并且基于所述下一个语音输入和所述至少一个说话人模板,确定所述下一个说话人是否与所述至少一个说话人中的所述一个说话人匹配;Receive the next voice input from the next speaker, and based on the next voice input and the at least one speaker template, determine whether the next speaker speaks to the one of the at least one speaker People match
    在确定所述下一个说话人与所述一个说话人匹配的情况下,用所述当前用户标识识别所述下一个说话人。In the case where it is determined that the next speaker matches the one speaker, the current user ID is used to identify the next speaker.
  14. 如权利要求8或10所述的语音处理方法,其特征在于,还包括:The voice processing method according to claim 8 or 10, further comprising:
    在所述至少一个经更新的说话人特征类别中的一个经更新的说话人特征类别包括多个说话人特征并且所述多个说话人特征与多个用户标识相关联的情况下,确定与所述多个说话人特征中最大数量的说话人特征相关联的一个用户标识;和In the case where one of the at least one updated speaker feature categories includes multiple speaker features and the multiple speaker features are associated with multiple user identities, it is determined that A user identification associated with the largest speaker feature among the multiple speaker features; and
    将所述一个用户标识与所述至少一个经更新的说话人模板中的一个说话人模板相关联,其中所述一个经更新的说话人模板与所述一个经更新的说话人特征类别相对应,Associating the one user identification with one speaker template in the at least one updated speaker template, wherein the one updated speaker template corresponds to the one updated speaker feature category,
    其中,所述多个用户标识中的每个用户标识包括说话人的姓名、性别、年龄、权限、喜好中的至少一种。Wherein, each user ID in the plurality of user IDs includes at least one of the speaker's name, gender, age, authority, and preferences.
  15. 如权利要求1-14中任一项所述的语音处理方法,其特征在于,还包括:The voice processing method according to any one of claims 1-14, further comprising:
    基于所述当前说话人的所述当前语音输入,确定所述当前说话人的声音属性;和Determine the voice attribute of the current speaker based on the current voice input of the current speaker; and
    在确定所述当前说话人与所述多个说话人中的一个说话人匹配的情况下,将所述声音属性和与所述一个说话人相对应的说话人特征类别相关联。In a case where it is determined that the current speaker matches one speaker among the plurality of speakers, the voice attribute is associated with a speaker characteristic category corresponding to the one speaker.
  16. 如权利要求15所述的语音处理方法,其特征在于,所述声音属性包括声音的年龄属性、声音的性别属性中的至少一种。15. The speech processing method according to claim 15, wherein the sound attributes include at least one of the age attribute of the sound and the gender attribute of the sound.
  17. 一种机器可读介质,其特征在于,在所述介质上存储有指令,当所述指令在所述机器上运行时,使得所述机器执行权利要求1至16中任意一项所述的语音处理方法。A machine-readable medium, characterized in that instructions are stored on the medium, and when the instructions are run on the machine, the machine will execute the voice described in any one of claims 1 to 16 Approach.
  18. 一种系统,其特征在于,包括:A system, characterized in that it includes:
    处理器;processor;
    存储器,在所述存储器上存储有指令,当所述指令被所述处理器运行时,使得所述系统执行权利要求1至16中任意一项所述的语音处理方法。A memory, where instructions are stored on the memory, and when the instructions are executed by the processor, the system executes the voice processing method according to any one of claims 1 to 16.
PCT/CN2020/141600 2020-01-10 2020-12-30 Voice processing method, medium, and system WO2021139589A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010025486.XA CN113129901A (en) 2020-01-10 2020-01-10 Voice processing method, medium and system
CN202010025486.X 2020-01-10

Publications (1)

Publication Number Publication Date
WO2021139589A1 true WO2021139589A1 (en) 2021-07-15

Family

ID=76771220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141600 WO2021139589A1 (en) 2020-01-10 2020-12-30 Voice processing method, medium, and system

Country Status (2)

Country Link
CN (1) CN113129901A (en)
WO (1) WO2021139589A1 (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998022936A1 (en) * 1996-11-22 1998-05-28 T-Netix, Inc. Subword-based speaker verification using multiple classifier fusion, with channel, fusion, model, and threshold adaptation
CN1462366A (en) * 2001-05-10 2003-12-17 皇家菲利浦电子有限公司 Background learning of speaker voices
CN1652206A (en) * 2005-04-01 2005-08-10 郑方 Sound veins identifying method
CN1662956A (en) * 2002-06-19 2005-08-31 皇家飞利浦电子股份有限公司 Mega speaker identification (ID) system and corresponding methods therefor
CN101241699A (en) * 2008-03-14 2008-08-13 北京交通大学 A speaker identification system for remote Chinese teaching
CN103562993A (en) * 2011-12-16 2014-02-05 华为技术有限公司 Speaker recognition method and device
EP2808866A1 (en) * 2013-05-31 2014-12-03 Nuance Communications, Inc. Method and apparatus for automatic speaker-based speech clustering
CN108597525A (en) * 2018-04-25 2018-09-28 四川远鉴科技有限公司 Voice vocal print modeling method and device
CN109243465A (en) * 2018-12-06 2019-01-18 平安科技(深圳)有限公司 Voiceprint authentication method, device, computer equipment and storage medium
CN110600041A (en) * 2019-07-29 2019-12-20 华为技术有限公司 Voiceprint recognition method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1387350A1 (en) * 2002-07-25 2004-02-04 Sony International (Europe) GmbH Spoken man-machine interface with speaker identification
JP5052449B2 (en) * 2008-07-29 2012-10-17 日本電信電話株式会社 Speech section speaker classification apparatus and method, speech recognition apparatus and method using the apparatus, program, and recording medium
CN102820033B (en) * 2012-08-17 2013-12-04 南京大学 Voiceprint identification method
JP6171544B2 (en) * 2013-05-08 2017-08-02 カシオ計算機株式会社 Audio processing apparatus, audio processing method, and program
US11417343B2 (en) * 2017-05-24 2022-08-16 Zoominfo Converse Llc Automatic speaker identification in calls using multiple speaker-identification parameters
JP6676009B2 (en) * 2017-06-23 2020-04-08 日本電信電話株式会社 Speaker determination device, speaker determination information generation method, and program

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998022936A1 (en) * 1996-11-22 1998-05-28 T-Netix, Inc. Subword-based speaker verification using multiple classifier fusion, with channel, fusion, model, and threshold adaptation
CN1462366A (en) * 2001-05-10 2003-12-17 皇家菲利浦电子有限公司 Background learning of speaker voices
CN1662956A (en) * 2002-06-19 2005-08-31 皇家飞利浦电子股份有限公司 Mega speaker identification (ID) system and corresponding methods therefor
CN1652206A (en) * 2005-04-01 2005-08-10 郑方 Sound veins identifying method
CN101241699A (en) * 2008-03-14 2008-08-13 北京交通大学 A speaker identification system for remote Chinese teaching
CN103562993A (en) * 2011-12-16 2014-02-05 华为技术有限公司 Speaker recognition method and device
EP2808866A1 (en) * 2013-05-31 2014-12-03 Nuance Communications, Inc. Method and apparatus for automatic speaker-based speech clustering
CN108597525A (en) * 2018-04-25 2018-09-28 四川远鉴科技有限公司 Voice vocal print modeling method and device
CN109243465A (en) * 2018-12-06 2019-01-18 平安科技(深圳)有限公司 Voiceprint authentication method, device, computer equipment and storage medium
CN110600041A (en) * 2019-07-29 2019-12-20 华为技术有限公司 Voiceprint recognition method and device

Also Published As

Publication number Publication date
CN113129901A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
EP3525205B1 (en) Electronic device and method of performing function of electronic device
US11822770B1 (en) Input-based device operation mode management
US20220101861A1 (en) Selective requests for authentication for voice-based launching of applications
US11430449B2 (en) Voice-controlled management of user profiles
US20210312905A1 (en) Pre-Training With Alignments For Recurrent Neural Network Transducer Based End-To-End Speech Recognition
CN105940407B (en) System and method for assessing the intensity of audio password
EP3676831B1 (en) Natural language user input processing restriction
US11763808B2 (en) Temporary account association with voice-enabled devices
US10714085B2 (en) Temporary account association with voice-enabled devices
EP3257043B1 (en) Speaker recognition in multimedia system
US20160019887A1 (en) Method and device for context-based voice recognition
JP2018536889A (en) Method and apparatus for initiating operations using audio data
US11727939B2 (en) Voice-controlled management of user profiles
US10916249B2 (en) Method of processing a speech signal for speaker recognition and electronic apparatus implementing same
US10885910B1 (en) Voice-forward graphical user interface mode management
US20180357269A1 (en) Address Book Management Apparatus Using Speech Recognition, Vehicle, System and Method Thereof
KR20200095947A (en) Electronic device and Method for controlling the electronic device thereof
WO2021139589A1 (en) Voice processing method, medium, and system
JP2024510798A (en) Hybrid multilingual text-dependent and text-independent speaker verification
Khosravani et al. The Intelligent Voice System for the IberSPEECH-RTVE 2018 Speaker Diarization Challenge.
CN112513845A (en) Transient account association with voice-enabled devices
US11076018B1 (en) Account association for voice-enabled devices
US20220399016A1 (en) Presence-based application invocation
CN111862947A (en) Method, apparatus, electronic device, and computer storage medium for controlling smart device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912907

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912907

Country of ref document: EP

Kind code of ref document: A1