WO2021139589A1

WO2021139589A1 - Voice processing method, medium, and system

Info

Publication number: WO2021139589A1
Application number: PCT/CN2020/141600
Authority: WO
Inventors: 胡伟湘; 王亚如; 李伟; 芦宇
Original assignee: 华为技术有限公司
Priority date: 2020-01-10
Filing date: 2020-12-30
Publication date: 2021-07-15
Also published as: CN113129901A

Abstract

A voice processing method comprises: receiving multiple pieces of voice input, and extracting multiple voice features from the multiple pieces of voice input; determining multiple speaker features on the basis of the multiple voice features; clustering the multiple speaker features into at least one speaker feature category, wherein the at least one speaker feature category corresponds one-to-one to at least one speaker (200), and each of the at least one speaker feature category comprises at least one of the multiple speaker features; determining at least one speaker template on the basis of the at least one speaker feature category, wherein the at least one speaker template corresponds one-to-one to the at least one speaker (200); and receiving current voice input from a current speaker (200), and determining, on the basis of the current voice input and the at least one speaker template, whether the current speaker (200) matches one of the at least one speaker (200). The method implements sensor-free registration, and prevents registration from causing a negative experience for the speaker (200).

Description

Voice processing method, medium and system

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 202010025486.X, and the application name is "a voice processing method, medium and system" on January 10, 2020, the entire content of which is incorporated by reference In this application.

Technical field

One or more embodiments of the present application generally relate to the field, and specifically relate to a voice processing method, medium, and system.

Background technique

Speaker recognition, that is, voiceprint recognition, belongs to the biometric recognition technology, which is a technology that automatically recognizes and confirms the identity of the speaker through the analysis and extraction of voice signals.

The existing speaker recognition includes two stages: registration and verification. During the registration phase, the system will require the registrant to enter multiple registered voices according to the specified requirements, and the system will convert these registered voices into corresponding speaker models. In the voiceprint verification phase, the system performs feature analysis and extraction on the input verification voice and scores the similarity with the speaker model generated in the registration phase, and judges whether the verification voice matches the registrant according to the set threshold.

Summary of the invention

The following describes the application from multiple aspects, and the implementations and beneficial effects of the following multiple aspects can be referred to each other.

The first aspect of the present application provides a voice processing method, which may include: receiving multiple voice inputs and extracting multiple voice features from the multiple voice inputs; determining multiple speaker features based on the multiple voice features; The multiple speaker feature clusters are clustered into at least one speaker feature category, where at least one speaker feature category corresponds to at least one speaker one-to-one, and each speaker feature category in the at least one speaker feature category includes multiple speaker feature categories. At least one speaker feature among speaker features; at least one speaker template is determined based on at least one speaker feature category, where at least one speaker template corresponds to at least one speaker one-to-one; and receiving current information from the current speaker Voice input, and based on the current voice input and the at least one speaker template, it is determined whether the current speaker matches one of the at least one speaker.

According to the embodiment of the present application, the speaker does not need to perform voice registration specifically, but in the process of voice interaction with the speaker, by clustering the speaker characteristics of different speakers, the speaker corresponding to each speaker is obtained.

Speaker template to identify different speakers. Therefore, the embodiments of the present application can realize non-sense registration, and avoid the negative experience brought by the registration to the speaker.

In some embodiments, determining multiple speaker features based on multiple voice features includes: determining multiple speaker features through a voiceprint model based on multiple voice features; wherein the voiceprint model includes a mixed Gaussian-general background model, At least one of the I-vector model and the joint factor analysis model, and the multiple speaker characteristics include the hypermean vector of the mixed Gaussian-universal background model, the I-vector vector of the I-vector model, and the joint factor analysis model and the speaker At least one of human-related super vectors.

In some embodiments, determining at least one speaker template based on at least one speaker feature category includes: determining the mean or weighted sum of at least one speaker feature in each speaker feature category; At least one mean or weighted sum serves as at least one speaker template.

In some embodiments, clustering the multiple speaker features into at least one speaker feature category includes: based on the similarity between every two speaker features in the multiple speaker features, two of the multiple speaker features At least one of the offset between the speaker features and the density distribution of the multiple speaker features, clustering the multiple speaker features into at least one speaker feature category.

In some embodiments, receiving the current voice input from the current speaker, and determining whether the current speaker matches one of the at least one speaker based on the current voice input and the at least one speaker template, further includes: receiving from The current voice input of the current speaker, and extract the current voice feature from the current voice input; determine the current speaker feature based on the current voice feature; determine whether the current speaker feature matches one of the at least one speaker template; In the case where it is determined that the current speaker feature matches a speaker template, it is determined that the current speaker matches a speaker corresponding to a speaker template.

In some embodiments, receiving the current voice input from the current speaker, and based on the current voice input and the at least one speaker template, determining whether the current speaker matches one of the at least one speaker includes: based on the current speech The similarity between the human feature and each speaker template in the at least one speaker template determines whether the current speaker matches one speaker in the at least one speaker template.

In some embodiments, the method further includes: in the case of determining that the current speaker matches a speaker, determining the number of current speaker features of the current speaker and at least one speaker feature category corresponding to one speaker Whether the sum of the number of at least one speaker feature in one speaker category is equal to the first threshold; in the case where it is determined that the sum of the number is not equal to the first threshold, based on the current speaker feature and a speaker feature in the category At least one speaker feature, update the speaker template corresponding to one speaker.

According to the embodiment of the present application, after the current speaker is successfully matched, the speaker feature of the current speaker is added to the speaker feature category that matches the current speaker, and the speaker template of the speaker feature category is updated. Improve the accuracy of speaker recognition.

In some embodiments, the method further includes: in the case of determining that the current speaker matches a speaker, determining the number of current speaker features of the current speaker and at least one speaker feature category corresponding to one speaker Whether the sum of the number of at least one speaker feature in one speaker category is equal to the first threshold; in the case of determining that the sum of the number is equal to the first threshold, add the current speaker feature to multiple speaker features to form an updated The multiple speaker features of the; cluster the updated multiple speaker features into the updated at least one speaker feature category, wherein the updated at least one speaker feature category and the updated at least one speaker are one-to-one Corresponding and each updated speaker feature category of the updated at least one speaker feature category includes at least one speaker feature of the updated multiple speaker features; based on the at least one updated speaker feature category , Determine the updated at least one speaker template, where the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.

According to the embodiment of the present application, after successfully matching the current speaker, if the speaker category corresponding to the speaker matching the current speaker satisfies the clustering condition, all speaker characteristics (including the current speaker characteristics) are re- Clustering is performed to update the speaker feature category and the speaker template corresponding to the speaker feature category. In this way, as the number of received voice inputs increases, the recognition accuracy of the speaker can be gradually improved.

In some embodiments, the method further includes: in a case where it is determined that the current speaker does not match the at least one speaker, determining the number of current speaker features of the current speaker and the number of features that are not included in the at least one speaker feature category Whether the sum of the number of at least one speaker feature is greater than or equal to the second threshold; in the case where it is determined that the sum of the number is greater than or equal to the second threshold, the current speaker feature and at least one that is not included in at least one speaker feature category One speaker feature cluster is at least one other speaker feature category, where at least one other speaker feature category corresponds to at least one other speaker one-to-one; based on at least one other speaker feature category, at least one other speaker template is determined , Where at least one other speaker template corresponds to at least one other speaker one-to-one.

According to the embodiment of the present application, after the current speaker is unsuccessfully matched, if the number of the features of the unsuccessfully matched speakers meets the clustering condition, the features of the unsuccessfully matched speakers are clustered, thereby A new speaker feature category and a new speaker template corresponding to the new speaker feature category are obtained. In this way, as the number of received voice inputs increases, the accuracy of speaker recognition can be gradually improved.

In some embodiments, the method further includes: in a case where it is determined that the current speaker does not match the at least one speaker, determining the number of current speaker characteristics of the current speaker and at least one that is not included in the at least one speaker category. Whether the sum of the characteristics of a speaker is greater than or equal to the second threshold; in the case where it is determined that the sum of the characteristics is greater than or equal to the second threshold, the current speaker characteristics and at least one speaker that is not included in the at least one speaker category are spoken Adding multiple speaker features to human features to form updated multiple speaker features; clustering the updated multiple speaker features into at least one updated speaker feature category; based on the updated at least one speaker feature The category determines the updated at least one speaker template, where the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.

According to the embodiment of the present application, after the current speaker is not successfully matched, if the number of speaker features that have not been successfully matched satisfies the clustering condition, all speaker features (including the current speaker features) are clustered again. Class, thereby updating the speaker feature category and the speaker template corresponding to the speaker feature category. In this way, as the number of received voice inputs increases, the recognition accuracy of the speaker can be gradually improved.

In some embodiments, the method further includes, in the case where it is determined that the current speaker matches one of the multiple speakers, obtaining the current user ID of the current speaker through interaction with the current speaker; and The current user identification is associated with one speaker feature category in at least one speaker feature category and one speaker template in at least one speaker template, wherein one speaker feature category and one speaker template correspond to one speaker.

According to the embodiments of the present application, by determining the user identification of the speaker, a more personalized and intelligent interactive experience can be provided to the speaker.

In some embodiments, the current user identification includes at least one of the name, gender, age, permissions, and preferences of the current speaker.

In some embodiments, the method further includes: receiving the next voice input from the next speaker, and determining whether the next speaker is related to one of the at least one speaker based on the next voice input and the at least one speaker template. Speaker matching; in the case where it is determined that the next speaker matches a speaker, the current user ID is used to identify the next speaker.

In some embodiments, the method further includes: one updated speaker feature category in the at least one updated speaker feature category includes multiple speaker features and the multiple speaker features are associated with multiple user identities In the case of, determine a user ID associated with the largest number of speaker features among the multiple speaker features; and associate a user ID with at least one of the updated speaker templates, one of which is The updated speaker template corresponds to an updated speaker feature category, wherein each user ID in the plurality of user IDs includes at least one of the speaker's name, gender, age, authority, and preferences.

In some embodiments, the method further includes: determining the voice attribute of the current speaker based on the current voice input of the current speaker; and in the case where it is determined that the current speaker matches one of the multiple speakers, combining The sound attribute is associated with the speaker characteristic category corresponding to a speaker.

According to the embodiment of the present application, by determining the voice attribute information of the speaker, a more personalized and intelligent interactive experience can be provided to the speaker.

In some embodiments, the sound attribute includes at least one of the age attribute of the sound and the gender attribute of the sound.

The second aspect of the present application provides a machine-readable medium on which instructions are stored. When the instructions are executed on a machine, the machine executes any of the above voice processing methods.

A third aspect of the present application provides a system, which includes: a processor; a memory, where instructions are stored in the memory, and when the instructions are executed by the processor, the system executes any of the above voice processing methods.

Description of the drawings

Fig. 1 shows a schematic diagram of a scene of speaker recognition according to an embodiment of the present application;

Fig. 2 shows a schematic structural diagram of a speaker recognition device according to an embodiment of the present application;

Fig. 3 shows a schematic structural diagram of a speaker template acquisition module according to an embodiment of the present application;

FIG. 4 shows a schematic diagram of a scene of voice interaction between a current speaker and a speaker recognition device according to an embodiment of the present application;

FIG. 5 shows a schematic flowchart of a speaker recognition method according to an embodiment of the present application;

Fig. 6 shows another schematic flowchart of a speaker recognition method according to an embodiment of the present application;

Fig. 7 shows another schematic flow chart of a speaker recognition method according to an embodiment of the present application;

Fig. 8 shows a schematic structural diagram of a system according to an embodiment of the present application.

Detailed ways

The application will be further described below in conjunction with specific embodiments and drawings. The specific embodiments described here are only for explaining the application, rather than limiting the application. In addition, for ease of description, the drawings only show a part of the structure or process related to the present application instead of all. It should be noted that in this specification, similar reference numerals and letters indicate similar items in the following drawings. Therefore, once a certain item is defined in one drawing, it is not necessary to refer to it in subsequent drawings. To further define and explain.

FIG. 1 shows a schematic diagram of a scene of speaker recognition according to an embodiment of the present application. As shown in the figure, the speaker recognition apparatus 100 can interact with multiple speakers 200 at different times, and during the interaction process A voice input 300 from multiple speakers 200 is received. According to some embodiments of the present application, the speaker recognition device 100 does not require the speaker 200 to perform special voice registration, but can recognize different speakers 200 based on the voice interaction with the speaker 200.

According to some embodiments of the present application, the speaker recognition device 100 may include, but is not limited to, smart speakers, smart headphones, smart bracelets, smart large screens, portable or mobile devices, mobile phones, personal digital assistants, cellular phones, and handheld PCs. , Portable media players, handheld devices, wearable devices (for example, watches, bracelets, display glasses or goggles, head-mounted displays, head-mounted devices), navigation equipment, servers, network equipment, graphics equipment, video game equipment , Set-top boxes, laptop devices, virtual reality and/or augmented reality devices, Internet of Things devices, industrial control devices, in-vehicle infotainment devices, streaming media client devices, e-books, reading devices, POS machines, and other devices.

2 shows a schematic structural diagram of the speaker recognition device 100 according to an embodiment of the present application. As shown in the figure, the speaker recognition device 100 may include an interaction module 110, a speaker feature acquisition module 120, and a speaker template acquisition module 130 and the speaker matching module 140, and optionally include a user identification acquisition module 150 and a voice attribute recognition module 160. Among them, one or more components of the speaker recognition device 100 (for example, the interaction module 110, the speaker feature acquisition module 120, the speaker template acquisition module 130, the speaker matching module 140, the user identification acquisition module 150, and the voice attribute recognition module 160), can be provided by application specific integrated circuit (ASIC), electronic circuit, (shared, dedicated or group) processor and/or memory, combinational logic circuit, or combinational logic circuit that executes one or more software or firmware programs Any combination of the described functions and other suitable components. According to one aspect, the processor may be a microprocessor, a digital signal processor, a microcontroller, etc., and/or any combination thereof. According to another aspect, the processor may be a single-core processor, a multi-core processor, etc., and/or any combination thereof.

It should be noted that the structure of the speaker recognition device 100 is not limited to that shown in FIG. 2. The speaker recognition device 100 may also include, but is not limited to, an input/output module for receiving voice input 300 from the speaker 200, The interactive sentences with the speaker 200 may also be output in the form of voice, text, etc. Examples of input/output modules may include, but are not limited to, speakers, microphones, displays (for example, liquid crystal displays, touch screen displays, etc.).

According to some embodiments of the present application, the interaction module 110 is used to interact with the speaker 200, where the interaction may include, but is not limited to, interactive sentences in the form of voice and/or text. In the embodiment of the present application, the interaction module 110 may be implemented by using any interaction technology in the prior art.

According to some embodiments of the present application, the speaker feature acquisition module 120 is configured to extract the voice features of the speaker 200 from the voice input 300 of the speaker 200 (for example, but not limited to, FBank (Filter Bank, filter bank) features, MFCC (Mel Frequency Cepstrum Coefficient, Mel frequency Cepstrum Coefficient) characteristics, etc.), and based on the voice characteristics of the speaker 200, according to the voiceprint model (for example, but not limited to, GMM-UBM (Gaussian Mixture Model-Universal Background Model), Mix Gaussian-universal background model), I-vector model, JFA (Joint Factor Analysis) model, etc.) to obtain speaker characteristics of speaker 200.

According to some embodiments of the present application, the speaker template acquisition module 130 is used to determine whether the speaker feature set satisfies the clustering condition, where the speaker feature set may include multiple speaker features from multiple speakers 200; When the speaker feature set meets the clustering conditions, the speaker template acquisition module 130 can cluster multiple speaker features in the speaker feature set into at least one speaker feature category, where the speaker feature category is related to the speaker feature category. 200 one-to-one correspondence, and each speaker feature category includes at least one speaker feature in the speaker feature set; and for each speaker feature category, the speaker template acquisition module 130 can be based on the speaker feature category For at least one speaker feature, a speaker template of the speaker 200 related to the speaker feature category is obtained, where the speaker template corresponds to the speaker 200 one-to-one.

In an example, the clustering condition may include at least one of the following: in the speaker feature set, the number of speaker features not included in any speaker feature category is greater than or equal to the first clustering threshold; The number of speaker features in a speaker feature category is equal to the second clustering threshold.

In an example, for each speaker feature category, a speaker template corresponding to the speaker feature category may be determined based on the mean value and/or weighted sum of at least one speaker feature in the speaker feature category.

According to some embodiments of the present application, the speaker matching module 140 is configured to determine whether the current speaker 200 matches one of the at least one existing speaker template based on whether the current speaker feature of the current speaker 200 matches One of the at least one speaker 200 corresponding to the at least one speaker template matches.

In an example, the speaker matching module 140 may determine that the current speaker 200 matches the current speaker feature of the current speaker 200 with one of the at least one existing speaker template. The speaker 200 corresponding to the matched speaker template is matched; in the case where it is determined that the current speaker feature of the current speaker 200 does not match each speaker template, the current speaker 200 and each speaker template corresponding to each speaker template are determined 200 does not match.

In an example, the speaker matching module 140 may determine whether the current speaker feature of the current speaker 200 matches based on the similarity between the current speaker feature of the current speaker 200 and at least one existing speaker template. One speaker template among at least one speaker template that exists.

According to some embodiments of the present application, the user identification acquisition module 150 is configured to acquire the user identification of the speaker 200 based on the interaction between the interaction module 110 and the speaker 200. The user identification may include, but is not limited to, name, gender, and age. , Permissions, preferences, etc.

According to some embodiments of the present application, the voice attribute recognition module 60 is used to recognize the voice attribute information of the speaker 200 based on the voice features of the speaker 200 (for example, but not limited to, FBank features, MFCC features, etc.). The information may include, but is not limited to, the gender attribute of the voice, the age attribute of the voice, and so on.

In some examples, the interaction module 110 may interact with the speaker 200 based on the sound attribute information of the speaker 200, so that the user identification obtaining module 150 obtains the user identification of the speaker 200.

Hereinafter, the functions of multiple modules of the speaker recognition device 100 will be further described with reference to FIGS. 2 and 3.

According to some embodiments of the present application, the speaker feature acquisition module 120 may preprocess the speech input 300 of the speaker 200, where the preprocessing may include, but is not limited to, signal enhancement, de-reverberation, denoising, etc. In addition, the speaker feature acquisition module 120 can also extract the voice features of the speaker 200 from the preprocessed voice input 300, such as, but not limited to, FBank features, MFCC features, and the like.

According to other embodiments of the present application, the speaker feature acquisition module 120 can also determine the signal-to-noise ratio of the speech input 300 of the speaker 200, so that the speaker template acquisition module 130 can filter high-quality speech input for acquiring the speaker template .

According to some embodiments of the present application, the speaker feature obtaining module 120 may also obtain the speaker feature of the speaker 200 based on the voice feature of the speaker 200 and the voiceprint model of the speaker 200. Among them, the voiceprint model is used to describe the spatial distribution of the speech features of the speaker 200, and examples of the voiceprint model may include, but are not limited to, GMM-UBM model, I-vector model, JFA model, etc. Among them, the GMM-UBM model uses the Gaussian probability density function to describe the spatial distribution of the speaker’s voice features. The establishment of the GMM-UBM model includes two parts. First, train a speaker that can describe the common features of the speaker based on a large number of speaker’s voice data. The universal background model UBM, and then UBM as the initial model, uses the voice characteristics of each speaker to perform adaptive training based on the maximum posterior probability, thereby obtaining the speaker's Gaussian mixture model GMM. The JFA model is based on the speaker’s GMM model, which defines the eigen-sound space, eigen-channel space and residual space to describe the spatial distribution of the speaker’s 200 speech features, that is, the mean supervector of the speaker’s GMM model (by The mean vector of each Gaussian probability density function is connected) and divided into a super vector related to the speaker and a super vector related to the channel, so that the interference of the channel can be removed, and a more accurate description of the speaker can be obtained. The I-vector model is also based on the speaker’s GMM model, which defines a global difference space that includes both the difference between speakers and the difference between channels to describe the spatial distribution of the voice features of the speaker 200, and is based on this The global difference space extracts a more compact I-vector vector from the mean supervector of the GMM model of the speaker.

As an example, when the voiceprint model of the speaker 200 is a GMM-UBM model, the speaker feature determined by the speaker feature acquisition module 120 may be the mean supervector of the GMM-UBM model; In the case where the voiceprint model is the JFA model, the speaker feature determined by the speaker feature acquisition module 120 may be the speaker-related supervector of the JFA model; in the case where the voiceprint model of the speaker 200 is an I-vector model Next, the speaker feature determined by the speaker feature acquisition module 120 may be the I-vector vector of the I-vector model.

It should be noted that the voiceprint model and speaker characteristics of the speaker 200 are not limited to this, and other types of voiceprint models and speaker characteristics may also be used.

3 shows a schematic structural diagram of the speaker template acquisition module 130 according to an embodiment of the present application. As shown in the figure, the speaker template acquisition module 130 includes, but is not limited to, a speaker feature set maintenance unit 131, The human feature clustering unit 132 and the speaker template obtaining unit 133.

According to some embodiments of the present application, the speaker feature set maintenance unit 131 can add the speaker features of the speaker 200 to the speaker feature set according to predetermined rules; the speaker feature set maintenance unit 131 can also determine whether the speaker feature set satisfies The clustering condition, if the speaker feature set satisfies the aforementioned clustering condition, the speaker feature clustering unit 132 is triggered to cluster multiple speaker features in the speaker feature set into at least one speaker feature category.

As an example, the foregoing predetermined rule may include: in the case that there is no speaker template, the speaker feature set maintenance unit 131 may directly add the current speaker feature of the current speaker 200 to the speaker feature set; In the case that there is a speaker template and the speaker matching module 140 determines that the current speaker feature of the current speaker 200 matches a speaker template, the speaker feature set maintenance unit 131 can compare the current speaker feature of the current speaker 200 Add to the speaker feature category corresponding to the matched speaker template; in the case that a speaker template already exists and the speaker matching module 140 determines that the current speaker feature of the current speaker 200 does not match each speaker template, The speaker feature set maintenance unit 131 may add the current speaker feature of the current speaker 200 to the speaker feature set and make it not belong to any speaker feature category.

As another example, the foregoing predetermined rule may further include: the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 according to the signal-to-noise ratio of the voice input 300 of the current speaker 200 Speaker feature set. Specifically, if the signal-to-noise ratio of the voice input 300 of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 Speaker feature set. If the signal-to-noise ratio of the voice input 300 of the current speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set .

As an example, the aforementioned clustering conditions may include: speaker features not included in any speaker feature category (for example, speaker features that are not included in any speaker feature category because there is no speaker feature category yet) The number of human features, or speaker features that are determined by the speaker matching module 140 as not matching each speaker template) is greater than or equal to the first clustering threshold.

As another example, the aforementioned clustering condition may include: the number of speaker features in at least one speaker feature category is equal to a second clustering threshold, and the second clustering threshold may include one or more values, for example, But not limited to 50, 100, 200, 500 or other values

According to some embodiments of the present application, the speaker feature clustering unit 132 is configured to cluster multiple speaker features in the speaker feature set into at least one speaker feature category according to the trigger instruction of the speaker feature set maintenance unit 131 , And assign a system identifier (for example, but not limited to, speaker A, speaker B, etc.) to the at least one speaker feature category, where one system identifier is associated with a speaker feature category and is also associated with the speaker feature At least one speaker feature in the category is associated with a speaker template corresponding to the speaker feature category. Among them, examples of clustering algorithms may include, but are not limited to, mean-shift (Mean-shift) clustering algorithms, density clustering algorithms (for example, but not limited to, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) Class algorithm), hierarchical clustering algorithm or other clustering algorithms, etc., among them, the mean-shift (Mean-shift) clustering algorithm clusters the speaker features based on the offset between the speaker features, and the density clustering algorithm The speaker features are clustered based on the density distribution of the speaker features. The hierarchical clustering algorithm clusters the speaker recognition features based on the similarity between every two speaker features.

As an example, in the case that the number of speaker features not included in any speaker feature category is greater than or equal to the first clustering threshold, the speaker feature clustering unit 132 can target these speaker features that are not included in any speaker. The speaker features in the speaker feature category are clustered, and clustering can also be performed on all speaker features in the speaker feature set; the number of speaker features in at least one speaker feature category is equal to the second clustering threshold In this case, the speaker feature clustering unit 132 can cluster all speaker features in the speaker feature set.

The following uses the Mean-Shift clustering algorithm as an example to describe the clustering of speaker features by the speaker feature clustering unit 132. The Mean-Shift clustering algorithm can include the following steps:

S1: In the set of speaker features to be clustered, randomly select a speaker feature as the center point;

S2: Determine all speaker features in the area with the center point as the center and radius r, group these speaker features into the same cluster, and record the number of times these speaker features appear;

S3: Calculate the offset vector (ie, the difference vector) required to move from the center point to the speaker feature determined in S2, and use the mean-shift vector of the offset vector as the mean-shift vector;

S4: Move the center point along the direction of the mean-shift vector, the moving distance is the size of the mean-shift vector, and obtain a new center point;

S5: Repeat steps S2 to S4 until the size of the mean-shift vector is less than the predetermined value. It should be noted that all speaker features involved in this iteration process should be grouped into the same cluster;

S6: Repeat steps S1 to S5 until all speaker features in the set of speaker features to be clustered have corresponding clusters;

S7: Determine the speaker feature category. For each speaker feature in each cluster, the cluster with the most frequent occurrences of the speaker feature is used as the speaker feature category to which the speaker feature belongs.

According to some embodiments of the present application, for each speaker feature category, the speaker template obtaining unit 133 may determine the mean value of at least one speaker feature in the speaker feature category and use it as a speaker template, for example, in In the case where the speaker feature is an I-vector vector of the I-vector model, the speaker template obtaining unit 133 may use the mean vector of at least one I-vector vector in the speaker feature category as the speaker template.

According to some embodiments of the present application, for each speaker feature category, the speaker template obtaining unit 133 may determine the weighted sum of at least one speaker feature in the speaker feature category and use it as a speaker template, where: The weight of each speaker feature can be determined according to the signal-to-noise ratio of the voice input corresponding to the speaker feature.

According to some embodiments of the present application, the speaker matching module 140 may first determine whether a speaker template already exists. When the speaker matching module 140 determines that there is no speaker template (for example, the speaker feature set has not yet clustered), it does not match the current speaker 200, and compares the current speaker 200 to the current speaker 200 The features are sent to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 of the speaker template acquisition module 130 can add the current speaker feature of the current speaker 200 to the speaker feature set.

In the case where the speaker matching module 140 determines that a speaker template already exists (for example, the speaker feature set has been clustered), it can determine the similarity between the current speaker feature and each speaker template and each similarity respectively And determine whether the maximum similarity is higher than or equal to the similarity threshold. In an example, the similarity between the current speaker feature and the speaker template can be determined by calculating the distance between the current speaker feature and the speaker template, where the distance can include, but is not limited to, cosine Distance, EMD distance (Earth Mover's Distance), Euclidean distance, Manhattan distance, etc.

On the one hand, when the speaker matching module 140 determines that the maximum similarity is higher than or equal to the similarity threshold, it can determine that the current speaker feature matches the speaker template with the maximum similarity to determine the current speaker 200 matches the speaker 200 corresponding to the speaker template with the greatest similarity.

Subsequently, the speaker matching module 140 can also compare the features of the current speaker and the matching result of the current speaker 200 (for example, but not limited to, a speaker template that matches the features of the current speaker, or a speaker corresponding to the speaker template. The system identification of the feature category, for example, but not limited to, speaker A) is sent to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 can add the current speaker feature to the speaker template matching the current speaker feature In the corresponding speaker feature category (hereinafter referred to as the speaker feature category matching the current speaker 200, for example, but not limited to, speaker A).

Subsequently, the speaker feature set maintenance unit 131 may determine whether the number of speaker features in the speaker feature category (for example, but not limited to, speaker A) that matches the current speaker 200 is equal to the second clustering threshold. If the speaker feature set maintenance unit 131 determines that the number of speaker features in the speaker feature category is equal to the second clustering threshold, the speaker feature set maintenance unit 131 can trigger the speaker feature clustering unit 132 to use a clustering algorithm to analyze All speaker features in the speaker feature set are re-clustered to obtain the updated at least one speaker feature category, where the updated at least one speaker feature category corresponds to the speaker 200 one-to-one, and each speaker feature The updated speaker feature category includes at least one speaker feature in the speaker feature set.

The speaker feature set maintenance unit 131 may determine the system identification associated with each updated speaker feature category. Specifically, for each updated speaker feature category, the speaker template acquisition unit 133 may determine that in the system identification associated with the updated at least one speaker feature in the updated speaker feature category, the corresponding The system identification of the largest number of speaker features, and the system identification is associated with the updated speaker feature category and the updated at least one speaker feature within the updated speaker feature category. The speaker feature set maintenance unit 131 may also determine an updated speaker feature category matching the current speaker 200 among the updated at least one speaker feature category, wherein the updated speaker feature category is associated with the updated speaker feature category The system identifier of is the same as the system identifier associated with the speaker feature category matched with the current speaker 200 before re-clustering.

The speaker template obtaining unit 133 may obtain the updated speaker template corresponding to the updated speaker feature category according to the updated at least one speaker feature in the updated speaker feature category matching the current speaker 200 . The speaker template obtaining unit 133 may also obtain an updated speaker template corresponding to each updated speaker feature category according to the updated at least one speaker feature in each updated speaker feature category, where The speaker template of has a one-to-one correspondence with the speaker 200. For obtaining the updated speaker template, please refer to the above description, which will not be repeated here.

In addition, when the speaker feature set maintenance unit 131 determines that the number of speaker features in the speaker feature category matched by the current speaker 200 is not equal to the second clustering threshold, the speaker feature set maintenance unit 131 does not trigger the speaker The feature clustering unit 132 re-clusters the speaker feature set. However, the speaker template acquiring unit 133 may update the speaker corresponding to the speaker feature category matching the current speaker 200 according to the speaker features (including the current speaker features) in the speaker feature category matching the current speaker 200 Templates, where the speaker template can be obtained with reference to the above description, which will not be repeated here.

On the other hand, when the speaker matching module 140 determines that the maximum similarity between the current speaker feature and each speaker template is lower than the similarity threshold, it can determine that the current speaker feature is different from each speaker template. Matching means that the current speaker 200 does not match the speaker 200 corresponding to each speaker template.

Subsequently, the speaker matching module 140 can compare the characteristics of the current speaker and the matching result of the current speaker 200 (for example, but not limited to, information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template) Sent to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 of the speaker template acquisition module 130 can add the current speaker feature to the speaker feature set, and make the current speaker feature not belong to any speaker feature category.

Subsequently, the speaker feature set maintenance unit 131 may determine whether the number of speaker features that are not included in any speaker feature category in the speaker feature set is greater than or equal to the first clustering threshold, and is not included in the speaker feature set. In the case that the number of speaker features included in any speaker feature category is greater than or equal to the first clustering threshold, the speaker feature set maintenance unit 131 can trigger the speaker feature clustering unit 132 to use a clustering algorithm to analyze these features. The speaker features included in any speaker feature category are clustered, and at least one new speaker feature category is obtained, and the at least one new speaker feature category is assigned a new system identifier (for example, but not limited to) , Speaker C, Speaker D, etc.), where the new speaker feature category corresponds to speaker 200 one-to-one, and each new speaker feature category includes at least one speaker feature. A new system identifier can be associated with one The new speaker feature category is associated with at least one speaker feature in the new speaker feature category and the speaker template corresponding to the new speaker feature category. In another example, the speaker feature set maintenance unit 131 may trigger the speaker feature clustering unit 132 to use a clustering algorithm to re-cluster all speaker features in the speaker feature set to obtain the updated at least one Speaker feature categories, where at least one updated speaker feature category corresponds to the speaker 200 one-to-one, and each updated speaker feature category includes at least one speaker feature in the speaker feature set.

Subsequently, the speaker template obtaining unit 133 may obtain a new speaker template corresponding to each new speaker feature category according to at least one speaker feature in each new speaker feature category, where the new speaker template There is a one-to-one correspondence with speaker 200. In another example, the speaker template obtaining unit 133 may also obtain the updated corresponding to each updated speaker feature category according to the updated at least one speaker feature in each updated speaker feature category. Among the speaker templates, the updated speaker template corresponds to the speaker 200 one-to-one. For the acquisition of the speaker template, please refer to the above description, which will not be repeated here.

In addition, when the speaker feature set maintenance unit 131 determines that the number of speaker features that are not included in any speaker feature category in the speaker feature set is less than the first clustering threshold, the speaker feature set maintenance unit 131 does not The speaker feature clustering unit 132 is triggered to perform clustering.

According to some embodiments of the present application, when the speaker recognition apparatus 100 includes the user identification acquisition module 150, the matching result user identification acquisition module 150 may determine whether the current speaker 200 needs to be acquired according to the matching result of the current speaker 200 The user ID of, for example, name, gender, age, permissions, preferences, etc.

On the one hand, in the case that the matching result of the current speaker 200 includes a speaker template that matches the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the user identification acquisition module 150 determines to return If the user ID of the speaker feature category corresponding to the speaker template is not obtained, then the user ID obtaining module 150 can trigger the interaction module 110 to interact with the current speaker 200, and determine to speak with the current speaker 200 according to the interactive voice input of the current speaker 200 The user ID of the speaker feature category matched by the person 200.

4 is a schematic diagram of a scene of voice interaction between the current speaker 200 and the speaker recognition device 100 according to an embodiment of the present application. It should be noted that the interaction module 110 of the speaker recognition device 100 may also communicate with the current speaker in text form. 200 to interact. In the scenario shown in FIG. 4, in order to determine the name and preference information of the current speaker 200, the interaction module 110 and the current speaker 200 may have the following voice interactions:

Interaction module 110: "I have listened to your voice for a long time, what do you call it?"

Current speaker 200: "Zhang San."

Interactive module 110: "It's nice to meet you, can you know more about you?"

Current speaker 200: "Okay."

Interactive module 110: "Which of the following types of movies do you prefer: Kung Fu, Comedy, Thriller..."

Current Speaker 200: "Comedy"

In another example, in order to determine the name and authority information of the current speaker 200, the interaction module 110 and the current speaker 200 may have the following voice interactions:

Current speaker 200: "Zhang San."

Current speaker 200: "Hello Zhang San, do you need to set your own permissions? Host has higher permissions, and Guest has guest permissions."

Current speaker 200: "Host permission."

Interactive module 110: "Please enter the highest authority password."

Current speaker 200: "1 2 3 4 5 6"

Interactive module 110: "The password is wrong, please enter it again"

Current speaker 200: "6 5 4 3 2 1"

Interactive module 110: "The password is correct, Mr. Zhang, congratulations you have the Host authority."

The user identification acquisition module 150 may determine that the name of the current speaker 200 is "Zhang San", the "like" is "comedy", and the "authority" is "Host authority" based on the above voice interaction.

After obtaining the user ID of the current speaker 200, the user ID obtaining module 150 may send the user ID to the speaker template obtaining module 130, so that the speaker feature set maintenance unit 131 adds the current speaker feature to match the current speaker 200 When in the speaker feature category, the user ID of the current speaker is associated with the speaker feature category, the speaker feature in the speaker feature category, and the speaker template corresponding to the speaker feature category.

In this way, the interaction module 110 can provide the current speaker with a more personalized and intelligent interactive experience based on the user identification of the current speaker 200. In addition, when the speaker recognition device 100 determines that a certain future speaker 200 matches the feature category of the speaker, the interaction module 110 can identify the future speaker according to the user identifier associated with the speaker feature in the speaker feature category 200, and provide future speakers 200 with a more personalized and intelligent interactive experience, such as, but not limited to, the following interactive scenarios:

Future Speaker 200: "Help me adjust the temperature of the air conditioner to 25 degrees."

Interaction module 110: (According to the user identification of the speaker feature category matching the future speaker 200, the name of the future speaker 200 is determined to be Li Si) "Okay, Mr. Li, I have helped you set the temperature of the air conditioner to 25 degrees. ."

Future Speaker 200: "Help me play the latest movie."

Interaction module 110: (According to the user identification of the speaker feature category matched with the future speaker 200, the preference of the future speaker 200 is determined to be a comedy) "Mr. Li, a comedy movie with a high rating has been launched recently, named "The Richest Man in Tomatoes" "Please enjoy it next."

Future Speaker 200: "Help me delete all local movies."

Interaction module 110: (According to the user ID of the speaker feature category matching the future speaker 200, the permission of the future speaker 200 is determined as the guest permission) "Mr. Li, sorry, your current permission is not enough. Please upgrade to the highest permission."

On the other hand, in the case that the matching result of the current speaker 200 includes a speaker template matching the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the user identification acquisition module 150 determines The user ID of the speaker feature category corresponding to the speaker template has been obtained, or the matching result of the current speaker 200 includes information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template Next, the user identification acquisition module 150 will not acquire the user identification of the current speaker 200, that is, the interaction module 110 will not be triggered to interact with the current speaker 200.

According to some embodiments of the present application, the speaker recognition device 100 may further include a voice attribute recognition module 160. When the user identification acquisition module 150 determines that the user identification of the current speaker 200 needs to be acquired, the voice attribute recognition module 60 may be based on the current speech. The voice features of the person 200 (for example, but not limited to, FBank features, MFCC features, etc.), identify the voice attribute information of the current speaker 200, where the voice attribute information may include, but is not limited to, the gender attributes of the voice (male voice, female voice) ), the age attribute of the sound (for example, children, adults, etc.), etc. In the embodiment of the present application, the voice attribute module 60 can use any voice attribute recognition technology in the prior art to recognize the voice attribute of the current speaker 200, for example, but not limited to, training the classification nerve through a large amount of voice sample data. A sound classifier obtained from the Internet.

After the voice attribute recognition module 160 recognizes the voice attribute information of the current speaker 200, the voice attribute information can be sent to the interaction module 110, so that the interaction module 110 interacts with the current speaker 200 based on the voice attribute information to obtain the current speaker 200 user ID. For example, the voice attribute recognition module 160 can recognize that the voice of the current speaker 200 belongs to the voice of a child. Then, when the interaction module 110 interacts with the current speaker 200, the current speaker 200 can be referred to as a "child", and to determine the current speaker 200. The name of the speaker 200, the interaction module 110 and the current speaker 200 may have the following interaction scenarios:

Interaction module 110: "Hello, kid, it sounds like a very cute baby, what's your name?"

Current speaker 200: "My name is Yaya."

After obtaining the user ID of the current speaker 200, the user ID obtaining module 150 and the voice attribute recognition module 160 may respectively send the user ID and voice attribute information of the current speaker 200 to the speaker template obtaining module 130, so that the speaker characteristics are set The maintenance unit 131 may add the current speaker feature to the speaker feature category matching the current speaker 200, and compare the current speaker’s voice attribute information and user identification with the speaker feature category and the speaker feature category. The speaker feature of the speaker is associated with the speaker template corresponding to the speaker feature category.

In this way, the interaction module 110 can provide the current speaker with a more personalized and intelligent interaction experience based on the user identification and voice attribute information of the current speaker 200. In addition, when the speaker recognition device 100 determines that a certain future speaker 200 matches the feature category of the speaker, the interaction module 110 can recognize the future speaker according to the voice attribute information and user identification associated with the feature category of the speaker. 200, and provide a more personalized and smarter interactive experience to future speakers 200, but not limited to the following interactive scenarios:

Future Speaker 200: "Play a song for me."

Interaction module 110: (determine that the name of the future speaker 200 is Yaya according to the user identification of the speaker feature category matching the future speaker 200, and determine the age attribute of the voice as a child according to the voice attribute information)

"Hello, ya ya, next I will play "Two Tigers" for you."

According to some other embodiments of the present application, the voice attribute recognition module 160 may determine whether the voice attribute of the current speaker 200 needs to be recognized according to the matching result of the speaker recognition device 100 to the current speaker 200. For example, in the case that the matching result of the current speaker 200 includes a speaker template that matches the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the voice attribute recognition module 160 determines that it has not Acquire the voice attribute information of the speaker category, then the voice attribute recognition module 160 can recognize the voice attribute of the current speaker 200, and send the voice attribute information to the speaker template acquisition module 130, so that the speaker feature set maintenance unit 131 The voice attribute information of the current speaker may be associated with the speaker feature in the speaker feature category matched with the current speaker 200. In the case that the matching result of the current speaker 200 includes a speaker template matching the features of the current speaker, or a system identification of the speaker feature category corresponding to the speaker template, if the voice attribute recognition module 160 determines that the speaker has been identified The voice attribute information of the speaker feature category, or when the matching result of the current speaker 200 includes information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template, the voice attribute recognition module 160 does not The voice attributes of the current speaker 200 are recognized.

In this way, the interaction module 110 can provide the current speaker with a more personalized and intelligent interaction experience based on the voice attribute information of the current speaker 200. In addition, in the case where the speaker recognition device 100 determines that a certain future speaker 200 matches the feature category of the speaker, the interaction module 110 may report to the future speaker 200 according to the voice attribute information associated with the feature category of the speaker. Provide a more personalized and intelligent interactive experience.

According to other embodiments of the present application, regardless of the matching result of the current speaker 200, the voice attribute recognition module 160 can recognize the voice attributes of the current speaker 200, and send the voice attribute information to the interaction module 110, so that The interaction module 110 can provide the current speaker 200 with a more personalized and intelligent interaction experience based on the sound attribute information, for example, but not limited to, the following interaction scenarios:

Current speaker 200: "Help me turn on the air conditioner."

Interaction module 110: (The age attribute of the sound is determined by the sound attribute recognition module 160 to be a child) "The air conditioner has been turned on and adjusted to 28 degrees Celsius."

Interaction module 110: (The age attribute of the voice is determined by the voice attribute recognition module 160 to be an adult) "The air conditioner has been turned on and it is adjusted to 25 degrees Celsius."

It should be noted that if the speaker features in the speaker feature category have user identification and/or voice attribute information associated with them, if re-clustering occurs, then for each updated after re-clustering The speaker feature category may use the user identification and/or voice attribute information corresponding to the largest number of speaker features as the user identification and/or voice attribute information associated with the updated speaker feature category, for example, re-clustering In the last updated speaker feature category, there are 48 speaker features corresponding to the user ID of "Zhang San", and 2 speaker features corresponding to the user ID of "Li Si", then the user ID of "Zhang San" The identification is associated with the updated speaker feature category.

In the embodiment of the present application, there is no need for the speaker to specifically perform voice registration, but in the process of voice interaction with the speaker, by clustering the speaker characteristics of different speakers, the corresponding speaker is obtained. of

Further, after successfully identifying the current speaker, adding the speaker feature of the current speaker to the speaker feature category matching the current speaker and updating the speaker template of the speaker feature category can improve the speaker Recognition accuracy. In addition, by setting clustering conditions for the speaker feature set, and re-clustering it after it meets the clustering conditions, the speaker template of the existing speaker feature category is updated, or the speech of a new speaker feature category is obtained. The person template can also improve the accuracy of speaker recognition.

Further, by determining the user identification and/or voice attribute information of the speaker, a more personalized and intelligent interactive experience can be provided to the speaker.

Fig. 5 shows a schematic flow chart of a speaker recognition method according to an embodiment of the present application. Different components of the speaker recognition apparatus 100 in Figs. 2-3 can implement different blocks or other parts of the method. For the content not described in the foregoing device embodiment, refer to the following method embodiment, and similarly, for the content not described in the method embodiment, refer to the foregoing device embodiment. It should be noted that the description order of the method steps should not be construed as these steps must be executed depending on the order, these steps may not need to be executed in the order described, and the method may include other steps besides these steps. Some of these steps can be included. As shown in Figure 5, the speaker recognition method includes:

Block 501, receiving the current voice input from the current speaker 200 through the interaction module 110;

In block 502, the current speech input 300 of the current speaker 200 is preprocessed through the speaker feature acquisition module 120, where the preprocessing may include, but is not limited to, signal enhancement, dereverberation, denoising, etc.; , The speaker feature acquisition module 120 can also extract the current voice features of the current speaker 200 from the preprocessed current voice input 300, such as, but not limited to, FBank features, MFCC features, etc.;

Block 503, through the speaker feature acquisition module 120, based on the current voice features of the current speaker 200, according to the voiceprint model, such as, but not limited to, GMM-UBM, I-vector model, JFA model, etc., to obtain the current speaker 200 Features of the current speaker, such as, but not limited to, the mean super vector of the GMM-UBM model, the speaker-related super vector of the JFA model, the I-vector vector of the I-vector model, etc.;

In block 504, it is determined whether there is at least one speaker template through the speaker matching module 140, if it is determined that there is at least one speaker template, block 505 is executed, and if it is determined that there is no speaker template, block 507 is executed;

Block 505, through the speaker matching module 140, determine the similarity between the current speaker feature of the current speaker 200 and each speaker template;

In an example, the similarity between the current speaker feature and each speaker template can be determined by calculating the distance between the current speaker feature and each speaker template, where the distance can include, but is not limited to, Cosine distance, EMD distance (Earth Mover's Distance), Euclidean distance, Manhattan distance, etc.;

Block 506, through the speaker matching module 140, determine whether the current speaker 200 is speaking with at least one speaker corresponding to the at least one speaker template according to the maximum similarity between the current speaker feature of the current speaker 200 and each speaker template One speaker 200 among the people 200 matches;

In an example, when the speaker matching module 140 determines that the maximum similarity is higher than or equal to the similarity threshold, it can determine that the current speaker feature matches the speaker template with the maximum similarity to determine The current speaker 200 matches the speaker 200 corresponding to the speaker template with the maximum similarity; in the case that the speaker matching module 140 determines that the maximum similarity is lower than the similarity threshold, it can determine that the characteristics of the current speaker are Each speaker template does not match, that is, the current speaker 200 does not match each speaker 200 corresponding to each speaker template;

In block 507, the speaker feature set maintenance unit 131 of the speaker template obtaining unit 130 adds the current speaker feature to the speaker feature sample set;

In an example, when the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker that matches the speaker template. The speaker feature category corresponding to the speaker template; in the case that the speaker matching module 140 determines that the current speaker feature does not match each speaker template, the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker feature In the set, and make the current speaker feature not belong to any speaker feature category;

In another example, the speaker feature set maintenance unit 131 may also determine whether to add the current speaker feature of the current speaker 200 to the speaker feature set according to the signal-to-noise ratio of the current voice input of the current speaker 200, specifically If the signal-to-noise ratio of the current speech input of the current speaker 200 is higher than or equal to the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines to add the current speaker feature of the current speaker 200 to the speaker feature set. If the signal-to-noise ratio of the current voice input of the speaker 200 is lower than the signal-to-noise ratio threshold, the speaker feature set maintenance unit 131 determines not to add the current speaker feature of the current speaker 200 to the speaker feature set;

In block 508, the speaker feature set maintenance unit 131 of the speaker template acquisition unit 130 is used to determine whether the speaker feature set meets the clustering condition, if yes, execute block 509, if not, return to execute block 501, that is, receive the next one Voice input;

In an example, when the speaker matching module 140 determines that the current speaker features do not match each speaker template, the speaker feature set maintenance unit 131 determines whether the speaker feature set satisfies the clustering condition, which may include: Determine whether the number of speaker features in the speaker feature set that is not included in any speaker feature category is greater than or equal to the first clustering threshold;

In another example, when the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, the speaker feature set maintenance unit 131 determines whether the speaker feature set satisfies the clustering condition, It may include determining whether the number of speaker features in the speaker feature category corresponding to the speaker template matched with the current speaker feature is equal to a second clustering threshold, where the second clustering threshold may include one or more values, For example, but not limited to, 50, 100, 200, 500 or other values;

In block 509, the speaker feature clustering unit 132 of the speaker template obtaining unit 130 uses a clustering algorithm to cluster the speaker features;

In an example, when the speaker matching module 140 determines that the current speaker feature does not match each speaker template, and the number of speaker features not included in any speaker feature category is greater than or equal to In the case of the first clustering threshold, the speaker feature clustering unit 132 can cluster the speaker features that are not included in any speaker feature category, and obtain at least one new speaker feature category. The at least one new speaker feature category is assigned a new system identifier (for example, but not limited to, speaker C, speaker D, etc.), where the new speaker feature category corresponds to speaker 200 one-to-one, and each new speaker feature category The speaker feature category includes at least one speaker feature. A new system identifier can be associated with a new speaker feature category, or can be associated with at least one speaker feature in the new speaker feature category, and the new speaker feature category. The speaker template corresponding to the speaker feature category is associated; in another example, the speaker feature clustering unit 132 may re-cluster all speaker features in the speaker feature set to obtain the updated at least one Speaker feature categories, where at least one updated speaker feature category corresponds to speaker 200 one-to-one, and each updated speaker feature category includes at least one speaker feature in the speaker feature set;

In another example, in the case where the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, and it is within the speaker feature category corresponding to the speaker template that matches the current speaker feature In the case that the number of speaker features in is equal to the second clustering threshold, the speaker feature clustering unit 132 may re-cluster all speaker features in the speaker feature set to obtain the updated at least one speaker feature Categories, where at least one updated speaker feature category corresponds to speaker 200 one-to-one, and each updated speaker feature category includes at least one speaker feature in the speaker feature set;

In addition, for the description of the clustering algorithm, please refer to the description in the device section above, which will not be repeated here;

Block 510, through the speaker template obtaining unit 133 of the speaker template obtaining unit 130, for each speaker feature category, obtain a speaker template corresponding to the speaker feature category according to the speaker feature in the speaker feature category;

In an example, when the speaker matching module 140 determines that the current speaker feature does not match each speaker template, and the number of speaker features not included in any speaker feature category is greater than or equal to In the case of the first clustering threshold, the speaker template obtaining unit 133 may obtain a new speaker template corresponding to each new speaker feature category according to at least one speaker feature in each new speaker feature category; The speaker template obtaining unit 133 may also obtain an updated speaker template corresponding to each updated speaker feature category according to the updated at least one speaker feature in each updated speaker feature category;

In another example, in the case where the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, and it is within the speaker feature category corresponding to the speaker template that matches the current speaker feature In the case that the number of speaker features of is equal to the second clustering threshold, the speaker template obtaining unit 133 may obtain each updated speaker feature according to the updated at least one speaker feature in each updated speaker feature category. The updated speaker template corresponding to the speaker feature category; the speaker template obtaining unit 133 may also obtain the speaker template according to the updated at least one speaker feature in the updated speaker feature category matching the current speaker 200 The updated speaker template corresponding to the updated speaker feature category, where the updated speaker feature category matching the current speaker 200 can be determined with reference to the description in the device section above;

In another example, in the case where the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, and it is within the speaker feature category corresponding to the speaker template that matches the current speaker feature In the case that the number of speaker features in is not equal to the second clustering threshold, the speaker template acquisition unit 133 may update according to the speaker features (including the current speaker features) in the speaker feature category matching the current speaker 200 The speaker template corresponding to the speaker feature category matched with the current speaker 200;

In addition, for the acquisition of the speaker template, the description in the device part above can be parameterized, which will not be repeated here.

It should be noted that after the execution of block 510 ends, the execution of block 501 may be returned to, that is, the next voice input is received.

Fig. 6 shows another schematic flow chart of a speaker recognition method according to an embodiment of the present application. Different components of the speaker recognition device 100 in Figs. 2 and 3 can implement different blocks or other parts of the method. For the content not described in the foregoing device embodiment, refer to the following method embodiment, and similarly, for the content not described in the method embodiment, refer to the foregoing device embodiment. It should be noted that the description order of the method steps should not be construed as these steps must be executed depending on the order, these steps may not need to be executed in the order described, and the method may include other steps besides these steps. Some of these steps can be included. As shown in Figure 6, the speaker recognition method includes:

Block 601-block 606: please refer to the description of block 501-block 506, which will not be repeated here;

In block 607, it is determined whether the user ID of the current speaker 200 needs to be obtained through the user ID obtaining module 150;

As an example, the user ID obtaining module 150 may determine whether it is necessary to obtain the user ID of the current speaker 200, for example, name, gender, age, authority, preference, etc., according to the matching result of the current speaker 200;

For example, in the case that the matching result of the current speaker 200 includes a speaker template matching the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the user identification acquisition module 150 determines that it has not Acquire the user ID of the speaker feature category corresponding to the speaker template, then the user ID obtaining module 150 determines that the user ID of the current speaker 200 needs to be obtained, and triggers the interaction module 110 to interact with the current speaker 200;

In the case that the matching result of the current speaker 200 includes a speaker template that matches the characteristics of the current speaker, or the system identification of the speaker characteristic category corresponding to the speaker template, if the user identification acquisition module 150 determines that the The user identification of the speaker feature category corresponding to the speaker template, or when the matching result of the current speaker 200 includes information indicating that the current speaker 200 does not match the speaker 200 corresponding to each speaker template, the user identification The acquiring module 150 determines that it is not necessary to acquire the user ID of the current speaker 200, and it will not trigger the interaction module 110 to interact with the current speaker 200;

In block 608, the user ID of the current speaker 200 is obtained according to the interaction with the current speaker through the user ID acquisition module 150;

In block 609, the speaker feature set maintenance unit 131 of the speaker template obtaining unit 130 adds the current speaker feature to the speaker feature sample set;

In an example, when the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker that matches the speaker template. In the speaker feature category corresponding to the person template, in addition, when the user ID acquisition module 150 has acquired the user ID of the current speaker 200, the speaker feature set maintenance unit 131 also compares the user ID of the current speaker with the speaker. The feature category, the speaker feature in the speaker feature category, and the speaker template corresponding to the speaker feature category are associated;

In another example, when the speaker matching module 140 determines that the current speaker features do not match each speaker template, the speaker feature set maintenance unit 131 may add the current speaker features to the speaker feature set. And make the current speaker feature not belong to any speaker feature category;

Block 610: please refer to the description of block 508, which will not be repeated here;

Block 611: Please refer to the description of block 509, which will not be repeated here;

It should be noted that if the speaker feature in the speaker feature category has a user identifier associated with it, if re-clustering occurs, then for each updated speaker feature category after re-clustering, you can The user ID corresponding to the largest number of speaker features is used as the user ID associated with the updated speaker feature category. For example, there are 48 speaker features in an updated speaker feature category after re-clustering Corresponding to the user ID of "Zhang San", there are 2 speaker characteristics corresponding to the user ID of "Li Si", then the user ID of "Zhang San" is associated with the updated speaker characteristic category;

For block 612, please refer to the description of block 510, which will not be repeated here.

Fig. 7 shows another schematic flow chart of a speaker recognition method according to an embodiment of the present application. Different components of the speaker recognition apparatus 100 in Figs. 2 and 3 can implement different blocks or other parts of the method. For the content not described in the foregoing device embodiment, refer to the following method embodiment, and similarly, for the content not described in the method embodiment, refer to the foregoing device embodiment. It should be noted that the description order of the method steps should not be construed as these steps must be executed depending on the order, these steps may not need to be executed in the order described, and the method may include other steps besides these steps. Some of these steps can be included. As shown in Figure 7, the speaker recognition method includes:

Blocks 701-707: please refer to the descriptions of blocks 601-607, which will not be repeated here;

Block 708: Through the voice attribute recognition module 160, recognize the voice attributes of the current speaker according to the voice characteristics of the current speaker; wherein, the voice attribute information may include, but is not limited to, the gender attribute of the voice, the age attribute of the voice, etc.; In the embodiment of the present application, the voice attribute module 60 can use any voice attribute recognition technology in the prior art to recognize the voice attribute of the current speaker 200, for example, but not limited to, training a classification neural network through a large amount of voice sample data And the sound classifier obtained.

Block 709: Obtain the user ID of the current speaker 200 according to the interaction with the current speaker 200 based on the voice attribute information through the user ID acquisition module 150;

In an example, after the voice attribute recognition module 60 recognizes the voice attribute information of the current speaker 200, the voice attribute information may be sent to the interaction module 110, so that the interaction module 110 interacts with the current speaker 200 based on the voice attribute information , The user ID acquisition module 150 can determine the user ID of the current speaker 200 according to the interactive voice input of the current speaker 200;

Block 710: Add the current speaker feature to the speaker feature sample set through the speaker feature set maintenance unit 131 of the speaker template acquisition unit 130;

In an example, when the speaker matching module 140 determines that the current speaker feature matches the speaker template with the greatest similarity, the speaker feature set maintenance unit 131 may add the current speaker feature to the speaker that matches the speaker template. In the speaker feature category corresponding to the person template, in addition, when the voice attribute recognition module 160 recognizes the voice attribute of the current speaker 200 and the user identification acquisition module 150 acquires the user identification of the current speaker 200, the speaker feature set The maintenance unit 131 also associates the voice attributes and user identification of the current speaker with the feature category of the speaker, the speaker feature in the feature category of the speaker, and the speaker template corresponding to the feature category of the speaker;

In the case that the speaker matching module 140 determines that the current speaker features do not match each speaker template, the speaker feature set maintenance unit 131 can add the current speaker features to the speaker feature set, and make the current speaker features different. Belongs to any speaker characteristic category;

Block 711-block 713: please refer to the description of block 610-block 612, which will not be repeated here.

It should be noted that in block 712, in the case that the speaker features in the speaker feature category have user identification and voice attribute information associated with them, if re-clustering occurs, then for each experience after re-clustering The updated speaker feature category may use the user identification and voice attribute information corresponding to the largest number of speaker features as the user identification and voice attribute information associated with the updated speaker feature category.

FIG. 8 shows a schematic structural diagram of an example system 800 according to an embodiment of the present application. The system 800 may include one or more processors 802, a system control logic 808 connected to a plurality of the processors 802, a system memory 804 connected to the system control logic 808, and a non-volatile memory connected to the system control logic 808 (NVM) 806, and a network interface 810 connected to the system control logic 808.

The processor 802 may include one or more single-core or multi-core processors. The processor 802 may include any combination of a general-purpose processor and a special-purpose processor (for example, a graphics processor, an application processor, a baseband processor, etc.). In the embodiment of the present application, the processor 802 may be configured to execute one or more embodiments according to the various embodiments shown in FIGS. 5-7.

In some embodiments, the system control logic 808 may include any suitable interface controller to provide any suitable interface to a plurality of the processors 802 and/or any suitable devices or components in communication with the system control logic 808.

In some embodiments, the system control logic 808 may include one or more memory controllers to provide an interface to the system memory 804. The system memory 804 may be used to load and store data and/or instructions for the system 800. In some embodiments, the memory 804 of the system 800 may include any suitable volatile memory, such as a suitable dynamic random access memory (DRAM).

The NVM/memory 806 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the NVM/memory 806 may include any suitable non-volatile memory such as flash memory and/or any suitable non-volatile storage device, such as HDD (Hard Disk Drive, hard disk drive), CD (Compact Disc) , CD) drive, DVD (Digital Versatile Disc, Digital Versatile Disc) drive.

The NVM/memory 806 may include a part of storage resources installed on the device of the system 800, or it may be accessed by the device, but not necessarily a part of the device. For example, the NVM/storage 806 can be accessed through the network via the network interface 810.

In particular, the system memory 804 and the NVM/memory 806 may include: a temporary copy and a permanent copy of the instruction 820, respectively. The instructions 820 may include instructions that, when executed by at least one of the processors 802, cause the system 800 to implement one or more of the various embodiments shown in FIGS. 5-7. In some embodiments, the instructions 820, hardware, firmware, and/or software components thereof may additionally/alternatively be placed in the system control logic 808, the network interface 810, and/or the processor 802.

The network interface 810 may include a transceiver to provide a radio interface for the system 800, and then communicate with any other suitable devices (such as a front-end module, an antenna, etc.) through one or more networks. In some embodiments, the network interface 810 may be integrated with other components of the system 800. For example, the network interface 810 may include at least one of a processor 802, a system memory 804, an NVM/memory 806, and a firmware device (not shown) with instructions, when at least one of the processors 802 executes the instructions , The system 800 implements one or more of the various embodiments shown in FIGS. 5-7.

The network interface 810 may further include any suitable hardware and/or firmware to provide a multiple input multiple output radio interface. For example, the network interface 810 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.

In one embodiment, multiple of the processors 802 may be packaged with the logic of one or more controllers for the system control logic 808 to form a system in package (SiP). In one embodiment, multiple of the processors 802 may be integrated with the logic of one or more controllers for the system control logic 808 on the same die to form a system on chip (SoC).

The system 800 may further include: an input/output (I/O) interface 812. The I/O interface 812 may include a user interface to enable a user to interact with the system 800; the design of the peripheral component interface enables the peripheral component to also interact with the system 800. In some embodiments, the system 800 further includes a sensor for determining at least one of environmental conditions and location information related to the system 800.

In some embodiments, the user interface may include, but is not limited to, a display (e.g., liquid crystal display, touch screen display, etc.), speakers, microphones, one or more cameras (e.g., still image cameras and/or video cameras), flashlights (e.g., LED flash) and keyboard.

In some embodiments, the peripheral component interface may include, but is not limited to, a non-volatile memory port, an audio jack, and a power interface.

In some embodiments, the sensor may include, but is not limited to, a gyroscope sensor, an accelerometer, a proximity sensor, an ambient light sensor, and a positioning unit. The positioning unit may also be part of or interact with the network interface 810 to communicate with components of the positioning network (eg, global positioning system (GPS) satellites).

Although the description of this application will be introduced in conjunction with the preferred embodiments, this does not mean that the features of the invention are limited to this embodiment. On the contrary, the purpose of introducing the invention in combination with the embodiments is to cover other options or modifications that may be extended based on the claims of this application. In order to provide an in-depth understanding of the application, the following description will contain many specific details. This application can also be implemented without using these details. In addition, in order to avoid confusion or obscuring the focus of this application, some specific details will be omitted in the description. It should be noted that the embodiments in the application and the features in the embodiments can be combined with each other if there is no conflict.

In addition, various operations will be described as a plurality of discrete operations in a manner that is most helpful for understanding the illustrative embodiments; however, the order of description should not be construed as implying that these operations must depend on the order. In particular, these operations need not be performed in the order of presentation.

Unless the context dictates otherwise, the terms "comprising", "having" and "including" are synonymous. The phrase "A/B" means "A or B". The phrase "A and/or B" means "(A and B) or (A or B)".

As used herein, the term "module" or "unit" can refer to, be, or include: application specific integrated circuit (ASIC), electronic circuit, (shared, dedicated, or group) processing that executes one or more software or firmware programs And/or memory, combinatorial logic circuits, and/or other suitable components that provide the described functions.

In the drawings, some structural or method features are shown in a specific arrangement and/or order. However, it should be understood that such a specific arrangement and/or ordering may not be required. In some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative drawings. In addition, the inclusion of structural or method features in a particular figure does not imply that such features are required in all embodiments, and in some embodiments, these features may not be included or may be combined with other features.

The various embodiments of the mechanism disclosed in this application may be implemented in hardware, software, firmware, or a combination of these implementation methods. The embodiments of the present application can be implemented as a computer program or program code executed on a programmable system including multiple processors and storage systems (including volatile and non-volatile memories and/or storage elements) , Multiple input devices and multiple output devices.

Program codes can be applied to input instructions to perform the functions described in this application and generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.

The program code can be implemented in a high-level programming language or an object-oriented programming language to communicate with the processing system. When needed, assembly language or machine language can also be used to implement the program code. In fact, the mechanism described in this application is not limited to the scope of any particular programming language. In either case, the language can be a compiled language or an interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. In some cases, one or more aspects of at least some embodiments may be implemented by representative instructions stored on a computer-readable storage medium. The instructions represent various logics in the processor, and the instructions, when read by a machine, cause This machine makes the logic used to execute the techniques described in this application. These representations called "IP cores" can be stored on a tangible computer-readable storage medium and provided to multiple customers or production facilities to be loaded into the manufacturing machine that actually manufactures the logic or processor.

Such computer-readable storage media may include, but are not limited to, non-transitory tangible arrangements of objects manufactured or formed by machines or equipment, including storage media, such as hard disks, any other types of disks, including floppy disks, optical disks, compact disks, etc. Disk read only memory (CD-ROM), compact disk rewritable (CD-RW), and magneto-optical disk; semiconductor devices such as read only memory (ROM), such as dynamic random access memory (DRAM) and static random access Random access memory (RAM) such as memory (SRAM), erasable programmable read-only memory (EPROM), flash memory, electrically erasable programmable read-only memory (EEPROM); phase change memory (PCM); magnetic card Or optical card; or any other type of medium suitable for storing electronic instructions.

Therefore, each embodiment of the present application also includes a non-transitory computer-readable storage medium, which contains instructions or contains design data, such as hardware description language (HDL), which defines the structures, circuits, devices, etc. described in the present application. Processor and/or system characteristics.

Claims

A voice processing method, characterized in that it comprises:

Receiving multiple voice inputs and extracting multiple voice features from the multiple voice inputs;

Determining multiple speaker features based on the multiple voice features;

The multiple speaker features are clustered into at least one speaker feature category, wherein the at least one speaker feature category corresponds to at least one speaker one-to-one, and each of the at least one speaker feature category The speaker feature category includes at least one speaker feature among the plurality of speaker features;

Determining at least one speaker template based on the at least one speaker feature category, wherein the at least one speaker template corresponds to the at least one speaker one-to-one; and

A current voice input from a current speaker is received, and based on the current voice input and the at least one speaker template, it is determined whether the current speaker matches one of the at least one speaker.
The voice processing method according to claim 1, wherein the determining a plurality of speaker characteristics based on the plurality of voice characteristics comprises:

Based on the multiple voice features, determining the multiple speaker features through a voiceprint model;

Wherein, the voiceprint model includes at least one of a mixed Gaussian-universal background model, an I-vector model, and a joint factor analysis model, and the multiple speaker features include a super mean value of the mixed Gaussian-universal background model At least one of a vector, an I-vector vector of the I-vector model, and a speaker-related hypervector of the joint factor analysis model.
The speech processing method according to claim 1 or 2, wherein the determining at least one speaker template based on the at least one speaker characteristic category comprises:

Determining the mean value or the weighted sum of the at least one speaker feature in each speaker feature category;

At least one mean value or weighted sum of the at least one speaker feature is used as the at least one speaker template.
The speech processing method according to any one of claims 1 to 3, wherein the clustering the multiple speaker features into at least one speaker feature category comprises:

Based on the similarity between each two speaker features in the multiple speaker features, the offset between the two speaker features in the multiple speaker features, and the density of the multiple speaker features At least one of the distributions, clustering the multiple speaker features into the at least one speaker feature category.
The voice processing method according to any one of claims 1 to 4, wherein the receiving current voice input from the current speaker, and determining based on the current voice input and the at least one speaker template Whether the current speaker matches one of the at least one speaker includes:

Receiving the current voice input from the current speaker, and extracting current voice features from the current voice input;

Determining the current speaker feature based on the current voice feature;

Determining whether the current speaker feature matches one speaker template in the at least one speaker template;

In a case where it is determined that the current speaker feature matches the one speaker template, it is determined that the current speaker matches the speaker corresponding to the one speaker template.
The voice processing method according to claim 3, wherein the current voice input from the current speaker is received, and based on the current voice input and the at least one speaker template, it is determined whether the current speaker is Matching with one of the at least one speaker includes:

Based on the similarity between the characteristics of the current speaker and each speaker template in the at least one speaker template, it is determined whether the current speaker matches one speaker in the at least one speaker template.
7. The voice processing method according to any one of claims 1-6, further comprising:

In the case where it is determined that the current speaker matches the one speaker, the number of current speaker features of the current speaker and the at least one speaker feature category corresponding to the one speaker are determined Whether the sum of the number of the at least one speaker feature in a speaker category is equal to the first threshold;

In the case where it is determined that the sum of the numbers is not equal to the first threshold, based on the feature of the current speaker and the at least one speaker feature in the feature category with the one speaker, update speaking with the one speaker The speaker template corresponding to the person.
7. The voice processing method according to any one of claims 1-7, further comprising:

In the case where it is determined that the current speaker matches the one speaker, the number of current speaker features of the current speaker and the at least one speaker feature category corresponding to the one speaker are determined Whether the sum of the number of the at least one speaker feature in a speaker category is equal to the first threshold;

In the case where it is determined that the sum of the numbers is equal to the first threshold, adding the current speaker feature to the multiple speaker features to form an updated multiple speaker feature;

Clustering the updated multiple speaker features into at least one updated speaker feature category, wherein the updated at least one speaker feature category corresponds to the updated at least one speaker one-to-one, and Each updated speaker feature category in the updated at least one speaker feature category includes at least one speaker feature in the updated multiple speaker features;

Based on the at least one updated speaker feature category, an updated at least one speaker template is determined, wherein the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.
8. The voice processing method according to any one of claims 1-8, further comprising:

In the case of determining that the current speaker does not match the at least one speaker, determine the number of current speaker features of the current speaker and at least one speaker that is not included in the at least one speaker feature category Whether the sum of the number of human characteristics is greater than or equal to the second threshold;

In the case where it is determined that the sum of the numbers is greater than or equal to the second threshold, cluster the current speaker feature and the at least one speaker feature not included in the at least one speaker feature category into At least one other speaker characteristic category, wherein the at least one other speaker characteristic category corresponds to at least one other speaker in a one-to-one correspondence;

Based on the at least one other speaker feature category, at least one other speaker template is determined, wherein the at least one other speaker template corresponds to the at least one other speaker one-to-one.
9. The voice processing method according to any one of claims 1-9, further comprising:

In the case of determining that the current speaker does not match the at least one speaker, determine the number of current speaker features of the current speaker and at least one speaker not included in the at least one speaker category Whether the sum of the number of features is greater than or equal to the second threshold;

In the case where it is determined that the sum of the numbers is greater than or equal to the second threshold, add the current speaker feature and the at least one speaker feature not included in the at least one speaker category to the multiple Individual speaker characteristics, forming updated multiple speaker characteristics;

Clustering the updated multiple speaker features into at least one updated speaker feature category;

Based on the updated at least one speaker feature category, an updated at least one speaker template is determined, wherein the updated at least one speaker template corresponds to the updated at least one speaker one-to-one.
The voice processing method according to any one of claims 1-10, further comprising:

In the case where it is determined that the current speaker matches with one of the multiple speakers, obtain the current user ID of the current speaker through interaction with the current speaker; and

Associate the current user identification with one speaker feature category in the at least one speaker feature category and one speaker template in the at least one speaker template, wherein the one speaker feature category and the One speaker template corresponds to the one speaker.
The voice processing method according to claim 11, wherein the current user identification includes at least one of the name, gender, age, authority, and preferences of the current speaker.
The voice processing method according to claim 11 or 12, further comprising:

Receive the next voice input from the next speaker, and based on the next voice input and the at least one speaker template, determine whether the next speaker speaks to the one of the at least one speaker People match

In the case where it is determined that the next speaker matches the one speaker, the current user ID is used to identify the next speaker.
The voice processing method according to claim 8 or 10, further comprising:

In the case where one of the at least one updated speaker feature categories includes multiple speaker features and the multiple speaker features are associated with multiple user identities, it is determined that A user identification associated with the largest speaker feature among the multiple speaker features; and

Associating the one user identification with one speaker template in the at least one updated speaker template, wherein the one updated speaker template corresponds to the one updated speaker feature category,

Wherein, each user ID in the plurality of user IDs includes at least one of the speaker's name, gender, age, authority, and preferences.
The voice processing method according to any one of claims 1-14, further comprising:

Determine the voice attribute of the current speaker based on the current voice input of the current speaker; and

In a case where it is determined that the current speaker matches one speaker among the plurality of speakers, the voice attribute is associated with a speaker characteristic category corresponding to the one speaker.
15. The speech processing method according to claim 15, wherein the sound attributes include at least one of the age attribute of the sound and the gender attribute of the sound.
A machine-readable medium, characterized in that instructions are stored on the medium, and when the instructions are run on the machine, the machine will execute the voice described in any one of claims 1 to 16 Approach.
A system, characterized in that it includes:

processor;

A memory, where instructions are stored on the memory, and when the instructions are executed by the processor, the system executes the voice processing method according to any one of claims 1 to 16.