CN114495911A

CN114495911A - Speaker clustering method, device and equipment

Info

Publication number: CN114495911A
Application number: CN202210028998.0A
Authority: CN
Inventors: 郑斯奇; 索宏彬
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-05-13

Abstract

The application discloses a speaker clustering method, a speaker clustering device and speaker clustering equipment. Wherein the method comprises the following steps: dividing the voice to be processed into a plurality of voice fragments; acquiring the speaker characteristics of the voice segments; establishing a community network by taking the voice segments as nodes and the speaker characteristic similarity as an edge value; and determining a voice fragment set corresponding to each speaker according to the community network through a community detection algorithm. By adopting the processing mode, the speaker clustering is carried out based on the community detection, and the accuracy of the speaker clustering can be effectively improved.

Description

Speaker clustering method, device and equipment

Technical Field

The application relates to the technical field of voice processing, in particular to a speaker clustering method, device and system and electronic equipment.

Background

With the wide application of intelligent voice technology in people's daily life, how to more accurately identify the starting and ending time points of different speaker vocalization in the multi-person voice interaction process gradually becomes a research hotspot as a premise of rear-end identification technologies such as voice and the like.

The Speaker logging (Speaker diagnosis) technology mainly aims at providing a section of long audio (usually single-channel spoken dialog voice, with more aliasing fragments of multiple people) for multi-person communication, a computer can automatically identify several speakers in the audio, detect the start and stop time stamps of each Speaker speaking in the audio, and solve the problem of Who speaks When (Who Spoke When), so that people can conveniently and quickly retrieve and locate the voice fragments of specific speakers, and the Speaker logging (Speaker diagnosis) technology is the basis of subsequent modules of voice recognition, voice print recognition and the like, and is widely applied to applications of voice transcription, indexing and the like of conference scenes.

The speaker log technology mainly adopts a clustering method to cluster a large number of voice segments according to the identity of a speaker and to cluster the segments belonging to the same speaker together. At present, the mainstream speaker clustering technology adopts common clustering methods such as k-means, AHC hierarchical clustering and spectral clustering. However, in the process of implementing the present invention, the inventor finds that the above technical solution has at least the following problems: the single person is easily identified into multiple categories, or different speakers are wrongly combined, and the accuracy of speaker clustering is low.

Disclosure of Invention

The application provides a speaker clustering method to solve the problem that the accuracy rate of speaker clustering is low in the prior art. The application additionally provides a speaker clustering device and system, and an electronic device.

The application provides a speaker clustering method, which comprises the following steps:

dividing the voice to be processed into a plurality of voice fragments;

acquiring the speaker characteristics of the voice segments;

establishing a community network by taking the voice segments as nodes and the speaker characteristic similarity as an edge value;

and determining a voice fragment set corresponding to each speaker according to the community network through a community detection algorithm.

Optionally, the determining, by a community detection algorithm, a speech segment set corresponding to each speaker according to the community network includes:

removing all edges in the community network, and taking each node of the community network as a community;

adding edges which are not added into the community network back to the network respectively, adding one edge each time, merging two communities if the edges added into the network are connected with two different communities, determining modularity increment for forming new community division according to the speaker characteristic similarity, and selecting two communities enabling the modularity increment to be maximum for merging until merging enabling modularity to be increased cannot be found;

and selecting the community partition with the maximum modularity as a voice segment set corresponding to each speaker according to the modularity values corresponding to the various community partitions.

Optionally, the method further includes:

and carrying out dimension reduction processing on the speaker characteristics by a unified manifold approximation and projection algorithm.

Optionally, the method further includes:

and smoothing the speaker characteristics.

Optionally, the obtaining of the speaker characteristics of the speech segment includes:

and acquiring the characteristics of the speaker through a speaker characteristic identification network.

Optionally, the method further includes:

displaying the speaker clustering results obtained through the community detection algorithm, wherein each cluster included in the clustering results corresponds to one speaker, and a node in each cluster corresponds to one voice fragment;

and updating the clustering result according to the correction information of the clustering result provided by the user.

Optionally, the method further includes:

determining speaking time corresponding to each speaker according to the voice segment set corresponding to each speaker;

and displaying the corresponding speaker and the voice segment according to the target speaking time.

The present application further provides a speaker clustering device, comprising:

the voice dividing unit is used for dividing the voice to be processed into a plurality of voice fragments;

the speaker characteristic identification unit is used for acquiring the speaker characteristics of the voice segment;

the community network construction unit is used for constructing a community network by taking the voice fragments as nodes and the speaker characteristic similarity as an edge value;

and the community detection unit is used for determining the voice fragment set corresponding to each speaker according to the community network through a community detection algorithm.

The present application further provides an electronic device, comprising:

a processor and a memory;

a memory for storing a program for implementing the speaker recognition method according to the above, the apparatus being powered on and the program for executing the method by the processor.

The present application further provides a conference transcription system, including:

the conference terminal is used for collecting conference voice and sending the conference voice to the server;

the server is used for dividing conference voice into a plurality of voice fragments; acquiring the speaker characteristics of the voice segments; establishing a community network by taking the voice segments as nodes and the speaker characteristic similarity as an edge value; determining a voice fragment set corresponding to each speaker according to the community network through a community detection algorithm; and forming a speaking text corresponding to each speaker according to the transcribed text of each voice segment.

The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the method has the following advantages:

the speaker clustering method provided by the embodiment of the application divides the voice to be processed into a plurality of voice segments; acquiring the speaker characteristics of the voice segments; establishing a community network by taking the voice segments as nodes and the speaker characteristic similarity as an edge value; and determining a voice fragment set corresponding to each speaker according to the community network through a community detection algorithm. By adopting the processing mode, the speaker clustering is carried out based on the community detection, and the accuracy of the speaker clustering can be effectively improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a speaker clustering method provided herein;

FIG. 2 is a schematic diagram of a social network according to an embodiment of the speaker clustering method provided in the present application;

FIG. 3 is a schematic diagram of speaker clustering according to an embodiment of the speaker clustering method provided in the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

In the application, a speaker clustering method and device, a speaker log system and an electronic device are provided. Each of the schemes is described in detail in the following examples.

First embodiment

Please refer to fig. 1, which is a flowchart illustrating a speaker clustering method according to the present application. In this embodiment, the method may include the steps of:

step S101: dividing the voice to be processed into a plurality of voice fragments.

In this step, the non-speech part can be removed by the speech segmentation module, and the input speech is segmented into small segments, i.e. a plurality of speech segments. In specific implementation, any cutting mode can be adopted, for example, 1.5 seconds to 2 seconds is a segment, and adjacent segments can have an overlap of 0.5 seconds to 1 second.

Step S103: and acquiring the speaker characteristics of the voice segments.

The speaker characteristics can be voiceprint characteristics of the speaker, and the method provided by the embodiment of the application can realize clustering of the voiceprint characteristics.

In the step, a feature vector extraction module can be used for extracting feature vectors such as i-vector, d-vector and the like which can judge the speaker from the voice segment. In specific implementation, the feature vector extraction module can adopt a speaker feature recognition network based on a neural network, takes the voice segment as input data of the network, and outputs the speaker feature through the network. Since the speaker feature vector extraction belongs to the mature prior art, it is not described herein again.

In this embodiment, the speaker characteristics of a plurality of speech segments form a speaker characteristic matrix, the rows of the matrix may be characteristic dimensions, and the columns may be speech segments. For example, the speaker feature corresponding to each speech segment is a feature vector of 500 dimensions, the whole conference speech is divided into 1000 speech segments, i.e. a matrix of 500 × 1000, and the whole conference speech can be reduced to a new matrix of, for example, 30 × 1000.

Step S105: and establishing a community network by taking the voice segments as nodes and the speaker characteristic similarity as an edge value.

The method provided by the embodiment of the application regards the speaker clustering problem as a community detection problem. The basic definition of a community in a community network is: a tightly connected set of nodes with more internal connections and relatively fewer external connections between the nodes. According to the structural relationship of the network diagram, the voice segments can be reasonably segmented, namely, the voice segments are planned to respective speakers.

As shown in fig. 2, the embodiment uses the voice segments as nodes and the speaker feature similarity as an edge value to construct a social network. The speaker feature similarity refers to the speaker feature similarity between two voice segments corresponding to two nodes of the connecting edge. The speaker characteristic similarity may be a product of speaker characteristics of two speech segments. In specific implementation, the speaker feature similarity between any two voice segments in all the voice segments can be calculated.

In practical applications, the method may further include a step of performing dimension reduction processing on the speaker characteristics, such as reducing the dimension of the 500-dimensional speaker characteristics by 30 dimensions. By adopting the processing mode, the accuracy and the stability of speaker clustering can be effectively improved.

In this embodiment, the speaker characteristics are subjected to dimension reduction processing by a unified manifold approximation and projection UMAP algorithm. The UMAP dimension reduction can keep a globally distributed topological structure, namely a distance relation between every global class (speaker), so that the clustering degree of the speakers can be effectively improved. In specific implementation, other feature dimension reduction methods such as t-SNE, LDA, PCA and the like can also be adopted.

Step S107: and determining a voice fragment set corresponding to each speaker according to the community network through a community detection algorithm.

Based on the community network, the speaker clustering is carried out on all the voice segments through a community detection algorithm, the number of speakers can be determined, and the identity of the speakers is distributed to each voice segment. The community detection algorithm may be a Leiden algorithm or the like.

In one example, step S107 may include the following sub-steps:

step S1071: and removing all edges in the community network, and taking each node of the community network as a community.

Step S1073: and respectively adding edges which are not added into the community network back to the network, adding one edge each time, merging two communities if the edges added into the network are connected with two different communities, determining the modularity increment for forming new community division according to the speaker characteristic similarity, and selecting the two communities with the largest modularity increment for merging until the merging for increasing the modularity can not be found.

Community detection is a method to optimize the expectation of modularity, which may be the ratio of the total edge weight inside a community to the total edge weight in the network minus an expectation.

Step S1073: and selecting the community partition with the maximum modularity as a voice segment set corresponding to each speaker according to the modularity values corresponding to the various community partitions.

And dividing the community with the maximum modularity into speaker clustering results of all voice segments. Ideally, if the entire conference includes three speakers, all of the speech segments (e.g., segments 5, 15, 64, etc.) for one speaker form a community.

The optimization goal of the community detection algorithm is to maximize the modularity of the entire data. In specific implementation, the modularity value can be calculated by the following formula:

in this formula, assume that there are n speech segments and a represents the corresponding adjacency matrix. A _ ij represents speaker feature similarity between the voice segment of the ith node and the voice segment of the jth node, k _ i and k _ j are degrees of the ith node and the jth node, respectively, and 2m is the sum of the degrees of the speaker features of all the voice segments. "degree" here means the sum of the products of each speaker's characteristic and all other speaker's characteristics.

In one example, the method may further comprise the steps of: smoothing the speaker characteristics. In specific implementation, the clustering results can be arranged according to the time sequence of the original conference and then smoothed. For example, the speaker characteristic of a piece of speech is 11112122222, which can be smoothed to 111111222222. By smoothing the speaker characteristics, the clustering result can be more easily read when presented.

In one example, the method may further comprise the steps of: displaying the speaker clustering results obtained through the community detection algorithm, wherein each cluster included in the clustering results corresponds to one speaker, and the point in each cluster corresponds to one voice segment; and updating the clustering result according to the correction information of the clustering result provided by the user. The correction information may include correction information for the speaker, correction information for the speaker to which the voice segment belongs, and correction information for both the speaker and the speaker to which the voice segment belongs. FIG. 3 shows speaker clustering results, where each cluster corresponds to a speaker and each dot corresponds to a speech segment.

In an example, the method provided by the embodiment of the present application can be applied to a conference scenario, where the to-be-processed speech includes a multi-person conference speech, the speakers include conference users, and the speech segment sets corresponding to the speakers include conference speech segment sets corresponding to the conference users. Correspondingly, the method can further comprise the following steps: determining speaking time corresponding to each conference user according to the voice segment set corresponding to each conference user; and displaying the corresponding conference user and the voice segment according to the target speaking time. The target speaking time can be any time point or time period in the conference specified by the user. The method is applied to a conference scene, the voice of each participant in a multi-person conference can be separated, the speaking duration of different speakers can be distinguished and marked in the collected voice signal stream, namely, the identity of the speaker corresponding to each section of voice is detected by taking time as an index.

As can be seen from the foregoing embodiments, the speaker clustering method provided in the embodiments of the present application divides the speech to be processed into a plurality of speech segments; acquiring the speaker characteristics of the voice segments; establishing a community network by taking the voice segments as nodes and the speaker characteristic similarity as an edge value; and determining a voice fragment set corresponding to each speaker according to the community network through a community detection algorithm. By adopting the processing mode, the speaker clustering is carried out based on the community detection, and the accuracy of the speaker clustering can be effectively improved.

Second embodiment

In the above embodiments, a speaker clustering method is provided, and correspondingly, a speaker clustering device is also provided. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application further provides a speaker clustering device, comprising: the system comprises a voice dividing unit, a speaker characteristic identification unit, a community network construction unit and a community detection unit.

The voice dividing unit is used for dividing the voice to be processed into a plurality of voice fragments; the speaker characteristic identification unit is used for acquiring the speaker characteristics of the voice segment; the community network construction unit is used for constructing a community network by taking the voice fragments as nodes and the speaker characteristic similarity as an edge value; and the community detection unit is used for determining the voice fragment set corresponding to each speaker according to the community network through a community detection algorithm.

In one example, the community detection unit is specifically configured to remove all edges in the community network, and use each node of the community network as a community; adding edges which are not added into the community network back to the network respectively, adding one edge each time, merging two communities if the edges added into the network are connected with two different communities, determining modularity increment for forming new community division according to the speaker characteristic similarity, and selecting two communities enabling the modularity increment to be maximum for merging until merging enabling modularity to be increased cannot be found; and selecting the community partition with the maximum modularity as a voice segment set corresponding to each speaker according to the modularity values corresponding to the various community partitions.

In one example, the apparatus may further include: and the dimension reduction unit is used for carrying out dimension reduction processing on the speaker characteristics through a unified manifold approximation and projection algorithm.

In one example, the apparatus may further include: and the smoothing processing unit is used for smoothing the speaker characteristics.

In one example, the speaker characteristic identification unit is specifically configured to acquire the speaker characteristic through a speaker characteristic identification network.

In one example, the apparatus may further include: a speaker clustering result display unit and a clustering result correction unit. The speaker clustering result display unit is used for displaying the speaker clustering results obtained through the community detection algorithm, each cluster included in the clustering results corresponds to one speaker, and the point in each cluster corresponds to one voice segment; and the clustering result correcting unit is used for updating the clustering result according to the correcting information of the clustering result provided by the user.

In one example, the apparatus may further include: the time marking unit and the speaker searching unit. The time marking unit is used for determining speaking time corresponding to each speaker according to the voice segment set corresponding to each speaker; and the speaker searching unit is used for displaying the corresponding speaker and the voice segment according to the target speaking time.

Third embodiment

In the foregoing embodiment, a speaker clustering method is provided, and accordingly, the present application further provides an electronic device. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; a memory for storing a program for implementing the speaker clustering method, wherein the following steps are executed after the device is powered on and the program for the method is run by the processor: dividing the voice to be processed into a plurality of voice fragments; acquiring the speaker characteristics of the voice segments; establishing a community network by taking the voice segments as nodes and the speaker characteristic similarity as an edge value; and determining a voice fragment set corresponding to each speaker according to the community network through a community detection algorithm.

Fourth embodiment

Corresponding to the speaker clustering method, the application also provides a conference transcription system. Parts of this embodiment that are the same as the first embodiment are not described again, please refer to corresponding parts in the first embodiment.

The conference transcription system is used for clustering the characters transcribed according to the conference voice according to the voice of the speaker, and marking each segment of characters with a label corresponding to the speaker. The system comprises: a conference terminal and a server.

The conference terminal is used for collecting conference voice and sending the conference voice to the server; the server is used for dividing conference voice into a plurality of voice fragments; acquiring the speaker characteristics of the voice segments; establishing a community network by taking the voice segments as nodes and the speaker characteristic similarity as an edge value; determining a voice fragment set corresponding to each speaker according to the community network through a community detection algorithm; and forming a speaking text corresponding to each speaker according to the transcribed text of each voice segment.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A speaker clustering method, comprising:

dividing the voice to be processed into a plurality of voice fragments;

acquiring the speaker characteristics of the voice segments;

2. The method of claim 1, wherein determining the set of speech segments corresponding to each speaker according to the community network by a community detection algorithm comprises:

3. The method of claim 1, further comprising:

4. The method of claim 1, further comprising:

and smoothing the speaker characteristics.

5. The method of claim 1, wherein said obtaining speaker characteristics of said speech segments comprises:

6. The method of claim 1, further comprising:

displaying the speaker clustering results obtained through the community detection algorithm, wherein each cluster included in the clustering results corresponds to one speaker, and the point in each cluster corresponds to one voice segment;

7. The method of claim 1, further comprising:

8. A speaker clustering apparatus, comprising:

9. An electronic device, comprising:

a processor and a memory;

a memory for storing a program for implementing the speaker recognition method according to any one of claims 1-7, the device being powered on and the program for executing the method by the processor.

10. A conference transcription system, comprising: