CN115050372A

CN115050372A - Audio segment clustering method and device, electronic equipment and medium

Info

Publication number: CN115050372A
Application number: CN202210828411.4A
Authority: CN
Inventors: 王斌; 王乾坤; 穆维林; 杨晶生
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-09-13

Abstract

The present disclosure provides a clustering method, an apparatus, an electronic device and a medium for audio segments, the clustering method comprising: acquiring a first clustering result corresponding to a first audio clip, acquiring a second audio clip acquired in a current sampling period, and analyzing the second audio clip according to the first clustering result to obtain a second clustering result; the second audio clip and the first audio clip collected in the last sampling period are both intercepted from the same real-time audio stream, and the second clustering result comprises the identity identification information of at least one speaker in the second audio clip and the timestamp information corresponding to the identity identification information. The method realizes segmentation and clustering of the streaming speaker audio stream, and updates the speaker information on line in real time, thereby reflecting the speaking condition of the speaker in real time and improving the speaker identification accuracy and user experience.

Description

Audio segment clustering method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for clustering audio segments, an electronic device, and a medium.

Background

Speaker segmentation and clustering (SD) refers to a technique for distinguishing voices of different speakers according to the identities of the speakers, and solves the problem of "who speaks when".

In the process of parsing the audio file, generally, the situation that different microphones are used for different speakers is supported, that is, different speakers and the content of a speaker can be identified only through different microphones or sound receiving devices, if two or more speakers use the same microphone. For example, in a conference room, two or more users speak successively, in such a case, it is impossible to detect who the current speaker is, including whether the current speaker is newly added or a previous speaker, and therefore, for an audience of an online conference, the situation of the conference speaker cannot be known in real time, which affects user experience and conference quality.

Disclosure of Invention

In order to solve the technical problem and improve the accuracy of speaker recognition, the present disclosure discloses the following technical solutions:

in a first aspect, an embodiment of the present disclosure provides a method for clustering audio segments, where the method includes the following steps: acquiring a first clustering result corresponding to a first audio segment, wherein the first clustering result comprises identity identification information of at least one speaker in the first audio segment and timestamp information corresponding to the identity identification information; and acquiring a second audio clip acquired in the current sampling period, and analyzing the second audio clip according to the first clustering result to obtain a second clustering result.

The second audio clip and the first audio clip collected in the last sampling period are both intercepted from the same real-time audio stream, and the second clustering result comprises the identity identification information of at least one speaker in the second audio clip and the timestamp information corresponding to the identity identification information.

According to the method provided by the aspect, in the audio stream generation process, the current audio segment is obtained in real time according to the preset sampling period and is aligned with the first clustering result of the audio segment recorded at the previous moment to obtain the second clustering result.

With reference to the first aspect, in a possible implementation manner of the first aspect, a current sampling period is smaller than the second audio segment, and the parsing the second audio segment according to the first clustering result to obtain a second clustering result includes:

respectively acquiring a first sub-segment and a second sub-segment according to the condition that the current sampling period is smaller than the second audio segment, and clustering the first sub-segment according to the first clustering result and the longest speaking time principle of the speaker to obtain a first sub-result; and clustering the second sub-segment according to the first clustering result and the voiceprint feature matching principle to obtain a second sub-result. The first sub-segment is an audio stream of a time overlapping portion of the first audio segment and the second audio segment, and the second sub-segment is the remaining audio segment of the second audio segment except the first sub-segment.

With reference to the first aspect, in another possible implementation manner of the first aspect, clustering the first sub-segment according to the longest speaking time of the speaker according to the first clustering result to obtain a first sub-result, where the clustering includes: searching an optimal mapping relation according to the speaking duration of each speaker in the first clustering result and the identity ID corresponding to the speaker by using the Hungarian algorithm to obtain a first sub-result; and under the optimal mapping relation, the overlapping time of the audio parts of the clusters is longest.

With reference to the first aspect, in yet another possible implementation manner of the first aspect, clustering the second sub-segment according to the first clustering result and a voiceprint feature matching principle to obtain a second sub-result includes: acquiring a first voiceprint characteristic corresponding to a second sub-fragment and at least one second voiceprint characteristic in a first clustering result, wherein each second voiceprint characteristic corresponds to a speaker of the first audio fragment; comparing the similarity of the first voiceprint feature with at least one second voiceprint feature; and if the target second voiceprint feature exists in the at least one second voiceprint feature, marking the first voiceprint feature and the target second voiceprint feature as the same speaker to obtain a second sub-result. And the similarity between the target second voiceprint feature and the first voiceprint feature is greater than or equal to a threshold value.

With reference to the first aspect, in yet another possible implementation manner of the first aspect, the method further includes: and if the target second voiceprint characteristic does not exist, marking the first voiceprint characteristic to correspond to a new speaker to obtain a second sub-result, wherein the new speaker is different from any speaker in the first audio fragment cluster.

With reference to the first aspect, in a further possible implementation manner of the first aspect, the method further includes: and calculating to obtain a second voiceprint characteristic based on the average value of the voiceprint vectors of the same speaker in each sampling period of the real-time audio stream.

With reference to the first aspect, in yet another possible implementation manner of the first aspect, the parsing the second audio segment according to the first clustering result to obtain a second clustering result when the current sampling period is greater than or equal to the second audio segment includes: and clustering the second audio segments according to the first clustering result and the voiceprint feature matching principle to obtain a second clustering result.

With reference to the first aspect, in yet another possible implementation manner of the first aspect, the acquiring a second audio segment acquired in a current sampling period includes: when the voice end asfinal is detected by the voice activity detection VAD, a second audio segment is obtained according to the sampling period.

With reference to the first aspect, in yet another possible implementation manner of the first aspect, the method further includes: updating the first clustering result based on the second clustering result to obtain a third clustering result of the current audio stream, and displaying the third clustering result; the current audio stream includes a first audio segment and a second audio segment, and the third clustering result includes at least one speaker identity information within the current audio stream, and timestamp information corresponding to the identity information.

In a second aspect, an embodiment of the present disclosure further provides an apparatus for clustering audio segments, where the apparatus includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first clustering result corresponding to a first audio segment, and the first clustering result comprises the identity identification information of at least one speaker in the first audio segment and a timestamp corresponding to the identity identification information;

the acquisition unit is used for acquiring a second audio clip acquired in the current sampling period;

and the processing unit is used for analyzing the second audio segment according to the first clustering result to obtain a second clustering result.

Furthermore, the apparatus comprises further functional units or modules for implementing the methods of the various embodiments of the first aspect.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores computer program instructions executable by the at least one processor, the instructions when executed by the at least one processor performing a method of clustering audio segments as in the first aspect or any implementation of the first aspect.

Furthermore, the electronic device may further comprise at least one functional module or unit such as an interface, a transceiver, etc.

In a fourth aspect, the embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, may implement the method of the first aspect or any of the implementation manners of the first aspect.

It should be noted that, beneficial effects corresponding to the technical solutions of the various implementation manners of the second aspect to the fourth aspect may refer to the beneficial effects of the foregoing first aspect and the various implementation manners of the first aspect, and are not described again.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic view of a scene of an audio streaming application provided in an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for clustering audio segments according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of acquiring a first audio segment and a second audio segment according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of updating a clustering result of a current audio stream according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating parsing a second audio clip to obtain a second clustering result according to an embodiment of the disclosure;

FIG. 6 is a flow chart for obtaining a second sub-result according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating a method for determining a second sub-result in a second clustering result according to an embodiment of the disclosure;

fig. 8 is a block diagram illustrating a structure of an apparatus for clustering audio segments according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions of the present disclosure will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

AI base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning, deep learning and other directions.

The biological recognition technology is a technology for identifying the identity of a human by using the physiological or behavioral characteristics of the human by a computer, and based on the unique, reliable and stable physiological characteristics (such as fingerprints, irises, faces, palmprints and the like) or behavioral characteristics (such as voices, keystrokes, gaits, signatures and the like) of the human body, the biological recognition technology adopts the powerful functions of the computer and the network technology to process image processing and mode recognition so as to identify the identity of the human. The technology has good safety, reliability and effectiveness.

Voiceprint Recognition (VPR) is one of the biometric information Recognition technologies, and is also called Speaker Recognition (SR), which is a technology for determining the identity of a Speaker by Voice. Because the voiceprint recognition has the characteristics of safety, reliability, convenience and the like, the voiceprint recognition can be widely applied to occasions needing identity recognition.

It is understood that, before the technical solutions disclosed in the embodiments of the present disclosure are used, the user information related to the present disclosure, such as user identity information, voiceprint information in the user audio stream, etc., should be authorized in a proper manner according to the relevant laws and regulations.

For example, prior to a meeting, the system may send a prompt to the user to explicitly prompt the user that the requested action to be performed would require the acquisition and use of personal information to the user. Therefore, the user can autonomously select whether to provide personal information to software or hardware such as electronic equipment, an application program, a server or a storage medium for executing the operation of the technical scheme of the disclosure according to the prompt information, and identify the identity of the speaker according to the acquired voiceprint information. In addition, in the process of authorizing the personal information of the user, prompt information can be sent to the user, for example, in a pop-up window mode, the prompt information can be presented in the pop-up window in a text mode.

It should be understood that the above-mentioned processes of notifying and obtaining the authorization of the personal information of the user are only illustrative and not limiting to the implementation of the present disclosure, and other ways of satisfying the relevant laws and regulations may be applied to the implementation of the present disclosure.

The clustering method of the audio segments based on the real-time recording of the speaker provided by the embodiment of the application is suitable for the application environment shown in the figure 1. In the application environment 100 shown in fig. 1, at least one terminal device, such as the mobile phone 10 and the notebook 20, and the server 30, are included, and the mobile phone 10, the notebook 20, and the server 30 may be connected through the network 40.

The terminal devices include, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, portable wearable devices, and the like. The server 30 may be implemented as a stand-alone server or as a server cluster comprised of a plurality of servers. Further, the server 30 may be an independent server, or may be a server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.

Further, the network connecting between the terminal devices and the server may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. In one example, a user may use a terminal device to communicate with a server over a network to enable the reception and/or transmission of information between the terminal device and the server.

It should be noted that the method provided in the embodiment of the present disclosure may be executed by a server, for example, the server acquires an audio stream reported by each terminal device and then processes the audio stream, or the method may also be executed by a terminal device, for example, a notebook computer collects an audio stream of a speaker in real time and processes the audio stream in real time, and the execution subject of the method is not limited in this embodiment.

It should be understood that the number of the terminal devices, the networks, and the servers in fig. 1 is only illustrative, and any number of the terminal devices, the networks, and the servers may be provided according to implementation requirements, and the terminal devices in the embodiments of the present disclosure may specifically correspond to an application system in actual production.

The disclosed embodiments provide a method for Clustering audio segments, wherein, Clustering is to divide a data set into different classes or clusters according to a certain standard, so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects not in the same cluster is also as large as possible. That is, after clustering, the data of the same class are gathered together as much as possible, and the data of different classes are separated as much as possible. In this embodiment, the audio streams of different speakers speaking in real time are clustered.

In the present embodiment, Speech Recognition technology, or Automatic Speech Recognition (ASR), is also involved, and aims to convert the vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes or character sequences. The process comprises the processes of acoustic feature extraction, acoustic models, language processing and the like.

The embodiments of the present disclosure will be further explained with reference to the drawings.

As shown in fig. 2, it is a flowchart of a method for clustering audio segments provided in an embodiment of the present disclosure, where the method is applicable to the foregoing server or terminal device, and the method specifically includes:

step 101: the method comprises the steps of obtaining a first clustering result corresponding to a first audio segment, wherein the first clustering result comprises identity identification information of at least one speaker in the first audio segment and timestamp information corresponding to the identity identification information.

The first audio clip is a part of the speaker audio stream recorded in real time, and can be obtained by sampling according to a preset sampling period. The audio stream is an audio file generated by collecting the speech of at least one speaker in real time.

Specifically, in the process of generating the audio stream, the microphone can convert the acoustic signal of the speaker into an electrical signal, acquire the voice to be processed through the microphone, input the voice to be processed into the voice detection model, and then splice all the results output after being processed by the voice detection model to obtain the audio stream. The speech frames in the audio stream contain speech from at least one speaker, for example, in a conference audio stream in a conference room, the audio stream or audio segment collected by the same microphone contains speech from at least one speaker. In this embodiment, a specific implementation manner of obtaining the audio stream is not limited.

The identification information of at least one speaker in the first clustering result may include, for example: the identity ID of each speaker, also called "spaker ID", further, may include, but is not limited to, a user name, a customized identity code/number. The timestamp information corresponding to the identification information may refer to a duration of speaking of each speaker in the first audio segment, for example. The timestamp information is actually a time period recording the time interval from the start time to the end time of each speaker speaking in the first audio segment. And a mapping relationship exists between the identity ID of each speaker and the timestamp. For example, in the first clustering result, record: a mapping relationship exists between speaker 1 and the first time interval, a mapping relationship exists between speaker 2 and the second time interval, and so on. In addition, the first clustering result may be obtained by parsing the first audio piece.

It can be understood that, if the first audio segment contains voices of multiple speakers, the clustering result in step 101 may contain identification information of at least one speaker in the first audio segment and corresponding timestamp information thereof; or, the clustering result may also include the identification information of all speakers in the first audio segment and the corresponding timestamp information.

Step 102: and acquiring a second audio clip acquired in the current sampling period, and analyzing the second audio clip according to the first clustering result to obtain a second clustering result.

It should be noted that the speakers in the first audio segment and the second audio segment may be the same or different.

The present embodiment expresses the duration of an audio segment by "Ni", for example, the duration of a first audio segment is N1, the duration of a second audio segment is N2, and N1 is N2, and there are two relations between the first audio segment and the second audio segment according to the sampling period, the first is that the sampling period is less than the length of the second audio segment N2, and there is a time overlapping portion between N1 and N2; second, the sampling period is equal to or greater than the second audio piece N2, when there is no temporal overlap between N1 and N2. The present embodiment mainly discusses the processing of audio streams in which there is a time overlap portion.

As shown in fig. 3, if the sampling period is less than the duration N of the audio segment, there is a time overlapping portion of the audio stream between N1 and N2, and of course, a time non-overlapping portion of the audio stream. Alternatively, N is 300s (seconds).

Optionally, the first audio segment in step 101 and the second audio segment in step 102 may be obtained by clustering through Voice Activity Detection (VAD) algorithm. The purpose of VAD detection is to judge the input signal and distinguish the voice signal from various background noise signals.

Specifically, when voice termination is detected by using VAD, that is, each ASR final, the current time is recorded, and one-time clustering is triggered to obtain one audio segment, and the current time is recorded as t. In this example, the timestamp is recorded at t1 for the ASR final of the first audio segment and t2 for the ASR final of the second audio segment, and there is a temporal overlap of the partial audio streams between the two audio segments.

It should be understood that, in the process of segmenting the audio stream, the time length of the segmented audio stream at the current time t2 and the previous time t1 may be different, i.e., N1 ≠ N2. In this embodiment, the voice is segmented according to a fixed segmentation duration to obtain audio segments with the same duration.

In step 102, the second audio segment is parsed according to the first clustering result to obtain a second clustering result, and the specific process may be the same as or different from the method for parsing the first audio segment to obtain the first clustering result in step 101. For example, according to the speakers identified in the first clustering result and the time stamp corresponding to each speaker, the second audio segment is analyzed through a preset algorithm to identify the speakers in the second audio segment and the time stamp corresponding to each speaker. The preset algorithm can be one or more deep learning algorithms based on a neural network model, such as hungarian algorithm and the like. In addition, in the two audio segments, the processing algorithms may be the same or different for the audio streams in the time overlapping portion and the time non-overlapping portion. It should be understood that the preset algorithm may also be other clustering algorithms, and the present embodiment does not limit this.

In addition, the method of the embodiment further includes: and updating the first clustering result based on the second clustering result to obtain a third clustering result of the current audio stream, and displaying the third clustering result.

The duration of the current audio stream is different from that of the first audio clip N1 and the second audio clip N2, that is, the current audio stream includes the first audio clip and the second audio clip, and the third clustering result includes at least one speaker identity information in the current audio stream and timestamp information corresponding to the identity information.

In a specific example, as shown in fig. 4, a diagram illustrating a clustering result of a current audio stream is obtained through updating. In the first audio segment N1, 4 speakers are clustered by obtaining a first clustering result, namely, speaker 1, speaker 2, speaker 3, and speaker 4, which are respectively represented by a speker ID number as "1, 2,3, 4", wherein a timestamp corresponding to each speaker is as shown in fig. 4, and represents a speaking duration of each speaker, and each speaking duration is composed of one or more feature vectors (embedding). Wherein the length of embedding can be set to 1.5s, and the sliding window moving step length (step) of embedding is set to 0.75 s. It should be understood that other more or fewer window lengths and step lengths may be provided in this embodiment.

Based on the first clustering result, the second audio segment N2 is parsed to obtain a second clustering result, which includes the time stamps (time lengths) of the

speakers

4 and 5, and the

speakers

4 and 5 speaking respectively. The second clustered result is aligned with the first clustered result label. Since the speaking duration of the "speaker 4" in the corresponding clustering result is greater than the duration of the "speaker 3" in the audio streams of the time overlapping portions in N1 and N2, the audio segment of the time overlapping portion in the second audio segment N2 is labeled as "speaker 4" or spaker ID4 after label alignment. For the audio stream with the non-overlapping time portion in N2, the obtained second clustering result includes speaker 5, and the identification information of this speaker 5 is different from the identification information of 4 speakers in the first clustering result, so that the audio segment with the non-overlapping time portion in the second audio segment N2 is labeled as "speaker 5" or spaker ID5, and the obtained second clustering result includes "speaker 4" and "speaker 5", and the time stamp information corresponding to these two speakers respectively.

And obtaining a third clustering result according to the first clustering result and the second clustering result, wherein the third clustering result comprises the first clustering result and the second clustering result, and the third clustering result conforms to the second clustering result, in this example, the current audio stream corresponding to the third clustering result comprises 5 speakers, and the speaker identification information is the speaker ID 1-5 respectively, and further comprises timestamp information corresponding to each speaker.

It should be noted that, if two different results are displayed in the two adjacent clustering results, the first clustering result is updated to obtain the third clustering result based on the result of the last clustering, that is, the second clustering result obtained by clustering at the current time t 2.

According to the method provided by the embodiment, in the audio stream generation process, the current audio segment is obtained in real time according to the preset sampling period and is aligned with the first clustering result of the audio segment recorded at the previous moment to obtain the second clustering result.

In addition, in the method, the audio segments are obtained in real time based on the detection of the ASR final time point, and the clustering result of the audio segments is updated, so that the audio clustering and processing of the streaming speaker are realized, and compared with the off-line audio stream processing, the method has stronger timeliness and higher feedback quality.

The clustering process of the above-mentioned method step 102 is explained in detail below.

Referring to FIG. 5, a flow chart for parsing a second audio segment to obtain a second clustering result is shown. The second clustering result obtained in step 102 includes a first sub-result and a second sub-result, where the first sub-result is a clustering result corresponding to a first sub-segment in the second audio segment, and the second sub-result is a clustering result corresponding to a second sub-segment in the second audio segment. The first sub-segment is an audio stream of a time overlapping portion of the first audio segment and the second audio segment, and the second sub-segment is the remaining audio segment of the second audio segment except the first sub-segment.

In the step 102, the second audio segment is analyzed according to the first clustering result to obtain a second clustering result, and the specific process includes:

step 1021: clustering the first sub-segment according to the longest speaking time of the speaker according to the first clustering result to obtain a first sub-result, and,

step 1022: and clustering the second sub-segments according to the first clustering result and the voiceprint feature matching principle to obtain a second sub-result.

Specifically, step 1021 specifically includes: and searching for an optimal mapping relation according to the speaking duration of each speaker in the first clustering result and the identity ID corresponding to the speaker by using the Hungary algorithm (Hungarian algorithm) to obtain the first sub-result. And under the optimal mapping relation, the overlapping time of the audio parts of the clusters is longest.

The Hungarian algorithm is a combinatorial optimization algorithm for solving task allocation problems in polynomial time and is used for searching for the maximum matching algorithm.

The mapping relationship is a mapping relationship between the speaker IDs in the last sampling period (historical) to the embedding, optionally, the mapping relationship is marked as "M0", and all embedding corresponding to each speaker ID in the last audio segment are recorded in M0. For convenience of illustration, in this embodiment, the sum of all embedding times included in each speaker probe ID is referred to as "time block" (chunk), and as shown in the example shown in fig. 4, the first audio segment N1 is parsed by using the speaker clustering algorithm SD, and the first clustering result S1 is: the spoker ID1-chunk 1, the spoker ID2-chunk 2, the spoker ID3-chunk 3 and the spoker ID4-chunk 4 are mapped into 4 maps, and correspond to the identity information and the time stamp information of 4 speakers.

Similarly, the second audio piece N2 is parsed by the SD algorithm to obtain a second clustering result S2: the spaker ID4-chunk 5 and the spaker ID5-chunk6 are mapped together to correspond to two speakers. Then based on the audio stream of the time overlapping part, the label aligns the spaker-ID and the timestamp information of S1 and S2, and uses Hungarian algorithm, namely finding the optimal mapping relation, so that the audio stream of the S1 and S2 overlapping is longest in duration, and mapping is obtained: the overlapping duration of the spaker ID4 in S2 is longer than that of the spaker ID3 in S1, so the mapping of the first sub-result obtained according to the optimal matching of duration is: spaker ID4-chunk 5.

In the embodiment, the clustering result of the audio stream of the time overlapping part is obtained by adopting the algorithm matching mapping with the optimal duration for the audio segments with the time overlapping part, and the clustering result obtained by the method has higher accuracy and short time consumption.

Further, in the above step 1022, according to the first clustering result, clustering the second sub-segment according to the voiceprint feature matching principle, so as to obtain a second sub-result, as shown in fig. 6, specifically including:

step 301: and acquiring a first voiceprint characteristic corresponding to the second sub-segment and at least one second voiceprint characteristic in the first clustering result, wherein each second voiceprint characteristic corresponds to a speaker of the first audio segment.

Specifically, one possible implementation way to extract the voiceprint features is: and performing acoustic feature extraction on the second sub-segment by using a feature model, wherein the acoustic features comprise voiceprint features, and the first voiceprint features are obtained. Optionally, the first voiceprint feature can be represented by a feature vector. Similarly, the acoustic feature extraction is performed on the first audio segment to obtain a second acoustic line feature, where the second acoustic line feature includes at least one feature vector.

It should be noted that the method for extracting the voiceprint feature includes, but is not limited to, extracting the feature of the voice segment by using a method based on Mel-Frequency Cepstral Coefficients (MFCC) feature extraction, because the MFCC is based on a cepstrum extraction method, the MFCC feature better conforms to the human auditory principle, the voice feature extraction effect is better, and the MFCC feature extraction technology is a technology well known to those skilled in the art and is not described in detail herein.

Step 302: and carrying out similarity comparison on the first voiceprint feature and the at least one second voiceprint feature.

Specifically, whether a target second voiceprint feature exists in at least one second voiceprint feature is judged, and the similarity between the target second voiceprint feature and the first voiceprint feature is larger than or equal to a threshold (threshold).

Step 303: and if the target second voiceprint feature exists, marking the first voiceprint feature and the target second voiceprint feature as the same speaker to obtain the second sub-result.

Specifically, if a second voiceprint feature exists in at least one second voiceprint feature, and the similarity between the second voiceprint feature and the first voiceprint feature is greater than or equal to a threshold value, the second voiceprint feature is taken as a target second voiceprint feature, and the speaker ID corresponding to the target second voiceprint feature is the speaker ID of the first voiceprint feature.

For example, as shown in fig. 7, if it is determined in the second audio segment N2 that the first voiceprint feature of the audio stream (second sub-segment) in the non-overlapping portion is closest to the second voiceprint feature of the spaker ID3 in the first audio segment N1, the second sub-segment is considered to be the same person as the spaker ID3 in chunk 3, and the second sub-segment is marked as spaker ID3-chunk 5, so as to obtain the second sub-result. The voiceprint feature in chunk 3 is an average value of all feature vectors (embedding) included in chunk 3, where each embedding can calculate one voiceprint feature. Similarly, the voiceprint features of other chunks are also obtained by calculating the embedding mean value.

In addition, the method further comprises: and if the target second voiceprint feature does not exist, marking the first voiceprint feature to correspond to a new speaker to obtain the second sub-result, wherein the new speaker is different from any speaker in the first audio fragment cluster.

For example, the speaker spaker ID5 and the corresponding timestamp information are obtained in the second clustering result S2, that is, the number of speakers obtained by clustering in S2 is greater than the number of speakers in S1, and the speker ID5 is not in the speaker spaker IDs 1-4 clustered in the first clustering result S1, that is, all the second voiceprint features are not similar to the first voiceprint features, or the similarity is smaller than the threshold value, at this time, the speker ID5 is marked as a new speaker ID, that is, the second sub-result is the speker ID5-chunk 6.

In another possible implementation manner, in step 301, one possible implementation manner of extracting at least one second voiceprint feature is to calculate the second voiceprint feature based on an average value of voiceprint vectors (e.g., embedding) of the same speaker in each sampling period of the real-time audio stream; and then, carrying out similarity comparison on each second voiceprint characteristic and the first voiceprint characteristic. The first audio clip comprises M embedding, each embedding periodically slides according to a preset time length, and M is greater than or equal to 1 and is a positive integer.

In this embodiment, for the audio streams of the time non-overlapping portions, it is determined whether the result of clustering the current audio streams is that the speakers clustered in the previous sampling period are the same or all the speakers are different and belong to the newly added speaker through voiceprint feature extraction and similarity comparison, so that the audio streams of the time non-overlapping portions can be accurately aligned, and label alignment of two audio segments is realized.

Optionally, in another embodiment, if the current sampling period is greater than or equal to the second audio segment, the audio of the time overlapping portion does not exist between the first audio segment and the second audio segment, and the step 102 of analyzing the second audio segment according to the first clustering result to obtain the second clustering result includes: and clustering the second audio segments according to the first clustering result and a voiceprint feature matching principle to obtain a second clustering result.

Further, the principle of voiceprint matching is the same as the aforementioned method for analyzing the second sub-segment to obtain the clustering result, and the specific process may refer to the description of steps 301 to 303 in the above embodiment, which is not described herein again.

Optionally, in the above embodiment, displaying the third clustering result corresponding to the current audio stream specifically includes displaying a speaker spaker ID. In the example shown in fig. 4, after the second clustering result is updated to obtain the third clustering result, the spoke ID5 corresponding to the current speaker is displayed in the text description of the conference window, and then when the speakers 1 to 5 use the same microphone to speak in the same conference scene, the method of the embodiment can display the clustered spoke IDs of different speakers in real time, so as to achieve the beneficial effect of identifying different speakers under the same microphone, and improve the audio recording quality and the user experience.

Optionally, in another embodiment, the method further includes: and obtaining a clustering result of the historical audio segments, wherein the historical audio segments comprise the first audio segment or further comprise more historical audio segments. In the example shown in fig. 3, audio is recorded starting from t ═ 0, and clustering is triggered every time an ASR end is detected using VAD, resulting in a clustering result.

If at a certain time t, triggering a clustering operation, but the duration is less than N, assuming that N is 300S, the current clustering time period is 240S, and 240S is less than 300S, clustering all embedding collected in all historical time periods (240S) to obtain a clustering result, such as S0. When the historical duration is greater than or equal to 300s, the method flow shown in fig. 2 or fig. 5 may be executed, and the specific process refers to the description of the foregoing method embodiment, which is not described herein again.

In the embodiment of the present disclosure, a device for clustering audio segments is further provided, where the device corresponds to the method for clustering audio segments in the above embodiment one to one, and is used to implement the method for clustering audio segments in the real-time voice recording process. Specifically, as shown in fig. 8, the apparatus includes the following modules:

the obtaining unit 801 is configured to obtain a first clustering result corresponding to the first audio segment, where the first clustering result includes identification information of at least one speaker in the first audio segment, and a timestamp corresponding to the identification information.

The acquiring unit 802 is configured to acquire a second audio segment acquired in a current sampling period.

The processing unit 803 is configured to parse the second audio segment according to the first clustering result to obtain a second clustering result.

It should be understood that the above-mentioned apparatus may also include other more or less units/modules, such as a sending unit, a calculating unit, a storage unit, etc., and the present embodiment does not limit the structure of the apparatus.

Optionally, in a specific embodiment, the current sampling period is smaller than the second audio piece,

the acquiring unit 802 is further configured to acquire a first sub-segment and a second sub-segment respectively according to that the current sampling period is smaller than the second audio segment, where the first sub-segment is an audio stream of a time overlapping portion of the first audio segment and the second audio segment, and the second sub-segment is a remaining audio segment of the second audio segment excluding the first sub-segment.

The processing unit 803 is specifically configured to cluster the first sub-segment according to the first clustering result and the longest speaking time of the speaker to obtain a first sub-result; and clustering the second sub-segment according to the first clustering result and the voiceprint feature matching principle to obtain a second sub-result.

Further, the processing unit 803 is specifically configured to search, by using the hungarian algorithm, an optimal mapping relationship in the analysis result according to the speaking duration of each speaker in the first clustering result and the identity ID corresponding to the speaker, so as to obtain the first sub-result. And under the optimal mapping relation, the overlapping time of the audio parts of the clusters is longest.

Optionally, in another specific embodiment, the processing unit 803 is further specifically configured to obtain a first voiceprint feature corresponding to the second sub-segment and at least one second voiceprint feature in the first clustering result, where each second voiceprint feature corresponds to a speaker of the first audio segment; comparing the similarity of the first voiceprint feature with the similarity of the at least one second voiceprint feature; and if the target second voiceprint feature exists in the at least one second voiceprint feature, marking the first voiceprint feature and the target second voiceprint feature as the same speaker to obtain a second sub-result. And the similarity between the target second voiceprint feature and the first voiceprint feature is greater than or equal to a threshold value.

Optionally, in another specific embodiment, the processing unit 803 is further specifically configured to detect that, if the target second voiceprint feature does not exist, the first voiceprint feature is marked to correspond to a new speaker, which is different from all speakers clustered with the first audio segment, to obtain the second sub-result.

Optionally, the apparatus further includes a calculating unit, not shown in fig. 8, configured to calculate an average value based on the voiceprint vectors of the same speaker in each sampling period of the real-time audio stream before performing the similarity comparison, so as to obtain a second voiceprint feature.

Optionally, in another specific embodiment, the acquiring unit 802 is further configured to acquire the second audio segment according to a sampling period when the voice end is detected by the voice activity detection VAD.

Optionally, in another specific embodiment, the apparatus further comprises a display unit.

The processing unit 803 is further configured to update the first clustering result based on the second clustering result, obtain a third clustering result of the current audio stream, and display the third clustering result through the display unit. The current audio stream includes a first audio segment and a second audio segment, and the third clustering result includes at least one speaker identity information and timestamp information corresponding to the identity information in the current audio stream.

It is noted that, in the application, relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises" and "comprising," and any variations thereof, in the above modules/units are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and the division of the modules presented in this application is merely a logical division and may be implemented in a practical application in another division.

For the specific definition of the audio segment clustering device, reference may be made to the above definition of the audio segment clustering method, which is not described herein again. The modules in the audio segment clustering device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the electronic device, or can be stored in a memory in the electronic device in a software form, so that the processor can call and execute operations corresponding to the modules.

In another embodiment of the present application, an electronic device is further provided, as shown in fig. 9, the electronic device includes a processor 901, a memory 902, and at least one interface 903, and the processor 901, the memory 902, and the at least one interface 903 may be connected through a bus.

The memory 902 stores a computer program that can be executed on the processor 901, and the processor 901, when executing the computer program, can implement the steps of the method for clustering audio segments in the above embodiments, such as the steps 101 to 102 shown in fig. 2 and other extensions of the method and related steps. Alternatively, when the processor 901 executes a computer program, the functions of the modules/units of the audio segment clustering device in the above embodiments, such as all or part of the functions of the acquiring unit 801, the acquiring unit 802, and the processing unit 803 shown in fig. 8, can be realized. To avoid repetition, further description is omitted here.

Further, the Processor 901 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor 901 is a control center of the electronic device and connects various parts of the whole electronic device by using various interfaces and lines.

The memory 902 can be used for storing the computer programs and/or modules, and the processor 901 can implement various functions of the electronic device by running or executing the computer programs and/or modules stored in the memory 902 and calling the data stored in the memory. In addition, the memory 902 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data) created according to the use of the cellular phone, and the like.

The memory 902 may be integrated in the processor 901, or may be provided separately from the processor 901.

The at least one interface 903 includes a communication interface and an input/output interface, for example, a USB interface, where the communication interface is used to enable the electronic device to communicate with other devices, such as a server or a terminal device. The input/output interface is used for connecting external devices, such as a display/display screen, a mouse, a keyboard, a microphone, a radio and the like.

It should be understood that the electronic device in this embodiment may also include other more or fewer components, such as including at least one sensor, etc.

In an embodiment, the present application further provides a computer-readable storage medium on which a computer program is stored, which, when being executed by a processor, is capable of implementing the steps of the method for clustering audio segments according to the above-described embodiments, such as the extensions of the other extensions and related steps of the methods shown in fig. 2, 5 or 6 described above. Alternatively, the computer program, when being executed by a processor, implements the functions of the modules/units of the clustering means of audio pieces in the above-described embodiments.

In addition, all or part of the processes in the methods of the embodiments may be implemented by a computer program that can be stored in a non-volatile computer-readable storage medium and that includes the processes of the embodiments of the methods when executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others.

Further, the non-volatile Memory may include Read-Only Memory (ROM), Programmable ROM (PROM), electrically Programmable ROM (eprom), Electrically Erasable Programmable ROM (EEPROM), or flash Memory. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM is available in many forms, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (Double Data Rate SDRAM, DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM ), Rambus Direct RAM (RDRAM), direct bus Dynamic RAM (DRDRAM), and memory bus Dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present disclosure, and not for limiting the same; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims

1. A method for clustering audio segments, the method comprising:

acquiring a first clustering result corresponding to a first audio segment, wherein the first clustering result comprises identity identification information of at least one speaker in the first audio segment and timestamp information corresponding to the identity identification information;

acquiring a second audio clip acquired in the current sampling period, and analyzing the second audio clip according to the first clustering result to obtain a second clustering result;

2. The method of claim 1, wherein the current sampling period is less than the second audio segment, and parsing the second audio segment according to the first clustering result to obtain a second clustering result comprises:

respectively acquiring a first sub-segment and a second sub-segment according to the condition that the current sampling period is smaller than the second audio segment, wherein the first sub-segment is an audio stream of a time overlapping part in the first audio segment and the second audio segment, and the second sub-segment is a residual audio segment except the first sub-segment in the second audio segment;

clustering the first sub-segment according to the longest speaking time principle of the speaker according to the first clustering result to obtain a first sub-result; and the number of the first and second groups,

and clustering the second sub-segments according to the first clustering result and a voiceprint feature matching principle to obtain a second sub-result.

3. The method of claim 2, wherein clustering the first sub-segment according to the longest speaking time of the speaker according to the first clustering result to obtain a first sub-result comprises:

searching an optimal mapping relation according to the speaking duration of each speaker in the first clustering result and the identity ID corresponding to the speaker by using the Hungarian algorithm to obtain the first sub-result; and under the optimal mapping relation, the overlapping time of the audio parts of the clusters is longest.

4. The method of claim 2, wherein clustering the second sub-segment according to the first clustering result and the voiceprint feature matching rule to obtain a second sub-result comprises:

acquiring a first voiceprint feature corresponding to the second sub-segment and at least one second voiceprint feature in the first clustering result, wherein each second voiceprint feature corresponds to a speaker of the first audio segment;

comparing the similarity of the first voiceprint feature with the similarity of the at least one second voiceprint feature;

and if the target second voiceprint feature exists in the at least one second voiceprint feature and the similarity between the target second voiceprint feature and the first voiceprint feature is larger than or equal to a threshold value, marking the first voiceprint feature and the target second voiceprint feature as the same speaker to obtain the second sub-result.

5. The method of claim 4, further comprising:

and if the target second voiceprint feature does not exist, marking the first voiceprint feature to correspond to a new speaker to obtain the second sub-result, wherein the new speaker is different from any speaker in the first audio fragment cluster.

6. The method of claim 4, further comprising:

and calculating the second voiceprint characteristics based on the average value of the voiceprint vectors of the same speaker in each sampling period of the real-time audio stream.

7. The method of claim 1, wherein the current sampling period is greater than or equal to the second audio segment, and parsing the second audio segment according to the first clustering result to obtain a second clustering result comprises:

and clustering the second audio segments according to the first clustering result and a voiceprint feature matching principle to obtain a second clustering result.

8. The method according to any one of claims 1-7, wherein the obtaining the second audio piece acquired in the current sampling period comprises:

and when the voice end is detected by the voice activity detection VAD, acquiring the second audio segment according to the sampling period.

9. The method according to any one of claims 1-7, further comprising:

updating the first clustering result based on the second clustering result to obtain a third clustering result of the current audio stream, and displaying the third clustering result; the current audio stream includes a first audio segment and a second audio segment, and the third classification result includes at least one speaker identity information within the current audio stream and timestamp information corresponding to the identity information.

10. An apparatus for clustering audio segments, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first clustering result corresponding to a first audio segment, and the first clustering result comprises identity identification information of at least one speaker in the first audio segment and a timestamp corresponding to the identity identification information;

the processing unit is used for analyzing the second audio clip according to the first clustering result to obtain a second clustering result;

11. An electronic device comprising a memory and a processor, the memory and the processor coupled;

the memory to store computer program instructions;

the computer program instructions, when read and executed by the processor, implement the method of any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium is used for storing a computer program;

the computer program, when executed by a computer, implementing the method of any one of claims 1 to 9.