CN111462759B

CN111462759B - Speaker labeling method, device, equipment and storage medium

Info

Publication number: CN111462759B
Application number: CN202010249826.7A
Authority: CN
Inventors: 宋亚楠; 刘庆峰; 刘聪; 魏思; 王智国; 高建清; 潘嘉; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2024-02-13
Anticipated expiration: 2040-04-01
Also published as: CN111462759A

Abstract

The application provides a speaker labeling method, a speaker labeling device, speaker labeling equipment and a storage medium, wherein the speaker labeling method comprises the following steps: acquiring acoustic characteristics of voice data to be marked; labeling the speaker of the voice data to be labeled according to at least the acoustic characteristics of the voice data to be labeled and the characteristics of the speaker appearing in the labeled voice data; the characteristic of the speaker appearing in the marked voice data is determined based on the association relation between the speaker learned in the speaker marking process of the marked voice data and the acoustic characteristic of the voice data. By adopting the method, the speaker marking of the voice data can be realized, and higher speaker marking accuracy can be ensured.

Description

Speaker labeling method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a speaker labeling method, device, equipment, and storage medium.

Background

With the rapid development of artificial neural networks in the field of speaker recognition, the demand for voice data with speaker labels is becoming urgent, and these data can be used to optimize speaker recognition models, which will have important significance for improving speaker recognition effect.

At present, when the artificial intelligence technology provides voice service for users, a large amount of voice data can be accumulated, the voice data can be used for optimizing a speaker recognition model, but speaker labels of the accumulated voice data can not be directly acquired, and can only be labeled at a later stage. Therefore, how speaker labeling of voice data can be achieved is a realistic need in the field of speaker recognition.

Disclosure of Invention

Based on the above requirements, the embodiments of the present application provide a speaker labeling method, device, equipment and storage medium, which at least can implement speaker labeling on voice data.

A speaker annotation method comprising:

acquiring acoustic characteristics of voice data to be marked;

labeling the speaker of the voice data to be labeled according to at least the acoustic characteristics of the voice data to be labeled and the characteristics of the speaker appearing in the labeled voice data;

the characteristic of the speaker appearing in the marked voice data is determined based on the association relation between the speaker learned in the speaker marking process of the marked voice data and the acoustic characteristic of the voice data.

A speaker annotation device comprising:

The characteristic extraction unit is used for acquiring acoustic characteristics of voice data to be marked;

the speaker labeling unit is used for labeling the speaker of the voice data to be labeled at least according to the acoustic characteristics of the voice data to be labeled and the characteristics of the speaker appearing in the labeled voice data;

A speaker annotation device comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

the processor implements the speaker labeling method described above by executing the program stored in the memory.

A storage medium having a computer program stored thereon, which when executed by a processor, implements the speaker tagging method described above.

The speaker labeling method provided by the invention can label the speaker of the voice data to be labeled according to the acoustic characteristics of the voice data to be labeled and the characteristics of the speaker appearing in the labeled voice data. And, the characteristics of the speaker who appears in the labeled voice data applied in the above-described scheme of the present application are determined based on the association relationship between the speaker learned in the process of labeling the labeled voice data and the acoustic characteristics of the voice data thereof.

It can be understood that in the process of labeling the speaker for the voice data, the speaker labeling method continuously learns the association relation between the speaker and the acoustic features of the voice data, and determines the speaker features based on the learned association relation, so that as the voice data of the speaker are labeled more and more, the association relation between the speaker and the acoustic features of the voice data of the speaker is learned more and more comprehensively and deeply, and the influence of the environment, the channel, the emotion and other factors on the speaker features can be counteracted, so that the features of the speaker are mastered more and more accurately.

After the characteristics of the speaker appearing in the marked voice data are determined according to the mode, the characteristics are used for marking the speaker for the new voice data to be marked, and the accuracy of marking the speaker can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of a speaker labeling method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of another speaker labeling method according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of yet another speaker labeling method according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of yet another speaker labeling method according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart of a speaker annotation process according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a speaker labeling device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a speaker labeling device according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is suitable for application scenes for distinguishing and marking the speakers of the voice data. By adopting the technical scheme of the embodiment of the application, the speaker of the voice data can be identified, and the speaker is marked on the voice data according to the identification result.

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The embodiment of the application provides a speaker labeling method, which is shown in fig. 1, and comprises the following steps:

s101, acquiring acoustic characteristics of voice data to be marked;

the above-mentioned voice data to be marked generally refers to voice data which contains the voice content of a speaker, the speaker of the speaker needs to be identified, and the speaker is marked according to the identification result of the speaker, and the voice data can be voice data taken from any channel, any environment and any user. For example, the user voice data acquired by a terminal such as a television or a mobile phone, or the user voice data acquired from a network gateway may be used.

As a preferred process, after obtaining the voice data to be marked, the embodiment of the application firstly removes suspected noise data from the obtained voice data through indexes such as signal to noise ratio and energy, and the suspected noise data includes data with low signal to noise ratio or low energy. Then, effective voice detection is carried out by voice detection technology (Voice Active Detector, VAD) to obtain voice fragments. In the above-mentioned voice data preprocessing stage, various prior art means may be utilized, including but not limited to the above-mentioned proposed scheme, to remove the invalid voice segment to the maximum.

And after the voice data preprocessing, extracting acoustic features of the voice data to be marked to obtain acoustic features of the voice data to be marked.

As a preferred implementation manner, the embodiment of the application extracts the filter bank feature of the voice data to be annotated as the acoustic feature.

The filter bank feature specifically refers to a voice data feature extracted by means of a filter bank, and is one of common voice data features. In general, the voice signal includes voice signals in various frequency bands, but the voice signals in certain frequency bands are really interesting or valuable for human ear listening, voice signal recognition and the like, and the other signals may not be so valuable for use, and may even interfere with normal voice signal recognition. Therefore, the data characteristics of the voice frames are extracted by means of the filter bank, the voice data characteristics which are really interesting or valuable can be effectively extracted, and meanwhile, the interference of redundant voice data characteristics on subsequent language identification can be prevented.

As an alternative implementation manner, the above specific implementation process of extracting the filter bank feature of each voice frame may be implemented with reference to a solution of extracting the filter bank feature of voice data known in the prior art, which is not described in detail in the embodiments of the present application.

S102, labeling the speaker of the voice data to be labeled at least according to the acoustic characteristics of the voice data to be labeled and the characteristics of the speaker appearing in the labeled voice data.

Specifically, when the speaker labeling is performed on the voice data to be labeled, the embodiment refers to the characteristics of the speaker appearing in the labeled voice data. The noted labeled speech data refers to speech data for which the labeling of the speaker has been completed before the labeling of the speaker is performed on the speech data to be labeled, and in particular, the labeled speech data for which the labeling of the speaker has been achieved by executing the speaker labeling method provided in the embodiment of the present application.

The number of labeled speech data is typically a plurality of pieces, wherein the speakers of each piece of speech data may be the same or different. The embodiment of the application takes different speakers in the speakers marked with the voice data as speakers which appear in the voice data.

For each speaker that appears in the noted voice data, the embodiments of the present application also determine the characteristics of the speaker when the speaker is identified from the noted voice data. The characteristic of the speaker refers to voice data from the speaker, and can represent information of the characteristic of the speaker.

It can be appreciated that when the features of the speaker appearing in the labeled voice data are determined, which is equivalent to the features of some known speakers already accumulated, and when the speaker of the new voice data to be labeled is identified, the speaker of the voice data to be labeled can be identified by comparing, distinguishing, etc. the acoustic features of the new voice data to be labeled with the features of the speaker appearing in the labeled voice data, and further the speaker of the voice data to be labeled can be labeled according to the speaker identification result.

For example, after the acoustic feature of the voice data to be labeled is extracted, comparing the acoustic feature with the feature of the speaker appearing in the labeled voice data, and if the similarity between the acoustic feature of the voice data to be labeled and the feature of the speaker a in the speaker A, B, C, D appearing in the labeled voice data is the largest and exceeds a preset feature similarity threshold, considering the speaker of the voice data to be labeled as the speaker a; if the similarity between the acoustic features of the voice data to be annotated and the features of the speaker A, B, C, D which appears in the annotated voice data does not exceed the preset feature similarity threshold, the speaker of the voice data to be annotated is considered to be the speaker which does not appear in the annotated voice data, and the speaker is set as a new speaker E.

It can be understood that according to the speaker labeling method disclosed by the application, the identification and labeling of the speaker to be labeled with the voice data can be realized by continuously accumulating the characteristics of the speaker labeled with the voice data.

Furthermore, the inventor of the present application found that, during the research process, when the voice data of a speaker is collected through different channels, such as a mobile phone, a television, a network, or when the environment, mood, etc. of the speaker changes, the characteristics of the speaker extracted from the voice data will change. Thus, the characteristics of the speaker that have appeared in the labeled speech data will vary from one speech data to another. And whether the characteristics of the voice data to be marked are accurate or not directly influences the speaker marking of the voice data to be marked.

Therefore, in order to avoid influencing speaker labeling of the voice data to be labeled due to inaccurate characteristics of the speaker appearing in the labeled voice data, the embodiment of the application sets that in the process of labeling the speaker of the labeled voice data, the association relationship between the speaker and the acoustic characteristics of the voice data is continuously learned, and then the characteristics of the speaker are determined according to the learned association relationship.

For example, for a speaker a who has appeared in the noted voice data, assuming that 10 pieces of voice data in total are voice data of the speaker a in the noted voice data, in the process of speaker labeling each piece of voice data of the speaker a, the association relationship between the speaker a and the acoustic feature of the voice data thereof is learned, and the feature of the speaker a is determined from the learned association relationship.

It will be appreciated that as speaker a's voice data is continually labeled, the association between speaker a and its acoustic features of voice data is continually learned, and the determined features of speaker a are continually updated.

It can be understood that, for a certain speaker appearing in the labeled voice data, as the number of the labeled voice data is increased, the association relationship between the learned speaker and the acoustic features of the voice data is also increased, so that the influence of the changes of factors such as channels, environments, moods and the like on the characteristics of the speaker can be offset to a certain extent, and the determined characteristics of the speaker are more accurate.

According to the above processing, the characteristics of the speaker that appears in the noted voice data in the embodiment of the present application are determined based on the association between the speaker learned in the process of labeling the noted voice data and the acoustic characteristics of the voice data thereof. The feature is used for labeling the new speaker of the voice data to be labeled, and is beneficial to improving the accuracy of labeling the speaker.

Moreover, by executing the speaker labeling method, the characteristics of the known speaker can be continuously perfected in the speaker labeling process, and the method is a processing process of learning and applying the complementary processes, so that the whole set of speaker labeling method can be more accurate and more accurate, and the influence of objective factors such as channels and environments or changes of subjective factors such as moods and moods of the speaker on the speaker labeling can be gradually and completely avoided.

As an exemplary implementation manner, labeling the speaker of the voice data to be labeled according to at least the acoustic feature of the voice data to be labeled and the feature of the speaker appearing in the labeled voice data, includes:

and determining the speaker of the voice data to be marked from the speaker appearing in the marked voice data according to the acoustic characteristics of the voice data to be marked and the characteristics of the speaker appearing in the marked voice data.

By comparing the acoustic characteristics of the voice data to be marked with the characteristics of the speaker appearing in the marked voice data, it can be determined whether the speaker of the voice data to be marked is one of the speakers appearing in the marked voice data, and if the speaker of the voice data to be marked is one of the speakers appearing in the marked voice data, the speaker is determined to be the speaker of the voice data to be marked, namely, the speaker of the voice data to be marked is determined from the speakers appearing in the marked voice data.

The above-mentioned process of determining the speaker of the voice data to be marked from the speakers appearing in the marked voice data has a good marking effect on the speaker of the scene where the speaker of the voice data to be marked is definitely a speaker appearing in the marked voice data, and the features of the speaker appearing in the marked voice data are used as the features of the known speaker to identify the speaker of the voice data to be marked.

However, in the conventional speaker labeling scenario, the labeled voice data cannot cover all the speakers, so that the speakers appearing in the labeled voice data may not cover all the speakers, and the speaker to be labeled with the voice data may be a new speaker that does not appear in the labeled voice data, and at this time, if the speaker to be labeled with the voice data is determined from the speakers appearing in the labeled voice data, it cannot be guaranteed that the speaker to be labeled with the voice data can be accurately identified.

In order to cope with the more general speaker annotation scenario, the embodiment of the present application presets a new speaker feature, which may be a fixed value, for example, may be set to be all zeros.

After the acoustic characteristics of the voice data to be marked are obtained, determining the speaker of the voice data to be marked from a speaker set formed by the speaker appearing in the voice data to be marked and the new speaker according to the acoustic characteristics of the voice data to be marked, the characteristics of the speaker appearing in the voice data to be marked and the characteristics of the new speaker.

The acoustic features of the voice data to be labeled are compared with the features of the speaker appearing in the labeled voice data and the features of the new speaker, and the features of the speaker most similar to the acoustic features of the voice data to be labeled are determined. It will be appreciated that the speaker of the voice data to be annotated may be any one of the speakers that have appeared in the annotated voice data at this time, or the new speaker described above.

The new speaker refers generally to a speaker that has not appeared in the labeled speech data, and not to a specific speaker. And labeling the speaker of the voice data to be labeled as a new speaker as long as the acoustic characteristics of the voice data to be labeled and the characteristics of the new speaker are larger than the characteristics of each speaker which appears in the voice data to be labeled.

For example, assuming that the similarity between the acoustic features of the first piece of voice data to be labeled and the features of the new speaker is greater than the features of the first piece of voice data to be labeled and all speakers A, B, C, D which appear in the labeled voice data, the speaker of the first piece of voice data to be labeled is labeled as speaker E; if the similarity between the acoustic features of the second piece of voice data to be marked and the features of the new speaker is greater than the features of the second piece of voice data to be marked and the features of the speaker A, B, C, D, E which appears in the marked voice data, the speaker of the second piece of voice data to be marked is marked as a speaker F, and the like.

It can be understood that, based on the preset characteristics of the new speaker, by executing the speaker labeling method provided by the embodiment of the present application, any speaker to be labeled with voice data can be labeled as a speaker that appears in the labeled voice data, or as a new speaker, so that effective speaker labeling can be achieved.

For convenience of description, the embodiment of the present application composes the speaker that appears in the noted voice data and the new speaker into a speaker set.

As an alternative implementation manner, referring to fig. 2, another embodiment of the present application discloses that determining a speaker of the voice data to be labeled from a speaker set formed by a speaker appearing in the voice data to be labeled and the new speaker according to the acoustic feature of the voice data to be labeled, the feature of the speaker appearing in the voice data to be labeled, and the feature of the new speaker, where the determining includes:

s202, determining the first labeling probability and the second labeling probability of each piece of voice data to be labeled according to the acoustic characteristics of the voice data to be labeled and the characteristics of each speaker in the speaker set; wherein the speaker set includes each speaker that has appeared in the labeled speech data and the new speaker.

The first labeling probabilities respectively represent the probabilities that the speaker of the voice data to be labeled is the speaker of the voice data to be labeled, and the second labeling probabilities represent the probabilities that the speaker of the voice data to be labeled is the new speaker.

Specifically, according to the acoustic characteristics of the voice data to be marked and the characteristics of each speaker in the speaker set, the probability that the speaker of the voice data to be marked is each speaker in the speaker set is calculated.

For example, the characteristics of each speaker in the speaker set are compared with the acoustic characteristics of the voice data to be marked, so as to determine the probability that the speaker of the voice data to be marked is the speaker.

After the operation processing is performed, the probability that the speaker of the voice data to be marked is each speaker in the speaker set can be obtained respectively.

In order to facilitate the distinction, the embodiment of the present application sets the probability that the speaker of the voice data to be marked is the speaker of the marked voice data as a first marking probability, and sets the probability that the speaker of the voice data to be marked is the new speaker as a second marking probability. It will be appreciated that the number of the first labeling probabilities is equal to the number of the speakers that appear in the labeled speech data, and each first labeling probability is used to represent a probability that the speaker of the speech data to be labeled is a speaker that appears in the labeled speech data, and the number of the second labeling probabilities is only one and is used to represent a probability that the speaker of the speech data to be labeled is a new speaker.

S203, determining the speaker of the voice data to be annotated from the speaker set at least according to the first annotation probability and the second annotation probability.

Specifically, after the first labeling probability and the second labeling probability are respectively determined, the probability that the speaker of the voice data to be labeled is each speaker in the speaker set is respectively determined.

On this basis, the speaker of the voice data to be marked can be determined from the speaker set, for example, the speaker may be a speaker which appears in some marked voice data in the speaker set, or may be a new speaker which does not appear in the marked voice data.

As an exemplary implementation manner, in the embodiment of the present application, a labeling probability with a maximum probability value is selected from the first labeling probability and the second labeling probability, and then, according to the selected labeling probability with the maximum probability value, a speaker of the voice data to be labeled is determined from the speaker set.

It can be understood that, because the first labeling probabilities and the second labeling probabilities are used to represent the probability that the speaker of the to-be-labeled voice data is a certain speaker, if the probability that the speaker of the to-be-labeled voice data is a certain speaker is maximum, the speaker of the to-be-labeled voice data should be labeled as the speaker. Therefore, in the embodiment of the present application, the maximum labeling probability is selected from the first labeling probability and the second labeling probability, and the speaker corresponding to the maximum labeling probability is set as the speaker of the voice data to be labeled.

For example, assuming that the speaker set includes the speaker A, B, C, D that appears in the labeled speech data and the new speaker E, and labeling probabilities corresponding to the speaker A, B, C, D, E are 40%, 35%, 60%, 75%, and 50%, respectively, where 40%, 35%, 60%, and 75% are the first labeling probability and 50% are the second labeling probability, since 75% is the maximum labeling probability, the speaker D corresponding to 75% is set as the speaker to be labeled with the speech data.

Step S201 in the embodiment shown in fig. 2 corresponds to step S101 in the method embodiment shown in fig. 1, and the specific content thereof may refer to the content of the method embodiment shown in fig. 1, and is not further described herein.

Optionally, in another embodiment of the present application, the determining the first labeling probability and the second labeling probability of each speaker in the speaker set according to the acoustic feature of the to-be-labeled voice data and the feature of each speaker in the speaker set further includes:

updating a preset feature processing model by utilizing the features of each speaker in the speaker set respectively to obtain a feature processing model corresponding to each speaker in the speaker set;

Respectively inputting the acoustic characteristics of the voice data to be annotated into a characteristic processing model corresponding to each speaker in the speaker set to obtain each first annotation probability and second annotation probability of the voice data to be annotated;

the feature processing model is obtained by at least extracting a feature sequence of an acoustic feature sample of voice data, calculating the probability that a speaker of the voice data is determined to be a speaker corresponding to the feature processing model according to the acoustic feature sample of the language data, and training.

Specifically, the embodiment of the application trains the feature processing model in advance and is used for calculating the speaker labeling probability of the voice data to be labeled.

The feature processing model can be obtained based on training an RNN model, and in the embodiment of the application, an LSTM model is taken as an example, and the feature processing model is obtained by training the LSTM model. The feature processing model takes the acoustic features of the voice data as input, and can extract and obtain a feature sequence of the voice data based on the acoustic features of the voice data. In the training stage, besides the function of training the feature processing model to extract the feature sequence of the acoustic features of the voice data, the feature processing model is trained to determine the probability that the speaker of the voice data is the speaker corresponding to the feature processing model according to the acoustic features of the voice data.

In addition, the feature processing model can also output the hidden layer state. The hidden layer represents the internal operation parameter information of the feature processing model.

In the embodiment of the present application, the feature processing model obtained by training in advance is updated by using the features of each speaker in the speaker set, and after the feature processing model is updated by using the features of the speaker, the feature processing model carries the feature information of the speaker, so that the feature processing model is used as the feature processing model corresponding to the speaker. Through the above processing, a feature processing model corresponding to each speaker in the above speaker group can be obtained.

When a feature processing model corresponding to each speaker in the speaker set is obtained, the acoustic feature of the voice data to be marked is input into the feature processing model corresponding to the speaker, and the probability that the speaker of the voice data to be marked is the speaker corresponding to the feature processing model is calculated by using the model, so that the first marking probability and the second marking probability can be obtained respectively.

As an exemplary implementation manner, the features of the speaker in the foregoing embodiments of the present application include an acoustic feature sequence and feature extraction operation parameters.

The acoustic feature sequence of the speaker appearing in the labeled voice data is extracted from the feature sequence obtained by processing the acoustic features of the latest labeled voice data by the feature processing model corresponding to the acoustic feature sequence, and the feature extraction operation parameters of the speaker appearing in the labeled voice data are hidden layer states when the feature processing model corresponding to the speaker processes the acoustic features of the latest labeled voice data of the speaker.

It will be appreciated that the acoustic feature sequence and feature extraction parameters contained in the features of a speaker who has appeared in the labeled speech data are both related to its most recent labeled speech data. The latest marked voice data of the speaker refers to the voice data of the speaker, wherein the speaker marking is completed before the current voice data to be marked is marked by the speaker, and the speaker marking result is the voice data of the speaker.

Therefore, as the number of the labeled voice data increases, the acoustic feature sequence and the feature extraction operation parameters of the speaker appearing in the labeled voice data are updated continuously, that is, the association relationship between the acoustic features of the voice data and the speaker of the voice data is learned continuously, and the features of the speaker are updated based on the learned association relationship.

The new acoustic feature sequence and feature extraction operation parameters of the speaker are preset acoustic feature sequence and feature extraction operation parameters, which are fixed values.

On the basis, when each speaker in the speaker set is utilized to update the feature processing model, the acoustic feature sequence and the feature extraction operation parameters of each speaker in the speaker set are input into the feature processing model so as to update the feature processing model, and the feature processing model corresponding to each speaker in the speaker set is obtained.

For example, 10 words have been labeled, the corresponding speaker labels are [1,1,1,2,2,2,2,2,2,2], the corresponding speaker acoustic feature sequences are [ X1, X2, X3, X4, X5, X6, X7, X8, X9, X10], and the corresponding feature processing model hidden layers are [ h1, h2, h3, h4, h5, h6, h7, h8, h9, h10], respectively. Then, the corresponding acoustic feature sequence when the voice data of speaker number 1 appears last time is X3, the corresponding hidden layer state h3; the corresponding acoustic feature sequence when the voice data of speaker No. 2 appears last time is X10, corresponding hidden layer state h10. Then respectively inputting X3-h3 and X10-h10 into the pre-trained feature processing model to respectively obtain a feature processing model corresponding to the speaker No. 1 and a feature processing model corresponding to the speaker No. 2.

Illustratively, the embodiment of the application uses LSTM (Long Short-Term Memory) to construct the feature processing model. After training, the parameter theta of the model is fixed, and the hidden layer state h is output _t ＝LSTM(X _t' ,h _t' I theta) and speaker annotation result y _t Correlation; m is m _t ＝f(h _t I theta) represents the probability that the speaker of the voice data to be annotated, which is output by the feature processing model, is the speaker corresponding to the feature processing model.

Wherein t' represents the speaker y _t The label of the latest voice data.

Acoustic feature sequence X of voice data output by the feature processing model _t Speaker y _t Obeys a normal distribution, the parameters of which are related to m:

X _t |X _[t-1] ,y _t ～N(μ _t ,σ ² )

therefore, speaker label y _t Corresponding acoustic feature sequence X _t And sampling from the normally distributed characteristic sequence. For example, the feature sequence output by the feature processing model is subjected to monte carlo sampling and the like, so that the acoustic feature sequence of the speaker corresponding to the feature sequence is obtained.

Optionally, referring to fig. 3, another embodiment of the present application further discloses a speaker labeling method provided in the embodiment of the present application further includes:

s303, determining the third labeling probability and the fourth labeling probability of the voice data to be labeled according to the number of labeled voice data corresponding to each speaker appearing in the labeled voice data.

The third labeling probabilities respectively represent the probabilities that the speaker of the voice data to be labeled is the speaker of the voice data to be labeled, and the fourth labeling probability represents the probability that the speaker of the voice data to be labeled is the new speaker.

Specifically, since the embodiment of the application predetermines the characteristics of the speaker appearing in the labeled voice data and the characteristics of the new speaker, and determines the speaker of the voice data to be labeled from the speaker set consisting of the speaker of the voice data to be labeled and the new speaker when determining the speaker of the voice data to be labeled, it can be understood that the speaker of the voice data to be labeled is either the speaker appearing in the voice data to be labeled or the new speaker.

If the speaker of the voice data to be marked is determined, whether the speaker of the voice data to be marked is a certain speaker or a new speaker among the speakers appearing in the marked voice data can be determined, and theoretical support can be provided for further determining the speaker of the voice data to be marked from the speaker set.

In order to facilitate the distinction, in the embodiment of the present application, the third labeling probability and the fourth labeling probability are used to represent the probability that the speaker of the voice data to be labeled is one of the speakers that have appeared in the labeled voice data, and the probability that the speaker of the voice data to be labeled is the new speaker.

As a preferred implementation manner, the embodiment of the present application determines the third labeling probability and the fourth labeling probability of the voice data to be labeled according to the number of labeled voice data corresponding to each speaker that appears in the labeled voice data.

It can be appreciated that if the number of labeled voice data of a certain speaker appearing in the labeled voice data is smaller, the probability that the voice data to be labeled is the voice data of the speaker is smaller, and the speaker of the voice data to be labeled is more likely to be a new speaker; if the number of marked voice data of a certain speaker appearing in the marked voice data is large, the probability that the voice data to be marked is the voice data of the speaker is large, and the speaker of the voice data to be marked is more likely to be the speaker.

Therefore, by counting the number of the labeled voice data corresponding to each speaker appearing in the labeled voice data, the third labeling probability and the fourth labeling probability of each voice data to be labeled can be determined. In general, the number of labeled speech data corresponding to the speaker who has appeared in the labeled speech data is proportional to the third labeling probability and inversely proportional to the fourth labeling probability. For example, if the number of labeled voice data corresponding to a certain speaker appearing in the labeled voice data is larger, the third labeling probability corresponding to the speaker is larger, and correspondingly, the fourth labeling probability is smaller; conversely, if the number of labeled voice data corresponding to a certain speaker appearing in the labeled voice data is smaller, the third labeling probability corresponding to the speaker is smaller, and accordingly, the fourth labeling probability is larger.

Based on the third labeling probability and the fourth labeling probability, determining the speaker of the voice data to be labeled from the speaker set at least according to the first labeling probability and the second labeling probability, including:

s304, determining the speaker of the voice data to be annotated from the speaker set at least according to the first annotation probability, the second annotation probability, the third annotation probability and the fourth annotation probability.

Specifically, after the first labeling probability, the second labeling probability, the third labeling probability and the fourth labeling probability are defined, the probability values may be combined to determine which speaker in the speaker set is the speaker of the voice data to be labeled, and then the speaker of the voice data to be labeled is determined as the speaker.

Because the first labeling probability and the third labeling probability are used for representing the probability that the speaker of the voice data to be labeled is a certain speaker which appears in the labeled voice data, the first labeling probability and the third labeling probability which correspond to the certain speaker which appears in the labeled voice data can be determined, the first labeling probability and the third labeling probability which correspond to the certain speaker are respectively used for representing the probability that the speaker of the voice data to be labeled is the speaker, and at the moment, the first labeling probability and the third labeling probability which correspond to the same speaker which appears in the labeled voice data have a corresponding relation.

As an exemplary implementation manner, the embodiment of the present application corresponds to each first labeling probability, multiplies each first labeling probability by the corresponding third labeling probability, that is, multiplies the first labeling probability and the third labeling probability of the same speaker that appear in the corresponding labeled voice data, and each obtained product result is set as a first comprehensive probability, so that each speaker that appears in the corresponding labeled voice data can obtain the first comprehensive probability corresponding to the first labeling probability.

Meanwhile, multiplying the second labeling probability with the fourth labeling probability to obtain a second comprehensive probability, wherein the second comprehensive probability represents the probability that the speaker of the voice data to be labeled is a new speaker.

After the first comprehensive probability and the second comprehensive probability are obtained respectively, the comprehensive probability with the maximum probability value is selected from the first comprehensive probability and the second comprehensive probability, and the speaker corresponding to the comprehensive probability with the maximum probability value is determined as the speaker of the voice data to be marked.

As an exemplary implementation manner, the embodiment of the present application trains a speaker prediction model in advance, which is used to determine each third labeling probability and fourth labeling probability of the voice data to be labeled.

In the training stage of the pre-constructed speaker prediction model, the number of voice data of each speaker appearing in the labeled voice data is input into the speaker prediction model, the probability that the speaker of the voice data sample is the labeled speaker and the probability that the speaker is a new speaker are determined by utilizing the pre-constructed speaker prediction model, and the training is finished until the speaker prediction model can accurately label the speaker of the voice data sample.

After the training, after determining the number of the labeled voice data corresponding to each speaker appearing in the labeled voice data, inputting the number information into a speaker prediction model obtained by training in advance, respectively determining the probability that the speaker of the voice data to be labeled is each speaker appearing in the labeled voice data, and determining the probability that the speaker of the labeled voice data is a new speaker, so as to obtain the third labeling probability and the fourth labeling probability.

Illustratively, the speaker prediction model described above may be defined as follows:

a) The speaker to be marked with the voice data is K which appears in the marked voice data _t-1 Probability of kth speaker among the individual speakers:

b) Probability that the speaker to be annotated with speech data is a new speaker:

wherein N is _k,t-1 The number of voice data of the kth speaker in the marked t-1 voice data is represented, and alpha is a parameter to be learned by the model.

Steps S301 and S302 in the embodiment shown in fig. 3 correspond to steps S201 and S202 in the method embodiment shown in fig. 2, respectively, and the specific content thereof is shown in the method embodiment shown in fig. 2, and will not be described herein.

As a preferred implementation manner, referring to fig. 4, the speaker labeling method provided in the embodiment of the present application further includes:

s404, extracting speaker characteristics of the voice data to be marked according to the acoustic characteristics of the voice data to be marked.

Illustratively, the embodiment of the application trains a speaker characteristic extraction model in advance for extracting the speaker characteristic, and the speaker characteristic extraction model can be obtained based on the training of the RNN model.

Inputting the acoustic characteristics of the voice data to be marked into the speaker characteristic extraction model, and extracting to obtain the speaker characteristics of the voice data to be marked.

For example, inputting the filter bank features extracted from the voice data to be annotated into the speaker feature extraction model to obtain the speaker features in the voice data to be annotated.

S405, determining the speaker jump probability of the voice data to be marked according to the speaker characteristics of the voice data to be marked and the speaker characteristics of the last voice data to be marked.

The speaker skip probability represents the probability that the speaker of the voice data to be marked is different from the speaker of the last voice data marked.

Specifically, referring to the method for extracting the speaker characteristics of the voice data to be labeled, the speaker characteristics of the last labeled voice data can be extracted.

The above-mentioned last marked voice data refers to voice data that completes speaker marking before speaker marking is performed on the current voice data to be marked, and the last marked voice data is also latest voice data that completes speaker marking before speaker marking is performed on the current voice data to be marked.

By comparing the characteristic of the current speaker of the voice data to be marked with the characteristic of the speaker of the voice data to be marked, whether the speaker of the voice data to be marked is the same as the speaker of the voice data to be marked can be determined, and further the probability of the speaker jump of the voice data to be marked can be determined, wherein the size of the probability of the speaker jump is inversely proportional to the similarity between the characteristic of the current speaker of the voice data to be marked and the characteristic of the speaker of the voice data to be marked. For example, the higher the similarity between the speaker characteristic of the current to-be-annotated speech data and the speaker characteristic of the last annotated speech data, the lower the probability that the speaker of the current to-be-annotated speech data is different from the speaker of the last annotated speech data is, and therefore the smaller the speaker skip probability of the to-be-annotated speech data is.

Based on the skip probability of the voice data to be annotated, determining a speaker of the voice data to be annotated from the speaker set at least according to the first annotation probability, the second annotation probability, the third annotation probability and the fourth annotation probability, includes:

s406, determining the speaker of the voice data to be annotated from the speaker set according to the speaker jump probability, the first annotation probability, the second annotation probability, the third annotation probability and the fourth annotation probability.

In an exemplary embodiment of the present application, according to the first labeling probabilities, the speaker skip probabilities, and the third labeling probabilities, third comprehensive probabilities corresponding to the speakers that appear in the labeled speech data are calculated.

Specifically, the speaker corresponding to the last piece of marked voice data in the marked voice data multiplies the first marking probability, the third marking probability and the speaker non-jump probability corresponding to the speaker, and obtains the third comprehensive probability corresponding to the speaker.

The speaker non-skip probability is calculated according to the speaker skip probability. Because the speaker of the voice data to be marked currently has only two situations of jumping or not jumping relative to the speaker of the voice data marked in the previous sentence, namely, has only two situations of the same or different, the probability of the speaker jumping is subtracted by 1, and the probability of the speaker not jumping can be obtained.

And multiplying the first labeling probability, the third labeling probability and the speaker skip probability corresponding to other speakers appearing in the labeled voice data to obtain a third comprehensive probability corresponding to each speaker.

Meanwhile, multiplying the speaker skip probability, the second labeling probability and the fourth labeling probability to obtain a fourth comprehensive probability corresponding to the new speaker.

It can be understood that the above-mentioned comprehensive probabilities corresponding to the respective speakers respectively represent the probabilities that the speaker to be annotated with the voice data is the speaker.

And determining the speaker of the voice data to be marked from the speaker set according to the maximum comprehensive probability of the probability values in the third comprehensive probability and the fourth comprehensive probability.

Specifically, based on the third comprehensive probability and the fourth comprehensive probability, the comprehensive probability with the largest probability value is selected, and the speaker corresponding to the comprehensive probability is the speaker serving as the voice data to be marked.

As an exemplary implementation, the embodiment of the present application trains a speaker skip model in advance, for determining a speaker skip probability of voice data to be annotated.

In the training stage of the speaker skip mode, the speaker characteristics of different voice data are input into the speaker skip mode, so that the speaker skip mode determines the probability of different speakers of different voice data, and the probability is also used as the speaker skip probability output by the speaker skip mode.

When the speaker skip model can accurately determine different probabilities of speakers of different voice data according to the speaker characteristics of the different voice data, training is completed.

When the speaker skip probability of the voice data to be marked is calculated, the speaker characteristic of the voice data to be marked and the speaker characteristic of the last marked voice data are input into the speaker skip model to obtain the speaker skip probability of the voice data to be marked.

Illustratively, the speaker skip model described above may be represented by the following expression:

p(z _t ＝0|z _[t-1] ,γ)＝g _γ (z _[t-1] )

wherein g _γ Is a function with arbitrary parameter gamma, which can be a neural network model or a parameter gamma= { p ₀ }∈[0,1]Is a binary distribution function of (1).

It can be understood whether the speaker of the voice data to be marked jumps or not, and the probability p ₀ Determining the probability p ₀ May be a preset probability value.

Steps S401 to S403 in the method embodiment shown in fig. 4 correspond to steps S301 to S303 in the method embodiment shown in fig. 3, respectively, and the specific content thereof is please refer to the content of the method embodiment shown in fig. 3, which is not described herein again.

Further, another embodiment of the present application further discloses that the above speaker labeling method further includes:

and acquiring the characteristics of the speaker of the voice data to be marked.

Firstly, a feature sequence obtained by processing the acoustic features of the voice data to be marked by a feature processing model corresponding to the speaker of the voice data to be marked is obtained.

The feature processing model corresponding to the speaker of the voice data to be marked refers to a feature processing model with the largest first marking probability calculated based on the acoustic feature of the voice data to be marked or the largest comprehensive probability formed by the first marking probability output by the feature processing model.

In the process of labeling the speaker of the voice data to be labeled, the acoustic features of the voice data to be labeled are input into the feature processing models corresponding to the known speakers (including the speakers which appear in the labeled voice data and the new speakers), so that the feature sequences output by the special processing models corresponding to the known speakers are obtained.

At this time, the feature processing model corresponding to the speaker of the voice data to be annotated can be directly determined from the feature processing models corresponding to the known speakers, and the feature sequence obtained by processing the acoustic features of the voice data to be annotated by reading the feature processing model.

And extracting the acoustic feature sequence of the speaker of the voice data to be marked from the obtained feature sequence.

Specifically, the obtained feature sequence is sampled, and the obtained sampling result is taken as the acoustic feature sequence of the speaker of the voice data to be marked.

The method comprises the steps of obtaining a feature processing model corresponding to a speaker of voice data to be marked, processing acoustic features of the voice data to be marked to obtain hidden layer states when feature sequences are obtained, and setting the obtained hidden layer states as feature extraction operation parameters when the acoustic feature sequences of the speaker of the voice data to be marked are extracted;

and the acoustic feature sequence of the speaker of the voice data to be marked and the feature extraction operation parameters are used as the features of the speaker of the voice data to be marked.

After the acoustic feature sequence of the speaker of the voice data to be marked and the feature extraction operation parameters when the acoustic feature sequence of the speaker of the voice data to be marked is extracted are respectively obtained, the acoustic feature sequence and the feature extraction operation parameters are used as the features of the speaker of the voice data to be marked. The characteristics of the speaker can be used for updating the characteristic processing model when the speaker is marked on new voice data later.

As a preferred processing manner, the embodiment of the present application sets that the noted voice data and the voice data to be noted are voice data with the same attribute.

Wherein the same attributes mentioned above include, but are not limited to, from the same channel, collected from the same kind of terminal, or the same language, etc.

The embodiment of the application sets that before the voice data is labeled by the speaker, the voice data is classified according to the attribute, and then the voice data with the same attribute is labeled by the speaker according to the speaker labeling method provided by the embodiment of the application. For example, for voice data from the same terminal ID, the speaker labeling method according to the embodiment of the present application establishes a speaker labeling system dedicated to speaker labeling of the voice data from the terminal ID.

It can be understood that the voice data with the same attribute have partial similar characteristics, and the voice data with different attributes may have larger differences, so that the voice data is classified, the speaker labeling method provided by the application can focus on speaker labeling of the voice data with the same attribute, and the processing of the voice data can be more refined and specialized. In addition, voice data with different attributes are marked separately, so that characteristic interference among the voice data with different attributes can be prevented, and the speaker marking accuracy is affected.

In summary, the speaker labeling method provided in the embodiment of the present application uses the feature processing model, the speaker prediction model, and the speaker skip model to identify and label the speaker of the voice data to be labeled, specifically, determines the probability that the speaker of the voice data to be labeled is the labeled speaker or the new speaker based on the probability values output by the three models, thereby determining the speaker of the voice data to be labeled.

The probability calculation described above can be expressed by the following formula:

wherein p (X) _t |X _[t-1] ,y _[t] ) The probability representing the output of the feature processing model in the above embodiment of the present application may specifically be a first labeling probability or a second labeling probability; p (y) _t |z _t ,y _[t-1] ) The probability of the speaker prediction model output in the above embodiment of the present application may be specifically the third labeling probability or the fourth labeling probability; p (z) _t |z _[t-1 ) The probability of the speaker skip model output in the above embodiments of the present application is specifically the speaker skip probability.

t represents the current voice data to be marked, t-1 represents the voice data sequence already marked, for example, 6 voice data are marked, and then t-1 is 1,2,3,4,5, 6.

X _t The speaker acoustic feature sequence represents the speaker acoustic feature sequence of the current voice data to be marked, wherein the speaker acoustic feature sequence refers to a multidimensional mapping vector of a section of voice data mapped to another space through a speaker model, and can represent the speaker feature of the current voice data. X is X _[t-1] Representing a sequence of speaker acoustic features of the already labeled speech data.

y _t Speaker, y representing the current voice data to be annotated _[t-1] Representing a sequence of speakers corresponding to the annotated speech data, which corresponds to X _[t-1] One-to-one correspondence. When t=1, y ₁ =1, i.e. the speaker corresponding to the first speech data is denoted as speaker No. 1.

z _t The Boolean value: z _t =0, indicating that the speaker of the current voice data to be marked and the previous sentence have beenThe speaker of the labeled speech data is the same speaker, at which time y _t ＝y _[t-1] ；z _t =1, the speaker representing the current speech data to be annotated is not the same speaker as the speaker of the previous sentence of already annotated speech data, at this time y _t ≠y _[t-1] 。

For an easy understanding of the several variables X, y, z described above, reference is made to the following examples:

assuming that the voice data to be currently annotated is the 7 th voice data, i.e., t=7, and the speaker annotation result of the first 6 voice data has been obtained: y is ₆ = (1,1,2,3,2,2), corresponding speaker acoustic feature sequence X ₆ ＝(X ₁ ,X ₂ ,X ₃ ,X ₄ ,X ₅ ,X ₆ ) Corresponding speaker skip sequence z ₆ = (0,1,1,1,0). According to the above, by executing the speaker labeling method according to the embodiment of the present application, the speaker labeling method is implemented by X ₆ 、y ₆ And z ₆ Speaker y capable of determining current voice data to be marked ₇ And can determine the corresponding speaker acoustic signature sequence X ₇ 。

It should be noted that, for the feature processing model, the speaker prediction model and the speaker skip model, the models are synchronously trained during training, and the models are closely connected and mutually combined, so that the speaker labeling method provided by the embodiment of the application forms a speaker labeling scheme of the system, and the purpose of accurately identifying the speaker of the voice data is achieved.

It will be appreciated that the probability p (X _t |X _[t-1] ,y _[t] ) Prediction result y of speaker prediction model _t Influence, probability of speaker prediction model output p (y _t |z _t ,y _[t-1] ) Speaker skip probability z output by speaker skip model _t The influence is that the influence relationship is tighter and tighter in the synchronous training process of the three models, the mutual matching is more and more proper, and finally, the speaker of the voice data can be accurately identified through training.

In order to facilitate calculation, log is taken from the above formula (1), and a speaker to be labeled with voice data can be predicted and determined based on probabilities output by the above three models:

wherein (y) _t ^* ,z _t ^* ) The speaker labeling result and the speaker jumping situation which are finally predicted and determined are represented; Indicating that the sum of the probabilities corresponding to the sum is the largest (y _t ,z _t ) A value as (y) _t ^* ,z _t ^* ) Is determined by the final determined value of (a);the speaker skip condition of the previous t-1 voice data is represented; y is ^* _[t-1] A speaker labeling result of the previous t-1 voice data is represented; y is _t The speaker prediction result of the current voice data to be marked is represented, and the prediction result can be a speaker determined according to each marking probability or each comprehensive probability. According to the above formula (2), by looking up each (y _t ,z _t ) The speaker of the voice data to be annotated can be finally determined by the maximum probability value corresponding to the combination of the voice data to be annotated.

In order to more vividly show the speaker labeling method provided by the embodiment of the application, the following is exemplified:

if 7 pieces of voice data need to be marked with the speaker, the speaker marking method provided by the embodiment of the application is executed according to the following processing, so that the speaker marking of each piece of voice data is realized:

step 1: initializing speaker acoustic feature sequence and feature processing model hidden layer state X ₀ ＝0,h ₀ =0, the initialized acoustic feature sequence and feature processing model hidden layer state of the speaker are respectively used as the acoustic feature sequence and feature extraction of the preset new speaker And taking the operation parameters.

Step 2: speaker labeling is sequentially carried out on each voice data:

t=1: annotating a first piece of speech data

a) Calculating hidden layer state h of feature processing model ₁ ＝LSTM(X ₀ ,h ₀ |θ)。

b) Feature processing model calculates probability m that the speaker of the speech data is speaker number 1 ₁ ＝f(h ₁ |θ)。

c) Speaker 1 speaker y labeled with current speech data ₁ =1, and obtain the acoustic feature sequence X of the speaker ₁ 。

d) Saving the current state: x is X _[1] ＝(X ₁ ),y _[1] ＝(1),h _[1] ＝(h ₁ ),z _[1] = (), where z _[1] Is empty.

t=2: annotating a second piece of voice data

(1) Determining probabilities that the speech data is speaker number 1 and a new speaker, respectively

A：z ₂ ＝0,y ₂ ＝y ₁ ＝1

a) Calculating hidden layer state h of feature processing model _{2_A} ＝LSTM(X ₁ ,h ₁ |θ)。

b) Feature processing model calculates probability m that the speaker of the speech data is speaker number 1 _{2_A} ＝f(h _{2_A} |θ)。

c) Speaker 1 speaker y labeled with current speech data ₂ =1, and obtain the acoustic feature sequence X of the speaker ₂ 。

d) Calculating a probability that the speaker of the speech data is speaker number 1:

the first term is the speaker jump probability, the second term is the probability that the speaker of the voice data calculated by the speaker prediction model is the speaker No. 1, and the third term is the probability that the speaker of the voice data determined by the feature processing model is the speaker No. 1.

B：z ₂ ＝1，y ₂ For new speakers

a) Calculating hidden layer state h of feature processing model _{2_B} ＝LSTM(X ₀ ,h ₀ |θ)。

b) Feature processing model calculates probability m that the speaker of the speech data is a new speaker _{2_B} ＝f(h _{2_B} |θ)。

c) The speaker marked with the current voice data is speaker number 2 y ₂ =2, and obtain the acoustic feature sequence X of the speaker ₂ 。

d) Calculating the probability that the speaker of the speech data is speaker number 2:

(2) Determining speaker labeling results:

a) If p is _B ＞p _A Then z ₂ ＝1,y ₂ =2, update X ₂ ＝X _{2_B} Number N of voice data occupied by speaker No. 1 in the first 2 pieces of voice data _1,2 Number of voice data N occupied by speaker No. 1, no. 2 in the first 2 pieces of voice data _2,2 Hidden layer state h of feature processing model=1 ₂ ＝h _{2_B} 。

If p is _B ＜p _A Then z ₂ ＝0,y ₂ =1, update X ₂ ＝X _{2_A} Number N of voice data occupied by speaker No. 1 in the first 2 pieces of voice data _1,2 Hidden layer state h of feature processing model=2 ₂ ＝h _{2_A} 。

b) Preserving current state

Suppose p in a) _B ＞p _A X is then _[2] ＝(X ₁ ,X ₂ ),y _[2] ＝(1,1),h _[2] ＝(h ₁ ,h ₂ )。

t=3: marking 3 rd speech data … …

Sequentially executing, in each speaker labeling, calculating the probability that a speaker of voice data to be labeled selects one speaker from speakers appearing in the historical voice data, or taking the probability of a new speaker, and selecting the speaker corresponding to the combination of the maximum probability as the speaker of the voice data to be labeled.

The speaker of each voice data is marked in turn according to the above description, and the marking result is obtained assuming that the speaker marking of the first 6 voice data is completed:

y _[6] ＝(1,1,2,3,2,2)

X _[6] ＝(X ₁ ,X ₂ ,X ₃ ,X ₄ ,X ₅ ,X ₆ )

h _[6] ＝(h ₁ ,h ₂ ,h ₃ ,h ₄ ,h ₅ ,h ₆ )

the speaker labeling method provided by the embodiment of the application is executed according to the processing procedure shown in fig. 5, so that the speaker labeling of the 7 th voice data can be realized.

The method comprises the steps of respectively updating a feature processing model by using an acoustic feature sequence and a feature processing model hidden layer state of a number 1 speaker, a number 2 speaker and a number 3 speaker, and a preset acoustic feature sequence and a hidden layer state, respectively calculating the probability that the 7 th voice data speaker is the number 1 speaker, the number 2 speaker, the number 3 speaker and the number 4 speaker by using the updated models, finally selecting the maximum probability value from the probabilities, and determining the 7 th voice data speaker.

Corresponding to the above speaker labeling method, the embodiment of the present application further provides a speaker labeling device, as shown in fig. 6, where the device includes:

a feature extraction unit 100, configured to obtain acoustic features of voice data to be labeled;

a speaker labeling unit 110, configured to label the speaker of the voice data to be labeled according to at least the acoustic feature of the voice data to be labeled and the feature of the speaker that appears in the labeled voice data;

The speaker labeling device provided by the application can label the speaker of the voice data to be labeled according to the acoustic characteristics of the voice data to be labeled and the characteristics of the speaker appearing in the labeled voice data. And, the characteristics of the speaker who appears in the labeled voice data applied in the above-described scheme of the present application are determined based on the association relationship between the speaker learned in the process of labeling the labeled voice data and the acoustic characteristics of the voice data thereof.

Illustratively, the speaker labeling unit 110 labels the speaker of the voice data to be labeled according to at least the acoustic features of the voice data to be labeled and the features of the speaker appearing in the labeled voice data, and specifically includes:

determining the speaker of the voice data to be marked from a speaker set formed by the speaker appearing in the voice data to be marked and the new speaker according to the acoustic characteristics of the voice data to be marked, the characteristics of the speaker appearing in the voice data to be marked and the characteristics of the preset new speaker.

As an alternative implementation manner, the speaker labeling unit 110 determines, according to the acoustic feature of the to-be-labeled voice data, the feature of the speaker appearing in the labeled voice data, and the feature of the preset new speaker, the speaker of the to-be-labeled voice data from the speaker set formed by the speaker appearing in the labeled voice data and the new speaker, including:

determining each first labeling probability and each second labeling probability of the voice data to be labeled according to the acoustic characteristics of the voice data to be labeled and the characteristics of each speaker in the speaker set;

wherein, the first labeling probabilities respectively represent the probabilities that the speaker of the voice data to be labeled is the speaker of the voice data already labeled, and the second labeling probabilities represent the probabilities that the speaker of the voice data to be labeled is the new speaker;

and determining the speaker of the voice data to be annotated from the speaker set at least according to the first annotation probability and the second annotation probability.

As an optional implementation manner, the determining, according to the acoustic feature of the to-be-annotated voice data and the feature of each speaker in the speaker set, the respective first annotation probability and the second annotation probability of the to-be-annotated voice data includes:

As an optional implementation manner, the updating the preset feature processing model by using the features of each speaker in the speaker set to obtain a feature processing model corresponding to each speaker in the speaker set includes:

respectively inputting the acoustic feature sequence and feature extraction operation parameters of each speaker in the speaker set into a preset feature processing model, and updating the preset feature processing model to obtain a feature processing model corresponding to each speaker in the speaker set;

The method comprises the steps of extracting an acoustic feature sequence of a speaker which appears in the marked voice data from a feature sequence obtained by processing the acoustic feature of the latest marked voice data of the speaker based on a feature processing model corresponding to the speaker, wherein the feature extraction operation parameter of the speaker is a hidden layer state when the feature processing model corresponding to the speaker processes the acoustic feature of the latest marked voice data of the speaker;

the acoustic feature sequence and the feature extraction operation parameters of the new speaker are preset acoustic feature sequence and feature extraction operation parameters.

As an optional implementation manner, the determining, from the speaker set, the speaker of the voice data to be annotated according to at least the first annotation probability and the second annotation probability includes:

selecting the labeling probability with the maximum probability value from the first labeling probabilities and the second labeling probabilities;

and determining the speaker of the voice data to be annotated from the speaker set according to the annotation probability with the maximum probability value.

As another implementation, the method further includes:

Determining the third labeling probability and the fourth labeling probability of the voice data to be labeled according to the quantity of the labeled voice data corresponding to each speaker appearing in the labeled voice data;

wherein, the third labeling probabilities respectively represent the probabilities that the speaker of the voice data to be labeled is the speaker of the voice data already labeled, and the fourth labeling probability represents the probability that the speaker of the voice data to be labeled is the new speaker;

and determining the speaker of the voice data to be annotated from the speaker set at least according to the first annotation probability and the second annotation probability, including:

and determining the speaker of the voice data to be annotated from the speaker set at least according to the first annotation probability, the second annotation probability, the third annotation probability and the fourth annotation probability.

The determining, according to the number of labeled voice data corresponding to each speaker that appears in the labeled voice data, the third labeling probability and the fourth labeling probability of the voice data to be labeled includes:

Inputting the number of the labeled language data corresponding to each speaker appearing in the labeled voice data into a pre-trained speaker prediction model, and determining a third labeling probability and a fourth labeling probability of the voice data to be labeled;

wherein the speaker prediction model is trained at least by performing the following training process: according to the number of the voice data of each marked speaker, the probability that the speaker of the voice data sample is any speaker in the marked speakers and the probability that the speaker of the voice data sample is a new speaker are determined.

The determining, from the speaker set, the speaker of the voice data to be annotated according to at least the first annotation probability, the second annotation probability, the third annotation probability, and the fourth annotation probability includes:

corresponding to each first labeling probability, multiplying the first labeling probability by a corresponding third labeling probability to obtain each first comprehensive probability; the first labeling probability and the third labeling probability which correspond to each other correspond to the same speaker which appears in the labeled voice data;

Multiplying the second labeling probability with the fourth labeling probability to obtain a second comprehensive probability;

and determining the speaker of the voice data to be marked from the speaker set according to the comprehensive probability with the maximum probability value in the first comprehensive probability and the second comprehensive probability.

As another alternative implementation, the method further includes:

extracting speaker characteristics of the voice data to be marked according to the acoustic characteristics of the voice data to be marked;

determining the speaker jump probability of the voice data to be marked according to the speaker characteristics of the voice data to be marked and the speaker characteristics of the last voice data to be marked; the speaker skip probability represents the probability that the speaker of the voice data to be marked is different from the speaker of the previous marked voice data;

the determining, from the speaker set, the speaker of the voice data to be annotated according to at least the first annotation probability, the second annotation probability, the third annotation probability, and the fourth annotation probability, includes:

and determining the speaker of the voice data to be annotated from the speaker set according to the speaker jump probability, the first annotation probability, the second annotation probability, the third annotation probability and the fourth annotation probability.

The determining the speaker skip probability of the voice data to be marked according to the speaker characteristic of the voice data to be marked and the speaker characteristic of the last marked voice data includes:

inputting the speaker characteristics of the voice data to be marked and the speaker characteristics of the last marked voice data into a preset speaker jump model to obtain the speaker jump probability of the voice data to be marked;

the speaker jump model is obtained at least through training by determining the probability of different speakers of different voice data according to the speaker characteristics of the different voice data.

The determining, according to the speaker skip probability, the first labeling probability, the second labeling probability, the third labeling probability and the fourth labeling probability, the speaker of the voice data to be labeled from the speaker set includes:

respectively calculating third comprehensive probabilities corresponding to all the speakers appearing in the marked voice data according to the first marking probabilities, the speaker jump probabilities and the third marking probabilities;

Multiplying the speaker jump probability, the second labeling probability and the fourth labeling probability to obtain a fourth comprehensive probability;

Further, the method further comprises:

acquiring a feature sequence obtained by processing the acoustic features of the voice data to be marked by a feature processing model corresponding to the speaker of the voice data to be marked;

extracting the acoustic feature sequence of the speaker of the voice data to be marked from the obtained feature sequence;

As a preferred implementation, the noted voice data and the voice data to be noted are voice data with the same attribute.

The specific working contents of each unit of the speaker labeling device described above, please refer to the contents of the above method embodiment, and are not repeated here.

Another embodiment of the present application further discloses a language identification apparatus, referring to fig. 7, the apparatus includes:

a memory 200 and a processor 210;

wherein the memory 200 is connected to the processor 210, and is used for storing a program;

the processor 210 is configured to implement the speaker labeling method disclosed in any of the foregoing embodiments by running a program stored in the memory 200.

Specifically, the apparatus for evaluating a target detection result may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are interconnected by a bus. Wherein:

a bus may comprise a path that communicates information between components of a computer system.

Processor 210 may be a general-purpose processor such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., or may be an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with aspects of the present invention. But may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

Processor 210 may include a main processor, and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for implementing the technical scheme of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer-operating instructions. More specifically, the memory 200 may include read-only memory (ROM), other types of static storage devices that may store static information and instructions, random access memory (random access memory, RAM), other types of dynamic storage devices that may store information and instructions, disk storage, flash, and the like.

The input device 230 may include means for receiving data and information entered by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include means, such as a display screen, printer, speakers, etc., that allow information to be output to a user.

The communication interface 220 may include devices using any transceiver or the like for communicating with other devices or communication networks, such as ethernet, radio Access Network (RAN), wireless Local Area Network (WLAN), etc.

The processor 2102 executes programs stored in the memory 200 and invokes other devices that may be used to implement the steps of the speaker tagging method provided in the embodiments of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the steps of the speaker labeling method provided in any of the foregoing embodiments.

For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts described, as some acts may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

The steps in the methods of the embodiments of the present application may be sequentially adjusted, combined, and pruned according to actual needs.

The modules and sub-modules in the device and the terminal of the embodiments of the present application may be combined, divided, and deleted according to actual needs.

In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of modules or sub-modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional module or sub-module in each embodiment of the present application may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for labeling a speaker, comprising:

acquiring acoustic characteristics of voice data to be marked;

the characteristic of the speaker appearing in the marked voice data is determined based on the association relation between the speaker learned in the speaker marking process of the marked voice data and the acoustic characteristic of the voice data;

the labeling the speaker of the voice data to be labeled at least according to the acoustic characteristics of the voice data to be labeled and the characteristics of the speaker appearing in the labeled voice data, including:

wherein, the first labeling probabilities respectively represent the probabilities that the speaker of the voice data to be labeled is the speaker of the voice data already labeled, and the second labeling probabilities represent the probabilities that the speaker of the voice data to be labeled is a new speaker;

2. The method according to claim 1, wherein labeling the speaker of the voice data to be labeled based at least on the acoustic features of the voice data to be labeled, the features of the speaker that appeared in the labeled voice data, comprises:

3. The method according to claim 1, wherein labeling the speaker of the voice data to be labeled based at least on the acoustic features of the voice data to be labeled, the features of the speaker that appeared in the labeled voice data, comprises:

4. A method according to claim 3, wherein said determining the respective first labeling probability and second labeling probability of the voice data to be labeled based on the acoustic features of the voice data to be labeled and the features of each speaker in the speaker set comprises:

5. The method of claim 4, wherein updating the predetermined feature processing model by using the features of each speaker in the speaker set to obtain the feature processing model corresponding to each speaker in the speaker set includes:

6. A method according to claim 3, wherein said determining the speaker of the speech data to be annotated from the set of speakers based at least on the respective first annotation probability and the second annotation probability comprises:

7. A method according to claim 3, characterized in that the method further comprises:

8. The method of claim 7, wherein determining the third labeling probability and the fourth labeling probability of the voice data to be labeled according to the number of labeled voice data corresponding to each speaker that appears in the labeled voice data comprises:

9. The method of claim 7, wherein determining the speaker of the speech data to be annotated from the speaker set based at least on the respective first annotation probability, the second annotation probability, the respective third annotation probability, and the fourth annotation probability comprises:

10. The method of claim 7, wherein the method further comprises:

11. The method of claim 10, wherein determining the probability of speaker skip for the voice data to be annotated based on the speaker characteristic of the voice data to be annotated and the speaker characteristic of the last annotated voice data comprises:

12. The method of claim 10, wherein the determining the speaker of the voice data to be annotated from the speaker set according to the speaker skip probability, the respective first annotation probability, the second annotation probability, the respective third annotation probability, and the fourth annotation probability comprises:

13. The method according to claim 4, wherein the method further comprises:

14. The method according to any one of claims 1 to 13, wherein the noted speech data and the speech data to be noted are speech data of the same attribute.

15. A speaker tagging device, comprising:

16. A speaker annotation device comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

the processor implements the speaker tagging method according to any one of claims 1 to 14 by executing a program stored in the memory.

17. A storage medium having stored thereon a computer program which, when executed by a processor, implements the speaker annotation method of any of claims 1 to 14.