CN110910891A

CN110910891A - Speaker segmentation labeling method and device based on long-time memory neural network

Info

Publication number: CN110910891A
Application number: CN201911118136.1A
Authority: CN
Inventors: 宓仕达; 杜姗姗; 冯瑞
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-03-24
Anticipated expiration: 2039-11-15
Also published as: CN110910891B

Abstract

The invention provides a speaker segmentation labeling method and device based on a long-time and short-time memory neural network, which are characterized in that a speaker recognition sample labeling model based on the long-time and short-time memory deep neural network is adopted to detect the occurrence and duration time of voice of each speaker from audio to be detected, and the method comprises the following steps: step S1, preprocessing the audio to be tested to obtain audio frame level characteristics f1 and audio frame level characteristics f 2; step S2, building a speaker recognition sample labeling model based on a long-time and short-time memory deep neural network, wherein the speaker sample labeling model comprises a speaker conversion detection sub-model and a speaker characteristic modeling sub-model; step S3, respectively training a speaker conversion detection sub-model and a speaker characteristic modeling sub-model; and step S4, inputting the audio frame level features f1 and the audio frame level features f2 into a speaker recognition sample labeling model based on a long-time and short-time memory deep neural network so as to complete the classified recording of the speaking time periods of all speakers in the audio to be detected.

Description

Speaker segmentation labeling method and device based on long-time memory neural network

Technical Field

The invention belongs to the technical field of computer hearing and artificial intelligence, relates to a method for speaker conversion detection, voice feature space modeling and segmentation labeling in a hearing scene, and particularly relates to a method for speaker conversion detection and voice segmentation labeling based on a bidirectional long-time and short-time memory network model.

Background

Under the condition that the current machine learning technology and the performance of computer hardware are improved at a high speed, the application fields of computer vision, natural language processing, voice detection and the like are developed in a breakthrough manner in recent years. The speaker segmentation labeling is used as a basic task in computer voice processing, and the recording precision is greatly improved.

The speaker segmentation labeling task can be divided into two key subtasks: speaker conversion detection and speech feature space modeling.

The speaker conversion detection task is responsible for finding out whether a speaker conversion point to be detected exists in an input voice segment, the output of the speaker conversion detection task is a series of confidence labels which represent time points of the conversion, and therefore the voice can be divided into a plurality of segments mainly containing the same speaker; the voice feature space modeling is responsible for generating a feature space capable of distinguishing voice features of enough speakers, a mapping relation is established, and voice feature segments from different speakers can be effectively distinguished, namely d-vector belonging to each speaker is generated.

Speaker segmentation labeling or speaker recording has important significance in the field of computer acoustics and the field of practical application, and has stimulated a large number of researchers to pay close attention to and put into research in the past decades. With the development of strong machine learning theory and feature analysis technology, research activities related to topics such as computer acoustics and speaker segmentation are continuously reduced in recent years, and the latest research results and practical application are published and published every year. Moreover, speaker segmentation is also applied to many practical tasks, such as intelligent human-computer interaction, sensitive person feature recognition, intelligent meeting recording, and the like. However, the detection accuracy of the prior art speaker segmentation recording methods is still low and cannot be applied to practical and general detection tasks. Thus, target detection has not been solved perfectly and remains an important challenging research topic.

In order to improve the accuracy of speaker recording, the current common method is to increase the training data during the training of the detection model and to use an unsupervised clustering method to realize the segment interception of each speaker. However, on one hand, it is extremely difficult to collect a large amount of training data, the model training time is prolonged due to the increase of the data amount, and on the other hand, the clustering method used by the existing method has the problems that the generated speaker feature space cannot be adapted, and the like, and the accuracy of segmentation and recording is affected.

Disclosure of Invention

In order to solve the problems, the invention provides a speaker recording method which has simple structure, less training consumption, no dependence on clustering algorithm and high recognition accuracy, and adopts the following technical scheme:

the invention provides a speaker segmentation labeling method based on a long-time and short-time memory deep neural network, which is characterized in that a speaker recognition sample labeling model based on the long-time and short-time memory deep neural network is adopted to detect the occurrence and duration time of voice of each speaker from audio to be detected, and the method comprises the following steps: step S1, preprocessing the audio to be detected to obtain audio frame level characteristics f1 and audio frame level characteristics f2, wherein the audio frame level characteristics f1 are data required by conversion detection of a speaker, and the audio frame level characteristics f2 are data required by voice print characteristic modeling of the speaker; step S2, building a speaker recognition sample labeling model based on a long-time and short-time memory deep neural network, wherein the speaker sample labeling model comprises a speaker conversion detection sub-model and a speaker characteristic modeling sub-model; step S3, inputting a training set containing a plurality of groups of speaker conversion training audios into the constructed speaker conversion detection submodel for training, and inputting a training containing a plurality of groups of speaker characteristic modeling training audios, namely the speaker characteristic modeling submodel for model training; step S4, inputting the audio frame level features f1 and the audio frame level features f2 into a speaker recognition sample labeling model based on a long-time and short-time memory deep neural network so as to complete the classified recording of the speaking time periods of all speakers in the audio to be tested, wherein the step S4 comprises the following substeps: step S4-1, inputting the audio frame level feature f1 into a speaker conversion detection submodel so as to identify the time point of a speaker conversion point in the audio to be detected; step S4-2, cutting out the characteristic segment of a single speaker from the audio frame level characteristic f2 according to the time point; step S4-3, inputting the characteristic segments into the speaker characteristic modeling submodel so as to generate the characteristic vector of each characteristic segment; and step S4-4, correspondingly distributing each characteristic segment to a certain stored or newly-built speaker information according to the cosine similarity between the characteristic vectors, so as to record each speaking time period and the corresponding speaker according to the characteristic segments and the time points.

The speaker segmentation labeling method based on the long-time and short-time memory deep neural network provided by the invention can also have the technical characteristics that the step S1 comprises the following substeps: step S1-1, carrying out Mel frequency cepstrum coefficient operation, MFCC first derivative operation and MFCC second derivative operation on the audio to be detected, further combining the three operation results of each frame to form a fusion feature, and using the fusion feature as the audio frame level feature f1 of each frame; and S1-2, performing operation of a logarithmic Mel filter bank on the audio to be detected, and taking the operation result as the audio frame level characteristic f2 of each frame.

The speaker segmentation labeling method based on the long-time and short-time memory deep neural network provided by the invention can also have the technical characteristics that the step S3 comprises the following substeps: step S3-1, initializing a speaker recognition sample labeling model, wherein model parameters contained in the speaker recognition sample labeling model are randomly set; step S3-2, inputting the corresponding training set into the speaker conversion detection submodel and the speaker characteristic modeling submodel respectively to perform iteration for one time; step S3-3, respectively calculating respective loss errors according to model parameters of the speaker conversion detection submodel and the speaker characteristic modeling submodel, wherein the speaker conversion detection submodel uses a logarithm cross entropy loss function, and the speaker characteristic modeling submodel uses a self-defined generation mode end-to-end loss function; step S3-4, respectively propagating loss errors of the speaker conversion detection submodel and the speaker characteristic modeling submodel in a reverse direction so as to update the model parameters; and S3-5, repeating the steps S3-2 to S3-4 until the training completion condition is reached, thereby obtaining the trained speaker recognition sample labeling model.

The invention also provides a speaker segmentation labeling device based on the long-time and short-time memory deep neural network, which is characterized in that the occurrence and duration time of each speaker voice is detected from the audio to be detected by adopting a speaker recognition sample labeling model based on the long-time and short-time memory deep neural network, and the speaker segmentation labeling device comprises: the preprocessing part is used for preprocessing the audio to be detected to obtain audio frame level characteristics f1 and audio frame level characteristics f2, the audio frame level characteristics f1 are data required by conversion detection of a speaker, and the audio frame level characteristics f2 are data required by voice print characteristic modeling of the speaker; and the speaker recognition part is used for finishing the classification of the speaking time periods of all speakers in the audio to be detected according to the audio frame level characteristics f1 and the audio frame level characteristics f2, the speaker recognition part comprises a pre-trained speaker recognition sample marking model based on a long-time and short-time memory deep neural network, the speaker sample marking model comprises a speaker conversion detection sub-model and a speaker characteristic modeling sub-model, and the speaker recognition part comprises: the time point identification unit is used for inputting the audio frame level characteristics f1 into the speaker conversion detection submodel so as to identify the time point of the speaker conversion point in the audio to be detected; the feature segment cutting unit is used for cutting out feature segments of the single speaker from the audio frame level features f2 according to the time points; the feature vector generating unit is used for inputting the feature segments into the speaker feature modeling submodel so as to generate the feature vector of each feature segment; and the speaker matching recording unit correspondingly allocates each characteristic segment to certain stored or newly-built speaker information according to the cosine similarity among the characteristic vectors, so as to record the speaking time period of each speaker.

Action and Effect of the invention

According to the speaker segmentation labeling method based on the long-time and short-time memory deep neural network, the speaker sample labeling model for detecting the audio to be detected comprises a speaker conversion detection submodel and a speaker characteristic modeling submodel, wherein the speaker conversion detection submodel directly performs speaker conversion detection on the sequence characteristic window of the audio to be detected, and the submodel can directly learn the sequence characteristics in the sequence segments containing the voices of the front and the rear speakers, so that the segments can be more accurately divided according to the speakers for the voice segments, the speaker characteristic modeling submodel is favorable for generating more similar embedded vectors for homologous voice segments and generating more distinctive embedded vectors for non-homologous voice segments, and the final speaker segmentation labeling precision is improved. Compared with the existing high-precision model, the speaker sample labeling model used in the speaker segmentation labeling method of the embodiment is fast and convenient to construct on the basis of ensuring higher accuracy, and the calculation amount consumed in the training process is smaller, so that the new use scene can be adapted to through short-time retraining on the migrated data set.

Drawings

FIG. 1 is a flowchart of a speaker segmentation labeling method based on a long-term and short-term memory deep neural network in an embodiment of the present invention; and

fig. 2 is a schematic structural diagram of a speaker recognition sample labeling model for long-and-short term memory of a deep neural network in an embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the speaker segmentation labeling method based on the long-time memory deep neural network of the invention is specifically described below with reference to the embodiments and the accompanying drawings.

< example >

In this embodiment, the data set used is timmit. TIMIT is a speech data benchmark developed and evaluated by an acoustic speech research and automatic speech recognition system, and this data set contains around 5 hours of audio data collected using a wide-band microphone in the eight major dialect areas of the United states. The recordings were recorded at a 16-bit, 16000Hz sampling rate. The TIMIT dataset had audio samples from 630 people, and 6300 single-speaker speech samples were obtained by having each volunteer read 10 multi-syllable sentences. Over 396000 speaker transition points were obtained by randomly concatenating the samples.

In addition, the hardware platform implemented by the speaker segmentation labeling method in this embodiment needs an NVIDIA1080Ti graphics card (for GPU acceleration).

In the embodiment, firstly, two kinds of preprocessing are carried out on the audio frequency of a data set, then two parts of a speaker recognition sample marking method of a long-time and short-time memory-free deep neural network are respectively trained, and finally, a speaker conversion time point and a time period marking of each speaker speaking contained in the target audio frequency are obtained through the speaker recognition sample marking method of the long-time and short-time memory-free deep neural network, namely, the 'who speaks when' is judged and recorded. The method specifically comprises the following 5 processes: preprocessing, building a model, training a speaker conversion detection model, training a speaker voiceprint feature space model and speaker segmentation labeling.

Fig. 1 is a flowchart of a speaker segmentation labeling method based on a long-term and short-term memory deep neural network in an embodiment of the present invention.

As shown in fig. 1, the speaker segmentation labeling method based on the long-time and short-time memory deep neural network includes the following steps:

step S1, preprocessing the audio to be tested to obtain two audio frame level characteristics f1 and f 2.

In this embodiment, the audio to be tested is an audio (e.g., a telephone recording, a conference recording, etc.) including a plurality of speaker utterance fragments. The preprocessing process in step S1 includes the following sub-steps:

and step S1-1, carrying out overlapped sliding window sampling on the sequence of the audio to be tested (namely the time sequence of each sampling frame of the audio), carrying out short-time Fourier transform on each window obtained by sampling, and carrying out logarithmic Mel filtering to obtain the audio frame level characteristic f 1. The audio frame level feature f1 is used to identify the speaker, and the dimension of this part of the feature is 40 × N, where N is the number of windows sampled by the sliding window.

Step S1-2, calculating the Mel frequency cepstrum coefficient of the original audio to be tested, and the first derivative and the second derivative of the Mel frequency cepstrum coefficient thereof, and stacking the audio as the audio frame level feature f2 according to the corresponding sequence of time slices. The audio frame level feature f2 is used to detect the speaker transition point in the audio, and since the same window parameters are used in sliding window sampling, the dimension of this feature is 59 × N.

And step S2, building a speaker recognition sample labeling model (hereinafter referred to as speaker recognition sample labeling model or model) based on the long-time and short-time memory deep neural network.

In step S2 of this embodiment, a speaker identification sample labeling model is constructed using a conventional deep learning framework PyTorch. The model is a speaker recording neural network model based on a long-time memory neural network and independent of a clustering method, and can be mainly divided into two modules, namely a speaker conversion detection sub-model and a speaker characteristic modeling sub-model.

The speaker recognition sample labeling model treats speaker segment labeling as two connected task combinations, firstly, a speaker conversion detection sub-model detects speaker conversion on audio to be detected, a basis for audio cutting is provided for subsequent steps, then, a speaker characteristic modeling sub-model is used for embedding characteristic space on the cut segments, voiceprint characteristic embedding vectors of speakers in each segment are extracted, and the voiceprint characteristic embedding vectors are respectively used as a characteristic basis for speaker identity labeling after normalization.

As shown in FIG. 2, the Speaker recognition sample labeling model includes two sub-model structures, namely, a Speaker conversion detection sub-model (SCD-Net) and a Speaker feature modeling sub-model (Speaker-Embedding-Net).

The speaker conversion detection submodel is used for building a speaker conversion point marking structure and providing a segmentation basis for segmenting the audio segment to be detected, and the characteristic information of the conversion point can be better extracted because the switching of the speaker in the window is directly predicted. The speaker conversion detection submodel comprises an input layer I, a bidirectional long-short time memory network B1, a bidirectional long-short time memory network B2, a full connection layer FC1, a full connection layer FC2 and an output layer O which are sequentially arranged.

The speaker feature modeling submodel (or called speaker voiceprint feature space modeling submodel) is used for extracting embedded vectors of the input audio feature segments, and can be used as the basis for segmenting the voice of the speaker according to the similarity relation of the vectors. The sub-model comprises an input layer I ', a long-time and short-time memory network L1, a long-time and short-time memory network L2, a long-time and short-time memory network L3, a full connection layer FC ' 1 and an output layer O ' which are arranged at one time.

The speaker identification sample labeling model of the embodiment is composed of a bidirectional long-short term memory network structure and a full connection layer, and L2 regularization operation is carried out on data after each layer of the bidirectional long-short term memory neural network layer. This task uses ReLU as an activation function for the fully connected layer as a two-class problem. Specifically, the specific structure of the speaker recognition sample labeling model for long-time and short-time memory of the deep neural network is as follows:

(1) an input layer I for inputting the pre-processed sliding window sampled frame-level features f1, which is 59 × 320 in size;

(2) two bidirectional long-and-short memory neural network structures, including B1 and B2, wherein B1 has 32 hidden nodes and the output is 32 x 320; b2 has 20 hidden nodes with 20 × 320 outputs;

(3) two fully connected layer structures including FC1 and FC2, where FC1 output is 40 × 320 and FC2 output is 10 × 320;

(4) the output layer O of the speaker conversion detection submodel is actually a full connection layer with the output dimension of 2 multiplied by 320 and a logarithm softmax layer, and the final output is 2 multiplied by 320;

(5) the input I 'of the speaker voiceprint feature space modeling submodel is used for inputting the preprocessed frame-level feature f2 segments which are cut according to the output result of the speaker conversion submodel, the segments are sampled to generate window data with the formula size of 40 multiplied by 160, and the size of the I' is consistent with the size of the window and is 40 multiplied by 160;

(6) the long-short time memory networks L1, L2 and L3 have the three long-short time memory structures with the number of hidden nodes of 768 and the output of 768 multiplied by 160;

(7) a full connection layer FC' 1, the layer structure mapping the result generated by the previous layer to 256 dimensions;

(8) the output layer O' is an L2 normalization layer, and the output is 256-dimensional feature space embedding vectors.

Step S3, converting the speaker in the built speaker identification sample labeling method into a detection sub-model so as to train the speaker detection model; and inputting a training set containing a plurality of groups of speaker characteristic modeling training audios into the speaker characteristic modeling submodel so as to train the speaker detection model.

In this embodiment, an acoustic speech data set timmit is used as training data. The speaker switching detection submodel uses more than 39600 speaker switching points generated by randomly splicing 6300 voice segments in TIMIT; the speaker voice print feature space modeling submodel uses 630 independent speaker voice data contained in the TIMIT dataset to construct the voice print feature space. The method employed in processing the data of the training set is similar to the method of step S1, namely: for data of a speaker conversion part, carrying out sliding window sampling on each audio frequency, then labeling each window data, obtaining a speaker conversion point containing more than 396000 from the data set, wherein each conversion point corresponds to 5 window samples, and the window data are used as a speaker conversion audio frequency training set of a speaker conversion detection sub-model; and for a speaker voiceprint feature training set required by the speaker voiceprint feature space modeling submodel, 4 audio segments are generated for sampling each piece of TIMIT original audio, so that the recognition effect of the submodel on the voices contained in the incomplete sentences is enhanced.

The model training process of step S3 specifically includes the following steps:

step S3-1, initializing a speaker identification sample labeling model, wherein each layer of the speaker identification sample labeling model comprises different model parameters, and the model parameters are randomly set during initialization.

And step S3-2, inputting the corresponding training set into the speaker conversion detection submodel and the speaker characteristic modeling submodel respectively so as to carry out iteration once.

In step S3-2 of this embodiment, the training data batch size of the speaker conversion detection submodel is 32 each time, and 35200 times of iterative training are performed in total; the batch size of the training data entering the speaker voiceprint feature space modeling submodel each time is 16, and the total iteration is 56000 times.

And step S3-3, calculating respective loss errors according to the model parameters of the two submodels, wherein the speaker conversion detection submodel uses a logarithmic cross entropy loss function, and the speaker characteristic modeling submodel uses a self-defined generation type end-to-end loss function.

And step S3-4, respectively propagating loss errors of the speaker conversion detection submodel and the speaker characteristic modeling submodel in a reverse direction so as to update the model parameters.

In the model training process, for a speaker conversion detection submodel, after each iteration (namely, a training set sound fragment passes through a model), calculating a Loss error (Log SoftMax Loss logarithmic cross entropy Loss) by using model parameters of the last layer, and then reversely transmitting the calculated Loss error (Log SoftMax Loss), thereby updating the model parameters; for the speaker voiceprint feature space modeling submodel, after each iteration, the generated End-to-End Loss function (Generalized End-to-End Loss) is used. In addition, the adaptive learning rate is used in the training process of the two models to adjust the learning rate used in each iteration, and the training is completed after the model parameters of each layer are converged.

And S3-5, repeating the steps S3-2 to S3-4 until a training completion condition is reached, thereby obtaining a trained model. The training completion condition is a conventional training completion condition, for example, convergence of model parameters is regarded as training completion.

After the iterative training and the error calculation and back propagation processes in the iterative process, the speaker recognition sample labeling model for memorizing the deep neural network at long and short times after the training is finished can be obtained. The embodiment adopts the trained model to record the speaker voice in a segmented way in the telephone recording scene.

Step S4, the audio frame level feature f1 and the audio frame level feature f2 are input into the speaker identification sample labeling model, so as to complete the classified recording of the speaking time periods of each speaker in the audio to be tested.

In this embodiment, step S4 includes the following sub-steps:

and S4-1, inputting the audio frame level characteristics f1 into the speaker conversion detection submodel so as to identify the time point of the speaker conversion point in the audio to be detected.

In this embodiment, the time point identified by the speaker conversion detection sub-model is the time position information of the speaker conversion point in the audio to be detected.

In step S4-2, feature segments within a time period in which each speaker exists individually are cut out in the audio frame level feature f2 according to the time point identified in step S4-1.

And step S4-3, inputting the feature segments into the speaker feature modeling submodel so as to generate a feature vector of each feature segment, wherein the feature vector is a voiceprint feature space embedding vector of each feature segment.

And step S4-4, correspondingly distributing each characteristic segment to a certain stored or newly-built speaker information according to the cosine similarity between the characteristic vectors, so as to record each speaking time period and the corresponding speaker according to the characteristic segments and the time points.

In step S4-4 of this embodiment, when the cosine similarity between the feature vectors is determined, the feature vectors corresponding to the feature segments are sequentially subjected to cosine similarity calculation with the voiceprint feature vectors pre-stored by the speakers, so as to complete matching between the feature segments and the speaker information. If the corresponding speaker information is matched, the characteristic segment is directly matched with the speaker information; if no matched speaker information exists, the segment is indicated to belong to a speaker which has not appeared, and then a piece of speaker information is created and matched with the characteristic segment.

After the feature segments are matched with the speaker information, recording according to the feature segments and the time points corresponding to the feature segments, namely: and obtaining and recording the corresponding speaking time period in the audio to be tested according to the characteristic segments and the time points, and correspondingly recording the corresponding speaker information. And after finishing all matching and recording processing, finally finishing attribution labeling of the speaker voice segment in each audio frequency to be tested.

In this embodiment, the model is also tested using a spliced timmit test set as the audio to be tested, where the target is a segment of a single utterance of a plurality of different speakers.

The specific process is as follows: preprocessing the audio data in the test set by using the preprocessing method as the step S1 to obtain two frame-level features of 200 test audios as the test set, sequentially and respectively inputting the trained speaker recognition sample labeling models, cutting the test audio segment according to the output of the speaker conversion detection sub-model, inputting the cut segment into the voiceprint feature sub-model, generating 256-d voiceprint feature embedding vectors of the segment, storing the vectors, repeating the above process, and performing cosine similarity comparison and recording by using the embedding vectors generated each time and the stored vectors. Finally, the attribution label of the voice segment of the speaker in the test audio can be obtained.

In this embodiment, the trained speaker recognition sample labeling method for long and short term memory deep neural network has a speaker record labeling precision (DER) of 14.22% for the test set. Meanwhile, the test set is adopted to carry out comparison tests on other methods in the prior art, and the results are shown in the following table:

table 1 comparative test results

As can be seen from Table 1, compared with the common labeling method, the method provided by the invention effectively improves the accuracy of speaker segmentation and labeling.

Examples effects and effects

According to the speaker segmentation labeling method based on the long-time and short-time memory deep neural network, the speaker sample labeling model for detecting the audio to be detected comprises a speaker conversion detection submodel and a speaker characteristic modeling submodel, wherein the speaker conversion detection submodel directly performs speaker conversion detection on the sequence characteristic window of the audio to be detected, and the submodel can directly learn the sequence characteristics in the sequence segments containing the voices of the front and the rear speakers, so that the segments can be more accurately divided according to the speakers for the voice segments, the speaker characteristic modeling submodel is favorable for generating more similar embedded vectors for homologous voice segments and more distinctive embedded vectors for non-homologous voice segments, and the final speaker segmentation labeling precision is improved. Compared with the existing high-precision model, the speaker sample labeling model used in the speaker segmentation labeling method of the embodiment is fast and convenient to construct on the basis of ensuring higher accuracy, and the calculation amount consumed in the training process is smaller, so that the new use scene can be adapted to through short-time retraining on the migrated data set.

In addition, in the embodiment, the speaker characteristic modeling submodel also uses a new metric learning loss function, so that the model can pay more attention to samples which belong to different speakers and have higher similarity during training, and the training efficiency is improved while the distinguishing capability is improved.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

For example, the foregoing embodiment provides a speaker segmentation labeling method based on a long-term and short-term memory deep neural network, and the method mainly includes the steps of preprocessing, model building, model training, and speaker recognition. However, for the convenience of practical use, the trained speaker sample labeling model can also be packaged to form a speaker recognition part, and the speaker recognition part and a preprocessing part for preprocessing the audio to be detected can form a speaker segmentation labeling device based on a long-time memory deep neural network, so that the speaker recognition part calls the model to complete a corresponding recognition task after the audio to be detected is processed by the preprocessing part. The preprocessing part adopts the preprocessing method of step S1 in the speaker segmentation labeling method, the speaker recognition part adopts the processing method of step S4 in the speaker segmentation labeling method, the speaker recognition part specifically comprises a time point recognition unit for executing step S4-1, a feature fragment clipping unit for executing step S4-2, a feature vector generation unit for executing step S4-3 and a speaker matching recording unit for executing step S4-1, the working principles of the units are consistent with the actions described in the corresponding steps, and the description is omitted here.

Claims

1. A speaker segmentation labeling method based on a long-time and short-time memory neural network is characterized in that a speaker recognition sample labeling model based on the long-time and short-time memory deep neural network is adopted to detect the occurrence and duration time of each speaker voice from audio to be detected, and the method comprises the following steps:

step S1, preprocessing the audio to be detected to obtain audio frame level characteristics f1 and audio frame level characteristics f2, wherein the audio frame level characteristics f1 are data required by conversion detection of a speaker, and the audio frame level characteristics f2 are data required by voiceprint characteristic modeling of the speaker;

step S2, building a speaker recognition sample labeling model based on a long-time and short-time memory deep neural network, wherein the speaker sample labeling model comprises a speaker conversion detection sub-model and a speaker characteristic modeling sub-model;

step S3, inputting a training set containing a plurality of groups of speaker conversion training audios into the constructed speaker conversion detection submodel for training, and inputting training containing a plurality of groups of speaker characteristic modeling training audios into the speaker characteristic modeling submodel for model training;

step S4, inputting the audio frame level features f1 and the audio frame level features f2 into the speaker recognition sample labeling model based on the long-and-short-term memory deep neural network so as to complete the classified recording of the speaking time periods of the speakers in the audio to be tested,

wherein the step S4 includes the following sub-steps:

step S4-1, inputting the audio frame level feature f1 into the speaker conversion detection submodel so as to identify the time point of the speaker conversion point in the audio to be detected;

step S4-2, cutting out the characteristic segment of the single speaker from the audio frame level characteristic f2 according to the time point;

step S4-3, inputting the feature segments into the speaker feature modeling submodel so as to generate a feature vector of each feature segment;

and step S4-4, correspondingly allocating each feature segment to a certain stored or newly-built speaker information according to the cosine similarity between the feature vectors, so as to record each speaking time period and the corresponding speaker according to the feature segments and the time points.

2. The speaker segmentation labeling method based on the long-time and short-time memory neural network as claimed in claim 1, wherein:

wherein the step S1 includes the following sub-steps:

step S1-1, carrying out operation of a logarithm Mel filter bank on the audio frequency to be detected, and taking the operation result as the audio frequency frame level characteristic f2 of each frame;

and S1-2, performing Mel frequency cepstrum coefficient operation, MFCC first derivative operation and MFCC second derivative operation on the audio to be detected, further combining the three operation results of each frame to form a fusion feature, and using the fusion feature as the audio frame-level feature f1 of each frame.

3. The speaker segmentation labeling method based on the long-time and short-time memory neural network as claimed in claim 1, wherein:

wherein the step S3 includes the following sub-steps:

step S3-1, initializing the speaker identification sample labeling model, wherein model parameters contained in the speaker identification sample labeling model are randomly set;

step S3-2, inputting the corresponding training set into the speaker conversion detection submodel and the speaker characteristic modeling submodel respectively to perform one iteration;

step S3-3, calculating respective loss error according to the model parameters of the speaker conversion detection submodel and the speaker characteristic modeling submodel, wherein the speaker conversion detection submodel uses a logarithm cross entropy loss function, and the speaker characteristic modeling submodel uses a self-defined generation mode end-to-end loss function;

step S3-4, respectively and reversely propagating the loss error of the speaker conversion detection submodel and the speaker characteristic modeling submodel so as to update the model parameters;

and S3-5, repeating the steps S3-2 to S3-4 until a training completion condition is reached, thereby obtaining the trained speaker recognition sample labeling model.

4. The utility model provides a speaker segmentation marking device based on long-term memory neural network which characterized in that adopts speaker discernment sample mark model based on long-term memory deep neural network to detect the time that every speaker pronunciation appears and lasts from the audio frequency that awaits measuring, includes:

the preprocessing part is used for preprocessing the audio to be detected to obtain audio frame level characteristics f1 and audio frame level characteristics f2, the audio frame level characteristics f1 are data required by conversion detection of a speaker, and the audio frame level characteristics f2 are data required by voice print characteristic modeling of the speaker; and

the speaker recognition part is used for finishing the classification of the speaking time periods of all speakers in the audio to be tested according to the audio frame level characteristics f1 and the audio frame level characteristics f2, the speaker recognition part comprises a pre-trained speaker recognition sample marking model based on a long-time and short-time memory deep neural network,

wherein, the speaker sample labeling model comprises a speaker conversion detection submodel and a speaker characteristic modeling submodel,

the speaker recognition unit includes:

the time point identification unit is used for inputting the audio frame level characteristic f1 into the speaker conversion detection submodel so as to identify the time point of the speaker conversion point in the audio to be detected;

a feature segment cutting unit which cuts out the feature segment of the single speaker from the audio frame level feature f2 according to the time point;

a feature vector generation unit which inputs the feature segments into the speaker feature modeling submodel to generate a feature vector of each feature segment; and

and the speaker matching recording unit correspondingly allocates each characteristic segment to certain stored or newly-built speaker information according to the cosine similarity among the characteristic vectors, so as to record the speaking time period of each speaker.