CN114005453A

CN114005453A - Model training method, voiceprint feature extraction device, and program product

Info

Publication number: CN114005453A
Application number: CN202111290709.6A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-02-01

Abstract

The present disclosure provides a model training method, a voiceprint feature extraction method, a device thereof, and a program product, and relates to a model training method, a voiceprint feature extraction method, a device thereof, and a program product. The technical scheme comprises the following steps: acquiring a first model, a first subframe and a target subframe applied to a first scene, wherein a first audio has annotation information; extracting a first spectrum characteristic of a first subframe, and extracting a target spectrum characteristic of a target subframe; and training the first model according to the first spectrum characteristic of the first subframe, the labeling information of the first audio, the target spectrum characteristic of the target subframe and the target audio to which the target subframe belongs to obtain a target model. In the embodiment, the information of the target audio to which the subframe belongs is used as the labeling information of the subframe, so that the first model can be trained by using the first audio with the labeling information and the target audio, and the target model capable of identifying the voiceprint characteristics of the audio in the target scene is obtained.

Description

Model training method, voiceprint feature extraction device, and program product

Technical Field

The present disclosure relates to a speech technology and a deep learning technology in an artificial intelligence technology, and in particular, to a model training method, a voiceprint feature extraction device, and a program product.

Background

At present, voiceprint recognition technology is applied to many scenes, and the identity of a speaker in audio can be determined by carrying out voiceprint recognition on the audio. The model used to identify the voiceprint can generally be obtained by means of model training.

In the related art, the voiceprint recognition model can be applied to various scenes, and in order to reduce the training cost of the model, the existing voiceprint recognition model can be adjusted, so that the model can be applied to a target scene. For example, a voiceprint recognition model which can be applied to the insurance field exists, and the voiceprint recognition model which can be applied to the banking business can be obtained by performing optimization training on the model by using data related to the banking business.

However, before optimally training the model, it is necessary to collect business data of the target scene, so as to train the existing model by using the business data. However, this method needs to collect a large amount of business data of the target scene, and labels the business data to train the existing model, so this method has a long period and is costly.

Disclosure of Invention

The disclosure provides a model training method, a voiceprint feature extraction method, equipment and a program product thereof, and aims to solve the problems of long period and high cost in cross-scene training of an existing model in the related art.

According to a first aspect of the present disclosure, there is provided a model training method, comprising:

the method comprises the steps of obtaining a first model, a first subframe and a target subframe, wherein the first model, the first subframe and the target subframe are applied to a first scene, the first subframe is obtained by framing a first audio frequency applied to the first scene, and the target subframe is obtained by framing a target audio frequency applied to a target scene; wherein the first audio has annotation information;

extracting a first spectrum characteristic of the first subframe and extracting a target spectrum characteristic of the target subframe;

and training the first model according to the first spectrum characteristic of the first subframe, the labeling information of the first audio, the target spectrum characteristic of the target subframe and the target audio to which the target subframe belongs to obtain a target model.

According to a second aspect of the present disclosure, there is provided a method for extracting a voiceprint feature, including:

acquiring audio data to be identified, and extracting the frequency spectrum characteristics of the audio data;

inputting the frequency spectrum characteristics into a target model to obtain the voiceprint characteristics of the audio data; the target model is trained by the method according to the first aspect.

According to a third aspect of the present disclosure, there is provided a model training apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first model, a first subframe and a target subframe which are applied to a first scene, the first subframe is obtained by framing a first audio frequency applied to the first scene, and the target subframe is obtained by framing a target audio frequency applied to a target scene; wherein the first audio has annotation information;

the extraction unit is used for extracting first spectrum characteristics of the first subframe and extracting target spectrum characteristics of the target subframe;

and the training unit is used for training the first model according to the first spectrum characteristic of the first subframe, the labeling information of the first audio, the target spectrum characteristic of the target subframe and the target audio to which the target subframe belongs to obtain a target model.

According to a fourth aspect of the present disclosure, there is provided an apparatus for extracting a voiceprint feature, comprising:

the audio data acquisition unit is used for acquiring audio data to be identified and extracting the spectral characteristics of the audio data;

the voiceprint feature extraction unit is used for inputting the frequency spectrum feature into a preset target model to obtain the voiceprint feature of the audio data; the target model is obtained by training the device of the third aspect.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first or second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first or second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, execution of the computer program by the at least one processor causing the electronic device to perform the method of the first or second aspect.

In the model training method, the voiceprint feature extraction method, the device thereof and the program product provided by the disclosure, because the target audio does not have the labeling information, the target audio can be divided into a plurality of subframes, and the information of the target audio to which the subframes belong is used as the labeling information of the subframes, so that the first model can be trained by using the first audio with the labeling information and the target audio, and the target model capable of identifying the voiceprint features of the audio in the target scene is obtained.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a diagram illustrating voiceprint recognition of audio data in accordance with an exemplary embodiment;

FIG. 2 is a schematic flow chart diagram illustrating a model training method according to an exemplary embodiment of the present disclosure;

FIG. 3 is a diagram illustrating a framing process for audio according to an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram illustrating a model training method according to another exemplary embodiment of the present disclosure;

fig. 5 is a flowchart illustrating a method for extracting voiceprint features according to an exemplary embodiment of the disclosure;

FIG. 6 is a schematic diagram illustrating a model training apparatus according to an exemplary embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating a model training apparatus according to another exemplary embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an apparatus for extracting voiceprint features according to an exemplary embodiment of the present disclosure;

FIG. 9 is a block diagram of an electronic device used to implement methods of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

At present, a user identity identification requirement exists in many application scenarios, and in one implementation mode, the user identity can be identified by using a voiceprint recognition technology. For example, in a banking scene, audio data spoken by a user may be collected, and a preset model is used to identify voiceprint features of the audio data, and the voiceprint features are compared with pre-stored voiceprint features of the user, so as to identify the identity of the user.

FIG. 1 is a diagram illustrating voiceprint recognition of audio data in accordance with an exemplary embodiment.

As shown in fig. 1, a piece of audio data is obtained, and front-end features of the audio data may be extracted, for example, MFCC (Mel-Frequency Cepstral Coefficients, Mel Cepstral Coefficients), PLP (Perceptual Linear Predictive coefficient), Fbank (Mel-Frequency Analysis, evaluation index of voice recognition), or audio features extracted based on FFT (fast fourier transform).

And inputting the front-end features into a pre-trained voiceprint recognition model, wherein the voiceprint recognition model is used for processing the front-end features so as to extract the voiceprint features of the speaker. The voiceprint recognition model can be, for example, a GMM, DNN, CNN, ResNet, SincNet, or like structure.

And classifying the voiceprint characteristics by utilizing a back-end classification model (such as COS, LDA, PLDA and the like) so as to identify the identity of the speaker.

In general, model training can be performed for a specific scene to obtain a model applied to a corresponding scene, but the training cost is high in this way, and therefore, an existing model applied to a first scene can also be optimally trained to obtain a model applicable to a target scene. For example, the model applied to the banking business is further trained to obtain the model applied to the insurance business.

In one embodiment, the voiceprint recognition model for extracting the voiceprint features can be optimally trained, however, in this way, a large amount of data of the target scene needs to be collected, the data is labeled through certain manpower, and the voiceprint recognition model is retrained, so that the voiceprint recognition model which can be applied to the target scene is obtained. However, this method requires a large amount of data to be prepared in advance, and also requires a large amount of manpower to label, resulting in a long training period and low training efficiency.

In another embodiment, the training of the back-end classification model can be optimized, a certain amount of data of the target scene can be collected, the data can be labeled through certain manpower, and the back-end classification model is retrained. Because the structure of the back-end classification model is simpler than that of the voiceprint recognition model, the more complex voiceprint recognition model does not need to be updated in the implementation mode, and only the lower back-end classification model needs to be retrained. The training complexity can be reduced, but in the scheme, the voiceprint recognition model cannot accurately recognize the voiceprint features of the target scene.

In order to solve the above technical problem, in the scheme provided by the present disclosure, a first audio used for training a first scene and its label information are utilized, and a target audio applied to a target scene is utilized to perform joint training on an existing first model, so as to obtain a target model. In this way, the target model can use the voiceprint recognition capability of the first model for reference, and the target audio is used for training the model, so that the target model can be applied to a target scene.

Fig. 2 is a flowchart illustrating a model training method according to an exemplary embodiment of the present disclosure.

As shown in fig. 2, the model training method provided by the present disclosure includes:

step 201, acquiring a first model, a first subframe and a target subframe applied to a first scene, wherein the first subframe is obtained by framing a first audio applied to the first scene, and the target subframe is obtained by framing a target audio applied to a target scene; wherein the first audio has annotation information.

The method provided by the present disclosure is executed by an electronic device with computing capability, which may be a computer, for example. The computer can be used for carrying out optimization training on the existing first model so as to obtain the target model.

In particular, a first model may be preset, where the first model is a model applied in a first scene, in particular a model for recognizing vocal print features. For example, it may be a model for identifying voiceprint features as applied in banking.

In practical application, the spectral characteristics of the audio data can be input into the first model, so that the voiceprint characteristics of the audio data can be identified through the first model.

Further, the first model can be optimized and trained to obtain a model applied to a target scene, specifically a model for identifying voiceprint features. For example, it may be a model for identifying voiceprint features that is applied in insurance services.

In practical application, the electronic device may further obtain a first subframe, where the first subframe is obtained by performing framing processing on a first audio, and the first audio is audio data applied in a first scene, for example, the first model is a model applied in banking services for identifying voiceprint features, and the obtained first audio is audio data in the banking services.

In an alternative embodiment, the first audio may be a first audio used to train the first model, the first audio having annotation information.

The electronic device may further obtain a target subframe, where the target subframe performs framing processing on a target audio, and the target audio is audio data applied in a target scene, for example, if the target model is a model applied in an insurance service for identifying a voiceprint feature, the obtained target audio is audio data in the insurance service.

Specifically, the target audio does not have labeling information, that is, in the solution of the present disclosure, the target audio does not need to be labeled before the model training.

Fig. 3 is a diagram illustrating a framing process for audio according to an exemplary embodiment of the present disclosure.

Further, the first audio may be framed according to the framing processing method, or the target audio may be framed using the framing processing method.

As shown in fig. 3, for one piece of audio data 31, a plurality of subframes 32 with a length of 25ms may be determined in the audio data 31 in steps of 10 ms.

Step 202, extracting a first spectrum feature of the first subframe, and extracting a target spectrum feature of the target subframe.

In one embodiment, for each first subframe, the electronic device may extract its first spectral feature, and for each target subframe, the electronic device may extract its target spectral feature.

In another embodiment, when the target model is trained, a batch of first subframes and target subframes may also be acquired, and spectral features are extracted for the acquired first subframes and target subframes. For example, m first subframes and n marked subframes may be obtained, and spectral features may be extracted for the obtained subframes respectively.

In another embodiment, when the target model is trained, a batch of first audio and target audio may be acquired, the acquired first audio and target audio may be subjected to framing processing, and the obtained spectral features of each subframe are extracted. For example, m first audios and n target audios may be obtained, the audios may be subjected to framing processing respectively to obtain a plurality of first subframes and target subframes, and then the first spectral features of the first subframes and the target spectral features of the target subframes are extracted.

The first audio feature and the target spectral feature are the same class of features, such as MFCC, PLP, Fbank, or FFT-based extracted features.

Among them, a method for extracting a spectral feature may be set in advance, such as a method for extracting a MFCC of audio data by which MFCC features of individual sub-frames are extracted. A method for extracting PLPs of audio data by which PLP features of respective subframes are extracted may also be provided.

Step 203, training the first model according to the first spectral feature of the first subframe, the labeling information of the first audio, the target spectral feature of the target subframe, and the target audio to which the target subframe belongs, so as to obtain a target model.

Specifically, the first model may be jointly trained by using the obtained first spectral feature of the first subframe, the labeling information of the first audio, the target spectral feature of the target subframe, and the target audio to which the target subframe belongs, so as to obtain the target model.

Further, the first spectrum feature of the first subframe may be input into the first model, and the first model may perform recognition processing on the first spectrum feature to obtain a recognition result of the first subframe. The recognition result may be the identity of the speaker in the first sub-frame.

For example, the prepared first audio is 1000 pieces in total, speakers of each audio are the same user, and audio data of 300 persons can be recorded as the first audio.

Since the first audio has the annotation information, 300 recognition results can be set at the full connection layer, so that the first model can output the probability that the first sub-frame belongs to each recognition result. Parameters in the first model can be adjusted to the first model by using the recognition result of the first subframe and the marking information, so that the first model is trained.

Since the target audio does not have the label information, the classification information cannot be set in advance through the full connection layer, and thus the first model cannot output the recognition result of the target sub-frame. The target feature vector of the target subframe can be extracted through the first model, and the first model is trained by the target feature vector. The target feature vector extracted by the first model is also the voiceprint feature of the target subframe.

In particular, only one speaking user is in a target audio, and therefore, the recognition effect of the first model on the target sub-frame can be measured based on the characteristic. If the two target subframes belong to the same target audio frequency, the target feature vectors of the two target subframes should be relatively close, and if the two target subframes belong to different target audio frequencies, the target feature vectors of the two target subframes should not be close, so that the parameters in the first model can be adjusted according to the target feature vectors of the target subframes and the target audio frequencies to which the target subframes belong, and the first model can be trained.

Furthermore, the first model can be gradually optimized through multiple iterations, and when the preset conditions are met, the iterations can be stopped, so that the target model is obtained. The object model can be applied in an object scene for identifying the voiceprint characteristics of audio data in the object scene.

For example, when the iteration number reaches a preset number, the training of the model may be stopped, and for example, when the recognition effect of the model on the first subframe and the target subframe is good, the training of the model may also be stopped.

The model training method provided by the present disclosure includes: the method comprises the steps of obtaining a first model, a first subframe and a target subframe, wherein the first model, the first subframe and the target subframe are applied to a first scene, the first subframe is obtained by framing a first audio frequency applied to the first scene, and the target subframe is obtained by framing a target audio frequency applied to a target scene; wherein the first audio has annotation information; extracting a first spectrum characteristic of a first subframe, and extracting a target spectrum characteristic of a target subframe; and training the first model according to the first spectrum characteristic of the first subframe, the labeling information of the first audio, the target spectrum characteristic of the target subframe and the target audio to which the target subframe belongs to obtain a target model. In this embodiment, since the target audio does not have the labeling information, the target audio may be split into a plurality of subframes, and information of the target audio to which the subframes belong is used as the labeling information of the subframes, so that the first model may be trained by using the first audio having the labeling information and the target audio, and the target model that can identify the voiceprint features of the audio in the target scene may be obtained.

Fig. 4 is a flowchart illustrating a model training method according to another exemplary embodiment of the present disclosure.

As shown in fig. 4, the model training method provided by the present disclosure includes:

step 401, acquiring a first subframe, where the first subframe is obtained by performing framing processing on a first audio applied in a first scene; wherein the first audio has annotation information.

Step 401 is similar to the manner of obtaining the first subframe in step 201, and is not repeated here.

Step 402, extracting a first spectral feature of the first subframe, and training a preset model by using the first spectral feature of the first subframe and the label information of the first audio to obtain a first model.

Wherein, the first spectrum feature can be extracted for any first subframe based on the following modes:

for any first subframe, the initial spectral feature of the first subframe may be determined, for example, an algorithm for extracting the spectral feature may be preset to extract the initial spectral feature of each first subframe. For example, the initial spectral feature may be a feature such as MFCC, PLP, Fbank, or specifically, a 40-dimensional initial spectral feature of the first subframe may be extracted.

Acquiring a related first subframe of the first subframe, wherein the related first subframe of the first subframe comprises: the first subframe is a first subframe located in a first subframe, and the first subframe is a second subframe located in a second subframe.

Specifically, the electronic device may determine the associated first subframe of the first subframe according to a position relationship of the first subframe in the first audio. For example, a first subframe of a first preset number before the first subframe and/or a first subframe of a first preset number after the first subframe may be used as the associated first subframe of the first subframe.

For example, when splitting the first audio, 0-20ms may be used as the first sub-frame, 10-30ms may be used as the second first sub-frame, and 20-40ms may be used as the third first sub-frame. When the preset number is 1, for the second first subframe, the first subframe and the third first subframe may be taken as their associated first subframes, and for the first subframe, the second first subframe may be taken as its associated first subframe. For the third first subframe, the second first subframe and the fourth first subframe may be taken as their associated first subframes.

The characteristic mean value of the first subframe can also be determined according to the initial spectrum characteristic of the associated first subframe of the first subframe.

Further, for one of the first subframes, the characteristic mean value of the first subframe may be determined according to the initial spectral characteristic of its associated first subframe. For example, for the first subframe, the characteristic mean value may be determined according to the initial spectral characteristics of its associated first subframe.

In practical application, the feature mean value may be calculated by using the initial spectral features of each associated first subframe, for example, if the initial spectral features are 40-dimensional features, one mean value may be calculated for each-dimensional feature, so as to obtain a 40-dimensional feature mean value.

And determining the difference value between the initial spectral feature of the first subframe and the feature mean value of the first subframe as the first spectral feature of the first subframe.

The electronic device may calculate a difference between the initial spectral feature of the first subframe and the feature mean, and then obtain the first spectral feature of the first subframe. For example, the 40-dimensional features in the initial spectral features may be used to respectively subtract the corresponding 40-dimensional features in the feature mean to obtain the first spectral feature of the first subframe.

In this embodiment, by removing the feature mean value associated with the first subframe from the initial spectral feature of the first subframe, noise data in the first subframe can be removed, and thus a more accurate first spectral feature can be obtained.

The electronic device may train a preset model by using the obtained first subframe and the labeling information of the first audio frequency to which the first subframe belongs, so as to obtain a first model applied to a first scene.

Specifically, the preset model may be composed of multiple layers of tdnn (time delay neural network) and multiple layers of ResNet. The input data of the preset model is the frequency spectrum characteristics of the audio data, and the output identification result is the probability that the user in the audio data is each speaker. For example, the first audio is recorded audio data of 300 users, and the recognition result is the probability that the user in the input data is each of the 300 persons.

Furthermore, a speaker feature vector (Embedding) is extracted from the 2 nd last hidden layer of the preset model, and the feature vector is used for representing the voiceprint features of the user. The last layer of the preset model can determine the recognition result according to the feature vector. When the trained model is applied, the feature vector of the audio data can be output through the model, and then the voiceprint features of the user are extracted.

In practical application, the electronic device may construct a loss function according to a recognition result of a first subframe output by the preset model and the labeling information of the first audio to which the first subframe belongs, train the preset model based on the loss function, and obtain the first model through multiple iterative training.

Wherein the source data is related data in the first scene, and the resulting first model is applied to the first scene. According to the scheme provided by the disclosure, the first model can be further optimized and trained to obtain the target model applied to the target scene.

In this embodiment, the first model may be obtained by training using the source data and the label information thereof, so as to obtain the first model with the voiceprint recognition function. The first model can be further optimized subsequently, the optimized model can use the voiceprint recognition function of the first model for reference, and the training cost of the target model can be reduced by the mode of optimizing and training the first model.

Step 403, acquiring a target subframe, where the target subframe is obtained by performing framing processing on a target audio applied in a target scene.

Step 403 is similar to the manner of obtaining the target subframe in step 202, and is not repeated here.

Step 404 and 407 can be performed for any target subframe, so as to extract the target spectrum feature of the target subframe.

Step 404, for any target subframe, determining an initial spectrum characteristic of the target subframe.

Step 405, obtaining a related target subframe of the target subframe, where the related target subframe of the target subframe includes: a third preset number of target-subframes located before the target-subframe, and/or a fourth preset number of target-subframes located after the target-subframe.

Step 406, determining a feature mean of the target subframe according to the initial spectrum feature of the associated target subframe of the target subframe.

Step 407, determining a difference between the initial spectrum feature of the target subframe and the feature mean of the target subframe, which is the target spectrum feature of the target subframe.

The implementation manner of

step

404 and 407 is similar to the manner of extracting the first spectrum feature of the first subframe, and is not described again.

In this embodiment, by removing the feature mean value of the associated target subframe from the initial spectral feature of the target subframe, the noise data in the target subframe can be removed, and thus a more accurate target spectral feature can be obtained.

In one embodiment, the following

steps

408 and 409 can be repeatedly executed until the predetermined stop training condition is satisfied. In another embodiment, the

steps

401 and 409 can be repeated until the predetermined stop training condition is satisfied.

Step 408, inputting the first spectrum feature and the target spectrum feature into the first model to obtain a recognition result corresponding to the first spectrum feature and a target feature vector corresponding to the target subframe.

The electronic device may input the first spectral feature and the target spectral feature into the first model, for example, the first spectral feature obtained through the n pieces of first audio and the target spectral feature obtained through the n pieces of standard audio may be input into the first model as a batch of training data.

Specifically, for the first spectrum feature, the first model may output its recognition result, which may include a probability of each user, for characterizing a probability that the speaker of the first subframe is each user.

Further, the first model may output its target feature vector for the target spectral feature.

After the first spectral feature of the first subframe is input into the first model, the penultimate layer in the first model can extract a source feature vector of the first spectral feature, and the last layer of the first model can determine the recognition result of the first subframe according to the source feature vector, for example, the probability that the user of the first subframe is each of 300 users can be determined, for example, the probability that the user of the first subframe is 1%, the probability that the user of the second yghurt is 3%, and the like. The final classification result may be determined according to the probabilities, for example, the user with the highest probability may be determined as the speaker of the first sub-frame.

After the target frequency spectrum features of the target subframe are input into the first model, the target feature vector of the target frequency spectrum features can be extracted by the penultimate layer in the first model, and the target audio does not have the labeling information, so that the classification information cannot be set through the full connection layer in advance, and the first model does not classify the target feature vector.

Specifically, the identifiers of the first spectrum feature and the target spectrum feature may be preset, so that the electronic device can distinguish the first spectrum feature from the target spectrum feature, for the first spectrum feature, the electronic device may determine the recognition result by using the first model, and for the target spectrum feature, the electronic device may determine the target feature vector by using the first model.

Step 409, determining a loss function value by using the recognition result of the first spectral feature, the labeling information of the first audio, the target feature vector of the target subframe and the target audio to which the target subframe belongs, and optimizing the first model by using the loss function value to obtain an optimized first model.

In practical application, the first audio to which the first sub-frame belongs has the annotation information, and therefore, the annotation information of the first audio can also be used as the annotation information of the first sub-frame. For example, the first model may be trained using the recognition result of the first subframe and the label information of the first subframe.

In particular, only one speaking user is in a target audio, and therefore, the recognition effect of the first model on the target sub-frame can be measured based on the characteristic. If the two target subframes belong to the same target audio frequency, the target feature vectors of the two target subframes should be relatively close, and if the two target subframes belong to different target audio frequencies, the target feature vectors of the two target subframes should not be close.

Further, the value of the loss function may be determined by combining the recognition result of the first spectral feature, the labeling information of the first audio, the target feature vector of the target subframe, and the target audio to which the target subframe belongs, so that the first model is jointly trained by using the first audio and the target audio to obtain the target model.

In practical application, iterative training can be performed on the first model for multiple times, and the training is stopped when a preset training stopping condition is met, so that a target model is obtained. For example, when the number of times of training reaches a preset number, the training may be stopped, and for example, when the value of the loss function is smaller than a preset value, the training may be stopped.

Through the implementation mode, the first model can be trained by using the first audio and the target audio in a combined manner, although the target audio has no labeled information, the method provided by the disclosure can utilize the characteristic that speakers belonging to a target subframe of the same target audio are the same user, and train the first model by using the target audio, so that the target model which can be applied to a target scene is obtained, and meanwhile, the target model is trained by using the first audio, so that the target model can be prevented from being over-fitted on data of the target scene.

The value of the first loss function may be determined based on the recognition result of the first spectral feature and the label information of the first spectral feature.

In practical application, the first audio to which the first sub-frame belongs has the annotation information, and therefore, the annotation information of the first audio can also be used as the annotation information of the first sub-frame. Therefore, the value of the first loss function can be determined by using the recognition result of the first sub-frame and the annotation information of the first audio to which the first sub-frame belongs. The effect of the first model on the recognition of the first sub-frame is measured using the value of the first loss function.

Determining a value of a second loss function according to the target feature vector of each target subframe and the target audio frequency to which each target subframe belongs; the values of the second loss function are used for representing comparison information among target feature vectors of a plurality of target subframes.

The target audio frequency to which the target sub-frame belongs does not have the label information, but if the two target sub-frames belong to the same target audio frequency, the target feature vectors of the two target sub-frames should be closer, and if the two target sub-frames belong to different target audio frequencies, the target feature vectors of the two target sub-frames should not be closer. Thus, the value of the second loss function may be determined using target feature vectors belonging to target sub-frames of the same target audio and target feature vectors belonging to target sub-frames of different target audio. The effect of the first model on the identification of the target sub-frame can be measured by using the value of the second loss function.

Specifically, the value of the loss function can be determined according to the value of the first loss function and the value of the second loss function, and the loss function can measure the recognition effect of the model in the training process from the whole life. Parameters in the model can be adjusted through the loss function, and a target model meeting the conditions is obtained.

In an alternative embodiment, the value of the loss function of the model may be determined by the difference between the value of the first loss function and the value of the weighted second loss function. For example, the value of the loss function may be:

L_total＝L_cla+λL_cl

wherein L is_totalIs the value of a loss function, L_claIs the value of the first loss function, L_clλ is a preset weighting coefficient for the value of the second loss function. In this embodiment, the user can adjust the ratio of the value of the first loss function to the value of the second loss function by using λ, thereby meeting the requirements of various cross-scene model training.

Through the method, the value of the loss function of the model can be determined, the value is used for measuring the recognition effect of the model on the first spectrum characteristic and the target spectrum characteristic, the model can be optimized by the value, the optimized model can more accurately recognize the first spectrum characteristic and the target spectrum characteristic, and then the target model capable of accurately recognizing the target spectrum characteristic can be obtained through multiple iterations.

Further, in determining the value of the second loss function, the first comparison information may be determined according to a feature vector of a target subframe belonging to the same target audio. Since the target sub-frames belong to the same target audio, the target feature vectors of the target sub-frames should be closer to each other. Therefore, if the first comparison information is smaller, the recognition effect of the characterization model on the audio data of the same user is better, otherwise, the recognition effect of the characterization model on the audio data of the same user is poorer.

In practical application, the second comparison information can be determined according to the feature vectors of the target subframes belonging to different target audios. Since the target sub-frames belong to different target audio frequencies, the target feature vectors of the target sub-frames should differ greatly from each other. Therefore, if the second comparison information is smaller, the recognition effect of the characterization model on the audio data of different users is poorer, otherwise, the recognition effect of the characterization model on the audio data of different users is better.

And finally, the first comparison information and the second comparison information can be collected to determine the value of the second loss function. The smaller the value of the first comparison information is, the larger the value of the second comparison information is, the better the recognition effect of the model is, so that the value capable of measuring the recognition effect of the model can be determined by using the first comparison information and the second comparison information.

By the implementation mode, even if the target audio does not have the labeling information, the value of the second loss function for measuring the recognition effect of the model on the target audio can be determined according to the target audio to which the target subframe belongs, and further, under the condition that the target audio does not need to be labeled, the existing first model can be optimally trained by using the target audio to obtain the target model.

Fig. 5 is a flowchart illustrating a method for extracting a voiceprint feature according to an exemplary embodiment of the present disclosure.

As shown in fig. 5, the method for extracting voiceprint features provided by the present disclosure includes:

step 501, obtaining audio data to be identified, and extracting frequency spectrum characteristics of the audio data.

Step 502, inputting the frequency spectrum characteristics into a preset target model to obtain the voiceprint characteristics of the audio data.

The method provided by the disclosure can be applied to an electronic device with computing capability, such as a computer, and a mobile phone, and the electronic device can be provided with a target model.

Specifically, a preset spectral feature extraction algorithm can be used to extract spectral features in the audio data to be identified, and then the target model is used to process the extracted spectral features to obtain the voiceprint features of the audio data.

Further, the target model may be a model trained by the method shown in fig. 2 or fig. 4, and the voiceprint feature of the second last layer output audio data of the target model may be utilized, and the voiceprint feature may be a feature vector, for example.

In an optional implementation manner, if the electronic device is a mobile phone, the electronic device may send the obtained voiceprint feature to a server, and the server compares the voiceprint feature with a preset voiceprint feature, so as to determine a user matching the voiceprint feature. Wherein the preset voiceprint feature is associated with user information.

In another optional implementation, if the electronic device is a computer, a voiceprint feature may be preset in the computer, and the preset voiceprint feature is associated with user information. The computer may compare the voiceprint feature with a preset voiceprint feature to determine a user matching the voiceprint feature.

Fig. 6 is a schematic structural diagram of a model training apparatus according to an exemplary embodiment of the present disclosure.

As shown in fig. 6, the present disclosure provides a model training apparatus 600, comprising:

a training data obtaining unit 610, configured to obtain a first model applied to a first scene, a first subframe, and a target subframe, where the first subframe is obtained by performing framing processing on a first audio applied to the first scene, and the target subframe is obtained by performing framing processing on a target audio applied to a target scene; wherein the first audio has annotation information;

a spectral feature extraction unit 620, configured to extract a first spectral feature of the first subframe and extract a target spectral feature of the target subframe;

a training unit 630, configured to train the first model according to the first spectral feature of the first subframe, the label information of the first audio, the target spectral feature of the target subframe, and the target audio to which the target subframe belongs, so as to obtain a target model.

In the model training device provided by the disclosure, because the target audio does not have the labeling information, the target audio can be split into a plurality of subframes, and the information of the target audio to which the subframes belong is used as the labeling information of the subframes, so that the first model can be trained by using the first audio and the target audio with the labeling information, and the target model capable of identifying the voiceprint characteristics of the audio in the target scene is obtained.

Fig. 7 is a schematic structural diagram of a model training apparatus according to another exemplary embodiment of the present disclosure.

As shown in fig. 7, in the model training apparatus 700 provided by the present disclosure, the training data acquisition unit 710 is similar to the training data acquisition unit 610 in fig. 6, the spectral feature extraction unit 720 is similar to the spectral feature extraction unit 620 in fig. 6, and the training unit 730 is similar to the training unit 630 in fig. 6.

The training unit 730 includes an identification module 731, a training module 732:

the recognition module 731 and the training module 732 repeatedly perform the following steps until a preset training stopping condition is met:

the identifying module 731 inputs the first spectrum feature and the target spectrum feature into the first model, and obtains an identifying result corresponding to the first spectrum feature and a target feature vector corresponding to the target subframe;

the training module 732 determines a value of a loss function by using the recognition result of the first spectral feature and the labeling information of the first audio, the target feature vector of the target subframe and the target audio to which the target subframe belongs, and optimizes the first model by using the value of the loss function to obtain an optimized first model;

and the optimized first model obtained when the preset training stopping condition is met is a target model.

Wherein the training module 732 is further configured to:

determining a value of a first loss function according to the identification result of the first spectrum characteristic and the labeling information of the first spectrum characteristic;

determining a value of a second loss function according to the target feature vector of each target subframe and the target audio frequency to which each target subframe belongs; the value of the second loss function is used for representing comparison information among target characteristic vectors of a plurality of target subframes;

and determining the value of the loss function according to the value of the first loss function and the value of the second loss function.

Wherein the training module 732 is further configured to:

determining first comparison information according to target characteristic vectors of target subframes belonging to the same target audio;

determining second comparison information according to target characteristic vectors of target subframes belonging to different target audios;

and determining the value of the second loss function according to the first comparison information and the second comparison information.

Wherein the training module 732 is further configured to:

determining a difference between the value of the first loss function and the weighted value of the second loss function as the value of the loss function.

The spectral feature extraction unit 720 includes a first feature extraction module 721 configured to:

aiming at any first subframe, determining the initial spectrum characteristics of the first subframe;

acquiring a related first subframe of the first subframe, wherein the related first subframe of the first subframe comprises: a first preset number of first subframes before the first subframe, and/or a second preset number of first subframes after the first subframe;

determining a characteristic mean value of the first subframe according to the initial spectrum characteristic of the first subframe related to the first subframe;

The spectral feature extraction unit 720 includes a target feature extraction module 722, configured to:

aiming at any target subframe, determining the initial spectrum characteristics of the target subframe;

acquiring a related target subframe of the target subframe, wherein the related target subframe of the target subframe comprises: a third preset number of target subframes located before the target subframe, and/or a fourth preset number of target subframes located after the target subframe;

determining a characteristic mean value of the target subframe according to the initial spectrum characteristic of the associated target subframe of the target subframe;

and determining the difference value between the initial frequency spectrum characteristic of the target subframe and the characteristic mean value of the target subframe as the target frequency spectrum characteristic of the target subframe.

In the apparatus 700 provided in this embodiment, the training data obtaining unit 710 is further configured to:

and training a preset model by using the first spectral feature of the first subframe and the labeling information of the first audio to obtain the first model.

Fig. 8 is a schematic structural diagram of an apparatus for extracting a voiceprint feature according to an exemplary embodiment of the present disclosure.

As shown in fig. 8, the present disclosure provides an apparatus 800 for extracting a voiceprint feature, including:

the audio data acquiring unit 810 is configured to acquire audio data to be identified, and extract a spectral feature of the audio data;

a voiceprint feature extraction unit 820, configured to input the frequency spectrum feature into a preset target model, so as to obtain a voiceprint feature of the audio data; the target model is obtained by training through the device shown in any one of the figures 6 or 7.

The present disclosure provides a model training method, a voiceprint feature extraction method, a device thereof, and a program product, which are applied to a speech technology and a deep learning technology in an artificial intelligence technology, so as to solve the problems of long period and high cost in cross-scene training of an existing model in the related art.

It should be noted that the first model and the target model in this embodiment are not voiceprint recognition models for a specific user, and cannot reflect personal information of a specific user. It should be noted that the audio data in the present embodiment is from a public data set.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a model training method or an extraction method of vocal print features. For example, in some embodiments, the model training method or the voiceprint feature extraction method can be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described model training method or the extraction method of vocal print features may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform a model training method or an extraction method of voiceprint features by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model training method, comprising:

2. The method of claim 1, wherein training the first model according to the first spectral feature of the first subframe, the label information of the first audio, the target spectral feature of the target subframe, and the target audio to which the target subframe belongs to obtain a target model comprises:

repeatedly executing the following steps until a preset training stopping condition is met:

inputting the first spectrum feature and the target spectrum feature into the first model to obtain an identification result corresponding to the first spectrum feature and a target feature vector corresponding to the target subframe;

determining a value of a loss function by using the recognition result of the first spectrum characteristic, the labeling information of the first audio, the target characteristic vector of the target subframe and the target audio to which the target subframe belongs, and optimizing the first model by using the value of the loss function to obtain an optimized first model;

3. The method of claim 2, wherein the determining the value of the loss function using the recognition result of the first spectral feature and the annotation information of the first audio, the target feature vector of the target-sub-frame, and the target audio to which the target-sub-frame belongs comprises:

determining a value of a first loss function according to the identification result of the first spectral feature and the labeling information of the first audio;

4. The method of claim 3, wherein the determining a value of a second loss function based on the target feature vector of each of the target-subframes and the target audio to which each of the target-subframes belongs comprises:

5. The method of claim 3 or 4, wherein said determining the value of the loss function from the values of the first and second loss functions comprises:

6. The method of any of claims 1-5, wherein the extracting the first spectral feature of the first subframe comprises:

acquiring a related first subframe of the first subframe, wherein the related first subframe of the first subframe comprises: a first subframe with a first preset number before the first subframe and/or a first subframe with a second preset number after the first subframe;

7. The method of claim 1, wherein the extracting target spectral features of the target subframe comprises:

for any one of the target-sub-frames,

determining the initial spectrum characteristics of the target subframe;

8. The method of any of claims 1-7, the obtaining a first model applied in a first scene, comprising:

9. A voiceprint feature extraction method comprises the following steps:

inputting the frequency spectrum characteristics into a target model to obtain the voiceprint characteristics of the audio data; the target model is trained by the method of any one of claims 1-8.

10. A model training apparatus comprising:

the training data acquisition unit is used for acquiring a first model, a first subframe and a target subframe which are applied to a first scene, wherein the first subframe is obtained by framing a first audio applied to the first scene, and the target subframe is obtained by framing a target audio applied to a target scene; wherein the first audio has annotation information;

the spectrum feature extraction unit is used for extracting a first spectrum feature of the first subframe and extracting a target spectrum feature of the target subframe;

11. The apparatus of claim 10, wherein the training unit comprises an identification module, a training module:

the identification module and the training module repeatedly execute the following steps until a preset training stopping condition is met:

the identification module inputs the first spectrum characteristic and the target spectrum characteristic into the first model to obtain an identification result corresponding to the first spectrum characteristic and a target characteristic vector corresponding to the target subframe;

the training module determines a value of a loss function by using the recognition result of the first spectrum characteristic, the labeling information of the first audio, the target characteristic vector of the target subframe and the target audio to which the target subframe belongs, and optimizes the first model by using the value of the loss function to obtain an optimized first model;

12. The apparatus of claim 11, wherein the training module is further to:

13. The apparatus of claim 12, wherein the training module is further to:

14. The apparatus of claim 12 or 13, wherein the training module is further to:

15. The apparatus according to any one of claims 10-14, wherein the spectral feature extraction unit comprises a first feature extraction module configured to:

16. The apparatus according to any one of claims 10-14, wherein the spectral feature extraction unit comprises a target feature extraction module configured to:

17. The apparatus according to any one of claims 10-16, the training data acquisition unit further to:

18. An apparatus for extracting voiceprint features, comprising:

the voiceprint feature extraction unit is used for inputting the frequency spectrum feature into a preset target model to obtain the voiceprint feature of the audio data; the target model is trained by the apparatus of any one of claims 10-17.

19. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 9.