CN114913860A - Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product - Google Patents

Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product Download PDF

Info

Publication number
CN114913860A
CN114913860A CN202210450804.6A CN202210450804A CN114913860A CN 114913860 A CN114913860 A CN 114913860A CN 202210450804 A CN202210450804 A CN 202210450804A CN 114913860 A CN114913860 A CN 114913860A
Authority
CN
China
Prior art keywords
voiceprint recognition
network
initial
data
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210450804.6A
Other languages
Chinese (zh)
Inventor
黄淋
饶宇熹
宁博
黎明欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202210450804.6A priority Critical patent/CN114913860A/en
Publication of CN114913860A publication Critical patent/CN114913860A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a voiceprint recognition method, a voiceprint recognition device, a computer device, a storage medium and a program product. The method comprises the following steps: and acquiring voice data to be recognized, and inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized. By adopting the method, the pre-trained voiceprint recognition model can be applied to recognize the voice data to be recognized to obtain the voiceprint recognition result, and because the convolutional layer parameters which are generated by the countermeasure network model and are obtained by training after data amplification is carried out on the small sample voice data training set are quoted during training of the voiceprint recognition model to be trained, the knowledge obtained by training of the large sample voice data set is quoted during training of the voiceprint recognition model to be trained, the convergence rate of training of the voiceprint recognition model can be further accelerated, and the accuracy of voiceprint recognition model recognition is improved.

Description

Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product
Technical Field
The present application relates to the field of computer technologies, and in particular, to a voiceprint recognition method, apparatus, computer device, storage medium, and program product.
Background
Voiceprint recognition can judge the identity of a speaker by extracting voice characteristics of the speaker when the speaker speaks, is suitable for remote recognition scenes due to the advantages of convenience in acquisition, high user acceptance, low cost, non-contact and the like, and is widely applied to the fields of banks, securities and the like.
Based on the fact that the deep learning algorithm becomes a mainstream method for extracting deep voice features due to strong nonlinear expression and automatic learning capability, in the related technology, a deep voice print recognition model based on deep learning is adopted to extract deep voice features of voice data to realize voice print recognition. However, voice data in an actual scene is relatively deficient, and it is difficult to obtain enough voice data to train a model, and a voiceprint recognition model trained by using a voice data set of a small sample has a problem of low accuracy of voiceprint recognition.
Disclosure of Invention
In view of the above, it is desirable to provide a voiceprint recognition method, apparatus, computer device, storage medium and program product for solving the above technical problems.
In a first aspect, a voiceprint recognition method includes:
acquiring voice data to be recognized;
inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized;
the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample voice data training set.
In one embodiment, the process of constructing the voiceprint recognition model comprises the following steps:
training the initially generated confrontation network model through a small sample voice data training set to obtain a generated confrontation network model;
migrating the network parameters of the generated confrontation network model to an initial voiceprint recognition model, and training the initial voiceprint recognition model through a small sample voice data training set to obtain a voiceprint recognition model; the initial voiceprint recognition model and the convolution layer for generating the countermeasure network model have the same structure.
In one embodiment, the initially generated confrontation network model comprises an initial generator network and an initial arbiter network;
training the initially generated confrontation network model through a small sample voice data training set to obtain a generated confrontation network model, including:
preprocessing a small sample voice data training set to obtain a preprocessing result;
and inputting random noise data into an initial generator network to obtain generated data, inputting a preprocessing result and the generated data into an initial discriminator network, and performing joint training on the initial generator network and the initial discriminator network to obtain a generator network and a discriminator network.
In one embodiment, inputting the preprocessing result and the generated data into an initial arbiter network, and performing joint training on the initial generator network and the initial arbiter network to obtain a generator network and an arbiter network, including:
inputting the preprocessing result and the generated data into an initial discriminator network to obtain an initial discrimination prediction result;
calculating a prediction error value between the initial judgment prediction result and the standard judgment result through a loss function;
updating network parameters in the initial generator network and the initial discriminator network according to the prediction error value;
and if the predicted error value meets the preset convergence condition, determining that the training of the initial generator network and the training of the initial discriminator network are finished, and obtaining the generator network and the discriminator network.
In one embodiment, preprocessing a small sample speech data training set to obtain a preprocessing result includes:
framing the small sample voice data in the small sample voice data training set to obtain a plurality of voice frame data;
windowing each voice frame data to obtain corresponding windowed data;
performing Fourier transform on each windowed data to determine a two-dimensional spectrogram;
and mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as a preprocessing result.
In one embodiment, the arbiter network in the generation countermeasure network model comprises a first convolution layer and a first fully connected layer; the initial voiceprint recognition model comprises a second convolution layer and a second full-connection layer;
then, the network parameters for generating the confrontation network model are transferred to the initial voiceprint recognition model, and the initial voiceprint recognition model is trained through a small sample speech data training set, which includes:
determining the network parameters of the first convolutional layer in the discriminator network as the network parameters of the second convolutional layer in the initial voiceprint recognition model, and initializing the network parameters of the second full-connection layer;
the initial voiceprint recognition model is trained through a small sample speech data training set.
In a second aspect, a voiceprint recognition apparatus, said apparatus comprising:
the voice data acquisition module is used for acquiring voice data to be recognized;
the voiceprint recognition module is used for inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized;
the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample voice data training set.
In a third aspect, a computer device comprises a memory and a processor, the memory storing a computer program, and the processor implementing the steps of any of the above methods in the first aspect when executing the computer program.
In a fourth aspect, a readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods in the first aspect described above.
In a fifth aspect, a readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods in the first aspect described above.
According to the voiceprint recognition method, the voiceprint recognition device, the computer device, the storage medium and the program product, the computer device can acquire the voice data to be recognized and input the voice data to be recognized into the preset voiceprint recognition model to obtain the voiceprint recognition result of the voice data to be recognized; the method can use the pre-trained voiceprint recognition model to recognize the voice data to be recognized to obtain the voiceprint recognition result, and the voice recognition model to be trained quotes the convolutional layer parameters which are obtained by training after data amplification is carried out on the small sample voice data training set and are used for generating the countermeasure network model, so that the knowledge obtained by training the large sample voice data set is quoted when the voiceprint recognition model to be trained is trained, the convergence rate of the voiceprint recognition model training can be further accelerated, and the accuracy of the voiceprint recognition model recognition is improved.
Drawings
FIG. 1 is a diagram of an application environment of a voiceprint recognition method in one embodiment;
FIG. 2 is a flow diagram illustrating a voiceprint recognition method in one embodiment;
FIG. 3 is a flow diagram illustrating a method for constructing a voiceprint recognition model in one embodiment;
FIG. 4 is a schematic flow chart illustrating a method for training an initially generated confrontation network model with a small sample speech data training set to obtain a generated confrontation network model in another embodiment;
FIG. 5 is a flow chart illustrating a method for joint training of an initial generator network and an initial arbiter network in another embodiment;
FIG. 6 is a flowchart illustrating a method for preprocessing a training set of small sample speech data to obtain a preprocessed result according to another embodiment;
FIG. 7 is a flowchart illustrating a method for migrating network parameters for generating a confrontation network model into an initial voiceprint recognition model and training the initial voiceprint recognition model in another embodiment;
FIG. 8 is a block diagram of the structure of a voiceprint recognition apparatus in one embodiment;
FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The voiceprint recognition method provided by the application can be applied to the voiceprint recognition system shown in fig. 1. The voiceprint recognition system comprises a voice acquisition device and a computer device. Optionally, the voice collecting device may be a recording pen, a sound collector, a voice detector, or the like; the computer device may be implemented as an independent server or a server cluster composed of a plurality of servers, and may also be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The voice acquisition equipment and the computer equipment can be in communication connection; the communication mode can be Wi-Fi, a mobile network or Bluetooth connection and the like. The following embodiment will explain the implementation process of the voiceprint recognition method.
In order to extract the speech features of a speaking object to determine the identity of the speaking object, the embodiment of the present application provides a voiceprint recognition method, which is described by taking the method applied to the computer device in fig. 1 as an example, as shown in fig. 2, the voiceprint recognition method includes the following steps:
and S100, acquiring voice data to be recognized.
Specifically, the voice acquisition device can acquire the voice data of at least one speaking object at regular time, and further, the computer device can receive the voice data sent by the voice acquisition device in real time, namely the voice data to be recognized. Or, the voice acquisition device may store the acquired voice data of the speaking object to the local or cloud end, and in actual application, the computer device may acquire the pre-stored voice data, that is, the voice data to be recognized, from the local or cloud end. In addition, the computer equipment can also download voice simulation data on the Internet, namely the voice data to be recognized on line.
It should be noted that the speech data to be recognized may be speech data containing only the speaking object. However, another way to acquire the speech data to be recognized may be that the computer device may acquire a speech data set to be recognized, and then perform feature extraction on the speech data set to be recognized to obtain speech data only including a speaking object, that is, the speech data to be recognized. Alternatively, the speech data set to be recognized may include speech data of a speaking subject as well as speech data of a non-speaking subject. In this embodiment, the voice data of the non-speaking object may be voice data of the environment, such as voice data of an electronic device, voice data of nature (such as wind, thunder, running water), and so on.
S200, inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized. The initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample voice data training set.
Specifically, the preset voiceprint recognition model may be a pre-trained voiceprint recognition model. Alternatively, the voiceprint recognition model can be a deep learning network model, such as the deep learning network model can be at least one of a convolutional neural network model, a cyclic neural network model, a deep belief network model, and so on.
It should be noted that the voiceprint recognition model may include at least one of a convolutional layer, a fully-connected layer, and a pooling layer, and the number of the convolutional layer, the fully-connected layer, and the pooling layer included in the voiceprint recognition model may be arbitrary, which is not limited herein. However, in this embodiment, the voiceprint recognition model includes at least one convolutional layer. Optionally, before training the voiceprint recognition model, the initial parameters (network parameters) of the convolutional layer during training the voiceprint recognition model may be the pre-trained convolutional layer parameters that generate the convolutional layer in the countermeasure network model.
Alternatively, the data augmentation may be understood as a process of cropping, shifting, changing brightness, adding noise, rotating angle, and/or mirroring the speech data in the training set of small sample speech data. Optionally, the voiceprint recognition result may be an identity of a speech object; the identity of the speaker object may be speaker identification or speaker verification. Optionally, the speaker identification may be understood as performing a one-to-N comparison between the voice data of the speaking object and the voice data of N speakers in the preset database, finding out the voice data with the same comparison result, and determining the speaker corresponding to the voice data as the speaking object; the speaker verification can be understood as comparing the voice data of the speaking object with the voice data of the speaking object in a preset database one to prove the speaking object.
The voiceprint recognition method can acquire the voice data to be recognized, and input the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized; the method can use a pre-trained voiceprint recognition model to recognize the voice data to be recognized to obtain a voiceprint recognition result, and the voiceprint recognition model to be trained quotes convolutional layer parameters which are obtained by training a small sample voice data training set after data amplification, and generate an antagonistic network model, so that the knowledge obtained by training a large sample voice data set is quoted during training the voiceprint recognition model to be trained, the convergence rate of the voiceprint recognition model training can be further accelerated, and the accuracy of voiceprint recognition model recognition is improved.
Because training a voiceprint recognition model by using a small sample speech data training set generally causes problems of overfitting, poor generalization capability and low accuracy of voiceprint recognition of the model, in one embodiment, as shown in fig. 3, the construction process of the voiceprint recognition model can be implemented by the following steps:
s210, training the initially generated confrontation network model through a small sample voice data training set to obtain the generated confrontation network model.
In particular, the small sample speech data training set may be a set of speech data combinations of multiple speaking subjects. Optionally, generating the antagonistic network model may include generating a model and discriminating a model. Optionally, the generative model and the discriminant model may each include at least one of a convolutional layer, a fully-connected layer, and a pooling layer. The overall structures of the generative model and the discriminant model may be the same or different, but the network parameters of the generative model and the discriminant model are different. In this embodiment, the structures of the same layer in different models may be the same or different.
S220, migrating the network parameters for generating the confrontation network model to an initial voiceprint recognition model, and training the initial voiceprint recognition model through a small sample voice data training set to obtain a voiceprint recognition model; the initial voiceprint recognition model and the convolution layer for generating the countermeasure network model have the same structure.
Specifically, the step of transferring the network parameters for generating the confrontation network model to the initial voiceprint recognition model may be understood as determining the trained network parameters for generating the model and/or the discriminant model in the confrontation network model as part of or all of the network parameters in the initial voiceprint recognition model. Further, the computer device may train the initial voiceprint recognition model through a small sample speech data training set.
The training set of the small sample voice data for training the initial voiceprint recognition model may be different from the training set of the small sample voice data for training the confrontation network model. However, in this embodiment, the training set of small sample speech data for training the initial voiceprint recognition model is the same as the training set of small sample speech data for training the generation of the confrontation network model.
In this embodiment, the initial voiceprint recognition model and the generative confrontation network model may each include a convolutional layer, and in order to achieve convolutional layer parameter migration of the convolutional layer, the structures of the convolutional layers in the initial voiceprint recognition model and the generative confrontation network model may be the same. Optionally, the network structures of the initial voiceprint recognition model and the generated countermeasure network model except for the convolutional layer may be the same or different.
According to the voiceprint recognition method, after data amplification is carried out on a small sample voice data training set, convolutional layer parameters which are obtained through training and generate an confrontation network model are transferred to an initial voiceprint recognition model, then the initial voiceprint recognition model is trained through the small sample voice data training set, so that knowledge obtained through training of a large sample voice data set is quoted during training of the initial voiceprint recognition model, the convergence rate of training of the voiceprint recognition model can be further accelerated, and the accuracy of voiceprint recognition model recognition is improved; in addition, the method utilizes the generated confrontation network model to amplify the small sample voice data set, so that the difference between the amplified data set and the small sample voice data set can be reduced, and then migration learning is used for migrating the convolutional layer parameters trained by the amplified sample to the voiceprint recognition model for small sample training, so that the generalization capability of the voiceprint recognition model trained by the small sample can be improved.
The following describes how to train the initially generated confrontation network model through the small sample speech data training set in order for the generated confrontation network model to learn the knowledge in the large sample speech data set. In one embodiment, the initially generated confrontation network model includes an initial generator network and an initial discriminator network; as shown in fig. 4, the step of training the initially generated confrontation network model through the small sample speech data training set in S210 to obtain the generated confrontation network model may be implemented by the following steps:
s211, preprocessing the small sample voice data training set to obtain a preprocessing result.
Specifically, the computer device may perform preprocessing on the voice data in the small sample voice data training set, such as removing noise (environmental noise, busy tone, color ring tone, etc.), enhancing data (aliasing echo, change rate, time domain and frequency domain random masking), clipping, data conversion, and/or feature extraction, to obtain a preprocessing result.
S212, inputting the random noise data into the initial generator network to obtain generated data, inputting the preprocessing result and the generated data into the initial discriminator network, and performing combined training on the initial generator network and the initial discriminator network to obtain a generator network and a discriminator network.
Specifically, the random noise may be gaussian noise, single frequency noise, impulse noise, fluctuation noise, white noise, and/or the like. In the process of initially generating the confrontation network model training, the computer device may first input the generated random noise data to the initial generator network to obtain corresponding simulation data (i.e., generated data), then input the preprocessing result and the generated data to the initial discriminator network to perform joint training on the initial generator network and the initial discriminator network, and when it is determined that both the initial generator network and the initial discriminator network satisfy the corresponding convergence condition, determine that the initial generator network at this time is the generator network, and the initial discriminator network at this time is the discriminator network. Alternatively, joint training may be understood as a process of training the initial generator network and the initial discriminator network simultaneously.
It should be noted that the initial generator network may include a fully connected layer and a plurality of deconvolution layers. The computer device can input random noise data into an initial generator network, convert the random noise data into three-dimensional data through a full connection layer, and then up-sample the three-dimensional data through a plurality of deconvolution layers to obtain generated data. Optionally, each deconvolution layer may output twice the amount of feature data as the last adjacent deconvolution layer.
It will be appreciated that the initial arbiter network may comprise a plurality of two-dimensional convolutional layers and a fully-connected layer. The computer device can input the preprocessing result and the generated data into the initial discriminator network, perform down-sampling by a plurality of two-dimensional convolution layers, learn the deep voice characteristics of the input data, and then output the discrimination result of the initial discriminator network through the full connection layer.
As shown in fig. 5, the step of inputting the preprocessing result and the generated data into the initial arbiter network in S212, and performing joint training on the initial generator network and the initial arbiter network to obtain the generator network and the arbiter network may be implemented by the following steps:
s2121, inputting the preprocessing result and the generated data into an initial discriminator network to obtain an initial discrimination prediction result.
Specifically, the computer device may input the pre-processing result or the generated data to the initial arbiter network to obtain the initial arbitration prediction result. However, in this embodiment, the computer device may simultaneously input the preprocessing result and the generated data output by the initial generator network to the initial arbiter network to obtain the initial discrimination prediction result.
It should be noted that, when the initial generator network and the initial discriminator network are not trained, the initial discriminator network can correctly discriminate whether the preprocessing result is true or false with respect to the generated data, and in this case, the initial discrimination prediction result may be the preprocessing result and the generated data with the identifier. Optionally, the identifier may distinguish between the pre-processing result and the authenticity of the generated data. Wherein the preprocessing result can be determined as real data; the generated data is analog data and can be determined as pseudo data.
Optionally, when the training of the initial generator network and the initial arbiter network is finished, the initial arbiter network may erroneously determine whether the pre-processing result is true or false with respect to the generated data, that is, in this case, the pre-processing result is determined as pseudo data, and the generated data is determined as real data. In this case, the initial discrimination prediction result may be a preprocessing result and generated data carrying an identifier, but in this case, the identifier carried by the preprocessing result and generated data is opposite to the identifier carried by the initial discriminator network and generated data when the initial generator network and the initial discriminator network are not trained.
And S2122, calculating a prediction error value between the initial judgment prediction result and the standard judgment result through a loss function.
Specifically, the above-mentioned loss function may be a 0-1 loss function, a squared loss function, an absolute value loss function, a logarithmic loss function, or the like. Optionally, the loss function includes a parameter corresponding to the initial discrimination prediction result and the standard discrimination result.
It should be noted that the computer device may bring the initial discrimination prediction result into the loss function to obtain a prediction error value between the initial discrimination prediction result and the standard discrimination result. Optionally, the standard determination result may be a preprocessing result and generated data, where the identifier carried in the preprocessing result is an identifier of dummy data, and the identifier carried in the generated data is an identifier of real data. Optionally, the standard discrimination result may be understood as a gold standard for training the anti-network model.
And S2123, updating network parameters in the initial generator network and the initial discriminator network according to the prediction error value.
Specifically, the computer device may adjust network parameters in the initial generator network and the initial arbiter network based on the magnitude of the prediction error value. Optionally, if the prediction error value is larger, the adjustment value of the network parameter may be slightly larger, and if the prediction error value is larger, the adjustment value of the network parameter may be slightly smaller.
And S2124, if the prediction error value meets a preset convergence condition, determining that both the initial generator network and the initial discriminator network are trained, and obtaining the generator network and the discriminator network.
It can be understood that, in the training process of the initial generator network and the initial discriminator network, the steps in S211, S2121-S2123 need to be continuously iterated, after each iteration process, it may be determined whether the prediction error value is less than or equal to the preset error threshold, or the iteration number reaches the preset iteration number threshold, if it is determined that the prediction error value is less than or equal to the preset error threshold, or the iteration number reaches the preset iteration number threshold, it is determined that both training of the initial generator network and the initial discriminator network is completed, the current initial generator network is determined to be the generator network, and the current initial discriminator network is determined to be the discriminator network.
According to the voiceprint recognition method, the confrontation network model can be generated by training after data amplification is carried out on the small sample voice data training set, so that the generated confrontation network model can learn a large amount of knowledge in large sample voice data, network parameters of the generated confrontation network model can be further migrated to the voiceprint recognition model to carry out small sample training, and the generalization capability of the voiceprint recognition model for small sample training is improved.
When the confrontation network model is trained through the voice data, the characteristics of the voice data need to be extracted, so that before training, the voice data can be preprocessed to obtain the Mel frequency spectrogram data (namely, the voice characteristic data). In an embodiment, as shown in fig. 6, the step of preprocessing the small sample speech data training set in S211 to obtain a preprocessing result may specifically include:
s2111, framing the small sample voice data in the small sample voice data training set to obtain a plurality of voice frame data.
Specifically, since the voice data is non-stationary, but the voice data is stationary in a short time, and no abrupt change occurs, in order to facilitate the processing, the small sample voice data in the small sample voice data training set may be divided into multiple frames of stationary voice data. Alternatively, the voice data may be one-dimensional data.
Optionally, the computer device may perform framing on the small sample voice data in the small sample voice data training set according to a time sequence to obtain a plurality of voice frame data.
S2112, windowing is carried out on the voice frame data respectively to obtain corresponding windowed data.
Specifically, in order that the spectrum energy of each voice frame data does not leak during the subsequent fourier transform, the windowing process may be performed on each voice frame data. Alternatively, the windowing process may be understood as a process of intercepting the voice frame data by a window function. Alternatively, the window type may be a rectangular window, a triangular window, a hanning window, a gaussian window, or the like.
It should be noted that, because two ends of the hamming window are not zero, the side lobe leakage can be reduced very close. Therefore, in this embodiment, the computer device may add a hamming window to each voice frame data, respectively, to obtain windowed data corresponding to each voice frame data.
S2113, performing Fourier transform on each windowed data, and determining a two-dimensional spectrogram.
Further, the computer device may perform fourier transform on each windowed data, and combine the fourier transform results together according to the time order to obtain the two-dimensional spectrogram. Optionally, the two-dimensional spectrogram may include X-axis data and Y-axis data; the X-axis data may be time and the Y-axis data may be frequency.
S2114, mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as a preprocessing result.
Specifically, the computer device may map the frequency data f of the Y axis in the two-dimensional spectrogram onto the mel scale mel according to a mapping relationship between the frequency data f and the mel scale mel, so as to obtain the mel spectrogram data. Optionally, the mapping relationship may be a proportional relationship, a functional relationship, a logarithmic relationship, an exponential relationship, and/or the like.
In the present embodiment, the mapping relationship between the frequency data f and the mel scale mel can be expressed by the following formula:
mel=2595*log 10 (1+f/700) (1);
the numerical value in the formula (1) may be other values, which is not limited.
The voiceprint recognition method can be used for preprocessing the small sample voice data in the small sample voice data training set to obtain the Mel frequency spectrogram data, then the countermeasure network model is generated through Mel frequency spectrogram data training, so that the optimal network parameters corresponding to the large sample can be obtained through the generated countermeasure network model training, and on the basis, the voiceprint recognition model with high generalization capability can be trained through transfer learning.
As one embodiment, the arbiter network in the generation countermeasure network model includes a first convolution layer and a first full connection layer; the initial voiceprint recognition model comprises a second convolution layer and a second full-connection layer; as shown in fig. 7, the step of migrating the network parameters for generating the confrontation network model into the initial voiceprint recognition model in S220 and training the initial voiceprint recognition model through the small sample speech data training set may be implemented by the following steps:
s221, determining the network parameters of the first convolution layer in the discriminator network as the network parameters of the second convolution layer in the initial voiceprint recognition model, and initializing the network parameters of the second full-connection layer.
In the embodiment, the arbiter network in the generation countermeasure network model includes a convolutional layer (i.e. the first convolutional layer) and a fully-connected layer (i.e. the first fully-connected layer); the initial voiceprint recognition model also includes a convolutional layer (i.e., a second convolutional layer) and a fully-connected layer (i.e., a second fully-connected layer). Wherein, the first convolution layer and the second convolution layer can have the same structure; the first full connection layer and the second full connection layer may have the same or different structures.
It should be noted that the computer device may assign the network parameter for generating the first convolution layer in the countermeasure network model to the network parameter for the second convolution layer in the initial voiceprint recognition model, and at the same time, the computer device may initialize the network parameter for the second fully-connected layer in the initial voiceprint recognition model. If the type output by the voiceprint recognition model is the same as the type output by the discriminator network, the network parameter of the first full connection layer in the discriminator network can be determined as the network parameter of the second full connection layer in the initial voiceprint recognition model, and under the condition, the first full connection layer and the second full connection layer have the same structure; if the output type of the voiceprint recognition model is different from the output type of the discriminator network, the network parameters of the second full connection layer in the initial voiceprint recognition model need to be initialized.
S222, training the initial voiceprint recognition model through a small sample voice data training set.
Further, the computer device may train the initial voiceprint recognition model through the speech data in the training set of small sample speech data. Specifically, the computer device may input all the voice data in the small sample voice data training set into the initial voiceprint recognition model to obtain a voiceprint recognition prediction result, calculate a prediction error value between the voiceprint recognition prediction result and the standard voiceprint recognition result through a loss function, update a network parameter in the initial voiceprint recognition model according to the prediction error value, continuously and repeatedly input all the voice data in the small sample voice data training set into the initial voiceprint recognition model after the network parameter is updated until the prediction error value meets a preset error threshold or the iteration number reaches a preset iteration number threshold, and obtain a pre-trained voiceprint recognition model. The standard voiceprint recognition result can be an idealized voiceprint recognition result, namely a gold standard for training a voiceprint recognition model.
The voiceprint recognition method can determine the network parameters of a first convolution layer in the discriminator network as the network parameters of a second convolution layer in the initial voiceprint recognition model, initialize the network parameters of a second full connection layer, and train the initial voiceprint recognition model through a small sample voice data training set to obtain a voiceprint recognition result; the method can transfer the network parameters corresponding to the large sample which is generated and learned by the confrontation network model to the voiceprint recognition model to be trained, so that the knowledge obtained by training the large sample voice data set is introduced during training of the voiceprint recognition model to be trained, the convergence rate of the voiceprint recognition model training can be further accelerated, and the accuracy of voiceprint recognition model recognition can be improved.
In order to facilitate understanding of those skilled in the art, the voiceprint recognition method provided by the present application is described by taking an execution subject as a computer device as an example, and specifically, the method includes:
(1) and framing the small sample voice data in the small sample voice data training set to obtain a plurality of voice frame data.
(2) And respectively carrying out windowing processing on the voice frame data to obtain corresponding windowed data.
(3) And performing Fourier transform on each windowed data to determine a two-dimensional spectrogram.
(4) And mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as a preprocessing result.
(5) And inputting the random noise data into an initial generator network to obtain generated data, and inputting the preprocessing result and the generated data into an initial discriminator network to obtain an initial discrimination prediction result.
(6) And calculating a prediction error value between the initial judgment prediction result and the standard judgment result through a loss function.
(7) And updating network parameters in the initial generator network and the initial arbiter network according to the prediction error value.
(8) And if the predicted error value meets the preset convergence condition, determining that the training of the initial generator network and the training of the initial discriminator network are finished, and obtaining the generator network and the discriminator network.
(9) And determining the network parameters of the first convolution layer in the discriminator network as the network parameters of the second convolution layer in the initial voiceprint recognition model, and initializing the network parameters of the second full-connection layer.
(10) Training the initial voiceprint recognition model through a small sample voice data training set to obtain a voiceprint recognition model; the initial voiceprint recognition model and the convolution layer for generating the countermeasure network model have the same structure.
(11) And acquiring voice data to be recognized.
(12) And inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized.
For the implementation processes of (1) to (12), reference may be specifically made to the description of the above embodiments, and the implementation principles and technical effects thereof are similar and will not be described herein again.
It should be understood that although the various steps in the flow charts of fig. 2-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 8, there is provided a voiceprint recognition apparatus including: voice data acquisition module 11 and voiceprint recognition module 12, wherein:
the voice data acquisition module 11 is used for acquiring voice data to be recognized;
the voiceprint recognition module 12 is configured to input the voice data to be recognized into a preset voiceprint recognition model, so as to obtain a voiceprint recognition result of the voice data to be recognized;
the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample voice data training set.
The voiceprint recognition apparatus provided in this embodiment may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
In one embodiment, the voiceprint recognition device further comprises: a first model training module and a second model training module, wherein:
the first model training module is used for training the initially generated confrontation network model through a small sample voice data training set to obtain a generated confrontation network model;
the second model training module is used for transferring the network parameters for generating the confrontation network model to the initial voiceprint recognition model, and training the initial voiceprint recognition model through a small sample voice data training set to obtain the voiceprint recognition model; the initial voiceprint recognition model and the convolution layer for generating the countermeasure network model have the same structure.
The voiceprint recognition apparatus provided in this embodiment may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
In one embodiment, the initially generated confrontation network model comprises an initial generator network and an initial arbiter network; the first model training module includes: a preprocessing unit and a joint training unit, wherein:
the preprocessing unit is used for preprocessing the small sample voice data training set to obtain a preprocessing result;
and the joint training unit is used for inputting the random noise data into the initial generator network to obtain generated data, inputting the preprocessing result and the generated data into the initial discriminator network, and performing joint training on the initial generator network and the initial discriminator network to obtain the generator network and the discriminator network.
The voiceprint recognition apparatus provided in this embodiment may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
In one embodiment, the joint training unit comprises: the device comprises a discriminator network processing subunit, a prediction error value calculating subunit, a network parameter updating subunit and a training end determining subunit, wherein:
the discriminator network processing subunit is used for inputting the preprocessing result and the generated data into an initial discriminator network to obtain an initial discrimination prediction result;
the prediction error value calculation operator unit is used for calculating a prediction error value between the initial judgment prediction result and the standard judgment result through a loss function;
the network parameter updating subunit is used for updating the network parameters in the initial generator network and the initial discriminator network according to the prediction error value;
and the training end determining subunit is used for determining that the training of the initial generator network and the initial discriminator network is finished when the prediction error value meets the preset convergence condition, so as to obtain the generator network and the discriminator network.
The voiceprint recognition apparatus provided in this embodiment may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
In one embodiment, the pre-processing unit comprises: a framing subunit, a windowing subunit, a fourier transform subunit, and a data mapping subunit, wherein:
the framing subunit is used for framing the small sample voice data in the small sample voice data training set to obtain a plurality of voice frame data;
the windowing subunit is used for respectively carrying out windowing processing on the voice frame data to obtain corresponding windowed data;
the Fourier transform subunit is used for performing Fourier transform on each windowed data to determine a two-dimensional spectrogram;
and the data mapping subunit is used for mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as a preprocessing result.
The voiceprint recognition apparatus provided in this embodiment may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
In one embodiment, the arbiter network in the generation countermeasure network model comprises a first convolutional layer and a first fully connected layer; the initial voiceprint recognition model comprises a second convolution layer and a second full-connection layer; the second model training module comprises: a network parameter initialization unit and a voiceprint recognition model training unit, wherein:
the network parameter initialization unit is used for determining the network parameters of the first convolutional layer in the discriminator network as the network parameters of the second convolutional layer in the initial voiceprint recognition model and initializing the network parameters of the second full-connection layer;
and the voiceprint recognition model training unit is used for training the initial voiceprint recognition model through a small sample voice data training set.
The voiceprint recognition apparatus provided in this embodiment may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
The voiceprint recognition apparatus provided in this embodiment may implement the method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
For the specific definition of the voiceprint recognition device, reference may be made to the above definition of the voiceprint recognition method, which is not described herein again. The modules in the voiceprint recognition apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the voice data to be recognized. The network interface of the computer device is used for communicating with an external endpoint through a network connection. The computer program is executed by a processor to implement a voiceprint recognition method.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
acquiring voice data to be recognized;
inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized;
the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample voice data training set.
In one embodiment, a readable storage medium is provided, on which a computer program is stored, which computer program, when executed by a processor, performs the steps of:
acquiring voice data to be recognized;
inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized;
the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample voice data training set.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:
acquiring voice data to be recognized;
inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized;
the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample voice data training set.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of voiceprint recognition, the method comprising:
acquiring voice data to be recognized;
inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized;
the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample speech data training set.
2. The voiceprint recognition method according to claim 1, wherein the construction process of the voiceprint recognition model comprises:
training an initially generated confrontation network model through the small sample voice data training set to obtain the generated confrontation network model;
migrating the network parameters of the generated confrontation network model to an initial voiceprint recognition model, and training the initial voiceprint recognition model through the small sample voice data training set to obtain the voiceprint recognition model; wherein the initial voiceprint recognition model and the convolution layer generating the countermeasure network model have the same structure.
3. The voiceprint recognition method of claim 2 wherein said initially generated confrontation network model comprises an initial generator network and an initial discriminator network;
then, the training the initially generated confrontation network model through the small sample voice data training set to obtain a generated confrontation network model, including:
preprocessing the small sample voice data training set to obtain a preprocessing result;
and inputting random noise data into the initial generator network to obtain generated data, inputting the preprocessing result and the generated data into the initial discriminator network, and performing joint training on the initial generator network and the initial discriminator network to obtain the generator network and the discriminator network.
4. The voiceprint recognition method according to claim 3, wherein the inputting the preprocessing result and the generated data into the initial discriminator network, and performing joint training on the initial generator network and the initial discriminator network to obtain the generator network and the discriminator network comprises:
inputting the preprocessing result and the generated data into the initial discriminator network to obtain an initial discrimination prediction result;
calculating a prediction error value between the initial judgment prediction result and the standard judgment result through a loss function;
updating network parameters in the initial generator network and the initial arbiter network according to the prediction error value;
and if the prediction error value meets a preset convergence condition, determining that the training of the initial generator network and the training of the initial discriminator network are finished, and obtaining the generator network and the discriminator network.
5. The voiceprint recognition method according to claim 3, wherein the preprocessing the training set of small sample speech data to obtain a preprocessing result comprises:
framing the small sample voice data in the small sample voice data training set to obtain a plurality of voice frame data;
windowing each voice frame data to obtain corresponding windowed data;
performing Fourier transform on each windowed data to determine a two-dimensional spectrogram;
and mapping the frequency data in the two-dimensional spectrogram onto a Mel scale to obtain Mel spectrogram data, and determining the Mel spectrogram data as the preprocessing result.
6. The voiceprint recognition method according to any one of claims 2 to 5, wherein the discriminator network in the generative confrontation network model comprises a first convolutional layer and a first fully connected layer; the initial voiceprint recognition model comprises a second convolutional layer and a second full-connection layer;
then, the migrating the network parameters of the generated confrontation network model to an initial voiceprint recognition model, and training the initial voiceprint recognition model through the small sample speech data training set includes:
determining the network parameters of the first convolution layer in the discriminator network as the network parameters of the second convolution layer in the initial voiceprint recognition model, and initializing the network parameters of the second full connection layer;
and training the initial voiceprint recognition model through the small sample voice data training set.
7. A voiceprint recognition apparatus, said apparatus comprising:
the voice data acquisition module is used for acquiring voice data to be recognized;
the voiceprint recognition module is used for inputting the voice data to be recognized into a preset voiceprint recognition model to obtain a voiceprint recognition result of the voice data to be recognized;
the initial parameters of the convolutional layer during the training of the voiceprint recognition model are determined according to the parameters of the convolutional layer which is trained in advance and generates the confrontation network model, and the confrontation network model is generated by training after data amplification is carried out on a small sample speech data training set.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the voiceprint recognition method according to any one of claims 1 to 6.
9. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the voiceprint recognition method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1-6.
CN202210450804.6A 2022-04-27 2022-04-27 Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product Pending CN114913860A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210450804.6A CN114913860A (en) 2022-04-27 2022-04-27 Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210450804.6A CN114913860A (en) 2022-04-27 2022-04-27 Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN114913860A true CN114913860A (en) 2022-08-16

Family

ID=82765611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210450804.6A Pending CN114913860A (en) 2022-04-27 2022-04-27 Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN114913860A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727308A (en) * 2024-02-18 2024-03-19 百鸟数据科技(北京)有限责任公司 Mixed bird song recognition method based on deep migration learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727308A (en) * 2024-02-18 2024-03-19 百鸟数据科技(北京)有限责任公司 Mixed bird song recognition method based on deep migration learning
CN117727308B (en) * 2024-02-18 2024-04-26 百鸟数据科技(北京)有限责任公司 Mixed bird song recognition method based on deep migration learning

Similar Documents

Publication Publication Date Title
JP7177167B2 (en) Mixed speech identification method, apparatus and computer program
CN110444214B (en) Speech signal processing model training method and device, electronic equipment and storage medium
CN109166586B (en) Speaker identification method and terminal
CN108922544B (en) Universal vector training method, voice clustering method, device, equipment and medium
CN109065028B (en) Speaker clustering method, speaker clustering device, computer equipment and storage medium
US20190051292A1 (en) Neural network method and apparatus
WO2019227574A1 (en) Voice model training method, voice recognition method, device and equipment, and medium
CN110310647B (en) Voice identity feature extractor, classifier training method and related equipment
WO2019232851A1 (en) Method and apparatus for training speech differentiation model, and computer device and storage medium
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
WO2019232772A1 (en) Systems and methods for content identification
US20200125836A1 (en) Training Method for Descreening System, Descreening Method, Device, Apparatus and Medium
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN110929836B (en) Neural network training and image processing method and device, electronic equipment and medium
WO2023005386A1 (en) Model training method and apparatus
WO2021127982A1 (en) Speech emotion recognition method, smart device, and computer-readable storage medium
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
Wei et al. A method of underwater acoustic signal classification based on deep neural network
WO2021042544A1 (en) Facial verification method and apparatus based on mesh removal model, and computer device and storage medium
CN114913860A (en) Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product
CN116257762B (en) Training method of deep learning model and method for controlling mouth shape change of virtual image
KR20220065209A (en) Method and apparatus for recognizing image of various quality
CN113542527B (en) Face image transmission method and device, electronic equipment and storage medium
CN113688655B (en) Method, device, computer equipment and storage medium for identifying interference signals
CN111951791A (en) Voiceprint recognition model training method, recognition method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination