CN111783939A - Voiceprint recognition model training method and device, mobile terminal and storage medium - Google Patents

Voiceprint recognition model training method and device, mobile terminal and storage medium Download PDF

Info

Publication number
CN111783939A
CN111783939A CN202010469636.6A CN202010469636A CN111783939A CN 111783939 A CN111783939 A CN 111783939A CN 202010469636 A CN202010469636 A CN 202010469636A CN 111783939 A CN111783939 A CN 111783939A
Authority
CN
China
Prior art keywords
training
full
connection layer
recognition model
voiceprint recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010469636.6A
Other languages
Chinese (zh)
Inventor
洪国强
肖龙源
李稀敏
刘晓葳
叶志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010469636.6A priority Critical patent/CN111783939A/en
Publication of CN111783939A publication Critical patent/CN111783939A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a voiceprint recognition model training method, a voiceprint recognition model training device, a mobile terminal and a storage medium, wherein the method comprises the following steps: controlling an xvector voiceprint recognition model to perform feature extraction on training data to obtain a training feature vector, and performing type recognition on the training feature vector through a first full connection layer to obtain a preset feature vector and a dynamic digital feature vector; inputting the preset feature vector and the dynamic digital feature vector into a second full-connection layer and a third full-connection layer respectively; and performing loss calculation on the second full connection layer and the third full connection layer to obtain a first loss probability and a second loss probability, and training the second full connection layer and the third full connection layer according to the first loss probability and the second loss probability. According to the method, the second full-connection layer is trained according to the preset feature vectors, and the dynamic digital feature vectors are controlled to train the third full-connection layer, so that the identification effect of the xvector voiceprint identification model on the semi-correlation of the text after model training is improved.

Description

Voiceprint recognition model training method and device, mobile terminal and storage medium
Technical Field
The invention belongs to the technical field of voiceprint recognition, and particularly relates to a voiceprint recognition model training method and device, a mobile terminal and a storage medium.
Background
The voice of each person implies unique biological characteristics, and the voiceprint recognition refers to a technical means for recognizing a speaker by using the voice of the speaker. The voiceprint recognition has high safety and reliability as the techniques of fingerprint recognition and the like, and can be applied to all occasions needing identity recognition. Such as in the financial fields of criminal investigation, banking, securities, insurance, and the like. Compared with the traditional identity recognition technology, the voiceprint recognition technology has the advantages of simple voiceprint extraction process, low cost, uniqueness and difficulty in counterfeiting and counterfeit.
In the existing voiceprint recognition process, the xvector model has good effect on voiceprint recognition, and the application scene of the voiceprint generally has text independence, text correlation (fixed password) and text semi-correlation (dynamic number).
Disclosure of Invention
The embodiment of the invention aims to provide a voiceprint recognition model training method, a voiceprint recognition model training device, a mobile terminal and a storage medium, and aims to solve the problems that the existing voiceprint recognition model training method is low in audio detection efficiency and poor in audio detection accuracy.
The embodiment of the invention is realized in such a way that a voiceprint recognition model training method comprises the following steps:
acquiring training data, and inputting the training data into an xvector voiceprint recognition model; the training data comprises preset data and dynamic digital data;
extracting features of the training data based on the xvector voiceprint recognition model to obtain a training feature vector, and inputting the training feature vector into a first full connection layer;
performing type recognition on the training feature vector through the first full-connection layer to obtain a preset feature vector and a dynamic digital feature vector;
inputting the preset feature vector and the dynamic digital feature vector into a second full-connection layer and a third full-connection layer respectively, wherein the second full-connection layer and the third full-connection layer correspond to one output;
respectively performing loss calculation on the outputs of the second full connection layer and the third full connection layer by using a loss calculation layer to obtain a first loss probability and a second loss probability;
and training the second full-link layer according to the first loss probability, and training the third full-link layer according to the second loss probability until the outputs of the second full-link layer and the third full-link layer are converged.
Further, the step of performing feature extraction on the training data based on the xvector voiceprint recognition model includes:
inputting the training data into a TDNN (time domain neural network) in the xvector voiceprint recognition model, and controlling the TDNN to extract the characteristics of the training data to obtain training characteristics;
and the TDNN is controlled by the TDNN to carry out nonlinear transformation on the training characteristics to obtain the training characteristic vector.
Further, the step of using the loss calculation layer to perform loss calculation on the outputs of the second fully-connected layer and the third fully-connected layer respectively to obtain a first loss probability and a second loss probability includes:
performing loss calculation on the output of the second full-connection layer according to a preset loss function and the preset feature vector to obtain a first loss probability;
and performing loss calculation on the output of the third full-connection layer according to the preset loss function and the dynamic digital feature vector to obtain a second loss probability.
Further, the step of training the second fully-connected layer according to the first loss probability and the step of training the third fully-connected layer according to the second loss probability comprises:
and carrying out forward propagation in the xvector voiceprint recognition model according to the first loss probability, and carrying out backward propagation in the xvector voiceprint recognition model according to the second loss probability.
Further, before the step of inputting the training feature vector into the first fully-connected layer, the method further comprises:
pooling the training feature vectors output by each TDNN, and inputting the pooled training feature vectors into the first full connection layer.
Further, the step of pooling the training feature vectors output by each of the TDNN networks includes:
accumulating the training feature vectors output by each TDNN, calculating a mean value and a standard deviation in all the training feature vectors according to a vector accumulation result, and taking the mean value and the standard deviation as the output of the training feature vectors after pooling processing.
Still further, the method further comprises:
acquiring voiceprint data to be recognized, and inputting the voiceprint data to be recognized into the xvector voiceprint recognition model;
controlling the xvector voiceprint recognition model to recognize the voiceprint data to be recognized, and taking the output result of the first full connection layer as the output vector of the xvector voiceprint recognition model;
calculating a matching value between the output vector and a locally pre-stored sample vector according to an Euclidean distance formula, and acquiring a serial number value of the sample vector corresponding to the maximum value in the matching value;
and when the serial number value is judged to be larger than the serial number threshold value, judging that the voiceprint identification of the voiceprint data to be identified is qualified.
Another objective of an embodiment of the present invention is to provide a training apparatus for a voiceprint recognition model, where the apparatus includes:
the training data acquisition module is used for acquiring training data and inputting the training data into an xvector voiceprint recognition model, wherein the training data comprises preset data and dynamic digital data;
the feature extraction module is used for extracting features of the training data based on the xvector voiceprint recognition model to obtain a training feature vector, and inputting the training feature vector into a first full connection layer;
the characteristic type recognition model is used for carrying out type recognition on the training characteristic vector through the first full-connection layer to obtain a preset characteristic vector and a dynamic digital characteristic vector;
the characteristic output module is used for correspondingly inputting the preset characteristic vector and the dynamic digital characteristic vector into a second full connection layer and a third full connection layer respectively, and the second full connection layer and the third full connection layer correspond to one output;
the loss calculation module is used for performing loss calculation on the outputs of the second full connection layer and the third full connection layer by using a loss calculation layer to obtain a first loss probability and a second loss probability;
and the model training module is used for training the second full-connection layer according to the first loss probability and training the third full-connection layer according to the second loss probability until the outputs of the second full-connection layer and the third full-connection layer are converged.
Another object of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above-mentioned voiceprint recognition model training method.
Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the above-mentioned mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the above-mentioned voiceprint recognition model training method.
According to the embodiment of the invention, through the design that the second full-link layer is trained according to the preset characteristic vector and the dynamic digital characteristic vector is controlled to train the third full-link layer, the recognition effect of the xvector voiceprint recognition model on the semi-correlation of the text after model training is improved, so that the xvector voiceprint recognition model can aim at the text irrelevance, the text correlation and the text semi-correlation can both improve the effective voiceprint recognition, and the accuracy of voiceprint recognition is improved.
Drawings
FIG. 1 is a flowchart of a training method for a voiceprint recognition model according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a voiceprint recognition model training method provided by a second embodiment of the invention;
FIG. 3 is a schematic structural diagram of a training apparatus for a voiceprint recognition model according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a mobile terminal according to a fourth embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
Example one
Referring to fig. 1, a flowchart of a voiceprint recognition model training method according to a first embodiment of the present invention is shown, which includes the steps of:
step S10, acquiring training data, and inputting the training data into an xvector voiceprint recognition model;
the size and information parameters of the training data can be set according to user requirements, the training data comprises preset data and dynamic digital data, the preset data can be character data, voice data or digital data, for example, the training data comprises 100 fixed texts and 100 dynamic numbers, and each fixed text and each dynamic number can be randomly generated;
preferably, the xvector voiceprint recognition model includes a TDNN network (Time-Delay neural network), a pooling layer and a plurality of full connection layers, specifically, the xvector voiceprint recognition model includes a first full connection layer, a second full connection layer and a third full connection layer, and further, a user can set the number of the full connection layers according to own requirements;
step S20, extracting the features of the training data based on the xvector voiceprint recognition model to obtain a training feature vector, and inputting the training feature vector into a first full connection layer;
the training feature vector can be an MFCC feature vector, after the MFCC feature extraction is carried out on training data by an xvector voiceprint recognition model, vector conversion is carried out on the extracted MFCC feature to obtain the MFCC feature vector, and the MFCC feature vector is input into a first full-connection layer to be convolved;
step S30, performing type recognition on the training feature vector through the first full-connection layer to obtain a preset feature vector and a dynamic digital feature vector;
the method comprises the steps that a first full-connection layer is controlled to identify vector identification in training feature vectors, and preset feature vectors corresponding to preset data and dynamic digital feature vectors corresponding to dynamic digital data are obtained correspondingly based on identification results;
preferably, in this step, the sample data in the training data may be set according to the requirements of the user, but the sample data includes at least two different sample data, so as to ensure that after the first full-connection layer performs type recognition on the training feature vectors, at least two different types of feature vectors are obtained;
step S40, inputting the preset feature vector and the dynamic digital feature vector into a second full connection layer and a third full connection layer respectively;
the second full connection layer and the third full connection layer both correspond to one output, and share the TDNN, the pooling layer and the first full connection layer in the xvector voiceprint recognition model;
specifically, in this embodiment, the second full-link layer and the third full-link layer are respectively provided with a different output, so that the second full-link layer and the third full-link layer can respectively perform model training on different features, so that the subsequent first full-link layer can recognize different types of features, and the diversity of the first full-link layer and the xvector voiceprint recognition model on voiceprint recognition is improved;
step S50, respectively performing loss calculation on the outputs of the second full connection layer and the third full connection layer using a loss calculation layer to obtain a first loss probability and a second loss probability;
the posterior probabilities of the second full-link layer and the third full-link layer, namely the probability value of the speaker, can be effectively calculated by using the design that the loss calculation layer is used for respectively carrying out loss calculation on the outputs of the second full-link layer and the third full-link layer;
step S60, training the second fully-connected layer according to the first loss probability, and training the third fully-connected layer according to the second loss probability until the outputs of the second fully-connected layer and the third fully-connected layer converge;
preferably, when the second full connection layer and the third full connection layer reach a preset iteration number, the model training of the xvector voiceprint recognition model is automatically stopped, so that the trained xvector voiceprint recognition model can effectively achieve recognition effects on voiceprint data which are text-independent, text-dependent and text-semi-dependent;
in the embodiment, the second full connection layer is trained according to the preset feature vector, and the dynamic digital feature vector is controlled to train the third full connection layer, so that the recognition effect of the xvector voiceprint recognition model on the semi-correlation of the text after model training is improved, the xvector voiceprint recognition model can be independent of the text, the text correlation and the text semi-correlation can both improve effective voiceprint recognition, and the accuracy of voiceprint recognition is improved.
Example two
Referring to fig. 2, a flowchart of a voiceprint recognition model training method according to a second embodiment of the present invention is shown, which includes the steps of:
step S11, acquiring training data, and inputting the training data into an xvector voiceprint recognition model;
the training data comprises preset data and dynamic digital data, and further comprises at least two different sample data, wherein one of the sample data is dynamic digital data, so that the trained xvector voiceprint recognition model can have an effective voiceprint recognition effect on the semi-correlation of the text;
step S21, inputting the front-end characteristics of the training data into a TDNN (time domain neural network) in an xvector voiceprint recognition model, and controlling the TDNN to extract the characteristics of the front-end characteristics to obtain training characteristics;
the front-end feature may be an MFCC feature, the TDNN network is used to express a relationship of a voiceprint feature in time, and preferably, two TDNN networks are used in the xvector voiceprint recognition model;
step S31, controlling a TDNN to carry out nonlinear transformation on the training features to obtain training feature vectors;
the training feature vector can be an MFCC feature vector, and after the TDNN extracts the MFCC feature of training data, the extracted MFCC feature is subjected to nonlinear transformation to achieve the effect of feature vector conversion, so that the MFCC feature vector is obtained;
specifically, in this step, the MFCC feature vector may be obtained by performing pre-emphasis, framing, windowing, fast fourier transform, band-pass filtering, logarithmic operation, and discrete cosine transform on the training feature;
step S41, pooling the training feature vectors output by each TDNN, and inputting the pooled training feature vectors into a first full connection layer;
wherein, Pooling processing (Pooling), also called undersampling or downsampling, is mainly used for feature dimension reduction and compressing the number of data and parameters to reduce overfitting and improve the fault tolerance of the model;
specifically, in this step, the pooling processing of the training feature vector output by each TDNN network includes:
accumulating the training feature vectors output by each TDNN, calculating a mean value and a standard deviation in all the training feature vectors according to a vector accumulation result, and taking the mean value and the standard deviation as the output of the training feature vectors after pooling processing.
Step S51, performing type recognition on the training feature vectors through the first full-connection layer to obtain preset feature vectors and dynamic digital feature vectors;
the method comprises the steps that a first full-connection layer is controlled to identify vector identification in training feature vectors, and preset feature vectors corresponding to preset data and dynamic digital feature vectors corresponding to dynamic digital data are obtained correspondingly based on identification results;
step S61, inputting the preset feature vector and the dynamic digital feature vector into a second full connection layer and a third full connection layer respectively;
the second full connection layer and the third full connection layer both correspond to one output, and share the TDNN, the pooling layer and the first full connection layer in the xvector voiceprint recognition model;
step S71, performing loss calculation on the output of the second full connection layer according to a preset loss function and a preset feature vector to obtain a first loss probability, and performing loss calculation on the output of the third full connection layer according to the preset loss function and a dynamic digital feature vector to obtain a second loss probability;
step S81, forward propagation is carried out in the xvector voiceprint recognition model according to the first loss probability, and backward propagation is carried out in the xvector voiceprint recognition model according to the second loss probability until the output of the second full connection layer and the third full connection layer is converged;
step S91, acquiring voiceprint data to be recognized, and inputting the voiceprint data to be recognized into an xvector voiceprint recognition model;
step S101, controlling an xvector voiceprint recognition model to recognize voiceprint data to be recognized, and taking an output result of a first full connection layer as an output vector of the xvector voiceprint recognition model;
step S111, calculating a matching value between the output vector and a sample vector prestored locally according to an Euclidean distance formula, and acquiring a serial number value of the sample vector corresponding to the maximum value in the matching value;
wherein, the Euclidean distance formula adopted between the output vector and the sample vector is as follows:
Figure BDA0002513879360000091
a is an output vector, b is a sample vector, and the current characteristic value (output vector) and the characteristic value (sample vector) existing in the voiceprint library are made into 1 by using an Euclidean distance formula: n, searching and scoring to obtain the matching value;
specifically, in this embodiment, a number table is pre-stored, and the number table stores corresponding relationships between different matching values and number values, so that the number value is queried by matching a maximum value of the matching values with the number table;
step S121, when the number value is judged to be larger than the number threshold value, judging that the voiceprint identification of the voiceprint data to be identified is qualified;
the queried number value and the number threshold are subjected to size judgment to judge whether the voiceprint identification of the voiceprint data to be identified is qualified, specifically, the number threshold can be subjected to parameter setting according to requirements, for example, the number threshold can be 0.8, 0.9 or 0.95, and the number threshold is used for judging whether the voiceprint features in the voiceprint data to be identified are consistent with locally pre-stored sample voiceprint features;
further, in this embodiment, when it is determined that the voiceprint recognition of the voiceprint data to be recognized is qualified, obtaining a user identifier corresponding to the number value, and outputting the user identifier, where the user identifier may be stored in a manner of characters, numbers, images, or biological features, and the user identifier is used to point to a corresponding user, for example, when the user identifier is stored in a manner of characters, the user identifier may be a user name, such as "zhang san", "lie si", and the like; when the user identification is stored in a numbering mode, the user identification can be a user job number, and when the user identification is stored in an image mode, the user identification is a head portrait picture of a user;
in the embodiment, the second full connection layer is trained according to the preset feature vector, and the dynamic digital feature vector is controlled to train the third full connection layer, so that the recognition effect of the xvector voiceprint recognition model on the semi-correlation of the text after model training is improved, the xvector voiceprint recognition model can be independent of the text, the text correlation and the text semi-correlation can both improve effective voiceprint recognition, and the accuracy of voiceprint recognition is improved.
EXAMPLE III
Referring to fig. 3, a schematic structural diagram of a training apparatus 100 for a voiceprint recognition model according to a third embodiment of the present invention is shown, including: a training data acquisition module 10, a feature extraction module 11, a feature type recognition model 12, a feature output module 13 and a model training module 14, wherein:
the training data acquisition module 10 is configured to acquire training data and input the training data into an xvector voiceprint recognition model, where the training data includes preset data and dynamic digital data, a size and information parameters of the training data may be set according to a user requirement, and the training data includes the preset data and the dynamic digital data;
preferably, the xvector voiceprint recognition model includes a TDNN Network (Time-Delay Neural Network), a pooling layer and a plurality of full connection layers, and specifically, the xvector voiceprint recognition model includes a first full connection layer, a second full connection layer and a third full connection layer, and further, the user can set the number of the full connection layers according to the own requirement.
And the feature extraction module 11 is configured to perform feature extraction on the training data based on the xvector voiceprint recognition model to obtain a training feature vector, and input the training feature vector into the first full connection layer, where the training feature vector may be an MFCC feature vector, and after performing MFCC feature extraction on the training data by the xvector voiceprint recognition model, perform vector conversion on the extracted MFCC feature to obtain the MFCC feature vector, and input the MFCC feature vector into the first full connection layer for convolution.
Preferably, the feature extraction module 11 is further configured to: inputting the training data into a TDNN (time domain neural network) in the xvector voiceprint recognition model, and controlling the TDNN to extract the characteristics of the training data to obtain training characteristics;
and controlling the TDNN to carry out nonlinear transformation on the training features to obtain the training feature vector.
And the feature type recognition model 12 is configured to perform type recognition on the training feature vector through the first full connection layer to obtain a preset feature vector and a dynamic digital feature vector, wherein the first full connection layer is controlled to recognize a vector identifier in the training feature vector, and the preset feature vector corresponding to the preset data and the dynamic digital feature vector corresponding to the dynamic digital data are obtained correspondingly based on a recognition result.
And the feature output module 13 is configured to correspondingly input the preset feature vectors and the dynamic digital feature vectors into a second full connection layer and a third full connection layer respectively, the second full connection layer and the third full connection layer both correspond to one output, and the second full connection layer and the third full connection layer share the TDNN network, the pooling layer and the first full connection layer in the xvector voiceprint recognition model.
And a loss calculation module 14, configured to perform loss calculation on the outputs of the second full connection layer and the third full connection layer by using a loss calculation layer, respectively, so as to obtain a first loss probability and a second loss probability.
Wherein the loss calculation module 14 is further configured to: performing loss calculation on the output of the second full-connection layer according to a preset loss function and the preset feature vector to obtain a first loss probability;
and performing loss calculation on the output of the third full-connection layer according to the preset loss function and the dynamic digital feature vector to obtain a second loss probability.
And the model training module 15 is used for training the second full connection layer according to the first loss probability and training the third full connection layer according to the second loss probability until the output of the second full connection layer and the output of the third full connection layer converge, preferably, when the second full connection layer and the third full connection layer reach a preset iteration number, the model training of the xvector voiceprint recognition model is automatically stopped, so that the trained xvector voiceprint recognition model can be effectively independent of texts, and the voiceprint data related to the texts and semi-related to the texts have a recognition effect.
Wherein the model training module 15 is further configured to: and carrying out forward propagation in the xvector voiceprint recognition model according to the first loss probability, and carrying out backward propagation in the xvector voiceprint recognition model according to the second loss probability.
Specifically, in this embodiment, the training apparatus 100 for a voiceprint recognition model further includes:
a feature pooling module 16, configured to pool the training feature vectors output by each TDNN network, and input the pooled training feature vectors into the first full connection layer.
Preferably, the feature pooling module 16 is further configured to: accumulating the training feature vectors output by each TDNN, calculating a mean value and a standard deviation in all the training feature vectors according to a vector accumulation result, and taking the mean value and the standard deviation as the output of the training feature vectors after pooling processing.
In addition, the voiceprint recognition model training apparatus 100 further includes:
the voiceprint recognition model 17 is used for acquiring voiceprint data to be recognized and inputting the voiceprint data to be recognized into the xvector voiceprint recognition model;
controlling the xvector voiceprint recognition model to recognize the voiceprint data to be recognized, and taking the output result of the first full connection layer as the output vector of the xvector voiceprint recognition model;
calculating a matching value between the output vector and a locally pre-stored sample vector according to an Euclidean distance formula, and acquiring a serial number value of the sample vector corresponding to the maximum value in the matching value;
and when the serial number value is judged to be larger than the serial number threshold value, judging that the voiceprint identification of the voiceprint data to be identified is qualified.
In the embodiment, the second full connection layer is trained according to the preset feature vector, and the dynamic digital feature vector is controlled to train the third full connection layer, so that the recognition effect of the xvector voiceprint recognition model on the semi-correlation of the text after model training is improved, the xvector voiceprint recognition model can be independent of the text, the text correlation and the text semi-correlation can both improve effective voiceprint recognition, and the accuracy of voiceprint recognition is improved.
Example four
Referring to fig. 4, a mobile terminal 101 according to a fourth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal 101 execute the above-mentioned voiceprint recognition model training method.
The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:
acquiring training data, and inputting the training data into an xvector voiceprint recognition model; the training data comprises preset data and dynamic digital data;
extracting features of the training data based on the xvector voiceprint recognition model to obtain a training feature vector, and inputting the training feature vector into a first full connection layer;
performing type recognition on the training feature vector through the first full-connection layer to obtain a preset feature vector and a dynamic digital feature vector;
inputting the preset feature vector and the dynamic digital feature vector into a second full-connection layer and a third full-connection layer respectively, wherein the second full-connection layer and the third full-connection layer correspond to one output;
respectively performing loss calculation on the outputs of the second full connection layer and the third full connection layer by using a loss calculation layer to obtain a first loss probability and a second loss probability;
and training the second full-link layer according to the first loss probability, and training the third full-link layer according to the second loss probability until the outputs of the second full-link layer and the third full-link layer are converged. The storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.
Those skilled in the art will appreciate that the component structure shown in fig. 3 does not constitute a limitation of the voiceprint recognition model training apparatus of the present invention, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components, and that the voiceprint recognition model training method in fig. 1-2 is also implemented using more or fewer components than those shown in fig. 3, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) in the target voiceprint recognition model training apparatus and that can perform specific functions, and all of them can be stored in a storage device (not shown) of the target voiceprint recognition model training apparatus.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A method for training a voiceprint recognition model, the method comprising:
acquiring training data, and inputting the training data into an xvector voiceprint recognition model; the training data comprises preset data and dynamic digital data;
extracting features of the training data based on the xvector voiceprint recognition model to obtain a training feature vector, and inputting the training feature vector into a first full connection layer;
performing type recognition on the training feature vector through the first full-connection layer to obtain a preset feature vector and a dynamic digital feature vector;
inputting the preset feature vector and the dynamic digital feature vector into a second full-connection layer and a third full-connection layer respectively, wherein the second full-connection layer and the third full-connection layer correspond to one output;
respectively performing loss calculation on the outputs of the second full connection layer and the third full connection layer by using a loss calculation layer to obtain a first loss probability and a second loss probability;
and training the second full-link layer according to the first loss probability, and training the third full-link layer according to the second loss probability until the outputs of the second full-link layer and the third full-link layer are converged.
2. The method for training the voiceprint recognition model according to claim 1, wherein the step of performing the feature extraction on the training data based on the xvector voiceprint recognition model comprises:
inputting the training data into a TDNN (time domain neural network) in the xvector voiceprint recognition model, and controlling the TDNN to extract the characteristics of the training data to obtain training characteristics;
and controlling the TDNN to carry out nonlinear transformation on the training features to obtain the training feature vector.
3. The training method of the voiceprint recognition model according to claim 1, wherein the step of performing the loss calculation on the outputs of the second fully-connected layer and the third fully-connected layer respectively by using the loss calculation layer comprises:
performing loss calculation on the output of the second full-connection layer according to a preset loss function and the preset feature vector to obtain a first loss probability;
and performing loss calculation on the output of the third full-connection layer according to the preset loss function and the dynamic digital feature vector to obtain a second loss probability.
4. The method of claim 1, wherein the step of training the second fully-connected layer according to the first loss probability and the step of training the third fully-connected layer according to the second loss probability comprises:
and carrying out forward propagation in the xvector voiceprint recognition model according to the first loss probability, and carrying out backward propagation in the xvector voiceprint recognition model according to the second loss probability.
5. The method of training a voiceprint recognition model according to claim 1, wherein said step of inputting said training feature vectors into a first fully connected layer is preceded by the method further comprising:
pooling the training feature vectors output by each TDNN, and inputting the pooled training feature vectors into the first full connection layer.
6. The method for training the voiceprint recognition model according to claim 5, wherein the step of pooling the training feature vectors outputted by each of the TDNN networks comprises:
accumulating the training feature vectors output by each TDNN, calculating a mean value and a standard deviation in all the training feature vectors according to a vector accumulation result, and taking the mean value and the standard deviation as the output of the training feature vectors after pooling processing.
7. The method of claim 1, wherein the method further comprises:
acquiring voiceprint data to be recognized, and inputting the voiceprint data to be recognized into the xvector voiceprint recognition model;
controlling the xvector voiceprint recognition model to recognize the voiceprint data to be recognized, and taking the output result of the first full connection layer as the output vector of the xvector voiceprint recognition model;
calculating a matching value between the output vector and a locally pre-stored sample vector according to an Euclidean distance formula, and acquiring a serial number value of the sample vector corresponding to the maximum value in the matching value;
and when the serial number value is judged to be larger than the serial number threshold value, judging that the voiceprint identification of the voiceprint data to be identified is qualified.
8. A voiceprint recognition model training apparatus, the apparatus comprising:
the training data acquisition module is used for acquiring training data and inputting the training data into an xvector voiceprint recognition model, wherein the training data comprises preset data and dynamic digital data;
the feature extraction module is used for extracting features of the training data based on the xvector voiceprint recognition model to obtain a training feature vector, and inputting the training feature vector into a first full connection layer;
the characteristic type recognition model is used for carrying out type recognition on the training characteristic vector through the first full-connection layer to obtain a preset characteristic vector and a dynamic digital characteristic vector;
the characteristic output module is used for correspondingly inputting the preset characteristic vector and the dynamic digital characteristic vector into a second full connection layer and a third full connection layer respectively, and the second full connection layer and the third full connection layer correspond to one output;
the loss calculation module is used for performing loss calculation on the outputs of the second full connection layer and the third full connection layer by using a loss calculation layer to obtain a first loss probability and a second loss probability;
and the model training module is used for training the second full-connection layer according to the first loss probability and training the third full-connection layer according to the second loss probability until the outputs of the second full-connection layer and the third full-connection layer are converged.
9. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor for executing the computer program to make the mobile terminal execute the voiceprint recognition model training method according to any one of claims 1 to 7.
10. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 9, which computer program, when being executed by a processor, carries out the steps of the voiceprint recognition model training method according to any one of claims 1 to 7.
CN202010469636.6A 2020-05-28 2020-05-28 Voiceprint recognition model training method and device, mobile terminal and storage medium Pending CN111783939A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010469636.6A CN111783939A (en) 2020-05-28 2020-05-28 Voiceprint recognition model training method and device, mobile terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010469636.6A CN111783939A (en) 2020-05-28 2020-05-28 Voiceprint recognition model training method and device, mobile terminal and storage medium

Publications (1)

Publication Number Publication Date
CN111783939A true CN111783939A (en) 2020-10-16

Family

ID=72754420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010469636.6A Pending CN111783939A (en) 2020-05-28 2020-05-28 Voiceprint recognition model training method and device, mobile terminal and storage medium

Country Status (1)

Country Link
CN (1) CN111783939A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382298A (en) * 2020-11-17 2021-02-19 北京清微智能科技有限公司 Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof
CN113421575A (en) * 2021-06-30 2021-09-21 平安科技(深圳)有限公司 Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium
CN113470655A (en) * 2021-07-02 2021-10-01 因诺微科技(天津)有限公司 Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382298A (en) * 2020-11-17 2021-02-19 北京清微智能科技有限公司 Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof
CN112382298B (en) * 2020-11-17 2024-03-08 北京清微智能科技有限公司 Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof
CN113421575A (en) * 2021-06-30 2021-09-21 平安科技(深圳)有限公司 Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium
CN113421575B (en) * 2021-06-30 2024-02-06 平安科技(深圳)有限公司 Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium
CN113470655A (en) * 2021-07-02 2021-10-01 因诺微科技(天津)有限公司 Voiceprint recognition method of time delay neural network based on phoneme log-likelihood ratio

Similar Documents

Publication Publication Date Title
RU2738325C2 (en) Method and device for authenticating an individual
CN110265037B (en) Identity verification method and device, electronic equipment and computer readable storage medium
CN111783939A (en) Voiceprint recognition model training method and device, mobile terminal and storage medium
CN111243603B (en) Voiceprint recognition method, system, mobile terminal and storage medium
US6772119B2 (en) Computationally efficient method and apparatus for speaker recognition
CN104834849A (en) Dual-factor identity authentication method and system based on voiceprint recognition and face recognition
CN111312259B (en) Voiceprint recognition method, system, mobile terminal and storage medium
CN103794207A (en) Dual-mode voice identity recognition method
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN111145758A (en) Voiceprint recognition method, system, mobile terminal and storage medium
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN110634492B (en) Login verification method, login verification device, electronic equipment and computer readable storage medium
CN111816185A (en) Method and device for identifying speaker in mixed voice
CN111653283B (en) Cross-scene voiceprint comparison method, device, equipment and storage medium
CN111816203A (en) Synthetic speech detection method for inhibiting phoneme influence based on phoneme-level analysis
JP7259981B2 (en) Speaker authentication system, method and program
CN111611437A (en) Method and device for preventing face voiceprint verification and replacement attack
CN109545226B (en) Voice recognition method, device and computer readable storage medium
CN204576520U (en) Based on the Dual-factor identity authentication device of Application on Voiceprint Recognition and recognition of face
CN112581967B (en) Voiceprint retrieval method, front-end back-end server and back-end server
CN111370000A (en) Voiceprint recognition algorithm evaluation method, system, mobile terminal and storage medium
Chetty et al. Liveness detection using cross-modal correlations in face-voice person authentication.
Gofman et al. Hidden markov models for feature-level fusion of biometrics on mobile devices
TWI778234B (en) Speaker verification system
Shenai et al. Fast biometric authentication system based on audio-visual fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination