CN112133289B

CN112133289B - Voiceprint identification model training method, voiceprint identification device, voiceprint identification equipment and voiceprint identification medium

Info

Publication number: CN112133289B
Application number: CN202011324851.3A
Authority: CN
Inventors: 曹岩岗
Original assignee: Beijing Yuanjian Information Technology Co Ltd
Current assignee: Beijing Yuanjian Information Technology Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-02-26
Anticipated expiration: 2040-11-24
Also published as: CN112133289A

Abstract

The invention provides a method, a device, equipment and a medium for training a voiceprint identification model and identifying a voiceprint, and relates to the technical field of data processing. The method comprises the following steps: determining a target phoneme sample from phonemes of a voice sample, wherein the phonemes of the voice sample are labeled with speaker labels in advance; generating a broadband spectrogram and a narrow-band spectrogram of phonemes in the target phoneme sample; acquiring first sample characteristic information of a broadband spectrogram and second sample characteristic information of a narrow-band spectrogram; and performing model training by adopting a preset neural network architecture according to the first sample characteristic information and the second sample characteristic information to obtain the voiceprint identification model with the neural network architecture. The first sample characteristic information of the broadband spectrogram and the second sample characteristic information of the narrowband spectrogram are trained to obtain a voiceprint identification model with a neural network architecture, voiceprint identification can be performed on the voice to be identified based on the voiceprint identification model, waste of human resources is reduced, and objectivity and accuracy of voiceprint identification are improved.

Description

Voiceprint identification model training method, voiceprint identification device, voiceprint identification equipment and voiceprint identification medium

Technical Field

The invention relates to the technical field of data processing, in particular to a method, a device, equipment and a medium for training a voiceprint identification model and identifying a voiceprint.

Background

Voiceprint recognition belongs to a recognition technology, also called speaker recognition. The voice information spoken by different people, the corresponding voiceprints, may be different, and it is becoming more and more important to recognize a segment of voice information to recognize the corresponding voiceprint, i.e. the corresponding speaker.

In the related art, the voice information is listened to by human ears, and then the voiceprint corresponding to the voice information is manually identified.

However, in the related art, the voiceprint is recognized through the human ear, unnecessary human resources are wasted, and the problem that the recognition result is inaccurate easily occurs.

Disclosure of Invention

The present invention aims to provide a method, an apparatus, a device and a medium for training a voiceprint identification model and identifying a voiceprint, so as to solve the problems in the related art that the voiceprint is identified by human ears, unnecessary human resources are wasted, and the identification result is inaccurate.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides a method for training a voiceprint identification model, including:

determining a target phoneme sample from phonemes of a voice sample, wherein the phonemes of the voice sample are labeled with speaker labels in advance;

generating a broadband spectrogram and a narrow-band spectrogram of phonemes in the target phoneme sample;

acquiring first sample characteristic information of the broadband spectrogram and second sample characteristic information of the narrow-band spectrogram;

and performing model training by adopting a preset neural network architecture according to the first sample characteristic information and the second sample characteristic information to obtain the voiceprint identification model with the neural network architecture.

Optionally, the determining a target phoneme sample from phonemes of the speech sample includes:

selecting a reference phoneme sample from phonemes of the speech sample;

determining phonemes with the same speaker label from the phonemes of the voice sample as the similar phoneme sample according to the phonemes in the reference phoneme sample;

determining phonemes with different speaker labels from the phonemes of the voice sample as heterogeneous phoneme samples according to the phonemes in the reference phoneme sample;

the target phoneme sample includes: the reference phoneme sample, the homogeneous phoneme sample, and the heterogeneous phoneme sample.

Optionally, the generating a wideband spectrogram and a narrowband spectrogram of a phoneme in the target phoneme sample includes:

performing data enhancement on phonemes in the target phoneme sample;

and drawing the wide-band spectrogram and the narrow-band spectrogram of the phoneme subjected to data enhancement.

Optionally, the drawing the wide-band spectrogram and the narrow-band spectrogram of the data-enhanced phoneme includes:

framing the data-enhanced phoneme to obtain a plurality of phoneme frames;

windowing each phoneme frame according to the frame length of each phoneme frame to obtain a windowed phoneme frame;

carrying out Fourier transform on the windowed phoneme frame to obtain a frequency domain phoneme frame;

calculating the energy of the frequency domain phoneme frame on a frequency scale;

and drawing the broadband spectrogram and the narrow-band spectrogram according to the energy of the frequency scale.

Optionally, the drawing the wide-band spectrogram and the narrow-band spectrogram according to the energy of the frequency scale includes:

integrating the energy of the frequency scale into a two-dimensional matrix;

and carrying out gray mapping on the two-dimensional matrix to obtain the broadband spectrogram and the narrow-band spectrogram.

Optionally, the first sample feature information includes: formant and power spectrum information;

the second sample characteristic information includes: fundamental frequency and harmonic information.

In a second aspect, an embodiment of the present invention provides a voiceprint authentication method, including:

extracting a plurality of identification phonemes of the speech to be identified;

generating a broadband spectrogram and a narrow-band spectrogram of each identification phoneme;

acquiring first identification characteristic information of the broadband spectrogram and second identification characteristic information of the narrow-band spectrogram;

performing voiceprint identification by adopting a voiceprint identification model according to the first identification characteristic information and the second identification characteristic information to obtain a voiceprint identification result;

the voiceprint identification result is used for indicating whether the multiple voices to be identified are the same speaker or not; the voiceprint identification model is obtained by adopting the training method of any one of the first aspect.

In a third aspect, an embodiment of the present invention further provides a training apparatus for a voiceprint identification model, including:

the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a target phoneme sample from phonemes of a voice sample, and phonemes of the voice sample are labeled with speaker labels in advance;

the generating module is used for generating a broadband spectrogram and a narrowband spectrogram of the phonemes in the target phoneme sample;

the acquisition module is used for acquiring first sample characteristic information of the broadband spectrogram and second sample characteristic information of the narrow-band spectrogram;

and the training module is used for carrying out model training by adopting a preset neural network architecture according to the first sample characteristic information and the second sample characteristic information to obtain the voiceprint identification model with the neural network architecture.

Optionally, the determining module is configured to select a reference phoneme sample from phonemes of the speech sample; determining phonemes with the same speaker label from the phonemes of the voice sample as the similar phoneme sample according to the phonemes in the reference phoneme sample; determining phonemes with different speaker labels from the phonemes of the voice sample as heterogeneous phoneme samples according to the phonemes in the reference phoneme sample; the target phoneme sample includes: the reference phoneme sample, the homogeneous phoneme sample, and the heterogeneous phoneme sample.

Optionally, the generating module is further configured to perform data enhancement on phonemes in the target phoneme sample; and drawing the wide-band spectrogram and the narrow-band spectrogram of the phoneme subjected to data enhancement.

Optionally, the generating module is further configured to perform framing processing on the data-enhanced phoneme to obtain a plurality of phoneme frames; windowing each phoneme frame according to the frame length of each phoneme frame to obtain a windowed phoneme frame; carrying out Fourier transform on the windowed phoneme frame to obtain a frequency domain phoneme frame; calculating the energy of the frequency domain phoneme frame on a frequency scale; and drawing the broadband spectrogram and the narrow-band spectrogram according to the energy of the frequency scale.

Optionally, the generating module is further configured to integrate the energy of the frequency scale into a two-dimensional matrix; and carrying out gray mapping on the two-dimensional matrix to obtain the broadband spectrogram and the narrow-band spectrogram.

In a fourth aspect, an embodiment of the present invention further provides a voiceprint authentication apparatus, including:

the extraction module is used for extracting a plurality of identification phonemes of the speech to be identified;

the generating module is used for generating a broadband spectrogram and a narrow-band spectrogram of each identification phoneme;

the acquisition module is used for acquiring first identification characteristic information of the broadband spectrogram and second identification characteristic information of the narrow-band spectrogram;

the identification module is used for carrying out voiceprint identification by adopting a voiceprint identification model according to the first identification characteristic information and the second identification characteristic information to obtain a voiceprint identification result;

In a fifth aspect, an embodiment of the present invention further provides a processing device, including: a memory storing a computer program executable by the processor, and a processor implementing the method of any one of the first and second aspects when the processor executes the computer program.

In a sixth aspect, an embodiment of the present invention further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is read and executed, the method according to any one of the first and second aspects is implemented.

The invention has the beneficial effects that: the embodiment of the application provides a training method of a voiceprint identification model, which comprises the following steps: determining a target phoneme sample from phonemes of a voice sample, wherein the phonemes of the voice sample are labeled with speaker labels in advance; generating a broadband spectrogram and a narrow-band spectrogram of phonemes in the target phoneme sample; acquiring first sample characteristic information of a broadband spectrogram and second sample characteristic information of a narrow-band spectrogram; and performing model training by adopting a preset neural network architecture according to the first sample characteristic information and the second sample characteristic information to obtain the voiceprint identification model with the neural network architecture. The first sample characteristic information of the broadband spectrogram and the second sample characteristic information of the narrowband spectrogram are trained to obtain a voiceprint identification model with a neural network architecture, voiceprint identification can be performed on the voice to be identified based on the voiceprint identification model, waste of human resources is reduced, and objectivity and accuracy of voiceprint identification are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flowchart of a method for training a voiceprint authentication model according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a training method for a voiceprint authentication model according to an embodiment of the present invention;

FIG. 3 is a schematic flowchart of a training method for a voiceprint authentication model according to an embodiment of the present invention;

FIG. 4 is a schematic flowchart of a training method for a voiceprint authentication model according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a method for training a voiceprint authentication model according to an embodiment of the present invention;

fig. 6 is a schematic flow chart of a voiceprint authentication method according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a training apparatus for a voiceprint authentication model according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a voiceprint authentication apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the present application, it should be noted that if the terms "upper", "lower", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings or the orientation or positional relationship which is usually arranged when the product of the application is used, the description is only for convenience of describing the application and simplifying the description, but the indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation and operation, and thus, cannot be understood as the limitation of the application.

Furthermore, the terms "first," "second," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

The method aims at solving the problems that in the related art, the voiceprint is recognized through human ears, unnecessary human resources are wasted, and the recognition result is inaccurate easily. The embodiment of the application provides a training method of a voiceprint identification model, which comprises the following steps: determining a target phoneme sample from phonemes of a voice sample, wherein the phonemes of the voice sample are labeled with speaker labels in advance; generating a broadband spectrogram and a narrow-band spectrogram of phonemes in the target phoneme sample; acquiring first sample characteristic information of a broadband spectrogram and second sample characteristic information of a narrow-band spectrogram; and performing model training by adopting a preset neural network architecture according to the first sample characteristic information and the second sample characteristic information to obtain the voiceprint identification model with the neural network architecture. The first sample characteristic information of the broadband spectrogram and the second sample characteristic information of the narrowband spectrogram are trained to obtain a voiceprint identification model with a neural network architecture, voiceprint identification can be performed on the voice to be identified based on the voiceprint identification model, waste of human resources is reduced, and objectivity and accuracy of voiceprint identification are improved.

In the training method of the voiceprint identification model provided in the embodiment of the present application, the execution subject may be a processing device, the processing device may be a terminal, or may also be a server, and the processing device may be other types of devices with functions, which is not specifically limited in the embodiment of the present application. The following describes a training method of a voiceprint authentication model provided in an embodiment of the present application, with a processing device as an execution subject.

Fig. 1 is a schematic flowchart of a method for training a voiceprint authentication model according to an embodiment of the present invention, and as shown in fig. 1, the method may include:

s101, determining a target phoneme sample from phonemes of the voice sample.

Wherein, the phoneme of the voice sample is labeled with a speaker label in advance.

In some embodiments, the processing device may obtain a plurality of voice samples with speaker tags, extract phonemes from each voice sample to obtain phonemes of the plurality of voice samples, and classify and integrate the phonemes of the plurality of voice samples to obtain a plurality of target phoneme samples.

It should be noted that a phoneme is a minimum speech unit divided according to natural attributes of speech, and may be analyzed according to pronunciation actions in syllables, and one action constitutes one phoneme. Wherein the phonemes in the target phoneme sample may be phoneme pairs.

S102, generating a broadband spectrogram and a narrow-band spectrogram of the phonemes in the target phoneme sample.

Wherein, one phoneme can correspond to one wide-band spectrogram and one narrow-band spectrogram respectively. The information characterized by the broad spectrogram and the narrow spectrogram can be different.

Optionally, the broadband spectrogram can clearly display a formant structure and a spectrogram envelope and can reflect a rapid time-varying process of a frequency spectrum; the narrow-band spectrogram can clearly display the structure of the harmonic wave and reflect the time-varying process of the fundamental frequency.

In one possible implementation, the processing device may generate a wideband spectrogram and a narrowband spectrogram of each phoneme in the target phoneme sample by using a preset spectrogram generation rule. The preset spectrogram generating rule may include a preset broadband spectrogram generating rule and a preset narrowband spectrogram generating rule, and the corresponding spectrogram may be generated by using the corresponding rule.

S103, obtaining first sample characteristic information of the broadband spectrogram and second sample characteristic information of the narrow-band spectrogram.

The method comprises the steps that a phoneme of a voice sample is marked with a speaker label, a target phoneme sample determined based on the phoneme of the voice sample is also marked with the speaker label, a broadband spectrogram and a narrowband spectrogram of the phoneme also correspond to the speaker label, and first sample characteristic information and second sample characteristic information are also marked with the speaker label.

In one possible implementation, the processing device may extract first sample feature information from the wide-band spectrogram and extract second sample feature information from the narrow-band spectrogram, respectively, to obtain the first sample feature information and the second sample feature information of each phoneme. The first sample feature information and the second sample feature information of the same phoneme may then be spliced.

And S104, performing model training by adopting a preset neural network architecture according to the first sample characteristic information and the second sample characteristic information to obtain a voiceprint identification model with the neural network architecture.

Optionally, the preset neural network architecture may be a ternary convolutional neural network architecture. The preset neural network architecture has the advantage of being capable of fitting any nonlinear function, and modeling the processes of feature extraction, feature transfer and perception of external information in human neurons.

In some embodiments, the processing device may input the first sample feature information and the second sample feature information of each phoneme in the target phoneme sample into a preset neural network architecture, perform forward calculation layer by layer, calculate a loss function of a round of training, perform back propagation on a gradient of the neural network architecture, update network parameters in the neural network architecture, and complete training when a training result meets a preset condition to obtain a voiceprint identification model with the neural network architecture.

Wherein, the training result meeting the preset conditions comprises: the loss function does not change, or the evaluation value calculated in the training process is smaller than or equal to the preset evaluation value, which indicates that the preset condition is satisfied.

In summary, the embodiment of the present application provides a training method for a voiceprint identification model, including: determining a target phoneme sample from phonemes of a voice sample, wherein the phonemes of the voice sample are labeled with speaker labels in advance; generating a broadband spectrogram and a narrow-band spectrogram of phonemes in the target phoneme sample; acquiring first sample characteristic information of a broadband spectrogram and second sample characteristic information of a narrow-band spectrogram; and performing model training by adopting a preset neural network architecture according to the first sample characteristic information and the second sample characteristic information to obtain the voiceprint identification model with the neural network architecture. The first sample characteristic information of the broadband spectrogram and the second sample characteristic information of the narrowband spectrogram are trained to obtain a voiceprint identification model with a neural network architecture, voiceprint identification can be performed on the voice to be identified based on the voiceprint identification model, waste of human resources is reduced, and objectivity and accuracy of voiceprint identification are improved.

Optionally, fig. 2 is a schematic flowchart of a training method for a voiceprint identification model according to an embodiment of the present invention, and as shown in fig. 2, the process of determining a target phoneme sample from phonemes of a speech sample in S101 may include:

s201, selecting a reference phoneme sample from phonemes of the voice sample.

It should be noted that the processing device may randomly select the reference phoneme sample from the phonemes of the speech sample. Of course, the processing device may also select the reference phoneme sample based on other selection rules, which is not specifically limited in this embodiment.

S202, determining phonemes with the same speaker tag from the phonemes of the voice sample as the same type phoneme sample according to the phonemes in the reference phoneme sample.

And S203, determining phonemes with different speaker labels from the phonemes of the voice sample as the allophone sample according to the phonemes in the reference phoneme sample.

The reference phoneme sample and the homogeneous phoneme sample may be a phoneme pair having the same speaker tag, and the reference phoneme sample and the heterogeneous phoneme sample may be a phoneme pair having different speaker tags. The target phoneme sample may include: a reference phoneme sample, a homogeneous phoneme sample, and a heterogeneous phoneme sample. The reference phoneme sample, the homogeneous phoneme sample, and the heterogeneous phoneme sample may constitute a triplet.

In the embodiment of the present application, the loss function in the training process may be:

wherein the content of the first and second substances,

a reference sample is shown and described,

a sample of phonemes of the same type is represented,

a sample of the heterogeneous phonemes is represented,

representing the loss between the homophone sample and the reference sample,

representing the loss between the alien phoneme sample and the reference sample.

It should be noted that, in the following description,

wherein, in the step (A),

in order to refer to the characteristics of the sample,

the method is the characteristics of the same type phoneme sample.

In addition, the first and second substrates are,

wherein, in the step (A),

in order to refer to the characteristics of the sample,

features of the heterogeneous phoneme sample. The loss function can enable the target phoneme samples of the same category to be closer to the feature space layer, and the target phoneme samples of different categories to be farther from the feature space layer, so that the reference phoneme sample and the phoneme samples of the same category can keep a certain boundary.

In the embodiment of the present application, the preset neural network architecture may include an input layer, a convolutional layer, a nonlinear activation layer, a pooling layer, and a convolutional layer, and an Adam (adaptive moment estimation) optimizer may be used. The input of the input layer can be first sample characteristic information and second sample characteristic information, the pooling layer can reduce the dimension of the characteristic dimension, the number of parameters can be compressed, overfitting is reduced, and meanwhile the robustness of the model is improved. And the features can be further refined by introducing the convolution layer, the nonlinear activation layer and the pooling layer for the second time and the third time and finally connecting the convolution layer with the other convolution layer.

In addition, in the training process, the weights corresponding to the reference phoneme sample, the similar phoneme sample and the heterogeneous phoneme sample can be shared.

Optionally, fig. 3 is a schematic flow chart of a training method of a voiceprint identification model according to an embodiment of the present invention, and as shown in fig. 3, the process of generating a wideband spectrogram and a narrowband spectrogram of a phoneme in a target phoneme sample in S102 may include:

s301, performing data enhancement on the phonemes in the target phoneme sample.

The data enhancement is carried out on the phonemes in the target phoneme sample, so that the training data volume can be increased, and the robustness and the accuracy can be enhanced.

In some embodiments, the processing device may perform data enhancement on the phonemes in the target phoneme sample within a preset range in a random manner. The phoneme can be processed by adjusting volume, speech speed, reverberation, noise and the like, and the size of the random quantity can be controlled.

It should be noted that the preset range corresponding to the adjusted volume may be 0 to +10dB (decibel), the preset range corresponding to the speech rate may be 0.95 to 1.05 times the speech rate, the preset range corresponding to the RT60 (reverberation time) of reverberation may be 0 to 1.3s (second), and the preset range corresponding to the noise may be 0 to +15dB (decibel).

S302, drawing a broadband spectrogram and a narrow-band spectrogram of the phoneme subjected to data enhancement.

In this embodiment of the application, the processing device may use a preset wideband spectrogram drawing rule and a preset narrowband spectrogram drawing rule to draw a corresponding wideband spectrogram and a corresponding narrowband spectrogram according to the data-enhanced phoneme.

In summary, data enhancement is performed on the phonemes in the target phoneme sample, and a broadband spectrogram and a narrowband spectrogram of the phonemes after the data enhancement are drawn. The reliability and robustness of the trained voiceprint identification model can be better, and then the output voiceprint identification result can be more accurate.

Optionally, fig. 4 is a schematic flow chart of a training method of a voiceprint identification model according to an embodiment of the present invention, and as shown in fig. 4, the process of drawing the data-enhanced wideband spectrogram and narrowband spectrogram of a phoneme in S302 may include:

s401, framing the phonemes after the data enhancement is carried out, and a plurality of phoneme frames are obtained.

The processing device may determine a frame length and a frame shift of the wideband speech spectrogram and a frame length and a frame shift of the narrowband speech spectrogram, and then may perform framing processing on the data-enhanced phonemes according to the corresponding frame lengths and frame shifts to obtain two groups of phoneme frames, one group of phoneme frames corresponding to the wideband speech spectrogram and one group of phoneme frames corresponding to the narrowband speech spectrogram, where each group of phoneme frames includes a plurality of phoneme frames.

It should be noted that the frame length and the frame shift satisfy a predetermined relationship. The frame shift may be greater than or equal to a preset value of the frame length, where the preset value may be 50%, may be 75%, and may also be other values, which is not specifically limited in this embodiment of the application.

In the embodiment of the application, the time resolution of the broadband spectrogram is higher, but the frequency resolution is lower; the narrow-band spectrogram has a lower time resolution but a higher frequency resolution. The reason for this phenomenon is that the frame lengths of the analysis frames of the wide-band spectrogram and the narrow-band spectrogram are different, which results in the difference of the bandwidth, and the wide-band spectrogram and the narrow-band spectrogram are named accordingly.

In some embodiments, the Gaussian analysis window is taken as an example (other windows are close to the Gaussian analysis window case), -3db bandwidthdFrame length of analysis frame (in Hertz)l(in seconds) has the following relationship:

. In this embodiment of the application, the bandwidths of the wideband spectrogram and the narrowband spectrogram may be 260Hz and 43Hz, respectively, the frame length of the wideband spectrogram may be 0.005 second, and the frame length of the narrowband spectrogram may be 0.03 second. Correspondingly, when the frame shift is 50% of the frame length, the frame shift of the wide-band spectrogram can be 0.0025 seconds, and the frame shift of the narrow-band spectrogram can be 0.015 seconds.

S402, according to the frame length of each phoneme frame, windowing is carried out on each phoneme frame to obtain a windowed phoneme frame.

The frame length of each phoneme frame in one group of phoneme frames can be the same, and the frame length of each phoneme frame in the other group of phoneme frames can also be the same. The windowed phoneme frame includes: the windowed phoneme frame corresponding to the broadband spectrogram and the windowed phoneme frame corresponding to the narrow-band spectrogram.

In one possible implementation, the processing device may use a frame length of a plurality of phoneme frames of the wideband spectrogram as a window length corresponding to the wideband spectrogram; and taking the frame length of the multiple phoneme frames of the narrow-band spectrogram as the window length corresponding to the narrow-band spectrogram. Then, windowing each phoneme frame corresponding to the broadband spectrogram by adopting a preset window according to the window length corresponding to the broadband spectrogram to obtain a windowed phoneme frame corresponding to the broadband spectrogram; and windowing each phoneme frame corresponding to the narrow-band spectrogram by adopting a preset window according to the window length corresponding to the narrow-band spectrogram to obtain a windowed phoneme frame corresponding to the narrow-band spectrogram.

It should be noted that the analysis window may generally include: bartlett windows, Blackman windows, Barlett-Hann windows, Blackman-Harris windows, Bohman windows, Flattop windows, Gauss windows, Hamming windows, Hann windows, Nuttall windows, Parzen windows, Rectangular windows, Triangular windows.

Optionally, the preset window used in the embodiment of the present application may be a Hamming window, and the windowed phoneme frame may be determined by multiplying the phoneme frame by a window function corresponding to the preset window. When the preset window is a Hamming window, the corresponding window function may be:

wherein N is the sample identification in the phoneme frame, and N is the total number of sample points in one frame of phoneme frame.

In the embodiment of the application, the broadband spectrogram and the narrowband spectrogram can be mutually converted only by changing the window length by using similar steps, and the reusability is high.

And S403, carrying out Fourier transform on the windowed phoneme frame to obtain a frequency domain phoneme frame.

Wherein the frequency domain phoneme frame may include: the frequency domain phoneme frame corresponding to the broadband spectrogram and the frequency domain phoneme frame corresponding to the narrow-band spectrogram.

In some embodiments, the processing device may perform a Discrete Fourier Transform (DFT) on the windowed frames of phonemes to convert the time domain frames of phonemes into frequency domain frames of phonemes.

It should be noted that, the above discrete fourier transform formula may be:

wherein, in the step (A),

n is the sample identification in the phoneme frame, and N is the total number of sample points in one frame of phoneme frame.

In the embodiment of the present application, the number of samples of each windowed phoneme frame is taken as an integral power of 2, and Fast Fourier Transform (FFT) is used to accelerate the calculation. For example, the number of samples of each frame of a phoneme with a sampling rate of 8000Hz, a wide-band spectrogram and a narrow-band spectrogram is 40 and 240 respectively, and zero padding is performed on the end of the windowed phoneme frame to fill 64 and 256 samples respectively.

S404, calculating the energy of the frequency domain phoneme frame on the frequency scale.

Wherein, the energy of the frequency scale can include: the energy of the frequency scale corresponding to the broadband spectrogram and the energy of the frequency scale corresponding to the narrow-band spectrogram.

In this embodiment of the present application, fourier transform is performed on the windowed phoneme frame to obtain a frequency domain phoneme frame, which may be a group of complex numbers and may represent amplitudes and phases on different frequency scales, and a spectrogram shows energy on the frequency scales. Therefore, the energy of the frequency domain phoneme frame on the frequency scale can be calculated by adopting a preset formula.

In addition, the preset formula may be:

wherein, in the step (A),

regulating stomach

Are respectively a plurality of

Real and imaginary parts of (c).

S405, drawing a broadband spectrogram and a narrow-band spectrogram according to the energy of the frequency scale.

In this embodiment, the processing device may draw the wideband spectrogram according to the energy of the frequency scale corresponding to the wideband spectrogram, and draw the narrowband spectrogram according to the energy of the frequency scale corresponding to the narrowband spectrogram.

Optionally, fig. 5 is a schematic flow chart of a training method for a voiceprint identification model according to an embodiment of the present invention, and as shown in fig. 5, the process of drawing the wide-band spectrogram and the narrow-band spectrogram according to the energy of the frequency scale in S405 may include:

and S501, integrating the energy of the frequency scale into a two-dimensional matrix.

In some embodiments, the processing device may integrate the energy of the frequency scale corresponding to the wideband spectrogram into a two-dimensional matrix corresponding to the wideband spectrogram; and integrating the energy of the frequency scale corresponding to the narrow-band spectrogram into a two-dimensional matrix corresponding to the narrow-band spectrogram.

S502, gray mapping is carried out on the two-dimensional matrix to obtain a broadband spectrogram and a narrow-band spectrogram.

Wherein the two-dimensional matrix may comprise energy values.

In a possible implementation manner, the energy value in the two-dimensional matrix corresponding to the broadband spectrogram is subjected to gray mapping by using the maximum energy value, the minimum energy value and a preset energy range to obtain the broadband spectrogram. The energy value in the two-dimensional matrix corresponding to the narrow-band spectrogram can be subjected to gray mapping by adopting the maximum energy value, the minimum energy value and a preset energy range to obtain the narrow-band spectrogram.

In addition, the ordinate of the broadband spectrogram and the narrowband spectrogram can be frequency scale, and the abscissa can be time.

Optionally, the first sample feature information includes: formant and power spectrum information; the second sample characteristic information includes: fundamental frequency and harmonic information.

In the embodiment of the application, by using a Keyword Spotting (KWS) search technology, the time of the vowel phoneme in the broadband spectrogram can be located, and then, the frequency of each formant is determined by observing a darker color region on the broadband spectrogram, and the trend of the power spectrum can be seen according to the shade of the color.

It should be noted that the narrowband spectrogram mainly displays fundamental frequency and harmonic information, the fundamental frequency is vibration frequency of vocal cords when the original sound is generated, the transverse stripe at the bottommost part in the narrowband spectrogram is the frequency of the fundamental frequency, the frequency of the harmonic is integral multiple of the fundamental frequency, so that the narrowband spectrogram is also divided into odd harmonic and even harmonic, and the timbre of the voice has a larger relationship with the fundamental frequency and the harmonic.

In addition, the wide-band spectrogram can be vertically distributed, and the narrow-band spectrogram can be horizontally distributed. The darker the colors of the wide band spectrogram and the narrow band spectrogram, the larger the corresponding energy, and the black area represents a phoneme segment.

In the voiceprint authentication method provided in the embodiment of the present application, the execution subject may be a processing device, the processing device may be a terminal, or may also be a server, and the processing device may be other types of devices with functions, which is not specifically limited in the embodiment of the present application. The following describes a voiceprint authentication method provided in an embodiment of the present application, with a processing device as an execution subject.

In summary, the embodiment of the present application provides a training method for a voiceprint identification model, including: determining a target phoneme sample from phonemes of a voice sample, wherein the phonemes of the voice sample are labeled with speaker labels in advance; generating a broadband spectrogram and a narrow-band spectrogram of phonemes in the target phoneme sample; acquiring first sample characteristic information of a broadband spectrogram and second sample characteristic information of a narrow-band spectrogram; and performing model training by adopting a preset neural network architecture according to the first sample characteristic information and the second sample characteristic information to obtain the voiceprint identification model with the neural network architecture. The first sample characteristic information of the broadband spectrogram and the second sample characteristic information of the narrowband spectrogram are trained to obtain a voiceprint identification model with a neural network architecture, voiceprint identification can be performed on the voice to be identified based on the voiceprint identification model, waste of human resources is reduced, and objectivity and accuracy of voiceprint identification are improved. And the first sample characteristic information and the second sample characteristic information are determined based on the broadband spectrogram and the narrowband spectrogram, so that the determined characteristic information is more accurate. The bandwidth of the wide-band spectrogram and the bandwidth of the narrow-band spectrogram are respectively 260Hz and 43Hz, so that the identification result is more accurate, data enhancement is performed on the phonemes in the target phoneme sample, and the accuracy and the robustness can be improved.

Fig. 6 is a schematic flow chart of a voiceprint authentication method according to an embodiment of the present invention, and as shown in fig. 6, the voiceprint authentication method may include:

s801, extracting a plurality of identification phonemes of the speech to be identified.

The multiple voices to be identified may be voices of the same speaker or voices of different speakers.

In some embodiments, the number of voices to be authenticated may be two. And if one voice to be identified is marked with a speaker tag, and the other voice to be identified is not marked with a speaker tag, whether the other speaker of the voice to be identified is the speaker of the voice to be identified can be identified through the voiceprint identification model.

S802, generating a wide-band spectrogram and a narrow-band spectrogram of each identification phoneme.

It should be noted that the generation of the wide-band spectrogram and the narrow-band spectrogram of each identified phoneme in S802 is similar to the generation of the wide-band spectrogram and the narrow-band spectrogram of the phoneme in the target phoneme sample in S102, and is not repeated here.

S803, first identification characteristic information of the broadband spectrogram and second identification characteristic information of the narrow-band spectrogram are obtained.

It should be noted that the process of S803 is similar to the process of S103 described above, and is not described here again.

And S804, according to the first identification characteristic information and the second identification characteristic information, performing voiceprint identification by adopting a voiceprint identification model to obtain a voiceprint identification result.

The voiceprint identification result is used for indicating whether a plurality of voices to be identified are the same speaker or not; the voiceprint identification model is obtained by adopting the training method of any one of the above-mentioned figures 1 to 5.

In a possible implementation manner, the processing device may splice the first authentication feature information and the second authentication feature information of the same authentication phoneme, input the spliced information into the voiceprint authentication model, extract features from the network layer by layer, and then output a voiceprint authentication result. The voiceprint identification result can represent that a plurality of voices to be identified are similar or heterogeneous. When the voices to be identified are of the same type, the voices to be identified are the same speaker, and when the voices to be identified are of the different type, the voices to be identified are not the same speaker.

In summary, an embodiment of the present invention provides a voiceprint identification method, including: extracting a plurality of identification phonemes of the speech to be identified; generating a broadband spectrogram and a narrow-band spectrogram of each identification phoneme; acquiring first identification characteristic information of a broadband spectrogram and second identification characteristic information of a narrow spectrogram; performing voiceprint identification by adopting a voiceprint identification model according to the first identification characteristic information and the second identification characteristic information to obtain a voiceprint identification result; the voiceprint identification result is used for indicating whether a plurality of voices to be identified are the same speaker or not. The voiceprint identification model obtained through training can be used for carrying out voiceprint identification on the voice to be identified, waste of human resources is reduced, and objectivity and accuracy of voiceprint identification are improved.

FIG. 7 is a schematic structural diagram of a training apparatus for a voiceprint authentication model according to an embodiment of the present invention; as shown in fig. 7, the training device of the voiceprint identification model may include:

a determining module 901, configured to determine a target phoneme sample from phonemes of a speech sample, where phonemes of the speech sample are labeled with a speaker tag in advance;

a generating module 902, configured to generate a wideband spectrogram and a narrowband spectrogram of a phoneme in a target phoneme sample;

an obtaining module 903, configured to obtain first sample feature information of a wideband spectrogram and second sample feature information of a narrowband spectrogram;

and a training module 904, configured to perform model training by using a preset neural network architecture according to the first sample feature information and the second sample feature information, so as to obtain a voiceprint identification model with the neural network architecture.

Optionally, the determining module 901 is configured to select a reference phoneme sample from phonemes of the speech sample; determining phonemes with the same speaker label from the phonemes of the voice sample as the similar phoneme sample according to the phonemes in the reference phoneme sample; determining phonemes with different speaker labels from the phonemes of the voice sample as heterogeneous phoneme samples according to the phonemes in the reference phoneme sample; the target phoneme sample includes: a reference phoneme sample, a homogeneous phoneme sample, and a heterogeneous phoneme sample.

Optionally, the generating module 902 is further configured to perform data enhancement on phonemes in the target phoneme sample; and drawing a broadband spectrogram and a narrow-band spectrogram of the phoneme after data enhancement.

Optionally, the generating module 902 is further configured to perform framing processing on the data-enhanced phoneme to obtain a plurality of phoneme frames; windowing each phoneme frame according to the frame length of each phoneme frame to obtain a windowed phoneme frame; carrying out Fourier transform on the windowed phoneme frame to obtain a frequency domain phoneme frame; calculating the energy of the frequency domain phoneme frame on the frequency scale; and drawing a broadband spectrogram and a narrow-band spectrogram according to the energy of the frequency scale.

Optionally, the generating module 902 is further configured to integrate energy of the frequency scale into a two-dimensional matrix; and carrying out gray mapping on the two-dimensional matrix to obtain a broadband spectrogram and a narrowband spectrogram.

Fig. 8 is a schematic structural diagram of a voiceprint authentication apparatus according to an embodiment of the present invention. As shown in fig. 8, the voiceprint authentication apparatus may include, as shown in fig. 8:

an extracting module 1001 configured to extract a plurality of authentication phonemes of a speech to be authenticated;

a generating module 1002, configured to generate a wideband spectrogram and a narrowband spectrogram of each identification phoneme;

an obtaining module 1003, configured to obtain first identifying feature information of the wideband spectrogram and second identifying feature information of the narrowband spectrogram;

the identification module 1004 is used for performing voiceprint identification by adopting a voiceprint identification model according to the first identification characteristic information and the second identification characteristic information to obtain a voiceprint identification result;

the voiceprint identification result is used for indicating whether a plurality of voices to be identified are the same speaker or not; the voiceprint identification model is obtained by adopting any one of the training methods.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 9 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention, and as shown in fig. 9, the processing apparatus may include: a processor 1101, a memory 1102.

The memory 1102 is used for storing programs, and the processor 1101 calls the programs stored in the memory 1102 to execute the above-described method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the invention also provides a program product, for example a computer-readable storage medium, comprising a program which, when being executed by a processor, is adapted to carry out the above-mentioned method embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A training method of a voiceprint authentication model is characterized by comprising the following steps:

2. The method of claim 1, wherein determining a target phoneme sample from the phonemes of the speech sample comprises:

selecting a reference phoneme sample from phonemes of the speech sample;

3. The method of claim 1, wherein the generating a wide-band spectrogram and a narrow-band spectrogram of phonemes in the target phoneme sample comprises:

performing data enhancement on phonemes in the target phoneme sample;

4. The method of claim 3, wherein said mapping said wide-band spectrogram and said narrow-band spectrogram of data-enhanced phonemes comprises:

framing the data-enhanced phoneme to obtain a plurality of phoneme frames;

5. The method of claim 4, wherein said plotting said wide band spectrogram and said narrow band spectrogram according to the energy of said frequency scale comprises:

integrating the energy of the frequency scale into a two-dimensional matrix;

6. The method of any of claims 1-5, wherein the first sample characteristic information comprises: formant and power spectrum information;

7. A voiceprint authentication method comprising:

the voiceprint identification result is used for indicating whether the multiple voices to be identified are the same speaker or not; the voiceprint identification model is obtained by adopting the training method of any one of the claims 1 to 6.

8. A training apparatus for a voiceprint authentication model, comprising:

9. A voiceprint authentication apparatus comprising:

10. An apparatus for voiceprint authentication, comprising: a memory storing a computer program executable by the processor, and a processor implementing the method of any of the preceding claims 1-7 when executing the computer program.

11. A computer-readable storage medium, having stored thereon a computer program which, when read and executed, implements the method of any of claims 1-7.