CN113223536A

CN113223536A - Voiceprint recognition method and device and terminal equipment

Info

Publication number: CN113223536A
Application number: CN202010062402.XA
Authority: CN
Inventors: 唐延欢
Original assignee: TCL Research America Inc
Current assignee: TCL Research America Inc
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2021-08-06
Anticipated expiration: 2040-01-19
Also published as: CN113223536B

Abstract

The application is applicable to the technical field of voice processing, and provides a voiceprint recognition method, a voiceprint recognition device and terminal equipment, wherein the voiceprint recognition device comprises the following steps: acquiring an audio characteristic vector of an audio to be identified; inputting the audio characteristic vector into a target neural network to obtain a target voiceprint characteristic vector corresponding to the audio characteristic vector, wherein the target neural network consists of a SEnet module, a hole convolution network and a full connection layer, and the hole convolution network comprises a plurality of hole convolution layers for extracting context information of the audio characteristic vector in a time dimension; and comparing the target voiceprint characteristic vector with the registered voiceprint characteristic vector to determine a target user corresponding to the audio to be identified. The method and the device for identifying the voiceprint can improve accuracy of voiceprint identification and guarantee identification efficiency.

Description

Voiceprint recognition method and device and terminal equipment

Technical Field

The application belongs to the technical field of voice processing, and particularly relates to a voiceprint recognition method, a voiceprint recognition device and terminal equipment.

Background

Voiceprint Recognition (VPR), also known as Speaker Recognition (Speaker Recognition), has long been a widespread concern in academia and industry as one of biometric technologies. The traditional voiceprint recognition technology takes i-vectors as the classic, but the accuracy is poor. For this reason, google proposes a GE2E (Generalized end-to-end) network structure, which has higher recognition accuracy in voiceprint recognition than i-vectors. However, the complicated neural network structure of GE2E makes the model occupy too large space and slow recognition speed, which is not favorable for the application in practical production environment.

Disclosure of Invention

In view of this, embodiments of the present application provide a voiceprint recognition method, a voiceprint recognition device, and a terminal device, so as to solve the problem in the prior art how to ensure recognition efficiency while improving accuracy of voiceprint recognition.

A first aspect of an embodiment of the present application provides a voiceprint recognition method, including:

acquiring an audio feature vector of an audio to be identified, wherein the audio feature vector comprises a time dimension and a spectrum feature dimension, and one unit time in the time dimension corresponds to a group of spectrum feature information in the spectrum feature dimension;

inputting the audio characteristic vector into a target neural network to obtain a target voiceprint characteristic vector corresponding to the audio characteristic vector, wherein the target neural network consists of a SEnet module, a hole convolution network and a full connection layer, and the hole convolution network comprises a plurality of hole convolution layers for extracting context information of the audio characteristic vector in a time dimension;

and comparing the target voiceprint characteristic vector with the registered voiceprint characteristic vector to determine a target user corresponding to the audio to be identified.

A second aspect of the embodiments of the present application provides a voiceprint recognition apparatus, including:

the audio feature vector acquisition unit is used for acquiring an audio feature vector of an audio to be identified, wherein the audio feature vector comprises a time dimension and a spectrum feature dimension, and one unit time in the time dimension corresponds to a group of spectrum feature information in the spectrum feature dimension;

the target neural network unit is used for inputting the audio characteristic vectors into a target neural network to obtain target voiceprint characteristic vectors corresponding to the audio characteristic vectors, wherein the target neural network consists of a SEnet module, a cavity convolution network and a full connection layer, and the cavity convolution network comprises a plurality of cavity convolution layers for extracting context information of the audio characteristic vectors in a time dimension;

and the determining unit is used for comparing the target voiceprint characteristic vector with the registered voiceprint characteristic vector and determining the target user corresponding to the audio to be identified.

A third aspect of the embodiments of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the voiceprint recognition method when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the steps of the voiceprint recognition method as described.

A fifth aspect of embodiments of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to execute the above voiceprint recognition method.

Compared with the prior art, the embodiment of the application has the advantages that: in the embodiment of the application, the audio characteristic vector of the audio to be identified is subjected to characteristic extraction through a target neural network consisting of a SENET module, a cavity convolution network and a full connection layer to obtain a target voiceprint characteristic vector, and the target voiceprint characteristic vector is compared with the registered voiceprint characteristic vector to determine a target user corresponding to the audio to be identified; meanwhile, the target neural network has a simple structure relative to the GE2E network, so that the complexity in extracting the voiceprint characteristic information can be reduced, the efficiency of extracting the voiceprint characteristic information is improved, and the efficiency of voiceprint recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flow chart of an implementation of a first voiceprint recognition method provided in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a target neural network provided in an embodiment of the present application;

fig. 3 is a schematic flow chart of an implementation of a second voiceprint recognition method provided in the embodiment of the present application;

FIG. 4 is a schematic diagram of a voiceprint recognition apparatus provided by an embodiment of the present application;

fig. 5 is a schematic diagram of a terminal device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

In addition, in the description of the present application, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

The first embodiment is as follows:

fig. 1 shows a schematic flow chart of a first voiceprint recognition method provided in an embodiment of the present application, which is detailed as follows:

in S101, an audio feature vector of an audio to be identified is obtained, where the audio feature vector includes a time dimension and a spectral feature dimension, and one unit time in the time dimension corresponds to a set of spectral feature information in the spectral feature dimension.

The voiceprint recognition method in the embodiment of the application is specifically a text-independent voiceprint recognition method, namely, a user does not need to pronounce according to specified content, and the audio to be recognized in the embodiment of the application is the audio of any speaking content sent by the user. The audio to be recognized is obtained through sound collection equipment or a storage unit for storing the audio to be recognized, the audio characteristic vector of the audio to be recognized is extracted through transformation and analysis of a time domain and a frequency domain, and the audio characteristic vector can be stored in an npy file format. The audio feature vector comprises a time dimension and a spectral feature dimension, the audio feature vector can be represented by a, b, wherein a is a multiplication sign, a is the length of the audio feature vector in the time dimension, b is the length of the audio feature vector in the spectral feature dimension, one unit time in the time dimension corresponds to one set of spectral feature information in the spectral feature dimension, namely the audio feature vector comprises audio feature information of a unit time, and the audio feature information of each unit time can be represented by one set of spectral feature information with the length of b.

Specifically, the step S101 specifically includes:

acquiring an audio to be recognized, and filtering silence of the audio to be recognized to obtain an effective audio segment; intercepting the effective audio segment according to the target duration to obtain a target audio;

and extracting the Mel cepstrum coefficient MFCC characteristics of the target audio to obtain an audio characteristic vector.

And acquiring the audio to be recognized, filtering the silence of the audio to be recognized to obtain an effective audio segment of the audio to be recognized, and intercepting the effective audio segment of the audio to be recognized according to target duration (for example, 3 seconds) to obtain the target audio. Optionally, if the duration of the valid audio segment of the audio to be recognized is less than the valid duration threshold (e.g., 1 second), the audio to be recognized is discarded and the audio to be recognized is obtained again. Optionally, if the duration of the effective audio segment of the audio to be recognized is greater than or equal to the effective duration threshold and less than or equal to the target duration, the audio to be recognized is not intercepted, and the effective audio segment is directly used as the target audio, so that the duration of the target audio in the embodiment of the application is greater than or equal to the effective duration threshold and less than or equal to the target duration. Specifically, the target duration or the effective duration threshold is determined according to the identification accuracy of the target neural network. Optionally, the audio to be recognized in the embodiment of the present application is specifically a short audio (for example, an audio with a duration of 1 to 3 seconds), and the target neural network in the embodiment of the present application can accurately perform voiceprint recognition only according to less audio feature information of the short audio, that is, because the target neural network in the embodiment of the present application has high recognition accuracy and requires a small amount of input information, the duration of the obtained audio to be recognized can be shortened, and the data volume processed by the target neural network is also reduced, so that the efficiency of voiceprint recognition can be improved.

Performing Mel-scale Frequency Cepstral Coefficients (MFCC) feature extraction on the target audio according to parameters such as a preset sampling rate, a framing frame length, a first step length and the like to obtain an audio feature vector containing a time dimension and a spectrum feature dimension, wherein the size of the audio feature vector is a specified size, namely the length in the time dimension and the length in the spectrum feature dimension are target lengths obtained according to parameter setting during MFCC feature extraction. By way of example and not limitation, the preset sampling rate is 1.6k, the framing frame is 25 ms long, the first step size is 32 ms, and the size of the audio feature vector is 96 x 64. Optionally, if the duration of the target audio is less than 3 seconds, the size of the first audio feature vector obtained after the MFCC feature extraction is less than the specified size, at this time, data filling is performed on a portion, which is less than the specified size, in the first audio feature vector with "0", and a second audio feature vector with the specified size is obtained by padding to be used as the audio feature vector finally input to the target neural network. For example, assuming that the duration of the target audio is 2 seconds, the size of the first audio feature vector obtained by extracting the feature of the target audio through the MFCC is 63 × 64, and the size is smaller than the predetermined size "96 × 64", so that the position of the size "33 × 64" different from the predetermined size in the first audio feature vector is data-filled with "0", and the second audio feature vector with the size 96 × 64 is obtained as the final audio feature vector by padding.

In S102, the audio feature vector is input into a target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, where the target neural network is composed of a SENet module, a hole convolution network and a full connection layer, and the hole convolution network includes a plurality of hole convolution layers for extracting context information of the audio feature vector in a time dimension.

And inputting the audio feature vector with the specified size into a target neural network through a single channel (namely the number of input channels is 1), and further performing feature extraction processing on the audio feature vector to obtain a target voiceprint feature vector corresponding to the audio feature vector. The target neural network consists of a Squeze-and-Excitation Networks module, a void convolution network and a full connection layer, wherein the SENet module is used for modeling the correlation among channels in the target neural network, the SENet determines the weight parameter of each channel according to sample data during training, so that the trained SENet module can accurately extract the feature information of each channel according to the weight, and the void convolution network is used for extracting the context information of the audio feature vector in the time dimension. In the embodiment of the present application, the context information is specifically feature information that integrates spectrum feature information corresponding to a plurality of different unit times before and after in the audio feature vector. Specifically, the hole convolution network comprises a plurality of hole convolution layers, each hole convolution layer comprises a convolution kernel with the size of n x 1, wherein n is a positive integer larger than 1, and "" is a multiplication sign. Context information of the audio feature vector on n time dimensions is analyzed in an integrated mode through the convolution kernel of n x 1, namely context connection in the audio feature extraction process is strengthened, and therefore accuracy of voiceprint recognition is improved. Moreover, the hollow convolution network is adopted, so that the view of the convolution kernel is wider under the condition of not losing the data accuracy.

Specifically, the target neural network specifically includes a first convolution layer, a sentet module, a first reconstruction layer, a first fully-connected layer, a second reconstruction layer, a hole convolution network, a third reconstruction layer, an average pooling layer, and a second fully-connected layer, and step S102 specifically includes:

s10201: inputting the audio feature vector into a target neural network, and obtaining a first feature vector through the first convolution layer, wherein the first feature vector comprises a time dimension, a frequency spectrum feature dimension and a channel dimension;

s10202: the first feature vector weights the information of each channel through the SENET module to obtain a second feature vector;

s10203: the second eigenvector sequentially passes through the first reconstruction layer, the first full-link layer and the second reconstruction layer to obtain a third eigenvector;

s10204: the third feature vector passes through a plurality of layers of cavity convolution layers in sequence through the cavity convolution network, context information of the third feature vector in different time dimensions is extracted, and a fourth feature vector is obtained, wherein each layer of cavity convolution layer comprises a convolution kernel with the size of n x 1, n is a positive integer larger than 1, and 'x' is a multiplication number;

s10205: and the fourth feature vector sequentially passes through the third reconstruction layer, the average pooling layer and the second full-connection layer to obtain the target voiceprint feature vector with the target size.

As shown in fig. 2, the target neural network of the embodiment of the present application is composed of a first convolution layer Convl-reduce, a send module, a first reconstruction layer Reshape1, a first fully-connected layer Fc1, a second reconstruction layer Reshape2, a void convolution network scaled-Conv-Net, a third reconstruction layer Reshape3, an average pooling layer Avg-pool, and a second fully-connected layer Fc 2.

In S10201, the audio feature vector is input to the target neural network via a single channel (i.e., the number of input channels is 1), and a first feature vector is obtained through the first convolution layer Convl-Relu. The first convolution layer comprises a first number of channels, the convolution kernel of each channel is 3 x 3, the step size is a second step size, correspondingly, a first feature vector output by the first convolution layer also comprises a channel dimension besides a time dimension and a spectrum feature dimension, and the length of the first feature vector in the channel dimension is equal to the first number. Illustratively, the size of the audio feature vector is 96 × 64, the number of input channels is 1, that is, the input data is input as 96 × 64 × 1; the first convolution layer has 32 channels, the convolution kernel size of each channel is 3 × 3, the step size is 2, and then after passing through the first convolution layer, a first feature vector with the size of 48 × 32 is obtained, wherein "48" is the length in the time dimension, the first "32" is the length in the frequency spectrum feature dimension, and the second "32" is the length in the channel dimension.

In S10202, the first eigenvector output from the first convolution layer is weighted by the SENet module according to the channel weight parameters of the SENet to obtain a second eigenvector. The dimensions and size of the second feature vector are identical to the first feature vector.

In S10203, the second feature vector obtained after channel weighting passes through the first reconstruction layer Reshape1, the first full link layer Fc1, and the second reconstruction layer Reshape2, and a third feature vector with a channel dimension length of 1 is obtained. Illustratively, the size of the second feature vector is 48 × 32, and after passing through the first reconstruction layer Reshape1, a feature vector with a size of 48 × 1024 (32 × 32); then mapping a feature vector with the size of 48 x 256 by a first fully-connected layer Fc1 consisting of a fully-connected layer and an activation function tanh; the channel dimensions are then expanded by a second reconstruction layer Reshape2, resulting in a third feature vector of size 48 × 256 × 1. And performing pattern adjustment on the data of the second eigenvector through the first reconstruction layer, the first full-connection layer and the second reconstruction layer to obtain a third eigenvector of a single channel (with the length of 1 in the channel dimension), so as to adapt to the format requirement of the void convolution network on the input data.

S10204: the third feature vector passes through a plurality of layers of cavity convolution layers in sequence through the cavity convolution network, and context information of the third feature vector in a time dimension is extracted to obtain a fourth feature vector;

and processing the third feature vector by a plurality of layers of hole convolution layers containing convolution kernels of n x 1 in the hole convolution network, and extracting context information in a time dimension in the third feature vector, namely performing correlation processing on the spectrum feature information corresponding to each unit time and the spectrum feature information corresponding to adjacent unit time to obtain a fourth feature vector containing the context information in the time dimension, wherein the size of the fourth feature vector is consistent with that of the third feature vector. Specifically, the hole convolution network comprises a plurality of hole convolution layers, each hole convolution layer comprises a convolution kernel with the size of n x 1, wherein n is a positive integer larger than 1, and "" is a multiplication sign. Context information of the audio feature vector on n time dimensions is analyzed in an integrated mode through the convolution kernel of n x 1, namely context connection in the audio feature extraction process is strengthened, and therefore accuracy of voiceprint recognition is improved. Illustratively, the hole convolution network consists of five hole convolution layers of scaled-Conv 1, scaled-Conv 2, scaled-Conv 3, scaled-Conv 4 and scaled-Conv 5, wherein the five hole convolution layers are all single-channel layers, the step size is 1, the convolution kernel sizes are 5 × 1, 9 × 1, 15 × 1, 24 × 1 and 24 × 1 in sequence, and the corresponding hole convolution ratios are 1, 2, 3, 1 and 1; assuming that the size of the third eigenvector is 48 × 256 × 1, the size of the fourth eigenvector processed by the void convolution network is also 48 × 256 × 1.

The fourth feature vector is subjected to data pattern adjustment sequentially through a third reconstruction layer Reshape3, and the average value of each frequency spectrum feature information of different unit time is obtained in the time dimension through an average pooling layer Avg-pool, so that the feature vector with the length of 1 in the time dimension (namely, in the single time dimension, the representation in the time dimension can be omitted) is obtained; and then obtaining a target voiceprint characteristic vector with the target size through a second full connection layer consisting of the full connection layer and the tanh activation function. Illustratively, the fourth eigenvector with the size of 48 × 256 × 1 is obtained as an eigenvector with the size of 48 × 256 by the third reconstruction layer, and is obtained as an eigenvector with the size of 256 by adding and averaging the spectral characteristic information of 48 different units of time by averaging the pooling layers (i.e., the eigenvector with the length of 256 in the spectral characteristic dimension and normalized in both the time dimension and the channel dimension); and then mapping the feature vector with the size of 256 into a target voiceprint feature vector with the target size of 512 through a second full connection layer, wherein the target voiceprint feature vector contains 512 units of feature information.

In S103, the target voiceprint feature vector is compared with the registered voiceprint feature vector, and a target user corresponding to the audio to be identified is determined.

In the embodiment of the application, the registered voiceprint feature vectors and the corresponding identification information of the user are prestored, and the identification information can be the name, the number and other information of the user. Comparing the target voiceprint feature vector obtained in the step S102 with the registered voiceprint feature vector, finding out the registered voiceprint feature vector with the highest similarity to the target voiceprint feature vector, determining that the user corresponding to the registered voiceprint vector is the target user corresponding to the current audio to be recognized, and outputting the identification information of the target user. Specifically, the registered voiceprint feature vector with the highest similarity to the target voiceprint feature vector is found by calculating the cosine similarity between the target voiceprint feature vector and each pre-stored registered voiceprint feature vector.

Optionally, before step S101, the method includes:

receiving a registration instruction, and acquiring identification information of a user to be registered and corresponding audio to be registered;

obtaining a voiceprint characteristic vector of the audio to be registered through a target neural network;

and storing the voiceprint characteristic vector of the audio to be registered and the corresponding user identification information into a target database to obtain the registered voiceprint characteristic vector and the corresponding user identification information.

Preferably, during registration, multiple audios to be registered of the same user to be registered are obtained, the multiple audios to be registered are simultaneously or sequentially passed through the target neural network, multiple voiceprint feature vectors of the same user to be registered are correspondingly obtained, and the average value of the multiple voiceprint feature vectors is obtained and used as the final voiceprint feature vector of the user to be registered for registration, so that the accuracy of registration data is further improved, and the accuracy of voiceprint recognition is improved later.

Optionally, after the step S103, the method further includes:

and if the registered voiceprint characteristic vector matched with the target voiceprint characteristic vector is not found, indicating the current user to register the target voiceprint characteristic vector.

If no registered voiceprint characteristic vector matched with the target voiceprint characteristic vector is found, the user information corresponding to the current audio to be recognized is not registered, so that the user is indicated to input the identification information of the current user, the identification information of the user and the target voiceprint characteristic vector are correspondingly stored in the target database, and the registration of the target voiceprint characteristic vector is completed.

In the embodiment of the application, the audio characteristic vector of the audio to be identified is subjected to characteristic extraction through a target neural network consisting of a SENET module, a cavity convolution network and a full connection layer to obtain a target voiceprint characteristic vector, and the target voiceprint characteristic vector is compared with the registered voiceprint characteristic vector to determine a target user corresponding to the audio to be identified; meanwhile, the target neural network has a simple structure relative to the GE2E network, so that the complexity in extracting the voiceprint characteristic information can be reduced, the efficiency of extracting the voiceprint characteristic information is improved, and the efficiency of voiceprint recognition is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Example two:

fig. 3 shows a schematic flow chart of a second voiceprint recognition method provided in the embodiment of the present application, which is detailed as follows:

in S301, sample data is obtained, where the sample data is from audio data of different users.

Audio data of different users are obtained and preprocessed to obtain audio feature vectors of the different users as sample data. Or, the audio feature vectors of the audio data of different users pre-stored in the form of npy files are read to obtain sample data for training. Specifically, in the sample data, two or more audio feature vectors exist per user.

In S302, inputting the sample data into the target neural network for training until the intra-class audio similarity and the inter-class audio similarity meet preset conditions to obtain a trained target neural network; the intra-class audio similarity is the similarity between the voiceprint feature vectors corresponding to different audio data belonging to the same user, and the inter-class audio similarity is the similarity between the voiceprint feature vectors corresponding to different audio data belonging to different users.

Inputting sample data into a target neural network for training, and adjusting learning parameters of each network layer until the intra-class audio similarity and the inter-class audio similarity meet preset conditions in a voiceprint feature vector obtained according to the sample data, so that the intra-class audio similarity is as large as possible, and the inter-class audio similarity is as small as possible. The intra-class audio similarity refers to the similarity between the voiceprint feature vectors belonging to the same user, and the inter-class audio similarity refers to the similarity between the voiceprint feature vectors belonging to different users. Specifically, the similarity between the voiceprint feature vectors can be represented by cosine similarity. Specifically, the preset condition may be that the intra-class audio similarity is greater than a first preset threshold and the inter-class audio similarity is less than a second preset threshold, or the preset condition may be that: the difference value between the intra-class audio similarity and the inter-class audio similarity is larger than a preset difference value, so that the intra-class audio similarity is as large as possible, and the inter-class audio similarity is as small as possible.

Optionally, the step S103 includes:

inputting preset sample data into the target neural network in sequence for training until the value of a target function meets a preset condition, and obtaining a trained target neural network, wherein the target function of the target neural network is as follows:

wherein Sc is a value of the objective function and represents a difference value between the intra-class audio similarity and the inter-class audio similarity; n is the number of users corresponding to the sample data input in the current batch, M is the number of sample data corresponding to each user, v_iRepresenting the voiceprint characteristic vector obtained by any sample data of the current batch through a target neural network model, wherein P represents v_iCorresponding user, v_jIs equal to v_iVoiceprint features co-owned by a userVector, v_kIs equal to v_iVoiceprint feature vectors, sim { (v) that do not belong to the same user_i,v_j)|i≠j,v_i∈P,v_jE.g. P represents v_iThe voiceprint feature vector v belonging to the same P_jThe degree of similarity of the cosine of (c),

representing the voiceprint feature vectors v of vi and other users_kCosine similarity of (c).

And acquiring sample data of a preset batch of samples from the data set each time, and inputting the sample data into the target neural network for training, wherein the sample data of each batch is from audio of a preset number of sentences corresponding to a preset number of users. For example, the number of samples in a preset batch is set to 64, that is, 64 sample data are input as one batch at a time to train the target neural network, where the 64 sample data are derived from 16 users, and each user corresponds to 4 sentences of audio, that is, each user corresponds to 4 sample data.

The objective function of the target neural network is

Wherein, the value Sc obtained by the calculation of the objective function represents the difference between the audio similarity in the class and the audio similarity between the classes; n is the number of users corresponding to the sample data input in the current batch, M is the number of sample data corresponding to each user, and NM represents that N is multiplied by M and is equal to the number of samples in the preset batch. v. of_iRepresenting the voiceprint characteristic vector obtained by any sample data of the current batch through a target neural network model, wherein P represents v_iCorresponding user, v_jIs equal to v_iVoiceprint feature vector, v, belonging to one user_kIs equal to v_iVoiceprint feature vectors, sim { (vi, v) that do not belong to the same user_j)|i≠j,v_i∈P,v_jE.g. P represents v_iThe voiceprint feature vector v belonging to the same P_jThe degree of similarity of the cosine of (c),

denotes v_iWith other usersOf the voiceprint feature vector v_kCosine similarity of (c). And during training, taking a negative value for Sc, and training by adopting a gradient descent method until the descending gradient of the value (-Sc) is smaller than a preset value and the accuracy of the target neural network is higher than an accuracy threshold value to obtain the trained target neural network. That is, the preset condition in the embodiment of the present application may be that the gradient of the decrease of the value of the (-Sc) is smaller than the preset value, and the accuracy of the target neural network is higher than the accuracy threshold value, at this time, the intra-class audio similarity is large and the inter-class audio similarity is small in the audio processed by the target neural network. Preferably, when (-Sc) takes the minimum value, that is, the Sc value is the maximum, the cosine similarity between the voiceprint feature vectors of the intra-class sample data (that is, the sample data belonging to the same user) is as large as possible, and the cosine similarity between the voiceprint feature vectors of the inter-class sample data (that is, the sample data belonging to different users, respectively) is as small as possible, so that the identification accuracy of the corresponding target neural network is the highest.

Optionally, the learning rate of the target neural network during training is dynamically adjusted according to a preset target learning rate and the current training step number.

Specifically, the learning rate of the target neural network during training is dynamically adjusted in a mode of combining the arm up and learning rate attenuation according to a preset target learning rate and the current training step number. Specifically, the learning rate lr during training is dynamically adjusted by the following learning rate formula:

lr＝flr×10^0.5×min(step×10^-1.5,step^-0.5)

wherein flr is a preset target learning rate, and step is the current training step number.

According to the learning rate formula, the learning rate is gradually preheated in the initial training stage and is increased to a preset target learning rate, and the training convergence speed is accelerated; in the later training stage after the learning rate reaches the target learning rate, the learning rate is gradually attenuated, so that the target neural network can be converged more accurately. By the aid of the dynamic adjustment mode, training speed and accuracy of the target neural network can be improved.

Optionally, the voiceprint recognition method is specifically applied to a far-field recording scene, and the sample data includes far-field recording data carrying background noise and noise-free audio data of a preset number.

The voiceprint recognition method is particularly applied to a far-field recording scene, such as a far-field recording scene of a smart television. The audio frequency in the far-field recording scene contains certain background noise, and correspondingly, the sample data for training the target neural network also comprises the far-field recording data containing the background noise. In addition, since the far-field recorded data may be too noisy, which makes it more difficult for the target neural network to converge, the sample data in the embodiment of the present application includes a preset amount of noise-free audio data in addition to the far-field recorded data containing background noise. The far-field recording data containing background noise and the noise-free audio data with the preset number are combined to be used as sample data to train the target neural network, so that the trained target neural network can be ensured to be accurately attached to voiceprint recognition in a far-field recording scene, and meanwhile, the convergence speed of the target neural network is improved. Illustratively, the sample data set included in the embodiment of the present application includes 16074 far-field recorded data (each stored in npy files) from 5512 users and 255763 noiseless audio data (each stored in npy files) from 2500 users.

In S303, an audio feature vector of the audio to be identified is obtained, where the audio feature vector includes a time dimension and a spectral feature dimension, and one unit time in the time dimension corresponds to a group of spectral feature information in the spectral feature dimension.

In S304, the audio feature vector is input into a target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, where the target neural network is composed of a SENet module, a hole convolution network and a full connection layer, and the hole convolution network includes a plurality of hole convolution layers for extracting context information of the audio feature vector in different time dimensions.

In S305, the target voiceprint feature vector is compared with the registered voiceprint feature vector, and a target user corresponding to the audio to be identified is determined.

S303 to S305 in the present embodiment are respectively the same as S101 to S103 in the previous embodiment, and please refer to the related description of S101 to S103 in the previous embodiment, which is not described herein again.

In the embodiment of the application, the target neural network is trained until the intra-class audio similarity and the inter-class audio similarity meet the preset condition, so that the finally trained target neural network can enable the cosine similarity between the voiceprint feature vectors of intra-class sample data (namely, sample data belonging to the same user) to be as large as possible, and meanwhile, the cosine similarity between the voiceprint feature vectors of inter-class sample data (namely, sample data belonging to different users) to be as small as possible, the identification accuracy of the target neural network is improved, and the accuracy of the voiceprint identification method is improved.

By way of example and not limitation, the following provides test validation procedures and results for the voiceprint recognition method of embodiments of the present application:

(I) accuracy testing

A1: acquiring voice data which are from 6 users and 2 sentences of each person and are outside the sample data set, preprocessing the voice data and performing MFCC (Mel frequency cepstrum coefficient) feature extraction to obtain 12 audio feature vectors, wherein each audio feature vector carries corresponding user identification information;

a2: inputting all the 12 audio feature vectors in the step A1 into a target neural network for feature extraction to obtain corresponding 12 voiceprint feature vectors;

a3: sequentially taking a voiceprint feature vector in 12 voiceprint feature vectors, calculating the similarity between other voiceprint feature vectors and the current voiceprint feature vector, if the voiceprint feature vector with the highest similarity and the voiceprint feature vector belong to a user, judging that the model identification is correct, and if not, judging that the model identification is wrong; repeating the execution until 12 voiceprint feature vectors are traversed;

a4: and counting the recognition result of the step A3 to obtain the final accuracy.

According to verification, the accuracy rate of the voiceprint recognition method in the embodiment of the application is higher than that of the method for voiceprint recognition through GE 2E. Illustratively, in one test result, the accuracy of the voiceprint recognition method using the GE2E network is 0.704, and the accuracy of the voiceprint recognition using the target neural network of the embodiment of the application is 0.805.

(II) operation speed test

B1: acquiring sample data of 2 sentences of voice from 6 users, namely 6 × 2-12 sample data (which can be 12 npy files) from a data set, inputting the sample data into a target neural network as a batch for testing, and recording the time consumed by the operation of the target neural network;

b2: and repeating the step B1 for 100 times to obtain 100 time-consuming data, removing the maximum value and the minimum value of the 100 time-consuming data to obtain the remaining 98 time-consuming data, averaging the 98 time-consuming data to obtain final time consumption, and comparing the final time consumption with the running time consumption of the GE 2E-based voiceprint recognition method.

By comparison, the final time consumption of the method for voiceprint recognition through the target neural network in the embodiment of the application is lower than the operation time consumption of the method for voiceprint recognition based on GE 2E. Illustratively, in one test result, the time consumed for the voiceprint recognition method using the GE2E network to operate is 0.656 seconds, and the time consumed for the voiceprint recognition method using the target neural network of the embodiment of the application to operate is 0.040 seconds.

Example three:

fig. 4 shows a schematic structural diagram of a voiceprint recognition apparatus provided in an embodiment of the present application, and for convenience of description, only parts related to the embodiment of the present application are shown:

the voiceprint recognition device includes: an audio feature vector obtaining unit 41, a target neural network unit 42, and a determining unit 43. Wherein:

the audio feature vector acquiring unit 41 is configured to acquire an audio feature vector of an audio to be identified, where the audio feature vector includes a time dimension and a spectral feature dimension, and one unit time in the time dimension corresponds to a set of spectral feature information in the spectral feature dimension.

Optionally, the audio feature vector obtaining unit 41 includes an audio to be identified obtaining module and an MFCC feature extraction module:

the device comprises an audio to be recognized acquisition module, a voice recognition module and a voice recognition module, wherein the audio to be recognized acquisition module is used for acquiring an audio to be recognized and filtering silence of the audio to be recognized to obtain an effective audio segment; intercepting the effective audio segment according to the target duration to obtain a target audio;

and the MFCC feature extraction module is used for extracting the Mel cepstrum coefficient MFCC features of the target audio to obtain an audio feature vector.

And a target neural network unit 42, configured to input the audio feature vector into a target neural network to obtain a target voiceprint feature vector corresponding to the audio feature vector, where the target neural network is composed of a SENet module, a hole convolution network and a full connection layer, and the hole convolution network includes a plurality of hole convolution layers for extracting context information of the audio feature vector in a time dimension.

Optionally, the target neural network element comprises a training module for obtaining sample data, wherein the sample data is from audio data of different users; inputting the sample data into the target neural network for training until the intra-class audio similarity and the inter-class audio similarity meet preset conditions to obtain a trained target neural network; the intra-class audio similarity is the similarity between the voiceprint feature vectors belonging to the same user, and the inter-class audio similarity is the similarity between the voiceprint feature vectors belonging to different users.

Optionally, the training module is specifically configured to sequentially input preset sample data into the target neural network for training until a value of a target function meets a preset condition, so as to obtain a trained target neural network, where the target function of the target neural network is:

wherein Sc is a value of the objective function and represents a difference value between the intra-class audio similarity and the inter-class audio similarity; n is the number of users corresponding to the sample data input in the current batch, and M is the number of users corresponding to each userNumber of sample data of v_iRepresenting the voiceprint characteristic vector obtained by any sample data of the current batch through a target neural network model, wherein P represents v_iCorresponding user, v_jIs equal to v_iVoiceprint feature vector, v, belonging to one user_kIs equal to v_iVoiceprint feature vectors, sim { (v) that do not belong to the same user_i,v_j)|i≠j,v_i∈P,v_jE.g. P represents v_iThe voiceprint feature vector v belonging to the same P_jThe degree of similarity of the cosine of (c),

denotes v_iVoiceprint feature vectors v with other users_kCosine similarity of (c).

Optionally, the training module includes a learning rate adjusting module, configured to dynamically adjust a learning rate of the target neural network during training according to a preset target learning rate and a current training step number.

Optionally, the voiceprint recognition apparatus is applied to a far-field recording scene, and the sample data includes far-field recording data carrying background noise and a preset number of noise-free audio data.

Optionally, the target neural network specifically consists of a first convolutional layer, a SENet module, a first reconstruction layer, a first fully-connected layer, a second reconstruction layer, a hole convolutional network, a third reconstruction layer, an average pooling layer, and a second fully-connected layer, and the target neural network unit 42 is specifically configured to:

inputting the audio feature vector into a target neural network, and obtaining a first feature vector through the first convolution layer, wherein the first feature vector comprises a time dimension, a frequency spectrum feature dimension and a channel dimension;

the first feature vector weights the information of each channel through the SENET module to obtain a second feature vector;

the second eigenvector sequentially passes through the first reconstruction layer, the first full-link layer and the second reconstruction layer to obtain a third eigenvector;

the third feature vector passes through a plurality of layers of cavity convolution layers in sequence through the cavity convolution network, and context information of the third feature vector in a time dimension is extracted to obtain a fourth feature vector, wherein each layer of cavity convolution layer comprises a convolution kernel with the size of n x 1, n is a positive integer greater than 1, and 'x' is a multiplication number;

and the fourth feature vector sequentially passes through the third reconstruction layer, the average pooling layer and the second full-connection layer to obtain the target voiceprint feature vector with the target size.

A determining unit 43, configured to compare the target voiceprint feature vector with the registered voiceprint feature vector, and determine a target user corresponding to the audio to be identified.

Optionally, the determining unit 43 further includes:

and the indicating module is used for indicating the user to register the target voiceprint feature vector if the registered voiceprint feature vector matched with the target voiceprint feature vector is not found.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Example four:

fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present application. As shown in fig. 5, the terminal device 5 of this embodiment includes: a processor 50, a memory 51 and a computer program 52, such as a voiceprint recognition program, stored in said memory 51 and executable on said processor 50. The processor 50, when executing the computer program 52, implements the steps in the above-described respective voiceprint recognition method embodiments, such as the steps S101 to S103 shown in fig. 1. Alternatively, the processor 50, when executing the computer program 52, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the units 41 to 43 shown in fig. 4.

Illustratively, the computer program 52 may be partitioned into one or more modules/units, which are stored in the memory 51 and executed by the processor 50 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 52 in the terminal device 5. For example, the computer program 52 may be divided into an audio feature vector obtaining unit, a target neural network unit and a determining unit, and each unit specifically functions as follows:

the audio feature vector acquisition unit is used for acquiring an audio feature vector of an audio to be identified, wherein the audio feature vector comprises a time dimension and a spectrum feature dimension, and one unit time in the time dimension corresponds to a group of spectrum feature information in the spectrum feature dimension.

And the target neural network unit is used for inputting the audio characteristic vectors into a target neural network to obtain target voiceprint characteristic vectors corresponding to the audio characteristic vectors, wherein the target neural network consists of a SEnet module, a cavity convolution network and a full connection layer, and the cavity convolution network comprises a plurality of layers of cavity convolution layers for extracting context information of the audio characteristic vectors in a time dimension.

The terminal device 5 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 50, a memory 51. Those skilled in the art will appreciate that fig. 5 is merely an example of a terminal device 5 and does not constitute a limitation of terminal device 5 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal device 5. The memory 51 is used for storing the computer program and other programs and data required by the terminal device. The memory 51 may also be used to temporarily store data that has been output or is to be output.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A voiceprint recognition method, comprising:

2. The voiceprint recognition method according to claim 1, wherein the obtaining the audio feature vector of the audio to be recognized comprises:

3. The voiceprint recognition method according to claim 1, wherein before said obtaining the audio feature vector of the audio to be recognized, further comprising:

acquiring sample data, wherein the sample data come from audio data of different users;

inputting the sample data into the target neural network for training until the intra-class audio similarity and the inter-class audio similarity meet preset conditions to obtain a trained target neural network; the intra-class audio similarity is the similarity between the voiceprint feature vectors belonging to the same user, and the inter-class audio similarity is the similarity between the voiceprint feature vectors belonging to different users.

4. The voiceprint recognition method according to claim 3, wherein the inputting the sample data into the target neural network for training until the intra-class audio similarity and the inter-class audio similarity satisfy a preset condition, to obtain the trained target neural network comprises:

wherein Sc is a value of the objective function and represents a difference value between the intra-class audio similarity and the inter-class audio similarity; n is the number of users corresponding to the sample data input in the current batch, M is the number of sample data corresponding to each user, v_iRepresenting the voiceprint characteristic vector obtained by any sample data of the current batch through a target neural network model, wherein P represents v_iCorresponding user, v_jIs equal to v_iVoiceprint feature vector, v, belonging to one user_kIs equal to v_iVoiceprint feature vectors, sim { (v) that do not belong to the same user_i,v_j)|i≠j,v_i∈P,v_jE.g. P represents v_iThe voiceprint feature vector v belonging to the same P_jThe degree of similarity of the cosine of (c),

5. The voiceprint recognition method according to claim 3, wherein a learning rate of the target neural network during training is dynamically adjusted according to a preset target learning rate and a current training step number.

6. The method according to claim 1, wherein the target neural network is specifically composed of a first convolutional layer, a sente module, a first reconstruction layer, a first fully-connected layer, a second reconstruction layer, a hole convolutional network, a third reconstruction layer, an averaging pooling layer, and a second fully-connected layer, and the inputting the audio feature vectors into the target neural network to obtain the target voiceprint feature vectors corresponding to the audio feature vectors includes:

7. The voiceprint recognition method according to any one of claims 1 to 6, wherein after said comparing said target voiceprint feature vector with registered voiceprint feature vectors, comprising:

8. A voiceprint recognition apparatus comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.