CN112614493B

CN112614493B - Voiceprint recognition method, system, storage medium and electronic device

Info

Publication number: CN112614493B
Application number: CN202011409154.8A
Authority: CN
Inventors: 张鹏; 吴伟; 李明杰; 詹培旋; 王彬
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2022-11-11
Anticipated expiration: 2040-12-04
Also published as: CN112614493A

Abstract

The application relates to the technical field of voiceprint recognition, in particular to a voiceprint recognition method, a voiceprint recognition system, a storage medium and electronic equipment, and solves the problem that in the related art, due to the fact that square convolution of a fixed receptive field is adopted, the final voiceprint recognition effect is poor. The method comprises the following steps: extracting the voiceprint features to be verified in the voice information through a pre-trained convolutional neural network model; the convolutional neural network model is obtained by training a convolutional neural network comprising a deformable convolutional layer; and comparing the similarity of the voiceprint features to be verified and the registered voiceprint features, judging whether the similarity result is greater than a preset threshold value, and if the similarity result is greater than the preset threshold value, successfully identifying the voiceprint. The voiceprint features are extracted through the convolutional neural network added with the deformable convolutional layer, so that adaptive receptive field change of different voiceprint features is realized, the finally obtained convolutional neural network model has higher robustness, and the voiceprint recognition precision is improved.

Description

Voiceprint recognition method, system, storage medium and electronic device

Technical Field

The present application relates to the field of voiceprint recognition technologies, and in particular, to a voiceprint recognition method, a voiceprint recognition system, a storage medium, and an electronic device.

Background

Voiceprint recognition is a technology for identity authentication based on voice, and belongs to one of biological feature recognition. The application field of voiceprint recognition is very wide, and the voiceprint recognition can be continuously popularized and popularized along with the development of intelligent voice technology. In recent years, application of deep learning begins to become a hot spot in the field of voiceprint recognition, and a voiceprint recognition system modeled by a deep convolutional neural network shows great recognition performance improvement due to the fact that a large amount of labeled audio data benefit.

In the existing voiceprint recognition method adopting deep convolutional neural network modeling, the convolutional kernel of the convolutional neural network can perform convolution operation on a local region of an input feature, the traditional square convolution is adopted, only voiceprint features in a fixed square region can be sampled, adaptive receptive field change cannot be performed on different voiceprint features, and the final voiceprint recognition effect is poor.

Disclosure of Invention

In view of the above problems, the present application provides a voiceprint recognition method, system, storage medium, and electronic device, which solve the technical problem in the related art that a final voiceprint recognition effect is poor due to the square convolution with a fixed receptive field.

In a first aspect, the present application provides a voiceprint recognition method, including:

receiving voice information;

extracting the voiceprint features to be verified in the voice information through a pre-trained convolutional neural network model; the convolutional neural network model which is trained in advance is obtained by training a convolutional neural network comprising a deformable convolutional layer;

comparing the similarity of the voiceprint features to be verified and the registered voiceprint features which are registered in advance to obtain a similarity result;

and judging whether the similarity result is greater than a preset threshold value, and if the similarity result is greater than the preset threshold value, successfully identifying the voiceprint.

Optionally, the comparing the similarity between the voiceprint feature to be verified and the registered voiceprint feature that is registered in advance to obtain a similarity result includes:

and calculating the similarity between the voiceprint features to be verified and the registered voiceprint features which are registered in advance by a cosine calculation method to obtain a similarity result.

Optionally, the process of registering the voiceprint feature includes:

receiving registration voice information;

extracting registration voiceprint characteristics in the registration voice information through a pre-trained convolutional neural network model; the convolutional neural network model which is trained in advance is obtained by training a convolutional neural network comprising a deformable convolutional layer.

Optionally, the training process of the convolutional neural network model includes:

establishing a convolutional neural network; the convolutional neural network comprises a first convolutional layer, a first pooling layer, a deformable convolutional layer, a second pooling layer, a second convolutional layer and a full-connection layer which are sequentially arranged; the first convolutional layer comprises a first sub-convolutional layer and a second sub-convolutional layer, the deformable convolutional layer comprises a first sub-deformable convolutional layer and a second sub-deformable convolutional layer;

and training the convolutional neural network by taking the training voiceprint features which are marked in advance as input to obtain the convolutional neural network model.

Optionally, the training voiceprint feature is a mel-frequency cepstrum coefficient feature.

Optionally, the deformable convolution layer is configured to add an offset parameter to each element of the convolution kernel to obtain the adaptive receptive field.

Optionally, the first pooling layer and the second pooling layer are used to reduce feature size, enlarge the receptive field, and/or reduce the computational effort.

In a second aspect, a voiceprint recognition system, the system comprising:

a receiving unit for receiving voice information;

the extracting unit is used for extracting the voiceprint features to be verified in the voice information through a pre-trained convolutional neural network model; the convolutional neural network model which is trained in advance is obtained by training a convolutional neural network comprising a deformable convolutional layer;

the comparison unit is used for comparing the similarity of the voiceprint features to be verified and the registered voiceprint features which are registered in advance to obtain a similarity result;

and the verification unit is used for judging whether the similarity result is greater than a preset threshold value or not, and if the similarity result is greater than the preset threshold value, the voiceprint recognition is successful.

In a third aspect, a storage medium storing a computer program executable by one or more processors may be used to implement the voiceprint recognition method as described in the first aspect above.

In a fourth aspect, an electronic device comprises a memory and a processor, the memory having a computer program stored thereon, the memory and the processor being communicatively connected to each other, the computer program, when executed by the processor, performing the voiceprint recognition method as described in the first aspect above.

The application provides a voiceprint recognition method, a system, a storage medium and an electronic device, comprising: receiving voice information; extracting the voiceprint features to be verified in the voice information through a pre-trained convolutional neural network model; the convolutional neural network model which is trained in advance is obtained by training a convolutional neural network comprising a deformable convolutional layer; comparing the similarity of the voiceprint features to be verified and the registered voiceprint features which are registered in advance to obtain a similarity result; and judging whether the similarity result is greater than a preset threshold value, and if the similarity result is greater than the preset threshold value, successfully identifying the voiceprint. According to the method and the device, the voiceprint features are extracted through the convolutional neural network added with the deformable convolutional layer, so that adaptive receptive field change is carried out on different voiceprint features, the finally obtained convolutional neural network model has higher robustness, and the voiceprint recognition precision is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a voiceprint recognition method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a voiceprint recognition system according to an embodiment of the present application;

fig. 4 is a connection block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following detailed description will be provided with reference to the accompanying drawings and embodiments, so that how to apply the technical means to solve the technical problems and achieve the corresponding technical effects can be fully understood and implemented. The embodiments and various features in the embodiments of the present application can be combined with each other without conflict, and the formed technical solutions are all within the scope of protection of the present application.

It can be known from the background art that in the existing voiceprint recognition method for modeling by using a deep convolutional neural network, the convolution kernel of the convolutional neural network performs convolution operation on a local region of an input feature, the traditional square convolution is adopted, the voiceprint feature in a fixed square region can only be sampled, and the adaptive receptive field change cannot be performed on different voiceprint features, so that the final voiceprint recognition effect is poor.

In view of this, the present application provides a voiceprint recognition method, system, storage medium and electronic device, which solve the technical problem in the related art that the final voiceprint recognition effect is poor due to the square convolution with a fixed receptive field.

Example one

Fig. 1 is a schematic flow chart of a voiceprint recognition method provided in an embodiment of the present application, and as shown in fig. 1, the method includes:

s101, receiving voice information;

s102, extracting voiceprint features to be verified in the voice information through a pre-trained convolutional neural network model;

in step S102, the pre-trained convolutional neural network model is trained by a convolutional neural network including a deformable convolutional layer.

S103, comparing the similarity of the voiceprint features to be verified and the registered voiceprint features which are registered in advance to obtain a similarity result;

and S104, judging whether the similarity result is greater than a preset threshold value, and if the similarity result is greater than the preset threshold value, successfully identifying the voiceprint.

It should be noted that, because the set of people needing to be verified by the voiceprint recognition system is not fixed, it is not practical to retrain the convolutional neural network model once by adding one person, and therefore, the convolutional neural network model plays a role in feature extraction in the whole method and does not serve as a classifier or a recognizer.

It should be noted that, the present invention includes, but is not limited to, calculating the similarity between the voiceprint feature to be verified and the registered voiceprint feature that is registered in advance by using a cosine calculation method, and other calculation methods may also be used as needed, as long as the similarity between the voiceprint feature to be verified and the registered voiceprint feature that is registered in advance is finally obtained.

Optionally, the process of registering the voiceprint feature includes:

receiving registration voice information;

It should be noted that, in order to solve the problem of increasing the number of people to be identified, a pre-registration method may be adopted to register the voiceprint features of the increased people for similarity comparison in subsequent voiceprint identification.

It should be noted that the first convolution layer is used for performing preliminary extraction on the input training voiceprint features to obtain the intermediate layer features, which is convenient for subsequent continuous processing of the deformable convolution layer and the like.

Specifically, as shown in fig. 2, for the schematic structural diagram of the convolutional neural network provided in the embodiment of the present application, after inputting training voiceprint features into the first convolutional layer, feature extraction is performed to obtain intermediate layer features, and then the intermediate layer features are output through the full connection layer after being sequentially processed by each layer.

It should be noted that Mel-scale Frequency Cepstral Coefficients (MFCCs) are Cepstral parameters extracted in the Mel-scale Frequency domain, and the Mel-scale describes the non-linear characteristic of human ear Frequency. Since the characteristics do not depend on the properties of the signals, no assumptions and restrictions are made on the input signals, and the research results of the auditory model are utilized. Therefore, the parameter has better robustness than the LPCC based on the vocal tract model, better conforms to the auditory characteristics of human ears, and still has better recognition performance when the signal-to-noise ratio is reduced.

It should be noted that after the training voiceprint features are input into the first convolution layer, feature extraction is performed to obtain intermediate layer features, and meanwhile, specific offset of each element of the convolution kernel can be obtained through learning, when a training process comes to the deformable convolution layer, an offset parameter is added to the specific offset of each element of the convolution kernel, and the offset parameter can enable the receptive field of the sampling network to be adaptively adjusted according to the shape of the target to be measured, so that the most accurate features are obtained.

In summary, an embodiment of the present application provides a voiceprint recognition method, including: receiving voice information; extracting the voiceprint features to be verified in the voice information through a pre-trained convolutional neural network model; the convolution neural network model which is trained in advance is obtained by training a convolution neural network comprising a deformable convolution layer; comparing the similarity of the voiceprint features to be verified and the registered voiceprint features which are registered in advance to obtain a similarity result; and judging whether the similarity result is greater than a preset threshold value, and if the similarity result is greater than the preset threshold value, successfully identifying the voiceprint. The voiceprint features are extracted through the convolutional neural network added with the deformable convolutional layer, so that adaptive receptive field change is carried out on different voiceprint features, the finally obtained convolutional neural network model has higher robustness, and the voiceprint recognition precision is improved.

Example two

Based on the voiceprint recognition method disclosed in the above embodiment of the present invention, fig. 3 specifically discloses a voiceprint recognition system applying the voiceprint recognition method.

As shown in fig. 3, the embodiment of the present invention discloses a voiceprint recognition system, which includes:

a receiving unit 301, configured to receive voice information;

an extracting unit 302, configured to extract a voiceprint feature to be verified in the voice information through a pre-trained convolutional neural network model; the convolutional neural network model which is trained in advance is obtained by training a convolutional neural network comprising a deformable convolutional layer;

a comparing unit 303, configured to compare similarity between the voiceprint feature to be verified and a registered voiceprint feature that is registered in advance, to obtain a similarity result;

the verification unit 304 is configured to determine whether the similarity result is greater than a preset threshold, and if the similarity result is greater than the preset threshold, the voiceprint recognition is successful.

For the specific working processes of the receiving unit 301, the extracting unit 302, the comparing unit 303 and the verifying unit 304 in the voiceprint recognition system disclosed in the embodiment of the present invention, reference may be made to the corresponding contents in the voiceprint recognition method disclosed in the above embodiment of the present invention, and details are not repeated here.

In summary, an embodiment of the present application provides a voiceprint recognition system, including: receiving voice information; extracting the voiceprint features to be verified in the voice information through a pre-trained convolutional neural network model; the convolutional neural network model which is trained in advance is obtained by training a convolutional neural network comprising a deformable convolutional layer; comparing the similarity of the voiceprint features to be verified and the registered voiceprint features which are registered in advance to obtain a similarity result; and judging whether the similarity result is greater than a preset threshold value, and if the similarity result is greater than the preset threshold value, successfully identifying the voiceprint. The voiceprint features are extracted through the convolutional neural network added with the deformable convolutional layer, so that adaptive receptive field change of different voiceprint features is realized, the finally obtained convolutional neural network model has higher robustness, and the voiceprint recognition precision is improved.

EXAMPLE III

The present embodiment further provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, where the computer program can implement the method steps of the first embodiment when executed by a processor, and the detailed description of the embodiment is not repeated herein.

Example four

Fig. 4 is a connection block diagram of an electronic device 500 according to an embodiment of the present application, and as shown in fig. 4, the electronic device 500 may include: a processor 501, a memory 502, a multimedia component 503, an input/output (I/O) interface 504, and a communication component 505.

The processor 501 is configured to execute all or part of the steps in the voiceprint recognition method according to the first embodiment. The memory 502 is used to store various types of data, which may include, for example, instructions for any application or method in the electronic device, as well as application-related data.

The Processor 501 may be implemented by an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and is configured to perform the voiceprint recognition method in the first embodiment.

The Memory 502 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.

The multimedia component 503 may include a screen, which may be a touch screen, and an audio component for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in a memory or transmitted through a communication component. The audio assembly also includes at least one speaker for outputting audio signals.

The I/O interface 504 provides an interface between the processor 501 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons.

The communication component 505 is used for wired or wireless communication between the electronic device 500 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 505 may include: wi-Fi module, bluetooth module, NFC module.

In summary, the present application provides a voiceprint recognition method, a system, a storage medium, and an electronic device, where the method includes: receiving voice information; extracting the voiceprint features to be verified in the voice information through a pre-trained convolutional neural network model; the convolutional neural network model which is trained in advance is obtained by training a convolutional neural network comprising a deformable convolutional layer; comparing the similarity of the voiceprint features to be verified and the registered voiceprint features which are registered in advance to obtain a similarity result; and judging whether the similarity result is greater than a preset threshold value, and if the similarity result is greater than the preset threshold value, successfully identifying the voiceprint. The voiceprint features are extracted through the convolutional neural network added with the deformable convolutional layer, so that adaptive receptive field change is carried out on different voiceprint features, the finally obtained convolutional neural network model has higher robustness, and the voiceprint recognition precision is improved.

In the embodiments provided in the present application, it should be understood that the disclosed method can be implemented in other ways. The above-described method embodiments are merely illustrative.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

Although the embodiments disclosed in the present application are described above, the above descriptions are only for the convenience of understanding the present application, and are not intended to limit the present application. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A voiceprint recognition method, the method comprising:

receiving voice information;

judging whether the similarity result is greater than a preset threshold value or not, and if the similarity result is greater than the preset threshold value, successfully identifying the voiceprint;

the training process of the convolutional neural network model comprises the following steps:

2. The method according to claim 1, wherein the comparing the similarity between the voiceprint feature to be verified and the registered voiceprint feature that has been registered in advance to obtain a similarity result comprises:

and calculating the similarity of the voiceprint features to be verified and the registered voiceprint features which are registered in advance by a cosine calculation method to obtain a similarity result.

3. The method of claim 1, wherein the registration process for registering the voiceprint feature comprises:

receiving registration voice information;

extracting registration voiceprint characteristics in the registration voice information through a pre-trained convolutional neural network model; the pre-trained convolutional neural network model is obtained by training a convolutional neural network comprising a deformable convolutional layer.

4. The method of claim 1, wherein the training voiceprint features are mel-frequency cepstral coefficient features.

5. The method of claim 1, wherein the deformable convolution layer is configured to add an offset parameter to each element of the convolution kernel to obtain an adaptive receptive field.

6. The method of claim 1, wherein the first and second pooling layers are used to reduce feature size, enlarge the receptive field, and/or reduce computational effort.

7. A voiceprint recognition system, said system comprising:

a receiving unit for receiving voice information;

the verification unit is used for judging whether the similarity result is greater than a preset threshold value or not, and if the similarity result is greater than the preset threshold value, the voiceprint recognition is successful;

8. A storage medium storing a computer program executable by one or more processors for implementing a voiceprint recognition method as claimed in any one of claims 1 to 6.

9. An electronic device, comprising a memory and a processor, wherein the memory has a computer program stored thereon, and the memory and the processor are communicatively connected to each other, and wherein the computer program, when executed by the processor, performs the voiceprint recognition method according to any one of claims 1 to 6.