CN113470653A

CN113470653A - Voiceprint recognition method, electronic equipment and system

Info

Publication number: CN113470653A
Application number: CN202010247516.1A
Authority: CN
Inventors: 黄劲文; 芦宇; 李晓建; 曾夕娟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2021-10-01

Abstract

The application provides a voiceprint recognition method, electronic equipment and a system, which can realize a distributed voiceprint recognition neural network. In the method, a first electronic device acquires a voice signal of a user through at least one microphone, inputs the voice signal into a preset first neural network model, acquires low-level features of the voice signal, and then outputs the low-level features to a second electronic device. The second electronic device may obtain the low-level features of the voice signal to be recognized of the user to be recognized from the first electronic device, input the low-level features of the voice signal to be recognized into a preset second neural network model, obtain the high-level features of the voice signal to be recognized, and then match the high-level features of the voice signal to be recognized with the voiceprint template of the registered user to determine whether the user is the registered user. The method, the electronic device and the system can be applied to the field of AI, and more particularly to the field of intelligent voiceprint recognition.

Description

Voiceprint recognition method, electronic equipment and system

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and more particularly, to a method, electronic device, and system for voiceprint recognition.

Background

Voiceprint recognition is the technique of distinguishing whether a certain sentence is spoken by a certain person or not by analyzing the characteristics of one or more speech signals to achieve the purpose of distinguishing unknown voices. Since each sound has a unique characteristic, the sound of different persons can be effectively distinguished by the characteristic. Voiceprint recognition includes two phases, voiceprint enrollment and voiceprint verification. In the voiceprint registration stage, the voice information of the registrant is converted into a voiceprint template of the speaker. In the voiceprint verification stage, the similarity of the information of the verification voice and the voiceprint template of the speaker generated in the registration stage is scored, and whether the verification voice comes from the speaker is judged. In both the voiceprint enrollment stage and the voiceprint verification stage, a voiceprint recognition algorithm is used to obtain the characteristic information of the speaker's voice.

For the end-side interactive device, due to the limited computing resources, a voiceprint recognition algorithm model with high accuracy or complexity cannot be deployed. There is a solution that can deploy the voiceprint recognition model in its entirety on a computationally intensive backend device. However, in this scheme, when the end-side interaction device has a microphone array, the voiceprint recognition model cannot utilize multi-channel data acquired by the microphone array, so that far-field performance and noise performance of a speech signal acquired by the end-side interaction device cannot be acquired, and accuracy of voiceprint recognition is affected.

Therefore, how to perform voiceprint recognition on the end-side interaction device is an urgent problem to be solved.

Disclosure of Invention

The application provides a voiceprint recognition method, electronic equipment and a system, and realizes a distributed voiceprint recognition neural network.

In a first aspect, a method for voiceprint recognition is provided, and the method is applied to a first electronic device or a chip in the first electronic device. In the method, a first electronic device acquires a voice signal of a user through at least one microphone, and then inputs the voice signal into a preset first neural network model to acquire low-level features of the voice signal, wherein the first neural network model is obtained by using a first training data sample set, and the first training data sample set comprises a plurality of voice data samples acquired by the at least one microphone. The low-level features are then output to a second electronic device.

Here, the low-level features refer to relatively redundant features of the voice signal, i.e., insufficiently refined features of the voice signal, such as edges, lines, corners, and the like of the input speech spectrum. That is, different speakers cannot be clearly distinguished according to the low-level features of the speech signal.

It can be understood that the low-level features are an intermediate variable obtained in the process of extracting the feature information of the speech signal, that is, the neural network model can form more abstract high-level features of the speech signal by further combining the low-level features. In the embodiment of the application, the high-level features are relatively accurate features of the voice signal extracted by the neural network, namely refined features of the voice signal. That is, different speakers can be more clearly distinguished according to the high-level features of a speech signal than according to the low-level features of the speech signal. In other words, the high-level features of the speech signal are feature information of the speech signal that can be used to generate the voiceprint template, or feature information that can be used to match the voiceprint template.

Therefore, the embodiment of the application, by deploying the first neural network model on the first electronic device and acquiring, by the first neural network device, the low-level features of the voice signal acquired by the first electronic device through the at least one microphone, wherein the first neural network model is adapted to the at least one microphone, when the number of the microphones is multiple, the first neural network model can utilize the multi-channel data acquired by the microphones to improve the far-field performance and the noise scene performance of the voice signal, thereby helping to improve the accuracy of voiceprint recognition.

In addition, since the first neural network model only needs to acquire low-level features of the voice signal, the first neural network has a low requirement on the computing power of the electronic device, and thus the first electronic device deploying the first neural network model may not have sufficient computing power.

With reference to the first aspect, in certain implementations of the first aspect, the first neural network model may be a two-dimensional (2D) convolutional neural network, and at least one input channel of the first neural network model has a one-to-one correspondence with the at least one microphone.

One implementation of the above-mentioned first neural network model preset on the input of the voice signal may be that the voice signal acquired by each microphone of the at least one microphone is input into the input channel corresponding to each microphone.

In this way, the number of the input channels of the 2D convolutional neural network is set to be the same as the number of the at least one microphone, when the number of the microphones is multiple, the voice information acquired by each of the plurality of microphones is respectively input into one input channel of the 2D convolutional neural network, the number and arrangement of the 2D convolutional neural network model and the microphones on the first electronic device can be adapted, and then the far-field performance of voice and the performance of a noise scene can be improved by using the multichannel data acquired by the at least one microphone, thereby being beneficial to improving the accuracy of voiceprint recognition.

With reference to the first aspect, in certain implementations of the first aspect, the first neural network model is a three-dimensional convolutional neural network, and a depth of a convolution kernel of the first neural network model is the same as the number of the at least one microphone.

In this way, by setting the depth of the convolution kernel of the 3D convolution neural network to be the same as the number of the microphones, when the number of the microphones is multiple, the convolution kernel in each channel of the 3D convolution neural network can simultaneously perform convolution operation on the voice information acquired by each of the microphones, so that the number and arrangement of the microphones of the first electronic device and the 3D convolution neural network can be adapted, and further, the far-field performance of voice and the performance of a noise scene can be improved by using the multi-channel data acquired by the at least one microphone, thereby being beneficial to improving the accuracy of voiceprint recognition.

For example, since the first neural network model can be adapted to the plurality of microphones, the direction and distance of the sound source of the voice signal received by the plurality of microphones can be analyzed by the first neural network model, and then the time information and/or intensity information of the voice signal can be obtained, that is, the low-level features may include the time information and/or intensity information of the voice signal, the embodiment of the present application can utilize multi-channel data collected by the plurality of microphones to improve the far-field performance of the voice signal.

In addition, the first neural network model can filter sound waves by utilizing the difference between the phases of the sound waves received by the microphones, so that environmental background sounds can be removed, and needed sound waves are left, so that the performance of a noise scene of a voice signal can be improved by utilizing multi-channel data acquired by the microphone array.

With reference to the first aspect, in certain implementations of the first aspect, the voice signal of the user includes a to-be-recognized voice signal or a registration voice signal.

In a second aspect, a method for voiceprint recognition is provided, which is applied to a second electronic device or a chip in the second electronic device. In the method, the second electronic device may obtain a low-level feature of a to-be-recognized voice signal of a to-be-recognized user from the first electronic device, input the low-level feature of the to-be-recognized voice signal into a preset second neural network model, and obtain a high-level feature of the to-be-recognized voice signal, where the second neural network model is obtained by using a second training data sample set, and the second training data sample set includes a plurality of voice data samples obtained by at least one first electronic device. Then, the second electronic device matches the high-level features of the speech signal to be recognized with the voiceprint template of the registered user. And if the high-level features are not matched with the voiceprint template, determining that the user to be identified is not a registered user.

Therefore, compared with the prior art that the neural network model is integrally deployed on one electronic device, the embodiment of the application implements a distributed voiceprint recognition neural network by splitting the neural network model into a lightweight first neural network model and a deep second neural network model, deploying the lightweight first neural network model on a first electronic device with limited computing power (such as an end-side interaction device), and deploying the deep second neural network model on a second electronic device with sufficient computing power (such as a back-end computing device).

Further, when the first electronic device includes a plurality of microphones, the lightweight first neural network model can be adapted to the plurality of microphone arrays, and the deep second neural network model at the back end is facilitated to perform deep learning on the low-level features of the multi-channel data collected by the plurality of microphone arrays, so that the far-field performance and the noise performance of the speech signals in the high-level features are improved, and the accuracy of voiceprint recognition is facilitated to be improved.

With reference to the second aspect, in some implementations of the second aspect, before matching the high-level features of the speech signal to be recognized with the voiceprint template of the registered user, the low-level features of the registered speech signal of the user may be further obtained from the first electronic device, and the high-level features of the second speech signal may be obtained by inputting the low-level features of the registered speech signal into the second neural network model. Then, a voiceprint template of the registered user is generated based on the high-level features of the second speech signal.

In the embodiment of the present application, when the number of the microphones is multiple, because the high-level features are obtained by deep learning the low-level features of the multi-channel data acquired by the plurality of microphones by the second neural network model, and the high-level features have better far-field performance and noise performance, the voiceprint template of the registered user or the registered owner in the voiceprint registration stage and the voice feature vector of the voice signal to be recognized in the voiceprint verification stage in the embodiment of the present application are more accurate than those in the prior art, and therefore, the embodiment of the present application can contribute to improving the accuracy of voiceprint recognition.

In a third aspect, a method for voiceprint recognition is provided, in which a first electronic device acquires a to-be-recognized voice signal of a user through at least one microphone, inputs the to-be-recognized voice signal into a preset first neural network model, and acquires low-level features of the to-be-recognized voice signal, where the first neural network model is obtained by using a first training data sample set, and the first training data sample set includes a plurality of voice data samples acquired by the at least one microphone. Then, the first electronic device outputs the low-level features of the speech signal to be recognized to the second electronic device.

In the method, a second electronic device obtains low-level features of a voice signal to be recognized from a first electronic device, inputs the low-level features of the voice signal to be recognized into a preset second neural network model, and obtains high-level features of the voice signal to be recognized, wherein the second neural network model is obtained by using a second training data sample set, and the second training data sample set comprises a plurality of voice data samples obtained by at least one first electronic device. The second electronic device then matches the high-level features of the speech signal to be recognized with the voiceprint template of the registered user. And if the high-level features are matched with the voiceprint template, the second electronic equipment determines that the user to be identified is the registered user, and if the high-level features are not matched with the voiceprint template, the second electronic equipment determines that the user to be identified is not the registered user.

Therefore, compared with the prior art that the neural network model is integrally deployed on one electronic device, the embodiment of the application implements a distributed voiceprint recognition neural network by splitting the neural network model into a lightweight first neural network model and a deep second neural network model, deploying the lightweight first neural network model on a first electronic device with limited computing power (such as an end-side interaction device), and deploying the deep second neural network model on a second electronic device with sufficient computing power (such as a back-end computing device). In this way, since the first neural network model only needs to acquire low-level features of the speech signal, the first neural network has a low requirement on the computing power of the electronic device, and thus the first electronic device deploying the first neural network model may not have sufficient computing power.

With reference to the third aspect, in certain implementations of the third aspect, the first neural network model is a two-dimensional convolutional neural network, and at least one input channel of the first neural network model has a one-to-one correspondence with the at least one microphone.

One implementation manner of inputting the speech signal to be recognized into the preset first neural network model is to input the speech signal to be recognized acquired by each microphone of the at least one microphone into an input channel corresponding to each microphone.

With reference to the third aspect, in certain implementations of the third aspect, the first neural network model is a three-dimensional convolutional neural network, and a depth of a convolution kernel of the first neural network model is the same as the number of the at least one microphone.

With reference to the third aspect, in some implementations of the third aspect, before the second electronic device matches the high-level features of the speech signal to be recognized with the voiceprint template of the registered user, the first electronic device may further acquire the registered speech signal of the user through at least one microphone, and input the speech signal to be recognized into the first neural network model to acquire the low-level features of the registered speech signal. Then, the first electronic device outputs the low-level features of the registered voice signal to the second electronic device.

And the second electronic equipment acquires the low-level features of the registered voice signal from the first electronic equipment, inputs the low-level features of the voice signal to be recognized into the second neural network model and acquires the high-level features of the second voice signal. Then, the second electronic device generates a voiceprint template of the registered user according to the high-level features of the second voice signal.

In a fourth aspect, an electronic device is provided that includes at least one microphone, a processor, and an output interface.

At least one microphone for acquiring a voice signal of a user.

And a processor, configured to input the voice signal into a preset first neural network model, and obtain low-level features of the voice signal, where the first neural network model is obtained by using a first training data sample set, and the first training data sample set includes a plurality of voice data samples obtained by the at least one microphone.

And the output interface is used for outputting the low-layer characteristics to the second electronic equipment.

Therefore, the embodiment of the application, by deploying the first neural network model on the electronic device and acquiring, by the first neural network device, the low-level features of the voice signal acquired by the electronic device through the at least one microphone, wherein the first neural network model is adapted to the at least one microphone, when the number of the microphones is multiple, the first neural network model can utilize the multi-channel data acquired by the microphones to improve the far-field performance and the noise scene performance of the voice signal, thereby helping to improve the accuracy of voiceprint recognition.

In addition, since the first neural network model only needs to acquire low-level features of the voice signal, the first neural network has a low requirement on the computing power of the electronic device, and thus the electronic device deploying the first neural network model may not have sufficient computing power.

With reference to the fourth aspect, in some implementations of the fourth aspect, the first neural network model is a two-dimensional convolutional neural network, and at least one input channel of the first neural network model corresponds to at least one microphone one to one.

The processor is specifically configured to input the voice signal acquired by each of the at least one microphone into the input channel corresponding to each microphone.

In this way, the number of input channels of the 2D convolutional neural network is set to be the same as the number of the at least one microphone, when the number of the microphones is multiple, the voice information acquired by each of the plurality of microphones is respectively input into one input channel of the 2D convolutional neural network, the number and arrangement of the 2D convolutional neural network model and the microphones on the electronic device can be adapted, and then the far-field performance of voice and the performance of a noise scene can be improved by using the multichannel data acquired by the at least one microphone, thereby being beneficial to improving the accuracy of voiceprint recognition.

With reference to the fourth aspect, in some implementations of the fourth aspect, the first neural network model is a three-dimensional convolutional neural network, and a depth of a convolution kernel of the first neural network model is the same as the number of the at least one microphone.

In this way, by setting the depth of the convolution kernel of the 3D convolution neural network to be the same as the number of the microphones, when the number of the microphones is multiple, the convolution kernel in each channel of the 3D convolution neural network can simultaneously perform convolution operation on the voice information acquired by each microphone of the multiple microphones, so that the number and arrangement of the 3D convolution neural network and the microphones of the electronic device can be adapted, and further, the far-field performance of voice and the performance of a noise scene can be improved by using the multi-channel data collected by the at least one microphone, thereby being beneficial to improving the accuracy of voiceprint recognition.

With reference to the fourth aspect, in some implementations of the fourth aspect, the voice signal of the user includes a voice signal to be recognized or a registration voice signal.

In a fifth aspect, an electronic device is provided that includes an input interface and a processor.

The input interface is used for acquiring the low-level characteristics of the voice signal to be recognized of the user to be recognized from the first electronic equipment.

And the processor is used for inputting the low-level features of the voice signal to be recognized into a preset second neural network model and acquiring the high-level features of the voice signal to be recognized, wherein the second neural network model is obtained by utilizing a second training data sample set, and the second training data sample set comprises a plurality of voice data samples acquired by at least one first electronic device.

The processor is further configured to match the high-level feature of the speech signal to be recognized with a voiceprint template of a registered user, determine that the user to be recognized is the registered user if the high-level feature is matched with the voiceprint template, and determine that the user to be recognized is not the registered user if the high-level feature is not matched with the voiceprint template.

Therefore, compared with the prior art that the neural network model is integrally deployed on one electronic device, the embodiment of the application implements a distributed voiceprint recognition neural network by splitting the neural network model into a lightweight first neural network model and a deep second neural network model, deploying the lightweight first neural network model on a first computationally limited electronic device (e.g., an end-side interaction device), and deploying the deep second neural network model on a computationally sufficient electronic device (e.g., a back-end computing device).

With reference to the fifth aspect, in some implementations of the fifth aspect, the input interface is further configured to obtain a low-level feature of a registration voice signal of the user from the first electronic device.

The processor is further configured to input the low-level features of the registered speech signal into the second neural network model, and obtain the high-level features of the second speech signal;

the processor is further configured to generate a voiceprint template of the registered user based on the high-level features of the second speech signal.

In a sixth aspect, a system for voiceprint recognition is provided, which includes the electronic device in any one of the possible implementations of the fourth aspect and the electronic device in any one of the possible implementations of the fifth aspect and the fifth aspect.

In a seventh aspect, an embodiment of the present application provides an electronic device, configured to execute the method in the first aspect or any possible implementation manner of the first aspect, and in particular, the electronic device includes a module configured to execute the method in the first aspect or any possible implementation manner of the first aspect.

In an eighth aspect, an embodiment of the present application provides an electronic device, configured to execute the method in the second aspect or any possible implementation manner of the second aspect, and specifically, the electronic device includes a module configured to execute the method in the second aspect or any possible implementation manner of the second aspect.

In a ninth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method as in the first aspect above or any possible implementation manner of the first aspect.

In a tenth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method as in the second aspect above or any possible implementation of the second aspect.

In an eleventh aspect, embodiments of the present application provide a computer-readable medium for storing a computer program comprising instructions for executing the first aspect or any possible implementation manner of the first aspect, or the method in any possible implementation manner of the second aspect or the second aspect.

In a twelfth aspect, embodiments of the present application further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the first aspect or any possible implementation manner of the first aspect, or the method in any possible implementation manner of the second aspect or the second aspect.

Drawings

FIG. 1 is a schematic diagram of a network architecture suitable for use in the present application;

FIG. 2 is a schematic diagram of a system for voiceprint recognition provided by an embodiment of the present application;

fig. 3 is 3 examples of a microphone array;

FIG. 4 is an example of a ring microphone array pickup beam area containing 6 microphones;

FIG. 5 is a specific example of convolution of a 2D CNN model;

FIG. 6 is a specific example of convolution of a 3D CNN model;

FIG. 7 is a schematic flow chart diagram of a method for voiceprint recognition provided by an embodiment of the present application;

FIG. 8 is an example of a display interface of the end-side interaction device in an embodiment of the present application;

FIG. 9 is another example of a display interface of the end-side interaction device in an embodiment of the present application;

FIG. 10 is an example of an interface provided by an embodiment of the present application for entering user speech;

FIG. 11 is another example of an interface provided by an embodiment of the present application for entering user speech;

fig. 12 is a schematic block diagram of an electronic device provided in an embodiment of the present application;

FIG. 13 is a schematic block diagram of another electronic device provided by embodiments of the present application;

FIG. 14 is a schematic block diagram of a system for voiceprint recognition provided by an embodiment of the present application;

fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

First, an application scenario of the present application is introduced, and fig. 1 is a schematic diagram of a network architecture suitable for the present application. As shown in fig. 1, a network architecture may include a front-end interaction device and a back-end computing device.

Illustratively, the end-side interaction device may comprise at least one computationally limited electronic device. The electronic device with limited computing power is, for example, an intelligent vehicle-mounted device, a wearable device (such as smart glasses, a smart watch, and the like), an intelligent home device (such as a smart sound box, a coffee machine, a printer, and the like), and the embodiment of the present application does not limit this.

For example, the back-end computing device may be a cloud server with sufficient computing power, an intelligent portable device, a home computing center, or the like, which is not limited in the embodiment of the present application. The smart portable device is, for example, a powerful electronic device, such as a mobile phone (mobile phone), a tablet computer, a notebook computer, a palm computer, and the like, which is not limited in this embodiment. The home computing center is, for example, a powerful electronic device, such as a mobile phone, a tablet computer, a notebook computer, a palm computer, a television, a router, and the like, which is not limited in this embodiment of the present application.

It should be noted that the peer-to-peer interaction device in fig. 1 is only an example, and the peer-to-peer interaction device to which the present application is applied is not limited thereto, and for example, the peer-to-peer interaction device may also be an electronic device in an internet of things (IoT) system. In addition, the back-end computing device in fig. 1 is only an example, and the back-end computing device to which the present application is applied is not limited thereto, and may be, for example, a mobile internet device or the like.

As a specific example, when the system architecture shown in fig. 1 is applied to a home use scenario, an end-side interactive device, for example, an end-side device such as a smart speaker, a smart home, and the like, and a back-end computing device may be a home computing center, for example, a powerful mobile phone, a television, a router, and the like, or may be a powerful cloud device, for example, a cloud server, and the like, which is not limited in this embodiment of the present application.

For another specific example, when the system architecture shown in fig. 1 is applied to a personal wearing scene, the end-side interaction device is a personal wearing device with limited computing power, such as an end-side device like a smart band, a smart watch, a smart headset, and smart glasses, and the back-end computing device may be a portable device with sufficient computing power, such as a mobile phone, which is not limited in this embodiment of the present application.

For example, the front-end interaction device and the back-end computing device may be connected through a wireless network or through a bluetooth pairing connection, which is not limited in this embodiment of the application.

It should be noted that, a plurality of electronic devices shown in the embodiments of the present application are for better and more comprehensive description of the embodiments of the present application, but should not cause any limitation to the embodiments of the present application.

In the following, related terms related to the embodiments of the present application are described.

1. The voiceprint recognition mainly comprises two processes of voiceprint registration and voiceprint confirmation/identification. In the voiceprint registration stage, feature extraction is carried out on a registration voice signal of a user to obtain feature information of the registration voice signal, and the feature information is trained to obtain a voiceprint template of the user. At this time, the user may be referred to as a registered user. In the voiceprint confirmation/identification stage, the characteristic information of the voice signal to be identified of the unknown speaker can be obtained, and then the characteristic information is matched with the known voiceprint template obtained in the voiceprint registration stage to carry out voiceprint confirmation/identification. The voiceprint validation/recognition phase may also be referred to as a voiceprint verification phase.

Wherein, the voiceprint confirmation is the speaker confirmation and is used for judging whether an unknown speaker is a certain appointed person. The voiceprint recognition is speaker recognition and is used for judging which position of the known recorded speaker the unknown speaker is.

As an example, the feature information of the registration voice signal or the voice signal to be recognized may be acquired by a neural network model.

2. Neural network models, which may also be referred to as artificial neural networks, are mathematical models that employ structures similar to brain neurosynaptic connections for information processing. The neural network model is formed by a large number of nodes (or called neurons) which are connected with each other. Each node represents a particular output function, called the excitation function. Every connection between two nodes represents a weighted value, called weight, for the signal passing through the connection, which is equivalent to the memory of the artificial neural network. The output of the network is different according to the connection mode of the network, the weight value and the excitation function. The network itself is usually an approximation to some algorithm or function in nature, and may also be an expression of a logic strategy.

Generally, a neural network model includes an input layer, an output layer, and a hidden layer. Among them, in the input layer, many neurons (neurons) accept a large amount of non-linear input information. The input information is called an input vector; in the output layer, information is transmitted, analyzed and weighed in the neuron link to form an output result. The output information is called an output vector. The hidden layer (hidden layer), referred to as "hidden layer" for short, is a layer composed of many neurons and links between the input layer and the output layer. The hidden layer may have one or more layers. The number of nodes (neurons) in the hidden layer is not constant, but the nonlinearity of the neural network is more remarkable as the number of the nodes is larger, so that the robustness (characteristic of a control system for maintaining certain performance under the condition of parameter perturbation of a certain structure, size and the like) of the neural network is more remarkable.

3. The mild neural network may also be referred to as a shallow neural network or a lightweight neural network. When the number of hidden layers in the neural network model is small, such as one or two layers, the neural network may be referred to as a mild neural network. The output layer of the mild neural network outputs low-level features of the input information. The input information is, for example, images, sounds, text, and the like.

Wherein, the low-level features refer to the relatively redundant features of the input information extracted by the neural network, namely the features which are not refined enough. That is, different objects cannot be clearly distinguished according to the low-level features. Illustratively, low-level features such as edges, lines, corners, etc. of the input image.

4. The deep neural network may be referred to as a deep neural network when the number of hidden layers in the neural network model is large, for example, three or more layers. The deep neural network can utilize the multilayer structure thereof to extract and screen the input information layer by layer so as to realize the characterization learning. The output layer of the mild neural network outputs the high-level features of the data, namely, the deep neural network forms more abstract high-level feature representation attribute classes or features by combining the low-level features. Here, the low-level features may be regarded as intermediate variables acquired by the deep neural network model in acquiring the high-level features.

Wherein, the high-level feature refers to a relatively accurate feature of the input information extracted by the neural network, namely a refined feature. That is, different objects can be more clearly distinguished according to the high-level features of the input information than according to the low-level features of the input information.

5. Convolutional Neural Networks (CNNs), which are feedforward artificial neural networks, allow artificial neurons to respond to surrounding cells and perform large-scale image processing. The convolutional neural network includes convolutional layers and pooling layers.

CNNs consist of one or more convolutional layers and an apical fully-connected layer (corresponding to a classical neural network). Wherein, the convolutional layer (convolutional layer) is composed of a plurality of convolution units, and the parameter of each convolution unit is optimized by a back propagation algorithm. The purpose of the convolution operation is to extract different features of the input. For example, a first layer convolutional layer may only extract some lower layer features, and a network with more layers can iteratively extract more complex features, i.e., higher layer features, from the lower layer features.

The convolutional neural network comprises a one-dimensional convolutional neural network, a two-dimensional convolutional neural network, a three-dimensional convolutional neural network and the like. The convolution kernel corresponding to the one-dimensional convolution neural network is a one-dimensional convolution kernel, the convolution kernel corresponding to the two-dimensional convolution neural network is a two-dimensional convolution kernel, and the convolution kernel corresponding to the three-dimensional convolution neural network is a three-dimensional convolution kernel.

The convolution kernel is a set of parameters for extracting features for a local part in a convolutional neural network. Illustratively, the one-dimensional convolution kernel includes a length-dimensional convolution kernel, the two-dimensional convolution kernel includes a height-and-width-dimensional convolution kernel, and the three-dimensional convolution kernel includes a depth-height-and-width-dimensional convolution kernel. The depth of the three-dimensional convolution kernel may also be referred to as a first dimension of the three-dimensional convolution kernel, which is not limited in the embodiments of the present application.

Fig. 2 is a schematic diagram illustrating a system 100 for voiceprint recognition according to an embodiment of the present application. As shown in FIG. 2, the system 100 includes a front-end interaction device 110 and a back-end computing device 120. Where the end-side interaction device may be an example of a first electronic device and the back-end computing device 120 may be an example of a second electronic device.

In the embodiment of the present application, the front-end interaction device 110 may include a microphone component 111 and a first neural network model 112, and the back-end computing device 120 may include a second neural network model 121 and a voiceprint recognition module 122.

The microphone assembly 111 is used for receiving a sound signal of a user and converting the sound signal into an electric signal, i.e., a voice signal. The microphone assembly 111 includes an energy conversion device for converting an acoustic signal into an electrical signal. Illustratively, during a voiceprint enrollment phase, the microphone component 111 obtains an enrollment voice signal of the user, and during a voiceprint verification phase, the microphone component 111 obtains a to-be-recognized voice signal of the to-be-recognized user or tester.

Illustratively, the microphone assembly 111 may be a single microphone, or an array of microphones. A microphone array, i.e. an arrangement of microphones, is composed of a certain number (e.g. at least two) of microphones for sampling and processing the spatial characteristics of the sound field. For example, the plurality of microphones in the microphone array may be arranged according to some rules, such as a circular array, a rectangular array, or a linear array, or others, which is not limited in this application.

Fig. 3 shows 3 examples of microphone arrays, wherein (a) an annular microphone array comprising 3 microphones, (b) an annular microphone array comprising 6 microphones, and (c) a linear microphone array comprising 6 microphones. It should be understood that fig. 3 is only an example, and the microphone array in the embodiment of the present application may also be any microphone array other than fig. 3, which is not limited in the embodiment of the present application.

It will be appreciated that when multiple microphones are included in the microphone assembly, the time information and/or intensity information of the speech signal received by each microphone is not the same due to the different spatial locations of each microphone. Referring to fig. 4, an example of a circular microphone array pickup beam area containing 6 microphones is shown. As shown in fig. 4, taking the rightmost microphone d in the annular microphone array as an example, the pickup beam area is a sector area directly opposite to the rightmost microphone d. That is, in the sound pickup beam region, the sound intensity of the sound source is the largest, can be received by the microphone d, and is regarded as a target sound signal. And outside the sector area corresponding to the sound pickup beam area, the sound suppression area of the microphone d is defined. That is, in this sound suppression area, all sounds are ambient noise, and the sound intensity of the sound can be considered to be small. Thus, the strength of the received sound signal of each microphone in the microphone array is related to the direction from which it is received, and to some extent, the characteristics of the sound signals expressed by the two can be considered equivalent.

It should be noted that, in fig. 4, the principle of receiving the voice information by the microphones other than the microphone d in the microphone array is similar to the principle of receiving the voice information by the microphone d, and reference may be made to the description of the microphone d, and for brevity, the description is not repeated here.

Optionally, the end-side interaction device 110 may further include a signal processing module, configured to perform signal processing on the voice signal acquired by the microphone component 111, for example, voice activation detection, voice noise reduction processing, dereverberation processing, and the like, which is not limited in this embodiment of the application.

In some embodiments, the speech signal may be represented by a spectrogram. Specifically, a spectrogram is an imaged representation of a speech signal, with the abscissa representing time, the ordinate representing frequency, and the coordinate points representing speech data energy. Since three-dimensional information is expressed by collecting two-dimensional planes, the magnitude of the energy value can be expressed by color. For example, the darker the color of the energy value of a coordinate point, the greater the energy value representing that point. Therefore, the spectrogram can represent the variation of the frequency amplitude of each frequency point of the sound signal with time and the variation of the energy value with time.

In the embodiment of the present application, the input of the first neural network model 112 is to collect the voice signal for the microphone component 111, and the output is the low-level feature of the voice signal. Illustratively, a spectrogram of the speech signal may be input into the first neural network model.

The first neural network 112 is obtained by using a first training data sample set, which includes a plurality of voice data samples collected by the microphone component 111. That is, by learning the plurality of voice data samples collected by the microphone component 111, the parameters of the first neural network model 112 can be updated, so that the algorithm of the first neural network model 112 is more accurate.

Wherein, the low-level features are obtained by the neural network model through mild (or shallow) learning of the input information. Illustratively, obtaining the low-level features may be accomplished by setting the first neural network model to a mild neural network model. As a specific example, the first neural network model may include fewer hidden layers, such as 1 layer, 2 layers, and the like, which is not limited in this embodiment.

Therefore, the embodiment of the application, by deploying the first neural network model on the end-side interaction device and acquiring, by the first neural network device, the low-level features of the voice signal acquired by the end-side interaction device through the at least one microphone, wherein the first neural network model is adapted to the at least one microphone, when the number of the microphones is multiple, the first neural network model can utilize the multi-channel data acquired by the microphones to improve the far-field performance and the noise scene performance of the voice signal, thereby helping to improve the accuracy of voiceprint recognition.

In addition, since the first neural network model only needs to acquire low-level features of the voice signal, the first neural network has low computational power requirements on the electronic device, and thus the electronic device (e.g., the end-side interaction device) deploying the first neural network model may not have sufficient computational power.

In some alternative embodiments, the first neural network model may be CNN. The CNN model comprises at least one convolutional layer, and each convolutional layer is used for extracting the characteristics of the voice signal.

As a possible implementation, the first neural network model may be a two-dimensional (2D) CNN model. At this time, the number of input channels of the first neural network model is the same as the number of microphones included in the microphone component 111, and at least one input channel of the first neural network model corresponds to at least one microphone in the microphone component 111 one-to-one.

Correspondingly, in an implementation manner of inputting the voice signal acquired by the microphone into the preset first neural network model, the voice signal acquired by each microphone may be input into the input channel corresponding to each microphone.

Fig. 5 shows a specific example of convolution of the 2D CNN model. Fig. 5 illustrates an example of a ring microphone array including 3 microphones. The number of input channels of the 2D CNN model may be the same as the number of microphones in the microphone assembly, and may include, for example, a first input channel, a second input channel, and a third input channel. In addition, each input channel in the 2D CNN model corresponds to a convolution kernel. The convolution process over the three input channels is shown in fig. 5 as an example. At this time, each voice information sample input to the channel may be a 2D picture sample, for example, a 6 × 6 picture sample, i.e., the length and height of the picture sample are respectively 6. Meanwhile, the convolution operation may be performed on the picture samples on each input channel separately using a convolution kernel of 3 × 3. Finally, a 4 × 4 picture feature can be extracted on each of the three input channels.

In this way, by setting the number of input channels of the 2D CNN model to be the same as the number of microphones in the microphone assembly, when the microphone assembly includes a microphone array, the voice information acquired by each microphone in the microphone array is respectively input into one input channel of the 2D CNN model, so that the 2D CNN model can be adapted to the microphone array of the end-side interaction device (i.e. the number and arrangement of the microphones in the microphone array), and further, the far-field performance of the voice and the performance of the noise scene can be improved by using the multi-channel data acquired by the microphone array.

As another possible implementation, the first neural network model may be a three-dimensional (3D) CNN model. At this time, the depth of the convolution kernel of the first neural network model is the same as the number of microphones included in the microphone assembly 111.

Fig. 6 shows a specific example of convolution of a 3D CNN model. Fig. 6 illustrates an example of a ring microphone array including 3 microphones. The depth of the convolution kernel of the 3D CNN model is 3, which is the same as the number of microphones in the microphone assembly. Here, the number of input channels of the 3D CNN model is not limited. By way of example, the convolution process on only one of the input channels (e.g., the ith channel) is shown in FIG. 6. At this time, the voice information sample input into the channel may be a 3D picture sample, for example, a 6 × 6 × 3 picture sample, wherein the length and height of the picture sample are both 6 and the depth is 3. At the same time, the depth of the convolution kernel needs to be the same as the depth of the picture sample, i.e. also 3. Meanwhile, the convolution kernel of 3 × 3 × 3 may be used to perform a convolution operation on the picture samples on the input channel. Finally, a 4 × 4 picture feature can be extracted on this channel.

In this way, by setting the depth of the convolution kernel of the 3D CNN model to be the same as the number of microphones in the microphone assembly, when the microphone assembly includes a microphone array, the convolution kernel in each channel in the 3D CNN model can simultaneously perform convolution operation on the voice information acquired by each microphone in the microphone array, so that the 3D CNN model can be adapted to the microphone array of the end-side interaction device (i.e. to the number and arrangement of the microphones in the microphone array), and further, the multi-channel data acquired by the microphone array can be utilized to improve the far-field performance of the voice and the performance of the noise scene.

For example, since the first neural network model can be adapted to the microphone array in the end-side interaction device, the direction and distance of the sound source of the voice signal received by the microphone array can be analyzed by the first neural network model, and then the time information and/or intensity information of the voice signal can be obtained, that is, the time information and/or intensity information of the voice signal can be included in the low-level features, the voiceprint recognition system can utilize multi-channel data collected by the microphone assembly to improve the far-field performance of the voice signal.

In addition, the first neural network model can filter sound waves by utilizing the difference between the phases of the sound waves received by the microphone array, so that environmental background sounds can be removed, and needed sound waves are left, so that the voiceprint recognition system can utilize multi-channel data collected by the microphone array to improve the performance of a noise scene of a voice signal.

In the embodiment of the present application, the input of the second neural network model 121 is the low-level features output by the first neural network model 112, and the output is the high-level features of the speech signal. Here, the high-level features can be referred to the above description, and are not described here for brevity.

Wherein the second neural network model 121 is derived using a second set of training data samples comprising a plurality of speech data samples acquired by microphone components of different end-side interaction devices.

That is, the second set of training data samples comprises a plurality of speech data samples acquired by microphone components of a plurality of different end-side interaction devices. Or, in other words, the speech data samples acquired by the microphone components in the plurality of end-side interaction devices may constitute a second set of training data samples of the second neural network model. Illustratively, the plurality of different end-side interaction devices includes the end-side interaction device 110 above.

In some possible embodiments, a plurality of different end-side interaction devices may be included in the system 100, and the low-level features acquired by the different end-side interaction devices may each be output to the second neural network model 121. At this point, the back-end computing device 120 is able to deep learn the low-level features output by the at least one front-end interactive device 110. At this time, the training data sample set of the second neural network model 121 may include a plurality of voice data samples acquired by the microphone components of the at least two end-side interaction devices.

In some optional embodiments, the high-level features are obtained by deep learning of the input information by the neural network model. Illustratively, obtaining high-level features of the speech signal may be accomplished by setting the second neural network model as a deep neural network model. As a specific example, the second neural network model may include more convolutional layers, for example, 3 layers, 5 layers, 6 layers or more, which is not limited in the embodiments of the present application.

In the embodiment of the application, when the microphone assembly comprises a microphone array, the first neural network model can be adapted to the microphone array in the end-side interaction device, so that the second neural network model can perform deep learning on low-level features of multi-channel data acquired by the microphone array to acquire high-level features of a voice signal. As an example, the second neural network model can learn spatial information of the speech information by using time information, intensity information, and angle information of the speech information received by the microphone array, thereby improving far-field performance of the speech signal. For another example, the second neural network model can further filter the sound waves by using the difference between the phases of the sound waves received by the microphone array, so that the environmental background sound can be removed to the maximum extent, and the required sound waves are left, thereby improving the performance of the noise scene of the speech signal.

Therefore, compared with the prior art that the neural network model is integrally deployed on one electronic device, the embodiment of the application realizes a distributed voiceprint recognition neural network by disassembling the neural network model into the lightweight first neural network model and the deep second neural network model, deploying the lightweight first neural network model on the end-side interaction device with limited computation power, and deploying the deep second neural network model on the back-end computing device with sufficient computation power.

Further, when the end-side interaction device comprises a microphone array, the lightweight first neural network model can be adapted to the microphone array on the end-side interaction device, so that the deep second neural network model at the back end can be used for deeply learning the low-level features of the multi-channel data acquired by the microphone array, and the far field performance and the noise performance of the voice signals in the high-level features are improved.

In this embodiment of the application, the voiceprint recognition module 122 is configured to generate a voiceprint template of the user of the end-side interaction device according to the high-level features of the registered voice signal of the user output by the second neural network model 121 in the voiceprint registration stage. The user may be referred to as a registered user at this time, or an owner. The voiceprint recognition module 122 is further configured to, in a voiceprint verification stage, output a high-level feature of a to-be-recognized voice signal of a to-be-recognized user (i.e., a tester) according to the second neural network model 121, generate a voice feature vector of the to-be-recognized user, match the voice feature vector with a voiceprint template of a registered user or an owner, and determine whether the to-be-recognized user is the registered user or the owner according to a matching result.

In the embodiment of the application, when the microphone assembly comprises the microphone array, because the high-level features are obtained by deep learning of the low-level features of the multi-channel data acquired by the microphone array by the second neural network model, the high-level features have better far-field performance and noise performance, a voiceprint template of a registered user or a registered owner in a voiceprint registration stage and a voice feature vector of a to-be-recognized voice signal in a voiceprint verification stage in the embodiment of the application are more accurate than those of the prior art, and therefore the embodiment of the application can help to improve accuracy of voiceprint recognition.

Fig. 7 is a schematic flow chart of a method 500 for voiceprint recognition provided by an embodiment of the present application. It should be understood that fig. 7 shows steps or operations of the method of voiceprint recognition, but these steps or operations are merely examples, and other operations or variations of the operations in fig. 7 may also be performed by embodiments of the present application. Moreover, the various steps in FIG. 7 may be performed in a different order than presented in FIG. 7, and it is possible that not all of the operations in FIG. 7 may be performed.

The method 500 may be applied to a home use scenario or a personal wearing scenario, which is not limited in the embodiment of the present application. Illustratively, the method 500 may be performed by the system 100 for voiceprint recognition in fig. 2, but the embodiments of the present application are not limited thereto. The following description will be made by the system 100 as an example of the method 500.

Method 500 includes steps 501 through 509. Wherein steps 501 and 502 are performed by the end-side interaction device 110 and steps 503 to 509 are performed by the back-end computing device 120.

501, a microphone picks up speech.

Illustratively, the microphone component 111 in the end-side interaction device 110 may perform step 501, i.e. acquiring a speech signal of the user.

In some embodiments, during the voiceprint enrollment phase, the microphone may pick up the enrollment voice of the user, i.e., the enrollment voice signal.

For example, when the user first uses the voiceprint recognition function of the end-side interactive device, the end-side interactive device may prompt the user whether the user needs to register the main voiceprint template. As a specific example, referring to FIG. 8, an example of a display interface of a peer-to-peer interaction device is shown. As shown in fig. 8, "whether to register the main voiceprint template" may be displayed through the display interface of the end-side interaction device. Optionally, the end-side interactive device may further display two virtual keys, yes and no, for obtaining the operation of the user. When the user inputs a "yes" operation, the end-side interactive device may enter an interface for entering the user's voice in response to the operation. When the user inputs an operation of "no", the end-side interaction device exits the voiceprint recognition function in response to the operation.

Optionally, the end-side interaction device may further obtain the operation of the user through a physical key. For example, the interface for entering the user's registration voice may be entered when the user selects the "OK" key and the voiceprint recognition function may be exited when the user selects the "BACK" key.

When the end-side interaction device does not have a display interface, or while the end-side interaction device displays the interface shown in fig. 8, the end-side interaction device may perform voice prompt on the user, for example, play "whether to register the main voiceprint template" or other voices through the audio player, which is not limited in this embodiment of the application.

As another example, after the user has entered the owner's voiceprint template, the user may also choose to add a new owner voiceprint template for voiceprint recognition in a secure setting. As a specific example, referring to FIG. 9, another example of a display interface of a peer-to-peer interaction device is shown. As shown in fig. 9, the user may enter an operation to enter "voiceprint" through the secure and private display interface on the left side of fig. 9. In response to this operation, the display meet-up may present an interface as shown on the right side in fig. 9. At this time, the user can input an operation of "newly creating a voiceprint". In response to this operation, the end-side interaction device may enter an interface to enter the user's voice.

FIG. 10 illustrates one example of an interface for entering user speech. As shown in fig. 10, "please enter speech to generate a master voiceprint template" may be displayed on the display interface. Optionally, the end-side interactive device may further display a virtual key of "start recording" in the interface. When the user selects to record the voice, the user can click or long press the virtual key of 'start recording', and after clicking the virtual key of 'start recording', or long pressing the virtual key of 'start recording', the user can input a section of registered voice. In response to an operation of inputting voice by the user, the end-side interaction device may control the microphone assembly to acquire a registered voice signal of the user.

Optionally, when the end-side interaction device does not display an interface, or while the end-side interaction device displays the interface shown in fig. 10, the end-side interaction device may further perform voice prompt on the user, for example, play "please enter a piece of voice to generate a main voiceprint template" through an audio player, or other voices, which is not limited in this embodiment of the present application.

Optionally, the end-side interactive device may further obtain the input registration voice of the user through a physical "start recording" key. At this time, the end-side interactive device does not need to display a virtual button for "start recording" to the user.

In some embodiments, the microphone may pick up the voice to be recognized, i.e., the voice signal to be recognized, of the user to be recognized during the voiceprint validation/recognition phase, i.e., the voiceprint verification phase.

For example, when the user turns on the peer-to-peer interaction device or enables some functions of the peer-to-peer interaction device that require secure authentication, the peer-to-peer interaction device may prompt the user that voiceprint authentication is required. As one example, the end-side interaction device may enter an interface to enter speech to be recognized. FIG. 11 illustrates another example of an interface for entering user speech. As shown in fig. 11, "please enter voice for voiceprint verification" may be displayed on the display interface. Optionally, the end-side interactive device may further display a virtual key of "start recording" in the interface. When the user selects to record the voice, the user can click or long press the virtual key of 'start recording', and after clicking the virtual key of 'start recording', or long pressing the virtual key of 'start recording', the user can input a section of test voice. In response to an operation of inputting voice by a user, the end-side interaction device may control the microphone assembly to acquire a test voice signal of the user.

In some embodiments, the end-side interaction device may perform voice prompt on the user while the end-side interaction device does not display the interface, or while the end-side interaction device displays the interface shown in fig. 11, for example, play "please enter a piece of voice for voiceprint verification" through an audio player, or other voices, which is not limited in this embodiment.

In some embodiments, the end-side interactive device may also obtain the user's input test voice through a physical "start recording" button. At this time, the end-side interactive device does not need to display a virtual button for "start recording" to the user.

In some embodiments, the end-side interaction device may further perform signal processing, such as voice activation detection, voice noise reduction processing, dereverberation processing, and the like, on the registration voice signal or the voice signal to be recognized acquired by the microphone to acquire a processed voice signal.

502, the first neural network model extracts low-level features.

Illustratively, inputting the speech signal obtained in step 501 into the first neural network model 112 in the end-side interaction device 110 may obtain low-level features of the speech signal. As an example, in the voiceprint registration stage, the registration voice signal may be input into the first neural network model, and the low-level features of the registration voice signal may be obtained, and in the voiceprint verification stage, the to-be-recognized voice signal may be input into the first neural network model, and the low-level features of the to-be-recognized voice signal may be obtained. Specifically, the low-level features can be referred to the above description, and are not described herein for brevity.

In this embodiment, before step 502 is executed, the first neural network model may be trained by using the voice information collected by the microphone component 111 as a training data sample set. For example, the first neural network model may be trained by the end-side device where the end-side interaction device 110 is located, or the first neural network model may be trained by the device where the back-end computing device 120 is located, which is not limited in this embodiment of the present application.

When the back-end computing device 120 trains the first neural network model, the end-side device may send the speech information obtained by the microphone component 111 to the device where the back-end computing device 120 is located. The device in which the back-end computing device 120 is located sends the obtained trained first neural network model to the end-side device after completing the training.

As a possible implementation manner, the first neural network model may be jointly trained end to end by fixing parameters of the second neural network model.

The second neural network model extracts high-level features 503.

Illustratively, the end-side device on which the end-side interaction device 110 is located may send the acquired low-level features of the speech signal to a second neural network in the back-end computing device 120 through the communication interface. The low-level features are then input into a second neural network 121 in the back-end computing device 120, and the high-level features of the speech signal can be obtained. The voice signal may be a registration voice signal or a voice signal to be recognized. For the sake of brevity, the description of the high-level features may be omitted.

In an embodiment of the application, the second neural network may be trained using speech information collected by microphone components on a plurality of different end-side devices as a training data sample set before performing 503. For example, the second neural network model may be trained by a device in which the back-end computing device 120 is located, which is not limited in this embodiment.

As a possible implementation manner, the second neural network model may be jointly trained end to end by fixing parameters of the first neural network model.

And 504, generating a voiceprint template of the registered user or the registered owner.

Specifically, in the voiceprint registration stage, a voiceprint template of the registered user or owner may be generated according to the high-level features of the registered voice signal of the user acquired in step 503.

In some possible embodiments, a plurality of voiceprint templates of registered persons or owners may be generated, which is not limited in the embodiments of the present application. Under the condition that the registered user or owner model exists, the regenerated registered user or owner model can be used alone or can cover the original registered user air ticket owner model, and the embodiment of the application does not limit the registered user or owner model.

505, obtaining a voice feature vector of the voice signal to be recognized.

Specifically, in the voiceprint verification stage, the speech feature vector of the speech signal to be recognized may be obtained according to the high-level feature of the speech signal to be recognized of the user to be recognized, which is obtained in step 503.

It should be noted that the voiceprint enrollment phase is typically performed before the voiceprint verification phase. That is, prior to step 505, the system for voiceprint recognition (e.g., via step 504) has obtained a voiceprint template for at least one registered user or owner.

And 506, matching scores.

Illustratively, in the voiceprint verification stage, after generating the voiceprint template of the registered user or the owner in step 505, step 506 is executed to score the voiceprint template of the registered user or the owner for matching with the voice feature vector of the voice signal to be recognized of the user to be recognized. The score can be used for describing the similarity between the voice feature vector of the voice signal to be recognized and the voiceprint template of the registered user or the owner. Illustratively, the higher the score, the higher the similarity of the two.

507, whether the score is higher than a threshold value is judged.

And 508, when the score is higher than the threshold value, judging that the user to be identified is a registered user or an owner.

509, when the score is not higher than the threshold, it is determined that the user to be identified is not a registered user or is not an owner.

Therefore, compared with the prior art that the neural network model is integrally deployed on the back-end processing device, the embodiment of the application realizes a distributed voiceprint recognition neural network by splitting the neural network model into the lightweight first neural network model and the deep second neural network model, deploying the lightweight first neural network model on the computational limited end-side interaction device, and deploying the deep second neural network model on the computational sufficient back-end computing device.

Further, when the microphone array is included on the end-side interaction device, the lightweight first neural network model can be adapted to the microphone array on the end-side interaction device, so that the deep second neural network model at the back end can be used for deeply learning the low-level features of the multi-channel data acquired by the microphone array, and the far field performance and the noise performance of the voice signals in the high-level features are improved. Furthermore, because the high-level features are obtained by deep learning of the low-level features of the multi-channel data acquired by the microphone array by the second neural network model, the far-field performance and the noise performance are better, the voiceprint template of the registered user or owner in the voiceprint registration stage and the voice feature vector in the voiceprint verification stage in the embodiment of the application are more accurate than those in the prior art, and therefore the embodiment of the application can help to improve the accuracy of voiceprint recognition.

The method for voiceprint recognition provided by the embodiment of the present application is described in detail above with reference to fig. 1 to 11, and the electronic device and the system of the embodiment of the present application are described below with reference to fig. 12 to 15. It should be understood that the electronic device or system in fig. 12 to 15 can perform each step in the method of voiceprint recognition in the embodiment of the present application, and in order to avoid repetition, the repeated description is appropriately omitted when the electronic device or system in fig. 12 to 15 is introduced.

Fig. 12 is a schematic block diagram of an electronic device 1200 according to an embodiment of the application. The electronic device 1200 includes a processor 1210, at least one microphone 1220, and an output interface 1230. Wherein the processor 1210, the at least one microphone 1220 and the output interface 1230 are capable of performing the steps of the first electronic device (e.g., the end-side interaction device) in the method of voiceprint recognition of the embodiments of the application referred to in fig. 1 to 11 above.

In particular, when the electronic device 1200 is configured to perform the above-described method for voiceprint recognition, the processor 1210, the at least one microphone 1220 and the output interface 1230 function specifically as follows:

at least one microphone 1220 for acquiring a voice signal of a user.

A processor 1210, configured to input the voice signal into a preset first neural network model, and obtain low-level features of the voice signal, where the first neural network model is obtained by using a first training data sample set, and the first training data sample set includes a plurality of voice data samples obtained by the at least one microphone.

An output interface for outputting the low-level features to a second electronic device.

In some optional embodiments, the first neural network model is a two-dimensional convolutional neural network, and at least one input channel of the first neural network model is in one-to-one correspondence with the at least one microphone;

In some optional embodiments, the first neural network model is a three-dimensional convolutional neural network, and a depth of a convolution kernel of the first neural network model is the same as the number of the at least one microphone.

In some optional embodiments, the voice signal of the user comprises a voice signal to be recognized or a registration voice signal.

Fig. 13 is a schematic block diagram of an electronic device 1300 according to an embodiment of the present application. The electronic device 1300 comprises a processor 13103 and an input interface 1320. Wherein the processor 1310 and the input interface 1320 are capable of performing the steps of the second electronic device (e.g., the back end computing device) in the method of voiceprint recognition of the embodiments of the application referred to above in fig. 1-11.

In particular, when the electronic device 1300 is configured to perform the above-described method for voiceprint recognition, the processor 13103 and the input interface 1320 are specifically configured to:

an input interface 1320 for obtaining low-level features of a speech signal to be recognized of a user to be recognized from a first electronic device.

A processor 1310, configured to input the low-level features of the speech signal to be recognized into a preset second neural network model, and obtain the high-level features of the speech signal to be recognized, where the second neural network model is obtained by using a second training data sample set, and the second training data sample set includes a plurality of speech data samples obtained by at least one first electronic device.

The processor 1310 is further configured to match a high-level feature of the voice signal to be recognized with a voiceprint template of a registered user, determine that the user to be recognized is the registered user if the high-level feature is matched with the voiceprint template, and determine that the user to be recognized is not the registered user if the high-level feature is not matched with the voiceprint template.

In some optional embodiments, the input interface 1320 is further configured to obtain low-level features of the user's registered voice signal from the first electronic device.

The processor 1310 is further configured to input low-level features of the registered speech signal into the second neural network model, and obtain high-level features of the second speech signal.

The processor 1310 is further configured to generate a voiceprint template of the registered user based on high level features of the second speech signal.

Fig. 14 is a schematic block diagram of a system 1400 for voiceprint recognition according to an embodiment of the present application. The system 1400 includes the electronic device 1200 of fig. 12 and the electronic device 1300 of fig. 13.

Fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device is, for example, a first electronic device, or a second electronic device. As shown in fig. 15, the electronic device includes a communication module 1510, a sensor 1520, a user input module 1530, an output module 1540, a processor 1550, an audio-video input module 1560, a memory 1570, and a power supply 1580.

The communication module 1510 may include at least one module that enables communication between the electronic device and other electronic devices. For example, the communication module 1510 may include one or more of a wired network interface, a broadcast receiving module, a mobile communication module, a wireless internet module, a local area communication module, and a location (or position) information module, etc. The various modules are implemented in various ways in the prior art, and are not described in the application.

The sensors 1520 may sense a current state of the system, such as an open/close state, a position, whether there is contact with a user, a direction, and acceleration/deceleration, and the sensors 1520 may generate sensing signals for controlling the operation of the system.

A user input module 1530 for receiving input numerical information, character information, or contact touch operation/non-contact gesture, and receiving signal input related to user setting and function control of the system, etc. The user input module 1530 includes a touch panel and/or other input devices.

The output module 1540 includes a display panel for displaying information input by the user, information provided to the user, various menu interfaces of the system, and the like. Alternatively, the display panel may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. In other embodiments, the touch panel can be overlaid on the display panel to form a touch display screen. In addition, the output module 1540 may further include an audio output module, an alarm, a haptic module, and the like.

And an audio and video input module 1560 for inputting audio signals or video signals. The audio/video input module 1560 may include a camera and a microphone. Wherein the microphone may be a microphone array.

The power supply 1580 may receive external power and internal power under the control of the processor 1550 and provide power required for the operation of the various components of the system.

Processor 1550 may be indicative of one or more processors, for example, processor 1550 may include one or more central processors, or include a central processor and a graphics processor, or include an application processor and a coprocessor (e.g., a microcontrol unit). When processor 1550 includes multiple processors, the multiple processors can be integrated on the same chip or can be separate chips. A processor may include one or more physical cores, where a physical core is the smallest processing module.

Memory 1570 stores computer programs including operating system programs 1572 and application programs 1571, among others. Typical operating systems are those for desktop or notebook computers such as Windows from Microsoft corporation, MacOS from apple Inc., and others such as those developed by Google Inc

Android of

System, etc. for a mobile terminal. The method provided by the foregoing embodiment may be implemented by software, and may be considered as a specific implementation of the application 1571.

Memory 1570 may be of one or more of the following types: flash (flash) memory, hard disk type memory, micro multimedia card type memory, card type memory (e.g., SD or XD memory), Random Access Memory (RAM), Static Random Access Memory (SRAM), Read Only Memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, or optical disk. In other embodiments, the memory 1570 may be a network storage device on the internet, and the system may perform updates or reads to the memory 570 on the internet.

Processor 1550 is configured to read the computer programs in memory 1570 and then execute methods defined by the computer programs, such as processor 1550 reading operating system program 1572 to run an operating system on the system and implement various functions of the operating system, or reading one or more application programs 1571 to run applications on the electronic device.

The memory 1570 also stores other data 1573 than computer programs, such as a first neural network model, a second neural network model, or a voiceprint template, etc., as referred to herein.

The connection relationship of the modules in fig. 15 is only an example, and the method provided in any embodiment of the present application may also be applied to electronic devices with other connection modes, for example, all modules are connected through a bus.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for voiceprint recognition, the method being applied to a first electronic device, the method comprising:

the first electronic equipment acquires a voice signal of a user through at least one microphone;

inputting the voice signal into a preset first neural network model to obtain low-level features of the voice signal, wherein the first neural network model is obtained by utilizing a first training data sample set, and the first training data sample set comprises a plurality of voice data samples obtained by the at least one microphone;

outputting the low-level features to a second electronic device.

2. The method of claim 1, wherein the first neural network model is a two-dimensional convolutional neural network, and at least one input channel of the first neural network model has a one-to-one correspondence with the at least one microphone;

wherein the inputting the voice signal into a preset first neural network model comprises:

and inputting the voice signal acquired by each microphone of the at least one microphone into the input channel corresponding to each microphone.

3. The method of claim 1, wherein the first neural network model is a three-dimensional convolutional neural network, and wherein the depth of the convolutional kernel of the first neural network model is the same as the number of the at least one microphone.

4. A method according to any one of claims 1-3, characterized in that the speech signal of the user comprises a speech signal to be recognized or a registration speech signal.

5. A method for voiceprint recognition, the method being applied to a second electronic device, the method comprising:

acquiring low-level features of a voice signal to be recognized of a user to be recognized from first electronic equipment;

inputting the low-level features of the voice signal to be recognized into a preset second neural network model to obtain the high-level features of the voice signal to be recognized, wherein the second neural network model is obtained by utilizing a second training data sample set, and the second training data sample set comprises a plurality of voice data samples obtained by at least one first electronic device;

matching the high-level characteristics of the voice signal to be recognized with the voiceprint template of the registered user;

and if the high-level features are matched with the voiceprint template, determining that the user to be identified is the registered user, and if the high-level features are not matched with the voiceprint template, determining that the user to be identified is not the registered user.

6. The method according to claim 5, wherein before matching the high-level features of the speech signal to be recognized with the voiceprint template of the registered user, further comprising:

acquiring low-level features of a registered voice signal of a user from the first electronic equipment;

inputting the low-level features of the registered voice signal into the second neural network model to obtain the high-level features of the second voice signal;

and generating the voiceprint template of the registered user according to the high-level characteristics of the second voice signal.

7. A method of voiceprint recognition, comprising:

the method comprises the steps that first electronic equipment acquires a voice signal to be recognized of a user through at least one microphone;

the first electronic device inputs the voice signal to be recognized into a preset first neural network model, and obtains low-level features of the voice signal to be recognized, wherein the first neural network model is obtained by using a first training data sample set, and the first training data sample set comprises a plurality of voice data samples obtained by the at least one microphone;

the first electronic equipment outputs the low-level features of the voice signal to be recognized to second electronic equipment;

the second electronic equipment acquires low-level features of the voice signal to be recognized from the first electronic equipment;

the second electronic device inputs the low-level features of the voice signal to be recognized into a preset second neural network model to obtain the high-level features of the voice signal to be recognized, wherein the second neural network model is obtained by utilizing a second training data sample set, and the second training data sample set comprises a plurality of voice data samples obtained by at least one first electronic device;

the second electronic equipment matches the high-level features of the voice signal to be recognized with the voiceprint template of the registered user;

and if the high-level features are matched with the voiceprint template, the second electronic equipment determines that the user to be identified is the registered user, and if the high-level features are not matched with the voiceprint template, the second electronic equipment determines that the user to be identified is not the registered user.

8. The method of claim 7, wherein the first neural network model is a two-dimensional convolutional neural network, and at least one input channel of the first neural network model has a one-to-one correspondence with the at least one microphone;

inputting the voice signal to be recognized into a preset first neural network model, wherein the inputting the voice signal to be recognized into the preset first neural network model comprises the following steps:

and inputting the voice signal to be recognized acquired by each microphone of the at least one microphone into an input channel corresponding to each microphone.

9. The method of claim 7, wherein the first neural network model is a three-dimensional convolutional neural network, and wherein the depth of the convolutional kernel of the first neural network model is the same as the number of the at least one microphone.

10. The method according to any one of claims 7-9, wherein before matching the high-level features of the speech signal to be recognized with the voiceprint template of the registered user, further comprising:

the first electronic equipment acquires a registration voice signal of a user through at least one microphone;

the first electronic equipment inputs the voice signal to be recognized into the first neural network model to obtain the low-level features of the registered voice signal;

the first electronic equipment outputs the low-level features of the registration voice signal to the second electronic equipment;

the second electronic equipment acquires the low-level features of the registration voice signals from the first electronic equipment;

the second electronic equipment inputs the low-level features of the voice signal to be recognized into the second neural network model to obtain the high-level features of the second voice signal;

and the second electronic equipment generates a voiceprint template of the registered user according to the high-level characteristics of the second voice signal.

11. An electronic device, comprising:

at least one microphone for acquiring a voice signal of a user;

the processor is used for inputting the voice signal into a preset first neural network model and acquiring low-level features of the voice signal, wherein the first neural network model is obtained by utilizing a first training data sample set, and the first training data sample set comprises a plurality of voice data samples acquired by the at least one microphone;

12. The electronic device of claim 11, wherein the first neural network model is a two-dimensional convolutional neural network, and at least one input channel of the first neural network model has a one-to-one correspondence with the at least one microphone;

13. The electronic device of claim 11, wherein the first neural network model is a three-dimensional convolutional neural network, and wherein a depth of a convolution kernel of the first neural network model is the same as a number of the at least one microphone.

14. Electronic equipment according to any of claims 11-13, characterized in that the speech signal of the user comprises a speech signal to be recognized or a registration speech signal.

15. An electronic device, comprising:

the input interface is used for acquiring low-level characteristics of a voice signal to be recognized of a user to be recognized from the first electronic equipment;

the processor is used for inputting the low-level features of the voice signal to be recognized into a preset second neural network model and acquiring the high-level features of the voice signal to be recognized, wherein the second neural network model is acquired by utilizing a second training data sample set, and the second training data sample set comprises a plurality of voice data samples acquired by at least one first electronic device;

the processor is further configured to match a high-level feature of the voice signal to be recognized with a voiceprint template of a registered user, determine that the user to be recognized is the registered user if the high-level feature is matched with the voiceprint template, and determine that the user to be recognized is not the registered user if the high-level feature is not matched with the voiceprint template.

16. The apparatus of claim 15,

the input interface is further used for acquiring low-level features of a registration voice signal of a user from the first electronic equipment;

17. A system for voiceprint recognition comprising an electronic device according to any one of claims 11 to 14 and an electronic device according to claim 15 or 16.