CN111949965A

CN111949965A - Artificial intelligence-based identity verification method, device, medium and electronic equipment

Info

Publication number: CN111949965A
Application number: CN202010811349.9A
Authority: CN
Inventors: 田植良
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-11-17

Abstract

The embodiment of the application provides an identity verification method, an identity verification device, an identity verification medium and electronic equipment based on artificial intelligence, wherein the method comprises the following steps: performing voiceprint feature extraction on the collected voice signals to be verified through at least two neural networks respectively to obtain feature vectors output by each neural network for the voice signals to be verified; splicing the feature vectors output by each neural network for the voice signal to be verified to obtain a first voiceprint feature vector of the voice signal to be verified; predicting a first prediction probability that the tone of the user from which the voice signal to be verified comes is the same as that of an authorized user according to the first voiceprint feature vector; and determining an identity authentication result according to the first prediction probability, and performing identity authentication by combining the first prediction probability obtained by voiceprint recognition, thereby effectively solving the problem of deception by static photos in the prior art.

Description

Artificial intelligence-based identity verification method, device, medium and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an identity verification method, an identity verification device, an identity verification medium and electronic equipment based on artificial intelligence.

Background

With the research and progress of artificial intelligence technology, the artificial intelligence technology is applied in a plurality of fields, such as equipment unlocking based on face recognition, payment based on face recognition, access control release based on face recognition, and the like. In summary, in the several application scenarios, identity authentication needs to be performed through face recognition, and then subsequent equipment unlocking, payment, access control machine release control and the like are performed according to an identity authentication result.

In practice, there are cases where an unauthorized user passes authentication by means of a photograph of an authorized user in order to pass authentication. Specifically, in the process of image acquisition by the terminal device, since the unauthorized user places a photo of the authorized user at the image acquisition module of the terminal device, an image actually acquired by the terminal device is not a face image of the unauthorized user but an image of the photo held by the unauthorized user, and the terminal device may pass authentication of the unauthorized user based on the acquired image of the photo. Therefore, the authentication based on face recognition in the related art has a problem of being spoofed by a still picture.

Disclosure of Invention

Embodiments of the present application provide an identity authentication method, apparatus, medium, and electronic device based on artificial intelligence, so that the problem of spoofing by a static photo in an identity authentication process in the prior art can be solved at least to a certain extent.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of the embodiments of the present application, there is provided an identity authentication method based on artificial intelligence, including:

performing voiceprint feature extraction on the collected voice signals to be verified through at least two neural networks respectively to obtain feature vectors output by each neural network for the voice signals to be verified;

splicing the feature vectors output by each neural network for the voice signal to be verified to obtain a first voiceprint feature vector of the voice signal to be verified;

predicting a first prediction probability that the tone of the user from which the voice signal to be verified comes is the same as that of an authorized user according to the first voiceprint feature vector;

and determining an identity verification result according to the first prediction probability.

According to an aspect of an embodiment of the present application, there is provided an identity authentication apparatus based on artificial intelligence, including:

the characteristic extraction module is used for respectively carrying out voiceprint characteristic extraction on the collected voice signals to be verified through at least two neural networks to obtain a characteristic vector output by each neural network for the voice signals to be verified;

the splicing module is used for splicing the feature vectors output by each neural network for the voice signal to be verified to obtain a first voiceprint feature vector of the voice signal to be verified;

the prediction module is used for predicting a first prediction probability that the tone of a user from which the voice signal to be verified comes is the same as that of an authorized user according to the first voiceprint feature vector;

and the verification result determining module is used for determining an identity verification result according to the first prediction probability.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: a processor; a memory having stored thereon computer-readable requests that, when executed by the processor, implement the artificial intelligence based authentication method described above.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a processor, implement the artificial intelligence based authentication method described above.

In the scheme of the application, a first voice print feature vector obtained by extracting voice print features of a voice signal to be verified is combined to predict a first prediction probability that the tone of a user from which the voice signal to be verified comes is the same as that of an authorized user, and an identity verification result is determined according to the first prediction probability. In the process, the identity authentication result is determined by combining the voiceprint recognition result rather than the face recognition result, so that the problem that identity authentication is deceived by a static photo only according to the face recognition in the prior art can be effectively solved, and the accuracy of the identity authentication result is improved.

Moreover, because different neural networks have differences in information concerned in the feature extraction process, voiceprint feature extraction is performed on a voice signal to be verified through at least two types of neural networks respectively, and feature vectors output by each type of neural network for the voice signal to be verified are spliced to obtain a first voiceprint feature vector of the voice signal to be verified, so that the voiceprint feature vector can represent the voiceprint features of the voice signal to be verified in a multi-dimensional manner, and the accuracy of a first prediction probability that the tone of a user from which the voice signal to be verified originates is the same as the tone of an authorized user according to prediction of the first voiceprint feature vector can be ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

fig. 1A and 1B show schematic diagrams of exemplary system architectures to which the technical aspects of the embodiments of the present application can be applied;

FIG. 2 is a flow diagram illustrating an artificial intelligence based authentication method according to one embodiment of the present application;

FIG. 3 is a flow diagram of step 230 of the corresponding embodiment of FIG. 2 in one embodiment;

FIG. 4 is a flow diagram illustrating an artificial intelligence based authentication method according to another embodiment;

FIG. 5 is a schematic diagram illustrating a predictive model according to one embodiment;

FIG. 6 is a flow diagram of step 240 of the corresponding embodiment of FIG. 2 in one embodiment;

FIG. 7 is a flow diagram illustrating an artificial intelligence based authentication method in accordance with another embodiment;

FIG. 8 is a schematic diagram illustrating an implementation of sound source localization according to an embodiment;

FIG. 9 is a schematic diagram illustrating an implementation of sound source localization according to another embodiment;

FIG. 10 is a block diagram illustrating an artificial intelligence based authentication mechanism, according to one embodiment;

FIG. 11 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

With the research and the progress of the artificial intelligence technology, the artificial intelligence technology is applied in a plurality of fields, for example, equipment unlocking based on face recognition, payment based on face recognition, access control release based on face recognition, and the like.

In practice, there are cases where an unauthorized user performs authentication using a photograph of an authorized user in order to pass authentication. Specifically, in the process of image acquisition by the terminal device for authentication, the actually acquired image is not the face image of the unauthorized user, but is an image of a photo of an authorized user held by the unauthorized user, and the terminal device may generate an authentication passing result based on the acquired image of the photo of the authorized user. Therefore, the authentication based on face recognition in the related art has a problem of being spoofed by a still picture. In order to solve the problem, the scheme of the embodiment of the application is provided.

Fig. 1A and 1B show schematic diagrams of exemplary system architectures to which the technical solutions of the embodiments of the present application can be applied.

In the system architecture shown in fig. 1A, a user to be authenticated and a terminal device 110 are included, where the terminal device 110 may be a business handling device in an office, such as a smartphone, a tablet computer, a laptop computer, a desktop computer, a bank, and the like, and is not limited in detail here. In order to use the terminal device 110, the terminal device 110 needs to perform authentication on the user to be authenticated, and after the authentication of the user to be authenticated is passed, the terminal device 110 performs screen unlocking, so that the user to be authenticated can use the terminal device 110.

Or, in a scenario where the terminal device 110 is installed with an application client, before the user to be authenticated uses the function provided by the application to enter the interactive interface of the application, the application performs authentication on the user to be authenticated, and after the user to be authenticated passes the authentication, the user is allowed to enter the interactive interface of the application.

The terminal device 110 performs voice acquisition to authenticate the identity of a user to be authenticated to obtain a voice signal to be authenticated of the user to be authenticated, performs voiceprint feature extraction on the voice signal to be authenticated through at least two types of neural networks to obtain a first voiceprint feature vector of the voice signal to be authenticated, and performs probability prediction according to the first voiceprint feature vector to obtain a first prediction probability that the tone of the user indicated by the voice signal to be authenticated is the same as that of an authorized user; an authentication result is then determined based on the first predicted probability.

In some embodiments of the application, in the process of performing identity verification based on voiceprint recognition, image acquisition is also performed on a user to be verified to obtain a face image to be authenticated, and further, face recognition is performed according to the face image to be authenticated so as to predict a second prediction probability that the user indicated by the face image to be authenticated is an authorized user. And finally, comprehensively determining the identity verification result corresponding to the user to be verified by combining the first prediction probability and the second prediction probability.

In some embodiments of the present application, limited by the computing processing power of the terminal device 110, authentication by the terminal device 110 may result in a slower rate of authentication, and thus authentication may be performed by means of the server 120, which is fast in processing. In this application scenario, the identity authentication method of the present application can be implemented by the system architecture shown in fig. 1B. As shown in fig. 1B, the system architecture further includes a server 120 communicatively connected to the terminal device 110, in addition to the user to be authenticated and the terminal device 110, where the server 120 may be a server 120 cluster formed by a plurality of servers 120.

Under the system architecture shown in fig. 1B, terminal device 110 performs voice acquisition to obtain a voice signal to be verified, or acquires a face image of a user to be verified while acquiring voice to obtain a face image to be authenticated; then, the terminal device 110 sends the collected user information (the collected voice signal to be verified, or the collected voice signal to be verified and the collected face image to be authenticated) to the server 120, and the server 120 performs authentication according to the user information sent by the terminal device 110 to determine an authentication result. The server 120 then returns the authentication result to the terminal device 110.

The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:

fig. 2 is a flow chart illustrating an artificial intelligence based authentication method according to an embodiment of the present application, which may be performed by a device having a computing processing function, such as the terminal device shown in fig. 1A or the server shown in fig. 1B. Referring to fig. 2, the authentication method at least includes steps 210 to 240, which are described in detail as follows:

step 210, performing voiceprint feature extraction on the collected voice signals to be verified through at least two types of neural networks respectively to obtain feature vectors output by each type of neural network for the voice signals to be verified.

With the improvement of the safety awareness of users, authentication is required in more and more scenes, such as unlocking of a terminal screen, authentication for using an application program, authentication for allowing an access control device to pass, authentication for opening a safe, and the like.

In the authentication method according to the embodiment of the present application, the authentication process may be started in response to an authentication request, in other words, the authentication request is used to indicate that authentication is started. The authentication request may be generated based on a user-triggered action. In different application scenarios, the triggered operations for generating the authentication request may differ.

For example, in a scenario of unlocking a screen of the terminal, the operation triggered for generating the authentication request may be a pressing operation of a user on a designated key on the terminal, a touch or click operation on the terminal screen, where the designated key is, for example, a power key on the terminal, a Home key on a smartphone, or the like.

In the scenario of performing authentication for an application in the terminal, the operation triggered to generate the authentication request may be a touch operation or a click operation of the user on an icon corresponding to the application.

After the authentication is initiated, the process of step 210 and step 240 is performed for authentication. The voice signal to be verified is the voice signal collected for identity verification.

In some embodiments of the present application, the voiceprint feature extraction may be performed directly with a time-domain speech signal to be verified, or may be performed with a frequency-domain speech signal to be verified. For a scene in which a voice signal to be verified in a frequency domain is used for voiceprint feature extraction, after the voice signal to be verified in the time domain is acquired, time-frequency transformation needs to be performed on the voice signal to be verified in the time domain to obtain a voice signal to be verified in the frequency domain, for example, a frequency spectrum of the voice signal to be verified, and then the voice signal to be verified in the frequency domain is respectively input to each neural network for voiceprint feature extraction. The time-frequency transform may be a short-time fourier transform, a fourier transform, or the like, and is not particularly limited herein.

Because different neural networks have differences in the process of extracting features of voice signals to be verified, compared with the feature vectors output by only one neural network for the voice signals to be verified, the voiceprint feature extraction of the voice signals to be verified by at least two different neural networks respectively can reflect the voiceprint information of the voice signals to be verified in a multi-dimensional manner, and a multi-dimensional prediction basis can be provided for subsequent probability prediction, so that the accuracy of the predicted probability is ensured.

Wherein any two of the at least two neural networks differ at least in the composition structure of the neural network, e.g., the two neural networks differ in type. In some embodiments of the present application, the Neural Network used for performing voiceprint feature extraction on a speech signal to be verified may be a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Long Short Term Memory Network (LSTM) and a Gated Recurrent Unit (GRU), a bidirectional Recurrent Neural Network, a bidirectional Long-Term Memory Network, a bidirectional Gated cyclic Unit, or the like.

It should be noted that before the at least two neural networks are used for extracting the voiceprint features of the voice signal to be verified, training needs to be performed through training samples, and the training process is described in detail in the following description and is not described herein again.

In a specific embodiment, the number of the neural networks used for extracting the voiceprint features of the speech signal to be verified may be set according to actual needs, for example, two, three, or more.

In some embodiments of the present application, in order to consider both the accuracy of the identity authentication and the processing capability of the device, voiceprint feature extraction is performed on the voice signals to be authenticated through two different neural networks. In this embodiment, step 210 includes: performing voiceprint feature extraction on a voice signal to be verified through a second neural network to obtain a first feature vector; and performing voiceprint feature extraction on the voice signal to be verified through a third neural network to obtain a second feature vector, wherein the third neural network is different from the second neural network. The types of the second neural network and the third neural network may be selected according to actual needs, and are not specifically limited herein.

In some embodiments of the present application, the second neural network may be a convolutional neural network and the third neural network may be a recurrent neural network.

Step 220, the feature vectors output by each neural network for the voice signal to be verified are spliced to obtain a first voiceprint feature vector of the voice signal to be verified.

The first voiceprint feature vector obtained by vector splicing fuses feature vectors output by each neural network for the voice signal to be verified, so that the first voiceprint feature vector can represent multi-dimensional voiceprint information of the voice signal to be verified.

In step 230, a first prediction probability that the tone of the user from which the voice signal to be verified originates is the same as that of the authorized user is predicted according to the first voiceprint feature vector.

The voice color is the same, namely the voice print feature similarity is high correspondingly, so that if the voice signal to be verified comes from the user and the authorized user have the same voice color, the user to be verified is the authorized user.

It can be understood that, before step 230, it is further necessary to collect and store the voice signal of the authorized user, so as to identify whether the voice signal to be authenticated is the same as the tone corresponding to the voice signal of the authorized user, with the voice signal of the authorized user as a reference in the authentication process.

Further, for the stored voice signal of the authorized user, voiceprint feature extraction is performed on the authorized voice signal in the same manner as the first voiceprint feature vector of the voice signal to be verified is obtained, and the voiceprint feature vector corresponding to the voice signal of the authorized user is correspondingly obtained.

In other words, the voice signals of the authorized users are respectively subjected to voiceprint feature extraction through at least two neural networks used for extracting the voiceprint features of the voice signals to be verified, and the respectively extracted feature vectors are spliced to obtain the reference voiceprint feature vectors representing the voiceprint information of the authorized users. The reference voiceprint feature vector is generated in the same mode as the first voiceprint feature vector, so that the voiceprint information dimensions of the sound signal reflected by the first voiceprint feature vector and the reference voiceprint feature vector are the same, and further, the influence on the probability prediction result due to the fact that the first voiceprint feature vector and the reference voiceprint feature vector are different in generation mode is avoided.

In some embodiments of the present application, a first prediction probability that a user from which a voice signal to be verified originates and an authorized user have the same tone color may be predicted based on a similarity between the first voiceprint feature vector and a reference voiceprint feature vector of the authorized user. Specifically, the cosine distance between the first voiceprint feature vector and the reference voiceprint feature vector may be calculated, and then the first prediction probability corresponding to the cosine distance obtained through calculation may be determined according to the mapping relationship between the cosine distance and the first prediction probability.

In some embodiments of the present application, the first prediction probability may also be predicted from the first voiceprint feature vector and the reference voiceprint feature vector by means of deep learning. The process of implementing the first probability prediction by deep learning is shown in fig. 3, and includes steps 310 and 330, which are specifically described as follows:

step 310, obtaining a reference voiceprint feature vector, wherein the reference voiceprint feature vector is obtained by extracting voiceprint features of the stored voice signals of the authorized users through at least two neural networks.

And step 320, inputting the first vocal print feature vector and the reference vocal print feature vector into a first neural network for vector transformation to obtain a target feature vector.

The first neural network may be a recurrent neural network, a neural network formed by a full connectivity layer and an activation layer, and the like, and is not particularly limited herein. It is understood that, before step 320, the first neural network needs to be trained through training samples, and the specific training process is described below and will not be described herein again.

Vector transformation is performed by the first neural network according to the first voiceprint feature vector and the reference voiceprint feature vector, so that more complex and more dimensional features can be extracted from the first voiceprint feature vector and the reference voiceprint feature vector for probability prediction.

The vector transformation performed by the first neural network may be linear transformation, nonlinear transformation, or an alternation of linear transformation and nonlinear transformation.

In some embodiments of the present application, the first neural network comprises a first full connectivity layer, an activation layer and a second full connectivity layer in cascade, step 320, comprising: performing linear transformation on the first voiceprint characteristic vector and the reference voiceprint characteristic vector by the first full communication layer to obtain a first output result; carrying out nonlinear transformation on the first output result by the active layer to obtain a second output result; and performing linear transformation on the second output result by the second full communication layer to obtain a target characteristic vector.

The input in the full communication layer is a linear combination of the outputs corresponding to each neuron in the network of the previous layer, in other words, each neuron in the full communication layer is connected with all neurons in the network of the previous layer, which is equivalent to that the input and the output of the full communication layer are linearly transformed.

Therefore, in the first full communication layer and the second full communication layer, the input is linearly transformed to obtain the corresponding output result. It should be noted that, in the first full communication layer or the second full communication layer, one full communication layer may be included, or a plurality of full communication layers may be included. For a scene including multiple layers of full-communication layers in the first full-communication layer (or the second full-communication layer), because the input corresponding to each layer of full-communication layer and the output corresponding to each layer of full-communication layer are linear transformation, the transformation realized by the multiple layers of full-communication layers can be equivalent to total linear transformation.

In the activation layer, each neuron performs nonlinear transformation on the input of the layer through an activation function to obtain a corresponding input result, namely a second output result. The activation function configured for each neuron in the activation layer may be a sigmoid function, where an expression of the sigmoid function is:

the activation function may also be a tanh function, where the expression of the tanh function is:

the activation function may also be a ReLU function, where the expression of the ReLU function is:

f(x)＝max(0，x)， (3)

the activation function may also be a SoftPlus function, where the expression of the SoftPlus function is:

f(x)＝ln(1+e^x)， (4)

it should be noted that the activation functions are only partially listed above, and are not considered to limit the scope of the application, and other activation functions for implementing a non-linear transformation between an output and an input are also applicable to the application. In a specific embodiment, the activation function configured by the neurons in the activation layer can be selected according to actual needs.

In this example, the alternation of linear transformation and non-linear transformation in the first neural network is realized by arranging an activation layer, and in other embodiments, a plurality of activation layers may be further arranged, wherein each activation layer is used for separating two adjacent full communication layers, so as to increase the depth of the first neural network, and make the first neural network have stronger non-linear learning ability.

And 330, performing probability prediction according to the target feature vector to obtain a first prediction probability that the tone of the user from which the voice signal to be verified comes is the same as that of the authorized user.

The first voiceprint feature vector and the reference voiceprint feature vector are transformed by the first neural network, so that the first voiceprint feature vector and the reference voiceprint feature vector are fused, the obtained target feature vector more abundantly represents voiceprint information of the voice signal to be verified and voiceprint information of the voice signal representing the authorized user, and accuracy of the predicted first prediction probability is guaranteed.

Continuing with FIG. 2, at step 240, the identity verification result is determined according to the first prediction probability.

It can be understood that the authentication result indicates a result that the authentication is passed and a result that the authentication is not passed, where the result that the authentication is passed indicates that the user from which the voice signal to be authenticated is an authorized user, and correspondingly, the result that the authentication is not passed indicates that the user from which the voice signal to be authenticated is not an authorized user.

In some embodiments of the present application, the authentication result may be determined in dependence on only the first predicted probability. Specifically, a probability range (hereinafter referred to as a third probability range for the convenience of distinguishing probability ranges) is set in advance for the authentication result indicating that the authentication passes, and if the obtained first predicted probability is within the third probability range, the authentication result is determined as a result indicating that the authentication passes; otherwise, if the first prediction probability is not in the third probability range, the identity verification result is determined to be a result indicating that the verification is not passed.

In some embodiments of the present application, in the identity verification process based on voiceprint recognition, a second prediction probability that the user indicated by the face feature image to be authenticated is an authorized user is also predicted by acquiring a face image of the user to be verified (i.e., a face image to be authenticated). And after the first prediction probability and the second prediction probability are obtained, the identity verification result is determined by combining the first prediction probability and the second prediction probability. The process of determining the authentication result by combining the first prediction probability and the second prediction probability is described in detail below, and is not described herein again.

Moreover, because the concerned information of different neural networks in the feature extraction process is different, voiceprint feature extraction is respectively carried out on the voice signal to be verified through at least two types of neural networks, and the feature vectors output by each type of neural network for the voice signal to be verified are spliced to obtain a first voiceprint feature vector of the voice signal to be verified, so that the voiceprint feature vector can represent the voiceprint features of the voice signal to be verified in a multi-dimensional manner, and the accuracy of a first prediction probability that the tone of the user from which the voice signal to be verified is obtained is the same as the tone of the authorized user according to prediction of the first voiceprint feature vector can be ensured.

In some embodiments of the present application, the process of step 210-230 in the above embodiments is implemented by a predictive model. The predictive model includes the first neural network, the second neural network, and the third neural network listed in the above embodiments. In addition, the prediction model further comprises a multilayer perceptron, the target feature vectors output by the third neural network are input into the multilayer perceptron, the multilayer perceptron is used for classifying, corresponding classification results are output, and the output classification results are used for indicating a first prediction probability that the tone colors of a user from which the voice signal to be verified comes are the same as those of an authorized user.

Before the prediction model is used for implementing the process of step 210-230, the prediction model also needs to be trained, wherein the process of training the prediction model is shown in fig. 4, and includes step 410-460, which is specifically described as follows:

step 410, obtaining training data, where the training data includes a plurality of training samples and labels corresponding to the training samples, the training samples include two segments of voice signals, and the labels corresponding to the training samples are used to indicate whether the two segments of voice signals included in the training samples are from the same user.

Wherein, in order to train the prediction model, a training sample needs to be constructed before step 410. Each training sample comprises two segments of speech signals.

Further, in order to ensure the training effect of the prediction model, the constructed training samples include positive training samples and negative training samples. Two sections of voice signals in the training sample come from the same user, namely, two sections of voice signals in the speaking process of a user are collected in the speaking process of the user. The two sections of voice in the negative training sample come from two different users, respectively collect a section of voice signal in the process that the two users speak independently, and then the two sections of voice signals form the negative training sample.

In the embodiment, the prediction model is trained by means of supervised training. Therefore, prior to step 410, labels also need to be labeled for each training sample. Specifically, for a positive training sample, labeling a first label, where the first label is used to indicate that two segments of voice signals in the positive training sample are from the same user; and for the negative training sample, labeling a second label, wherein the second label is used for indicating that the two sections of voice signals in the negative training sample are not from the same user.

Step 420, the second neural network respectively extracts the voiceprint characteristics of the two sections of voice signals in the training sample to obtain a first sample characteristic vector corresponding to each section of voice signals; and respectively carrying out voiceprint feature extraction on the two sections of voice signals in the training sample by using a third neural network to obtain a second sample feature vector corresponding to each section of voice signals.

And 430, splicing the corresponding first sample characteristic vector and the corresponding second sample characteristic vector aiming at each section of voice signals in the training samples to obtain a first sample voiceprint characteristic vector of each section of voice signals.

Step 440, the first neural network transforms the first sample voiceprint feature vectors corresponding to the two segments of speech signals in the training sample, and outputs the sample target feature vector of the training sample.

The first neural network may refer to the above description for transforming the voiceprint feature vectors of the first sample corresponding to the two segments of the speech signals in the input training sample, which is not described herein again.

And step 450, predicting by the multilayer perceptron according to the sample target feature vector to obtain the sample prediction probability that the two sections of voice signals of the training sample come from the same user.

And step 460, adjusting parameters of at least one of the first neural network, the second neural network, the third neural network and the multilayer perceptron according to the sample prediction probability and the label corresponding to the training sample.

Specifically, if the sample prediction probability predicted for the training sample does not match the label corresponding to the training sample, adjusting parameters of a prediction model, namely adjusting parameters of at least one of a first neural network, a second neural network, a third neural network and a multilayer perceptron in the prediction model until the sample prediction probability predicted for the training sample by the prediction model matches the label corresponding to the training sample after the parameters are adjusted; otherwise, if the sample prediction probability predicted by the training sample is consistent with the label corresponding to the training sample, continuing to train with the next training sample and the corresponding label.

In order to judge whether the sample prediction probability predicted by the training sample is consistent with the label corresponding to the training sample, a probability threshold value is preset, and if the sample prediction probability predicted by the training sample is greater than the probability threshold value, the predicted sample prediction probability indicates that two sections of voice in the corresponding training sample come from the same user; otherwise, if the predicted sample prediction probability is not greater than the probability threshold, the predicted sample prediction probability is regarded as indicating that the two segments of speech in the corresponding training samples are not from the same user.

On this basis, if the sample prediction probability predicted for a training sample indicates that two end speech signals in the training sample are from the same user and the label corresponding to the training sample indicates that two end speech signals are from the same user, or if the sample prediction probability predicted for a training sample indicates that two end speech signals in the training sample are not from the same user and the label corresponding to the training sample indicates that two end speech signals are not from the same user, it indicates that the sample prediction probability predicted for the training sample is consistent with the label corresponding to the training sample.

Similarly, if the sample prediction probability predicted for a training sample indicates that the two end speech signals in the training sample are from the same user and the label corresponding to the training sample indicates that the two end speech signals are not from the same user, or if the sample prediction probability predicted for a training sample indicates that the two end speech signals in the training sample are not from the same user and the label corresponding to the training sample indicates that the two end speech signals are from the same user, it indicates that the sample prediction probability predicted for the training sample is not consistent with the label corresponding to the training sample.

In some embodiments of the present application, the training of the prediction model is ended when the loss function of the prediction model converges or the number of iterations of the prediction model reaches a set number of iterations.

Fig. 5 is a schematic diagram of a predictive model according to an embodiment, in which a first neural network includes a first full-connectivity layer, a ReLU layer, and a second full-connectivity layer in cascade, as shown in fig. 5. The ReLU layer in the first neural network is used as an activation layer, and the activation function configured for each neuron in the layer is a ReLU function. The second neural network is a convolutional neural network and the third neural network is a recurrent neural network.

As shown in fig. 5, the convolutional neural network includes cascaded convolutional layers, pooling layers, and full-convolutional layers. After the voice signal to be verified is input into the convolutional neural network, the first feature vector aiming at the voice signal to be verified is output after the voice signal to be verified is sequentially processed by the convolutional layer, the pooling layer and the full-communication layer.

After the voice signal to be verified is input to the recurrent neural network, the recurrent neural network outputs a second feature vector aiming at the voice signal to be verified; and then splicing the first characteristic vector and the second characteristic vector to obtain a first voiceprint characteristic vector.

In some embodiments, since the voice signal of the authorized user is stored in advance, a first feature vector may be output for the voice signal of the authorized user by using a convolutional neural network; outputting a second feature vector for the voice signal of the authorized user by using a recurrent neural network; and then splicing the first characteristic vector and the second characteristic vector of the voice signal of the authorized user to correspondingly obtain a first voiceprint characteristic vector, and storing the first voiceprint characteristic vector corresponding to the voice signal of the authorized user. In the subsequent process of identity authentication, voiceprint feature extraction is not carried out on the voice signal of the authorized user in each authentication, so that processing resources are saved.

After the first voiceprint feature vectors aiming at the voice signals to be verified and the first voiceprint feature vectors aiming at the voice signals of the authorized user are obtained, the two first voiceprint feature vectors are input into a first neural network, the first neural network alternately carries out linear transformation and nonlinear transformation aiming at the two input first voiceprint feature vectors, and target feature vectors are correspondingly output.

And then, inputting the target characteristic vector into a multi-layer perceptron, classifying by the multi-layer perceptron, and outputting a classification result, wherein the output classification result is used for indicating a first prediction probability that the tone of a user from which the voice signal to be verified comes is the same as that of an authorized user.

In some embodiments of the present application, as shown in fig. 6, step 240 comprises:

and step 610, acquiring a second prediction probability, wherein the second prediction probability is the probability that the user indicated by the face image to be verified is an authorized user according to the prediction of the face image to be verified.

And when the terminal performs identity authentication through voiceprint recognition, image acquisition is performed on the user to be authenticated to obtain a face image to be authenticated, and therefore, face recognition is performed on the basis of the obtained face image to be authenticated to correspondingly obtain a second prediction probability.

In some embodiments of the present application, the obtaining the second prediction probability may be performed by the following processes, including: acquiring a face image to be verified; extracting features of a face image to be verified to obtain a first face feature vector; and predicting according to the first face feature vector and the second face feature vector to obtain a second prediction probability, wherein the second face feature vector is obtained by extracting the features of the stored face image of the authorized user.

Specifically, the feature extraction performed on the face image to be verified may be implemented by a convolutional neural network, and certainly, other neural networks capable of implementing image recognition may also be implemented, which is not specifically limited herein. The convolutional neural network may include a feature extraction layer (of course, the feature extraction layer may include a multilayer neural network) and an output layer, the feature extraction layer is configured to perform feature extraction on an input facial image to be verified to obtain a first facial feature image, and the output layer is configured to perform classification prediction according to the first facial feature image and a second facial feature image, and output a second prediction probability.

In some embodiments of the present application, after the first face feature vector is extracted, the second prediction probability may be predicted based on a similarity between the first face feature vector and the second face feature vector. Specifically, a mapping relation between the similarity and the second prediction probability is preset, and then after the similarity between the first face feature vector and the second face feature vector is obtained through calculation, the second prediction probability corresponding to the obtained similarity is correspondingly searched.

And step 620, weighting the first prediction probability and the second prediction probability to obtain a target probability.

Wherein the weighting factor for the first prediction probability and the weighting factor for the second prediction probability may be determined experimentally. Of course, in the process of practical application, the two weighting coefficients may also be adjusted according to practical situations.

And 630, if the target prediction probability is within the first probability range, determining the identity verification result as a result indicating that the verification is passed.

Otherwise, if the target prediction probability is not within the first probability range, determining the identity verification result as a result indicating that the verification is not passed.

In the embodiment, the second prediction probability obtained by face recognition and the first prediction probability obtained by voiceprint recognition are combined to comprehensively determine the identity verification result, so that the condition that different static photo cheats exist due to identity verification only through face recognition is avoided. Moreover, the two authentication modes are combined to determine the authentication result, so that the accuracy and the reasonability of the authentication result are ensured.

In some embodiments of the present application, after step 240, the method further comprises: if the identity verification result indicates that the verification is passed, unlocking the object requested to be unlocked; and if the identity authentication result indicates that the authentication is not passed, performing authentication failure prompt.

The object requested to be unlocked may be a display screen of the terminal device or an application requested to enter.

As described above, the authentication method may be performed for unlocking the screen of the real terminal device, and correspondingly, if the authentication result indicates that the authentication is passed, the terminal device performs the unlocking process. Otherwise, carrying out verification failure prompt.

As described above, the authentication method may also be authentication performed for entering the application level, and correspondingly, if the authentication result indicates that the authentication passes, the requested application program performs unlocking processing to correspondingly enter the interface of the application layer sequence. Otherwise, carrying out verification failure prompt.

In some embodiments of the present application, in an application scenario of authentication for passage, after an authentication result is determined, if the authentication result indicates that authentication passes, the user to be authenticated is controlled to execute a passing action.

In some embodiments of the present application, prior to step 210, as shown in fig. 7, the method further comprises:

step 710, obtain an authentication request.

And 720, starting sound collection according to the authentication request.

Step 730, performing background sound identification on the collected first sound signal.

In some embodiments of the present application, the background sound identification may be performed based on a distance between a sound source corresponding to the first sound signal and the terminal device, and specifically, if the distance between the sound source corresponding to the first sound signal and the terminal device is within a target distance range, which indicates that the sound source corresponding to the first sound signal is closer to the terminal device, the first sound signal is determined to be a non-background sound; otherwise, if the distance between the sound source corresponding to the first sound signal and the terminal device exceeds the set target distance range, indicating that the sound source corresponding to the first sound signal is far away from the terminal device, determining that the first sound signal is the background sound.

In some embodiments of the present application, the background sound identification may be performed based on the intensity of the first sound signal. Generally, since the user to be verified is closer to the terminal device, for the sound collection module disposed on the terminal device, the intensity of the sound emitted by the user to be verified is higher than that of other sounds, so that an intensity threshold may be set, and if the intensity of the collected first sound signal exceeds the intensity threshold, it is determined that the first sound signal is the sound emitted by the user to be verified and is a non-background sound; otherwise, if the intensity of the collected first sound signal does not exceed the intensity threshold, the first sound signal is determined to be a background sound.

In step 740, if it is determined that the first sound signal is not the background sound, the first sound signal is used as the voice signal to be verified.

In some embodiments of the present application, if the power of the collected first sound signal is lower than the power threshold, or the first sound signal is identified and determined to be a background sound, a second prediction probability is obtained; and if the second prediction probability is within the second probability range, determining the identity verification result as a result indicating that the verification is passed.

In practice, there may be a case where the user to be authenticated does not speak after the sound collection is started, and thus, the collected first sound signal may be a weak environmental sound. If the power of the collected first sound signal is lower than the power threshold, the first sound signal is very weak, and the user to be verified does not speak in the sound collection process. In this case, since the user to be authenticated does not speak, voiceprint recognition cannot be performed, and therefore, the authentication result is determined only according to the second prediction probability.

Similarly, in practice, there may be a sound with a relatively high power in the background sound, and in the process of sound collection, there is a first sound signal with a relatively high power actually collected as the background sound, but the user to be verified does not speak. In this case, since the user to be authenticated does not speak, the authentication result is determined only from the second prediction probability.

In some embodiments of the present application, in order to reduce the possibility that the situation that the voice signal to be authenticated of the user to be authenticated cannot be acquired due to the fact that the user to be authenticated does not know that voice acquisition is required, a candidate problem may be selected from a candidate problem set according to an authentication request for voice playing after receiving the authentication request; and acquiring the answer voice of the user to be verified aiming at the played candidate question through a voice acquisition module in the terminal to obtain a voice signal to be verified. Therefore, the user to be verified is prompted to sound through the candidate question set played by the voice, namely, the user is prompted to answer the candidate question played by the voice, the answer voice of the user to be verified can be correspondingly collected, and the collected answer voice is used as a voice signal to be verified.

In some embodiments of the present application, the sound collection is performed by a sound collection module, the sound collection module includes at least four sound collectors, step 730, including: positioning a sound source according to first sound signals respectively collected by at least four sound collectors, and determining a first distance between the sound source corresponding to the first sound signal and a sound collection module; if the first distance is within the set target distance range, determining that the first sound signal is a non-background sound; and if the first distance exceeds the set target distance range, determining the first sound signal as background sound.

In some embodiments of the present application, the distance from the sound source to each sound collector may be determined by measuring the time taken for the sound signal to reach each sound collector, and then, under the condition that the position of the sound collector is determined and the distance from the sound source to the sound collector is known, the position of the sound source may be determined in reverse, thereby implementing sound source localization.

The principle of achieving sound source localization in a two-dimensional plane will now be explained. In order to determine the position of a sound source in a two-dimensional plane, at least three sound collectors need to be arranged in the two-dimensional plane. Suppose that three sound collectors, R1, R2, and R3, respectively, are arranged in a two-dimensional plane, and the sound source is S. As shown in fig. 8, since the same sound signal from the sound source S is received by the three sound collectors R1, R2, and R3, respectively, the position of the sound collector (R1, R2, R3) is taken as the center of a circle, and the distance from the sound collector to the sound source S is taken as the radius to form a circle, and the intersection point of the three circles is the position of the sound source. Therefore, the distances of the sound source from the three sound collectors are calculated by measuring the time when the sound signal reaches the sound collectors R1, R2, R3 from the sound source, respectively, and then the position of the sound source can be correspondingly determined in combination with the positions of the three sound collectors.

Of course, if the sound source is located in three-dimensional space, at least four sound collectors are required to be arranged, so that sound source localization is realized according to the principle shown in fig. 8.

In some embodiments of the present application, the sound source location may also be performed by using the TDOA (Time Difference of Arrival) principle, that is, calculating the distance Difference between the sound source and any two sound collectors according to the Time Difference of the sound signal arriving at the two sound collectors, and then determining the position of the sound source by combining the position of the sound collectors and the determined distance Difference.

The description will now be made in conjunction with the principle of implementing sound source localization using the TDOA principle in a two-dimensional plane. In a two-dimensional plane, in order to determine the position of a sound source by using the TDOA principle, at least three sound collectors need to be arranged in the two-dimensional plane, assuming that three sound collectors are arranged in the two-dimensional plane, which are respectively R1, R2 and R3, the sound source is S, the distance from the sound source S to the sound collector R1 is L1, the distance from the sound source S to the sound collector R2 is L2, and the distance from the sound source S to the sound collector R3 is L3.

After the time difference of the sound signal reaching any two sound collectors is obtained, the distance difference between the sound source and the two sound collectors is obtained by multiplying the sound velocity by the time difference. In a plane, a locus of points whose difference in distance from two fixed points is constant is a hyperbola, and thus a hyperbola can be determined based on the positions of any two of the three sound collectors and the difference in distance from the sound source to the two sound collectors.

On this basis, as shown in fig. 9, a hyperbola can be determined based on the positions of the sound collectors R1 and R2 and the difference in distance from the sound source to the sound collectors R1 and R2; similarly, another hyperbola can be determined based on the positions of the sound collectors R1 and R3 and the difference in the distances of the sound source S to the sound collectors R1 and R3; another hyperbola can be determined based on the positions of the sound collectors R2 and R3 and the difference in the distance of the sound source S to the sound collectors R2 and R3; the intersection point of the three hyperbolas is the position of the sound source.

Of course, in a three-dimensional space, at least four sound collectors are arranged, so as to determine the position of the sound source according to the principle shown in fig. 9.

The sound collection module comprises at least four sound collectors, one of the at least four sound collectors is used as a reference collector, and other sound collectors except the reference collector in the at least four sound collectors are used as reference collectors. The process of positioning the sound source according to the principle shown in fig. 9 is as follows:

respectively calculating the time difference between the time of acquiring the first sound signal by each reference acquisition device and the reference time by taking the time of acquiring the first sound signal by the reference acquisition device as the reference time; respectively calculating the distance difference between the sound source corresponding to the first sound signal and each reference collector and the distance between the sound source and the reference collector according to the time difference; determining the position information of the sound source corresponding to the first sound signal according to the distance difference between the distance between the sound source corresponding to the first sound signal and each reference collector and the distance between the sound source corresponding to the first sound signal and the reference collector; and calculating to obtain a first distance according to the position information of the sound source corresponding to the first sound signal and the position information of the sound acquisition module.

Specifically, the position of the reference collector is used as a reference origin, and a coordinate system is constructed, so that the coordinates of each reference collector relative to the constructed coordinate system can be obtained according to the position information of the reference collector and each reference collector.

And the distance difference between the sound source corresponding to the first sound signal and the reference collector and between the sound source and the reference collector can be calculated according to the time difference of the first sound signal collected by each reference collector relative to the reference collector.

And constructing the following matrix equation according to the coordinates of each reference collector and the calculated distance difference:

AX＝B， (5)

wherein, the matrix A is a matrix of n multiplied by 4, n is the number of the reference collectors, n is more than or equal to 4, and the ith row element in the matrix A is [ x [)_i，y_i，z_i，d_i]，x_iIs the x-axis coordinate, y, of the ith reference collector_iIs the y-axis coordinate, Z, of the ith reference collector_iIs the z-axis coordinate of the ith reference collector, d_iThe distance difference between the sound source corresponding to the first sound signal and the ith reference collector and the reference collector is obtained; x ═ X, y, z, R]^TWherein the parameter is a determined parameter; the matrix B is an n multiplied by 4 matrix, and the ith row element in the matrix B is

And solving the matrix equation to obtain the position coordinates (x, y, z) of the sound source corresponding to the first sound signal through calculation, namely correspondingly obtaining the position information of the sound source corresponding to the first sound signal.

Furthermore, the distance between the first sound signal and the sound collection module, namely the first distance, can be obtained through corresponding calculation according to the position information of the sound source corresponding to the first sound signal and the position information of the sound collection module.

Embodiments of the apparatus of the present application are described below, which may be used to perform the methods of the above-described embodiments of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method described above in the present application.

The present application provides an artificial intelligence based identity authentication apparatus 1000, and the artificial intelligence based identity authentication apparatus 1000 can be configured in the terminal device shown in fig. 1A, as shown in fig. 10, the identity authentication apparatus 1000 includes:

the feature extraction module 1010 is configured to perform voiceprint feature extraction on the collected voice signals to be verified through at least two types of neural networks, so as to obtain feature vectors output by each type of neural network as the voice signals to be verified.

The splicing module 1020 is configured to splice the feature vectors output by each neural network for the voice signal to be verified, so as to obtain a first voiceprint feature vector of the voice signal to be verified.

The prediction module 1030 is configured to predict, according to the first voiceprint feature vector, a first prediction probability that a user from which the voice signal to be verified originates is the same as the authorized user in tone.

And the verification result determining module 1040 is configured to determine an identity verification result according to the first prediction probability.

In some embodiments of the present application, prediction module 1030 comprises: the device comprises a reference voiceprint characteristic vector acquisition unit, a storage unit and a processing unit, wherein the reference voiceprint characteristic vector acquisition unit is used for acquiring a reference voiceprint characteristic vector, and the reference voiceprint characteristic vector is obtained by extracting voiceprint characteristics of a stored voice signal of an authorized user through at least two neural networks; the transformation unit is used for inputting the first vocal print characteristic vector and the reference vocal print characteristic vector into the first neural network for vector transformation to obtain a target characteristic vector; the prediction unit is used for carrying out probability prediction according to the target characteristic vector to obtain a first prediction probability that the tone of a user from which the voice signal to be verified comes is the same as that of an authorized user;

in some embodiments of the present application, the first neural network comprises a first full-connectivity layer, an activation layer, and a second full-connectivity layer in cascade; a transform unit comprising: the first linear transformation unit is used for carrying out linear transformation on the first vocal print characteristic vector and the reference vocal print characteristic vector by the first full communication layer to obtain a first output result; the nonlinear transformation unit is used for carrying out nonlinear transformation on the first output result by the activation layer to obtain a second output result; and the second linear transformation unit is used for carrying out linear transformation on the second output result by the second full communication layer to obtain the target characteristic vector.

In some embodiments of the present application, the feature extraction module 1010 includes: the first extraction unit is used for extracting the voiceprint features of the voice signal to be verified through a second neural network to obtain a first feature vector; and the second extraction unit is used for extracting the voiceprint features of the voice signal to be verified through a third neural network to obtain a second feature vector, and the third neural network is different from the second neural network.

In some embodiments of the present application, the prediction unit is configured to: and performing probability prediction by the multilayer perceptron according to the target feature vector to obtain a first prediction probability that the tone of the user from which the voice signal to be verified comes is the same as that of the authorized user. In this embodiment, the artificial intelligence-based authentication apparatus further includes:

the training data acquisition module is used for acquiring training data, the training data comprises a plurality of training samples and labels corresponding to the training samples, the training samples comprise two sections of voice signals, and the labels corresponding to the training samples are used for indicating whether the two sections of voice signals included in the training samples come from the same user or not.

And the second characteristic extraction module is used for respectively carrying out voiceprint characteristic extraction on the two sections of voice signals in the training sample by the second neural network to obtain a first sample characteristic vector corresponding to each section of voice signals.

And the third characteristic extraction module is used for respectively carrying out voiceprint characteristic extraction on the two sections of voice signals in the training sample by using a third neural network to obtain a second sample characteristic vector corresponding to each section of voice signals.

The second splicing module is used for splicing the corresponding first sample characteristic vector and the corresponding second sample characteristic vector aiming at each section of voice signals in the training samples to obtain a first sample voiceprint characteristic vector of each section of voice signals;

and the second transformation module is used for transforming the first neural network according to the first sample voiceprint characteristic vectors respectively corresponding to the two sections of voice signals in the training sample and outputting the sample target characteristic vector of the training sample.

And the second prediction module is used for predicting the sample prediction probability of two sections of voice signals of the training sample from the same user according to the sample target feature vector by the multilayer perceptron.

And the parameter adjusting module is used for adjusting at least one parameter of the first neural network, the second neural network, the third neural network and the multilayer perceptron according to the sample prediction probability and the label corresponding to the training sample.

In some embodiments of the present application, the verification result determination module 1040 includes: the second prediction probability obtaining unit is used for obtaining a second prediction probability, and the second prediction probability is the probability that the user indicated by the face image to be verified is an authorized user according to the prediction of the face image to be verified; the weighting unit is used for weighting the first prediction probability and the second prediction probability to obtain a target probability; and the first determining unit is used for determining the identity verification result as a result indicating that the verification is passed if the target prediction probability is within the first probability range.

In some embodiments of the present application, the second prediction probability obtaining unit is further configured to: acquiring a face image to be verified; extracting features of a face image to be verified to obtain a first face feature vector; and predicting according to the first face feature vector and the second face feature vector to obtain a second prediction probability, wherein the second face feature vector is obtained by extracting the features of the stored face image of the authorized user.

In some embodiments of the present application, the artificial intelligence based authentication apparatus further comprises: the identity authentication request acquisition module is used for acquiring an identity authentication request; the starting module is used for starting sound collection according to the identity authentication request; the background sound identification module is used for carrying out background sound identification on the collected first sound signal; and the to-be-verified voice signal determining module is used for taking the first voice signal as the to-be-verified voice signal if the first voice signal is identified and determined to be the non-background sound.

In some embodiments of the present application, the artificial intelligence based identity authentication apparatus further includes a second determining unit, configured to obtain a second prediction probability if the power of the collected first sound signal is lower than a power threshold, or the first sound signal is determined to be a background sound by identification; and if the second prediction probability is within the second probability range, determining the identity verification result as a result indicating that the verification is passed.

In some embodiments of the present application, the sound collection is performed by a sound collection module, the sound collection module includes at least four sound collectors, and the background sound identification module includes: the sound source positioning unit is used for positioning a sound source according to first sound signals respectively collected by at least four sound collectors and determining a first distance between the sound source corresponding to the first sound signal and the sound collection module; a non-background sound determination unit, configured to determine that the first sound signal is a non-background sound if the first distance is within the set target distance range; and the background sound determining unit is used for determining the first sound signal as the background sound if the first distance exceeds the set target distance range.

In some embodiments of the present application, one sound collector of the at least four sound collectors is used as a reference collector, and other sound collectors of the at least four sound collectors except the reference collector are used as reference collectors; a sound source localization unit comprising: the time difference calculation unit is used for calculating the time difference between the time of acquiring the first sound signal by each reference acquisition device and the reference time by taking the time of acquiring the first sound signal by the reference acquisition device as the reference time; the distance difference calculating unit is used for respectively calculating the distance difference between the sound source corresponding to the first sound signal and each reference collector and the distance between the sound source and the reference collector according to the time difference; the sound source position information determining unit is used for determining the position information of the sound source corresponding to the first sound signal according to the distance difference between the distance between the sound source corresponding to the first sound signal and each reference collector and the distance between the sound source corresponding to the first sound signal and the reference collector; and the first distance determining unit is used for calculating a first distance according to the position information of the sound source corresponding to the first sound signal and the position information of the sound acquisition module.

In some embodiments of the present application, the artificial intelligence based authentication apparatus further comprises: the unlocking module is used for unlocking the object requested to be unlocked if the identity verification result indicates that the verification is passed; and the prompting module is used for prompting the verification failure if the identity verification result indicates that the verification fails.

It should be noted that the computer system 1100 of the electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 11, the computer system 1100 includes a Central Processing Unit (CPU)1101, which can perform various appropriate actions and processes, such as executing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for system operation are also stored. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An Input/Output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to I/O interface 1105: an input portion 1106 including a keyboard, mouse, and the like; an output section 1107 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 1108 including a hard disk and the like; and a communication section 1109 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. Drivers 1110 are also connected to I/O interface 1105 as needed. A removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1110 as necessary, so that a computer program read out therefrom is mounted into the storage section 1108 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication portion 11011 and/or installed from the removable medium 1111. When the computer program is executed by a Central Processing Unit (CPU)1101, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a requesting execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a request execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

According to an aspect of the present application, there is also provided an electronic device including: a processor; a memory having stored thereon computer-readable requests which, when executed by the processor, implement the method of any of the above embodiments.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the method of any of the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several requests to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. An identity authentication method based on artificial intelligence is characterized by comprising the following steps:

predicting a first prediction probability that the tone of the user from which the voice signal to be verified comes is the same as that of the authorized user according to the first voiceprint feature vector;

2. The method according to claim 1, wherein said predicting a first prediction probability that the user from which the speech signal to be verified originates and the authorized user have the same timbre according to the first voiceprint feature vector comprises:

acquiring a reference voiceprint feature vector, wherein the reference voiceprint feature vector is obtained by extracting voiceprint features of voice signals of the stored authorized users through the at least two neural networks;

inputting the first vocal print feature vector and the reference vocal print feature vector into a first neural network for vector transformation to obtain a target feature vector;

and performing probability prediction according to the target feature vector to obtain a first prediction probability that the tone of the user from which the voice signal to be verified comes is the same as that of the authorized user.

3. The method of claim 2, wherein the first neural network comprises a first full-connectivity layer, an activation layer, and a second full-connectivity layer in cascade;

the inputting the first voiceprint feature vector and the reference voiceprint feature vector into a first neural network for vector transformation to obtain a target feature vector includes:

performing linear transformation on the first voiceprint feature vector and the reference voiceprint feature vector by the first full communication layer to obtain a first output result;

carrying out nonlinear transformation on the first output result by the active layer to obtain a second output result;

and performing linear transformation on the second output result by the second full communication layer to obtain the target characteristic vector.

4. The method according to claim 2, wherein the obtaining of the feature vector output by each neural network for the voice signal to be verified by performing voiceprint feature extraction on the collected voice signal to be verified through at least two types of neural networks respectively comprises:

performing voiceprint feature extraction on the voice signal to be verified through a second neural network to obtain a first feature vector;

and carrying out voiceprint feature extraction on the voice signal to be verified through a third neural network to obtain a second feature vector, wherein the third neural network is different from the second neural network.

5. The method according to claim 4, wherein the performing probability prediction according to the target feature vector to obtain a first prediction probability that the tone of the user from which the voice signal to be verified originates is the same as that of an authorized user comprises:

performing probability prediction by the multilayer perceptron according to the target characteristic vector to obtain a first prediction probability that the tone of the user from which the voice signal to be verified comes is the same as that of the authorized user;

the method further comprises the following steps:

acquiring training data, wherein the training data comprises a plurality of training samples and labels corresponding to the training samples, the training samples comprise two sections of voice signals, and the labels corresponding to the training samples are used for indicating whether the two sections of voice signals included in the training samples are from the same user;

the second neural network respectively extracts the voiceprint characteristics of the two sections of voice signals in the training sample to obtain a first sample characteristic vector corresponding to each section of voice signals; the third neural network respectively extracts the voiceprint characteristics of the two sections of voice signals in the training sample to obtain a second sample characteristic vector corresponding to each section of voice signals;

for each section of voice signals in the training samples, splicing the corresponding first sample characteristic vector and the corresponding second sample characteristic vector to obtain a first sample voiceprint characteristic vector of each section of voice signals;

transforming by the first neural network according to first sample voiceprint feature vectors respectively corresponding to two sections of voice signals in the training sample, and outputting a sample target feature vector of the training sample;

predicting by the multilayer perceptron according to the sample target feature vector to obtain the sample prediction probability of two sections of voice signals of the training sample from the same user;

and adjusting parameters of at least one of the first neural network, the second neural network, the third neural network and the multilayer perceptron according to the sample prediction probability and the label corresponding to the training sample.

6. The method of claim 1, wherein determining an authentication result according to the first predictive probability comprises:

acquiring a second prediction probability, wherein the second prediction probability is the probability that the user indicated by the face image to be verified is the authorized user predicted according to the face image to be verified;

weighting the first prediction probability and the second prediction probability to obtain a target probability;

and if the target prediction probability is within a first probability range, determining the identity verification result as a result indicating that the verification is passed.

7. The method of claim 6, wherein obtaining the second prediction probability comprises:

acquiring a face image to be verified;

extracting the features of the facial image to be verified to obtain a first facial feature vector;

and predicting according to the first face feature vector and a second face feature vector to obtain the second prediction probability, wherein the second face feature vector is obtained by performing feature extraction on the stored face image of the authorized user.

8. The method according to claim 6, wherein before the voiceprint feature extraction is performed on the collected voice signal to be verified through at least two types of neural networks respectively to obtain feature vectors output by each type of neural network for the voice signal to be verified, the method further comprises:

acquiring an identity authentication request;

starting sound collection according to the identity authentication request;

carrying out background sound identification on the collected first sound signal;

and if the first sound signal is identified and determined to be non-background sound, taking the first sound signal as the voice signal to be verified.

9. The method of claim 8, wherein after initiating voice capture according to the authentication request, the method further comprises:

if the power of the collected first sound signal is lower than a power threshold value, or the first sound signal is identified and determined to be background sound, acquiring the second prediction probability;

and if the second prediction probability is within a second probability range, determining the identity verification result as a result indicating that the verification is passed.

10. The method of claim 8, wherein sound collection is performed by a sound collection module, the sound collection module comprises at least four sound collectors, and the performing background sound recognition on the collected first sound signal comprises:

positioning a sound source according to first sound signals respectively collected by the at least four sound collectors, and determining a first distance between the sound source corresponding to the first sound signal and the sound collection module;

if the first distance is within a set target distance range, determining that the first sound signal is a non-background sound;

and if the first distance exceeds a set target distance range, determining that the first sound signal is a background sound.

11. The method according to claim 10, wherein one of the at least four sound collectors is used as a reference collector, and the other sound collectors except the reference collector are used as reference collectors;

the sound source positioning is performed according to the first sound signals respectively collected by the at least four sound collectors, and a first distance between a sound source corresponding to the first sound signal and the sound collection module is determined, including:

respectively calculating the time difference between the time of acquiring the first sound signal by each reference acquisition device and the reference time by taking the time of acquiring the first sound signal by the reference acquisition device as the reference time;

respectively calculating the distance difference between the sound source corresponding to the first sound signal and each reference collector and the distance between the sound source and the reference collector according to the time difference;

determining position information of a sound source corresponding to the first sound signal according to a distance difference between the distance between the sound source corresponding to the first sound signal and each reference collector and the distance between the sound source corresponding to the first sound signal and the reference collector;

and calculating to obtain the first distance according to the position information of the sound source corresponding to the first sound signal and the position information of the sound acquisition module.

12. The method of claim 1, wherein after determining the authentication result according to the first predictive probability, the method further comprises:

if the identity verification result indicates that the verification is passed, unlocking the object requested to be unlocked;

and if the identity authentication result indicates that the authentication is not passed, performing authentication failure prompt.

13. An identity verification device based on artificial intelligence, comprising:

14. An electronic device, comprising:

a processor;

a memory having stored thereon computer-readable requests which, when executed by the processor, implement the method of any of claims 1 to 12.

15. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1 to 12.