CN113707182A

CN113707182A - Voiceprint recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113707182A
Application number: CN202111091629.8A
Authority: CN
Inventors: 岳晓宇; 陈孝良; 李智勇
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-11-26
Anticipated expiration: 2041-09-17
Also published as: CN113707182B

Abstract

The disclosure provides a voiceprint recognition method and device, electronic equipment and a storage medium, and belongs to the technical field of voiceprint recognition. The method comprises the following steps: and determining second similarities respectively corresponding to the first voiceprint information and the second voiceprint information based on the device type of the first voiceprint information, the device types of the second voiceprint information and the first similarities respectively corresponding to the first voiceprint information and the second voiceprint information of the second audio data, and further determining target voiceprint information matched with the first voiceprint information. Therefore, when the target voiceprint information is determined, the device types corresponding to all the voiceprint information are considered, the identification errors caused by different types of audio acquisition devices are reduced, and the accuracy of voiceprint identification is improved.

Description

Voiceprint recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of voiceprint recognition technologies, and in particular, to a voiceprint recognition method and apparatus, an electronic device, and a storage medium.

Background

With the continuous development of voiceprint recognition technology, the voiceprint recognition technology is widely applied to the scene of user identity confirmation. At present, when the identity of a user is confirmed, audio data of the user is generally obtained, based on voiceprint information of the audio data, voiceprint information matched with the voiceprint information of the audio data is retrieved from a voiceprint information base, and then identity information associated with the matched voiceprint information is determined as identity information of the current user, so that the identity of the user is confirmed according to the voiceprint information of the user, wherein the voiceprint information base is used for storing a plurality of audio data, and the voiceprint information and the user information associated with the plurality of audio data.

In the related art, the plurality of audio data stored in the voiceprint information base may be acquired by different audio acquisition devices, such as a microphone built in the mobile terminal, an external microphone of a desktop computer, or a dedicated acquisition device, and when voiceprint retrieval is performed, the accuracy of voiceprint identification is low due to different types of the audio acquisition devices.

Disclosure of Invention

The embodiment of the disclosure provides a voiceprint recognition method and device, electronic equipment and a storage medium, which reduce recognition errors caused by different types of audio acquisition equipment and improve the accuracy of voiceprint recognition. The technical scheme comprises the following steps.

In one aspect, a voiceprint recognition method is provided, and the method includes:

acquiring first voiceprint information of first audio data;

determining first similarity corresponding to the first voiceprint information and second voiceprint information of a plurality of second audio data respectively, wherein the second audio data are acquired by different types of audio acquisition equipment;

determining second similarities respectively corresponding to the first voiceprint information and the second voiceprint information based on the device type of the first voiceprint information, the device types of the second voiceprint information and the first similarities;

and determining target voiceprint information matched with the first voiceprint information based on a plurality of second similarities.

In some embodiments, determining, based on the device type of the first voiceprint information, the device types of the second voiceprint information, and the first similarities, second similarities corresponding to the first voiceprint information and the second voiceprint information respectively includes:

inputting the first voiceprint information, the device type of the first voiceprint information, the second voiceprint information, the device types of the second voiceprint information and the first similarity into a voiceprint model, and determining second similarities corresponding to the first voiceprint information and the second voiceprint information respectively through the voiceprint model.

In some embodiments, determining the target voiceprint information that matches the first voiceprint information based on a plurality of the second similarities comprises:

and sequencing according to the sequence of the second similarity from high to low, selecting second voiceprint information corresponding to the second similarity with the sequence positioned at the front target digit, and determining the second voiceprint information as the target voiceprint information.

In some embodiments, after obtaining the first voiceprint information for the first audio data, the method further comprises:

based on first voiceprint information of the first audio data, searching in a plurality of voiceprint information bases respectively to obtain second voiceprint information of the plurality of second audio data, wherein the voiceprint information bases are used for storing the voiceprint information of the plurality of audio data acquired by the corresponding audio acquisition equipment respectively;

and executing the step of determining the first similarity corresponding to the first voiceprint information and the second voiceprint information of the plurality of second audio data respectively based on the second voiceprint information of the plurality of second audio data.

In some embodiments, before performing the search in the plurality of voiceprint information repositories, respectively, the method further comprises:

and determining the equipment type of the first voiceprint information, searching in a voiceprint information base corresponding to the equipment type, and if the voiceprint information base corresponding to the equipment type does not have the voiceprint information matched with the first voiceprint information, searching in a plurality of other voiceprint information bases respectively.

In some embodiments, after determining the target voiceprint information that matches the first voiceprint information based on a plurality of the second similarities, the method further comprises:

and acquiring user information associated with the target voiceprint information, and determining the user information as the user information matched with the first audio data.

In one aspect, a voiceprint recognition apparatus is provided, the apparatus comprising:

the acquisition module is used for acquiring first voiceprint information of the first audio data;

the first similarity determining module is used for determining first similarities corresponding to the first voiceprint information and second voiceprint information of a plurality of second audio data respectively, and the plurality of second audio data are acquired by different types of audio acquisition equipment;

a second similarity determining module, configured to determine, based on the device type of the first voiceprint information, the device types of the second voiceprint information, and the first similarities, second similarities corresponding to the first voiceprint information and the second voiceprint information, respectively;

and the voiceprint information determining module is used for determining target voiceprint information matched with the first voiceprint information based on a plurality of second similarities.

In some embodiments, the second similarity determination module is to:

In some embodiments, the voiceprint information determination module is to:

In some embodiments, the apparatus further comprises:

the retrieval module is used for respectively retrieving in a plurality of voiceprint information bases based on first voiceprint information of the first audio data to obtain second voiceprint information of the plurality of second audio data, and the voiceprint information bases are respectively used for storing the voiceprint information of the plurality of audio data acquired by the corresponding audio acquisition equipment;

In some embodiments, the retrieval module is further configured to:

In some embodiments, the apparatus further comprises:

and the user information determining module is used for acquiring the user information associated with the target voiceprint information and determining the user information as the user information matched with the first audio data.

In one aspect, an electronic device is provided and includes one or more processors and one or more memories, where at least one program code is stored in the one or more memories, and loaded and executed by the one or more processors to implement the operations performed by the voiceprint recognition method.

In one aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the program code is loaded and executed by a processor to implement the operations performed by the voiceprint recognition method.

According to the technical scheme provided by the embodiment of the disclosure, the second similarity corresponding to the voiceprint information of the audio data to be recognized and the voiceprint information of the audio data is determined based on the first similarity corresponding to the voiceprint information of the audio data to be recognized and the corresponding equipment type of each voiceprint information, and then the target voiceprint information matched with the voiceprint information of the audio data to be recognized is determined.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a voiceprint recognition method provided by an embodiment of the present disclosure;

fig. 2 is a flowchart of a voiceprint recognition method provided by an embodiment of the present disclosure;

fig. 3 is a flowchart of a voiceprint recognition method provided by an embodiment of the present disclosure;

fig. 4 is a flowchart of a voiceprint recognition apparatus provided by an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a server according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

First, description is made for an application scenario related in the embodiment of the present disclosure:

the voiceprint recognition method related to the embodiment of the disclosure can be applied to the identity confirmation scene of the user. For example, the voiceprint recognition method can be applied to a business scene of identity confirmation in a special case, and the identity information of the person related to the case can be determined by comparing the voiceprint of the audio data of the person related to the case with the audio data in the voiceprint information base. Or, the voiceprint recognition method can also be applied to an identity authentication scene, and the voiceprint comparison is carried out according to the audio data of the user to be authenticated and the audio data in the voiceprint information base, so that whether voiceprint information matched with the audio data of the user to be authenticated exists in the voiceprint information base can be determined, and whether the user to be authenticated is a legal user can be determined. Of course, the voiceprint recognition method may also be applied to other scenarios of identity confirmation, which is not limited in the embodiment of the present disclosure.

In the related art, in some embodiments, the plurality of audio data stored in the voiceprint information base are acquired by different audio acquisition devices, such as a microphone built in the mobile terminal, an external microphone of a desktop computer, or a dedicated acquisition device, and further, when voiceprint comparison is performed on the basis of the audio data to be identified and reference audio data in the voiceprint information base, a problem of voiceprint comparison between audio data acquired by different types of audio acquisition devices may occur, and because the types of the audio acquisition devices are different, a voiceprint identification error may be caused, and the accuracy of voiceprint identification is reduced. In other embodiments, the plurality of audio data stored in the voiceprint information base are acquired by the same audio acquisition device, such as a dedicated acquisition device, and when voiceprint comparison is performed on the basis of the audio data to be identified and the reference audio data in the voiceprint information base, since voiceprint retrieval is performed only in the voiceprint information base of the same device type, the retrieval range of the voiceprint retrieval is reduced, and the success rate of voiceprint matching is reduced.

Fig. 1 is a schematic diagram of an implementation environment of a voiceprint recognition method provided in an embodiment of the present disclosure, and referring to fig. 1, the implementation environment includes: a first electronic device 101 and a second electronic device 102.

The first electronic device 101 is a terminal device, specifically a terminal device operated by a user (such as a specific person) (for convenience of description, the terminal is subsequently adopted to refer to the first electronic device 101). In some embodiments, the terminal is at least one of a smartphone, a smartwatch, a desktop computer, a laptop computer, a virtual reality terminal, an augmented reality terminal, a wireless terminal, a laptop portable computer, and the like. The terminal has a communication function and can be accessed to a wired network or a wireless network. A terminal may refer to one of a plurality of terminals, and this embodiment is only illustrated by a terminal. Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer.

In some embodiments, the terminal is associated with a voiceprint recognition platform that provides voiceprint recognition functionality. In the embodiment of the disclosure, the terminal is configured to send an identification request for the first audio data to the server in response to an identification operation for the first audio data.

The second electronic device 102 is a server (for convenience of description, the server is subsequently used to refer to the second electronic device 102), and specifically, a background server of the voiceprint recognition platform. In some embodiments, the server is an independent physical server, or a server cluster or distributed file system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, web services, cloud communication, middleware services, domain name services, security services, Content Delivery Networks (CDNs), and big data and artificial intelligence platforms. The server and the terminal may be directly or indirectly connected through wired or wireless communication, which is not limited in this disclosure. Optionally, the number of the servers may be more or less, and the embodiment of the disclosure does not limit this. Of course, the server may also include other functional servers in order to provide more comprehensive and diversified services.

In this embodiment, the server is configured to obtain first voiceprint information of first audio data, determine first similarities corresponding to second voiceprint information of the first voiceprint information and second voiceprint information of multiple second audio data, determine second similarities corresponding to the first voiceprint information and the second voiceprint information based on a device type of the first voiceprint information, a device type of the second voiceprint information, and the first similarities, and determine target voiceprint information matched with the first voiceprint information based on the second similarities.

Fig. 2 is a flowchart of a voiceprint recognition method provided by an embodiment of the present disclosure, referring to fig. 2, the method is performed by a server and includes the following steps.

201. The server acquires first voiceprint information of the first audio data.

202. The server determines first similarity corresponding to the first voiceprint information and second voiceprint information of a plurality of second audio data, and the second audio data are acquired by different types of audio acquisition equipment.

203. The server determines second similarities respectively corresponding to the first voiceprint information and the second voiceprint information based on the device type of the first voiceprint information, the device types of the second voiceprint information and the first similarities.

204. The server determines target voiceprint information matched with the first voiceprint information based on a plurality of second similarities.

Fig. 3 is a flowchart of a voiceprint recognition method provided by an embodiment of the present disclosure, and referring to fig. 3, fig. 3 illustrates a scheme with a terminal and a server as an execution subject, where the method includes the following steps.

301. The terminal sends an identification request for the first audio data to the server in response to the identification operation for the first audio data.

The first audio data is audio data to be subjected to voiceprint recognition. Taking the case of the special case for identification confirmation, the first audio data may be audio data of the case-related person, and taking the case of the identification verification scenario as an example, the first audio data may be audio data of the user to be verified. In some embodiments, the first audio data is acquired by a microphone built in the mobile terminal, or acquired by a microphone external to the desktop computer, or acquired by a dedicated acquisition device. Of course, the first audio data may also be acquired by other audio acquisition devices. The embodiments of the present disclosure do not limit this.

In the embodiment of the present disclosure, the identification request of the first audio data is used to request to perform voiceprint identification on the first audio data, so as to obtain the user information associated with the first audio data. The user information involved in the embodiments of the present disclosure is information authorized by the user or fully authorized by each party. Therefore, the identity of the user is confirmed according to the voiceprint information of the user.

In some embodiments, the terminal displays a voiceprint recognition page, the voiceprint recognition page is provided with a function control for uploading audio data, a user can upload the first audio data to the terminal by operating in the voiceprint recognition page, and the corresponding process is as follows: the terminal responds to the triggering operation of the functional control for uploading the audio data in the voiceprint recognition page, displays a plurality of candidate audio data, responds to the selection operation of any one candidate audio data, uploads the selected candidate audio data to the terminal, and then the terminal acquires the selected candidate audio data, namely the first audio data.

Further, the voiceprint recognition page is provided with a voiceprint recognition function control, the recognition operation on the first audio data may be a trigger operation on the function control, the user can trigger the terminal to send the recognition request of the first audio data to the server by triggering the function control in the voiceprint recognition page, and the corresponding process is as follows: and the terminal responds to the triggering operation of the functional control for voiceprint recognition in the voiceprint recognition page and sends the recognition request of the first audio data to the server.

In one particular example, the first audio data is acquired by a built-in microphone of the current terminal. For example, the first audio data may be call audio in a local call record stored by the terminal, and accordingly, the terminal displays the call audio in the local call record stored by the terminal in response to a trigger operation on the function control for uploading the audio data, uploads the selected call audio to the terminal in response to a selection operation on any call audio, and then the terminal acquires the selected call audio, that is, acquires the first audio data, and further sends an identification request of the first audio data to the server based on the first audio data.

302. The server responds to the received identification request of the first audio data, and obtains first voiceprint information of the first audio data.

The voiceprint is a sound wave frequency spectrum which is displayed by an electro-acoustic instrument and carries speech information, and the voiceprint information is information which can represent a speaker and is extracted by a filter, a model and the like. It should be understood that since the vocal organs (tongue, teeth, larynx, lung, nasal cavity) used by a person when speaking vary greatly from person to person in terms of size and morphology, the voiceprints of any two persons differ. In some embodiments, the first voiceprint information is a feature vector used to characterize a voiceprint of the first audio data.

In some embodiments, in response to receiving an identification request for first audio data, a server acquires the first audio data carried in the identification request for the first audio data, and extracts first voiceprint information of the first audio data. In an alternative embodiment, the server employs a neural network model to extract the first voiceprint information of the first audio data, and the corresponding process is as follows: the server extracts the voice acoustic features of the first audio data, inputs the voice acoustic features into a neural network model, processes the voice acoustic features through the neural network model, outputs the voiceprint features of the voice acoustic features, and takes the voiceprint features as voiceprint information of the first audio data.

The neural network model sequentially comprises an input layer, a convolution layer, a pooling layer, a full-connection layer and an output layer from a bottom layer to an upper layer. The input layer is used for inputting the voice acoustic characteristics acquired by the server into the neural network model, converting the input voice acoustic characteristics into a digital matrix, and standardizing the input characteristics so that the neural network model can carry out a subsequent operation process; the convolutional layer is used for performing convolution operation on the digital matrix generated by the input layer, and performing feature extraction on the speech acoustic features based on the result of the convolution operation, and the neural network model can comprise one or more convolutional layers; the pooling layer is used for quantizing the feature extraction values obtained by the convolutional layer to obtain a matrix with a smaller dimension so as to further extract the vocal print features, and the neural network model can comprise one or more pooling layers; the full connection layer is used for integrating the extracted voiceprint features into complete voiceprint features through a weight matrix; the output layer is used for outputting the voiceprint characteristics obtained by integrating the full connection layers. The voiceprint features refer to features output by a fully connected layer of the neural network model. Accordingly, the specific process of the server determining the voiceprint information based on the neural network model may include: the server inputs the extracted voice acoustic features of the first audio data into a neural network model, sequentially passes through an input layer, a convolution layer and a full connection layer of the neural network model, performs convolution processing on the voice acoustic features through the convolution layer of the neural network model to obtain voice acoustic features output by the convolution layer, performs nonlinear combination on the input voice acoustic features through the full connection layer, outputs voiceprint features of the voice acoustic features, and takes the voiceprint features as voiceprint information of the first audio data.

The speech acoustic Feature may be a Mel Frequency spectrum Cepstrum Coefficient (MFCC) of the speech, or a Perceptual Linear prediction Coefficient (PLP), or a Filter Bank Feature (Filter Bank Feature), and of course, the speech acoustic Feature may also be the original speech, that is, the first audio data.

In other embodiments, the server can also extract the first voiceprint information by adopting other voiceprint extraction manners. The disclosed embodiment does not limit the process of obtaining voiceprint information.

303. The server determines the device type of the first voiceprint information, searches in the voiceprint information base corresponding to the device type based on the first voiceprint information of the first audio data, and if the voiceprint information base corresponding to the device type does not have voiceprint information matched with the first voiceprint information, step 304 is executed.

In the embodiment of the present disclosure, the server is associated with a plurality of voiceprint information bases, and the voiceprint information bases are respectively used for storing voiceprint information of a plurality of audio data acquired by a plurality of corresponding audio acquisition devices, that is, one voiceprint information base corresponds to one audio acquisition device. For example, the plurality of voiceprint information bases may include a voiceprint information base corresponding to a microphone built in the mobile terminal (for storing voiceprint information of a plurality of audio data collected by the microphone built in the mobile terminal), a voiceprint information base corresponding to a microphone external to the desktop computer (for storing voiceprint information of a plurality of audio data collected by the microphone external to the desktop computer), and a voiceprint information base corresponding to a dedicated collection device (for storing voiceprint information of a plurality of audio data collected by the dedicated collection device). The voiceprint information of the audio data is a feature vector used for representing the voiceprint of the audio data.

In some embodiments, the server determines the device type of the first voiceprint information by: the server determines the device type corresponding to the audio identifier based on the audio identifier of the first audio data and the target corresponding relationship, and determines the device type corresponding to the audio identifier as the device type of the first audio data, that is, the device type of the first voiceprint information. The target corresponding relation comprises audio identifications of a plurality of audio data and device types corresponding to the audio identifications. The audio Identification may be a name, a number, an ID (Identification), etc. of the audio data. Optionally, the target correspondence is stored in a voiceprint information base. Optionally, the device type is represented by a one-hot sequence of the device, where the one-hot sequence is a segment of code used for identifying the device type.

In some embodiments, after determining the device type of the first voiceprint information, the server searches in a voiceprint information base corresponding to the device type based on the first voiceprint information of the first audio data by using a Faiss vector search engine, and if the voiceprint information base corresponding to the device type does not have voiceprint information matching the first voiceprint information, then step 304 is performed.

The Faiss vector retrieval engine is an open-source similar vector retrieval library, and the retrieval principle is that the voiceprint reaching the similarity threshold is selected as the matchable voiceprint by calculating the similarity between the voiceprint to be identified and the reference voiceprint. And the similarity threshold value is a preset similarity score, and whether the corresponding reference voiceprint can be used as a matchable voiceprint is determined by comparing the calculated similarity with the preset similarity score. For example, if the calculated similarity between the voiceprint to be recognized and the reference voiceprint is 80 and the preset similarity score is 75, it is determined that the reference voiceprint can be used as a matchable voiceprint.

Therefore, when massive data in a high-dimensional space are faced, an efficient and reliable retrieval method is provided, part of voiceprint information which can be matched can be quickly retrieved, the data quantity referred by subsequent voiceprint recognition is reduced, and the operation efficiency of the server is improved. For convenience of description, the search similarity is subsequently used to represent the similarity calculated during the voiceprint search.

In an alternative embodiment, the process of using the Faiss vector search engine to perform voiceprint search by the server is as follows: the server determines retrieval similarity corresponding to the first voiceprint information and the voiceprint information respectively based on the first voiceprint information and the voiceprint information included in the voiceprint information base corresponding to the equipment type, if voiceprint information with the retrieval similarity reaching a similarity threshold exists, it is indicated that voiceprint information capable of being matched with the first voiceprint information exists in the voiceprint information base corresponding to the equipment type, and if voiceprint information with the retrieval similarity reaching the similarity threshold does not exist, it is indicated that voiceprint information matched with the first voiceprint information does not exist in the voiceprint information base corresponding to the equipment type. In some embodiments, the search similarity is expressed in terms of a feature distance, such as a euclidean distance. It should be understood that the larger the distance, the smaller the search similarity, and the smaller the distance, the larger the search similarity.

It should be noted that step 303 is an optional step. In some embodiments, after the server obtains the first voiceprint information of the first audio data, it is not necessary to perform the process of preferentially searching in the voiceprint information bases of the same device type in step 303, and the server may search in the multiple voiceprint information bases based on the first voiceprint information of the first audio data.

304. The server searches in a plurality of other voiceprint information bases respectively based on the first voiceprint information of the first audio data to obtain second voiceprint information of a plurality of second audio data, and the plurality of second audio data are acquired by different types of audio acquisition equipment.

In the embodiment of the disclosure, a similarity condition is satisfied between the second voiceprint information and the first voiceprint information of the plurality of second audio data. The similarity condition is used for screening out second voiceprint information of a plurality of second audio data which can be matched. In some embodiments, the similarity condition is that the search similarity reaches a preset similarity score. In some embodiments, the second voiceprint information is a feature vector used to characterize a voiceprint of the second audio data.

It should be noted that the process of the server performing retrieval in the multiple voiceprint information bases respectively refers to the voiceprint retrieval process in step 303, and is not described again. In some embodiments, the similarity score mentioned in step 303 is the same as the similarity score mentioned in step 304, or, in other embodiments, the similarity score mentioned in step 303 is different from the similarity score mentioned in step 304.

The above-mentioned steps 303 to 304 are a process in which the server automatically and preferentially searches the voiceprint information bases of the same device type, and performs extended search in the plurality of voiceprint information bases if the voiceprint information matching the first voiceprint information is not searched. The voiceprint information bases can include voiceprint information bases of the same device type, and can not include voiceprint information bases of the same device type. In other embodiments, the above process can also be triggered by the user, that is, the user can specify one or more voiceprint information libraries for voiceprint retrieval when the voiceprint recognition function is triggered. For example, the user may preferentially specify the voiceprint information base of the same device type to perform voiceprint retrieval, and the corresponding process is as follows: the terminal responds to the triggering operation of a target voiceprint information base in a plurality of candidate voiceprint information bases in the voiceprint recognition page, and sends a request for indicating the voiceprint retrieval of the first audio data in the target voiceprint information base to the server; further, if the search result of the voiceprint information base of the same device type cannot meet the user requirement, the user can select a plurality of voiceprint information bases to perform extended search so as to expand the data range.

305. The server determines first similarity corresponding to the first voiceprint information and second voiceprint information of the plurality of second audio data respectively.

In the embodiment of the present disclosure, the first similarity refers to a similarity calculated based on a voiceprint similarity algorithm. In some embodiments, the server determines a first similarity between the first voiceprint information and second voiceprint information of the plurality of second audio data based on a voiceprint similarity algorithm. So, through calculating the first similarity between the vocal print information, can determine the similarity between the vocal print information more accurately, further promote the accuracy of vocal print discernment.

In some embodiments, for any one of the second voiceprint information, the server calculates, based on the feature dimensions included in the first voiceprint information and the second voiceprint information, the feature value of each feature dimension, and the weight occupied by each feature dimension, first similarities corresponding to the first voiceprint information and the second voiceprint information, respectively. In other embodiments, the server determines the first similarity, which can be implemented by other voiceprint similarity algorithms. For example, any of a distance algorithm, a correlation coefficient algorithm, or a model algorithm is employed.

306. The server determines second similarities respectively corresponding to the first voiceprint information and the second voiceprint information based on the device type of the first voiceprint information, the device types of the second voiceprint information and the first similarities.

In the embodiment of the present disclosure, the second similarity refers to a similarity calculated based on a voiceprint model. In some embodiments, the server determining the second similarity between the first voiceprint information and each of the second voiceprint information comprises: the server inputs the first voiceprint information, the equipment type of the first voiceprint information, the second voiceprint information, the equipment types of the second voiceprint information and the first similarity into a voiceprint model, and determines second similarities corresponding to the first voiceprint information and the second voiceprint information respectively through the voiceprint model.

In some embodiments, the voiceprint model is an XGBoost (distributed Gradient Boosting) model. The voiceprint model is used to predict a second similarity between the voiceprint information.

The voiceprint model adopted in the embodiment of the application is a trained model. In some embodiments, the server obtains a plurality of groups of sample voiceprint data and similarity labels of the plurality of groups of sample voiceprint data, and performs model training based on the plurality of groups of sample voiceprint data and the similarity labels of the plurality of groups of sample voiceprint data to obtain a voiceprint model. Each group of sample voiceprint data comprises to-be-identified voiceprint information, equipment types of the to-be-identified voiceprint information, a plurality of reference voiceprint information, equipment types of the plurality of reference voiceprint information, the to-be-identified voiceprint information and first similarities corresponding to the plurality of reference voiceprint information respectively. Specifically, the training process of the voiceprint model comprises the following steps: in the first iteration process, the multiple groups of sample voiceprint data are respectively input into an initial model to obtain a similarity training result of the first iteration process; determining a loss function based on the similarity training result of the first iteration process and the similarity labels of the multiple groups of sample voiceprint data, and adjusting model parameters in the initial model based on the loss function; taking the model parameters after the first iteration adjustment as model parameters of the second iteration, and then carrying out the second iteration; and repeating the iteration process for a plurality of times, in the Nth process, taking the model parameters after the N-1 th iteration adjustment as new model parameters, carrying out model training until the training meets the target condition, and acquiring the model corresponding to the iteration process meeting the target condition as a voiceprint model. Wherein N is a positive integer and is greater than 1. Optionally, the target condition met by training may be that the number of training iterations of the initial model reaches a target number, which may be a preset number of training iterations; alternatively, the target condition met by the training may be that the loss value meets a target threshold condition, such as a loss value less than 0.00001. The embodiments of the present disclosure are not limited thereto.

In the embodiment, the voiceprint information base is managed in a database-by-database mode according to the type of the audio acquisition equipment, and under the condition that the coverage range of the voiceprint information base is ensured, similarity calculation is performed through the voiceprint model, so that voiceprint identification errors caused by different types of the audio acquisition equipment are reduced, and the accuracy of voiceprint identification is improved.

307. The server determines target voiceprint information matched with the first voiceprint information based on a plurality of second similarities.

In some embodiments, the server determining the target voiceprint information that matches the first voiceprint information comprises: and the server sorts according to the second similarity from high to low, selects second voiceprint information corresponding to the second similarity with the ranking order positioned at the front target digit and determines the second voiceprint information as the target voiceprint information. Wherein the target number is a predetermined fixed number, such as 1 or 10.

308. The server acquires the user information associated with the target voiceprint information and determines the user information as the user information matched with the first audio data.

In some embodiments, the voiceprint information store also stores user information associated with a plurality of audio data.

309. The server returns the user information associated with the target voiceprint information to the terminal.

310. And the terminal receives the user information associated with the target voiceprint information and displays the user information associated with the target voiceprint information.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Fig. 4 is a schematic structural diagram of a voiceprint recognition apparatus provided in an embodiment of the present disclosure, and referring to fig. 4, the apparatus includes:

an obtaining module 401, configured to obtain first voiceprint information of first audio data;

a first similarity determining module 402, configured to determine first similarities corresponding to the first voiceprint information and second voiceprint information of multiple pieces of second audio data, where the multiple pieces of second audio data are acquired by different types of audio acquisition devices;

a second similarity determining module 403, configured to determine, based on the device type of the first voiceprint information, the device types of the second voiceprint information, and the first similarities, second similarities corresponding to the first voiceprint information and the second voiceprint information, respectively;

a voiceprint information determination module 404, configured to determine, based on a plurality of the second similarities, target voiceprint information that matches the first voiceprint information.

In some embodiments, the second similarity determination module 403 is configured to:

In some embodiments, the voiceprint information determination module 404 is configured to:

In some embodiments, the apparatus further comprises:

In some embodiments, the retrieval module is further configured to:

In some embodiments, the apparatus further comprises:

It should be noted that: in the voiceprint recognition apparatus provided in the above embodiment, only the division of the above functional modules is exemplified when voiceprint recognition is performed, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the electronic device may be divided into different functional modules to complete all or part of the above described functions. In addition, the voiceprint recognition apparatus provided in the above embodiment and the voiceprint recognition method embodiment belong to the same concept, and specific implementation processes thereof are described in the method embodiment, and are not described herein again.

The electronic device in the embodiment of the present disclosure may be provided as a server, and fig. 5 is a schematic structural diagram of a server provided in the embodiment of the present disclosure, where the server 500 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 501 and one or more memories 502, where at least one program code is stored in the one or more memories 502, and is loaded and executed by the one or more processors 501 to implement the voiceprint recognition method executed by the server in the above-described method embodiments. Of course, the server 500 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 500 may also include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, a computer readable storage medium, such as a memory, including program code executable by a processor to perform the voiceprint recognition method in the above embodiments is also provided. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by hardware associated with program code, and the program may be stored in a computer readable storage medium, where the above mentioned storage medium may be a read-only memory, a magnetic or optical disk, etc.

The foregoing is considered as illustrative of the embodiments of the disclosure and is not to be construed as limiting thereof, and any modifications, equivalents, improvements and the like made within the spirit and principle of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A voiceprint recognition method, the method comprising:

acquiring first voiceprint information of first audio data;

2. The method according to claim 1, wherein the determining, based on the device type of the first voiceprint information, the device types of the second voiceprint information, and the first similarities, the second similarities corresponding to the first voiceprint information and the second voiceprint information respectively comprises:

inputting the first voiceprint information, the device type of the first voiceprint information, the plurality of second voiceprint information, the device types of the plurality of second voiceprint information and the plurality of first similarities into a voiceprint model, and determining second similarities corresponding to the first voiceprint information and the plurality of second voiceprint information respectively through the voiceprint model.

3. The method of claim 1, wherein determining target voiceprint information that matches the first voiceprint information based on the plurality of second similarities comprises:

4. The method of claim 1, wherein after obtaining the first voiceprint information for the first audio data, the method further comprises:

based on first voiceprint information of the first audio data, searching in a plurality of voiceprint information bases respectively to obtain second voiceprint information of a plurality of second audio data, wherein the voiceprint information bases are used for storing the voiceprint information of a plurality of audio data acquired by a plurality of corresponding audio acquisition devices respectively;

5. The method of claim 4, wherein prior to retrieving in each of the plurality of voiceprint information repositories, the method further comprises:

6. The method according to claim 1, wherein after determining the target voiceprint information matching the first voiceprint information based on the plurality of second similarities, the method further comprises:

7. A voiceprint recognition apparatus, said apparatus comprising:

the first similarity determining module is used for determining first similarities corresponding to the first voiceprint information and second voiceprint information of a plurality of second audio data, and the second audio data are acquired by different types of audio acquisition equipment;

and the voiceprint information determining module is used for determining target voiceprint information matched with the first voiceprint information based on the plurality of second similarities.

8. The apparatus of claim 7, wherein the second similarity determination module is configured to:

9. An electronic device, comprising one or more processors and one or more memories having at least one program code stored therein, the program code being loaded and executed by the one or more processors to implement the operations performed by the voiceprint recognition method of any one of claims 1 to 6.

10. A computer-readable storage medium having at least one program code stored therein, the program code being loaded and executed by a processor to perform operations performed by the voiceprint recognition method according to any one of claims 1 to 6.