CN114333844A

CN114333844A - Voiceprint recognition method, voiceprint recognition device, voiceprint recognition medium and voiceprint recognition equipment

Info

Publication number: CN114333844A
Application number: CN202111550073.4A
Authority: CN
Inventors: 吴志勇; 张阳; 吴海滨; 高骥; 黄申
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen International Graduate School of Tsinghua University
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-04-12

Abstract

The application discloses a voiceprint recognition method, a voiceprint recognition device, a voiceprint recognition medium and voiceprint recognition equipment, which relate to the technical field of voice recognition, and the method comprises the following steps: acquiring voice information to be recognized; converting the voice information to be recognized to obtain candidate voice information; respectively inputting the voice information to be recognized and the candidate voice information into a voiceprint model, and performing voiceprint verification processing to obtain a first voiceprint verification result corresponding to the voice information to be recognized and a second voiceprint verification result corresponding to the candidate voice information; and determining a voiceprint recognition result aiming at the voice information to be recognized according to the first voiceprint verification result and the second voiceprint verification result. The technical scheme provided by the application can improve the defense performance against the attack and improve the reliability and safety of voiceprint recognition.

Description

Voiceprint recognition method, voiceprint recognition device, voiceprint recognition medium and voiceprint recognition equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a voiceprint recognition method, apparatus, medium, and device.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, such as natural language processing, machine learning, deep learning and the like. With the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important value.

At present, voiceprint recognition based on a deep neural network has achieved a recognition effect with good performance, and is also widely applied to the fields of financial payment, remote biological authentication, security protection, public security investigation, judicial appraisal and the like. However, the unprotected voiceprint recognition system has a great potential safety hazard and may be attacked by playback of the voice, synthesis of the voice, conversion of the voice, counterattack of samples and the like. The existing schemes are mainly used for performing countermeasure training or introducing a new network structure for active defense, and these methods need more computing resources or increase the parameter quantity of the model, and the effect of the countermeasure defense has a great space for improvement.

Disclosure of Invention

In order to improve reliability and safety of voiceprint recognition, the application provides a voiceprint recognition method, a voiceprint recognition device, a voiceprint recognition medium and voiceprint recognition equipment. The technical scheme is as follows:

in a first aspect, the present application provides a voiceprint recognition method, including:

acquiring voice information to be recognized;

converting the voice information to be recognized to obtain candidate voice information;

respectively inputting the voice information to be recognized and the candidate voice information into a voiceprint model, and performing voiceprint verification processing to obtain a first voiceprint verification result corresponding to the voice information to be recognized and a second voiceprint verification result corresponding to the candidate voice information;

and determining a voiceprint recognition result aiming at the voice information to be recognized according to the first voiceprint verification result and the second voiceprint verification result.

In a second aspect, the present application provides a voiceprint recognition apparatus, the apparatus comprising:

the acquisition module is used for acquiring the voice information to be recognized;

the conversion module is used for carrying out conversion processing on the voice information to be recognized to obtain candidate voice information;

a voiceprint verification module, configured to input the to-be-recognized voice information and the candidate voice information into a voiceprint model respectively, and perform voiceprint verification processing to obtain a first voiceprint verification result corresponding to the to-be-recognized voice information and a second voiceprint verification result corresponding to the candidate voice information;

and the recognition module is used for determining a voiceprint recognition result aiming at the voice information to be recognized according to the first voiceprint verification result and the second voiceprint verification result.

In a third aspect, the present application provides a computer-readable storage medium, in which at least one instruction or at least one program is stored, and the at least one instruction or the at least one program is loaded and executed by a processor to implement a voiceprint recognition method according to the first aspect.

In a fourth aspect, the present application provides a computer device comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement a voiceprint recognition method according to the first aspect.

In a fifth aspect, the present application provides a computer program product comprising computer instructions which, when executed by a processor, implement a voiceprint recognition method as described in the first aspect.

The voiceprint recognition method, the voiceprint recognition device, the voiceprint recognition medium and the voiceprint recognition equipment have the following technical effects:

the voice information to be recognized is converted to obtain candidate voice information, the voice information to be recognized and the candidate voice information are input to the voiceprint model together to be subjected to voiceprint verification processing, and finally a voiceprint recognition result corresponding to the voice information to be recognized is obtained according to a first voiceprint verification result corresponding to the voice information to be recognized and a second voiceprint verification result corresponding to the candidate voice information. When the voice recognition method is applied to confrontation defense and detection, the technical scheme provided by the application is improved in an input preprocessing stage, the confrontation training of an original voiceprint model is not needed, a defense module for confronting sample attack is not needed, only conversion processing of voice to be recognized and judgment processing of a plurality of voiceprint verification results are needed to be added in an original voiceprint verification stage, the whole scheme is simple and easy to implement, and the method has high universality; according to the technical scheme, effective defense and detection can be implemented on the basis of the performance of the original voiceprint model and the white box attack or the black box attack based on the countercheck sample, so that the safety and the reliability of voiceprint recognition are improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

In order to more clearly illustrate the technical solutions and advantages of the embodiments of the present application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a voiceprint recognition method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a voiceprint recognition method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a process for converting speech information to be recognized according to an embodiment of the present application;

FIG. 4 is a schematic diagram of voiceprint recognition based on a voiceprint model according to an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating a process for determining a voiceprint authentication result according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a decision making process according to a voiceprint verification result according to an embodiment of the present application;

fig. 7 is a schematic flowchart of countermeasure detection according to a voiceprint verification result according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of another voiceprint recognition provided by an embodiment of the present application;

fig. 9 is a schematic diagram of a voiceprint recognition apparatus provided in an embodiment of the present application;

fig. 10 is a hardware structural diagram of an apparatus for implementing a voiceprint recognition method according to an embodiment of the present application.

Detailed Description

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like.

The scheme provided by the embodiment of the application relates to an artificial intelligence voice technology. The key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

The scheme provided by the embodiment of the application can be deployed at the cloud end, and further relates to cloud technology and the like.

Cloud technology (Cloud technology): the cloud computing business model based management system is a management technology for unifying series resources such as hardware, software and networks in a wide area network or a local area network to realize calculation, storage, processing and sharing of data, can also be understood as a general term of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like applied based on a cloud computing business model, can form a resource pool, and is used as required, flexible and convenient. Background services of a technical network system require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites, with the high development and application of the internet industry, each article in the future may have its own identification mark and needs to be transmitted to a background system for logic processing, data at different levels are processed separately, and data in various industries need strong system support, so that cloud computing is required as support in the cloud technology. Cloud computing is a computing model that distributes computing tasks over a resource pool of large numbers of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand. As a basic capability provider of cloud computing, a cloud computing resource pool platform, which is called an Infrastructure as a Service (IaaS) for short, is established, and multiple types of virtual resources are deployed in a resource pool and are used by external clients selectively. The cloud computing resource pool mainly comprises: a computing device (which may be a virtualized machine, including an operating system), a storage device, and a network device.

The workflow of a typical voiceprint recognition system involves essentially two steps: "voiceprint reservation (registration)" and "voiceprint verification (test)". The reservation is to convert the user voice into speaker characterization vectors and store the speaker characterization vectors, voiceprint verification judges whether an unknown test voice comes from a designated speaker, the system converts the test voice into the speaker characterization vectors and simultaneously performs scoring comparison with the reserved user voice, and if the test voice is greater than a preset threshold value, the system judges that the test voice belongs to the same speaker; otherwise, if the score is smaller than the threshold value, the speaker does not belong to the same speaker.

Voiceprint recognition has been widely applied to the fields of financial payment, remote biological authentication, security protection, public security investigation, judicial appraisal and the like, and an unprotected voiceprint system has great potential safety hazards. After the user finishes registering, in the step of voiceprint verification, the voiceprint identification may have security related problems of record playback attack, voice synthesis attack, voice conversion attack, sample attack resistance and the like. The attack of record replay, the attack of speech synthesis and the attack of speech conversion are secondary record or sound converted by synthesis, because of the insufficient frequency response of equipment or the performance of a synthesis model, some frequency domains in the speech data used for the attack have deletion and distortion, and the characteristics of the attack are different from those of the real person, and through the study of a large number of positive and negative samples, a computer can easily distinguish whether the speech is the record synthesis or the real person.

Attacks against sample attacks, especially in the white-box case, are more difficult to detect and defend than the above three attacks. The reason that the deep learning model is vulnerable to resisting sample attack is still an open research subject, and the further development of deep learning is restricted by the problem of lack of a complete theoretical system. Most of the current research on confrontational specimens is focused on the field of computer vision. Since speech signals are non-stationary signals, research on voiceprint recognition against sample attacks and defenses is still in the initiative. Particularly in defense against sample attacks, there are currently no many well-established studies and solutions relating to voiceprints. At present, defense methods against attacks are mainly classified into the following three schemes:

1. and (3) confrontation training: in each model training process, retraining the model by injecting a confrontation sample into the training set;

2. preprocessing input data: the input is transformed, so that an attacker is difficult to calculate the gradient of the model, and the aim of defending and resisting the attack is fulfilled;

3. distilling the model: the method of knowledge distillation is used for reducing the size of the network gradient and improving the capability of discovering small-amplitude disturbance to resist the sample.

However, in the above scheme, there are the following disadvantages:

1. the countermeasure training requires that a countermeasure sample is generated in the process of model training, and then the generated countermeasure sample is used as input data to train the original network. Both processes need to consume a large amount of computing resources and computing time, most of the finally obtained training models can only defend against a specific counterattack sample algorithm, and if an attacker modifies the counterattack algorithm, the defense capability of the models is greatly reduced.

2. At present, most defense methods for preprocessing input data adopt a method for reconstructing the input data by using a generated neural network model, for example, denoising a reactance sample based on a Variational Auto-Encoder (VAE) or a antagonism-generated neural network (GAN), so that a data result output by a denoising model is closer to original noiseless data. The methods need to introduce a new neural network, so that the parameter quantity of the voiceprint recognition system is improved; moreover, the voiceprint recognition system needs to consume more computing time in the denoising processing of the original audio in the inference process, and the method is difficult to resist the white-box attack.

3. Distilling and regularizing the model may greatly impair the recognition performance and robustness of the voiceprint model, and degrade the recognition performance of the original real sample that has not been attacked.

4. Most existing defense strategies of the voiceprint recognition system for the confrontation sample are reference and migration on computer vision research, compared with image signals, voice signals belong to non-stationary signals, and the confrontation sample attack defense of the voiceprint recognition system is still in a starting stage temporarily. These over-the-image methods are not necessarily applicable to voice data.

In order to improve reliability and safety of voiceprint recognition, the embodiment of the application provides a voiceprint recognition method, a voiceprint recognition device, a voiceprint recognition medium and voiceprint recognition equipment. The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. Examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to facilitate understanding of the technical solutions and the technical effects thereof described in the embodiments of the present application, the embodiments of the present application explain related terms:

voiceprint: voiceprint is a behavior characteristic, and since the tongue, oral cavity, nasal cavity, vocal cords, lung and other organs used by each person when speaking are different in size, form and the like, the Voiceprint of each speaker is unique in consideration of the difference in factors such as age, personality, language habits and the like. The voiceprint on the data can be represented as a spectrum of sound waves carrying verbal information displayed with an electro-acoustic instrument. Modern scientific research shows that the voiceprint not only has specificity, but also has the characteristic of relative stability.

And (3) voiceprint recognition: voiceprint Recognition, VPR; voiceprint Recognition (SRE), also called Speaker Recognition, is a kind of biometric Recognition technology, which is a technology for automatically recognizing the identity of a Speaker according to voice parameters ("voiceprint") reflecting physiological and behavioral characteristics of the Speaker in a voice signal. Voiceprint recognition can be divided into voiceprint recognition and voiceprint validation, depending on the specific service scenario and recognition target. Voiceprint recognition is to determine which person in the set of target speaker models the speech to be recognized belongs to, and is 1: n selection problem; and voiceprint validation is the determination of whether the speech to be recognized is from its claimed target speaker, which is a 1: 1.

Resisting sample attack: the method is characterized in that a newly generated countermeasure sample is obtained by artificially adding an imperceptible disturbance to a normal data sample, so that the model makes an erroneous judgment on the newly generated sample.

White box attack: the attacker knows detailed information of the model, such as a data preprocessing method, a model structure, model parameters and the like, and can grasp part or all of training data information, such as loss functions and gradient information, in some cases. In extreme cases, the attacker even knows the defense strategy of the model.

Black box attack: the attacker does not know the key details of the model, can only touch the input and output links, and cannot substantially touch any internal operations and data.

It is understood that in the specific implementation of the present application, related data such as user information, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

Referring to fig. 1, which is a schematic diagram of an implementation environment of a voiceprint recognition method according to an embodiment of the present application, as shown in fig. 1, the implementation environment may at least include a client 01 and a server 02.

Specifically, the client 01 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a vehicle-mounted terminal, a digital assistant, a smart wearable device, a monitoring device, a voice interaction device, and other types of devices, and may also include software running in the devices, such as a web page provided by some service providers to the user, and applications provided by the service providers to the user. Specifically, the client 01 may be configured to obtain voice information to be recognized.

Specifically, the server 02 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The server 02 may comprise a network communication unit, a processor and a memory, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. Specifically, the server 02 may be configured to convert the voice information to be recognized, which is sent by the client 01, to obtain candidate voice information; and inputting the voice information to be recognized and the candidate voice information into a voiceprint model together, performing voiceprint verification processing to obtain a corresponding first voiceprint verification result and a corresponding second voiceprint verification result respectively, and finally judging according to a plurality of voiceprint verification results to obtain a voiceprint recognition result corresponding to the voice information to be recognized.

The embodiment of the present application can also be implemented by combining a Cloud technology, which refers to a hosting technology for unifying series resources such as hardware, software, and a network in a wide area network or a local area network to implement data calculation, storage, processing, and sharing, and can also be understood as a generic term of a network technology, an information technology, an integration technology, a management platform technology, an application technology, and the like applied based on a Cloud computing business model. Cloud technology requires cloud computing as a support. Cloud computing is a computing model that distributes computing tasks over a resource pool of large numbers of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Specifically, the server 02 and the database are located in the cloud, and the server 02 may be an entity machine or a virtualization machine.

The following describes a voiceprint recognition method provided by the present application. Fig. 2 is a flow chart of a voiceprint recognition method provided by an embodiment of the present application, which provides the method operation steps as described in the embodiment or the flow chart, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Referring to fig. 2, a voiceprint recognition method provided in the embodiment of the present application may include the following steps:

s210: and acquiring the voice information to be recognized.

In this embodiment, the voice information may be audio data, which may be represented as an electrical signal in a time domain or a frequency domain, and the voice print recognition may be performed by using the audio data. The voiceprint recognition can be divided into voiceprint Identification (SI) and voiceprint Verification (SV) according to a specific service scenario and a recognition target. Voiceprint recognition is to determine which person in the target speaker model set the speech information to be recognized belongs to, and is 1: n, wherein the target speaker model is a speaker set with reserved voiceprint characteristics; and voiceprint validation is the determination of whether the speech to be recognized is from its claimed target speaker, which is a 1: 1.

S230: and carrying out conversion processing on the voice information to be recognized to obtain candidate voice information.

The method provided by the embodiment of the application is mainly improved in an input preprocessing stage, and one or more candidate voice messages are obtained by converting original voice messages to be recognized, so that the voice messages to be recognized and the one or more candidate voice messages are input to a voiceprint model together for voiceprint verification processing, and the method is not only used for inputting the voice messages to be recognized or only inputting the converted candidate voice messages.

In an embodiment of the present application, the speech information to be recognized is subjected to batch stochastic preprocessing and audio conversion without changing speaker characteristics in the speech information to be recognized as much as possible, where the speaker characteristics may be traditional acoustic characteristics including Mel-Frequency Cepstral Coefficients (MFCCs), Perceptual linear prediction Coefficients (PLPs), depth characteristics (Deep features), or Power-Normalized Cepstral Coefficients (PNCCs), and all of the above parameters may be used as voiceprints to recognize acoustic characteristics that are selectable and perform well at the aspect of Feature extraction. Illustratively, as shown in formula (1), in the voiceprint verification stage, the voice information x to be recognized is treated by using a batch audio transfer function g_tCarrying out batch conversion to generate n candidate voice messages, and forming a set X with the voice messages to be recognized:

wherein the content of the first and second substances,

is the original audio frequency corresponding to the voice information to be recognized, which is equal to x_tThe bulk audio conversion function g refers to a method for identifying words without changing the words to be identified as much as possibleOn the premise of speaker characteristics in the voice information, the original audio is subjected to batch random preprocessing and audio conversion,

and representing the audio data corresponding to the n candidate voice information.

In one possible implementation, as shown in fig. 3, the conversion process may specifically include the following:

s231: and adding noise information in the voice information to be recognized to obtain first candidate voice information.

Illustratively, in the case of no reverberation, noise within a certain signal-to-noise ratio limit range is randomly added to obtain the first candidate speech information, where the added noise may be noise that does not change the characteristics of the original audio speaker, such as real ambient noise, white gaussian noise, and the like, that is, additive noise is added.

Illustratively, a certain amount of reverberation is introduced into the speech information to be recognized, so as to obtain the first candidate speech information. It is feasible that the first candidate speech information with reverberation can be obtained by adding multiplicative noise, that is, the speech information to be recognized and the noise information are subjected to convolution operation in the time domain.

S232: or adjusting the volume or the speed of sound of the voice information to be recognized to obtain second candidate voice information.

Illustratively, the audio volume corresponding to the speech information to be recognized may be adjusted by the amplifier element, or the audio speed may be adjusted by changing the audio sampling rate, so as to obtain one or more second candidate speech information.

S233: or filtering the voice information to be recognized to obtain third candidate voice information.

For example, the audio for characterizing the speech information to be recognized is filtered, and a filter that does not change the speaker characteristic, such as gaussian filter, median filter, mean filter, etc., is used to obtain one or more third candidate speech information.

S234: or, performing audio reconstruction on the voice information to be recognized to obtain fourth candidate voice information.

For example, for the reconstruction of audio signals, conventional methods are Filter-Bank-Summation (FBS) and Overlap-and-Add (OLA). In addition, the original voice information to be recognized can be subjected to noise reduction reconstruction by using an audio noise reduction model, or the original voice information to be recognized can be reconstructed by using a vocoder, an interpolation algorithm and the like.

In the above embodiment, the one or more candidate voice messages are obtained in multiple ways, and are pre-processed and converted to the voice message to be recognized, so that the candidate voice messages and the voice message to be recognized can be output to the voiceprint model.

S250: and respectively inputting the voice information to be recognized and the candidate voice information into the voiceprint model, and performing voiceprint verification processing to obtain a first voiceprint verification result corresponding to the voice information to be recognized and a second voiceprint verification result corresponding to the candidate voice information.

Generally, as shown in fig. 4, voiceprint recognition mainly involves two steps: the voice print verification is to judge whether a section of voice to be recognized comes from a designated speaker, and mainly compares the voice to be recognized with a speaker characterization vector corresponding to the voice of the user under reservation while converting the voice to be recognized into the speaker characterization vector, and if the score result is greater than a preset threshold value, the voice can be judged to belong to the same speaker; otherwise, if the score is smaller than the threshold value, the speaker is judged not to belong to the same speaker.

In the embodiment of the application, voiceprint verification processing is respectively performed on the voice information to be recognized and one or more candidate voice information, so that a first voiceprint verification result corresponding to the voice information to be recognized and a second voiceprint verification result corresponding to the candidate voice information are respectively obtained. When there are a plurality of candidate voice messages, the second voiceprint verification result is also a plurality of candidate voice messages. The voiceprint verification result can be a verification score representing similarity of speaker characterization vectors corresponding to the reserved user speech, such asFormula (2) where X represents the speech information to be recognized and the set of candidate speech information, s₀Representing a first voiceprint verification result, s₁,s₂,s₃……s_nAnd representing the second voiceprint verification results corresponding to the n candidate voice messages respectively, and forming a score group S.

S＝f(X)＝[s₀,s₁,s₂,s₃……s_n] (2)

In one embodiment of the present application, each speech information is input to the voiceprint model after feature extraction. Specifically, as shown in fig. 5, the following steps may be included:

s251: and performing feature extraction on the voice information to be recognized to obtain first feature information corresponding to the voice information to be recognized.

S252: and performing feature extraction on the candidate voice information to obtain second feature information corresponding to the candidate voice information.

It is feasible that the speech information can be regarded as a short-time stationary signal and a long-time non-stationary signal, the long-time non-stationary characteristic of which is generated by the change of the physical movement process of the sounding organ, but the movement of the sounding organ has a certain inertia, so that the speech information can be processed as a stationary signal in a short time, which is generally in the range of 10 to 30 milliseconds. In digital signal processing, it is generally desirable to perform time-frequency analysis on stationary signals to extract features. Therefore, when the feature extraction is performed on the voice signal, a time window of about 20ms may be set, the voice signal is considered to be stable in the time window, then the sliding is performed on the voice signal by taking the time window as a unit, each time window may extract a feature capable of characterizing the signal in the time window, and the feature may characterize the related information of the voice signal in the time window, so as to obtain the feature sequence of the voice signal. Therefore, a section of voice can be converted into a characteristic sequence taking a frame as a unit, and the characteristic sequence is used as the characteristic information corresponding to the voice information. Commonly used characteristic parameters are Filter Bank (Filter Bank), mel-frequency cepstrum coefficient (MFCC), perceptual linear prediction coefficient (PLP), and the like.

S253: and inputting the first characteristic information into a voiceprint model, and comparing the first characteristic information with the reserved registered voice characteristic information to obtain a first voiceprint verification result corresponding to the voice information to be recognized.

S254: and inputting the second characteristic information into the voiceprint model, and comparing the second characteristic information with the reserved registered voice characteristic information to obtain a second voiceprint verification result corresponding to the candidate voice information.

The registered voice characteristic information corresponds to registered voice information of registered users and is user voiceprint characteristic information stored in a voiceprint reservation stage.

The voiceprint Model may be a Model framework of a Gaussian Mixture Model (GMM), a Gaussian Mixture Model-Universal Background Model (GMM-UBM), or a neural network Model framework based on deep learning, such as an Identity Vector (i-Vector) Model, and by taking an Identity Vector Model as an example, by calculating a cosine distance between each feature information and the reserved registered voice feature information, each voiceprint verification result may be obtained, and each voiceprint verification result is scored for verification.

In another embodiment of the present application, an end-to-end neural network model based on deep learning, such as a d-vector deep neural network model, may be employed. Specifically, a feature extraction layer (hidden layer) of the deep neural network is used for outputting speaker features at a frame level, the speaker features are expressed into speaker feature vectors, and then the cosine distance between the speaker feature vectors corresponding to the voice information to be recognized and the reserved speaker feature vectors of the registered users is calculated to obtain a score.

In an embodiment of the present application, in the voiceprint reservation stage, the registration voice information of the registered user needs to be acquired, and the registration voice information is input into the voiceprint model to obtain the registration voice feature information of the registered user, where the registration voice feature information represents the voiceprint feature of the registered user. That is, in the application of voiceprint confirmation or voiceprint recognition, the voice information and voiceprint characteristics of the target speaker need to be reserved so as to be compared with the voice information to be recognized and the candidate voice information.

S270: and determining a voiceprint recognition result aiming at the voice information to be recognized according to the first voiceprint verification result and the second voiceprint verification result.

In the embodiment of the application, for a voiceprint recognition task of voice information to be recognized, the voice information to be recognized and one or more corresponding candidate voice information are subjected to voiceprint verification processing, so that a plurality of voiceprint verification results can be obtained, wherein the plurality of voiceprint verification results include a first voiceprint verification result corresponding to the voice information to be recognized and a second voiceprint verification result corresponding to the candidate voice information. Therefore, for a plurality of output results, the method provided in the embodiment of the present application designs a corresponding decision policy to obtain a voiceprint recognition result finally corresponding to the voice information to be recognized, where the voiceprint recognition result indicates whether the current speaker (i.e., the provider of the voice information to be recognized) is one bit in the registered user set or indicates whether the current speaker is a specified target speaker.

In a possible implementation manner, the first voiceprint verification result and the second voiceprint verification result are verification scores, so that the judgment can be performed according to a certain average result to obtain a final voiceprint recognition result. Specifically, as shown in fig. 6, the method may include:

s2711: and calculating to obtain a first target mean value according to the first voiceprint verification result and the second voiceprint verification result, wherein the first target mean value represents the mean value of the verification scores.

S2713: and determining a voiceprint recognition result aiming at the voice information to be recognized based on a first preset threshold and a first target mean value, wherein the voiceprint recognition result represents whether an object to be recognized, which provides the voice information to be recognized, is a registered user.

Illustratively, on the basis of equation (2), as shown in equation (3), a first target mean value of the score set S can be calculated

Further averaging the first target mean

And a first predetermined threshold τ₁And comparing to judge whether the current speaker is the same speaker as a registered user or whether the current speaker is one bit in the registered user set.

In another possible implementation, for a plurality of voiceprint verification results, a final voiceprint recognition result is obtained based on a minority-compliant voting strategy, and specifically, as shown in fig. 6, the method may include:

s2721: and determining a first voiceprint identification result corresponding to the first voiceprint verification result and a second voiceprint identification result corresponding to the second voiceprint verification result according to the second preset threshold, the first voiceprint verification result and the second voiceprint verification result.

And comparing the first voiceprint verification result and the second voiceprint verification result with a second preset threshold value respectively to obtain a corresponding first voiceprint identification result and a corresponding second voiceprint identification result, wherein each voiceprint identification result indicates whether the current speaker is the same speaker as a registered user or judges whether the current speaker is one bit in a registered user set. Wherein the second preset threshold value tau₂May or may not be equal to the first preset threshold τ in the previous embodiments₁。

S2723: determining a final voiceprint recognition result according to the first voiceprint recognition result and the second voiceprint recognition result based on a minority obedient principle; the voiceprint recognition result represents whether the object to be recognized, which provides the voice information to be recognized, is a registered user.

For example, if the number of voiceprint recognition results indicating that the current speaker and a registered user are the same speaker is greater than the number of voiceprint recognition results indicating that the current speaker and a registered user are not the same speaker, the final voiceprint recognition result indicates that the current speaker and a registered user are the same speaker.

In the above embodiment, the final voiceprint recognition result may be different from the first voiceprint recognition result corresponding to the first voiceprint verification result, and if the obtained to-be-recognized voice information is a challenge sample added with disturbance, effective challenge defense may be implemented at this time, and false decision may be effectively prevented.

In a possible implementation, whether the speech information to be recognized is a challenge sample added with a disturbance can be detected through a difference between the first voiceprint verification result and the second voiceprint verification result. Specifically, as shown in fig. 7, the method may include:

s2731: and calculating to obtain a second target mean value according to the first voiceprint verification result and the second voiceprint verification result, wherein the second target mean value represents the mean value of the verification score difference value.

S2733: and determining attribute state information of the voice information to be recognized based on a third preset threshold and the second target mean value, wherein the attribute state information represents whether the voice information to be recognized is a countermeasure sample.

For example, on the basis of equation (2), as shown in equation (4), a second target mean S' of the score set S can be calculated:

wherein s is₀A first voiceprint authentication result, s, corresponding to the speech information to be recognized_iFor the second voiceprint verification result corresponding to the candidate voice information, the second target mean value s' and a third preset threshold value tau are further used₃Comparing if s' is greater than the threshold tau₃If so, judging the voice information to be recognized as a countermeasure sample, and not returning a final voiceprint recognition result; if s' is less than the threshold τ₃Then the speech information to be recognized is not considered as a countermeasure sample and can be according to s₀And judging and outputting a voiceprint recognition result.

In another possible implementation, for a plurality of voiceprint verification results, the final attribute state information is obtained based on a minority-compliant voting strategy to detect whether the voice information to be recognized is a countermeasure sample added with disturbance. Specifically, as shown in fig. 7, the method may include:

s2741: and determining first attribute state information corresponding to the first voiceprint verification result and second attribute state information corresponding to the second voiceprint verification result according to a fourth preset threshold, the first voiceprint verification result and the second voiceprint verification result.

And comparing the first voiceprint verification result and the second voiceprint verification result with a fourth preset threshold value respectively to obtain corresponding first attribute state information and first attribute state information, wherein each attribute state information indicates whether the corresponding voice information is a countermeasure sample. Wherein the fourth preset threshold τ₄May or may not be equal to the third preset threshold τ in the preceding embodiments₃。

S2743: and determining final attribute state information according to the first attribute state information and the second attribute state information on the basis of a minority-obeying majority principle, wherein the attribute state information represents whether the voice information to be recognized is a countermeasure sample.

For example, if the number of the attribute state information indicating that the corresponding voice information is a countermeasure sample is greater than the number of the attribute state information indicating that the corresponding voice information is not a countermeasure sample, the final attribute state information indicates that the voice information to be recognized is a countermeasure sample.

Illustratively, when the attribute state information represents that the voice information to be recognized is not a countermeasure sample, a first voiceprint recognition result corresponding to the first voiceprint verification result is taken as a voiceprint recognition result and output.

Fig. 8 shows a flowchart of voiceprint recognition, and as shown in fig. 8, in addition to performing voiceprint reservation, the voiceprint recognition method provided in the present application mainly performs batch conversion on to-be-recognized voice information (that is, test voice), so as to obtain a plurality of candidate voice information, and inputs the to-be-recognized voice information and the candidate voice information to a voiceprint model together for voiceprint verification processing, so as to obtain a first voiceprint verification result corresponding to the to-be-recognized voice information and a second voiceprint verification result corresponding to the candidate voice information, which constitute a component array. And then voting judgment can be carried out according to the score groups to obtain a voiceprint recognition result or attribute state information which finally corresponds to the voice information to be recognized, wherein the voiceprint recognition result or attribute state information can indicate whether the current speaker is the same as the user of the registered voice or not, and the voice to be tested can indicate whether the tested voice is a confrontation sample or not. When the voice recognition method is applied to confrontation defense and detection, the technical scheme provided by the application is improved in an input preprocessing stage, the confrontation training of an original voiceprint model is not needed, a defense module for confronting sample attack is not needed, only conversion processing of voice to be recognized and judgment processing of a plurality of voiceprint verification results are needed to be added in an original voiceprint verification stage, the whole scheme is simple and easy to implement, and the method has high universality; according to the technical scheme, effective defense and detection can be implemented on the basis of the performance of the original voiceprint model and the white box attack or the black box attack based on the countercheck sample, so that the safety and the reliability of voiceprint recognition are improved.

An embodiment of the present application further provides a voiceprint recognition apparatus 900, as shown in fig. 9, the apparatus 900 may include:

an obtaining module 910, configured to obtain voice information to be recognized;

a conversion module 920, configured to perform conversion processing on the voice information to be recognized to obtain candidate voice information;

a voiceprint verification module 930, configured to input the to-be-recognized voice information and the candidate voice information into a voiceprint model, and perform a voiceprint verification process to obtain a first voiceprint verification result corresponding to the to-be-recognized voice information and a second voiceprint verification result corresponding to the candidate voice information;

an identifying module 940, configured to determine a voiceprint identification result for the to-be-identified voice information according to the first voiceprint authentication result and the second voiceprint authentication result.

In one embodiment of the present application, the voiceprint verification module 930 can include:

the feature extraction unit is used for respectively extracting features of the voice information to be recognized and the candidate voice information to obtain first feature information corresponding to the voice information to be recognized and second feature information corresponding to the candidate voice information;

a feature comparison unit, configured to input the first feature information and the second feature information into the voiceprint model, respectively, compare the voiceprint model with reserved registered voice feature information, and obtain a first voiceprint verification result corresponding to the voice information to be recognized and a second voiceprint verification result corresponding to the candidate voice information; the registered voice feature information corresponds to registered voice information of a registered user.

In one embodiment of the present application, the identifying module 940 may include:

the first mean value calculating unit is used for calculating a first target mean value according to the first voiceprint verification result and the second voiceprint verification result, and the first target mean value represents a mean value of verification scores;

a first determining unit, configured to determine, based on a first preset threshold and the first target mean, the voiceprint recognition result for the to-be-recognized voice information, where the voiceprint recognition result represents whether an object to be recognized, which provides the to-be-recognized voice information, is a registered user.

In an embodiment of the present application, the identifying module 940 may further include:

a second determining unit, configured to determine, according to a second preset threshold, the first voiceprint verification result, and the second voiceprint verification result, a first voiceprint identification result corresponding to the first voiceprint verification result, and a second voiceprint identification result corresponding to the second voiceprint verification result;

a first decision unit, configured to determine a final voiceprint recognition result according to the first voiceprint recognition result and the second voiceprint recognition result based on a minority-compliant principle; and the voiceprint recognition result represents whether the object to be recognized, which provides the voice information to be recognized, is a registered user.

the second mean value calculating unit is used for calculating a second target mean value according to the first voiceprint verification result and the second voiceprint verification result, and the second target mean value represents a mean value of the verification score difference values;

a third determining unit, configured to determine attribute state information of the voice information to be recognized based on a third preset threshold and the second target mean, where the attribute state information represents whether the voice information to be recognized is a countermeasure sample.

a fourth determining unit, configured to determine, according to a fourth preset threshold, the first voiceprint verification result, and the second voiceprint verification result, first attribute state information corresponding to the first voiceprint verification result, and second attribute state information corresponding to the second voiceprint verification result;

and the second judging unit is used for determining final attribute state information according to the first attribute state information and the second attribute state information on the basis of a minority-compliant principle, wherein the attribute state information represents whether the voice information to be recognized is a countermeasure sample.

In one embodiment of the present application, the third determining unit or the fourth determining unit may include:

and the result determining subunit is configured to, when the attribute state information indicates that the to-be-recognized speech information is not a countermeasure sample, use a first voiceprint recognition result corresponding to the first voiceprint verification result as the voiceprint recognition result.

In one embodiment of the present application, the apparatus 900 may further include:

a registered voice acquiring unit configured to acquire the registered voice information of the registered user;

and the voiceprint feature reservation unit is used for inputting the registered voice information into the voiceprint model to obtain the registered voice feature information of the registered user, and the registered voice feature information represents the voiceprint feature of the registered user.

In an embodiment of the present application, the conversion module 920 may include:

the first conversion unit is used for adding noise information in the voice information to be recognized to obtain first candidate voice information;

the second conversion unit is used for adjusting the volume or the sound speed of the voice information to be recognized to obtain second candidate voice information;

the third conversion unit is used for filtering the voice information to be recognized to obtain third candidate voice information;

and the fourth conversion unit is used for carrying out audio reconstruction on the voice information to be recognized to obtain fourth candidate voice information.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

The embodiment of the present application provides a computer device, which includes a processor and a memory, where the memory stores at least one instruction or at least one program, and the at least one instruction or the at least one program is loaded and executed by the processor to implement a voiceprint recognition method as provided in the above method embodiment.

Fig. 10 is a schematic hardware structure diagram of an apparatus for implementing a voiceprint recognition method provided in the embodiment of the present application, and the apparatus may participate in forming or including the apparatus or system provided in the embodiment of the present application. As shown in fig. 10, the device 10 may include one or more (shown with 1002a, 1002b, … …, 1002 n) processors 1002 (the processors 1002 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 1004 for storing data, and a transmission device 1006 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 10 is merely illustrative and is not intended to limit the structure of the electronic device. For example, device 10 may also include more or fewer components than shown in FIG. 10, or have a different configuration than shown in FIG. 10.

It should be noted that the one or more processors 1002 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the device 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 1004 can be used for storing software programs and modules of application software, such as program instructions/data storage devices corresponding to the methods described in the embodiments of the present application, and the processor 1002 executes various functional applications and data processing by running the software programs and modules stored in the memory 1004, so as to implement a voiceprint recognition method as described above. The memory 1004 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1004 may further include memory located remotely from the processor 1002, which may be connected to the device 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 1006 is used for receiving or sending data via a network. Specific examples of such networks may include wireless networks provided by the communication provider of the device 10. In one example, the transmission device 1006 includes a Network adapter (NIC) that can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 1006 can be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the device 10 (or mobile device).

The present application further provides a computer-readable storage medium, which may be disposed in a server to store at least one instruction or at least one program for implementing a voiceprint recognition method in the method embodiment, where the at least one instruction or the at least one program is loaded and executed by the processor to implement a voiceprint recognition method provided in the method embodiment.

Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Embodiments of the present invention also provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute a voiceprint recognition method provided in the various alternative embodiments described above.

As can be seen from the above-mentioned embodiments of the voiceprint recognition method, apparatus, medium and device provided by the present application,

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, device and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A voiceprint recognition method, the method comprising:

acquiring voice information to be recognized;

2. The method according to claim 1, wherein the inputting the voice information to be recognized and the candidate voice information into a voiceprint model respectively, and performing voiceprint verification processing to obtain a first voiceprint verification result corresponding to the voice information to be recognized and a second voiceprint verification result corresponding to the candidate voice information comprises:

respectively extracting the characteristics of the voice information to be recognized and the candidate voice information to obtain first characteristic information corresponding to the voice information to be recognized and second characteristic information corresponding to the candidate voice information;

respectively inputting the first characteristic information and the second characteristic information into the voiceprint model, and comparing the first characteristic information and the second characteristic information with reserved registered voice characteristic information to obtain a first voiceprint verification result corresponding to the voice information to be recognized and a second voiceprint verification result corresponding to the candidate voice information; the registered voice feature information corresponds to registered voice information of a registered user.

3. The method according to claim 1, wherein the first voiceprint verification result and the second voiceprint verification result are verification scores, and the determining a voiceprint recognition result for the voice information to be recognized according to the first voiceprint verification result and the second voiceprint verification result comprises:

calculating to obtain a first target mean value according to the first voiceprint verification result and the second voiceprint verification result, wherein the first target mean value represents a mean value of verification scores;

and determining the voiceprint recognition result aiming at the voice information to be recognized based on a first preset threshold and the first target mean value, wherein the voiceprint recognition result represents whether an object to be recognized, which provides the voice information to be recognized, is a registered user.

4. The method according to claim 1, wherein the determining a voiceprint recognition result for the speech information to be recognized according to the first voiceprint verification result and the second voiceprint verification result further comprises:

determining a first voiceprint identification result corresponding to the first voiceprint verification result and a second voiceprint identification result corresponding to the second voiceprint verification result according to a second preset threshold, the first voiceprint verification result and the second voiceprint verification result;

determining a final voiceprint recognition result according to the first voiceprint recognition result and the second voiceprint recognition result based on a principle that a minority obeys majority; and the voiceprint recognition result represents whether the object to be recognized, which provides the voice information to be recognized, is a registered user.

5. The method of claim 1, further comprising:

calculating to obtain a second target mean value according to the first voiceprint verification result and the second voiceprint verification result, wherein the second target mean value represents a mean value of the verification score difference values;

determining attribute state information of the voice information to be recognized based on a third preset threshold and the second target mean value, wherein the attribute state information represents whether the voice information to be recognized is a countermeasure sample.

6. The method of claim 1, further comprising:

determining first attribute state information corresponding to the first voiceprint verification result and second attribute state information corresponding to the second voiceprint verification result according to a fourth preset threshold, the first voiceprint verification result and the second voiceprint verification result;

and determining final attribute state information according to the first attribute state information and the second attribute state information on the basis of a minority-compliant principle, wherein the attribute state information represents whether the voice information to be recognized is a countermeasure sample.

7. The method of claim 5 or 6, further comprising:

and when the attribute state information represents that the voice information to be recognized is not a countermeasure sample, taking a first voiceprint recognition result corresponding to the first voiceprint verification result as the voiceprint recognition result.

8. The method of claim 2, wherein prior to obtaining the speech information to be recognized, the method further comprises:

acquiring the registration voice information of the registered user;

and inputting the registered voice information into the voiceprint model to obtain the registered voice characteristic information of the registered user, wherein the registered voice characteristic information represents the voiceprint characteristics of the registered user.

9. The method of claim 1, wherein the converting the speech information to be recognized to obtain candidate speech information comprises:

adding noise information into the voice information to be recognized to obtain first candidate voice information;

or adjusting the volume or the sound speed of the voice information to be recognized to obtain second candidate voice information;

or filtering the voice information to be recognized to obtain third candidate voice information;

or, performing audio reconstruction on the voice information to be recognized to obtain fourth candidate voice information.

10. A voiceprint recognition apparatus, said apparatus comprising:

11. A computer-readable storage medium, having stored therein at least one instruction or at least one program, which is loaded and executed by a processor to implement a voiceprint recognition method as claimed in any one of claims 1 to 9.

12. A computer device comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and wherein the at least one instruction or the at least one program is loaded and executed by the processor to implement a voiceprint recognition method as claimed in any one of claims 1 to 9.

13. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement a voiceprint recognition method according to any one of claims 1 to 9.