CN112786058B

CN112786058B - Voiceprint model training method, voiceprint model training device, voiceprint model training equipment and storage medium

Info

Publication number: CN112786058B
Application number: CN202110263981.9A
Authority: CN
Inventors: 赵情恩; 曾新贵; 熊新雷; 陈蓉; 肖岩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2024-03-29
Anticipated expiration: 2041-03-08
Also published as: CN112786058A

Abstract

The application discloses a voiceprint model training method, device, equipment and storage medium, and relates to the field of artificial intelligence such as voice recognition and deep learning. One embodiment of the method comprises the following steps: acquiring a training sample set, wherein the training sample set comprises audios of a plurality of sample speakers; extracting voice characteristics of audios of a plurality of sample speakers; inputting the voice features into a voiceprint model to obtain the score of a speaker to which the voice features belong and the score of partial noise; a voiceprint model is trained based on the score of the speaker to which the speech feature belongs and the score of the partial noise. The embodiment provides a training mode based on noise proportion for training the voiceprint model, so that the calculated amount of model training is reduced, and the efficiency of model training is improved.

Description

Voiceprint model training method, voiceprint model training device, voiceprint model training equipment and storage medium

Technical Field

The embodiment of the application relates to the field of computers, in particular to the field of artificial intelligence such as voice recognition, deep learning and the like, and particularly relates to a voiceprint model training method, device, equipment and storage medium.

Background

For the field of audio with a large number of different speakers, it is generally desirable to train and optimize a voiceprint model on the stored audio, and then compare and search voiceprints by establishing a huge voiceprint library, which is helpful for the promotion of services in the field. With the rapid development of society and the Internet, the audio of a large number of speakers can be accumulated in the field. How to train and obtain a voiceprint model with better effect under the huge data volume becomes a problem to be solved in the voiceprint field.

Disclosure of Invention

The embodiment of the application provides a voiceprint model training method, device and equipment and a storage medium.

In a first aspect, an embodiment of the present application provides a voiceprint model training method, including: acquiring a training sample set, wherein the training sample set comprises audios of a plurality of sample speakers; extracting voice characteristics of audios of a plurality of sample speakers; inputting the voice features into a voiceprint model to obtain the score of a speaker to which the voice features belong and the score of noise; a voiceprint model is trained based on the score of the speaker to which the speech feature belongs and the score of the partial noise.

In a second aspect, an embodiment of the present application provides a voiceprint model training apparatus, including: an acquisition module configured to acquire a training sample set, wherein the training sample set includes audio of a plurality of sample speakers; an extraction module configured to extract speech features of audio of a plurality of sample speakers; the recognition module is configured to input the voice characteristics into the voiceprint model to obtain the score of a speaker to which the voice characteristics belong and the score of partial noise; and the training module is configured to train the voiceprint model based on the score of the speaker to which the voice feature belongs and the score of the partial noise.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described in any implementation of the first aspect.

In a fifth aspect, embodiments of the present application propose a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

The voiceprint model training method, device, equipment and storage medium provided by the embodiment of the application firstly extract the voice characteristics of the audios of a plurality of sample speakers in a training sample set; then inputting the voice characteristics into a voiceprint model to obtain the score of a speaker to which the voice characteristics belong and the score of noise; and finally training the voiceprint model based on the score of the speaker to which the voice feature belongs and the score of partial noise. The voiceprint model is trained by the training mode based on the noise proportion, so that the calculated amount of model training is reduced, and the efficiency of model training is improved.

It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a voiceprint model training method according to the present application;

FIG. 3 is a flow chart of yet another embodiment of a voiceprint model training method according to the present application;

fig. 4 may be an application scenario diagram for implementing a voiceprint model training method according to an embodiment of the present application.

FIG. 5 is a schematic structural view of one embodiment of a voiceprint model training apparatus according to the present application;

FIG. 6 is a block diagram of an electronic device for implementing a voiceprint model training method of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of a voiceprint model training method or voiceprint model training apparatus of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or transmit video frames or the like. Various client applications, such as a voice recording application, a voiceprint model training application, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-described electronic devices. Which may be implemented as a plurality of software or software modules, or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may provide various services. For example, the server 105 may analyze and process a training sample set obtained from the terminal devices 101, 102, 103 and generate processing results (e.g., voiceprint models).

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 105 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the voiceprint model training method provided in the embodiment of the present application is generally executed by the server 105, and accordingly, the voiceprint model training device is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a voiceprint model training method according to the present application is shown. The voiceprint model training method comprises the following steps:

step 201, a training sample set is obtained.

In this embodiment, the execution subject of the voiceprint model training method may acquire a training sample set.

Wherein the training sample set may include audio of a plurality of sample speakers. Each sample speaker may include at least one piece of audio. Each segment of audio is labeled with a corresponding sample speaker. For example, a training sample set may include 100 ten thousand sample speakers of audio, each sample speaker including 5 segments of audio. Thus, the training sample set includes 500 ten thousand pieces of audio.

It should be noted that, in the technical solution of the present application, the acquisition, storage, application, etc. of the related personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations of the public order.

Step 202, extracting speech features of audio of a plurality of sample speakers.

In this embodiment, for each piece of audio of each sample speaker, the execution body may extract the corresponding voice feature.

Where the speech features may include, but are not limited to, time domain features and frequency domain features of the speech signal. In the time domain, the speech signal can be represented directly by its time waveform. The time domain characteristics of the speech signal can be analyzed by short-time energy, short-time zero-crossing rate and the like. The frequency domain analysis method may analyze frequency domain characteristics of the voice signal. A commonly used frequency domain analysis method is fourier analysis. The speech signal is a non-stationary process and therefore requires a short-time fourier transform for spectral analysis of the speech signal. Their formant characteristics, pitch frequency and harmonic frequency can be observed through the spectrum of the speech signal.

Step 203, inputting the voice feature into the voiceprint model to obtain the score of the speaker to which the voice feature belongs and the score of the noise.

In this embodiment, the executing body may input the voice feature into the voiceprint model, so as to obtain the score of the speaker to which the voice feature belongs and the score of the partial noise.

Wherein the voiceprint model can include a plurality of output nodes. Typically, one output node may correspond to one sample speaker. The number of output nodes is not greater than the number of sample speakers corresponding to the training sample set. The output node corresponding to the speaker to which the input speech feature belongs may output the score of the speaker to which the speech feature belongs. Other output nodes than the output node to which the speaker to which the input speech feature belongs may output a score of noise.

Step 204, training a voiceprint model based on the score of the speaker to which the speech feature belongs and the score of the partial noise.

In this embodiment, the executing body may select the score of the partial noise, and train the voiceprint model based on the score of the speaker to which the voice feature belongs and the score of the partial noise.

Generally, the executing body may update the network parameters of the voiceprint model based on the score of the speaker to which the voice feature belongs and the score of the partial noise, and end training to obtain a stable voiceprint model when a preset training end condition is satisfied. For example, the training end condition may include, but is not limited to, at least one of: the training time exceeds a fourth preset duration, the training times exceeds a preset number of times, the score of the speaker to which the speech feature belongs is as high as possible (e.g., greater than a first preset threshold), the score of the partial noise is as low as possible (e.g., less than a second preset threshold), and so on.

In practice, the executing entity may batch the training sample set to train the voiceprint model. For example, for a training sample set comprising 100 ten thousand sample speakers of audio, 10 sample speakers of audio may be selected for each iteration to train the voiceprint model. After multiple iterations, the voiceprint model can learn the characteristics of different speakers, and can be used for voiceprint recognition, namely speaker recognition, including speaker recognition and speaker confirmation. The voiceprint model can be applied to the field of audio where a large number of speakers exist, including but not limited to the fields of public security, technical surveillance, criminal investigation, customs, traffic management, banking, insurance, internet social interaction and the like. A huge voiceprint library can be established through the voiceprint model to carry out comparison and search of voiceprints, and the promotion of multiple services in the field is facilitated. Such as monitoring of key personnel in public security, detection of technical topic cases, verification of customs passenger identity, verification of customer identity at banking transactions, and the like.

According to the voiceprint model training method provided by the embodiment of the application, firstly, the voice characteristics of audios of a plurality of sample speakers in a training sample set are extracted; then inputting the voice characteristics into a voiceprint model to obtain the score of a speaker to which the voice characteristics belong and the score of noise; and finally training the voiceprint model based on the score of the speaker to which the voice feature belongs and the score of partial noise. The voiceprint model is trained by the training mode based on the noise proportion, the calculated amount of model training is reduced, training by using more sample speaker audios is supported, training can be completed in a short time, and the model training efficiency is improved. The method can be used for voiceprint model training in a large-scale speaker scene.

With further reference to fig. 3, a flow 300 of yet another embodiment of a voiceprint model training method according to the present application is shown. The voiceprint model training method comprises the following steps:

step 301, a training sample set is obtained.

In this embodiment, the specific operation of step 301 is described in detail in step 201 in the embodiment shown in fig. 2, and will not be described herein.

Step 302, transform the audio of the multiple sample speaker from the time domain to the frequency domain and extract speech features on the frequency domain.

In this embodiment, the execution body of the voiceprint model training method may transform each piece of audio of each sample speaker from the time domain to the frequency domain and extract the speech features on the frequency domain.

Wherein the speech features are frequency domain features of the speech signal including, but not limited to, at least one of: MFCC (Mel-scaleFrequency Cepstral Coefficient, mel-frequency cepstral coefficient), PLP (Perceptual linear predictive, perceptual linear prediction), FBank (Filter Bank), and the like. MFCC is a cepstrum parameter extracted in the Mel-scale frequency domain, a feature widely used in automatic speech and speaker recognition. The MFCC feature extraction procedure may include: the basic idea of linear prediction is that a speech sample can be approximated by a linear combination of past speech samples, and a set of unique prediction coefficients can be obtained by approximating the actual speech sample in the sense of minimum mean square error by the linearly predicted samples.

Step 303, inputting the voice feature to the Xvector, to obtain the score output by the output node corresponding to the speaker to which the voice feature belongs and the score output by the output node corresponding to the noise.

In this embodiment, the voiceprint model can be an xvactor. The network structure of a common Xvector includes, in order, a frame-level layer (frame-level), a pooling layer (statistics pooling), a segment-level layer (segment-level), and an activation function layer (softmax). The execution body may input the voice feature to the Xvector, and perform forward computation by the Xvector, and calculate the score output by the output node corresponding to the speaker to which the voice feature belongs and the score output by the output node corresponding to the noise when the last layer of activation function layer performs computation.

The output nodes of the Xvector correspond to the sample speakers corresponding to the training sample set one by one. The output nodes except the output node corresponding to the speaker to which the voice feature belongs are all output nodes corresponding to noise. Thus, the original multi-classification problem is converted into a two-classification problem. Wherein the multiple classification problem is to determine to which of the plurality of sample speakers the training sample corresponds. Because each training sample only belongs to one speaker, only corresponds to one output node, and the other output nodes are all commonly called as output nodes corresponding to noise, the two-classification problem is converted.

Step 304, estimating a noise prior distribution based on the training sample set.

In this embodiment, the execution subject may estimate the noise prior distribution based on the training sample set.

Wherein the training sample set includes audio of a plurality of sample speakers. One output node may correspond to one sample speaker. The output nodes except the output node corresponding to the speaker to which the input voice feature belongs are all output nodes corresponding to noise. The fraction of the output node output corresponding to noise is uniformly distributed, so the noise prior distribution can be uniform.

In step 305, a fraction of the partial noise is selected based on the noise prior distribution.

In this embodiment, the executing body may select the fraction of the partial noise based on the noise prior distribution. Wherein the fraction of the portion of noise selected satisfies the noise prior distribution. That is, the fraction of the partial noise is randomly selected on the premise that the noise prior distribution is satisfied. For example, the scores of noises are sorted in order from large to small (or from small to large), and the sorted scores of noises are uniformly divided into a plurality of sections (e.g., 5 sections). A fraction of a fixed number (e.g., 5 tens of thousands) of noise is randomly selected for each interval. The fraction of the noise selected (e.g., the fraction of 25 ten thousand noises) also satisfies the uniform distribution.

Step 306, the score of the speaker to which the voice feature belongs and the score of the partial noise are input into the loss function, and the loss value is calculated.

In this embodiment, the executing body may input the score of the speaker to which the voice feature belongs and the score of the partial noise into the loss function, and calculate the loss value.

In general, the score of the speaker to which the speech feature belongs and the score of the partial noise may be summed to obtain a normalization factor, and a softmax calculation is performed to obtain the probability of each sample speaker, and finally, a loss value is calculated according to a certain target criterion. The loss function may be, for example, cross entropy Energy.

And step 307, updating network parameters of the voiceprint model based on the loss value until the voiceprint model converges.

In this embodiment, the executing entity may update the network parameters of the voiceprint model based on the loss value until the voiceprint model converges.

Here, various implementations may be employed to adjust network parameters of the voiceprint model based on the loss values. For example, random gradient descent (SGD, stochastic Gradient Descent), newton's Method, quasi-Newton Methods (Quasi-Newton Methods), conjugate gradient Method (Conjugate Gradient), heuristic optimization Methods, and other various optimization algorithms now known or developed in the future may be employed. Typically, the loss value of the voiceprint model after adjustment is smaller than the loss value of the voiceprint model before adjustment until the loss value of the voiceprint model is sufficiently small and the voiceprint model converges.

For example, for a training sample set that includes 100 ten thousand sample speakers of audio, the Xvector includes 100 ten thousand output nodes. When each training sample iterates, only 25 scores of scores output by the output nodes corresponding to the ten thousands of noises are needed to be selected to calculate the loss value. The updating cost of each time is reduced, the calculated amount of each time is reduced, the delay is reduced, and the Xvector converges in a short time.

As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the noise score selection step and the xvactor training step highlighted by the voiceprint model training method in this embodiment. Therefore, the scheme described in this embodiment combines a certain noise prior distribution, and only selects the score output by the output node corresponding to part of noise, and replaces the score output by all the output nodes with the score of the speaker to which the voice feature belongs to calculate the loss value. The score output by the output node corresponding to the selected partial noise and the score output by the output node corresponding to all the noises meet the same noise prior distribution, so that the effect of equivalent training can be achieved.

For ease of understanding, fig. 4 illustrates an application scenario diagram in which a voiceprint model training method of an embodiment of the present application may be implemented. As shown in fig. 4, feature extraction is performed on the original audio to obtain speech features. The speech features are input to an Xvector network, the score of the original audio belonging to the target speaker is calculated, and the score of the original audio belonging to the speaker in the noise set is calculated. The scores of partial speakers in the partial noise set are selected based on the pre-estimated noise prior distribution, and the Loss is calculated together with the scores of the target speakers. And finally, reversely updating the network to be converged based on the Loss, thus completing model training.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a voiceprint model training apparatus, where an embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the voiceprint model training apparatus 500 of the present embodiment may include: an acquisition module 501, an extraction module 502, an identification module 503, and a training module 504. Wherein the obtaining module 501 is configured to obtain a training sample set, wherein the training sample set comprises audio of a plurality of sample speakers; an extraction module 502 configured to extract speech features of the audio of the plurality of sample speakers; the recognition module 503 is configured to input the voice features into the voiceprint model to obtain the score of the speaker to which the voice features belong and the score of the noise; a training module 504 configured to train a voiceprint model based on the score of the speaker to which the speech feature belongs and the score of the partial noise.

In this embodiment, in the voiceprint model training apparatus 500: the specific processes of the obtaining module 501, the extracting module 502, the identifying module 503 and the training module 504 and the technical effects thereof may refer to the relevant descriptions of the steps 201 to 204 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some alternative implementations of the present embodiment, the extraction module 502 is further configured to: transforming the audio of the plurality of sample speakers from the time domain to the frequency domain, and extracting speech features on the frequency domain, wherein the speech features include at least one of: mel-frequency cepstrum coefficient MFCC, perceptual linear prediction PLP, filter bank FBank.

In some alternative implementations of the present embodiment, the voiceprint model is Xvector; and the identification module 503 is further configured to: and inputting the voice characteristics into the Xvector to obtain the score output by the output nodes corresponding to the speakers to which the voice characteristics belong and the score output by the output nodes corresponding to the noise, wherein the output nodes of the Xvector are in one-to-one correspondence with the sample speakers corresponding to the training sample set, and the output nodes except the output nodes corresponding to the speakers to which the voice characteristics belong are the output nodes corresponding to the noise.

In some optional implementations of this embodiment, the voiceprint model training apparatus 500 further includes: an estimation module configured to estimate a noise prior distribution based on the training sample set; and a selection module configured to select a fraction of the partial noise based on the noise prior distribution.

In some alternative implementations of the present embodiment, training module 504 is further configured to: inputting the score of the speaker to which the voice feature belongs and the score of part of noise into a loss function, and calculating to obtain a loss value; and updating network parameters of the voiceprint model based on the loss value until the voiceprint model converges.

According to embodiments of the present application, there is also provided an electronic device, a readable storage medium and a computer program product.

Fig. 6 shows a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as the voiceprint model training method. For example, in some embodiments, the voiceprint model training method can be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of the voiceprint model training method described above can be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the voiceprint model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

In the technical scheme of the application, the acquisition, storage, application and the like of the related user personal information all accord with the regulations of related laws and regulations, and the public sequence is not violated.

Claims

1. A voiceprint model training method comprising:

acquiring a training sample set, wherein the training sample set comprises audios of a plurality of sample speakers;

extracting speech features of the audio of the plurality of sample speakers;

inputting the voice features into a voiceprint model to obtain the score of a speaker to which the voice features belong and the score of noise;

training the voiceprint model based on the score of the speaker to which the speech feature belongs and the score of the partial noise;

wherein the voiceprint model is an Xvector; and

the step of inputting the voice features into a voiceprint model to obtain the score of the speaker to which the voice features belong and the score of noise, comprising:

inputting the voice characteristics into an Xvector to obtain the score output by an output node corresponding to a speaker to which the voice characteristics belong and the score output by an output node corresponding to noise, wherein the output nodes of the Xvector are in one-to-one correspondence with sample speakers corresponding to the training sample set, and the output nodes except the output nodes corresponding to the speaker to which the voice characteristics belong are output nodes corresponding to noise;

after the voiceprint model meets a preset training ending condition, inputting the voice features into the voiceprint model, wherein a sample speaker corresponding to an output node with the output score larger than a first preset threshold value is a speaker to which the input voice features belong.

2. The method of claim 1, wherein the extracting speech features of the audio of the plurality of sample speakers comprises:

transforming the audio of the plurality of sample speakers from the time domain to the frequency domain, and extracting the speech features on the frequency domain, wherein the speech features include at least one of: mel-frequency cepstrum coefficient MFCC, perceptual linear prediction PLP, filter bank FBank.

3. The method of claim 1, wherein prior to the training the voiceprint model based on the score of the speaker to which the speech feature belongs and the score of the portion of noise, further comprising:

estimating a noise prior distribution based on the training sample set;

and selecting the fraction of the partial noise based on the noise prior distribution.

4. The method of claim 1, wherein the training the voiceprint model based on the score of the speaker to which the speech feature belongs and the score of the partial noise comprises:

inputting the score of the speaker to which the voice feature belongs and the score of the partial noise into a loss function, and calculating to obtain a loss value;

and updating network parameters of the voiceprint model based on the loss value until the voiceprint model converges.

5. A voiceprint model training apparatus comprising:

an acquisition module configured to acquire a training sample set, wherein the training sample set includes audio of a plurality of sample speakers;

an extraction module configured to extract speech features of the audio of the plurality of sample speakers;

the recognition module is configured to input the voice characteristics into a voiceprint model to obtain the score of a speaker to which the voice characteristics belong and the score of noise;

a training module configured to train the voiceprint model based on a score of a speaker to which the speech feature belongs and a score of a portion of noise;

wherein the voiceprint model is an Xvector; and

the identification module is further configured to:

6. The apparatus of claim 5, wherein the extraction module is further configured to:

7. The apparatus of claim 5, wherein the apparatus further comprises:

an estimation module configured to estimate a noise prior distribution based on the training sample set;

a selection module configured to select a fraction of the partial noise based on the noise prior distribution.

8. The apparatus of claim 5, wherein the training module is further configured to:

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4.