WO2020073694A1 - 一种声纹识别的方法、模型训练的方法以及服务器 - Google Patents

一种声纹识别的方法、模型训练的方法以及服务器 Download PDF

Info

Publication number
WO2020073694A1
WO2020073694A1 PCT/CN2019/093792 CN2019093792W WO2020073694A1 WO 2020073694 A1 WO2020073694 A1 WO 2020073694A1 CN 2019093792 W CN2019093792 W CN 2019093792W WO 2020073694 A1 WO2020073694 A1 WO 2020073694A1
Authority
WO
WIPO (PCT)
Prior art keywords
loss function
information
voice information
function
speech
Prior art date
Application number
PCT/CN2019/093792
Other languages
English (en)
French (fr)
Inventor
李娜
陀得意
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19870101.3A priority Critical patent/EP3866163A4/en
Priority to JP2020561916A priority patent/JP7152514B2/ja
Publication of WO2020073694A1 publication Critical patent/WO2020073694A1/zh
Priority to US17/085,609 priority patent/US11508381B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence technology, and in particular, to a voiceprint recognition method, a model training method, and a server.
  • CNN Convolutional Neural Network
  • the system based on the softmax loss function is prone to overfitting during the training process, that is, the performance on the training set is better, but for the untrained test set, its performance is poor.
  • the embodiments of the present application provide a voiceprint recognition method, a model training method, and a server.
  • the normalized index function and the centralization function are used to jointly optimize the voiceprint recognition model, which can reduce the depth features from the same speaker.
  • Within the class Using two functions to monitor and learn the voiceprint recognition model at the same time can make the deep features more distinguishable, thereby improving the recognition performance.
  • the first aspect of the embodiments of the present application provides a voiceprint recognition method, including:
  • the voiceprint recognition model is obtained by training according to a first loss function and a second loss function, and the first loss function belongs to a normalized exponential function.
  • the second loss function belongs to the centralized function;
  • the voiceprint recognition result is determined according to the target feature information and the registered feature information, wherein the registered feature information is obtained by the voice information of the object to be recognized after passing the voiceprint recognition model.
  • the second aspect of the embodiments of the present application provides a method for model training, including:
  • the set of voice information to be trained includes voice information corresponding to at least one object
  • the voiceprint recognition model is obtained by training according to the above model adjustment function.
  • a third aspect of the embodiments of the present application provides a server, including one or more processors and one or more memories storing program modules, where the program modules are executed by the processors, and the program modules include:
  • the acquisition module is used to acquire the target voice information to be recognized
  • the above acquisition module is also used to acquire the target feature information of the target speech information through a voiceprint recognition model, wherein the voiceprint recognition model is obtained by training according to a first loss function and a second loss function, and the first loss function belongs to Normalized exponential function, the above second loss function belongs to the centralized function;
  • the determining module is configured to determine a voiceprint recognition result based on the target feature information and the registered feature information acquired by the acquiring module, where the registered feature information is obtained by the voice information of the object to be recognized after passing the voiceprint recognition model.
  • a fourth aspect of the embodiments of the present application provides a server, including one or more processors, and one or more memories storing program modules, where the program modules are executed by the processors, and the program modules include:
  • An obtaining module configured to obtain a set of voice information to be trained, wherein the set of voice information to be trained includes voice information corresponding to at least one object;
  • a determining module configured to determine a model adjustment function based on the speech information corresponding to each object in the to-be-trained speech information set acquired by the acquisition module, wherein the model adjustment function includes the first loss function and the second loss function,
  • the first loss function belongs to a normalized exponential function, and the above second loss function belongs to a centralized function;
  • the training module is configured to train to obtain a voiceprint recognition model according to the model adjustment function determined by the determination module.
  • a fifth aspect of the embodiments of the present application provides a server, including: a memory, a transceiver, a processor, and a bus system;
  • the above memory is used to store programs
  • the above processor is used to execute the program in the above memory and includes the following steps:
  • the voiceprint recognition model is obtained by training according to a first loss function and a second loss function, and the first loss function belongs to a normalized exponential function.
  • the second loss function belongs to the centralized function;
  • the bus system is used to connect the memory and the processor to enable the memory and the processor to communicate.
  • a sixth aspect of the embodiments of the present application provides a server, including: a memory, a transceiver, a processor, and a bus system;
  • the above memory is used to store programs
  • the above processor is used to execute the program in the above memory and includes the following steps:
  • the set of voice information to be trained includes voice information corresponding to at least one object
  • the bus system is used to connect the memory and the processor to enable the memory and the processor to communicate.
  • a seventh aspect of the embodiments of the present application provides a computer-readable storage medium, in which instructions are stored in the computer-readable storage medium, and when executed on a computer, the computer is caused to execute the method in the above aspects.
  • a method for voiceprint recognition is provided.
  • the server obtains target voice information to be recognized, and then the server obtains target feature information of the target voice information through a voiceprint recognition model, in which the voiceprint recognition model is based on
  • the first loss function and the second loss function are obtained by training.
  • the first loss function belongs to a normalized index function
  • the second loss function belongs to a centralized function.
  • the server determines the voiceprint recognition result according to the target feature information and the registered feature information
  • the registered feature information is obtained after the voice information of the object to be recognized passes the voiceprint recognition model.
  • the normalized index function and the centralization function are used to jointly optimize the voiceprint recognition model.
  • the normalized index function as the loss function can effectively improve the distinguishability between different speakers in the depth feature space.
  • the quantization function can optionally reduce the intra-class variation between depth features from the same speaker. Using two types of loss functions to supervise and learn the voiceprint recognition model at the same time can make the deep features more discriminative, thereby improving the recognition performance.
  • FIG. 1 is a schematic structural diagram of a voiceprint recognition system in an embodiment of this application
  • FIG. 2 is a schematic diagram of an embodiment of a method of voiceprint recognition in an embodiment of the present application
  • FIG. 3 is a schematic flowchart of determining a voiceprint recognition result in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of determining a voiceprint recognition result based on cosine similarity in an embodiment of the present application
  • FIG. 5 is a schematic diagram of an embodiment of a method for model training in an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of preprocessing voice information in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an overall structure of a convolutional neural network in an embodiment of this application.
  • FIG. 8 is a partial structural schematic diagram of a convolutional neural network in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the comparison of the accuracy of the verification set applied to different networks in the embodiment of the present application.
  • FIG. 10 is a schematic diagram of an embodiment of a server in an embodiment of this application.
  • FIG. 11 is a schematic diagram of another embodiment of the server in the embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a server in an embodiment of this application.
  • the embodiments of the present application provide a voiceprint recognition method, a model training method, and a server.
  • the normalized index function and the centralization function are used to jointly optimize the voiceprint recognition model, which can reduce the depth features from the same speaker.
  • Within the class Using two functions to monitor and learn the voiceprint recognition model at the same time can make the deep features more distinguishable, thereby improving the recognition performance.
  • speaker recognition can be divided into two types: speaker identification (Speaker Identification) and speaker verification (Speaker Verification).
  • the goal of speaker recognition is to determine which piece of known registered speaker set a piece of speech to be tested is a one-to-many recognition problem.
  • the goal of speaker confirmation is to determine whether the voice under test is said by a registered target speaker, which is a one-to-one confirmation problem.
  • Speaker identification is performed within the range of registered speakers, which is a closed-set identification problem. As the number of registered speakers increases, the complexity of the algorithm becomes larger and the system performance decreases. And each test confirmed by the speaker is only related to one target speaker, which is an open set recognition problem, and the system performance is not greatly affected by the number of people.
  • speaker recognition can be divided into two categories: Text-dependent and Text-independent.
  • the former requires that the registration and test voices have the same semantics and are used in places where the speakers are more cooperative. Since the same semantic content can provide more supplementary information for the recognition system, this type of system has better recognition effect and system performance It is not sensitive to the change of voice duration, and can maintain high accuracy when the duration is short. While the latter does not pay attention to the semantic content in the voice signal, compared with the former, there are fewer limiting factors and the application is more flexible and wide, but because the semantic content is not limited, there will be a mismatch of speech in the training and testing phase. Various types of systems are difficult to recognize and have poor performance. To obtain better recognition performance, a large amount of training corpus is required. The performance of the text-independent speaker recognition system decreases rapidly as the duration of the test speech becomes shorter, making the user experience worse.
  • FIG. 1 is a schematic structural diagram of the recognition system in the embodiment of the present application. As shown in the figure, it is used to initiate a voiceprint recognition request (such as Speaking a voice), after receiving the voiceprint recognition request sent by the terminal device, the server can confirm the speaker according to the trained voiceprint recognition model, that is, determine whether the speaker is a registered speaker, thereby generating a voiceprint Recognize the results.
  • a voiceprint recognition request such as Speaking a voice
  • the server can confirm the speaker according to the trained voiceprint recognition model, that is, determine whether the speaker is a registered speaker, thereby generating a voiceprint Recognize the results.
  • the terminal device includes but is not limited to a tablet computer, a notebook computer, a palmtop computer, a mobile phone, and a personal computer (PC for short), which is not limited here.
  • the voiceprint recognition system may include but is not limited to: a terminal device and a server, where the terminal device collects target voice information of the speaker, and then sends the target voice information to the server.
  • the server After receiving the target voice information, the server calls a pre-trained voiceprint recognition model to perform feature extraction on the target voice information obtained above to obtain target feature information. Then, the voiceprint recognition result corresponding to the target voice information is determined according to the target feature information and the registered feature information. Therefore, the voiceprint recognition of the speaker's voice is completed through the interaction process between the terminal device and the server, so as to reduce the processing load of the terminal device and improve the efficiency and accuracy of voiceprint recognition.
  • the above voiceprint recognition system may also include a separate terminal device or server, and the terminal device or server independently completes the foregoing voiceprint recognition process, which is not described in detail in this embodiment.
  • One embodiment of the method for voiceprint recognition in the embodiment of the present application includes:
  • the server obtains target voice information to be recognized
  • the speaker utters a piece of voice through the terminal device, where the piece of voice is target voice information to be recognized, and the terminal device sends the target voice information to be recognized to the server.
  • the server obtains the target feature information of the target voice information through the voiceprint recognition model, where the voiceprint recognition model is obtained by training according to the first loss function and the second loss function, and the first loss function belongs to the normalized exponential function.
  • the second loss function belongs to the centralized function;
  • the server inputs the target voice information to be recognized into the voiceprint recognition model, and then the voiceprint recognition model outputs corresponding target feature information, where the voiceprint recognition model is determined by the first loss function-normalized The index function (softmax loss function) and the second loss function-the centering function (center loss function) are jointly trained.
  • the loss function measures the difference between the predicted value and the true value, and the softmax loss function.
  • the server determines the voiceprint recognition result according to the target feature information and the registered feature information, where the registered feature information is obtained after the voice information of the object to be recognized passes the voiceprint recognition model.
  • the server in the process of identifying the speaker, the server not only needs to extract the features of the voice information to be recognized, but also needs to calculate the test score, and finally determine the voiceprint recognition result according to the test score.
  • the voiceprint recognition model may be a trained convolutional neural network (Convolutional Neural Network, (Referred to as CNN), firstly, the registered voice and the test voice are divided into a sequence of smaller voice segments. If the voice is too short, a splicing method is used to generate a voice segment of appropriate duration, and the voice segment is input to the voiceprint recognition model.
  • CNN convolutional Neural Network
  • the L2-Norm layer optionally normalizes the registered feature information and target feature information, where the L2-Norm layer refers to the sum of Euclidean distances.
  • PLDA probabilistic linear discriminant analysis
  • a method for voiceprint recognition is provided.
  • the server obtains target voice information to be recognized, and then the server obtains target feature information of the target voice information through a voiceprint recognition model, in which the voiceprint recognition model is based on
  • the first loss function and the second loss function are obtained by training.
  • the first loss function belongs to a normalized exponential function
  • the second loss function belongs to a centralized function
  • the server determines the voiceprint recognition result according to the target feature information and the registered feature information.
  • the registered feature information is obtained after the voice information of the object to be recognized passes the voiceprint recognition model.
  • the normalized index function and the centralization function are used to jointly optimize the voiceprint recognition model.
  • the normalized index function as the loss function can effectively improve the distinguishability between different speakers in the depth feature space.
  • the quantization function can optionally reduce the intra-class variation between depth features from the same speaker. Using two types of loss functions to supervise and learn the voiceprint recognition model at the same time can make the deep features more discriminative, thereby improving the recognition performance.
  • determining the voiceprint recognition result according to the target feature information and the registered feature information may include :
  • the server calculates the cosine similarity based on the target feature information and the registered feature information
  • the server determines that the target voice information belongs to the voice information of the object to be recognized;
  • the server determines that the target voice information does not belong to the voice information of the object to be recognized.
  • a method for determining whether a speaker belongs to a registered speaker is provided.
  • the realization process that facilitates the cosine similarity for scoring is that, for the obtained registered feature information, if it is obtained from the training data, the feature information belonging to the same object is classified into one class, and this class is calculated The average value of is the registered feature information.
  • the cosine similarity of the two feature information can be calculated, and the recognition result can be determined according to the cosine similarity.
  • FIG. 4 is a schematic diagram of determining the voiceprint recognition result based on the cosine similarity in the embodiment of the present application.
  • the angle ⁇ between the vector a and the vector b is first obtained, and The cosine value cos ⁇ corresponding to the included angle ⁇ can be used to characterize the similarity of these two vectors. The smaller the angle, the closer the cosine value is to 1, and the more similar their directions are, the more similar they are.
  • the cosine similarity is the cosine of the angle between two vectors in the vector space as a measure of the difference between the two individuals. Compared to the distance measure, cosine similarity pays more attention to the difference in the direction of the two vectors than the distance or length.
  • the server determines that the target voice information belongs to the voice information of the object to be recognized. If the cosine similarity (such as 0.7) does not reach the first similarity threshold (such as 0.8), the server determines that the target voice information does not belong to the voice information of the object to be recognized.
  • the server may first calculate the cosine similarity based on the target feature information and the registered feature information, if the cosine similarity reaches the first similarity Degree threshold, the server determines that the target voice information belongs to the voice information of the object to be recognized. If the cosine similarity does not reach the first similarity threshold, the server determines that the target voice information does not belong to the voice information of the object to be recognized.
  • the cosine similarity is to distinguish differences from the direction. It is mainly used to distinguish the similarity and difference of users by scoring the content of users. At the same time, the problem of inconsistency of metric standards among users is corrected, which is beneficial to improve Reliability of voiceprint recognition results.
  • determining the voiceprint recognition result according to the target feature information and the registered feature information may include: :
  • the server calculates the log-likelihood ratio between the target feature information and the registered feature information through the PLDA classifier;
  • the server determines that the target voice information belongs to the voice information of the object to be recognized;
  • the server determines that the target voice information does not belong to the voice information of the object to be recognized.
  • another method for determining whether the speaker belongs to a registered speaker is provided.
  • the implementation process that facilitates the PLDA classifier to score is,
  • the training data speech consists of the speech of one speaker, where each speaker has J different speech. Then, we define the jth speech of the i-th speaker as X ij , and then, according to factor analysis, define the generation model of X ij as:
  • x ij u + Fh i + Gw ij + ⁇ ij ;
  • This model can be seen as two parts.
  • the first two items on the right side of the equal sign are only related to the speaker and not related to a specific voice of the speaker. This is called the signal part, which describes the difference between speaking humans.
  • the last two items on the right side of the equal sign describe the difference between different voices of the same speaker, and are called the noise part.
  • F and G contain the basic factors in their respective hypothetical variable spaces. These factors can be regarded as feature vectors in their respective spaces. For example, each column of F corresponds to the feature vector of the inter-class space, and each column of G corresponds to the feature vector of the intra-class space.
  • each space represents, such as h i
  • X ij characterized in speaker space representation.
  • recognition scoring stage if the two voice characteristic h i greater the likelihood the same, then the two speech more certainly belong to the same speaker.
  • PLDA model parameters which are the data mean u, the spatial feature matrix F and G, and the noise covariance ⁇ .
  • the training process of the model is solved iteratively using the classic maximum expectation algorithm.
  • the server may first calculate the log likelihood ratio between the target feature information and the registered feature information through the PLDA classifier, If the log likelihood ratio reaches the second similarity threshold, the server determines that the target voice information belongs to the voice information of the object to be recognized. If the log likelihood ratio does not reach the second similarity threshold, the server determines that the target voice information does not belong to the voice information of the object to be recognized.
  • PLDA is used as the channel compensation algorithm, and its channel compensation capability is better than the traditional linear discriminant analysis classifier, which is beneficial to improve the reliability of voiceprint recognition results.
  • One embodiment of the method of model training in the embodiment of the present application includes:
  • the server obtains a set of voice information to be trained, where the set of voice information to be trained includes voice information corresponding to at least one object;
  • the server first obtains the voice information set to be trained, and the voice information set needs to contain voice information of at least one object.
  • FIG. 6 is a schematic diagram of a process for preprocessing voice information in an embodiment of the present application, as shown in the figure, Optional:
  • step S1 it is first necessary to perform voice activity detection (Voice Activity Detection) (VAD) for each voice information in the training voice information set, the purpose is to identify and eliminate a long silence period from the sound signal flow, so as to achieve In the case of quality of service, the role of saving speech channel resources can help to reduce the end-to-end delay that users feel.
  • VAD Voice Activity Detection
  • the so-called front-back cutting is when restoring speech. Because there is a certain judgment threshold and delay from the beginning of actual speech to the detection of speech, sometimes the beginning and end of the speech waveform will be discarded as silence, and the restored speech will change Therefore, it is necessary to add a voice packet before or after the burst voice packet to smooth it to solve this problem.
  • step S2 the purpose of pre-emphasis is to enhance the high-frequency part, emphasize the high-frequency part of the voice information, remove the influence of lip radiation, and increase the high-frequency resolution of the voice to make the signal spectrum flat and keep the low frequency to high In the entire frequency band of the frequency, the same signal-to-noise ratio can be used to find the spectrum.
  • the reason is that for the voice signal, the energy in the low frequency band of the voice is large, and the energy is mainly distributed in the low frequency band. The power spectral density of the voice decreases with the increase of the frequency, so that the output of the frequency discriminator will be the output signal noise of the high frequency band.
  • step S3 the signal of each frame is usually multiplied by a smooth window function, so that both ends of the frame are smoothly attenuated to zero, which can reduce the intensity of the side lobe after Fourier transform and obtain a higher quality spectrum.
  • a smooth window function For each frame, choose a window function, the width of the window function is the frame length.
  • Commonly used window functions are rectangular window, Hamming window, Hanning window and Gaussian window.
  • step S4 since the signal transformation in the time domain is usually difficult to see the characteristics of the signal, it is usually converted into an energy distribution in the frequency domain for observation, and different energy distributions can represent the characteristics of different voices. Therefore, after multiplying the Hamming window, each frame needs to undergo a fast Fourier transform to obtain the energy distribution in the spectrum.
  • step S5 after undergoing the fast Fourier transform, it is passed through a set of May (Embedded Language, Mel for short) filters to obtain the Mel spectrum, and then the logarithm is taken, and thus the preprocessing of the voice information is completed, thus Generate feature vectors.
  • May Embedded Language, Mel for short
  • the model adjustment function includes a first loss function and a second loss function.
  • the first loss function belongs to a normalized exponential function.
  • the second loss function belongs to the centralized function;
  • the server generates a first loss function and a second loss function according to the preprocessed voice information, and combines the first loss function and the second loss function to obtain a model adjustment function, and the model adjustment function can be used to identify the voiceprint Make adjustments.
  • the voiceprint recognition model is jointly trained by the first loss function—normalized exponential function (softmax loss function), and the second loss function—centralized function (center loss function).
  • the server trains a voiceprint recognition model according to the model adjustment function.
  • the server trains and learns the voiceprint recognition model according to the obtained model adjustment function. And after receiving the voice recognition information to be recognized, the target voice information to be recognized is input to the voiceprint recognition model, and then the voiceprint recognition model outputs the corresponding target feature information.
  • a method for model training is provided, that is, the server first obtains a set of voice information to be trained, wherein the set of voice information to be trained includes voice information corresponding to at least one object, and then, the server The speech information corresponding to each object in the set determines a model adjustment function, where the model adjustment function includes a first loss function and a second loss function, the first loss function belongs to a normalized index function, and the second loss function belongs to a centralized function . Finally, the voiceprint recognition model is obtained by training according to the model adjustment function. In the above way, the normalized index function and the centralization function are used to jointly optimize the voiceprint recognition model.
  • the normalized index function as the loss function can effectively improve the distinguishability between different speakers in the depth feature space.
  • the transformation function can further reduce the intra-class variation from the depth features of the same speaker.
  • Using two types of loss functions to supervise and learn the voiceprint recognition model at the same time can make the deep features more discriminative, thereby improving the recognition performance.
  • the server according to the speech information corresponding to each object in the speech information set to be trained Determining the model adjustment function may include:
  • the server determines the depth features of each voice message through CNN
  • the server obtains the connection layer weight matrix according to the voice information corresponding to each object in the voice information set to be trained
  • the server determines the first loss function according to the depth feature of each voice information and the connection layer weight matrix.
  • the server uses a deep CNN generation model adjustment function based on the Inception-ResNet structure.
  • FIG. 7 is a schematic diagram of the overall structure of the convolutional neural network in the embodiment of the present application. As shown in the figure, the entire structure includes submodules Inception-ResNet-A, Inception-ResNet-B , Inception-ResNet-C, Reduction-A and Reduction-B. Among them, for module A1 and module A2, specifically including the structure shown in FIG. 8, please refer to FIG. 8, FIG.
  • FIG. 8 is a schematic diagram of a partial structure of a convolutional neural network in an embodiment of the present application, considering input voice information In the first convolutional layer, an asymmetric convolution kernel is used, so that a larger convolution can be made in the direction of the time axis.
  • each speech segment is intercepted for a fixed length of speech and the picture is used as the network input, combined with the given trained network, the sentence-level speaker
  • the features are obtained by calculating the average of the speaker features corresponding to the input speech segment.
  • the server may determine the model adjustment function according to the voice information corresponding to each object in the voice information set to be trained by first determining the depth feature of each voice information through CNN, and then according to the voice to be trained The speech information corresponding to each object in the information set obtains the connection layer weight matrix. Finally, the first loss function is determined according to the depth characteristics of each speech information and the connection layer weight matrix.
  • the embodiment of the present application provides a method for model training.
  • the server according to the depth characteristics of each voice information and the weight of the connection layer
  • the matrix determines the first loss function, which may include:
  • the server determines the first loss function in the following manner:
  • L s represents the first loss function
  • x i represents the ith depth feature from the y i object
  • W v represents the vth column in the weight matrix of the connection layer
  • b j represents the deviation term of the jth category
  • M represents the size of the training set group corresponding to the speech information set to be trained
  • N represents the number of objects corresponding to the speech information set to be trained.
  • the input of the log function is the result of softmax
  • L s represents the result of softmax loss
  • wx + b represents the output of the fully connected layer. Therefore, the input of log represents the probability that x i belongs to y i category.
  • the embodiment of the present application provides a specific way to obtain the first loss function, that is, the server determines the first loss function according to the depth feature of each voice information and the connection layer weight matrix.
  • the server according to the speech information corresponding to each object in the speech information set to be trained Determining the model adjustment function may include:
  • the server determines the depth features of each voice message through CNN
  • the server calculates the depth feature gradient according to the depth feature of each voice information
  • the server calculates the second speech mean based on the depth feature gradient and the first speech mean
  • the server determines the second loss function according to the depth feature of each voice information and the mean value of the second voice.
  • the second loss function the center loss function
  • the second loss function is determined.
  • gradient descent method In the process of gradient descent method, only the first derivative of the loss function needs to be solved, and the calculation cost is relatively small, which makes gradient descent method applicable to many large-scale data sets.
  • the meaning of gradient descent method is to find a new iteration point through the gradient direction of the current point.
  • an embodiment of the present application provides a way to obtain a second loss function, that is, the server calculates a depth feature gradient according to the depth feature of each speech information, and calculates a second speech mean based on the depth feature gradient and the first speech mean Finally, the second loss function is determined according to the depth feature of each speech information and the mean value of the second speech.
  • the embodiment of the present application provides a method for model training.
  • the server calculates a depth feature gradient according to the depth feature of each voice information , which can include:
  • the depth feature gradient is calculated as follows:
  • ⁇ j represents the depth feature gradient
  • M represents the training set grouping size corresponding to the speech information set to be trained
  • j represents the class
  • each category corresponds to an object
  • y i represents the y i- th object
  • Calculating the second speech mean value according to the depth feature gradient and the first speech mean value may include:
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the mean value of the second voice corresponding to time t + 1
  • represents the learning rate parameter
  • the value range of ⁇ is greater than or equal to 0, and less than or equal to 1;
  • the determination of the second loss function according to the depth feature of each voice information and the mean value of the second voice may include:
  • L c represents the second loss function
  • x i represents the i-th depth feature from the y i- th object
  • ⁇ yi represents the mean value of the depth distinguishing features from y i .
  • the gradient of the center loss function with respect to x i is ⁇ j .
  • the sum of the squared distances of the features of each sample in a batch from the center of the feature should be as small as possible, that is, the smaller the intra-class distance, the better. This is center loss.
  • the embodiments of the present application provide a specific way to obtain the second loss function. In this way, the feasibility and operability of the solution are improved.
  • the server determines the model adjustment function, which may include:
  • the server determines the first loss function according to the voice information corresponding to each object in the voice information set to be trained
  • the server determines the second loss function according to the voice information corresponding to each object in the voice information set to be trained
  • the server determines the model adjustment function according to the first loss function and the second loss function.
  • the server obtains the first loss function and the second loss function
  • the first loss function and the second loss function are jointly processed to obtain a model adjustment function.
  • the first loss function here is a softmax loss function
  • the second loss function is a center loss function. If only the softmax loss function is used to find the loss, whether it is the training data set or the test data set, you can see a clear category boundary. If the center loss function is added on the basis of the softmax loss function, the distance between classes becomes larger and the distance within classes decreases.
  • the server determines the model adjustment function according to the voice information corresponding to each object in the voice information set to be trained. Specifically, the server may first determine the model adjustment function according to the voice information corresponding to each object in the voice information set to be trained. The first loss function is determined by the voice information, and then the server determines the second loss function according to the voice information corresponding to each object in the to-be-trained voice information set, and finally determines the model adjustment function according to the first loss function and the second loss function. In this way, the feasibility and operability of the solution can be improved.
  • the model is determined according to the first loss function and the second loss function Adjustment functions can include:
  • the server determines the model adjustment function as follows:
  • L t represents the model adjustment function
  • L s represents the first loss function
  • L c represents the second loss function
  • represents the control parameter
  • the loss function used in the embodiment of the present application is a linear combination of a first loss function (softmax loss function) and a second loss function (center loss function), the weight of the first loss function is 1, and the weight of the second loss function is ⁇ .
  • M represents the number of samples included in the mini-batch
  • N represents the number of categories.
  • a specific calculation method for the server to determine the model adjustment function according to the first loss function and the second loss function is introduced.
  • the control parameters can be used to control the ratio between the first loss function and the second loss function, which is beneficial to improve the reliability of the calculation, and the server can be adjusted according to different applications, thereby improving the flexibility of the solution.
  • FIG. 9 is a schematic diagram comparing the accuracy of the verification set applied to different networks in the embodiment of the present application. As shown in the figure, it is better to use two loss functions to optimize network training at the same time than the softmax loss alone, and this application
  • the network structure of the embodiment can achieve the highest accuracy rate on the verification set when the input feature dimension is smaller and the input speech has the smallest duration and the shortest duration.
  • DNN Deep Neural Network
  • i-vector identity vector
  • the voiceprint recognition method provided by the embodiment of the present application is significantly better than the existing DNN / I-vector method in the case of phrase sound, and in the case of long speech, the performance of DNN / I-vector The difference is not big, but for the phrase sound situation, the speaker recognition system based on the depth distinguishing feature does not require complicated process design, therefore, the application efficiency of the scheme is improved.
  • FIG. 10 is a schematic diagram of an embodiment of the server in the embodiment of the present application.
  • the server 30 includes one or more processors and one or more stored program modules. Memory, wherein the program module is executed by the processor, and the program module includes:
  • the obtaining module 301 is used to obtain target voice information to be recognized
  • the obtaining module 301 is also used to obtain target feature information of target voice information through a voiceprint recognition model, where the voiceprint recognition model is obtained by training according to a first loss function and a second loss function, and the first loss function belongs to normalization Exponential function, the second loss function belongs to the centralized function;
  • the determining module 302 is configured to determine a voiceprint recognition result according to the target feature information acquired by the acquiring module 301 and the registered feature information, where the registered feature information is obtained after the voice information of the object to be recognized passes the voiceprint recognition model.
  • the obtaining module 301 obtains target voice information to be recognized, and the obtaining module 301 obtains target feature information of the target voice information through a voiceprint recognition model, where the voiceprint recognition model is based on the first loss function and the second loss After the function training, the first loss function belongs to the normalized exponential function, and the second loss function belongs to the centralized function.
  • the determination module 302 determines the voiceprint recognition result according to the target feature information and the registration feature information acquired by the acquisition module 301, where The feature information is obtained after the voice information of the object to be recognized passes through the voiceprint recognition model.
  • An embodiment of the present application provides a server.
  • the server obtains target voice information to be recognized, and then the server obtains target feature information of the target voice information through a voiceprint recognition model, where the voiceprint recognition model is based on the first loss function
  • the second loss function training the first loss function belongs to the normalized index function
  • the second loss function belongs to the centralized function
  • the server determines the voiceprint recognition result according to the target feature information and the registered feature information, where the registered feature information
  • the voice information of the object to be recognized is obtained after passing the voiceprint recognition model.
  • the normalized index function and the centralization function are used to jointly optimize the voiceprint recognition model.
  • the normalized index function as the loss function can effectively improve the distinguishability between different speakers in the depth feature space.
  • the quantization function can optionally reduce the intra-class variation between depth features from the same speaker. Using two types of loss functions to supervise and learn the voiceprint recognition model at the same time can make the deep features more discriminative, thereby improving the recognition performance.
  • the determination module 302 is specifically used to calculate the cosine similarity based on the target feature information and the registered feature information;
  • the target voice information belongs to the voice information of the object to be recognized
  • the target voice information does not belong to the voice information of the object to be recognized.
  • the server may first calculate the cosine similarity based on the target feature information and the registered feature information. Degree threshold, it is determined that the target voice information belongs to the voice information of the object to be recognized. If the cosine similarity does not reach the first similarity threshold, the server determines that the target voice information does not belong to the voice information of the object to be recognized.
  • the cosine similarity is to distinguish differences from the direction. It is mainly used to distinguish the similarity and difference of users by scoring the content of users. At the same time, the problem of inconsistency of metric standards among users is corrected, which is beneficial to improve Reliability of voiceprint recognition results.
  • the determination module 302 is specifically used to calculate the log-likelihood ratio between the target feature information and the registered feature information through the PLDA classifier;
  • the log likelihood ratio reaches the second similarity threshold, it is determined that the target voice information belongs to the voice information of the object to be recognized;
  • the log likelihood ratio does not reach the second similarity threshold, it is determined that the target voice information does not belong to the voice information of the object to be recognized.
  • the server may first calculate the log likelihood ratio between the target feature information and the registered feature information through the PLDA classifier, If the log likelihood ratio reaches the second similarity threshold, it is determined that the target voice information belongs to the voice information of the object to be recognized. If the log likelihood ratio does not reach the second similarity threshold, the server determines that the target voice information does not belong to the voice information of the object to be recognized.
  • PLDA is used as the channel compensation algorithm, and its channel compensation capability is better than the traditional linear discriminant analysis classifier, which is beneficial to improve the reliability of voiceprint recognition results.
  • FIG. 11 is a schematic diagram of an embodiment of the server in the embodiment of the present application.
  • the server 40 includes one or more processors and one or more stored program modules. Memory, wherein the program module is executed by the processor, and the program module includes:
  • the obtaining module 401 is used to obtain a set of voice information to be trained, wherein the set of voice information to be trained includes voice information corresponding to at least one object;
  • the determining module 402 is configured to determine a model adjustment function according to the speech information corresponding to each object in the to-be-trained speech information set acquired by the acquiring module 401, where the model adjustment function includes a first loss function and a second loss function The function belongs to the normalized exponential function, and the second loss function belongs to the centralized function;
  • the training module 403 is configured to train to obtain a voiceprint recognition model according to the model adjustment function determined by the determination module 402.
  • the acquiring module 401 acquires a set of voice information to be trained, wherein the set of voice information to be trained includes voice information corresponding to at least one object, and the determining module 402 determines each object in the set of voice information to be trained according to the acquiring module 401
  • the corresponding voice information determines the model adjustment function, where the model adjustment function includes a first loss function and a second loss function, the first loss function belongs to a normalized exponential function, the second loss function belongs to a centralized function, and the training module 403
  • the model adjustment function determined by the determination module 402 is trained to obtain a voiceprint recognition model.
  • a method for model training is provided, that is, the server first obtains a set of voice information to be trained, wherein the set of voice information to be trained includes voice information corresponding to at least one object, and then, the server The speech information corresponding to each object in the set determines a model adjustment function, where the model adjustment function includes a first loss function and a second loss function, the first loss function belongs to a normalized index function, and the second loss function belongs to a centralized function . Finally, the voiceprint recognition model is obtained by training according to the model adjustment function. In the above way, the normalized index function and the centralization function are used to jointly optimize the voiceprint recognition model.
  • the normalized index function as the loss function can effectively improve the distinguishability between different speakers in the depth feature space.
  • the transformation function can further reduce the intra-class variation from the depth features of the same speaker.
  • Using two types of loss functions to supervise and learn the voiceprint recognition model at the same time can make the deep features more discriminative, thereby improving the recognition performance.
  • the determination module 402 is specifically used to determine the depth feature of each voice information through a convolutional neural network CNN;
  • connection layer weight matrix according to the voice information corresponding to each object in the voice information set to be trained
  • the first loss function is determined according to the depth characteristics of each voice information and the weight matrix of the connection layer.
  • the server may determine the model adjustment function according to the voice information corresponding to each object in the voice information set to be trained by first determining the depth feature of each voice information through a convolutional neural network CNN, The connection layer weight matrix is obtained according to the speech information corresponding to each object in the to-be-trained speech information set. Finally, the first loss function is determined according to the depth feature of each speech information and the connection layer weight matrix.
  • the determining module 402 is specifically used to determine the first loss function in the following manner:
  • L s represents the first loss function
  • x i represents the ith depth feature from the y i object
  • W v represents the vth column in the weight matrix of the connection layer
  • b j represents the deviation term of the jth category
  • M represents the size of the training set group corresponding to the speech information set to be trained
  • N represents the number of objects corresponding to the speech information set to be trained.
  • an embodiment of the present application provides a specific way to obtain the first loss function, that is, determine the first loss function according to the depth feature of each voice information and the connection layer weight matrix. In the above way, the feasibility and operability of the scheme are improved.
  • the determination module 402 is specifically used to determine the depth feature of each voice information through a convolutional neural network CNN;
  • the second loss function is determined according to the depth feature of each voice information and the mean value of the second voice.
  • an embodiment of the present application provides a way to obtain a second loss function, that is, the server calculates a depth feature gradient according to the depth feature of each speech information, and calculates a second speech mean based on the depth feature gradient and the first speech mean Finally, the second loss function is determined according to the depth feature of each speech information and the mean value of the second speech.
  • the determination module 402 is specifically used to calculate the depth feature gradient in the following manner:
  • ⁇ j represents the depth feature gradient
  • M represents the training set grouping size corresponding to the speech information set to be trained
  • j represents the class
  • each category corresponds to an object
  • y i represents the y i- th object
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the mean value of the second voice corresponding to time t + 1
  • represents the learning rate parameter
  • the value range of ⁇ is greater than or equal to 0, and less than or equal to 1;
  • L c represents the second loss function
  • x i represents the i-th depth feature from the y i- th object
  • ⁇ yi represents the mean value of the depth distinguishing features from y i .
  • the embodiments of the present application provide a specific way to obtain the second loss function. In this way, the feasibility and operability of the solution are improved.
  • the determining module 402 is specifically configured to determine the first loss function according to the voice information corresponding to each object in the voice information set to be trained;
  • the model adjustment function is determined.
  • the server determines the model adjustment function according to the voice information corresponding to each object in the voice information set to be trained, which may specifically be that the server first The first loss function is determined by the voice information, and then the server determines the second loss function according to the voice information corresponding to each object in the set of voice information to be trained, and finally determines the model adjustment function according to the first loss function and the second loss function.
  • the server determines the model adjustment function according to the voice information corresponding to each object in the voice information set to be trained, which may specifically be that the server first The first loss function is determined by the voice information, and then the server determines the second loss function according to the voice information corresponding to each object in the set of voice information to be trained, and finally determines the model adjustment function according to the first loss function and the second loss function.
  • the determination module 402 is specifically used to determine the model adjustment function in the following manner:
  • L t represents the model adjustment function
  • L s represents the first loss function
  • L c represents the second loss function
  • represents the control parameter
  • a specific calculation method for the server to determine the model adjustment function according to the first loss function and the second loss function is introduced.
  • the control parameters can be used to control the proportion between the first loss function and the second loss function, which is beneficial to improve the reliability of the calculation, and can be adjusted according to different applications, thereby improving the flexibility of the solution.
  • FIG. 12 is a schematic diagram of a server structure provided by an embodiment of the present application.
  • the server 500 may have a relatively large difference due to different configurations or performance, and may include one or more central processing units (CPUs) 522 ( For example, one or more processors) and a memory 532, one or more storage media 530 (such as one or more mass storage devices) storing application programs 542 or data 544.
  • the memory 532 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
  • the central processor 522 may be configured to communicate with the storage medium 530 and execute a series of instruction operations in the storage medium 530 on the server 500.
  • the server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input and output interfaces 558, and / or one or more operating systems 541, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the steps performed by the server in the above embodiments may be based on the server structure shown in FIG. 12.
  • the CPU 522 is used to perform the following steps:
  • the voiceprint recognition model is trained based on a first loss function and a second loss function, and the first loss function belongs to a normalized index Function, the second loss function belongs to the centralization function;
  • the voiceprint recognition result is determined according to the target feature information and the registered feature information, where the registered feature information is obtained by the voice information of the object to be recognized after passing through the voiceprint recognition model.
  • the CPU 522 is specifically used to perform the following steps:
  • the target voice information belongs to the voice information of the object to be recognized
  • the target voice information does not belong to the voice information of the object to be recognized.
  • the CPU 522 is specifically used to perform the following steps:
  • the log likelihood ratio reaches the second similarity threshold, it is determined that the target voice information belongs to the voice information of the object to be recognized;
  • the log likelihood ratio does not reach the second similarity threshold, it is determined that the target voice information does not belong to the voice information of the object to be recognized.
  • the CPU 522 is used to perform the following steps:
  • the set of voice information to be trained includes voice information corresponding to at least one object
  • the model adjustment function is determined according to the speech information corresponding to each object in the speech information set to be trained, wherein the model adjustment function includes a first loss function and a second loss function, the first loss function belongs to a normalized exponential function, and the second loss The function belongs to the centralized function;
  • the voiceprint recognition model is obtained by training according to the model adjustment function.
  • the CPU 522 is specifically used to perform the following steps:
  • connection layer weight matrix according to the voice information corresponding to each object in the voice information set to be trained
  • the first loss function is determined according to the depth characteristics of each voice information and the weight matrix of the connection layer.
  • the CPU 522 is specifically used to perform the following steps:
  • L s represents the first loss function
  • x i represents the ith depth feature from the y i object
  • W v represents the vth column in the weight matrix of the connection layer
  • b j represents the deviation term of the jth category
  • M represents the size of the training set group corresponding to the speech information set to be trained
  • N represents the number of objects corresponding to the speech information set to be trained.
  • the CPU 522 is specifically used to perform the following steps:
  • the second loss function is determined according to the depth feature of each voice information and the mean value of the second voice.
  • the CPU 522 is specifically used to perform the following steps:
  • the depth feature gradient is calculated as follows:
  • ⁇ j represents the depth feature gradient
  • M represents the training set grouping size corresponding to the speech information set to be trained
  • j represents the class
  • each category corresponds to an object
  • y i represents the y i- th object
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the moment
  • t represents the mean value of the second voice corresponding to time t + 1
  • represents the learning rate parameter
  • the value range of ⁇ is greater than or equal to 0, and less than or equal to 1;
  • L c represents the second loss function
  • x i represents the i-th depth feature from the y i- th object
  • ⁇ yi represents the mean value of the depth distinguishing features from y i .
  • the CPU 522 is specifically used to perform the following steps:
  • the model adjustment function is determined.
  • the CPU 522 is specifically used to perform the following steps:
  • L t represents the model adjustment function
  • L s represents the first loss function
  • L c represents the second loss function
  • represents the control parameter
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of units is only a division of logical functions.
  • there may be other divisions for example, multiple units or components may be combined or integrated To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each of the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or software function unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present application may be essentially or part of the contribution to the related technology or all or part of the technical solutions may be embodied in the form of software products, and the computer software products are stored in a storage medium
  • several instructions are included to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., which can store program codes Medium.
  • This application uses acquiring the target speech information to be recognized, and acquiring the target feature information of the target speech information through the speech recognition model.
  • the speech recognition model is obtained by training according to the first loss function and the second loss function.
  • the first loss function belongs to normalization
  • the index loss function and the second loss function belong to the centralization function; the voiceprint recognition result is determined according to the target feature information and the registered feature information, and the registered feature information is obtained after the speech information of the object to be recognized passes the speech recognition model.
  • the normalization index function and the centralization function are used to jointly optimize the speech recognition model, which can reduce the intra-class variation from the depth features of the same speaker.
  • the use of two functions to simultaneously monitor and learn the speech recognition model can make the depth features have Better discrimination, thereby improving recognition performance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

提供一种声纹识别的方法,包括:获取待识别的目标语音信息(101);通过声纹识别模型获取目标语音信息的目标特征信息,声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数(102);根据目标特征信息以及注册特征信息确定声纹识别结果,注册特征信息为待识别对象的语音信息在通过语音识别模型之后得到的(103)。还提供了一种模型训练的方法以及服务器。采用两种函数同时监督和学习语音识别模型,可使深度特征具有更好的区分性,从而提升识别性能。

Description

一种声纹识别的方法、模型训练的方法以及服务器
本申请要求于2018年10月10日提交中国专利局、优先权号为2018111798564、发明名称为“一种声纹识别的方法、模型训练的方法以及服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能技术领域,尤其涉及一种声纹识别的方法、模型训练的方法以及服务器。
背景技术
网络信息技术的高速发展使人们能够方便地获得各种信息,随之也产生了信息安全问题。由于越来越多的涉及信息安全保密的场所需要可靠的身份认证系统,因此基于指纹、虹膜、人脸、手写签名以及语音的身份认证技术都在应用需求的推动下得到了很大的发展。语音是身份信息的重要载体,与人脸和指纹等其他生物特征相比,语音的获取成本低廉,使用简单,便于远程数据采集,且基于语音的人机交流界面更为友好,因此说话人识别技术成为重要的自动身份认证技术。近年来,说话人识别技术在智能家居领域中的身份认证、语音支付及个性化推荐中有着越来越重要的应用价值。
目前,基于卷积神经网络(Convolutional Neural Network,简称CNN)训练得到的系统可以对说话人进行识别。这类系统通常对短语音截取固定时长的音频,将该音频转换为图片后输入至CNN网络进行训练,通过预定义的softmax损失函数来调整整个网络。
然而,基于softmax损失函数的系统,在训练过程中,容易出现过拟合现象,也就是在训练集上的性能表现较好,但是对于未训练过的测试集而言,其性能表现较差。
发明内容
本申请实施例提供了一种声纹识别的方法、模型训练的方法以及服务器,利用归一化指数函数和中心化函数对声纹识别模型进行联合优化,能够 减少来自同一说话人深度特征之间的类内变化。采用两种函数同时监督和学习声纹识别模型,可使深度特征具有更好的区分性,从而提升识别性能。
有鉴于此,本申请实施例的第一方面提供了一种声纹识别的方法,包括:
获取待识别的目标语音信息;
通过声纹识别模型获取上述目标语音信息的目标特征信息,其中,上述声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,上述第一损失函数属于归一化指数函数,上述第二损失函数属于中心化函数;
根据上述目标特征信息以及注册特征信息确定声纹识别结果,其中,上述注册特征信息为待识别对象的语音信息在通过上述声纹识别模型之后得到的。
本申请实施例的第二方面提供了一种模型训练的方法,包括:
获取待训练语音信息集合,其中,上述待训练语音信息集合包括至少一个对象所对应的语音信息;
根据上述待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,上述模型调节函数包括上述第一损失函数以及第二损失函数,上述第一损失函数属于归一化指数函数,上述第二损失函数属于中心化函数;
根据上述模型调节函数训练得到声纹识别模型。
本申请实施例的第三方面提供了一种服务器,包括括一个或多个处理器,以及一个或多个存储程序模块的存储器,其中,程序模块由处理器执行,该程序模块包括:
获取模块,用于获取待识别的目标语音信息;
上述获取模块,还用于通过声纹识别模型获取上述目标语音信息的目标特征信息,其中,上述声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,上述第一损失函数属于归一化指数函数,上述第二损失函数属于中心化函数;
确定模块,用于根据上述获取模块获取的上述目标特征信息以及注册特征信息确定声纹识别结果,其中,上述注册特征信息为待识别对象的语音信息在通过上述声纹识别模型之后得到的。
本申请实施例的第四方面提供了一种服务器,包括括一个或多个处理器,以及一个或多个存储程序模块的存储器,其中,程序模块由处理器执行,该程序模块包括:
获取模块,用于获取待训练语音信息集合,其中,上述待训练语音信息集合包括至少一个对象所对应的语音信息;
确定模块,用于根据上述获取模块获取的上述待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,上述模型调节函数包括上述第一损失函数以及第二损失函数,上述第一损失函数属于归一化指数函数,上述第二损失函数属于中心化函数;
训练模块,用于根据上述确定模块确定的上述模型调节函数训练得到声纹识别模型。
本申请实施例的第五方面提供了一种服务器,包括:存储器、收发器、处理器以及总线系统;
其中,上述存储器用于存储程序;
上述处理器用于执行上述存储器中的程序,包括如下步骤:
获取待识别的目标语音信息;
通过声纹识别模型获取上述目标语音信息的目标特征信息,其中,上述声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,上述第一损失函数属于归一化指数函数,上述第二损失函数属于中心化函数;
根据上述目标特征信息以及注册特征信息确定声纹识别结果,其中,上述注册特征信息为待识别对象的语音信息在通过上述声纹识别模型之后得到的;
上述总线系统用于连接上述存储器以及上述处理器,以使上述存储器以及上述处理器进行通信。
本申请实施例的第六方面提供了一种服务器,包括:存储器、收发器、处理器以及总线系统;
其中,上述存储器用于存储程序;
上述处理器用于执行上述存储器中的程序,包括如下步骤:
获取待训练语音信息集合,其中,上述待训练语音信息集合包括至少一 个对象所对应的语音信息;
根据上述待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,上述模型调节函数包括上述第一损失函数以及第二损失函数,上述第一损失函数属于归一化指数函数,上述第二损失函数属于中心化函数;
根据上述模型调节函数训练得到声纹识别模型;
上述总线系统用于连接上述存储器以及上述处理器,以使上述存储器以及上述处理器进行通信。
本申请实施例的第七方面提供了一种计算机可读存储介质,上述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面上述的方法。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请实施例中,提供了一种声纹识别的方法,首先服务器获取待识别的目标语音信息,然后服务器通过声纹识别模型获取目标语音信息的目标特征信息,其中,声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数,服务器再根据目标特征信息以及注册特征信息确定声纹识别结果,其中,注册特征信息为待识别对象的语音信息在通过声纹识别模型之后得到的。通过上述方式,利用归一化指数函数和中心化函数对声纹识别模型进行联合优化,归一化指数函数作为损失函数,能够有效提升深度特征空间中不同说话人之间的区分性,而中心化函数作为损失函数,能够可选地减少来自同一说话人深度特征之间的类内变化。采用两种损失函数同时监督和学习声纹识别模型,可以使得深度特征具有更好的区分性,从而提升识别性能。
附图说明
图1为本申请实施例中声纹识别系统的一个架构示意图;
图2为本申请实施例中声纹识别的方法一个实施例示意图;
图3为本申请实施例中确定声纹识别结果的一个流程示意图;
图4为本申请实施例中基于余弦相似度确定声纹识别结果的一个示意图;
图5为本申请实施例中模型训练的方法一个实施例示意图;
图6为本申请实施例中对语音信息进行预处理的一个流程示意图;
图7为本申请实施例中卷积神经网络的一个总体结构示意图;
图8为本申请实施例中卷积神经网络的一个部分结构示意图;
图9为本申请实施例中验证集应用于不同网络的正确率对比示意图;
图10为本申请实施例中服务器的一个实施例示意图;
图11为本申请实施例中服务器的另一个实施例示意图;
图12为本申请实施例中服务器的一个结构示意图。
具体实施方式
本申请实施例提供了一种声纹识别的方法、模型训练的方法以及服务器,利用归一化指数函数和中心化函数对声纹识别模型进行联合优化,能够减少来自同一说话人深度特征之间的类内变化。采用两种函数同时监督和学习声纹识别模型,可使深度特征具有更好的区分性,从而提升识别性能。
本申请实施例的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请实施例的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应理解,本申请实施例可以应用于声纹识别场景,根据识别任务的不同,说话人识别可以分为说话人辨认(Speaker Identification)和说话人确认(Speaker Verification)两类。说话人辨认的目标是判断一段待测语音为已知注册说话人集合中的哪一个,是一对多的识别问题。而说话人确认的目标是判断待测语音是否为已注册的一个目标说话人所说,是一对一的确认问题。说话人辨认在已注册的说话人范围内进行,属于闭集识别问题,随着注册人 数的增加,算法复杂度变大,系统性能下降。而说话人确认的每次测试只与一个目标说话人有关,是个开集识别问题,系统性能受人数多少影响不大。
其中,根据对语音信息的要求,说话人识别又可以分为与文本相关(Text-dependent)和与文本无关(Text-independent)两类。前者要求注册和测试语音具有相同的语义,应用于说话人比较配合的场所,由于相同的语义内容可以为识别系统提供更多的补充信息,所以这种类型的系统的识别效果较好,系统性能对语音时长的变化不敏感,在时长较短时,也能保持较高的准确性。而后者则不关注语音信号中的语义内容,和前者相比,限制因素较少,应用更灵活广泛,但由于语义内容不受限制,在训练和测试阶段会出现语音类失配的现象,这种类型的系统识别难度大且性能较差,要获得较好的识别性能,需要大量的训练语料。文本无关的说话人识别系统的性能随着测试语音的时长变短而快速下降,使得用户体验较差。
为了使得识别系统能够更好地适用于不同长度的语音信息,本申请实施例提出了一种说话人识别的方法。该方法应用于图1所示的识别系统,请参阅图1,图1为本申请实施例中识别系统的一个架构示意图,如图所示,用于可以通过终端设备发起声纹识别请求(比如说一段语音),服务器接收终端设备发送的声纹识别请求之后,根据训练得到的声纹识别模型可以对说话人进行确认,即判断说话人是否为已经注册过的说话人,由此生成声纹识别结果。需要说明的是,终端设备包含但不仅限于平板电脑、笔记本电脑、掌上电脑、手机以及个人电脑(personal computer,简称PC),此处不做限定。
也就是说,在本实施例提供的声纹识别系统中,可以包括但不限于:终端设备和服务器,其中,终端设备采集说话人的目标语音信息,然后将该目标语音信息发送给服务器。服务器在接收到目标语音信息之后,调用预先训练好的声纹识别模型,来对上述获取到的目标语音信息进行特征提取,以得到目标特征信息。然后,根据该目标特征信息以及注册特征信息来确定与上述目标语音信息对应的声纹识别结果。从而实现通过终端设备与服务器的交互过程,来完成对说话人的语音的声纹识别,以减轻终端设备的处理负荷,提高声纹识别的效率和准确性。
此外,上述声纹识别系统也可以包括单独的终端设备或服务器,由终端 设备或服务器独立完成上述声纹识别过程,本实施例中对此不再赘述。
下面将从服务器的角度,对本申请实施例中声纹识别的方法进行介绍,请参阅图2,本申请实施例中声纹识别的方法一个实施例包括:
101、服务器获取待识别的目标语音信息;
本实施例中,说话人通过终端设备发出一段语音,其中,这段语音即为待识别的目标语音信息,由终端设备将待识别的目标语音信息发送至服务器。
102、服务器通过声纹识别模型获取目标语音信息的目标特征信息,其中,声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数;
本实施例中,服务器将待识别的目标语音信息输入至声纹识别模型,然后由该声纹识别模型输出对应的目标特征信息,其中,声纹识别模型是由第一损失函数——归一化指数函数(softmax损失函数),以及第二损失函数——中心化函数(center损失函数)共同训练得到。
其中,损失函数度量的是预测值与真实值之间的差异,softmax损失函数。
103、服务器根据目标特征信息以及注册特征信息确定声纹识别结果,其中,注册特征信息为待识别对象的语音信息在通过声纹识别模型之后得到的。
本实施例中,在服务器在识别说话人的过程中,不但需要提取待识别语音信息的特征,还需要计算测试得分,最后根据测试得分确定声纹识别结果。为了便于介绍,请参阅图3,图3为本申请实施例中确定声纹识别结果的一个流程示意图,如图所示,声纹识别模型可以为训练好的卷积神经网络(Convolutional Neural Network,简称CNN),首先将注册语音和测试语音分割为较小的语音片段序列,如果语音太短,就采用拼接方式生成合适时长的语音片段,将语音片段输入至声纹识别模型。然后通过统计平均层得到注册语音所对应的注册特征信息,并且通过统计平均层得到测试语音所对应的目标特征信息。这里的注册特征信息以及目标特征信息均属于句子水平的深度特征。接下来,L2-Norm层可选地对注册特征信息以及目标特征信息进行规整,其中,L2-Norm层是指欧几里德距离之和。最后采用余弦距离或者概率线性判别分析(Probabilistic Linear Discriminant Analysis,简称PLDA)分类器来计算测试得分。
本申请实施例中,提供了一种声纹识别的方法,首先服务器获取待识别的目标语音信息,然后服务器通过声纹识别模型获取目标语音信息的目标特征信息,其中,声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数,服务器再根据目标特征信息以及注册特征信息确定声纹识别结果,其中,注册特征信息为待识别对象的语音信息在通过声纹识别模型之后得到的。通过上述方式,利用归一化指数函数和中心化函数对声纹识别模型进行联合优化,归一化指数函数作为损失函数,能够有效提升深度特征空间中不同说话人之间的区分性,而中心化函数作为损失函数,能够可选地减少来自同一说话人深度特征之间的类内变化。采用两种损失函数同时监督和学习声纹识别模型,可以使得深度特征具有更好的区分性,从而提升识别性能。
可选地,在上述图2对应的实施例的基础上,本申请实施例提供模型训练的方法第一个可选实施例中,根据目标特征信息以及注册特征信息确定声纹识别结果,可以包括:
服务器根据目标特征信息以及注册特征信息计算余弦相似度;
若余弦相似度达到第一相似度阈值,则服务器确定目标语音信息属于待识别对象的语音信息;
若余弦相似度未达到第一相似度阈值,则服务器确定目标语音信息不属于待识别对象的语音信息。
本实施例中,提供了一种判断说话人是否属于已经注册过的一个说话人的方法。可选地,利于余弦相似度进行评分的实现过程为,对于得到的注册特征信息而言,如果是训练数据得到的,将属于同一个对象的特征信息归为一类,并计算出这一类的平均值,该平均值即为注册特征信息。对于需要评分的目标特征信息而言,可以计算出两个特征信息的余弦相似度,根据余弦相似度确定识别结果。
为了便于介绍,请参阅图4,图4为本申请实施例中基于余弦相似度确定声纹识别结果的一个示意图,如图所示,先求得向量a和向量b的夹角θ,并得出夹角θ对应的余弦值cosθ,此余弦值就可以用来表征这两个向量的相似性。夹角越小,余弦值越接近于1,它们的方向更加吻合,则越相似。余弦相 似度即用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小。相比距离度量,余弦相似度更加注重两个向量在方向上的差异,而非距离或长度上。
如果余弦相似度(比如0.9)达到第一相似度阈值(比如0.8),则服务器确定目标语音信息属于待识别对象的语音信息。如果余弦相似度(比如0.7)未达到第一相似度阈值(比如0.8),则服务器确定目标语音信息不属于待识别对象的语音信息。
需要说明的是,在实际应用中,除了可以采用上述介绍的余弦相似度确定声纹识别结果以外,还可以采用欧几里得距离、明可夫斯基距离、曼哈顿距离、切比雪夫距离、马哈拉诺比斯距离、皮尔森相关系数或者Jaccard相似系数进行相似度检测。
其次,本申请实施例中,在服务器根据目标特征信息以及注册特征信息确定声纹识别结果的过程中,可以先根据目标特征信息以及注册特征信息计算余弦相似度,若余弦相似度达到第一相似度阈值,则服务器确定目标语音信息属于待识别对象的语音信息。若余弦相似度未达到第一相似度阈值,则服务器确定目标语音信息不属于待识别对象的语音信息。通过上述方式,余弦相似度是从方向上区分差异,主要用于采用用户对内容评分来区分用户的相似度和差异,同时修正了用户间可能存在的度量标准不统一的问题,从而有利于提升声纹识别结果的可靠性。
可选地,在上述图2对应的实施例的基础上,本申请实施例提供模型训练的方法第二个可选实施例中,根据目标特征信息以及注册特征信息确定声纹识别结果,可以包括:
服务器通过PLDA分类器计算目标特征信息与注册特征信息之间的对数似然比;
若对数似然比达到第二相似度阈值,则服务器确定目标语音信息属于待识别对象的语音信息;
若对数似然比未达到第二相似度阈值,则服务器确定目标语音信息不属于待识别对象的语音信息。
本实施例中,提供了另一种判断说话人是否属于已经注册过的一个说话 人的方法。具体地,利于PLDA分类器进行评分的实现过程为,
在声纹识别领域中,我们假设训练数据语音由I个说话人的语音组成,其中,每个说话人有J段自己不同的语音。那么,我们定义第i个说话人的第j条语音为X ij,然后,根据因子分析,定义X ij的生成模型为:
x ij=u+Fh i+Gw ijij
这个模型可以看成两个部分,等号右边前两项只跟说话人有关而跟说话人的具体某一条语音无关,称为信号部分,这描述了说话人类间的差异。等号右边后两项描述了同一说话人的不同语音之间的差异,称为噪音部分。这样,我们用了这样两个假想变量来描述一条语音的数据结构。这两个矩阵F和G包含了各自假想变量空间中的基本因子,这些因子可以看做是各自空间的特征向量。比如,F的每一列就相当于类间空间的特征向量,G的每一列相当于类内空间的特征向量。而两个向量可以看做是分别在各自空间的特征表示,比如h i就可以看做是X ij在说话人空间中的特征表示。在识别打分阶段,如果两条语音的h i特征相同的似然度越大,那么这两条语音就更确定地属于同一个说话人。
PLDA的模型参数一个有4个,分别是数据均值u,空间特征矩阵F和G,噪声协方差ε。模型的训练过程采用经典的最大期望算法迭代求解。
其次,本申请实施例中,在服务器根据目标特征信息以及注册特征信息确定声纹识别结果的过程中,可以先通过PLDA分类器计算目标特征信息与注册特征信息之间的对数似然比,若对数似然比达到第二相似度阈值,则服务器确定目标语音信息属于待识别对象的语音信息。若对数似然比未达到第二相似度阈值,则服务器确定目标语音信息不属于待识别对象的语音信息。通过上述方式,采用PLDA作为信道补偿算法,其信道补偿能力比传统的线性判别分析分类器更好,从而有利于提升声纹识别结果的可靠性。
下面将从服务器的角度,对本申请实施例中模型训练的方法进行介绍,请参阅图5,本申请实施例中模型训练的方法一个实施例包括:
201、服务器获取待训练语音信息集合,其中,待训练语音信息集合包括至少一个对象所对应的语音信息;
本实施例中,首先由服务器获取待训练的语音信息集合,在该语音信息 集合中需要包含至少一个对象的语音信息。
可选地,服务器还需要对待训练语音信息集合中的各个语音信息进行预处理,请参阅图6,图6为本申请实施例中对语音信息进行预处理的一个流程示意图,如图所示,可选地:
步骤S1中,首先需要对待训练语音信息集合中的各个语音信息进行语音活动检测(Voice Activity Detection,简称VAD),目的是从声音信号流里识别和消除长时间的静音期,以达到在不降低业务质量的情况下节省话路资源的作用,可以有利于减少用户感觉到的端到端的时延。
在进行静音检测时有两个问题需要注意:一是背景噪声问题,即如何在较大的背景噪声中检测静音;二是前后沿剪切问题。所谓前后沿剪切就是还原语音时,由于从实际讲话开始到检测到语音之间有一定的判断门限和时延,有时语音波形的开始和结束部分会作为静音被丢掉,还原的语音会出现变化,因此需要在突发语音分组前面或后面增加一个语音分组进行平滑以解决这一问题。
步骤S2中,预加重的目的是提升高频部分,对语音信息的高频部分进行加重,去除口唇辐射的影响,增加语音的高频分辨率使信号的频谱变得平坦,保持在低频到高频的整个频带中,能用同样的信噪比求频谱。原因是因为对于语音信号来说,语音的低频段能量较大,能量主要分布在低频段,语音的功率谱密度随频率的增高而下降,这样,鉴频器输出就会高频段的输出信噪比明显下降,从而导致高频传输衰弱,使高频传输困难,这对信号的质量会带来很大的影响。因此,在传输之前把信号的高频部分进行加重,然后接收端再去重,能够提高信号传输质量。
步骤S3中,每帧信号通常要与一个平滑的窗函数相乘,让帧两端平滑地衰减到零,这样可以降低傅里叶变换后旁瓣的强度,取得更高质量的频谱。对每一帧,选择一个窗函数,窗函数的宽度就是帧长。常用的窗函数有矩形窗、汉明窗、汉宁窗以及高斯窗等。
步骤S4中,由于信号在时域上的变换通常很难看出信号的特性,所以通常将它转换为频域上的能量分布来观察,不同的能量分布,就能代表不同语音的特性。所以在乘上汉明窗后,每帧还需再经过快速傅里叶变换以得到在 频谱上的能量分布。
步骤S5中,在经过快速傅里叶变换之后,将其通过一组梅尔(Maya Embedded Language,简称Mel)滤波器就得到Mel频谱,再取对数,至此完成对语音信息的预处理,从而生成特征向量。
202、根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,模型调节函数包括第一损失函数以及第二损失函数,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数;
本实施例中,服务器根据经过预处理之后的语音信息生成第一损失函数和第二损失函数,联合第一损失函数和第二损失函数得到模型调节函数,利用模型调节函数可以对声纹识别模型进行调节。
其中,声纹识别模型是由第一损失函数——归一化指数函数(softmax损失函数),以及第二损失函数——中心化函数(center损失函数)共同训练得到。
203、服务器根据模型调节函数训练得到声纹识别模型。
本实施例中,服务器根据得到的模型调节函数训练和学习得到声纹识别模型。并且在收到待识别的语音识别信息之后,将将待识别的目标语音信息输入至声纹识别模型,然后由该声纹识别模型输出对应的目标特征信息。
本申请实施例中,提供了一种模型训练的方法,即服务器先获取待训练语音信息集合,其中,待训练语音信息集合包括至少一个对象所对应的语音信息,然后,服务器根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,模型调节函数包括第一损失函数以及第二损失函数,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数。最后,根据模型调节函数训练得到声纹识别模型。通过上述方式,利用归一化指数函数和中心化函数对声纹识别模型进行联合优化,归一化指数函数作为损失函数,能够有效提升深度特征空间中不同说话人之间的区分性,而中心化函数作为损失函数,能够进一步减少来自同一说话人深度特征之间的类内变化。采用两种损失函数同时监督和学习声纹识别模型,可以使得深度特征具有更好的区分性,从而提升识别性能。
可选地,在上述图5对应的实施例的基础上,本申请实施例提供模型训 练的方法第一个可选实施例中,服务器根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,可以包括:
服务器通过CNN确定每个语音信息的深度特征;
服务器根据待训练语音信息集合中每个对象所对应的语音信息获取连接层权重矩阵;
服务器根据每个语音信息的深度特征以及连接层权重矩阵确定第一损失函数。
本实施例中,服务器采用基于Inception-ResNet结构的深度CNN生成模型调节函数。为了便于理解,请参阅图7,图7为本申请实施例中卷积神经网络的一个总体结构示意图,如图所示,整个结构中包含了子模块Inception-ResNet-A、Inception-ResNet-B、Inception-ResNet-C、Reduction-A以及Reduction-B。其中,对于模块A1和模块A2而言,具体包括了如图8所示的结构,请参阅图8,图8为本申请实施例中卷积神经网络的一个部分结构示意图,考虑到输入语音信息的特点,在第一个卷积层采用了非对称卷积核,由此可以对时间轴方向做更大幅度的卷积。
基于改进的Inception-ResNet结构学习整句话的深度特征,在训练过程中,对每条语音截取固定时长的语音段以图片形式作为网络输入,结合给定训练好的网络,句子水平的说话人特征通过对输入语音段对应的说话人特征计算平均值得到。
其次,本申请实施例中,服务器根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数的方式可以为,先通过CNN确定每个语音信息的深度特征,然后根据待训练语音信息集合中每个对象所对应的语音信息获取连接层权重矩阵,最后,根据每个语音信息的深度特征以及连接层权重矩阵确定第一损失函数。通过上述方式,为方案的实现提供了可行的实现方式,从而提升了方案的实用性和可行性。
可选地,在上述图5对应的第一个实施例的基础上,本申请实施例提供模型训练的方法第二个可选实施例中,服务器根据每个语音信息的深度特征以及连接层权重矩阵确定第一损失函数,可以包括:
服务器采用如下方式确定第一损失函数:
Figure PCTCN2019093792-appb-000001
其中,L s表示第一损失函数,x i表示来自第y i个对象的第i个深度特征,W v表示连接层权重矩阵中的第v列,b j表示第j类的偏差项,且每一类对应一个对象,M表示待训练语音信息集合所对应的训练集分组大小,N表示待训练语音信息集合所对应的对象个数。
本实施例中,介绍了一种计算第一损失函数的具体的方式。即采用如下公式进行计算:
Figure PCTCN2019093792-appb-000002
其中,log函数的输入就是softmax的结果,而L s表示的是softmax损失的结果,而wx+b表示全连接层的输出,因此,log的输入就表示x i属于y i类别的概率。
再次,本申请实施例中,提供了一种获取第一损失函数的具体方式,即服务器根据每个语音信息的深度特征以及连接层权重矩阵确定第一损失函数。通过上述方式,提升了方案的可行性和可操作性。
可选地,在上述图5对应的实施例的基础上,本申请实施例提供模型训练的方法第三个可选实施例中,服务器根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,可以包括:
服务器通过CNN确定每个语音信息的深度特征;
服务器根据每个语音信息的深度特征计算深度特征梯度;
服务器根据深度特征梯度以及第一语音均值计算第二语音均值;
服务器根据每个语音信息的深度特征以及第二语音均值,确定第二损失函数。
本实施例中,将介绍第二损失函数——center损失函数的确定方式。在确定第二损失函数的过程中,需要利用小批量(mini-batch)梯度下降法,每次只拿总训练集的一小部分来训练,比如一共有5000个样本,每次拿100个样 本来计算损失,然后更新参数。50次后完成整个样本集的训练,即为一轮训练。由于每次更新用了多个样本来计算损失,就使得损失的计算和参数的更新更加具有代表性。损失的下降更加稳定,同时小批量的计算,也减少了计算资源的占用。
在梯度下降法的求解过程中,只需求解损失函数的一阶导数,计算的代价比较小,这使得梯度下降法能在很多大规模数据集上得到应用。梯度下降法的含义是通过当前点的梯度方向寻找到新的迭代点。
其次,本申请实施例中,提供了一种获取第二损失函数的方式,即服务器根据每个语音信息的深度特征计算深度特征梯度,根据深度特征梯度以及第一语音均值计算第二语音均值,最后根据每个语音信息的深度特征以及第二语音均值,确定第二损失函数。通过上述方式,能够为方案的实现提供合理的依据,从而提升方案的可行性和实用性。
可选地,在上述图5对应的第三个实施例的基础上,本申请实施例提供模型训练的方法第四个可选实施例中,服务器根据每个语音信息的深度特征计算深度特征梯度,可以包括:
采用如下方式计算深度特征梯度:
Figure PCTCN2019093792-appb-000003
其中,Δμ j表示深度特征梯度,M表示待训练语音信息集合所对应的训练集分组大小,j表示类,且每一类对应一个对象,y i表示第y i个对象;
根据深度特征梯度以及第一语音均值计算第二语音均值,可以包括:
采用如下方式计算第二语音均值:
Figure PCTCN2019093792-appb-000004
其中,t表示时刻,
Figure PCTCN2019093792-appb-000005
表示t+1时刻所对应的第二语音均值,
Figure PCTCN2019093792-appb-000006
表示t时刻所对应的第一个语音均值,
Figure PCTCN2019093792-appb-000007
表示t时刻所对应的深度特征梯度,α表示学习速率参数,且α的取值范围为大于或等于0,且小于或等于1;
根据每个语音信息的深度特征以及第二语音均值,确定第二损失函数,可以包括:
采用如下方式确定第二损失函数:
Figure PCTCN2019093792-appb-000008
其中,L c表示第二损失函数,x i表示来自第y i个对象的第i个深度特征,μ yi表示来自y i的深度区分特征均值。
本实施例中,介绍了一种计算第二损失函数的具体的方式。即服务器采用如下公式进行计算:
Figure PCTCN2019093792-appb-000009
其中,μ yi代表来自说话人y i的深度区分特征的均值。需要说明的是,各类均值是随着小批量(mini-batch)单位进行更新的。在每个训练迭代步中,小批量中所出现的说话人的深度特征用于更新相应说话人的均值。均值更新的公式如下所示:
Figure PCTCN2019093792-appb-000010
Figure PCTCN2019093792-appb-000011
其中,center损失函数关于x i的梯度为Δμ j。一个batch中的每个样本的特征离特征的中心的距离的平方和要越小越好,也就是类内距离要越小越好。这就是center loss。
再次,本申请实施例中,提供了一种获取第二损失函数的具体方式。通过上述方式,从而提升方案的可行性和可操作性。
可选地,在上述图5以及图5对应的第一个至第四个实施例中任一项的基础上,本申请实施例提供模型训练的方法第五个可选实施例中,服务器根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,可以包括:
服务器根据待训练语音信息集合中每个对象所对应的语音信息确定第一损失函数;
服务器根据待训练语音信息集合中每个对象所对应的语音信息确定第二损失函数;
服务器根据第一损失函数以及第二损失函数,确定模型调节函数。
本实施例中,在服务器获取第一损失函数以及第二损失函数之后,将第一损失函数与第二损失函数进行联合处理,从而得到模型调节函数。
可选地,这里的第一损失函数为softmax损失函数,第二损失函数为center损失函数。如果只采用softmax损失函数来求损失,无论是训练数据集还是测试数据集,都能看出比较清晰的类别界限。如果在softmax损失函数的基础上在加入center损失函数,那么类间距离变大了,类内距离减少了。
可选地,本申请实施例中,服务器根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,具体可以为,服务器先根据待训练语音信息集合中每个对象所对应的语音信息确定第一损失函数,然后服务器根据待训练语音信息集合中每个对象所对应的语音信息确定第二损失函数,最后根据第一损失函数以及第二损失函数,确定模型调节函数。通过上述方式,可以提升方案的可行性和可操作性。
可选地,在上述图5对应的第五个实施例的基础上,本申请实施例提供模型训练的方法第六个可选实施例中,根据第一损失函数以及第二损失函数,确定模型调节函数,可以包括:
服务器采用如下方式确定模型调节函数:
L t=L s+λL c
其中,L t表示模型调节函数,L s表示第一损失函数,L c表示第二损失函数,λ表示控制参数。
本实施例中,介绍了一种计算模型调节函数的具体的方式。即服务器采用如下公式进行计算:
Figure PCTCN2019093792-appb-000012
其中,本申请实施例所用损失函数为第一损失函数(softmax损失函数)和第二损失函数(center损失函数)的线性组合,第一损失函数的权重为1,第二损失函数的权重为λ。这里的M表示小批量(mini-batch)所包含的样本 数量,N表示类别数。
可选地,本申请实施例中,介绍了一种根据第一损失函数以及第二损失函数,服务器确定模型调节函数的具体计算方式。通过上述方式,采用控制参数可以控制第一损失函数和第二损失函数之间的比重,从而有利于提升计算的可靠性,并且服务器能够根据不同的应用进行调整,进而提升方案的灵活性。
为了验证本申请实施例提供的声纹识别方法的应用效果,在大数据集上进行了验证对比,该数据集包含了来自2500说话人的760220句话,每个说话人平均有300句,该数据的平均时长为2.6s。我们将数据集分成训练集、验证集、和测试集三个部分。为了便于理解,请参阅表1,表1为不同网络的配置情况,
表1
Figure PCTCN2019093792-appb-000013
其中,对于Inception-ResNet-v1网络,为了能保证该网络的正常训练,我们采用120维的log-mel特征作为该网络的输入。请参阅图9,图9为本申请实施例中验证集应用于不同网络的正确率对比示意图,如图所示,采用两种损失函数同时优化网络训练比单独的softmax损失更好,而且本申请实施例的网络结构可以在输入特征维度更小,输入语音最小时长最短的情况下在验证集上达到最高的正确率。
下面将对本申请实施例所提供的系统与基于深度神经网络(Deep Neural Network,简称DNN)/身份认证矢量(identity vector,简称i-vector)的系统进行对比,请参阅表2,表2为本申请实施例提供的系统与DNN/i-vector系统 的性能比较示意。
表2
Figure PCTCN2019093792-appb-000014
从表2中可以看出,本申请实施例提供的声纹识别方法在短语音情况下明显优于现有DNN/I-vector方法,在长语音情况下,与DNN/I-vector的性能的差别不大,但是对于短语音情况而言,基于深度区分特征的说话人识别系统不需要繁杂的流程设计,因此,提升了方案的应用效率。
下面对本申请实施例中的服务器进行详细描述,请参阅图10,图10为本申请实施例中服务器一个实施例示意图,服务器30包括一个或多个处理器,以及一个或多个存储程序模块的存储器,其中,程序模块由处理器执行,该程序模块包括:
获取模块301,用于获取待识别的目标语音信息;
获取模块301,还用于通过声纹识别模型获取目标语音信息的目标特征信息,其中,声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数;
确定模块302,用于根据获取模块301获取的目标特征信息以及注册特征信息确定声纹识别结果,其中,注册特征信息为待识别对象的语音信息在通过声纹识别模型之后得到的。
本实施例中,获取模块301,获取待识别的目标语音信息,获取模块301通过声纹识别模型获取目标语音信息的目标特征信息,其中,声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数,确定模块302根据获取模块301获取的目标特征信息以及注册特征信息确定声纹识别结果,其中,注册特征信息为待识别对象的语音信息在通过声纹识别模型之后得到的。
本申请实施例中,提供了一种服务器,首先服务器获取待识别的目标语音信息,然后服务器通过声纹识别模型获取目标语音信息的目标特征信息, 其中,声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数,服务器再根据目标特征信息以及注册特征信息确定声纹识别结果,其中,注册特征信息为待识别对象的语音信息在通过声纹识别模型之后得到的。通过上述方式,利用归一化指数函数和中心化函数对声纹识别模型进行联合优化,归一化指数函数作为损失函数,能够有效提升深度特征空间中不同说话人之间的区分性,而中心化函数作为损失函数,能够可选地减少来自同一说话人深度特征之间的类内变化。采用两种损失函数同时监督和学习声纹识别模型,可以使得深度特征具有更好的区分性,从而提升识别性能。
可选地,在上述图10所对应的实施例的基础上,本申请实施例提供的服务器30的另一实施例中,
确定模块302,具体用于根据目标特征信息以及注册特征信息计算余弦相似度;
若余弦相似度达到第一相似度阈值,则确定目标语音信息属于待识别对象的语音信息;
若余弦相似度未达到第一相似度阈值,则确定目标语音信息不属于待识别对象的语音信息。
其次,本申请实施例中,在服务器根据目标特征信息以及注册特征信息确定声纹识别结果的过程中,可以先根据目标特征信息以及注册特征信息计算余弦相似度,若余弦相似度达到第一相似度阈值,则确定目标语音信息属于待识别对象的语音信息。若余弦相似度未达到第一相似度阈值,则服务器确定目标语音信息不属于待识别对象的语音信息。通过上述方式,余弦相似度是从方向上区分差异,主要用于采用用户对内容评分来区分用户的相似度和差异,同时修正了用户间可能存在的度量标准不统一的问题,从而有利于提升声纹识别结果的可靠性。
可选地,在上述图10所对应的实施例的基础上,本申请实施例提供的服务器30的另一实施例中,
确定模块302,具体用于通过PLDA分类器计算目标特征信息与注册特征信息之间的对数似然比;
若对数似然比达到第二相似度阈值,则确定目标语音信息属于待识别对象的语音信息;
若对数似然比未达到第二相似度阈值,则确定目标语音信息不属于待识别对象的语音信息。
其次,本申请实施例中,在服务器根据目标特征信息以及注册特征信息确定声纹识别结果的过程中,可以先通过PLDA分类器计算目标特征信息与注册特征信息之间的对数似然比,若对数似然比达到第二相似度阈值,则确定目标语音信息属于待识别对象的语音信息。若对数似然比未达到第二相似度阈值,则服务器确定目标语音信息不属于待识别对象的语音信息。通过上述方式,采用PLDA作为信道补偿算法,其信道补偿能力比传统的线性判别分析分类器更好,从而有利于提升声纹识别结果的可靠性。
下面对本申请实施例中的服务器进行详细描述,请参阅图11,图11为本申请实施例中服务器一个实施例示意图,服务器40包括一个或多个处理器,以及一个或多个存储程序模块的存储器,其中,程序模块由处理器执行,该程序模块包括:
获取模块401,用于获取待训练语音信息集合,其中,待训练语音信息集合包括至少一个对象所对应的语音信息;
确定模块402,用于根据获取模块401获取的待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,模型调节函数包括第一损失函数以及第二损失函数,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数;
训练模块403,用于根据确定模块402确定的模型调节函数训练得到声纹识别模型。
本实施例中,获取模块401获取待训练语音信息集合,其中,待训练语音信息集合包括至少一个对象所对应的语音信息,确定模块402根据获取模块401获取的待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,模型调节函数包括第一损失函数以及第二损失函数,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数,训练模块403根据确定模块402确定的模型调节函数训练得到声纹识别模型。
本申请实施例中,提供了一种模型训练的方法,即服务器先获取待训练语音信息集合,其中,待训练语音信息集合包括至少一个对象所对应的语音信息,然后,服务器根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,模型调节函数包括第一损失函数以及第二损失函数,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数。最后,根据模型调节函数训练得到声纹识别模型。通过上述方式,利用归一化指数函数和中心化函数对声纹识别模型进行联合优化,归一化指数函数作为损失函数,能够有效提升深度特征空间中不同说话人之间的区分性,而中心化函数作为损失函数,能够进一步减少来自同一说话人深度特征之间的类内变化。采用两种损失函数同时监督和学习声纹识别模型,可以使得深度特征具有更好的区分性,从而提升识别性能。
可选地,在上述图11所对应的实施例的基础上,本申请实施例提供的服务器40的另一实施例中,
确定模块402,具体用于通过卷积神经网络CNN确定每个语音信息的深度特征;
根据待训练语音信息集合中每个对象所对应的语音信息获取连接层权重矩阵;
根据每个语音信息的深度特征以及连接层权重矩阵确定第一损失函数。
其次,本申请实施例中,服务器根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数的方式可以为,先通过卷积神经网络CNN确定每个语音信息的深度特征,然后根据待训练语音信息集合中每个对象所对应的语音信息获取连接层权重矩阵,最后,根据每个语音信息的深度特征以及连接层权重矩阵确定第一损失函数。通过上述方式,为方案的实现提供了可行的实现方式,从而提升了方案的实用性和可行性。
可选地,在上述图11所对应的实施例的基础上,本申请实施例提供的服务器40的另一实施例中,
确定模块402,具体用于采用如下方式确定第一损失函数:
Figure PCTCN2019093792-appb-000015
其中,L s表示第一损失函数,x i表示来自第y i个对象的第i个深度特征,W v表示连接层权重矩阵中的第v列,b j表示第j类的偏差项,且每一类对应一个对象,M表示待训练语音信息集合所对应的训练集分组大小,N表示待训练语音信息集合所对应的对象个数。
再次,本申请实施例中,提供了一种获取第一损失函数的具体方式,即根据每个语音信息的深度特征以及连接层权重矩阵确定第一损失函数。通过上述方式,提升方案的可行性和可操作性。
可选地,在上述图11所对应的实施例的基础上,本申请实施例提供的服务器40的另一实施例中,
确定模块402,具体用于通过卷积神经网络CNN确定每个语音信息的深度特征;
根据每个语音信息的深度特征计算深度特征梯度;
根据深度特征梯度以及第一语音均值计算第二语音均值;
根据每个语音信息的深度特征以及第二语音均值,确定第二损失函数。
其次,本申请实施例中,提供了一种获取第二损失函数的方式,即服务器根据每个语音信息的深度特征计算深度特征梯度,根据深度特征梯度以及第一语音均值计算第二语音均值,最后根据每个语音信息的深度特征以及第二语音均值,确定第二损失函数。通过上述方式,能够为方案的实现提供合理的依据,从而提升方案的可行性和实用性。
可选地,在上述图11所对应的实施例的基础上,本申请实施例提供的服务器40的另一实施例中,
确定模块402,具体用于采用如下方式计算深度特征梯度:
Figure PCTCN2019093792-appb-000016
其中,Δμ j表示深度特征梯度,M表示待训练语音信息集合所对应的训练集分组大小,j表示类,且每一类对应一个对象,y i表示第y i个对象;
采用如下方式计算第二语音均值:
Figure PCTCN2019093792-appb-000017
其中,t表示时刻,
Figure PCTCN2019093792-appb-000018
表示t+1时刻所对应的第二语音均值,
Figure PCTCN2019093792-appb-000019
表示t时刻所对应的第一个语音均值,
Figure PCTCN2019093792-appb-000020
表示t时刻所对应的深度特征梯度,α表示学习速率参数,且α的取值范围为大于或等于0,且小于或等于1;
采用如下方式确定第二损失函数:
Figure PCTCN2019093792-appb-000021
其中,L c表示第二损失函数,x i表示来自第y i个对象的第i个深度特征,μ yi表示来自y i的深度区分特征均值。
再次,本申请实施例中,提供了一种获取第二损失函数的具体方式。通过上述方式,从而提升方案的可行性和可操作性。
可选地,在上述图11所对应的实施例的基础上,本申请实施例提供的服务器40的另一实施例中,
确定模块402,具体用于根据待训练语音信息集合中每个对象所对应的语音信息确定第一损失函数;
根据待训练语音信息集合中每个对象所对应的语音信息确定第二损失函数;
根据第一损失函数以及第二损失函数,确定模型调节函数。
可选地,本申请实施例中,服务器根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,具体可以为,服务器先根据待训练语音信息集合中每个对象所对应的语音信息确定第一损失函数,然后服务器根据待训练语音信息集合中每个对象所对应的语音信息确定第二损失函数,最后根据第一损失函数以及第二损失函数,确定模型调节函数。通过上述方式,可以提升方案的可行性和可操作性。
可选地,在上述图11所对应的实施例的基础上,本申请实施例提供的服务器40的另一实施例中,
确定模块402,具体用于采用如下方式确定模型调节函数:
L t=L s+λL c
其中,L t表示模型调节函数,L s表示第一损失函数,L c表示第二损失函数,λ表示控制参数。
可选地,本申请实施例中,介绍了一种根据第一损失函数以及第二损失函数,服务器确定模型调节函数的具体计算方式。通过上述方式,采用控制参数可以控制第一损失函数和第二损失函数之间的比重,从而有利于提升计算的可靠性,并且能够根据不同的应用进行调整,进而提升方案的灵活性。
图12是本申请实施例提供的一种服务器结构示意图,该服务器500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,简称CPU)522(例如,一个或一个以上处理器)和存储器532,一个或一个以上存储应用程序542或数据544的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器532和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器522可以设置为与存储介质530通信,在服务器500上执行存储介质530中的一系列指令操作。
服务器500还可以包括一个或一个以上电源526,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口558,和/或,一个或一个以上操作系统541,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
上述实施例中由服务器所执行的步骤可以基于该图12所示的服务器结构。
本申请实施例中,CPU 522用于执行如下步骤:
获取待识别的目标语音信息;
通过声纹识别模型获取所述目标语音信息的目标特征信息,其中,所述声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,所述第一损失函数属于归一化指数函数,所述第二损失函数属于中心化函数;
根据目标特征信息以及注册特征信息确定声纹识别结果,其中,注册特征信息为待识别对象的语音信息在通过声纹识别模型之后得到的。
可选地,本申请实施例中,CPU 522具体用于执行如下步骤:
根据所述目标特征信息以及注册特征信息计算余弦相似度;
若余弦相似度达到第一相似度阈值,则确定目标语音信息属于待识别对象的语音信息;
若余弦相似度未达到第一相似度阈值,则确定目标语音信息不属于待识别对象的语音信息。
可选地,本申请实施例中,CPU 522具体用于执行如下步骤:
通过PLDA分类器计算目标特征信息与注册特征信息之间的对数似然比;
若对数似然比达到第二相似度阈值,则确定目标语音信息属于待识别对象的语音信息;
若对数似然比未达到第二相似度阈值,则确定目标语音信息不属于待识别对象的语音信息。
本申请实施例中,CPU 522用于执行如下步骤:
获取待训练语音信息集合,其中,待训练语音信息集合包括至少一个对象所对应的语音信息;
根据待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,模型调节函数包括第一损失函数以及第二损失函数,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数;
根据模型调节函数训练得到声纹识别模型。
可选地,本申请实施例中,CPU 522具体用于执行如下步骤:
通过CNN确定每个语音信息的深度特征;
根据待训练语音信息集合中每个对象所对应的语音信息获取连接层权重矩阵;
根据每个语音信息的深度特征以及连接层权重矩阵确定第一损失函数。
可选地,本申请实施例中,CPU 522具体用于执行如下步骤:
采用如下方式确定第一损失函数:
Figure PCTCN2019093792-appb-000022
其中,L s表示第一损失函数,x i表示来自第y i个对象的第i个深度特 征,W v表示连接层权重矩阵中的第v列,b j表示第j类的偏差项,且每一类对应一个对象,M表示待训练语音信息集合所对应的训练集分组大小,N表示待训练语音信息集合所对应的对象个数。
可选地,本申请实施例中,CPU 522具体用于执行如下步骤:
通过CNN确定每个语音信息的深度特征;
根据每个语音信息的深度特征计算深度特征梯度;
根据深度特征梯度以及第一语音均值计算第二语音均值;
根据每个语音信息的深度特征以及第二语音均值,确定第二损失函数。
可选地,本申请实施例中,CPU 522具体用于执行如下步骤:
采用如下方式计算深度特征梯度:
Figure PCTCN2019093792-appb-000023
其中,Δμ j表示深度特征梯度,M表示待训练语音信息集合所对应的训练集分组大小,j表示类,且每一类对应一个对象,y i表示第y i个对象;
采用如下方式计算第二语音均值:
Figure PCTCN2019093792-appb-000024
其中,t表示时刻,
Figure PCTCN2019093792-appb-000025
表示t+1时刻所对应的第二语音均值,
Figure PCTCN2019093792-appb-000026
表示t时刻所对应的第一个语音均值,
Figure PCTCN2019093792-appb-000027
表示t时刻所对应的深度特征梯度,α表示学习速率参数,且α的取值范围为大于或等于0,且小于或等于1;
采用如下方式确定第二损失函数:
Figure PCTCN2019093792-appb-000028
其中,L c表示第二损失函数,x i表示来自第y i个对象的第i个深度特征,μ yi表示来自y i的深度区分特征均值。
可选地,本申请实施例中,CPU 522具体用于执行如下步骤:
根据待训练语音信息集合中每个对象所对应的语音信息确定第一损失函数;
根据待训练语音信息集合中每个对象所对应的语音信息确定第二损失函 数;
根据第一损失函数以及第二损失函数,确定模型调节函数。
可选地,本申请实施例中,CPU 522具体用于执行如下步骤:
采用如下方式确定模型调节函数:
L t=L s+λL c
其中,L t表示模型调节函数,L s表示第一损失函数,L c表示第二损失函数,λ表示控制参数。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一 个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请实施例各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,简称ROM)、随机存取存储器(random access memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上,以上实施例仅用以说明本申请实施例的技术方案,而非对其限制;尽管参照前述实施例对本申请实施例进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请实施例各实施例技术方案的精神和范围。
工业实用性
本申请采用了获取待识别的目标语音信息,通过语音识别模型获取目标语音信息的目标特征信息,语音识别模型为根据第一损失函数以及第二损失函数训练得到的,第一损失函数属于归一化指数函数,第二损失函数属于中心化函数;根据目标特征信息以及注册特征信息确定声纹识别结果,注册特征信息为待识别对象的语音信息在通过语音识别模型之后得到的。采用归一化指数函数和中心化函数对语音识别模型进行联合优化,能够减少来自同一说话人深度特征之间的类内变化,采用两种函数同时监督和学习语音识别模型,可使深度特征具有更好的区分性,从而提升识别性能。

Claims (15)

  1. 一种声纹识别的方法,包括:
    获取待识别的目标语音信息;
    通过声纹识别模型获取所述目标语音信息的目标特征信息,其中,所述声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,所述第一损失函数属于归一化指数函数,所述第二损失函数属于中心化函数;
    根据所述目标特征信息以及注册特征信息确定声纹识别结果,其中,所述注册特征信息为待识别对象的语音信息在通过所述声纹识别模型之后得到的。
  2. 根据权利要求1所述的方法,其中,所述根据所述目标特征信息以及注册特征信息确定声纹识别结果,包括:
    根据所述目标特征信息以及注册特征信息计算余弦相似度;
    若所述余弦相似度达到第一相似度阈值,则确定所述目标语音信息属于所述待识别对象的语音信息;
    若所述余弦相似度未达到所述第一相似度阈值,则确定所述目标语音信息不属于所述待识别对象的语音信息。
  3. 根据权利要求1所述的方法,其中,所述根据所述目标特征信息以及注册特征信息确定声纹识别结果,包括:
    通过PLDA分类器计算所述目标特征信息与所述注册特征信息之间的对数似然比;
    若所述对数似然比达到第二相似度阈值,则确定所述目标语音信息属于所述待识别对象的语音信息;
    若所述对数似然比未达到所述第二相似度阈值,则确定所述目标语音信息不属于所述待识别对象的语音信息。
  4. 一种模型训练的方法,其中,包括:
    获取待训练语音信息集合,其中,所述待训练语音信息集合包括至少一个对象所对应的语音信息;
    根据所述待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,所述模型调节函数包括所述第一损失函数以及第二损失函数,所述第一损失函数属于归一化指数函数,所述第二损失函数属于中心化函数;
    根据所述模型调节函数训练得到声纹识别模型。
  5. 根据权利要求4所述的方法,其中,所述根据所述待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,包括:
    通过CNN确定每个语音信息的深度特征;
    根据所述待训练语音信息集合中每个对象所对应的语音信息获取连接层权重矩阵;
    根据所述每个语音信息的深度特征以及所述连接层权重矩阵确定所述第一损失函数。
  6. 根据权利要求5所述的方法,其中,所述根据所述每个语音信息的深度特征以及所述连接层权重矩阵确定所述第一损失函数,包括:
    采用如下方式确定所述第一损失函数:
    Figure PCTCN2019093792-appb-100001
    其中,所述L s表示所述第一损失函数,所述x i表示来自第y i个对象的第i个深度特征,W v表示所述连接层权重矩阵中的第v列,所述b j表示第j类的偏差项,且每一类对应一个对象,所述M表示所述待训练语音信息集合所对应的训练集分组大小,所述N表示所述待训练语音信息集合所对应的对象个数。
  7. 根据权利要求4所述的方法,其中,所述根据所述待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,包括:
    通过CNN确定每个语音信息的深度特征;
    根据所述每个语音信息的深度特征计算深度特征梯度;
    根据所述深度特征梯度以及第一语音均值计算第二语音均值;
    根据所述每个语音信息的深度特征以及所述第二语音均值,确定所述第二损失函数。
  8. 根据权利要求7所述的方法,其中,所述根据所述每个语音信息的深度特征计算深度特征梯度,包括:
    采用如下方式计算所述深度特征梯度:
    Figure PCTCN2019093792-appb-100002
    其中,所述Δμ j表示所述深度特征梯度,所述M表示所述待训练语音信息集合所对应的训练集分组大小,所述j表示类,且每一类对应一个对象,所述y i表示第y i个对象;
    所述根据所述深度特征梯度以及第一语音均值计算第二语音均值,包括:
    采用如下方式计算所述第二语音均值:
    Figure PCTCN2019093792-appb-100003
    其中,所述t表示时刻,所述
    Figure PCTCN2019093792-appb-100004
    表示t+1时刻所对应的所述第二语音均值,所述
    Figure PCTCN2019093792-appb-100005
    表示t时刻所对应的所述第一个语音均值,所述
    Figure PCTCN2019093792-appb-100006
    表示t时刻所对应的所述深度特征梯度,所述α表示学习速率参数,且所述α的取值范围为大于或等于0,且小于或等于1;
    所述根据所述每个语音信息的深度特征以及所述第二语音均值,确定所述第二损失函数,包括:
    采用如下方式确定所述第二损失函数:
    Figure PCTCN2019093792-appb-100007
    其中,所述L c表示所述第二损失函数,所述x i表示来自第y i个对象的第i个深度特征,所述μ yi表示来自所述y i的深度区分特征均值。
  9. 根据权利要求4至8中任一项所述的方法,其中,所述根据所述待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,包括:
    根据所述待训练语音信息集合中每个对象所对应的语音信息确定所述第一损失函数;
    根据所述待训练语音信息集合中每个对象所对应的语音信息确定所述第二损失函数;
    根据所述第一损失函数以及所述第二损失函数,确定所述模型调节函数。
  10. 根据权利要求9所述的方法,其中,所述根据所述第一损失函数以及所述第二损失函数,确定所述模型调节函数,包括:
    采用如下方式确定所述模型调节函数:
    L t=L s+λL c
    其中,所述L t表示所述模型调节函数,所述L s表示所述第一损失函数,所述L c表示所述第二损失函数,所述λ表示控制参数。
  11. 一种服务器,其中,包括一个或多个处理器,以及一个或多个存储程序模块的存储器,其中,所述程序模块由所述处理器执行,所述程序模块包括:
    获取模块,被设置为获取待识别的目标语音信息;
    所述获取模块,还被设置为通过声纹识别模型获取所述目标语音信息的目标特征信息,其中,所述声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,所述第一损失函数属于归一化指数函数,所述第二损失函数属于中心化函数;
    确定模块,被设置为根据所述获取模块获取的所述目标特征信息以及注册特征信息确定声纹识别结果,其中,所述注册特征信息为待识别对象的语音信息在通过所述声纹识别模型之后得到的。
  12. 一种服务器,其中,包括一个或多个处理器,以及一个或多个存储程序模块的存储器,其中,所述程序模块由所述处理器执行,所述程序模块包括:
    获取模块,被设置为获取待训练语音信息集合,其中,所述待训练语音信息集合包括至少一个对象所对应的语音信息;
    确定模块,被设置为根据所述获取模块获取的所述待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,所述模型调节函数包括所述第一损失函数以及第二损失函数,所述第一损失函数属于归一化指数函数,所述第二损失函数属于中心化函数;
    训练模块,被设置为根据所述确定模块确定的所述模型调节函数训练得到声纹识别模型。
  13. 一种服务器,其中,包括:存储器、收发器、处理器以及总线系统;
    其中,所述存储器被设置为存储程序;
    所述处理器被设置为执行所述存储器中的程序,包括如下步骤:
    获取待识别的目标语音信息;
    通过声纹识别模型获取所述目标语音信息的目标特征信息,其中,所述声纹识别模型为根据第一损失函数以及第二损失函数训练得到的,所述第一损失函数属于归一化指数函数,所述第二损失函数属于中心化函数;
    根据所述目标特征信息以及注册特征信息确定声纹识别结果,其中,所述注册特征信息为待识别对象的语音信息在通过所述声纹识别模型之后得到的;
    所述总线系统被设置为连接所述存储器以及所述处理器,以使所述存储器以及所述处理器进行通信。
  14. 一种服务器,其中,包括:存储器、收发器、处理器以及总线系统;
    其中,所述存储器被设置为存储程序;
    所述处理器被设置为执行所述存储器中的程序,包括如下步骤:
    获取待训练语音信息集合,其中,所述待训练语音信息集合包括至少一个对象所对应的语音信息;
    根据所述待训练语音信息集合中每个对象所对应的语音信息确定模型调节函数,其中,所述模型调节函数包括所述第一损失函数以及第二损失函数,所述第一损失函数属于归一化指数函数,所述第二损失函数属于中心化函数;
    根据所述模型调节函数训练得到声纹识别模型;
    所述总线系统被设置为连接所述存储器以及所述处理器,以使所述存储器以及所述处理器进行通信。
  15. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至3中任一项所述的方法,或者执行如权利要求4至10中任一项所述的方法。
PCT/CN2019/093792 2018-10-10 2019-06-28 一种声纹识别的方法、模型训练的方法以及服务器 WO2020073694A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP19870101.3A EP3866163A4 (en) 2018-10-10 2019-06-28 VOICE PRESSURE IDENTIFICATION PROCEDURES, MODEL TRAINING PROCEDURES AND SERVER
JP2020561916A JP7152514B2 (ja) 2018-10-10 2019-06-28 声紋識別方法、モデルトレーニング方法、サーバ、及びコンピュータプログラム
US17/085,609 US11508381B2 (en) 2018-10-10 2020-10-30 Voiceprint recognition method, model training method, and server

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811179856.4A CN110164452B (zh) 2018-10-10 2018-10-10 一种声纹识别的方法、模型训练的方法以及服务器
CN201811179856.4 2018-10-10

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/085,609 Continuation US11508381B2 (en) 2018-10-10 2020-10-30 Voiceprint recognition method, model training method, and server

Publications (1)

Publication Number Publication Date
WO2020073694A1 true WO2020073694A1 (zh) 2020-04-16

Family

ID=67644996

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/093792 WO2020073694A1 (zh) 2018-10-10 2019-06-28 一种声纹识别的方法、模型训练的方法以及服务器

Country Status (5)

Country Link
US (1) US11508381B2 (zh)
EP (1) EP3866163A4 (zh)
JP (1) JP7152514B2 (zh)
CN (2) CN110164452B (zh)
WO (1) WO2020073694A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345444A (zh) * 2021-05-07 2021-09-03 华中师范大学 一种说话者确认方法及系统
CN113409794A (zh) * 2021-06-30 2021-09-17 平安科技(深圳)有限公司 声纹识别模型的优化方法、装置、计算机设备及存储介质
CN113421573A (zh) * 2021-06-18 2021-09-21 马上消费金融股份有限公司 身份识别模型训练方法、身份识别方法及装置
CN113555008A (zh) * 2020-04-17 2021-10-26 阿里巴巴集团控股有限公司 一种针对模型的调参方法及装置
EP3901948A1 (en) * 2020-04-22 2021-10-27 Beijing Xiaomi Pinecone Electronics Co., Ltd. Method for training a voiceprint extraction model and method for voiceprint recognition, and device and medium thereof
CN113611314A (zh) * 2021-08-03 2021-11-05 成都理工大学 一种说话人识别方法及系统
CN117577116A (zh) * 2024-01-17 2024-02-20 清华大学 连续学习语音鉴别模型的训练方法、装置、设备及介质

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11126890B2 (en) * 2019-04-18 2021-09-21 Adobe Inc. Robust training of large-scale object detectors with a noisy dataset
US11183201B2 (en) * 2019-06-10 2021-11-23 John Alexander Angland System and method for transferring a voice from one body of recordings to other recordings
JP7290507B2 (ja) * 2019-08-06 2023-06-13 本田技研工業株式会社 情報処理装置、情報処理方法、認識モデルならびにプログラム
CN110751941B (zh) * 2019-09-18 2023-05-26 平安科技(深圳)有限公司 语音合成模型的生成方法、装置、设备及存储介质
WO2020035085A2 (en) * 2019-10-31 2020-02-20 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for determining voice characteristics
CN110660399A (zh) * 2019-11-11 2020-01-07 广州国音智能科技有限公司 声纹识别的训练方法、装置、终端及计算机存储介质
CN110838295B (zh) * 2019-11-17 2021-11-23 西北工业大学 一种模型生成方法、声纹识别方法及对应装置
CN111128128B (zh) * 2019-12-26 2023-05-23 华南理工大学 一种基于互补模型评分融合的语音关键词检测方法
CN111145761B (zh) * 2019-12-27 2022-05-24 携程计算机技术(上海)有限公司 模型训练的方法、声纹确认的方法、系统、设备及介质
CN111341318B (zh) * 2020-01-22 2021-02-12 北京世纪好未来教育科技有限公司 说话者角色确定方法、装置、设备及存储介质
CN111524525B (zh) 2020-04-28 2023-06-16 平安科技(深圳)有限公司 原始语音的声纹识别方法、装置、设备及存储介质
CN111524526B (zh) * 2020-05-14 2023-11-17 中国工商银行股份有限公司 声纹识别方法及装置
CN111652315B (zh) * 2020-06-04 2023-06-02 广州虎牙科技有限公司 模型训练、对象分类方法和装置、电子设备及存储介质
CN111933147B (zh) * 2020-06-22 2023-02-14 厦门快商通科技股份有限公司 声纹识别方法、系统、移动终端及存储介质
CN112017670B (zh) * 2020-08-13 2021-11-02 北京达佳互联信息技术有限公司 一种目标账户音频的识别方法、装置、设备及介质
CN111709004B (zh) * 2020-08-19 2020-11-13 北京远鉴信息技术有限公司 一种身份认证方法、装置、电子设备及可读存储介质
CN111951791B (zh) * 2020-08-26 2024-05-17 上海依图网络科技有限公司 声纹识别模型训练方法、识别方法、电子设备及存储介质
CN112053695A (zh) * 2020-09-11 2020-12-08 北京三快在线科技有限公司 声纹识别方法、装置、电子设备及存储介质
CN112037800B (zh) * 2020-09-22 2024-07-12 平安科技(深圳)有限公司 声纹核身模型训练方法、装置、介质及电子设备
CN112259114A (zh) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 语音处理方法及装置、计算机存储介质、电子设备
CN112259106B (zh) * 2020-10-20 2024-06-11 网易(杭州)网络有限公司 声纹识别方法、装置、存储介质及计算机设备
CN112259097A (zh) * 2020-10-27 2021-01-22 深圳康佳电子科技有限公司 一种语音识别的控制方法和计算机设备
CN112071322B (zh) * 2020-10-30 2022-01-25 北京快鱼电子股份公司 一种端到端的声纹识别方法、装置、存储介质及设备
CN112767949B (zh) * 2021-01-18 2022-04-26 东南大学 一种基于二值权重卷积神经网络的声纹识别系统
CN112929501B (zh) * 2021-01-25 2024-08-27 深圳前海微众银行股份有限公司 语音通话服务方法、装置、设备、介质及计算机程序产品
CN113299295B (zh) * 2021-05-11 2022-12-30 支付宝(杭州)信息技术有限公司 声纹编码网络的训练方法及装置
CN113327618B (zh) * 2021-05-17 2024-04-19 西安讯飞超脑信息科技有限公司 声纹判别方法、装置、计算机设备和存储介质
CN113327617B (zh) * 2021-05-17 2024-04-19 西安讯飞超脑信息科技有限公司 声纹判别方法、装置、计算机设备和存储介质
CN113421575B (zh) * 2021-06-30 2024-02-06 平安科技(深圳)有限公司 声纹识别方法、装置、设备及存储介质
CN113837594A (zh) * 2021-09-18 2021-12-24 深圳壹账通智能科技有限公司 多场景下客服的质量评价方法、系统、设备及介质
CN116964667A (zh) * 2021-11-11 2023-10-27 深圳市韶音科技有限公司 语音活动检测方法、系统、语音增强方法以及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005055200A1 (en) * 2003-12-05 2005-06-16 Queensland University Of Technology Model adaptation system and method for speaker recognition
CN102270451A (zh) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 说话人识别方法及系统
CN103971690A (zh) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 一种声纹识别方法和装置
CN104157290A (zh) * 2014-08-19 2014-11-19 大连理工大学 一种基于深度学习的说话人识别方法
CN107146624A (zh) * 2017-04-01 2017-09-08 清华大学 一种说话人确认方法及装置

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5229124B2 (ja) 2009-06-12 2013-07-03 日本電気株式会社 話者照合装置、話者照合方法およびプログラム
US9258425B2 (en) * 2013-05-22 2016-02-09 Nuance Communications, Inc. Method and system for speaker verification
CN103679452A (zh) * 2013-06-20 2014-03-26 腾讯科技(深圳)有限公司 支付验证方法、装置及系统
CN103730114A (zh) * 2013-12-31 2014-04-16 上海交通大学无锡研究院 一种基于联合因子分析模型的移动设备声纹识别方法
CN104732978B (zh) * 2015-03-12 2018-05-08 上海交通大学 基于联合深度学习的文本相关的说话人识别方法
US9978374B2 (en) * 2015-09-04 2018-05-22 Google Llc Neural networks for speaker verification
CN107274905B (zh) * 2016-04-08 2019-09-27 腾讯科技(深圳)有限公司 一种声纹识别方法及系统
US20180018973A1 (en) * 2016-07-15 2018-01-18 Google Inc. Speaker verification
CN108288470B (zh) * 2017-01-10 2021-12-21 富士通株式会社 基于声纹的身份验证方法和装置
CN107680582B (zh) * 2017-07-28 2021-03-26 平安科技(深圳)有限公司 声学模型训练方法、语音识别方法、装置、设备及介质
CN107680600B (zh) * 2017-09-11 2019-03-19 平安科技(深圳)有限公司 声纹模型训练方法、语音识别方法、装置、设备及介质
CN107464568B (zh) * 2017-09-25 2020-06-30 四川长虹电器股份有限公司 基于三维卷积神经网络文本无关的说话人识别方法及系统
CN108417217B (zh) * 2018-01-11 2021-07-13 思必驰科技股份有限公司 说话人识别网络模型训练方法、说话人识别方法及系统
CN108197591A (zh) * 2018-01-22 2018-06-22 北京林业大学 一种基于多特征融合迁移学习的鸟类个体识别方法
CN108564955B (zh) * 2018-03-19 2019-09-03 平安科技(深圳)有限公司 电子装置、身份验证方法和计算机可读存储介质
CN108564954B (zh) * 2018-03-19 2020-01-10 平安科技(深圳)有限公司 深度神经网络模型、电子装置、身份验证方法和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005055200A1 (en) * 2003-12-05 2005-06-16 Queensland University Of Technology Model adaptation system and method for speaker recognition
CN102270451A (zh) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 说话人识别方法及系统
CN103971690A (zh) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 一种声纹识别方法和装置
CN104157290A (zh) * 2014-08-19 2014-11-19 大连理工大学 一种基于深度学习的说话人识别方法
CN107146624A (zh) * 2017-04-01 2017-09-08 清华大学 一种说话人确认方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
See also references of EP3866163A4 *
WEN, YANDONG ET AL.: "A Discriminative Feature Learning Approach for Deep Face Recognition", COMPUTER VISION-ECCV 2016: 14TH EUROPEAN CONFERENCE, vol. 9911, no. 558, 14 October 2016 (2016-10-14), pages 499 - 515, XP047355154, ISSN: 0302-9743 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113555008A (zh) * 2020-04-17 2021-10-26 阿里巴巴集团控股有限公司 一种针对模型的调参方法及装置
KR102603466B1 (ko) * 2020-04-22 2023-11-17 베이징 시아오미 파인콘 일렉트로닉스 컴퍼니 리미티드 성문 추출 모델 훈련 방법 및 성문 인식 방법, 이의 장치 및 매체, 프로그램
EP3901948A1 (en) * 2020-04-22 2021-10-27 Beijing Xiaomi Pinecone Electronics Co., Ltd. Method for training a voiceprint extraction model and method for voiceprint recognition, and device and medium thereof
KR20210131211A (ko) * 2020-04-22 2021-11-02 베이징 시아오미 파인콘 일렉트로닉스 컴퍼니 리미티드 성문 추출 모델 훈련 방법 및 성문 인식 방법, 이의 장치 및 매체, 프로그램
CN113345444B (zh) * 2021-05-07 2022-10-28 华中师范大学 一种说话者确认方法及系统
CN113345444A (zh) * 2021-05-07 2021-09-03 华中师范大学 一种说话者确认方法及系统
CN113421573A (zh) * 2021-06-18 2021-09-21 马上消费金融股份有限公司 身份识别模型训练方法、身份识别方法及装置
CN113421573B (zh) * 2021-06-18 2024-03-19 马上消费金融股份有限公司 身份识别模型训练方法、身份识别方法及装置
CN113409794A (zh) * 2021-06-30 2021-09-17 平安科技(深圳)有限公司 声纹识别模型的优化方法、装置、计算机设备及存储介质
CN113409794B (zh) * 2021-06-30 2023-05-23 平安科技(深圳)有限公司 声纹识别模型的优化方法、装置、计算机设备及存储介质
CN113611314A (zh) * 2021-08-03 2021-11-05 成都理工大学 一种说话人识别方法及系统
CN117577116A (zh) * 2024-01-17 2024-02-20 清华大学 连续学习语音鉴别模型的训练方法、装置、设备及介质
CN117577116B (zh) * 2024-01-17 2024-03-19 清华大学 连续学习语音鉴别模型的训练方法、装置、设备及介质

Also Published As

Publication number Publication date
CN110164452A (zh) 2019-08-23
JP2021527840A (ja) 2021-10-14
JP7152514B2 (ja) 2022-10-12
EP3866163A4 (en) 2021-11-24
CN110289003A (zh) 2019-09-27
EP3866163A1 (en) 2021-08-18
CN110289003B (zh) 2021-10-29
CN110164452B (zh) 2023-03-10
US11508381B2 (en) 2022-11-22
US20210050020A1 (en) 2021-02-18

Similar Documents

Publication Publication Date Title
WO2020073694A1 (zh) 一种声纹识别的方法、模型训练的方法以及服务器
US11031018B2 (en) System and method for personalized speaker verification
US9940935B2 (en) Method and device for voiceprint recognition
CN112259106B (zh) 声纹识别方法、装置、存储介质及计算机设备
EP2763134B1 (en) Method and apparatus for voice recognition
CN104167208B (zh) 一种说话人识别方法和装置
WO2020155584A1 (zh) 声纹特征的融合方法及装置,语音识别方法,系统及存储介质
CN105261367B (zh) 一种说话人识别方法
WO2014114116A1 (en) Method and system for voiceprint recognition
WO2014114049A1 (zh) 一种语音识别的方法、装置
Wang et al. A network model of speaker identification with new feature extraction methods and asymmetric BLSTM
Ramos-Castro et al. Speaker verification using speaker-and test-dependent fast score normalization
US11437044B2 (en) Information processing apparatus, control method, and program
You et al. A GMM-supervector approach to language recognition with adaptive relevance factor
Guo et al. Speaker Verification Using Short Utterances with DNN-Based Estimation of Subglottal Acoustic Features.
Perdana et al. Voice recognition system for user authentication using gaussian mixture model
Ambikairajah et al. PNCC-ivector-SRC based speaker verification
Li et al. Adaptive threshold estimation of open set voiceprint recognition based on OTSU and deep learning
WO2021257000A1 (en) Cross-modal speaker verification
Kanrar et al. Text and language independent speaker identification by GMM based i vector
Alwahed et al. ARABIC SPEECH RECOGNITION BASED ON KNN, J48, AND LVQ
Ahmad et al. Client-wise cohort set selection by combining speaker-and phoneme-specific I-vectors for speaker verification
Zhang Realization and improvement algorithm of GMM-UBM model in voiceprint recognition
JP7287442B2 (ja) 情報処理装置、制御方法、及びプログラム
Memon et al. Information theoretic expectation maximization based Gaussian mixture modeling for speaker verification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19870101

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020561916

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019870101

Country of ref document: EP

Effective date: 20210510