WO2018010683A1 - 身份向量生成方法、计算机设备和计算机可读存储介质 - Google Patents

身份向量生成方法、计算机设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2018010683A1
WO2018010683A1 PCT/CN2017/092892 CN2017092892W WO2018010683A1 WO 2018010683 A1 WO2018010683 A1 WO 2018010683A1 CN 2017092892 W CN2017092892 W CN 2017092892W WO 2018010683 A1 WO2018010683 A1 WO 2018010683A1
Authority
WO
WIPO (PCT)
Prior art keywords
statistic
gaussian distribution
quotient
distribution component
speaker
Prior art date
Application number
PCT/CN2017/092892
Other languages
English (en)
French (fr)
Inventor
李为
钱柄桦
金星明
李科
吴富章
吴永坚
黄飞跃
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP17827019.5A priority Critical patent/EP3486903B1/en
Publication of WO2018010683A1 publication Critical patent/WO2018010683A1/zh
Priority to US16/213,421 priority patent/US10909989B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Definitions

  • the present application relates to the field of computer technology, and in particular, to an identity vector generation method, a computer device, and a computer readable storage medium.
  • Speaker identification is an important means of identification. After the user collects a speech and performs a series of operations such as preprocessing, feature extraction, modeling and parameter estimation, the speech is mapped to a fixed length. A vector that can express a speaker's speech characteristics, which is called an identity vector (i-vector). The identity vector can well express the speaker identity information included in the corresponding voice.
  • i-vector A vector that can express a speaker's speech characteristics
  • the identity vector of speech data it is necessary to extract its acoustic features, and based on the speaker background model in the form of Gaussian mixture model, the statistical analysis of the posterior probability of each Gaussian distribution component in each speaker's background model is calculated. A quantity, which in turn generates an identity vector based on the statistic.
  • the current method of generating an identity vector may result in a decrease in the identity recognition performance of the identity vector in the case where the speech data has a relatively short speech length or a relatively sparse speech.
  • an identity vector generation method a computer device, and a computer readable storage medium are provided.
  • An identity vector generation method includes:
  • the statistic space is constructed according to a statistic corresponding to a voice sample exceeding a preset speech duration;
  • An identity vector is generated based on the modified statistic.
  • a computer device comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor such that the processor performs the following steps:
  • the statistic space is constructed according to a statistic corresponding to a voice sample exceeding a preset speech duration;
  • An identity vector is generated based on the modified statistic.
  • One or more non-transitory computer readable storage mediums storing computer readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps :
  • the statistic space is constructed according to a statistic corresponding to a voice sample exceeding a preset speech duration;
  • An identity vector is generated based on the modified statistic.
  • 1 is an application environment diagram of a speaker recognition system in an embodiment
  • FIG. 2A is a schematic diagram showing the internal structure of a server in an embodiment
  • 2B is a schematic diagram showing the internal structure of a terminal in an embodiment
  • FIG. 3 is a schematic flow chart of an identity vector generating method in an embodiment
  • FIG. 4 is a schematic flow chart of an identity vector generating method in another embodiment
  • FIG. 5 is a schematic flow chart of steps of constructing a statistic space in an embodiment
  • Figure 6 is a block diagram showing the structure of a computer device in an embodiment
  • FIG. 7 is a structural block diagram of a statistic generating module in an embodiment
  • Figure 8 is a block diagram showing the structure of a computer device in another embodiment
  • Figure 9 is a block diagram showing the structure of a computer device in still another embodiment.
  • first, second and the like may be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another. Both the first zero-order statistic and the second zero-order statistic are zero-order statistics. But it is not the same zero-order statistic.
  • FIG. 1 is an application environment diagram of a speaker recognition system in an embodiment.
  • the system includes a terminal 110 and a server 120 connected through a network.
  • the terminal 110 can be configured to collect voice data to be verified, and generate an identity vector to be verified by using the identity vector generation method in this application, and send the identity vector to be verified to the server 120.
  • the server 120 may collect voice data of the target speaker category and generate a target speaker identity vector using the identity vector generation method in the present application.
  • the server 120 can be configured to calculate the similarity between the identity to be verified and the target speaker identity vector; perform speaker identity verification according to the similarity. Server 120 can be used to feed back authentication results to terminal 110.
  • the server includes a processor, a non-volatile storage medium, an internal memory, and a network interface connected by a system bus.
  • the non-volatile storage medium of the server stores an operating system, a database, and computer readable instructions that, when executed by the processor, cause the processor to implement an identity vector generation method.
  • the server's processor is used to provide computing and control capabilities that support the operation of the entire server.
  • Computer readable instructions may be stored in the internal memory of the server, the computer readable instructions being executable by the processor to cause the processor to perform an identity vector generation method.
  • the server's network interface is used to communicate with the terminal connection.
  • the server can be implemented with a stand-alone server or a server cluster consisting of multiple servers.
  • a stand-alone server or a server cluster consisting of multiple servers.
  • the specific server may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
  • the terminal includes a processor connected through a system bus, a non-volatile storage medium, an internal memory, a network interface, and a sound collection device.
  • the non-volatile storage medium of the terminal stores an operating system, and further stores computer readable instructions.
  • the processor can implement an identity vector generation method.
  • the processor is used to provide computing and control capabilities to support the operation of the entire terminal.
  • Computer readable instructions may be stored in the internal memory in the terminal, the computer readable instructions being When the processor executes, it can cause the processor to execute an identity vector generation method.
  • the network interface is used for network communication with the server.
  • the terminal can be a mobile phone, a tablet or a personal digital assistant or a wearable device.
  • a person skilled in the art can understand that the structure shown in FIG. 2B is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the terminal to which the solution of the present application is applied.
  • the specific terminal may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
  • FIG. 3 is a schematic flow chart of an identity vector generation method in an embodiment. This embodiment is exemplified by applying the method to the server 120. Referring to FIG. 3, the method specifically includes the following steps:
  • the voice data to be processed refers to voice data that needs to be subjected to a series of processing to generate a corresponding identity vector.
  • the voice data is data formed by the voice collected by the voice collecting device after the speaker speaks the voice.
  • the to-be-processed voice data may include voice data to be verified and voice data of a target speaker category, wherein the voice data to be verified refers to voice data of an unknown speaker category and needs to determine whether it belongs to the target speaker category; the target speaker category is known
  • the speaker category is a category composed of voice data formed by the target speaker.
  • the server may perform preprocessing on the processed speech data, such as filtering out noise or a unified speech format, and then extract corresponding acoustic feature vectors from the preprocessed speech data to be processed.
  • An acoustic feature vector refers to a vector of acoustic features that reflect acoustic properties.
  • the acoustic feature vector includes a series of acoustic features, which may be Mel Frequency Cepstrum Coefficient or Linear Predicted Cepstrum Coefficient (LPCC).
  • the speaker background model is a Gaussian mixture model trained by a series of speech samples, which is used to train the representation of features that are independent of the speaker.
  • the Gaussian mixture model is a mathematical model in which a fixed number of Gaussian distribution components are superimposed.
  • the speaker background model can be trained by the EM algorithm (Expectation Maximization Algorithm). Say The speaker background model can adopt GMM-UBM (Gaussian Mixture Model-Universal Background Model).
  • the speaker background model can be expressed by the following formula (1):
  • x is the speech sample
  • C is the total number of Gaussian distribution components included in the Gaussian mixture model
  • c is the number of the Gaussian distribution component included in the Gaussian mixture model
  • ⁇ c , ⁇ c ) is the c-th Gaussian distribution Component
  • a c is the coefficient of the cth Gaussian distribution component
  • ⁇ c is the mean of the cth Gaussian distribution component
  • ⁇ c is the variance of the cth Gaussian distribution component.
  • the acoustic feature vector can be expressed as: ⁇ y 1 , y 2 ... y L ⁇ .
  • the acoustic feature vector includes L acoustic features, each of which can be represented as y t , where t ⁇ [1, L].
  • the posterior probability that each acoustic feature in the acoustic feature vector belongs to each Gaussian distribution component in the speaker background model can be expressed as: P(c
  • y t , ⁇ ) represents the posterior probability that the acoustic feature y t belongs to the c-th Gaussian distribution component in the case where the speaker background model ⁇ and the acoustic feature y t have been observed.
  • the server can obtain statistics based on the posterior probability P(c
  • the statistic is mapped to the statistic space to obtain a reference statistic.
  • the statistic space is constructed according to a statistic corresponding to the voice sample exceeding the preset speech duration.
  • the statistic space is a vector space, and the statistic space is constructed according to the same type of statistics corresponding to the statistic obtained by the voice sample, and the voice duration of the speech sample used to construct the statistic space exceeds
  • the preset voice duration is preset, such as 30 seconds.
  • the speech samples used to construct the statistic space may be speech samples selected from speech samples used to train the speaker background model that exceed the preset speech duration. After the statistical statistic is mapped to the statistic space, a reference statistic is obtained, and the reference statistic is a prior statistic determined according to a statistic corresponding to the voice sample exceeding the preset speech duration.
  • the modified statistic is a statistic obtained by using the reference statistic to correct the statistic obtained by the statistic, and the statistic combines the prior statistic and the posterior statistic.
  • the identity vector can be generated by using the modified statistic and using a conventional method of generating an identity vector.
  • the statistic space is constructed according to a statistic corresponding to a speech sample exceeding a preset speech duration, and the posterior probability of each Gaussian distribution component in each speaker's background model is obtained.
  • the statistic is mapped into the statistic space, and the obtained reference statistic is a prior statistic.
  • the a priori statistic is used to correct the statistical statistic to obtain a modified statistic, which can compensate for the statistical bias estimation caused by the short duration of the speech of the speech data to be processed and the sparse speech, and improve the identity.
  • the identity of the vector is constructed according to a statistic corresponding to a speech sample exceeding a preset speech duration, and the posterior probability of each Gaussian distribution component in each speaker's background model is obtained.
  • the statistic is mapped into the statistic space, and the obtained reference statistic is a prior statistic.
  • the a priori statistic is used to correct the statistical statistic to obtain a modified statistic, which can compensate for the statistical bias estimation caused by the short duration of the speech of the speech data to be processed and the sparse speech, and improve the
  • the identity vector generating method includes the following steps:
  • each Gaussian distribution component in the speaker background model the sum of the posterior probabilities of the respective acoustic features belonging to the corresponding Gaussian distribution components is respectively counted as the corresponding first zero-order statistics.
  • each component c particular Gaussian distribution corresponding to the speaker in the background model [Omega], respectively, each acoustic feature statistics belong y t corresponding Gaussian posterior probability P component c
  • the first zero-order statistic N c (u) corresponding to the Gaussian distribution component c can be calculated by the following formula (2):
  • u represents the speech data to be processed
  • N c (u) represents the first zero-order statistic of the pending speech data u corresponding to the Gaussian distribution component c
  • y t represents the t-th acoustic of the L acoustic features of the acoustic eigenvector The feature
  • y t , ⁇ ) represents the posterior probability that the acoustic feature y t belongs to the c-th Gaussian distribution component if the speaker background model ⁇ and the acoustic feature y t have been observed.
  • each acoustic feature is respectively calculated as a weighted sum with a posterior probability that the acoustic feature belongs to the corresponding Gaussian distribution component as a corresponding first first-order statistic.
  • S404 and S406 are included in the above step S304.
  • the speaker corresponding to each background model component c Gaussian distribution respectively, each of the acoustic characteristics of the acoustic characteristics to y t y t Gaussian posterior probability distribution belonging to the respective component c P (c
  • a weighted sum is calculated for the weight, and the weighted sum is taken as the first first-order statistic corresponding to the Gaussian distribution component c.
  • the first first-order statistic F c (u) corresponding to the Gaussian distribution component c can be calculated by the following formula (3):
  • u denotes the speech data to be processed
  • F c (u) denotes the first first-order statistic of the to-be-processed speech data u corresponding to the Gaussian distribution component c
  • y t denotes the t-th acoustic of the L acoustic features of the acoustic eigenvector
  • y t , ⁇ ) represents the posterior probability that the acoustic feature y t belongs to the c-th Gaussian distribution component if the speaker background model ⁇ and the acoustic feature y t have been observed.
  • S410 Mapping the first zero-order statistic and the first first-order statistic to the statistic space, and obtaining a reference first-order statistic of each Gaussian distribution component in the corresponding speaker background model and a second of the corresponding reference zero-order statistic
  • the statistic space is constructed based on statistics corresponding to speech samples that exceed the preset speech duration.
  • the first zero-order statistic N c (u) and the first first-order statistic F c (u) are mapped to the statistic space H to obtain a reference one of each Gaussian distribution component c in the corresponding speaker background model.
  • the correction statistic corresponding to the Gaussian distribution component c can be calculated by the following formula (4):
  • the sum of R1 and R2 can be defined as 1.
  • the weight of the third quotient is the first zero-order statistic of the corresponding Gaussian distribution component divided by the sum of the corresponding first zero-order statistic and the tunable parameter, and the weight of the second quotient Dividing the tunable parameter by the sum of the first zero-order statistic of the corresponding Gaussian distribution component and the tunable parameter.
  • the correction statistic corresponding to the Gaussian distribution component c can be calculated by the following formula (5):
  • the third quotient The weight of Is the first zero-order statistic N c (u) of the corresponding Gaussian distribution component c divided by the sum of the corresponding first zero-order statistic N c (u) and the tunable parameter q;
  • q takes 0.4 to 1, it can achieve good results.
  • the adjustable parameters by adjusting the adjustable parameters, the difference adjustment can be performed for different environments, and the robustness is increased.
  • ⁇ 1 , ⁇ 2 ... ⁇ C are the mean values of Gaussian distribution components of the speaker background model, respectively.
  • the identity vector can be calculated according to the following formula (9)
  • I denotes an identity matrix
  • T denotes a known Total Factor Matrix
  • t denotes transpose
  • denotes a covariance matrix in the form of a diagonal matrix, and the diagonal element of ⁇ is the covariance of each Gaussian distribution component
  • m represents the mean supervector of the speaker background model
  • the above formula (9) can be transformed, which will involve a matrix with Computational transformation with Calculation, and Obtained in this embodiment Can be used directly to calculate the identity vector without having to build a matrix with Simplify the calculation.
  • the first first-order statistic and the first zero-order statistic can be used to more accurately reflect the characteristics of the acoustic features, and it is convenient to calculate an accurate correction statistic. Since the quotient of the first-order statistic and the corresponding zero-order statistic is basically kept within a stable range, linear summation can be directly performed when determining the correction statistic, and the calculation amount is reduced.
  • Figure 5 is a flow diagram showing the steps of constructing a statistic space in one embodiment.
  • the step of constructing a statistic space specifically includes the following steps.
  • the speech samples whose speech duration exceeds the preset speech duration may be selected from the speech samples used to train the speaker background model.
  • the second zero-order statistics corresponding to each Gaussian distribution component c are respectively counted. the amount And second-order statistics
  • the first quotient for each speech category s and corresponding to each Gaussian distribution component c in the speaker background model may be sequentially arranged according to the speaker category and the corresponding Gaussian distribution component to form a matrix representing the statistic space.
  • the statistic space is established based on the first quotient of the second first-order statistic and the corresponding second zero-order statistic, and the quotient of the first-order statistic and the corresponding zero-order statistic is basically kept within a stable range. It is convenient to map the first zero-order statistic and the first first-order statistic to the calculation of the statistic space, thereby improving the calculation efficiency.
  • S508 includes: subtracting the calculated first quotient from the mean of the corresponding Gaussian distribution component to obtain a corresponding difference; and arranging the obtained difference according to the speaker category and the corresponding Gaussian distribution component.
  • a matrix of statistic spaces
  • the matrix H characterizing the statistic space can be determined according to the following formula (10):
  • m represents the mean supervector of the speaker background model
  • S ⁇ [1,S] representing the second first-order statistic matrix corresponding to the sth speaker category
  • a second zero-order statistic corresponding to each Gaussian distribution component c of the speaker background model for each sth speaker category is indicated.
  • the calculated first quotient is subtracted from the mean value of the corresponding Gaussian distribution component to obtain a corresponding difference, so that the obtained difference is sequentially arranged according to the speaker category and the corresponding Gaussian distribution component to form a characterization statistic space.
  • the matrix makes the constructed statistic space center roughly at the origin of the statistic space, which is convenient for calculation and improves computational efficiency.
  • step S410 specifically includes: acquiring an orthogonal basis vector of a statistic space; obtaining a mapping coefficient of the orthogonal basis vector, a product of the orthogonal basis vector and the mapping coefficient, and adding a mean value of the corresponding Gaussian distribution component, The two norm distance between the third quotient of the corresponding Gaussian distribution component is minimized; multiplying the orthogonal basis vector by the mapping coefficient and adding the mean of the corresponding Gaussian distribution components to obtain each Gaussian distribution component in the corresponding speaker background model The second quotient of the reference first-order statistic and the corresponding reference zero-order statistic.
  • the statistic space can be decomposed by eigenvalues to obtain a set of orthogonal basis vectors F eigen of the statistic space.
  • An optimization function of the following formula (12) can be defined:
  • N c (u) represents a first zero-order statistic corresponding to the Gaussian distribution component c
  • F c (u) represents a first first-order statistic corresponding to the Gaussian distribution component c
  • ⁇ c representing the mean corresponding to the Gaussian distribution component c
  • F eigen representing the orthogonal basis vector of the statistic space H; Indicates the mapping coefficient.
  • the to-be-processed voice data includes voice data to be verified and voice data of the target speaker category; step S312 includes: generating a to-be-verified identity vector according to the modified statistics corresponding to the voice data to be verified; according to the target speaker The corrected statistic corresponding to the voice data of the category generates a target speaker identity vector.
  • the identity vector generating method further includes: calculating a similarity between the identity vector to be verified and the target speaker identity vector; and performing speaker identity verification according to the similarity.
  • speaker identification can be applied to a variety of scenarios that require authentication of an unknown user identity. Speaker identification is divided into two stages: off-line and on-line: the offline stage needs to collect a large number of non-target speaker categories of speech samples for training the speaker identification system and speaking.
  • the human identification system includes an identity vector extraction module and an identity vector regularization module.
  • the online phase is divided into two phases: the registration phase and the identification phase.
  • the registration phase the voice data of the target speaker needs to be acquired, and the voice data is preprocessed, feature extracted, and model trained, and then mapped to a fixed length identity vector, which is the identity of the target speaker. a model.
  • the identification phase a voice with unknown identity is obtained, and the voice to be verified is also subjected to preprocessing, feature extraction and model training, and then mapped to a piece of identity to be verified.
  • the identity vector of the target speaker category and the identity vector to be verified in the identification phase are then calculated in the similarity calculation module, and the similarity is compared with a previously manually set threshold value, if the similarity is greater than or equal to the threshold.
  • the value can be determined that the identity corresponding to the voice to be verified matches the identity of the target speaker, and the identity verification is passed. If the similarity is less than the threshold, the corresponding voice to be verified may be determined. The identity does not match the target speaker identity, and the authentication failed.
  • the similarity may be a cosine similarity, a Pearson correlation coefficient, or an Euclidean distance.
  • the identity vector generation method of the embodiment can still generate an identity vector with high identity recognition performance, and the speaker does not need to speak too long voice, so that the voice is short. Time-independent speaker recognition can be widely promoted.
  • FIG. 6 is a block diagram showing the structure of a computer device 600 in one embodiment.
  • the computer device 600 can be used as a server or as a terminal.
  • the internal structure of the server may correspond to the structure as shown in FIG. 2A
  • the internal structure of the terminal may correspond to the structure as shown in FIG. 2B.
  • Each of the modules described below may be implemented in whole or in part by software, hardware, or a combination thereof.
  • the computer device 600 includes an acoustic feature extraction module 610, a statistic generation module 620, a mapping module 630, a modified statistic determination module 640, and an identity vector generation module 650.
  • the acoustic feature extraction module 610 is configured to acquire voice data to be processed, and extract corresponding acoustic features from the voice data to be processed.
  • the statistic generating module 620 is configured to perform statistics on the posterior probabilities of each Gaussian distribution component in each speaker's background model.
  • the mapping module 630 is configured to map the statistic to the statistic space to obtain the reference statistic; the statistic space is constructed according to the statistic corresponding to the voice sample exceeding the preset speech duration.
  • the modified statistic determination module 640 is configured to determine the corrected statistic according to the statistically obtained statistic and the reference statistic.
  • the identity vector generation module 650 is configured to generate an identity vector according to the modified statistic.
  • the statistic space is constructed according to the statistic corresponding to the voice sample exceeding the preset speech duration, and the posterior probability of each Gaussian distribution component in each speaker's background model is statistically obtained and statistically obtained.
  • the statistic is mapped into the statistic space, and the obtained reference statistic is a prior statistic.
  • the a priori statistic is used to correct the statistical statistic to obtain a modified statistic, which can compensate for the statistical bias estimation caused by the short duration of the speech of the speech data to be processed and the sparse speech, and improve the identity.
  • the identity of the vector is constructed according to the statistic corresponding to the voice sample exceeding the preset speech duration, and the posterior probability of each Gaussian distribution component in each speaker's background model is statistically obtained and statistically obtained.
  • the statistic is mapped into the statistic space, and the obtained reference statistic is a prior statistic.
  • the a priori statistic is used to correct the statistical statistic to obtain a modified statistic, which can compensate for the statistical bias estimation caused by the short duration of the speech of the speech data to be processed and the spars
  • FIG. 7 is a structural block diagram of a statistic generating module 620 in an embodiment.
  • the statistically obtained statistic includes a first zero-order statistic and a first first-order statistic;
  • the statistic generating module 620 includes: a first zero-order statistic generating module 621 and a first first-order statistic generating module. 622.
  • the first zero-order statistic generating module 621 is configured to respectively calculate, according to each Gaussian distribution component in the speaker background model, a sum of posterior probabilities of the respective acoustic features belonging to the corresponding Gaussian distribution components as corresponding first zero-order statistics. the amount.
  • the first first-order statistic generating module 622 is configured to calculate, according to each Gaussian distribution component in the speaker background model, each acoustic feature as a weighted sum of the posterior probabilities of the acoustic features belonging to the corresponding Gaussian distribution component. Corresponding first-order statistics.
  • FIG. 8 is a block diagram showing the structure of a computer device 600 in another embodiment.
  • the computer device 600 also includes a statistic statistics module 660 and a statistic space building module 670.
  • a statistic statistics module 660 configured to acquire a speech sample that exceeds a preset speech duration; and a second zero-order statistic corresponding to each Gaussian distribution component in the speaker background model according to a speaker category in the speech sample, and a second one Order statistics.
  • the statistic space construction module 670 is configured to calculate a first quotient of the second first order statistic and the corresponding second zero order statistic; construct a statistic space according to the calculated first quotient.
  • the statistic space is established based on the first quotient of the second first-order statistic and the corresponding second zero-order statistic, and the quotient of the first-order statistic and the corresponding zero-order statistic is basically kept within a stable range. It is convenient to map the first zero-order statistic and the first first-order statistic to the calculation of the statistic space, thereby improving the calculation efficiency.
  • the statistic space construction module 670 is further configured to subtract the mean of the corresponding Gaussian distribution components from the calculated first quotient to obtain a corresponding difference; and obtain the difference according to the speaker category and the corresponding Gaussian distribution.
  • the components are arranged in sequence to form a matrix that characterizes the statistic space.
  • the calculated first quotient is subtracted from the mean value of the corresponding Gaussian distribution component to obtain a corresponding difference, so that the obtained difference is sequentially arranged according to the speaker category and the corresponding Gaussian distribution component to form a characterization statistic space.
  • the matrix makes the constructed statistic space center roughly at the origin of the statistic space, which is convenient for calculation and improves computational efficiency.
  • the reference statistic includes a second quotient of a reference first order statistic and a corresponding reference zero order statistic for each Gaussian distribution component in the speaker background model; the modified statistic determination module 640 is further The third-order statistic of the first-order statistic and the corresponding first zero-order statistic, and the second quotient weighted sum of the corresponding Gaussian distribution components, obtain the corrected first-order statistic of each Gaussian distribution component in the corresponding speaker background model. The fourth quotient of the zero-order statistic is corrected accordingly as the correction statistic.
  • the weight of the third quotient is the first zero-order statistic of the corresponding Gaussian distribution component divided by the corresponding first zero-order statistic and the tunable parameter.
  • the sum of the second quotient is the sum of the tunable parameter divided by the first zero-order statistic of the corresponding Gaussian distribution component and the tunable parameter.
  • the mapping module 630 is further configured to obtain an orthogonal basis vector of the statistic space; obtain a mapping coefficient of the orthogonal basis vector, and a product of the orthogonal basis vector and the mapping coefficient plus the mean of the corresponding Gaussian distribution component And the two norm distance between the third quotient of the corresponding Gaussian distribution component is minimized; multiplying the orthogonal basis vector by the mapping coefficient and adding the mean of the corresponding Gaussian distribution component, obtaining each Gaussian distribution in the corresponding speaker background model The reference first-order statistic of the component and the second quotient of the corresponding reference zero-order statistic.
  • the to-be-processed voice data includes the voice data to be verified and the voice data of the target speaker category; the identity vector generation module 650 is further configured to generate the identity to be verified according to the modified statistics corresponding to the voice data to be verified; The corrected statistic corresponding to the speech data of the target speaker category generates a target speaker identity vector.
  • FIG. 9 is a block diagram showing the structure of a computer device 600 in still another embodiment.
  • the computer device 600 in this embodiment further includes: a speaker identity verification module 680, configured to calculate a similarity between the identity vector to be verified and the target speaker identity vector; and perform speaker identity verification according to the similarity.
  • the identity vector generation method of the embodiment can still generate an identity vector with high identity recognition performance, and the speaker does not need to speak too long voice, so that the voice is short. Time-independent speaker recognition can be widely promoted.
  • a computer device including a memory and a processor, Storing computer readable instructions in the memory, the computer readable instructions being executed by the processor, causing the processor to perform the steps of: acquiring voice data to be processed; extracting corresponding acoustic features from the voice data to be processed Generating a statistic for each posterior probability of each Gaussian distribution component in the speaker background model; mapping the statistic to a statistic space to obtain a reference statistic; Constructing a statistic corresponding to the voice sample of the preset voice duration; determining the correction statistic according to the statistically obtained statistic and the reference statistic; and generating the identity vector according to the modified statistic.
  • the statistically derived statistic includes a first zero-order statistic and a first first-order statistic; the posteriori of each Gaussian distribution component in each of the acoustic features belonging to a speaker background model
  • the statistically obtaining the statistics includes: corresponding to each Gaussian distribution component in the speaker background model, respectively counting the sum of the posterior probabilities of the respective acoustic features belonging to the corresponding Gaussian distribution components as the corresponding first zero-order statistics; And corresponding to each Gaussian distribution component in the speaker background model, each of the acoustic features is calculated as a weighted sum with a posterior probability of the acoustic feature belonging to the corresponding Gaussian distribution component as a corresponding first first-order statistic.
  • the computer readable instructions further cause the processor to: acquire a speech sample that exceeds a predetermined speech duration; and statistically correspond to each Gaussian distribution in the speaker background model according to the speaker category statistics in the speech sample a second zero-order statistic of the component and a second first-order statistic; calculating a first quotient of the second first-order statistic and the corresponding second zero-order statistic; and calculating a first quotient-constructed statistic space.
  • the calculating the first quotient to construct the statistic space comprises: subtracting the calculated first quotient from the mean of the corresponding Gaussian distribution component to obtain a corresponding difference; and comparing the obtained difference according to the speaking
  • the human class and the corresponding Gaussian distribution components are sequentially arranged to form a matrix characterizing the statistic space.
  • the weight of the third quotient is the first zero-order statistic of the corresponding Gaussian distribution component divided by the sum of the corresponding first zero-order statistic and the tunable parameter
  • the weight of the second quotient is the sum of the tunable parameter divided by the first zero-order statistic of the corresponding Gaussian distribution component and the tunable parameter.
  • the mapping the statistic to the statistic space to obtain the reference statistic comprises: acquiring an orthogonal basis vector of the statistic space; and obtaining a mapping coefficient of the orthogonal basis vector, After the product of the orthogonal basis vector and the mapping coefficient plus the mean of the corresponding Gaussian distribution component, the two norm distance between the third quotient of the corresponding Gaussian distribution component is minimized; and the orthogonal basis vector is multiplied
  • the mapping coefficient is followed by the mean of the corresponding Gaussian distribution components, and a second quotient corresponding to the reference first-order statistic and the corresponding reference zero-order statistic of each Gaussian distribution component in the speaker background model is obtained.
  • the to-be-processed voice data includes voice data to be verified and voice data of a target speaker category; and the generating the identity vector according to the modified statistic includes: correcting according to the voice data to be verified Generating a identity vector to be verified; and generating a target speaker identity vector based on a modified statistic corresponding to voice data of the target speaker class; the computer readable instructions further causing the processor to perform the step of calculating the Verifying the similarity between the identity vector and the target speaker identity vector; and performing speaker identity verification based on the similarity.
  • the statistic space is constructed according to the statistic corresponding to the voice sample exceeding the preset speech duration, and the statistic is obtained by counting the posterior probability of each Gaussian distribution component in each speaker's background model. After that, the statistic is mapped into the statistic space, and the obtained reference statistic is a prior statistic. The a priori statistic is used to correct the statistical statistic to obtain a modified statistic, which can compensate for the statistical bias estimation caused by the short duration of the speech of the speech data to be processed and the sparse speech, and improve the identity.
  • the identity of the vector is constructed according to the statistic corresponding to the voice sample exceeding the preset speech duration, and the statistic is obtained by counting the posterior probability of each Gaussian distribution component in each speaker's background model. After that, the statistic is mapped into the statistic space, and the obtained reference statistic is a prior statistic. The a priori statistic is used to correct the statistical statistic to obtain a modified statistic, which can compensate for the statistical bias estimation caused by the short duration of the speech of the speech data to be processed and the sparse speech, and improve the
  • one or more non-volatiles storing computer readable instructions are provided Computer readable storage medium, when the computer readable instructions are executed by one or more processors, causing the one or more processors to perform the steps of: acquiring voice data to be processed; extracting from the voice data to be processed Corresponding acoustic features; statistically obtaining statistics on posterior probabilities of each Gaussian distribution component in each of the acoustic features belonging to the speaker background model; mapping the statistics to a statistic space to obtain a reference statistic; The volume space is constructed according to a statistic corresponding to the voice sample exceeding the preset voice duration; the modified statistic is determined according to the statistically obtained statistic and the reference statistic; and the identity vector is generated according to the modified statistic.
  • the statistically derived statistic includes a first zero-order statistic and a first first-order statistic; the posteriori of each Gaussian distribution component in each of the acoustic features belonging to a speaker background model
  • the statistically obtaining the statistics includes: corresponding to each Gaussian distribution component in the speaker background model, respectively counting the sum of the posterior probabilities of the respective acoustic features belonging to the corresponding Gaussian distribution components as the corresponding first zero-order statistics; And corresponding to each Gaussian distribution component in the speaker background model, each of the acoustic features is calculated as a weighted sum with a posterior probability of the acoustic feature belonging to the corresponding Gaussian distribution component as a corresponding first first-order statistic.
  • the computer readable instructions further cause the processor to: acquire a speech sample that exceeds a predetermined speech duration; and statistically correspond to each Gaussian distribution in the speaker background model according to the speaker category statistics in the speech sample a second zero-order statistic of the component and a second first-order statistic; calculating a first quotient of the second first-order statistic and the corresponding second zero-order statistic; and calculating a first quotient-constructed statistic space.
  • the calculating the first quotient to construct the statistic space comprises: subtracting the calculated first quotient from the mean of the corresponding Gaussian distribution component to obtain a corresponding difference; and comparing the obtained difference according to the speaking
  • the human class and the corresponding Gaussian distribution components are sequentially arranged to form a matrix characterizing the statistic space.
  • the weight of the third quotient is the first zero-order statistic of the corresponding Gaussian distribution component divided by the sum of the corresponding first zero-order statistic and the tunable parameter
  • the weight of the second quotient is the sum of the tunable parameter divided by the first zero-order statistic of the corresponding Gaussian distribution component and the tunable parameter.
  • the mapping the statistic to the statistic space to obtain the reference statistic comprises: acquiring an orthogonal basis vector of the statistic space; and obtaining a mapping coefficient of the orthogonal basis vector, After the product of the orthogonal basis vector and the mapping coefficient plus the mean of the corresponding Gaussian distribution component, the two norm distance between the third quotient of the corresponding Gaussian distribution component is minimized; and the orthogonal basis vector is multiplied
  • the mapping coefficient is followed by the mean of the corresponding Gaussian distribution components, and a second quotient corresponding to the reference first-order statistic and the corresponding reference zero-order statistic of each Gaussian distribution component in the speaker background model is obtained.
  • the to-be-processed voice data includes voice data to be verified and voice data of a target speaker category; and the generating the identity vector according to the modified statistic includes: correcting according to the voice data to be verified Generating a identity vector to be verified; and generating a target speaker identity vector based on a modified statistic corresponding to voice data of the target speaker class; the computer readable instructions further causing the processor to perform the step of calculating the Verifying the similarity between the identity vector and the target speaker identity vector; and performing speaker identity verification based on the similarity.
  • the statistic space is constructed according to a statistic corresponding to a speech sample exceeding a preset speech duration, and statistics are performed on the posterior probability of each Gaussian distribution component in each speaker's background model. After the statistic is obtained, the statistic is mapped into the statistic space, and the obtained reference statistic is a prior statistic. The a priori statistic is used to correct the statistical statistic to obtain a modified statistic, which can compensate for the statistical bias estimation caused by the short duration of the speech of the speech data to be processed and the sparse speech, and improve the identity.
  • the identity of the vector is constructed according to a statistic corresponding to a speech sample exceeding a preset speech duration, and statistics are performed on the posterior probability of each Gaussian distribution component in each speaker's background model. After the statistic is obtained, the statistic is mapped into the statistic space, and the obtained reference statistic is a prior statistic. The a priori statistic is used to correct the statistical statistic to obtain a modified statistic, which can compensate for the statistical bias estimation caused by the short duration of the speech of the speech data to be processed and the
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种身份向量生成方法,包括:获取待处理语音数据(S302,S402);从所述待处理语音数据提取相应的声学特征(S304,S404);对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量(S306);将所述统计量映射到统计量空间获得参考统计量;所述统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成(S308);根据统计得到的所述统计量和所述参考统计量确定修正统计量(S310);及根据所述修正统计量生成身份向量(S312)。该方法能够补偿因待处理语音数据的语音时长过短和语音稀疏的情况下导致的统计量偏估,提高身份向量的身份识别性能。

Description

身份向量生成方法、计算机设备和计算机可读存储介质
本申请要求于2016年7月15日提交中国专利局,申请号为201610560366.3,发明名称为“身份向量生成方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种身份向量生成方法、计算机设备和计算机可读存储介质。
背景技术
说话人身份识别是一种重要的身份识别手段,采集用户说出一段语音,并将采集的语音进行预处理、特征提取、建模和参数估计等一系列操作后,将语音映射为一段定长的可以表达说话人语音特征的向量,该向量称为身份向量(i-vector)。身份向量可以良好地表达相应语音中包括的说话人身份信息。
目前在生成语音数据的身份向量时,需要提取出其声学特征,并基于高斯混合模型形式的说话人背景模型,统计各声学特征属于说话人背景模型中每个高斯分布分量的后验概率的统计量,进而基于该统计量生成身份向量。
然而,目前生成身份向量的方式,在语音数据语音长度比较短或者语音比较稀疏的情况下,会导致身份向量的身份识别性能降低。
发明内容
根据本申请的各种实施例,提供一种身份向量生成方法、计算机设备和计算机可读存储介质。
一种身份向量生成方法,包括:
获取待处理语音数据;
从所述待处理语音数据提取相应的声学特征;
对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量;
将所述统计量映射到统计量空间获得参考统计量;所述统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成;
根据统计得到的所述统计量和所述参考统计量确定修正统计量;及
根据所述修正统计量生成身份向量。
一种计算机设备,包括存储器和处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
获取待处理语音数据;
从所述待处理语音数据提取相应的声学特征;
对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量;
将所述统计量映射到统计量空间获得参考统计量;所述统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成;
根据统计得到的所述统计量和所述参考统计量确定修正统计量;及
根据所述修正统计量生成身份向量。
一个或多个存储有计算机可读指令的非易失性的计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
获取待处理语音数据;
从所述待处理语音数据提取相应的声学特征;
对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量;
将所述统计量映射到统计量空间获得参考统计量;所述统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成;
根据统计得到的所述统计量和所述参考统计量确定修正统计量;及
根据所述修正统计量生成身份向量。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为一个实施例中说话人识别系统的应用环境图;
图2A为一个实施例中服务器的内部结构示意图;
图2B为一个实施例中终端的内部结构示意图;
图3为一个实施例中身份向量生成方法的流程示意图;
图4为另一个实施例中身份向量生成方法的流程示意图;
图5为一个实施例中构建统计量空间的步骤的流程示意图;
图6为一个实施例中计算机设备的结构框图;
图7为一个实施例中统计量生成模块的结构框图;
图8为另一个实施例中计算机设备的结构框图;
图9为再一个实施例中计算机设备的结构框图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
可以理解,本申请所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。第一零阶统计量和第二零阶统计量两者都是零阶统计量, 但其不是同一零阶统计量。
图1为一个实施例中说话人识别系统的应用环境图。如图1所示,该系统包括通过网络连接的终端110和服务器120。终端110可用于采集待验证语音数据,并采用本申请中的身份向量生成方法生成待验证身份向量,并将待验证身份向量发送到服务器120。服务器120可收集目标说话人类别的语音数据,并采用本申请中的身份向量生成方法生成目标说话人身份向量。服务器120可用于计算待验证身份向量和目标说话人身份向量的相似度;根据相似度进行说话人身份验证。服务器120可用于向终端110反馈身份验证结果。
图2A为一个实施例中服务器的内部结构示意图。如图2A所示,该服务器包括通过系统总线连接的处理器、非易失性存储介质、内存储器和网络接口。其中,该服务器的非易失性存储介质存储有操作系统、数据库和计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器实现一种身份向量生成方法。该服务器的处理器用于提供计算和控制能力,支撑整个服务器的运行。该服务器的内存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种身份向量生成方法。该服务器的网络接口用于与终端连接通信。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。本领域技术人员可以理解,图2A中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的服务器的限定,具体的服务器可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
图2B为一个实施例中终端的内部结构示意图。如图2B所示,该终端包括通过系统总线连接的处理器、非易失性存储介质、内存储器、网络接口和声音采集装置。其中,终端的非易失性存储介质存储有操作系统,还存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器实现一种身份向量生成方法。该处理器用于提供计算和控制能力,支撑整个终端的运行。终端中的内存储器中可储存有计算机可读指令,该计算机可读指令被处 理器执行时,可使得处理器执行一种身份向量生成方法。网络接口用于与服务器进行网络通信。该终端可以是手机、平板电脑或者个人数字助理或穿戴式设备等。本领域技术人员可以理解,图2B中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的终端的限定,具体的终端可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
图3为一个实施例中身份向量生成方法的流程示意图。本实施例以该方法应用于服务器120来举例说明。参照图3,该方法具体包括如下步骤:
S302,获取待处理语音数据。
其中,待处理语音数据是指需要对其进行一系列处理以生成相应的身份向量的语音数据。语音数据是在说话人将语音说出后由声音采集设备所采集的声音进行保存而形成的数据。待处理语音数据可以包括待验证语音数据和目标说话人类别的语音数据,其中待验证语音数据是指未知说话人类别并需要判断是否属于目标说话人类别的语音数据;目标说话人类别是已知的说话人类别,是目标说话人说话形成的语音数据所构成的类别。
S304,从待处理语音数据提取相应的声学特征。
具体地,服务器可以对待处理语音数据进行预处理,比如滤除噪声或者统一语音格式等,再从经过预处理的待处理语音数据提取相应的声学特征向量。声学特征向量是指反映声学特性的声学特征所构成的向量。声学特征向量包括一系列的声学特征,该声学特征可以是梅尔倒谱系数(MFCC,Mel Frequency Cepstrum Coefficient)或者线性预测倒谱系数(LPCC)。
S306,对各声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量。
其中,说话人背景模型是采用一系列的语音样本训练得到的高斯混合模型,用来训练表示与说话人无关的特征分布。其中高斯混合模型是固定数量的高斯分布分量叠加而成的数学模型。说话人背景模型可通过EM算法(Expectation Maximization Algorithm,译为期望最大化算法)训练得到。说 话人背景模型可采用GMM-UBM(Gaussian Mixture Model-Universal Background Model,高斯混合模型-通用背景模型)。
在一个实施例中,说话人背景模型可用如下公式(1)表示:
Figure PCTCN2017092892-appb-000001
其中,x表示语音样本;C是高斯混合模型所包括高斯分布分量的总数,c表示高斯混合模型所包括的高斯分布分量的序号;N(x|μc,∑c)表示第c个高斯分布分量;ac是第c个高斯分布分量的系数;μc是第c个高斯分布分量的均值;∑c是第c个高斯分布分量的方差。
在一个实施例中,声学特征向量可表达为:{y1,y2...yL}。该声学特征向量包括L个声学特征,每个声学特征可表示为yt,其中,t∈[1,L]。在一个实施例中,声学特征向量中各声学特征属于说话人背景模型中每个高斯分布分量的后验概率可表示为:P(c|yt,Ω)。其中,Ω表示说话人背景模型。P(c|yt,Ω)表示在说话人背景模型Ω和声学特征yt已观测到的情况下声学特征yt属于第c个高斯分布分量的后验概率。服务器可基于后验概率P(c|yt,Ω)进行统计得到统计量。
S308,将统计量映射到统计量空间获得参考统计量;统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成。
其中,统计量空间是一种向量空间,统计量空间根据语音样本所对应的与上述统计得到的统计量同类型的统计量构建而成,该用来构建统计量空间的语音样本的语音时长超过预设语音时长,预设语音时长比如30秒。用来构建统计量空间的语音样本可以是从用于训练说话人背景模型的语音样本中筛选出的超过预设语音时长的语音样本。将统计得到的统计量映射到统计量空间后得到参考统计量,该参考统计量是根据超过预设语音时长的语音样本所对应的统计量确定的先验统计量。
S310,根据统计得到的统计量和参考统计量确定修正统计量。
其中,修改统计量是利用参考统计量修正统计得到的统计量后得到的统计量,该统计量结合了先验的统计量和后验的统计量。
S312,根据修正统计量生成身份向量。
具体地,在得到修正统计量后,可以利用修正统计量并采用常规的生成身份向量的方式来生成身份向量。
上述身份向量生成方法,统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成,在对各声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量后,将该统计量映射到该统计量空间中,得到的参考统计量是先验统计量。利用先验统计量来对统计得到的统计量进行修正得到修正统计量,该修正统计量能够补偿因待处理语音数据的语音时长过短和语音稀疏的情况下导致的统计量偏估,提高身份向量的身份识别性能。
图4为另一个实施例中身份向量生成方法的流程示意图。如图4所示,该身份向量生成方法包括如下步骤:
S402,获取待处理语音数据。
S404,从待处理语音数据提取相应的声学特征。
S406,对应于说话人背景模型中的每个高斯分布分量,分别统计各声学特征属于相应高斯分布分量的后验概率的总和作为相应的第一零阶统计量。
具体地,对应于说话人背景模型Ω中的每个高斯分布分量c,分别统计各声学特征yt属于相应高斯分布分量c的后验概率P(c|yt,Ω)的总和,将该总和作为相应高斯分布分量c所对应的第一零阶统计量。
更具体地,可采用如下公式(2)计算对应于高斯分布分量c的第一零阶统计量Nc(u):
Figure PCTCN2017092892-appb-000002
其中,u表示待处理语音数据;Nc(u)表示待处理语音数据u对应于高斯分布分量c的第一零阶统计量;yt表示声学特征向量的L个声学特征中第t个声学特征;P(c|yt,Ω)表示在说话人背景模型Ω和声学特征yt已观测到的情况下声学特征yt属于第c个高斯分布分量的后验概率。
S408,对应于说话人背景模型中的每个高斯分布分量,分别将各声学特征以该声学特征属于相应高斯分布分量的后验概率为权重计算加权和作为相应的第一一阶统计量。
其中,S404和S406包括于上述步骤S304。具体地,对应于说话人背景模型中的每个高斯分布分量c,分别将各声学特征yt以该声学特征yt属于相应高斯分布分量c的后验概率P(c|yt,Ω)为权重计算加权和,将该加权和作为应高斯分布分量c所对应的第一一阶统计量。
更具体地,可采用如下公式(3)计算对应于高斯分布分量c的第一一阶统计量Fc(u):
Figure PCTCN2017092892-appb-000003
其中,u表示待处理语音数据;Fc(u)表示待处理语音数据u对应于高斯分布分量c的第一一阶统计量;yt表示声学特征向量的L个声学特征中第t个声学特征;P(c|yt,Ω)表示在说话人背景模型Ω和声学特征yt已观测到的情况下声学特征yt属于第c个高斯分布分量的后验概率。
S410,将第一零阶统计量和第一一阶统计量映射到统计量空间,获得对应说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商;统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成。
具体地,将第一零阶统计量Nc(u)和第一一阶统计量Fc(u)映射到统计量空间H,得到对应说话人背景模型中每个高斯分布分量c的参考一阶统计量Fc ref(u)和相应参考零阶统计量Nc ref(u)的第二商:Fc ref(u)/Nc ref(u)。
S412,将第一一阶统计量与相应第一零阶统计量的第三商,与相应高斯分布分量的第二商加权求和,得到对应说话人背景模型中每个高斯分布分量的修正一阶统计量和相应修正零阶统计量的第四商作为修正统计量。
具体地,可采用如下公式(4)计算对应于高斯分布分量c的修正统计量:
Figure PCTCN2017092892-appb-000004
其中,
Figure PCTCN2017092892-appb-000005
表示对应于高斯分布分量c的修正一阶统计量;
Figure PCTCN2017092892-appb-000006
表示对 应于高斯分布分量c的修正零阶统计量;R1和R2是权重;
Figure PCTCN2017092892-appb-000007
表示对应于高斯分布分量c的第二商;
Figure PCTCN2017092892-appb-000008
表示对应于高斯分布分量c的第三商。可限定R1和R2的和为1。
在一个实施例中,加权求和中,第三商的权重为相应高斯分布分量的第一零阶统计量除以相应的第一零阶统计量与可调参数的和,第二商的权重为可调参数除以相应高斯分布分量的第一零阶统计量与可调参数的和。
具体地,可采用如下公式(5)计算对应于高斯分布分量c的修正统计量:
Figure PCTCN2017092892-appb-000009
其中,第三商
Figure PCTCN2017092892-appb-000010
的权重为
Figure PCTCN2017092892-appb-000011
是相应高斯分布分量c的第一零阶统计量Nc(u)除以相应的第一零阶统计量Nc(u)与可调参数q的和;第二商
Figure PCTCN2017092892-appb-000012
的权重为
Figure PCTCN2017092892-appb-000013
是可调参数q除以相应高斯分布分量c的第一零阶统计量Nc(u)与可调参数q的和。q取0.4~1时可达到很好的效果。本实施例中,通过调整可调参数,可以针对不同环境进行差异性调整,增加鲁棒性。
S414,根据修正统计量生成身份向量。
具体地,当
Figure PCTCN2017092892-appb-000014
时可求得
Figure PCTCN2017092892-appb-000015
按照如下公式(6)定义说话人背景模型的均值超向量m:
Figure PCTCN2017092892-appb-000016
其中,μ1、μ2……μC分别是说话人背景模型各高斯分布分量的均值。
按照如下公式(7)定义对角矩阵形式的修正零阶统计量矩阵
Figure PCTCN2017092892-appb-000017
Figure PCTCN2017092892-appb-000018
其中,
Figure PCTCN2017092892-appb-000019
分别是对应于说话人背景模型各高斯分布 分量的修正零阶统计量。
按照如下公式(8)定义修正一阶统计量矩阵
Figure PCTCN2017092892-appb-000020
Figure PCTCN2017092892-appb-000021
其中,
Figure PCTCN2017092892-appb-000022
分别是对应于说话人背景模型各高斯分布分量的修正一阶统计量。
在一个实施例中,可根据如下公式(9)计算身份向量
Figure PCTCN2017092892-appb-000023
Figure PCTCN2017092892-appb-000024
其中,I表示单位矩阵;T表示已知的全因子矩阵(Total Factor Matrix);t表示转置;∑表示对角矩阵形式的协方差矩阵,∑的对角元素是各高斯分布分量的协方差;m表示说话人背景模型的均值超向量;
Figure PCTCN2017092892-appb-000025
表示修正零阶统计量矩阵;
Figure PCTCN2017092892-appb-000026
表示修正一阶统计量矩阵。
在一个实施例中,可对上述公式(9)进行变换,将涉及矩阵
Figure PCTCN2017092892-appb-000027
Figure PCTCN2017092892-appb-000028
的计算变换为涉及
Figure PCTCN2017092892-appb-000029
Figure PCTCN2017092892-appb-000030
的计算,而
Figure PCTCN2017092892-appb-000031
本实施例中在得到
Figure PCTCN2017092892-appb-000032
后可直接用来计算身份向量,不必构建矩阵
Figure PCTCN2017092892-appb-000033
Figure PCTCN2017092892-appb-000034
简化计算。
本实施例中,利用第一一阶统计量和第一零阶统计量可以更加准确地反映声学特征的特性,便于计算出准确的修正统计量。由于一阶统计量与相应零阶统计量的商基本保持在稳定的范围内,可以在确定修正统计量时直接进行线性加和,减少计算量。
图5为一个实施例中构建统计量空间的步骤的流程示意图。参照图5,构建统计量空间的步骤具体包括如下步骤
S502,获取超过预设语音时长的语音样本。
具体地,可从用于训练说话人背景模型的语音样本中筛选出语音时长超过预设语音时长的语音样本。
S504,按照语音样本中说话人类别统计对应于说话人背景模型中的每个 高斯分布分量的第二零阶统计量和第二一阶统计量。
具体地,若获取的语音样本共有S个说话人类别,对于第s个说话人类别,参照上述公式(2)和(3),分别统计对应于每个高斯分布分量c的第二零阶统计量
Figure PCTCN2017092892-appb-000035
和第二一阶统计量
Figure PCTCN2017092892-appb-000036
S506,计算第二一阶统计量和相应的第二零阶统计量的第一商。
具体地,对于每个说话类别s,分别计算对应于说话人背景模型中每个高斯分布分量c的第二一阶统计量
Figure PCTCN2017092892-appb-000037
和相应的第二零阶统计量
Figure PCTCN2017092892-appb-000038
的第一商
Figure PCTCN2017092892-appb-000039
S508,根据计算出的第一商构建统计量空间。
具体地,可将对于每个说话类别s且对应于说话人背景模型中每个高斯分布分量c的第一商,按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵。
本实施例中,基于第二一阶统计量和相应的第二零阶统计量的第一商建立统计量空间,由于一阶统计量与相应零阶统计量的商基本保持在稳定的范围内,便于将第一零阶统计量和第一一阶统计量映射到统计量空间的计算,提高计算效率。
在一个实施例中,S508包括:将计算出的第一商减去相应高斯分布分量的均值得到相应的差值;将得到的差值按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵。
具体地,可按照如下公式(10)确定表征统计量空间的矩阵H:
Figure PCTCN2017092892-appb-000040
其中,m表示说话人背景模型的均值超向量;
Figure PCTCN2017092892-appb-000041
s∈[1,S],表示第s个说话人类别对应的第二一阶统计量矩阵,
Figure PCTCN2017092892-appb-000042
表示各第s个说话人类别的对应于说话人背景模型各高斯分布分量c的第二零阶统计量。
Figure PCTCN2017092892-appb-000043
可表示为如下形式:
Figure PCTCN2017092892-appb-000044
因此,上述公式(10)可变形为如下公式(11)
Figure PCTCN2017092892-appb-000045
本实施例中,将计算出的第一商减去相应高斯分布分量的均值得到相应的差值,从而将得到的差值按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵,使得构建出的统计量空间中心大致在统计量空间的原点处,便于计算,提高计算效率。
在一个实施例中,步骤S410具体包括:获取统计量空间的正交基向量;求取正交基向量的映射系数,正交基向量与映射系数的乘积加上相应高斯分布分量的均值后,与相应高斯分布分量的第三商之间的二范数距离最小化;将正交基向量乘以映射系数后加上相应高斯分布分量的均值,得到对应说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商。
具体地,统计量空间可通过特征值分解得到统计量空间的一组正交基向量Feigen。可定义如下公式(12)的优化函数:
Figure PCTCN2017092892-appb-000046
其中,Nc(u)表示对应于高斯分布分量c的第一零阶统计量;Fc(u)表示对应于高斯分布分量c的第一一阶统计量;
Figure PCTCN2017092892-appb-000047
表示对应于高斯分布分量c的第三商;μc表示对应于高斯分布分量c的均值;Feigen表示统计量空间H的正交基向量;
Figure PCTCN2017092892-appb-000048
表示映射系数。
优化如公式(12)的优化函数,得到的最优的映射系数
Figure PCTCN2017092892-appb-000049
如下公式(13):
Figure PCTCN2017092892-appb-000050
进一步地,按照如下公式(14)计算对应说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商:
Figure PCTCN2017092892-appb-000051
本实施例中,可实现准确地将第一零阶统计量和第一一阶统计量映射到统计量空间。
在一个实施例中,待处理语音数据包括待验证语音数据和目标说话人类别的语音数据;步骤S312包括:根据与待验证语音数据对应的修正统计量生成待验证身份向量;根据与目标说话人类别的语音数据对应的修正统计量生成目标说话人身份向量。该身份向量生成方法还包括:计算待验证身份向量和目标说话人身份向量的相似度;根据相似度进行说话人身份验证。
具体地,说话人身份识别可以应用于多种需要认证未知用户身份的场景。说话人身份识别分为线下(off-line)和线上(on-line)两个阶段:线下阶段需要收集大量的非目标说话人类别的语音样本用于训练说话人身份识别系统,说话人身份识别系统包括身份向量提取模块与身份向量规整模块。
线上阶段又分为两个阶段:注册阶段与识别阶段。在注册阶段中,需要获取目标说话人的语音数据,将该语音数据进行预处理、特征提取与模型训练后,映射为一段定长的身份向量,该已知身份向量即是表征目标说话人身份的一个模型。而在识别阶段中,获取一段身份未知的待验证语音,将该待验证语音同样经过预处理、特征提取与模型训练后,映射为一段待验证身份向量。
目标说话人类别的身份向量与识别阶段的待验证身份向量接下来在相似度计算模块中计算相似度,将相似度与预先人工设定的一个门限值进行比较,若相似度大于等于门限值,则可判定待验证语音对应的身份与目标说话人身份匹配,身份验证通过。若相似度小于门限值,则可判定待验证语音对应的 身份与目标说话人身份不匹配,身份验证未通过。相似度可采用余弦相似度、皮尔森相关系数或者欧氏距离等。
本实施例中,即使是语音时长很短的语音数据,通过本实施例的身份向量生成方法,依然可以生成身份识别性能较高的身份向量,不需要说话人说出太长的语音,使得短时文本无关说话人识别能够广泛推广。
图6为一个实施例中计算机设备600的结构框图。计算机设备600可用作服务器,也可以用作终端。服务器的内部结构可对应于如图2A所示的结构,终端的内部结构可对应于如图2B所示的结构。下述每个模块可全部或部分通过软件、硬件或其组合来实现。
如图6所示,计算机设备600包括声学特征提取模块610、统计量生成模块620、映射模块630、修正统计量确定模块640和身份向量生成模块650。
声学特征提取模块610,用于获取待处理语音数据;从待处理语音数据提取相应的声学特征。
统计量生成模块620,用于对各声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量。
映射模块630,用于将统计量映射到统计量空间获得参考统计量;统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成。
修正统计量确定模块640,用于根据统计得到的统计量和参考统计量确定修正统计量。
身份向量生成模块650,用于根据修正统计量生成身份向量。
上述计算机设备600,统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成,在对各声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量后,将该统计量映射到该统计量空间中,得到的参考统计量是先验统计量。利用先验统计量来对统计得到的统计量进行修正得到修正统计量,该修正统计量能够补偿因待处理语音数据的语音时长过短和语音稀疏的情况下导致的统计量偏估,提高身份向量的身份识别性能。
图7为一个实施例中统计量生成模块620的结构框图。本实施例中,统计得到的统计量包括第一零阶统计量和第一一阶统计量;统计量生成模块620包括:第一零阶统计量生成模块621和第一一阶统计量生成模块622。
第一零阶统计量生成模块621,用于对应于说话人背景模型中的每个高斯分布分量,分别统计各声学特征属于相应高斯分布分量的后验概率的总和作为相应的第一零阶统计量。
第一一阶统计量生成模块622,用于对应于说话人背景模型中的每个高斯分布分量,分别将各声学特征以该声学特征属于相应高斯分布分量的后验概率为权重计算加权和作为相应的第一一阶统计量。
图8为另一个实施例中计算机设备600的结构框图。计算机设备600还包括:统计量统计模块660和统计量空间构建模块670。
统计量统计模块660,用于获取超过预设语音时长的语音样本;按照语音样本中说话人类别统计对应于说话人背景模型中的每个高斯分布分量的第二零阶统计量和第二一阶统计量。
统计量空间构建模块670,用于计算第二一阶统计量和相应的第二零阶统计量的第一商;根据计算出的第一商构建统计量空间。
本实施例中,基于第二一阶统计量和相应的第二零阶统计量的第一商建立统计量空间,由于一阶统计量与相应零阶统计量的商基本保持在稳定的范围内,便于将第一零阶统计量和第一一阶统计量映射到统计量空间的计算,提高计算效率。
在一个实施例中,统计量空间构建模块670还用于将计算出的第一商减去相应高斯分布分量的均值得到相应的差值;将得到的差值按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵。
本实施例中,将计算出的第一商减去相应高斯分布分量的均值得到相应的差值,从而将得到的差值按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵,使得构建出的统计量空间中心大致在统计量空间的原点处,便于计算,提高计算效率。
在一个实施例中,参考统计量包括对应说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商;修正统计量确定模块640还用于将第一一阶统计量与相应第一零阶统计量的第三商,与相应高斯分布分量的第二商加权求和,得到对应说话人背景模型中每个高斯分布分量的修正一阶统计量和相应修正零阶统计量的第四商作为修正统计量。
在一个实施例中,修正统计量确定模块640用于加权求和时,第三商的权重为相应高斯分布分量的第一零阶统计量除以相应的第一零阶统计量与可调参数的和,第二商的权重为可调参数除以相应高斯分布分量的第一零阶统计量与可调参数的和。本实施例中,通过调整可调参数,可以针对不同环境进行差异性调整,增加鲁棒性。
在一个实施例中,映射模块630还用于获取统计量空间的正交基向量;求取正交基向量的映射系数,正交基向量与映射系数的乘积加上相应高斯分布分量的均值后,与相应高斯分布分量的第三商之间的二范数距离最小化;将正交基向量乘以映射系数后加上相应高斯分布分量的均值,得到对应说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商。
在一个实施例中,待处理语音数据包括待验证语音数据和目标说话人类别的语音数据;身份向量生成模块650还用于根据与待验证语音数据对应的修正统计量生成待验证身份向量;根据与目标说话人类别的语音数据对应的修正统计量生成目标说话人身份向量。
图9为再一个实施例中计算机设备600的结构框图。本实施例中计算机设备600还包括:说话人身份验证模块680,用于计算待验证身份向量和目标说话人身份向量的相似度;根据相似度进行说话人身份验证。
本实施例中,即使是语音时长很短的语音数据,通过本实施例的身份向量生成方法,依然可以生成身份识别性能较高的身份向量,不需要说话人说出太长的语音,使得短时文本无关说话人识别能够广泛推广。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,所述 存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:获取待处理语音数据;从所述待处理语音数据提取相应的声学特征;对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量;将所述统计量映射到统计量空间获得参考统计量;所述统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成;根据统计得到的所述统计量和所述参考统计量确定修正统计量;及根据所述修正统计量生成身份向量。
在一个实施例中,统计得到的所述统计量包括第一零阶统计量和第一一阶统计量;所述对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量包括:对应于说话人背景模型中的每个高斯分布分量,分别统计各所述声学特征属于相应高斯分布分量的后验概率的总和作为相应的第一零阶统计量;及对应于说话人背景模型中的每个高斯分布分量,分别将各所述声学特征以该声学特征属于相应高斯分布分量的后验概率为权重计算加权和作为相应的第一一阶统计量。
在一个实施例中,计算机可读指令还使得处理器执行以下步骤:获取超过预设语音时长的语音样本;按照所述语音样本中说话人类别统计对应于说话人背景模型中的每个高斯分布分量的第二零阶统计量和第二一阶统计量;计算所述第二一阶统计量和相应的第二零阶统计量的第一商;及根据计算出的第一商构建统计量空间。
在一个实施例中,所述根据计算出的第一商构建统计量空间包括:将计算出的第一商减去相应高斯分布分量的均值得到相应的差值;及将得到的差值按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵。
在一个实施例中,所述参考统计量包括对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商;所述根据统计得到的所述统计量和所述参考统计量确定修正统计量包括:将所述第一一阶统计量与相应第一零阶统计量的第三商,与相应高斯分布分量的所述第 二商加权求和,得到对应所述说话人背景模型中每个高斯分布分量的修正一阶统计量和相应修正零阶统计量的第四商作为修正统计量。
在一个实施例中,所述加权求和中,所述第三商的权重为相应高斯分布分量的第一零阶统计量除以相应的第一零阶统计量与可调参数的和,所述第二商的权重为所述可调参数除以所述相应高斯分布分量的第一零阶统计量与所述可调参数的和。
在一个实施例中,所述将所述统计量映射到统计量空间获得参考统计量包括:获取所述统计量空间的正交基向量;求取所述正交基向量的映射系数,所述正交基向量与所述映射系数的乘积加上相应高斯分布分量的均值后,与相应高斯分布分量的第三商之间的二范数距离最小化;及将所述正交基向量乘以所述映射系数后加上相应高斯分布分量的均值,得到对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商。
在一个实施例中,所述待处理语音数据包括待验证语音数据和目标说话人类别的语音数据;所述根据所述修正统计量生成身份向量包括:根据与所述待验证语音数据对应的修正统计量生成待验证身份向量;及根据与目标说话人类别的语音数据对应的修正统计量生成目标说话人身份向量;所述计算机可读指令还使得所述处理器执行以下步骤:计算所述待验证身份向量和所述目标说话人身份向量的相似度;及根据所述相似度进行说话人身份验证。
上述计算机设备,统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成,在对各声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量后,将该统计量映射到该统计量空间中,得到的参考统计量是先验统计量。利用先验统计量来对统计得到的统计量进行修正得到修正统计量,该修正统计量能够补偿因待处理语音数据的语音时长过短和语音稀疏的情况下导致的统计量偏估,提高身份向量的身份识别性能。
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性 的计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:获取待处理语音数据;从所述待处理语音数据提取相应的声学特征;对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量;将所述统计量映射到统计量空间获得参考统计量;所述统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成;根据统计得到的所述统计量和所述参考统计量确定修正统计量;及根据所述修正统计量生成身份向量。
在一个实施例中,统计得到的所述统计量包括第一零阶统计量和第一一阶统计量;所述对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量包括:对应于说话人背景模型中的每个高斯分布分量,分别统计各所述声学特征属于相应高斯分布分量的后验概率的总和作为相应的第一零阶统计量;及对应于说话人背景模型中的每个高斯分布分量,分别将各所述声学特征以该声学特征属于相应高斯分布分量的后验概率为权重计算加权和作为相应的第一一阶统计量。
在一个实施例中,计算机可读指令还使得处理器执行以下步骤:获取超过预设语音时长的语音样本;按照所述语音样本中说话人类别统计对应于说话人背景模型中的每个高斯分布分量的第二零阶统计量和第二一阶统计量;计算所述第二一阶统计量和相应的第二零阶统计量的第一商;及根据计算出的第一商构建统计量空间。
在一个实施例中,所述根据计算出的第一商构建统计量空间包括:将计算出的第一商减去相应高斯分布分量的均值得到相应的差值;及将得到的差值按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵。
在一个实施例中,所述参考统计量包括对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商;所述根据统计得到的所述统计量和所述参考统计量确定修正统计量包括:将所述第一一阶统计量与相应第一零阶统计量的第三商,与相应高斯分布分量的所述第 二商加权求和,得到对应所述说话人背景模型中每个高斯分布分量的修正一阶统计量和相应修正零阶统计量的第四商作为修正统计量。
在一个实施例中,所述加权求和中,所述第三商的权重为相应高斯分布分量的第一零阶统计量除以相应的第一零阶统计量与可调参数的和,所述第二商的权重为所述可调参数除以所述相应高斯分布分量的第一零阶统计量与所述可调参数的和。
在一个实施例中,所述将所述统计量映射到统计量空间获得参考统计量包括:获取所述统计量空间的正交基向量;求取所述正交基向量的映射系数,所述正交基向量与所述映射系数的乘积加上相应高斯分布分量的均值后,与相应高斯分布分量的第三商之间的二范数距离最小化;及将所述正交基向量乘以所述映射系数后加上相应高斯分布分量的均值,得到对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商。
在一个实施例中,所述待处理语音数据包括待验证语音数据和目标说话人类别的语音数据;所述根据所述修正统计量生成身份向量包括:根据与所述待验证语音数据对应的修正统计量生成待验证身份向量;及根据与目标说话人类别的语音数据对应的修正统计量生成目标说话人身份向量;所述计算机可读指令还使得所述处理器执行以下步骤:计算所述待验证身份向量和所述目标说话人身份向量的相似度;及根据所述相似度进行说话人身份验证。
上述计算机可读存储介质,统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成,在对各声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量后,将该统计量映射到该统计量空间中,得到的参考统计量是先验统计量。利用先验统计量来对统计得到的统计量进行修正得到修正统计量,该修正统计量能够补偿因待处理语音数据的语音时长过短和语音稀疏的情况下导致的统计量偏估,提高身份向量的身份识别性能。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程, 是可以通过计算机程序来指令相关的硬件来完成,该程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,该存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (24)

  1. 一种身份向量生成方法,包括:
    获取待处理语音数据;
    从所述待处理语音数据提取相应的声学特征;
    对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量;
    将所述统计量映射到统计量空间获得参考统计量;所述统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成;
    根据统计得到的所述统计量和所述参考统计量确定修正统计量;及
    根据所述修正统计量生成身份向量。
  2. 根据权利要求1所述的方法,其特征在于,统计得到的所述统计量包括第一零阶统计量和第一一阶统计量;所述对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量包括:
    对应于说话人背景模型中的每个高斯分布分量,分别统计各所述声学特征属于相应高斯分布分量的后验概率的总和作为相应的第一零阶统计量;及
    对应于说话人背景模型中的每个高斯分布分量,分别将各所述声学特征以该声学特征属于相应高斯分布分量的后验概率为权重计算加权和作为相应的第一一阶统计量。
  3. 根据权利要求2所述的方法,其特征在于,还包括:
    获取超过预设语音时长的语音样本;
    按照所述语音样本中说话人类别统计对应于说话人背景模型中的每个高斯分布分量的第二零阶统计量和第二一阶统计量;
    计算所述第二一阶统计量和相应的第二零阶统计量的第一商;及
    根据计算出的第一商构建统计量空间。
  4. 根据权利要求3所述的方法,其特征在于,所述根据计算出的第一商构建统计量空间包括:
    将计算出的第一商减去相应高斯分布分量的均值得到相应的差值;及
    将得到的差值按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵。
  5. 根据权利要求2所述的方法,其特征在于,所述参考统计量包括对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商;所述根据统计得到的所述统计量和所述参考统计量确定修正统计量包括:
    将所述第一一阶统计量与相应第一零阶统计量的第三商,与相应高斯分布分量的所述第二商加权求和,得到对应所述说话人背景模型中每个高斯分布分量的修正一阶统计量和相应修正零阶统计量的第四商作为修正统计量。
  6. 根据权利要求5所述的方法,其特征在于,所述加权求和中,所述第三商的权重为相应高斯分布分量的第一零阶统计量除以相应的第一零阶统计量与可调参数的和,所述第二商的权重为所述可调参数除以所述相应高斯分布分量的第一零阶统计量与所述可调参数的和。
  7. 根据权利要求5所述的方法,其特征在于,所述将所述统计量映射到统计量空间获得参考统计量包括:
    获取所述统计量空间的正交基向量;
    求取所述正交基向量的映射系数,所述正交基向量与所述映射系数的乘积加上相应高斯分布分量的均值后,与相应高斯分布分量的第三商之间的二范数距离最小化;及
    将所述正交基向量乘以所述映射系数后加上相应高斯分布分量的均值,得到对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商。
  8. 根据权利要求1所述的方法,其特征在于,所述待处理语音数据包括待验证语音数据和目标说话人类别的语音数据;所述根据所述修正统计量生成身份向量包括:
    根据与所述待验证语音数据对应的修正统计量生成待验证身份向量;及
    根据与目标说话人类别的语音数据对应的修正统计量生成目标说话人身 份向量;
    所述方法还包括:
    计算所述待验证身份向量和所述目标说话人身份向量的相似度;及
    根据所述相似度进行说话人身份验证。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
    获取待处理语音数据;
    从所述待处理语音数据提取相应的声学特征;
    对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量;
    将所述统计量映射到统计量空间获得参考统计量;所述统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成;
    根据统计得到的所述统计量和所述参考统计量确定修正统计量;及
    根据所述修正统计量生成身份向量。
  10. 根据权利要求9所述的计算机设备,其特征在于,统计得到的所述统计量包括第一零阶统计量和第一一阶统计量;所述对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量包括:
    对应于说话人背景模型中的每个高斯分布分量,分别统计各所述声学特征属于相应高斯分布分量的后验概率的总和作为相应的第一零阶统计量;及
    对应于说话人背景模型中的每个高斯分布分量,分别将各所述声学特征以该声学特征属于相应高斯分布分量的后验概率为权重计算加权和作为相应的第一一阶统计量。
  11. 根据权利要求10所述的计算机设备,其特征在于,所述计算机可读指令还使得所述处理器执行以下步骤:
    获取超过预设语音时长的语音样本;
    按照所述语音样本中说话人类别统计对应于说话人背景模型中的每个高 斯分布分量的第二零阶统计量和第二一阶统计量;
    计算所述第二一阶统计量和相应的第二零阶统计量的第一商;及
    根据计算出的第一商构建统计量空间。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述根据计算出的第一商构建统计量空间包括:
    将计算出的第一商减去相应高斯分布分量的均值得到相应的差值;及
    将得到的差值按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵。
  13. 根据权利要求10所述的计算机设备,其特征在于,所述参考统计量包括对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商;所述根据统计得到的所述统计量和所述参考统计量确定修正统计量包括:
    将所述第一一阶统计量与相应第一零阶统计量的第三商,与相应高斯分布分量的所述第二商加权求和,得到对应所述说话人背景模型中每个高斯分布分量的修正一阶统计量和相应修正零阶统计量的第四商作为修正统计量。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述加权求和中,所述第三商的权重为相应高斯分布分量的第一零阶统计量除以相应的第一零阶统计量与可调参数的和,所述第二商的权重为所述可调参数除以所述相应高斯分布分量的第一零阶统计量与所述可调参数的和。
  15. 根据权利要求13所述的计算机设备,其特征在于,所述将所述统计量映射到统计量空间获得参考统计量包括:
    获取所述统计量空间的正交基向量;
    求取所述正交基向量的映射系数,所述正交基向量与所述映射系数的乘积加上相应高斯分布分量的均值后,与相应高斯分布分量的第三商之间的二范数距离最小化;及
    将所述正交基向量乘以所述映射系数后加上相应高斯分布分量的均值,得到对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应 参考零阶统计量的第二商。
  16. 根据权利要求9所述的计算机设备,其特征在于,所述待处理语音数据包括待验证语音数据和目标说话人类别的语音数据;所述根据所述修正统计量生成身份向量包括:
    根据与所述待验证语音数据对应的修正统计量生成待验证身份向量;及
    根据与目标说话人类别的语音数据对应的修正统计量生成目标说话人身份向量;
    所述计算机可读指令还使得所述处理器执行以下步骤:
    计算所述待验证身份向量和所述目标说话人身份向量的相似度;及
    根据所述相似度进行说话人身份验证。
  17. 一个或多个存储有计算机可读指令的非易失性的计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取待处理语音数据;
    从所述待处理语音数据提取相应的声学特征;
    对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量;
    将所述统计量映射到统计量空间获得参考统计量;所述统计量空间根据超过预设语音时长的语音样本所对应的统计量构建而成;
    根据统计得到的所述统计量和所述参考统计量确定修正统计量;及
    根据所述修正统计量生成身份向量。
  18. 根据权利要求17所述的计算机可读存储介质,其特征在于,统计得到的所述统计量包括第一零阶统计量和第一一阶统计量;所述对各所述声学特征属于说话人背景模型中每个高斯分布分量的后验概率进行统计得到统计量包括:
    对应于说话人背景模型中的每个高斯分布分量,分别统计各所述声学特征属于相应高斯分布分量的后验概率的总和作为相应的第一零阶统计量;及
    对应于说话人背景模型中的每个高斯分布分量,分别将各所述声学特征以该声学特征属于相应高斯分布分量的后验概率为权重计算加权和作为相应的第一一阶统计量。
  19. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述计算机可读指令还使得所述处理器执行以下步骤:
    获取超过预设语音时长的语音样本;
    按照所述语音样本中说话人类别统计对应于说话人背景模型中的每个高斯分布分量的第二零阶统计量和第二一阶统计量;
    计算所述第二一阶统计量和相应的第二零阶统计量的第一商;及
    根据计算出的第一商构建统计量空间。
  20. 根据权利要求19所述的计算机可读存储介质,其特征在于,所述根据计算出的第一商构建统计量空间包括:
    将计算出的第一商减去相应高斯分布分量的均值得到相应的差值;及
    将得到的差值按照说话人类别和对应的高斯分布分量依次排布形成表征统计量空间的矩阵。
  21. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述参考统计量包括对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商;所述根据统计得到的所述统计量和所述参考统计量确定修正统计量包括:
    将所述第一一阶统计量与相应第一零阶统计量的第三商,与相应高斯分布分量的所述第二商加权求和,得到对应所述说话人背景模型中每个高斯分布分量的修正一阶统计量和相应修正零阶统计量的第四商作为修正统计量。
  22. 根据权利要求21所述的计算机可读存储介质,其特征在于,所述加权求和中,所述第三商的权重为相应高斯分布分量的第一零阶统计量除以相应的第一零阶统计量与可调参数的和,所述第二商的权重为所述可调参数除以所述相应高斯分布分量的第一零阶统计量与所述可调参数的和。
  23. 根据权利要求21所述的计算机可读存储介质,其特征在于,所述将 所述统计量映射到统计量空间获得参考统计量包括:
    获取所述统计量空间的正交基向量;
    求取所述正交基向量的映射系数,所述正交基向量与所述映射系数的乘积加上相应高斯分布分量的均值后,与相应高斯分布分量的第三商之间的二范数距离最小化;及
    将所述正交基向量乘以所述映射系数后加上相应高斯分布分量的均值,得到对应所述说话人背景模型中每个高斯分布分量的参考一阶统计量和相应参考零阶统计量的第二商。
  24. 根据权利要求17所述的计算机可读存储介质,其特征在于,所述待处理语音数据包括待验证语音数据和目标说话人类别的语音数据;所述根据所述修正统计量生成身份向量包括:
    根据与所述待验证语音数据对应的修正统计量生成待验证身份向量;及
    根据与目标说话人类别的语音数据对应的修正统计量生成目标说话人身份向量;
    所述计算机可读指令还使得所述处理器执行以下步骤:
    计算所述待验证身份向量和所述目标说话人身份向量的相似度;及
    根据所述相似度进行说话人身份验证。
PCT/CN2017/092892 2016-07-15 2017-07-14 身份向量生成方法、计算机设备和计算机可读存储介质 WO2018010683A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP17827019.5A EP3486903B1 (en) 2016-07-15 2017-07-14 Identity vector generating method, computer apparatus and computer readable storage medium
US16/213,421 US10909989B2 (en) 2016-07-15 2018-12-07 Identity vector generation method, computer device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610560366.3A CN106169295B (zh) 2016-07-15 2016-07-15 身份向量生成方法和装置
CN201610560366.3 2016-07-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/213,421 Continuation US10909989B2 (en) 2016-07-15 2018-12-07 Identity vector generation method, computer device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2018010683A1 true WO2018010683A1 (zh) 2018-01-18

Family

ID=58065477

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/092892 WO2018010683A1 (zh) 2016-07-15 2017-07-14 身份向量生成方法、计算机设备和计算机可读存储介质

Country Status (4)

Country Link
US (1) US10909989B2 (zh)
EP (1) EP3486903B1 (zh)
CN (1) CN106169295B (zh)
WO (1) WO2018010683A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106169295B (zh) 2016-07-15 2019-03-01 腾讯科技(深圳)有限公司 身份向量生成方法和装置
CN107068154A (zh) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 基于声纹识别的身份验证的方法及系统
CN110310647B (zh) * 2017-09-29 2022-02-25 腾讯科技(深圳)有限公司 一种语音身份特征提取器、分类器训练方法及相关设备
KR102637339B1 (ko) * 2018-08-31 2024-02-16 삼성전자주식회사 음성 인식 모델을 개인화하는 방법 및 장치
CN109346084A (zh) * 2018-09-19 2019-02-15 湖北工业大学 基于深度堆栈自编码网络的说话人识别方法
CN110544481B (zh) * 2019-08-27 2022-09-20 华中师范大学 一种基于声纹识别的s-t分类方法、装置及设备终端
US20230109177A1 (en) 2020-01-31 2023-04-06 Nec Corporation Speech embedding apparatus, and method
CN111462762B (zh) * 2020-03-25 2023-02-24 清华大学 一种说话人向量正则化方法、装置、电子设备和存储介质
CN113660670B (zh) * 2020-05-12 2024-02-06 哈尔滨工程大学 基于射频指纹的无线设备身份认证方法及其装置
CN113053395B (zh) * 2021-03-05 2023-11-17 深圳市声希科技有限公司 发音纠错学习方法、装置、存储介质及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010065A1 (en) * 2006-06-05 2008-01-10 Harry Bratt Method and apparatus for speaker recognition
CN102024455A (zh) * 2009-09-10 2011-04-20 索尼株式会社 说话人识别系统及其方法
CN102820033A (zh) * 2012-08-17 2012-12-12 南京大学 一种声纹识别方法
CN106169295A (zh) * 2016-07-15 2016-11-30 腾讯科技(深圳)有限公司 身份向量生成方法和装置

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060366B1 (en) * 2007-07-17 2011-11-15 West Corporation System, method, and computer-readable medium for verbal control of a conference call
US9042867B2 (en) * 2012-02-24 2015-05-26 Agnitio S.L. System and method for speaker recognition on mobile devices
ES2605779T3 (es) * 2012-09-28 2017-03-16 Agnitio S.L. Reconocimiento de orador
US20140222423A1 (en) * 2013-02-07 2014-08-07 Nuance Communications, Inc. Method and Apparatus for Efficient I-Vector Extraction
US9406298B2 (en) * 2013-02-07 2016-08-02 Nuance Communications, Inc. Method and apparatus for efficient i-vector extraction
US9336781B2 (en) * 2013-10-17 2016-05-10 Sri International Content-aware speaker recognition
US20150154002A1 (en) * 2013-12-04 2015-06-04 Google Inc. User interface customization based on speaker characteristics
US9640186B2 (en) * 2014-05-02 2017-05-02 International Business Machines Corporation Deep scattering spectrum in acoustic modeling for speech recognition
CN105261367B (zh) * 2014-07-14 2019-03-15 中国科学院声学研究所 一种说话人识别方法
US10354657B2 (en) * 2015-02-11 2019-07-16 Bang & Olufsen A/S Speaker recognition in multimedia system
US10141009B2 (en) * 2016-06-28 2018-11-27 Pindrop Security, Inc. System and method for cluster-based audio event detection
US10553218B2 (en) * 2016-09-19 2020-02-04 Pindrop Security, Inc. Dimensionality reduction of baum-welch statistics for speaker recognition
GB2563952A (en) * 2017-06-29 2019-01-02 Cirrus Logic Int Semiconductor Ltd Speaker identification
CN110310647B (zh) * 2017-09-29 2022-02-25 腾讯科技(深圳)有限公司 一种语音身份特征提取器、分类器训练方法及相关设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010065A1 (en) * 2006-06-05 2008-01-10 Harry Bratt Method and apparatus for speaker recognition
CN102024455A (zh) * 2009-09-10 2011-04-20 索尼株式会社 说话人识别系统及其方法
CN102820033A (zh) * 2012-08-17 2012-12-12 南京大学 一种声纹识别方法
CN106169295A (zh) * 2016-07-15 2016-11-30 腾讯科技(深圳)有限公司 身份向量生成方法和装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MD , J.A. ET AL.: "Multi-taper MFCC Features for Speaker Verification Using I-vectors", AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2011 IEEE WORKSHOP ON, 5 March 2012 (2012-03-05), XP032126088 *
YUN, LEI ET AL.: "A Noise Robust I-vector Extractor Using Vector Taylor Series for Speaker Recognition", ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) , 2013 IEEE INTERNATIONAL CONFERENCE ON, 21 October 2013 (2013-10-21), XP032508553, ISSN: 1520-6149 *

Also Published As

Publication number Publication date
US20190115031A1 (en) 2019-04-18
EP3486903A4 (en) 2019-05-22
CN106169295A (zh) 2016-11-30
US10909989B2 (en) 2021-02-02
EP3486903B1 (en) 2024-04-24
CN106169295B (zh) 2019-03-01
EP3486903A1 (en) 2019-05-22

Similar Documents

Publication Publication Date Title
WO2018010683A1 (zh) 身份向量生成方法、计算机设备和计算机可读存储介质
JP7008638B2 (ja) 音声認識
WO2017215558A1 (zh) 一种声纹识别方法和装置
US9373330B2 (en) Fast speaker recognition scoring using I-vector posteriors and probabilistic linear discriminant analysis
CN109065028B (zh) 说话人聚类方法、装置、计算机设备及存储介质
US20220238117A1 (en) Voice identity feature extractor and classifier training
CN108922544B (zh) 通用向量训练方法、语音聚类方法、装置、设备及介质
WO2016095218A1 (en) Speaker identification using spatial information
CN109360572B (zh) 通话分离方法、装置、计算机设备及存储介质
CN109711358B (zh) 神经网络训练方法、人脸识别方法及系统和存储介质
CN106062871B (zh) 使用所选择的群组样本子集来训练分类器
CN108922543B (zh) 模型库建立方法、语音识别方法、装置、设备及介质
WO2020045313A1 (ja) マスク推定装置、マスク推定方法及びマスク推定プログラム
WO2018010646A1 (zh) 身份向量处理方法和计算机设备
WO2018024259A1 (zh) 训练声纹识别系统的方法和装置
JP6059072B2 (ja) モデル推定装置、音源分離装置、モデル推定方法、音源分離方法及びプログラム
CN106373576B (zh) 一种基于vq和svm算法的说话人确认方法及其系统
CN112489678A (zh) 一种基于信道特征的场景识别方法及装置
CN113327617A (zh) 声纹判别方法、装置、计算机设备和存储介质
JP2004170552A (ja) 音声抽出装置
Alam et al. An ensemble approach to unsupervised anomalous sound detection
Memon et al. Information theoretic expectation maximization based Gaussian mixture modeling for speaker verification
WO2021189980A1 (zh) 语音数据生成方法、装置、计算机设备及存储介质
CN113421574B (zh) 音频特征提取模型的训练方法、音频识别方法及相关设备
Farhood et al. Investigation on model selection criteria for speaker identification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17827019

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017827019

Country of ref document: EP

Effective date: 20190215