CN111161713A

CN111161713A - Voice gender identification method and device and computing equipment

Info

Publication number: CN111161713A
Application number: CN201911328136.4A
Authority: CN
Inventors: 王佳琦; 张丽娜
Original assignee: Beijing Pierbulaini Software Co ltd
Current assignee: Beijing Pierbulaini Software Co ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-15

Abstract

The invention discloses a voice gender identification method, a voice gender identification device and computing equipment, wherein the method comprises the following steps: acquiring voice data to be recognized; performing feature extraction on the voice data to obtain acoustic features of the voice data; inputting the acoustic features into a general background model, and performing maximum posterior estimation processing on the output of the general background model to obtain Gaussian mixture distribution of the voice data; extracting a mean value super vector of the voice data based on the Gaussian mixture distribution; performing factor analysis on the mean value super vector to obtain dimension reduction characteristics of the voice data; and inputting the dimension reduction features into a trained gender classifier for processing, and outputting a gender estimation result of the voice data.

Description

Voice gender identification method and device and computing equipment

Technical Field

The invention relates to the field of voice processing, in particular to a voice gender identification method, a voice gender identification device and computing equipment.

Background

Voiceprint recognition (VPR), also known as Speaker Recognition (SRE), is a technique for automatically recognizing the identity of a speaker based on voice parameters ("voiceprints") in the voice signal of the speaker, which reflect the physiological and behavioral characteristics of the speaker. Speaker-human recognition is an important field of voiceprint recognition, and is a technology for recognizing the gender of a speaker based on the acoustic characteristics of the speaker.

Telephone consulting services such as 400 telephone are widely used for enterprises to build user portraits by accumulating a large amount of high-value data during the process of serving clients after the enterprises sell before and after. Through user portrait, the enterprise can carry out the targeted advertisement and put in, accomplishes accurate marketing, promotes the advertisement conversion rate. The method is characterized in that the user gender information is crucial to construction of the user portrait, a large amount of labor cost is required to be invested for marking the gender information of a call user, automatic real-time gender recognition is carried out on call voice of the user through a voiceprint recognition technology, so that the work efficiency of enterprises can be improved, and the labor marking cost is saved.

At present, the automatic recognition of the voice gender of the telephone firstly saves the call recording, and then carries out gender recognition on the saved recording by an audio signal processing or deep learning method. The method needs to store complete call records, consumes a large amount of server resources, and cannot realize real-time gender identification.

In addition, in the process of actual voice communication, the channel conditions of environmental noise, communication tools and the like are complex, so that the original voice signal is distorted, and the accuracy of voice gender identification is reduced.

Disclosure of Invention

In view of the above, the present invention is proposed to provide a speech gender recognition method, apparatus and computing device that overcome or at least partially solve the above problems.

According to one aspect of the invention, a speech gender recognition method is provided, which is executed in a computing device and comprises the following steps:

acquiring voice data to be recognized;

performing feature extraction on the voice data to obtain acoustic features of the voice data;

inputting the acoustic features into a general background model, and performing maximum posterior estimation processing on the output of the general background model to obtain Gaussian mixture distribution of the voice data;

extracting a mean value super vector of the voice data based on the Gaussian mixture distribution;

performing factor analysis on the mean value super vector to obtain dimension reduction characteristics of the voice data;

and inputting the dimension reduction features into a trained gender classifier for processing, and outputting a gender estimation result of the voice data.

Optionally, in the voice gender identification method according to the present invention, the acquiring voice data to be identified includes: and carrying out endpoint detection on the voice stream, and intercepting continuous voice with preset duration from the voice stream according to an endpoint detection result to be used as voice data to be recognized.

Optionally, in the voice gender recognition method according to the present invention, the extracting features of the voice data to obtain the acoustic features of the voice data includes: pre-emphasis, framing and windowing are carried out on the voice data; performing discrete Fourier transform on each voice frame subjected to windowing to obtain the frequency spectrum of each voice frame; extracting an FBANK feature of a Mel scale filter bank from the frequency spectrum of each voice frame, and performing discrete cosine transform on the FBANK feature to obtain a Mel cepstrum coefficient MFCC feature; the MFCC features of all speech frames are constructed as a feature sequence, and the feature sequence is taken as the acoustic feature of the speech data.

Optionally, in the speech gender identification method according to the present invention, before constructing the MFCC features of all speech frames as a feature sequence, further comprising: calculating the energy value of each voice frame; the first coefficient of the MFCC feature of each speech frame is replaced with the energy value of that speech frame.

Optionally, in the speech gender recognition method according to the present invention, the performing factor analysis on the mean value supervector to obtain the dimension reduction feature of the speech data includes: acquiring a mean value super vector m of the general background model; acquiring a total change space matrix T of the factor analysis; calculating i-vector feature w based on the following formula: m + Tw, where M is a mean value supervector of the voice data; and taking the calculated i-vector feature as a dimension reduction feature of the voice data.

Optionally, in the speech gender recognition method according to the present invention, before inputting the dimension reduction features to a trained gender classifier for processing, the method further includes: and performing channel compensation on the dimensionality reduction features through linear discriminant analysis.

Alternatively, in the voice gender identification method according to the present invention, the voice data is telephone voice data.

Optionally, the speech gender recognition method according to the present invention further comprises: and training the universal background model by utilizing the linguistic data of various channels.

Optionally, the speech gender recognition method according to the present invention further comprises: the total variation space matrix for the factorial analysis is estimated by the maximum expectation algorithm using the phone corpus.

Optionally, the speech gender recognition method according to the present invention further comprises: the gender classifier was trained as follows: acquiring a training data set, wherein each piece of training data in the training data set comprises voice data and a gender label thereof; for each piece of training data, extracting the dimensionality reduction characteristic of the training data; inputting the extracted dimension reduction features into a gender classifier to be trained; and adjusting the model parameters of the gender classifier to be trained according to the output of the gender classifier and the gender label of the language data.

Optionally, in the speech gender recognition method according to the present invention, the gender classifier employs a logistic regression classifier.

According to another aspect of the present invention, there is provided a speech gender recognition apparatus, residing in a computing device, and comprising:

the acquisition module is suitable for acquiring voice data to be recognized;

the feature extraction module is suitable for extracting features of the voice data to obtain acoustic features of the voice data;

the characteristic processing module is suitable for inputting the acoustic characteristics into a general background model and carrying out maximum posterior estimation processing on the output of the general background model to obtain Gaussian mixture distribution of the voice data;

the mean value super vector extraction module is suitable for extracting a mean value super vector of the voice data based on the Gaussian mixture distribution;

the factor analysis module is suitable for carrying out factor analysis on the mean value super vector to obtain the dimension reduction characteristics of the voice data;

and the classification module is suitable for inputting the dimension reduction characteristics into a trained gender classifier for processing and outputting a gender estimation result of the voice data.

According to yet another aspect of the invention, there is provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing the above-described method.

According to yet another aspect of the present invention, there is provided a readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the above-described method.

The voice gender identification scheme of the invention has one or more of the following beneficial technical effects:

1) the real-time gender recognition is completed by using the short-time voice, and the problems that the whole conversation voice is required to be reserved in the traditional method, so that a large amount of server resources are occupied and the real-time performance is poor are solved.

2) Model parameters are estimated through a GMM-UBM model and an MAP adaptive algorithm, all parameters of the GMM are not required to be adjusted, only mean parameters of single Gaussian distribution are required to be estimated, model parameters are few, convergence speed is high, model training can be completed through a small amount of telephone voice data, and overfitting is avoided.

3) The method extracts the characteristics related to the speaker gender from the voice by a factor analysis method, classifies the voice gender by using a discriminant model, and solves the problem of low recognition precision caused by the interference of different channel information in the conversation process.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention;

FIG. 2 shows a flow diagram of a speech recognition method 200 according to one embodiment of the invention;

FIG. 3 shows a schematic of the modeling and training process of method 200;

fig. 4 is a block diagram illustrating a voice gender recognition apparatus 400 according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention. As shown in FIG. 1, in a basic configuration 102, a computing device 100 typically includes a system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. The application 122 is actually a plurality of program instructions that direct the processor 104 to perform corresponding operations. In some embodiments, the application 122 may be arranged to cause the processor 104 to operate with the program data 124 on an operating system.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as a personal computer including desktop and notebook computer configurations, as well as a server, such as a file server, database server, application server, WEB server, and the like. Of course, the computing device 100 may also be implemented as part of a small-sized portable (or mobile) electronic device. In an embodiment in accordance with the invention, the computing device 100 is configured to execute a speech gender recognition method 200 in accordance with the invention. The application 122 of the computing device 100 contains a plurality of program instructions for performing the method 200 according to the invention.

FIG. 2 illustrates a flow diagram of a speech gender recognition method 200 according to one embodiment of the present invention, the method 200 being performed in a computing device, such as the computing device 100 shown in FIG. 1.

Referring to fig. 2, the method 200 begins at step S202. In step S202, voice data to be recognized is acquired. In an embodiment of the present invention, the voice data to be recognized may be telephone voice data, such as voice data of a 400-telephone.

The Voice acquisition equipment acquires a Voice stream of a call channel of a telephone client, caches the Voice stream into a cache region in real time, performs real-time endpoint Detection (VAD) on the Voice stream, and intercepts continuous Voice with preset duration from the Voice stream according to an endpoint Detection result to serve as Voice data to be identified. Specifically, after the front end point of the voice stream is detected, if the voice length reaches a predetermined time (for example, two seconds), the buffering is stopped, and the section of voice is stored on the computing device, otherwise, the end point detection is continued until a continuous voice of a predetermined time is intercepted.

In the actual voice communication process, the channel complexity of environmental noise, communication tools and the like is high, and it is difficult to effectively eliminate the influence of channel difference only by cutting off the mute part by VAD, so that the accuracy of voice gender identification is low. In the subsequent steps, the voice data to be recognized are preprocessed, the acoustic features of the voice data are extracted, and the extracted acoustic features are input into a trained model for real-time gender recognition.

In step S204, feature extraction is performed on the voice data to obtain acoustic features of the voice data. Before feature extraction, the intercepted voice data can be preprocessed, specifically, the voice data comprises voice format conversion and sampling rate conversion. For example, the format of the voice data is converted to the wav format, and various sampling rates (8k, 16k, etc.) are converted to 8 k.

Then, extraction of acoustic features is performed. The acoustic features may be selected from FBANK, MFCC, PLP, etc., where MFCC (Mel-Frequency Cepstrum Coefficients, Mel cepstral Coefficients) is a linear transformation of the log energy spectrum based on the nonlinear Mel scale of the sound frequencies is preferred.

The specific steps of the acoustic feature extraction may include:

1) pre-emphasis is performed on the speech data. The energy of the high-frequency voice is emphasized, so that the high-frequency information of the voice signal is more prominent;

2) speech data is framed and windowed. The frame length is preferably 25ms, the frame shift is preferably 10ms, and the window function is preferably a Hamming window;

3) performing discrete Fourier transform on each windowed speech frame, and extracting frequency domain information to obtain a frequency spectrum corresponding to each speech frame;

4) and extracting FBANK characteristics. Respectively planning the frequency spectrum of each voice frame on a Mel scale through a Mel scale filter bank to obtain a Mel frequency spectrum, wherein the number of the Mel scale filters is preferably 40, and then taking logarithm of the energy value of the Mel frequency spectrum to obtain a multidimensional (for example 40-dimensional) FBANK (Mel scale filter bank) characteristic;

5) for the FBANK feature of each speech frame, discrete cosine transform is performed on the FBANK feature to obtain the MFCC feature of the speech frame, and the first 20 dimensions, for example, can also be taken as the MFCC feature.

Thus, each speech frame of speech data corresponds to a MFCC feature, and the MFCC features form a feature sequence which becomes the acoustic feature of the speech data.

In a preferred mode, an energy value of each frame of speech audio of the speech data is calculated, the energy value is a sum of squares of values of all audio sample points in the frame, and the energy value can be used for replacing a first coefficient of the MFCC features. Continuing with the example above, each feature in the sequence of features that make up an acoustic feature is 20-dimensional, including 1-dimensional energy values and 19-dimensional MFCCs.

In step S206, the acoustic features of the speech data are input to a general background model trained in advance, and Maximum A Posteriori (MAP) adaptive processing is performed on the output of the general background model to obtain a gaussian mixture distribution of the speech data. Where GMM (Gaussian Mix Model), which refers to a linear combination of multiple Gaussian distribution functions, can theoretically fit any type of distribution, and is used here to Model male and female voices respectively. UBM (Universal Background Model) is a Gaussian Mixture Model (GMM) that is relatively stable and independent of speaker characteristics, built using large amounts of speech data from different speakers. This model describes the shared nature of different speakers in the acoustic space, and is referred to as the Universal Background Model (UBM).

In this step, MAP (Maximum a posteriodic Estimation) adaptation may be performed on each gaussian component of the UBM model for the MFCC feature sequence of the speech data to obtain a GMM model corresponding to the speech data, that is, a GMM-UBM model. In the embodiment of the invention, only the mean vector mu of the GMM model needs to be updated in the self-adaption process_iI-1, 2, …, c, where c is the number of components of the GMM, i.e. the GMM is a linear combination of c gaussian distribution functions.

In step S208, a mean supervector of the speech data is extracted based on the gaussian mixture distribution. In one implementation, the mean vector μ of all Gaussian components in the GMM is used_iAnd (3) splicing according to a fixed sequence to obtain a GMM mean value super vector M:

assuming that each acoustic feature vector of the GMM is F-dimensional, M is a high-dimensional feature vector of CF × 1, which contains all information corresponding to the speech of the calling user, including speaker information and channel information.

In step S210, a factor analysis is performed on the mean value supervector M to obtain a feature of the speech data after the dimension reduction of the mean value supervector, which is referred to as a dimension reduction feature in the present invention. In one implementation, the dimension reduction feature is an i-vector feature.

In the embodiment of the present invention, the i-vector is an R × 1-dimensional vector, obeys gaussian distribution N (0,1), and includes identity information and channel information of a speaker, so that the change of environmental factors such as noise, reverberation, and coding mode can be fully covered, and the dimension is usually 400-600, and preferably 400-dimensional. The GMM mean supervector M for speech data can be expressed as follows:

M＝m+Tw

wherein M follows Gaussian distribution N (M, TT)^T) M is a UBM mean value supervector, T is a total change space matrix, the dimension of the matrix is CF multiplied by R, and w is an i-vector characteristic.

In step S212, the dimension-reduced features, such as i-vector features, are input to the trained gender classifier for processing, and the gender estimation result of the speech data is output.

According to another embodiment of the present invention, before step S212, channel compensation is further performed on the i-vector through LDA (Linear discriminant analysis), and in step S212, the i-vector feature after channel compensation is input into a trained gender classifier for processing, and a gender estimation result of the voice data is output.

LDA is a dimension reduction technology in the field of pattern recognition, and new features are made to be more distinctive by finding the direction which can distinguish various types of data most. Through LDA, the capability of distinguishing the gender of the speaker by the i-vector can be further improved, and the influence of different channel information on the identification accuracy rate is weakened.

The training process of LDA is as follows:

providing a training data set, wherein each piece of training data in the training data set comprises voice data and a gender label thereof, and extracting the i-vector of each piece of training data according to the above mode.

The solution process for LDA is then the process of maximizing the Rayleigh coefficient J

Wherein S is_bAnd S_wAn inter-class divergence matrix and an intra-class divergence matrix, respectively. S_bAnd S_wThe calculation methods of (a) are respectively as follows:

wherein S is a gender category (S-0 point male, S-1 for female),

is the average of i-vectors of all speech data corresponding to gender s in the training data set,

is the i-vector mean, n, of the total speech data_sIs the number of pieces of speech data, w, corresponding to gender s in the training data set_s,hIs the i-vector of the h-th speech corresponding to gender s. The Rayleigh coefficient reflects the direction of speech S in the direction a_bAnd S_wThe process of maximizing the Rayleigh coefficient can minimize the variance due to channel effects while maximizing the variance between speaker characteristics. Maximizing the Rayleigh coefficient can be converted into solving a projection matrix A, and the matrix is composed of eigenvectors a corresponding to the following eigenvalues (from lower arrangement)

S_ba＝λS_wa

Wherein λ is a characteristic value.

Thus, after training, the projection matrix a can be obtained. The i-vector after channel compensation of the i-vector by LDA can be expressed as

φ(w)＝A^Tw

In the formula, w is the i-vector characteristic before channel compensation, and phi (w) is the i-vector characteristic after channel compensation.

It can be seen that in the embodiment of the invention, the i-vector characteristic of the voice is extracted through the factor analysis and channel compensation technology, and then the channel compensation is carried out on the characteristic through the LDA, so that the speaker information in the voice characteristic is enhanced, the influence of the complex channel information in the telephone voice on the voice gender identification is weakened, and the gender identification precision is improved.

The process of building and training the correlation model in method 200 is described below.

FIG. 3 shows a schematic of the modeling and training process of method 200. Referring to fig. 3, the process involves training of the UBM model, computation of a spatial matrix of total variation in factorial analysis, and training of a gender classifier.

First, a UBM model independent of speaker information is trained using a large number of corpora of various channels. As mentioned above, the UBM is also a GMM model, and is a common reflection of speech characteristics of all speakers and a common reflection of channel information, and the more training data sets and the wider coverage area for fixing the UBM, the more the GMM obtained by training can approach to the real distribution. Specifically, a large number of corpora of various channels are obtained, the corpus data is processed according to the methods of step S202 and step S204, MFCC features are extracted, and the MFCC features are used to train the UBM model, and the training of the UBM model parameters may adopt EM (Expectation-maximization algorithm). After training is completed, the mean supervector of the UBM model can be obtained.

Then, a phone corpus (e.g., 400 phone corpus) is obtained, one part is used as a training data set, the other part is used as a test data set (optional), and the corpus data is processed according to the methods of step S202 and step S204 to extract MFCC features. The MFCC features of the corpus data of the training set are processed in steps S206 and S208 to obtain a mean value supervector of each piece of speech data in the training set. And performing factor analysis on the mean value supervectors of all the voices in the training set, and extracting i-vector characteristics from the mean value supervectors respectively. The i-vector is an R × 1-dimensional vector, obeys gaussian distribution N (0,1), includes identity information and channel information of a speaker, and can fully cover changes of environmental factors such as noise, reverberation, and a coding mode, and the dimension is usually 400-600, and preferably 400-dimensional. The GMM mean supervector for each piece of speech data can be expressed as follows:

M＝m+Tw

wherein M is the mean value super vector of GMM and obeys Gaussian distribution N (M, TT)^T) M is a UBM mean value supervector, T is a total change space matrix, the dimension of the matrix is CF multiplied by R, and w is an i-vector characteristic. Estimating the total change space matrix T by an EM algorithm in the training process, and finishing the estimation of the total change space matrix TAnd then, extracting corresponding i-vector characteristics from the GMM mean value super vector of each voice for the training set and the test set respectively.

And then, training a Logitics Regression model by using the i-vector characteristics of the voice data to classify the voice gender. The method comprises the following specific steps:

a) respectively marking the voice data with labels 0 and 1 according to the gender of a male and a female;

b) training a logistic Regression (Logitics Regression) model by using the i-vector characteristics of the training set, wherein the model function is as follows:

wherein, theta^T＝[θ₀θ₁…θ_n]Representing a set of parameters, the loss function is:

the parameter θ is obtained by a gradient descent method.

Thus, in the recognition stage (step S212), the parameter θ can be substituted into the model, a segment of speech x to be recognized is given, i-vector is extracted and input into the model, and if h is_θ(x)<0.5, the male is identified, if h_θ(x)>0.5 is identified as female.

Referring to fig. 4, the apparatus 400 includes:

an obtaining module 410 adapted to obtain voice data to be recognized;

a feature extraction module 420, adapted to perform feature extraction on the voice data to obtain an acoustic feature of the voice data;

the feature processing module 430 is adapted to input the acoustic features into a general background model, and perform maximum posterior estimation processing on the output of the general background model to obtain gaussian mixture distribution of the voice data;

a mean supervector extraction module 440 adapted to extract a mean supervector of the speech data based on the gaussian mixture distribution;

the factor analysis module 450 is adapted to perform factor analysis on the mean value supervector to obtain a dimension reduction feature of the voice data;

and the classification module 460 is adapted to input the dimension reduction features into a trained gender classifier for processing, and output a gender estimation result of the voice data.

The specific processing performed by the obtaining module 410, the feature extracting module 420, the feature processing module 430, the factor analyzing module 450, and the classifying module 460 may refer to the above steps S202, S204, S206, S208, S210, and S212, which are not described herein again.

In summary, the invention obtains the call voice stream of the telephone client in real time, performs real-time endpoint detection on the voice stream, intercepts the call voice with a preset time (for example, 2 seconds), and can complete real-time gender identification through the voice with the preset time without reserving the whole call recording, thereby saving a large amount of server resources and having good real-time performance.

The invention trains the UBM model through a large amount of data to extract the common characteristics of the voice and the channel, then estimates the model parameters through the MAP adaptive algorithm to obtain the GMM of each voice, does not need to adjust all parameters of the GMM, only needs to estimate the mean value parameters of each single Gaussian distribution, has less model parameters and high convergence rate, can complete model training by using a small amount of telephone voice data, avoids the occurrence of overfitting, and solves the problem of reduced recognition performance caused by insufficient training corpus and incapability of covering all pronunciation contents.

The invention enhances the characterization capability of the speaker gender information in the voice characteristics through the factor analysis technology, weakens the influence of complex telephone channel information on voice gender identification in practical application, and improves the gender identification precision through the discriminant model.

8. The method of claim 7, further comprising: and training the universal background model by utilizing the linguistic data of various channels.

9. The method of claim 7 or 8, further comprising: the total variation space matrix for the factorial analysis is estimated by the maximum expectation algorithm using the phone corpus.

10. The method of any of claims 7 to 10, further comprising training the gender classifier as follows:

acquiring a training data set, wherein each piece of training data in the training data set comprises voice data and a gender label thereof;

for each piece of training data, extracting the dimensionality reduction characteristic of the training data;

inputting the extracted dimension reduction features into a gender classifier to be trained;

and adjusting the model parameters of the gender classifier to be trained according to the output of the gender classifier and the gender label of the language data.

11. The method of claim 10, wherein the gender classifier employs a logistic regression classifier

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Claims

1. A speech gender recognition method, implemented in a computing device, comprising the steps of:

acquiring voice data to be recognized;

2. The method of claim 1, wherein the obtaining speech data to be recognized comprises:

and carrying out endpoint detection on the voice stream, and intercepting continuous voice with preset duration from the voice stream according to an endpoint detection result to be used as voice data to be recognized.

3. The method of claim 1 or 2, wherein the performing feature extraction on the voice data to obtain the acoustic features of the voice data comprises:

pre-emphasis, framing and windowing are carried out on the voice data;

performing discrete Fourier transform on each voice frame subjected to windowing to obtain the frequency spectrum of each voice frame;

extracting an FBANK feature of a Mel scale filter bank from the frequency spectrum of each voice frame, and performing discrete cosine transform on the FBANK feature to obtain a Mel cepstrum coefficient MFCC feature;

the MFCC features of all speech frames are constructed as a feature sequence, and the feature sequence is taken as the acoustic feature of the speech data.

4. The method of claim 3, wherein prior to constructing MFCC features for all speech frames as a sequence of features, further comprising:

calculating the energy value of each voice frame;

the first coefficient of the MFCC feature of each speech frame is replaced with the energy value of that speech frame.

5. The method of any of claims 1 to 4, wherein the factoring the mean supervector to obtain the reduced-dimension feature of the speech data comprises:

acquiring a mean value super vector m of the general background model;

acquiring a total change space matrix T of the factor analysis;

calculating i-vector feature w based on the following formula: m + Tw, where M is a mean value supervector of the voice data;

and taking the calculated i-vector feature as a dimension reduction feature of the voice data.

6. The method of any one of claims 1 to 5, wherein prior to inputting the dimension-reduced features into a trained gender classifier for processing, further comprising:

and performing channel compensation on the dimensionality reduction features through linear discriminant analysis.

7. The method of any one of claims 1 to 6, wherein the voice data is telephony voice data.

8. A speech gender recognition apparatus, residing in a computing device, and comprising:

the acquisition module is suitable for acquiring voice data to be recognized;

9. A computing device, comprising:

at least one processor; and

a memory storing program instructions configured for execution by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-7.

10. A readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the method of any of claims 1-7.