CN111161713A - Voice gender identification method and device and computing equipment - Google Patents

Voice gender identification method and device and computing equipment Download PDF

Info

Publication number
CN111161713A
CN111161713A CN201911328136.4A CN201911328136A CN111161713A CN 111161713 A CN111161713 A CN 111161713A CN 201911328136 A CN201911328136 A CN 201911328136A CN 111161713 A CN111161713 A CN 111161713A
Authority
CN
China
Prior art keywords
voice data
voice
feature
gender
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911328136.4A
Other languages
Chinese (zh)
Inventor
王佳琦
张丽娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Pierbulaini Software Co ltd
Original Assignee
Beijing Pierbulaini Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Pierbulaini Software Co ltd filed Critical Beijing Pierbulaini Software Co ltd
Priority to CN201911328136.4A priority Critical patent/CN111161713A/en
Publication of CN111161713A publication Critical patent/CN111161713A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice gender identification method, a voice gender identification device and computing equipment, wherein the method comprises the following steps: acquiring voice data to be recognized; performing feature extraction on the voice data to obtain acoustic features of the voice data; inputting the acoustic features into a general background model, and performing maximum posterior estimation processing on the output of the general background model to obtain Gaussian mixture distribution of the voice data; extracting a mean value super vector of the voice data based on the Gaussian mixture distribution; performing factor analysis on the mean value super vector to obtain dimension reduction characteristics of the voice data; and inputting the dimension reduction features into a trained gender classifier for processing, and outputting a gender estimation result of the voice data.

Description

Voice gender identification method and device and computing equipment
Technical Field
The invention relates to the field of voice processing, in particular to a voice gender identification method, a voice gender identification device and computing equipment.
Background
Voiceprint recognition (VPR), also known as Speaker Recognition (SRE), is a technique for automatically recognizing the identity of a speaker based on voice parameters ("voiceprints") in the voice signal of the speaker, which reflect the physiological and behavioral characteristics of the speaker. Speaker-human recognition is an important field of voiceprint recognition, and is a technology for recognizing the gender of a speaker based on the acoustic characteristics of the speaker.
Telephone consulting services such as 400 telephone are widely used for enterprises to build user portraits by accumulating a large amount of high-value data during the process of serving clients after the enterprises sell before and after. Through user portrait, the enterprise can carry out the targeted advertisement and put in, accomplishes accurate marketing, promotes the advertisement conversion rate. The method is characterized in that the user gender information is crucial to construction of the user portrait, a large amount of labor cost is required to be invested for marking the gender information of a call user, automatic real-time gender recognition is carried out on call voice of the user through a voiceprint recognition technology, so that the work efficiency of enterprises can be improved, and the labor marking cost is saved.
At present, the automatic recognition of the voice gender of the telephone firstly saves the call recording, and then carries out gender recognition on the saved recording by an audio signal processing or deep learning method. The method needs to store complete call records, consumes a large amount of server resources, and cannot realize real-time gender identification.
In addition, in the process of actual voice communication, the channel conditions of environmental noise, communication tools and the like are complex, so that the original voice signal is distorted, and the accuracy of voice gender identification is reduced.
Disclosure of Invention
In view of the above, the present invention is proposed to provide a speech gender recognition method, apparatus and computing device that overcome or at least partially solve the above problems.
According to one aspect of the invention, a speech gender recognition method is provided, which is executed in a computing device and comprises the following steps:
acquiring voice data to be recognized;
performing feature extraction on the voice data to obtain acoustic features of the voice data;
inputting the acoustic features into a general background model, and performing maximum posterior estimation processing on the output of the general background model to obtain Gaussian mixture distribution of the voice data;
extracting a mean value super vector of the voice data based on the Gaussian mixture distribution;
performing factor analysis on the mean value super vector to obtain dimension reduction characteristics of the voice data;
and inputting the dimension reduction features into a trained gender classifier for processing, and outputting a gender estimation result of the voice data.
Optionally, in the voice gender identification method according to the present invention, the acquiring voice data to be identified includes: and carrying out endpoint detection on the voice stream, and intercepting continuous voice with preset duration from the voice stream according to an endpoint detection result to be used as voice data to be recognized.
Optionally, in the voice gender recognition method according to the present invention, the extracting features of the voice data to obtain the acoustic features of the voice data includes: pre-emphasis, framing and windowing are carried out on the voice data; performing discrete Fourier transform on each voice frame subjected to windowing to obtain the frequency spectrum of each voice frame; extracting an FBANK feature of a Mel scale filter bank from the frequency spectrum of each voice frame, and performing discrete cosine transform on the FBANK feature to obtain a Mel cepstrum coefficient MFCC feature; the MFCC features of all speech frames are constructed as a feature sequence, and the feature sequence is taken as the acoustic feature of the speech data.
Optionally, in the speech gender identification method according to the present invention, before constructing the MFCC features of all speech frames as a feature sequence, further comprising: calculating the energy value of each voice frame; the first coefficient of the MFCC feature of each speech frame is replaced with the energy value of that speech frame.
Optionally, in the speech gender recognition method according to the present invention, the performing factor analysis on the mean value supervector to obtain the dimension reduction feature of the speech data includes: acquiring a mean value super vector m of the general background model; acquiring a total change space matrix T of the factor analysis; calculating i-vector feature w based on the following formula: m + Tw, where M is a mean value supervector of the voice data; and taking the calculated i-vector feature as a dimension reduction feature of the voice data.
Optionally, in the speech gender recognition method according to the present invention, before inputting the dimension reduction features to a trained gender classifier for processing, the method further includes: and performing channel compensation on the dimensionality reduction features through linear discriminant analysis.
Alternatively, in the voice gender identification method according to the present invention, the voice data is telephone voice data.
Optionally, the speech gender recognition method according to the present invention further comprises: and training the universal background model by utilizing the linguistic data of various channels.
Optionally, the speech gender recognition method according to the present invention further comprises: the total variation space matrix for the factorial analysis is estimated by the maximum expectation algorithm using the phone corpus.
Optionally, the speech gender recognition method according to the present invention further comprises: the gender classifier was trained as follows: acquiring a training data set, wherein each piece of training data in the training data set comprises voice data and a gender label thereof; for each piece of training data, extracting the dimensionality reduction characteristic of the training data; inputting the extracted dimension reduction features into a gender classifier to be trained; and adjusting the model parameters of the gender classifier to be trained according to the output of the gender classifier and the gender label of the language data.
Optionally, in the speech gender recognition method according to the present invention, the gender classifier employs a logistic regression classifier.
According to another aspect of the present invention, there is provided a speech gender recognition apparatus, residing in a computing device, and comprising:
the acquisition module is suitable for acquiring voice data to be recognized;
the feature extraction module is suitable for extracting features of the voice data to obtain acoustic features of the voice data;
the characteristic processing module is suitable for inputting the acoustic characteristics into a general background model and carrying out maximum posterior estimation processing on the output of the general background model to obtain Gaussian mixture distribution of the voice data;
the mean value super vector extraction module is suitable for extracting a mean value super vector of the voice data based on the Gaussian mixture distribution;
the factor analysis module is suitable for carrying out factor analysis on the mean value super vector to obtain the dimension reduction characteristics of the voice data;
and the classification module is suitable for inputting the dimension reduction characteristics into a trained gender classifier for processing and outputting a gender estimation result of the voice data.
According to yet another aspect of the invention, there is provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing the above-described method.
According to yet another aspect of the present invention, there is provided a readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the above-described method.
The voice gender identification scheme of the invention has one or more of the following beneficial technical effects:
1) the real-time gender recognition is completed by using the short-time voice, and the problems that the whole conversation voice is required to be reserved in the traditional method, so that a large amount of server resources are occupied and the real-time performance is poor are solved.
2) Model parameters are estimated through a GMM-UBM model and an MAP adaptive algorithm, all parameters of the GMM are not required to be adjusted, only mean parameters of single Gaussian distribution are required to be estimated, model parameters are few, convergence speed is high, model training can be completed through a small amount of telephone voice data, and overfitting is avoided.
3) The method extracts the characteristics related to the speaker gender from the voice by a factor analysis method, classifies the voice gender by using a discriminant model, and solves the problem of low recognition precision caused by the interference of different channel information in the conversation process.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention;
FIG. 2 shows a flow diagram of a speech recognition method 200 according to one embodiment of the invention;
FIG. 3 shows a schematic of the modeling and training process of method 200;
fig. 4 is a block diagram illustrating a voice gender recognition apparatus 400 according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention. As shown in FIG. 1, in a basic configuration 102, a computing device 100 typically includes a system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.
Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. The application 122 is actually a plurality of program instructions that direct the processor 104 to perform corresponding operations. In some embodiments, the application 122 may be arranged to cause the processor 104 to operate with the program data 124 on an operating system.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 100 may be implemented as a personal computer including desktop and notebook computer configurations, as well as a server, such as a file server, database server, application server, WEB server, and the like. Of course, the computing device 100 may also be implemented as part of a small-sized portable (or mobile) electronic device. In an embodiment in accordance with the invention, the computing device 100 is configured to execute a speech gender recognition method 200 in accordance with the invention. The application 122 of the computing device 100 contains a plurality of program instructions for performing the method 200 according to the invention.
FIG. 2 illustrates a flow diagram of a speech gender recognition method 200 according to one embodiment of the present invention, the method 200 being performed in a computing device, such as the computing device 100 shown in FIG. 1.
Referring to fig. 2, the method 200 begins at step S202. In step S202, voice data to be recognized is acquired. In an embodiment of the present invention, the voice data to be recognized may be telephone voice data, such as voice data of a 400-telephone.
The Voice acquisition equipment acquires a Voice stream of a call channel of a telephone client, caches the Voice stream into a cache region in real time, performs real-time endpoint Detection (VAD) on the Voice stream, and intercepts continuous Voice with preset duration from the Voice stream according to an endpoint Detection result to serve as Voice data to be identified. Specifically, after the front end point of the voice stream is detected, if the voice length reaches a predetermined time (for example, two seconds), the buffering is stopped, and the section of voice is stored on the computing device, otherwise, the end point detection is continued until a continuous voice of a predetermined time is intercepted.
In the actual voice communication process, the channel complexity of environmental noise, communication tools and the like is high, and it is difficult to effectively eliminate the influence of channel difference only by cutting off the mute part by VAD, so that the accuracy of voice gender identification is low. In the subsequent steps, the voice data to be recognized are preprocessed, the acoustic features of the voice data are extracted, and the extracted acoustic features are input into a trained model for real-time gender recognition.
In step S204, feature extraction is performed on the voice data to obtain acoustic features of the voice data. Before feature extraction, the intercepted voice data can be preprocessed, specifically, the voice data comprises voice format conversion and sampling rate conversion. For example, the format of the voice data is converted to the wav format, and various sampling rates (8k, 16k, etc.) are converted to 8 k.
Then, extraction of acoustic features is performed. The acoustic features may be selected from FBANK, MFCC, PLP, etc., where MFCC (Mel-Frequency Cepstrum Coefficients, Mel cepstral Coefficients) is a linear transformation of the log energy spectrum based on the nonlinear Mel scale of the sound frequencies is preferred.
The specific steps of the acoustic feature extraction may include:
1) pre-emphasis is performed on the speech data. The energy of the high-frequency voice is emphasized, so that the high-frequency information of the voice signal is more prominent;
2) speech data is framed and windowed. The frame length is preferably 25ms, the frame shift is preferably 10ms, and the window function is preferably a Hamming window;
3) performing discrete Fourier transform on each windowed speech frame, and extracting frequency domain information to obtain a frequency spectrum corresponding to each speech frame;
4) and extracting FBANK characteristics. Respectively planning the frequency spectrum of each voice frame on a Mel scale through a Mel scale filter bank to obtain a Mel frequency spectrum, wherein the number of the Mel scale filters is preferably 40, and then taking logarithm of the energy value of the Mel frequency spectrum to obtain a multidimensional (for example 40-dimensional) FBANK (Mel scale filter bank) characteristic;
5) for the FBANK feature of each speech frame, discrete cosine transform is performed on the FBANK feature to obtain the MFCC feature of the speech frame, and the first 20 dimensions, for example, can also be taken as the MFCC feature.
Thus, each speech frame of speech data corresponds to a MFCC feature, and the MFCC features form a feature sequence which becomes the acoustic feature of the speech data.
In a preferred mode, an energy value of each frame of speech audio of the speech data is calculated, the energy value is a sum of squares of values of all audio sample points in the frame, and the energy value can be used for replacing a first coefficient of the MFCC features. Continuing with the example above, each feature in the sequence of features that make up an acoustic feature is 20-dimensional, including 1-dimensional energy values and 19-dimensional MFCCs.
In step S206, the acoustic features of the speech data are input to a general background model trained in advance, and Maximum A Posteriori (MAP) adaptive processing is performed on the output of the general background model to obtain a gaussian mixture distribution of the speech data. Where GMM (Gaussian Mix Model), which refers to a linear combination of multiple Gaussian distribution functions, can theoretically fit any type of distribution, and is used here to Model male and female voices respectively. UBM (Universal Background Model) is a Gaussian Mixture Model (GMM) that is relatively stable and independent of speaker characteristics, built using large amounts of speech data from different speakers. This model describes the shared nature of different speakers in the acoustic space, and is referred to as the Universal Background Model (UBM).
In this step, MAP (Maximum a posteriodic Estimation) adaptation may be performed on each gaussian component of the UBM model for the MFCC feature sequence of the speech data to obtain a GMM model corresponding to the speech data, that is, a GMM-UBM model. In the embodiment of the invention, only the mean vector mu of the GMM model needs to be updated in the self-adaption processiI-1, 2, …, c, where c is the number of components of the GMM, i.e. the GMM is a linear combination of c gaussian distribution functions.
In step S208, a mean supervector of the speech data is extracted based on the gaussian mixture distribution. In one implementation, the mean vector μ of all Gaussian components in the GMM is usediAnd (3) splicing according to a fixed sequence to obtain a GMM mean value super vector M:
Figure BDA0002328905030000091
assuming that each acoustic feature vector of the GMM is F-dimensional, M is a high-dimensional feature vector of CF × 1, which contains all information corresponding to the speech of the calling user, including speaker information and channel information.
In step S210, a factor analysis is performed on the mean value supervector M to obtain a feature of the speech data after the dimension reduction of the mean value supervector, which is referred to as a dimension reduction feature in the present invention. In one implementation, the dimension reduction feature is an i-vector feature.
In the embodiment of the present invention, the i-vector is an R × 1-dimensional vector, obeys gaussian distribution N (0,1), and includes identity information and channel information of a speaker, so that the change of environmental factors such as noise, reverberation, and coding mode can be fully covered, and the dimension is usually 400-600, and preferably 400-dimensional. The GMM mean supervector M for speech data can be expressed as follows:
M=m+Tw
wherein M follows Gaussian distribution N (M, TT)T) M is a UBM mean value supervector, T is a total change space matrix, the dimension of the matrix is CF multiplied by R, and w is an i-vector characteristic.
In step S212, the dimension-reduced features, such as i-vector features, are input to the trained gender classifier for processing, and the gender estimation result of the speech data is output.
According to another embodiment of the present invention, before step S212, channel compensation is further performed on the i-vector through LDA (Linear discriminant analysis), and in step S212, the i-vector feature after channel compensation is input into a trained gender classifier for processing, and a gender estimation result of the voice data is output.
LDA is a dimension reduction technology in the field of pattern recognition, and new features are made to be more distinctive by finding the direction which can distinguish various types of data most. Through LDA, the capability of distinguishing the gender of the speaker by the i-vector can be further improved, and the influence of different channel information on the identification accuracy rate is weakened.
The training process of LDA is as follows:
providing a training data set, wherein each piece of training data in the training data set comprises voice data and a gender label thereof, and extracting the i-vector of each piece of training data according to the above mode.
The solution process for LDA is then the process of maximizing the Rayleigh coefficient J
Figure BDA0002328905030000101
Wherein S isbAnd SwAn inter-class divergence matrix and an intra-class divergence matrix, respectively. SbAnd SwThe calculation methods of (a) are respectively as follows:
Figure BDA0002328905030000102
Figure BDA0002328905030000103
wherein S is a gender category (S-0 point male, S-1 for female),
Figure BDA0002328905030000104
is the average of i-vectors of all speech data corresponding to gender s in the training data set,
Figure BDA0002328905030000105
is the i-vector mean, n, of the total speech datasIs the number of pieces of speech data, w, corresponding to gender s in the training data sets,hIs the i-vector of the h-th speech corresponding to gender s. The Rayleigh coefficient reflects the direction of speech S in the direction abAnd SwThe process of maximizing the Rayleigh coefficient can minimize the variance due to channel effects while maximizing the variance between speaker characteristics. Maximizing the Rayleigh coefficient can be converted into solving a projection matrix A, and the matrix is composed of eigenvectors a corresponding to the following eigenvalues (from lower arrangement)
Sba=λSwa
Wherein λ is a characteristic value.
Thus, after training, the projection matrix a can be obtained. The i-vector after channel compensation of the i-vector by LDA can be expressed as
φ(w)=ATw
In the formula, w is the i-vector characteristic before channel compensation, and phi (w) is the i-vector characteristic after channel compensation.
It can be seen that in the embodiment of the invention, the i-vector characteristic of the voice is extracted through the factor analysis and channel compensation technology, and then the channel compensation is carried out on the characteristic through the LDA, so that the speaker information in the voice characteristic is enhanced, the influence of the complex channel information in the telephone voice on the voice gender identification is weakened, and the gender identification precision is improved.
The process of building and training the correlation model in method 200 is described below.
FIG. 3 shows a schematic of the modeling and training process of method 200. Referring to fig. 3, the process involves training of the UBM model, computation of a spatial matrix of total variation in factorial analysis, and training of a gender classifier.
First, a UBM model independent of speaker information is trained using a large number of corpora of various channels. As mentioned above, the UBM is also a GMM model, and is a common reflection of speech characteristics of all speakers and a common reflection of channel information, and the more training data sets and the wider coverage area for fixing the UBM, the more the GMM obtained by training can approach to the real distribution. Specifically, a large number of corpora of various channels are obtained, the corpus data is processed according to the methods of step S202 and step S204, MFCC features are extracted, and the MFCC features are used to train the UBM model, and the training of the UBM model parameters may adopt EM (Expectation-maximization algorithm). After training is completed, the mean supervector of the UBM model can be obtained.
Then, a phone corpus (e.g., 400 phone corpus) is obtained, one part is used as a training data set, the other part is used as a test data set (optional), and the corpus data is processed according to the methods of step S202 and step S204 to extract MFCC features. The MFCC features of the corpus data of the training set are processed in steps S206 and S208 to obtain a mean value supervector of each piece of speech data in the training set. And performing factor analysis on the mean value supervectors of all the voices in the training set, and extracting i-vector characteristics from the mean value supervectors respectively. The i-vector is an R × 1-dimensional vector, obeys gaussian distribution N (0,1), includes identity information and channel information of a speaker, and can fully cover changes of environmental factors such as noise, reverberation, and a coding mode, and the dimension is usually 400-600, and preferably 400-dimensional. The GMM mean supervector for each piece of speech data can be expressed as follows:
M=m+Tw
wherein M is the mean value super vector of GMM and obeys Gaussian distribution N (M, TT)T) M is a UBM mean value supervector, T is a total change space matrix, the dimension of the matrix is CF multiplied by R, and w is an i-vector characteristic. Estimating the total change space matrix T by an EM algorithm in the training process, and finishing the estimation of the total change space matrix TAnd then, extracting corresponding i-vector characteristics from the GMM mean value super vector of each voice for the training set and the test set respectively.
And then, training a Logitics Regression model by using the i-vector characteristics of the voice data to classify the voice gender. The method comprises the following specific steps:
a) respectively marking the voice data with labels 0 and 1 according to the gender of a male and a female;
b) training a logistic Regression (Logitics Regression) model by using the i-vector characteristics of the training set, wherein the model function is as follows:
Figure BDA0002328905030000121
wherein, thetaT=[θ0θ1…θn]Representing a set of parameters, the loss function is:
Figure BDA0002328905030000122
the parameter θ is obtained by a gradient descent method.
Thus, in the recognition stage (step S212), the parameter θ can be substituted into the model, a segment of speech x to be recognized is given, i-vector is extracted and input into the model, and if h isθ(x)<0.5, the male is identified, if hθ(x)>0.5 is identified as female.
Fig. 4 is a block diagram illustrating a voice gender recognition apparatus 400 according to an embodiment of the present invention.
Referring to fig. 4, the apparatus 400 includes:
an obtaining module 410 adapted to obtain voice data to be recognized;
a feature extraction module 420, adapted to perform feature extraction on the voice data to obtain an acoustic feature of the voice data;
the feature processing module 430 is adapted to input the acoustic features into a general background model, and perform maximum posterior estimation processing on the output of the general background model to obtain gaussian mixture distribution of the voice data;
a mean supervector extraction module 440 adapted to extract a mean supervector of the speech data based on the gaussian mixture distribution;
the factor analysis module 450 is adapted to perform factor analysis on the mean value supervector to obtain a dimension reduction feature of the voice data;
and the classification module 460 is adapted to input the dimension reduction features into a trained gender classifier for processing, and output a gender estimation result of the voice data.
The specific processing performed by the obtaining module 410, the feature extracting module 420, the feature processing module 430, the factor analyzing module 450, and the classifying module 460 may refer to the above steps S202, S204, S206, S208, S210, and S212, which are not described herein again.
In summary, the invention obtains the call voice stream of the telephone client in real time, performs real-time endpoint detection on the voice stream, intercepts the call voice with a preset time (for example, 2 seconds), and can complete real-time gender identification through the voice with the preset time without reserving the whole call recording, thereby saving a large amount of server resources and having good real-time performance.
The invention trains the UBM model through a large amount of data to extract the common characteristics of the voice and the channel, then estimates the model parameters through the MAP adaptive algorithm to obtain the GMM of each voice, does not need to adjust all parameters of the GMM, only needs to estimate the mean value parameters of each single Gaussian distribution, has less model parameters and high convergence rate, can complete model training by using a small amount of telephone voice data, avoids the occurrence of overfitting, and solves the problem of reduced recognition performance caused by insufficient training corpus and incapability of covering all pronunciation contents.
The invention enhances the characterization capability of the speaker gender information in the voice characteristics through the factor analysis technology, weakens the influence of complex telephone channel information on voice gender identification in practical application, and improves the gender identification precision through the discriminant model.
8. The method of claim 7, further comprising: and training the universal background model by utilizing the linguistic data of various channels.
9. The method of claim 7 or 8, further comprising: the total variation space matrix for the factorial analysis is estimated by the maximum expectation algorithm using the phone corpus.
10. The method of any of claims 7 to 10, further comprising training the gender classifier as follows:
acquiring a training data set, wherein each piece of training data in the training data set comprises voice data and a gender label thereof;
for each piece of training data, extracting the dimensionality reduction characteristic of the training data;
inputting the extracted dimension reduction features into a gender classifier to be trained;
and adjusting the model parameters of the gender classifier to be trained according to the output of the gender classifier and the gender label of the language data.
11. The method of claim 10, wherein the gender classifier employs a logistic regression classifier
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Claims (10)

1. A speech gender recognition method, implemented in a computing device, comprising the steps of:
acquiring voice data to be recognized;
performing feature extraction on the voice data to obtain acoustic features of the voice data;
inputting the acoustic features into a general background model, and performing maximum posterior estimation processing on the output of the general background model to obtain Gaussian mixture distribution of the voice data;
extracting a mean value super vector of the voice data based on the Gaussian mixture distribution;
performing factor analysis on the mean value super vector to obtain dimension reduction characteristics of the voice data;
and inputting the dimension reduction features into a trained gender classifier for processing, and outputting a gender estimation result of the voice data.
2. The method of claim 1, wherein the obtaining speech data to be recognized comprises:
and carrying out endpoint detection on the voice stream, and intercepting continuous voice with preset duration from the voice stream according to an endpoint detection result to be used as voice data to be recognized.
3. The method of claim 1 or 2, wherein the performing feature extraction on the voice data to obtain the acoustic features of the voice data comprises:
pre-emphasis, framing and windowing are carried out on the voice data;
performing discrete Fourier transform on each voice frame subjected to windowing to obtain the frequency spectrum of each voice frame;
extracting an FBANK feature of a Mel scale filter bank from the frequency spectrum of each voice frame, and performing discrete cosine transform on the FBANK feature to obtain a Mel cepstrum coefficient MFCC feature;
the MFCC features of all speech frames are constructed as a feature sequence, and the feature sequence is taken as the acoustic feature of the speech data.
4. The method of claim 3, wherein prior to constructing MFCC features for all speech frames as a sequence of features, further comprising:
calculating the energy value of each voice frame;
the first coefficient of the MFCC feature of each speech frame is replaced with the energy value of that speech frame.
5. The method of any of claims 1 to 4, wherein the factoring the mean supervector to obtain the reduced-dimension feature of the speech data comprises:
acquiring a mean value super vector m of the general background model;
acquiring a total change space matrix T of the factor analysis;
calculating i-vector feature w based on the following formula: m + Tw, where M is a mean value supervector of the voice data;
and taking the calculated i-vector feature as a dimension reduction feature of the voice data.
6. The method of any one of claims 1 to 5, wherein prior to inputting the dimension-reduced features into a trained gender classifier for processing, further comprising:
and performing channel compensation on the dimensionality reduction features through linear discriminant analysis.
7. The method of any one of claims 1 to 6, wherein the voice data is telephony voice data.
8. A speech gender recognition apparatus, residing in a computing device, and comprising:
the acquisition module is suitable for acquiring voice data to be recognized;
the feature extraction module is suitable for extracting features of the voice data to obtain acoustic features of the voice data;
the characteristic processing module is suitable for inputting the acoustic characteristics into a general background model and carrying out maximum posterior estimation processing on the output of the general background model to obtain Gaussian mixture distribution of the voice data;
the mean value super vector extraction module is suitable for extracting a mean value super vector of the voice data based on the Gaussian mixture distribution;
the factor analysis module is suitable for carrying out factor analysis on the mean value super vector to obtain the dimension reduction characteristics of the voice data;
and the classification module is suitable for inputting the dimension reduction characteristics into a trained gender classifier for processing and outputting a gender estimation result of the voice data.
9. A computing device, comprising:
at least one processor; and
a memory storing program instructions configured for execution by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-7.
10. A readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the method of any of claims 1-7.
CN201911328136.4A 2019-12-20 2019-12-20 Voice gender identification method and device and computing equipment Pending CN111161713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911328136.4A CN111161713A (en) 2019-12-20 2019-12-20 Voice gender identification method and device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911328136.4A CN111161713A (en) 2019-12-20 2019-12-20 Voice gender identification method and device and computing equipment

Publications (1)

Publication Number Publication Date
CN111161713A true CN111161713A (en) 2020-05-15

Family

ID=70557556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911328136.4A Pending CN111161713A (en) 2019-12-20 2019-12-20 Voice gender identification method and device and computing equipment

Country Status (1)

Country Link
CN (1) CN111161713A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111568400A (en) * 2020-05-20 2020-08-25 山东大学 Human body sign information monitoring method and system
CN111816218A (en) * 2020-07-31 2020-10-23 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and storage medium
CN112420018A (en) * 2020-10-26 2021-02-26 昆明理工大学 Language identification method suitable for low signal-to-noise ratio environment
CN113270111A (en) * 2021-05-17 2021-08-17 广州国音智能科技有限公司 Height prediction method, device, equipment and medium based on audio data
CN114049881A (en) * 2021-11-23 2022-02-15 深圳依时货拉拉科技有限公司 Voice gender recognition method, device, storage medium and computer equipment

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095401A (en) * 2015-07-07 2015-11-25 北京嘀嘀无限科技发展有限公司 Method and apparatus for identifying gender
CN106952643A (en) * 2017-02-24 2017-07-14 华南理工大学 A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN107274905A (en) * 2016-04-08 2017-10-20 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and system
CN107357782A (en) * 2017-06-29 2017-11-17 深圳市金立通信设备有限公司 One kind identification user's property method for distinguishing and terminal
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN107623614A (en) * 2017-09-19 2018-01-23 百度在线网络技术(北京)有限公司 Method and apparatus for pushed information
CN107886943A (en) * 2017-11-21 2018-04-06 广州势必可赢网络科技有限公司 A kind of method for recognizing sound-groove and device
CN108417217A (en) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 Speaker Identification network model training method, method for distinguishing speek person and system
CN108520752A (en) * 2018-04-25 2018-09-11 西北工业大学 A kind of method for recognizing sound-groove and device
CN108694954A (en) * 2018-06-13 2018-10-23 广州势必可赢网络科技有限公司 A kind of Sex, Age recognition methods, device, equipment and readable storage medium storing program for executing
CN108806697A (en) * 2017-05-02 2018-11-13 申子健 Speaker's identity identifying system based on UBM and SVM
CN108922544A (en) * 2018-06-11 2018-11-30 平安科技(深圳)有限公司 General vector training method, voice clustering method, device, equipment and medium
CN108922559A (en) * 2018-07-06 2018-11-30 华南理工大学 Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming
CN109065022A (en) * 2018-06-06 2018-12-21 平安科技(深圳)有限公司 I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium
CN109545227A (en) * 2018-04-28 2019-03-29 华中师范大学 Speaker's gender automatic identifying method and system based on depth autoencoder network
CN110502959A (en) * 2018-05-17 2019-11-26 Oppo广东移动通信有限公司 Sexual discriminating method, apparatus, storage medium and electronic equipment

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095401A (en) * 2015-07-07 2015-11-25 北京嘀嘀无限科技发展有限公司 Method and apparatus for identifying gender
CN107274905A (en) * 2016-04-08 2017-10-20 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and system
CN106952643A (en) * 2017-02-24 2017-07-14 华南理工大学 A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN108806697A (en) * 2017-05-02 2018-11-13 申子健 Speaker's identity identifying system based on UBM and SVM
CN107357782A (en) * 2017-06-29 2017-11-17 深圳市金立通信设备有限公司 One kind identification user's property method for distinguishing and terminal
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN107623614A (en) * 2017-09-19 2018-01-23 百度在线网络技术(北京)有限公司 Method and apparatus for pushed information
CN107886943A (en) * 2017-11-21 2018-04-06 广州势必可赢网络科技有限公司 A kind of method for recognizing sound-groove and device
CN108417217A (en) * 2018-01-11 2018-08-17 苏州思必驰信息科技有限公司 Speaker Identification network model training method, method for distinguishing speek person and system
CN108520752A (en) * 2018-04-25 2018-09-11 西北工业大学 A kind of method for recognizing sound-groove and device
CN109545227A (en) * 2018-04-28 2019-03-29 华中师范大学 Speaker's gender automatic identifying method and system based on depth autoencoder network
CN110502959A (en) * 2018-05-17 2019-11-26 Oppo广东移动通信有限公司 Sexual discriminating method, apparatus, storage medium and electronic equipment
CN109065022A (en) * 2018-06-06 2018-12-21 平安科技(深圳)有限公司 I-vector vector extracting method, method for distinguishing speek person, device, equipment and medium
CN108922544A (en) * 2018-06-11 2018-11-30 平安科技(深圳)有限公司 General vector training method, voice clustering method, device, equipment and medium
CN108694954A (en) * 2018-06-13 2018-10-23 广州势必可赢网络科技有限公司 A kind of Sex, Age recognition methods, device, equipment and readable storage medium storing program for executing
CN108922559A (en) * 2018-07-06 2018-11-30 华南理工大学 Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JEAN-LUC GAUVAIN,CHIN-HUI LEE: ""Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains"", 《IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING》 *
王尔玉: ""基于若干声纹信息空间的说话人识别技术研究"", 《中国优秀硕士学位文论全文数据库(信息科技辑)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111568400A (en) * 2020-05-20 2020-08-25 山东大学 Human body sign information monitoring method and system
CN111568400B (en) * 2020-05-20 2024-02-09 山东大学 Human body sign information monitoring method and system
CN111816218A (en) * 2020-07-31 2020-10-23 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and storage medium
CN111816218B (en) * 2020-07-31 2024-05-28 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and storage medium
CN112420018A (en) * 2020-10-26 2021-02-26 昆明理工大学 Language identification method suitable for low signal-to-noise ratio environment
CN113270111A (en) * 2021-05-17 2021-08-17 广州国音智能科技有限公司 Height prediction method, device, equipment and medium based on audio data
CN114049881A (en) * 2021-11-23 2022-02-15 深圳依时货拉拉科技有限公司 Voice gender recognition method, device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
US9940935B2 (en) Method and device for voiceprint recognition
CN111161713A (en) Voice gender identification method and device and computing equipment
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
WO2018149077A1 (en) Voiceprint recognition method, device, storage medium, and background server
US8935167B2 (en) Exemplar-based latent perceptual modeling for automatic speech recognition
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
WO2014114116A1 (en) Method and system for voiceprint recognition
US20140236593A1 (en) Speaker recognition method through emotional model synthesis based on neighbors preserving principle
CN112562691A (en) Voiceprint recognition method and device, computer equipment and storage medium
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN108986798B (en) Processing method, device and the equipment of voice data
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
CN110459242A (en) Change of voice detection method, terminal and computer readable storage medium
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
CN110931023A (en) Gender identification method, system, mobile terminal and storage medium
CN113129867A (en) Training method of voice recognition model, voice recognition method, device and equipment
Chakroun et al. Robust features for text-independent speaker recognition with short utterances
WO2023279691A1 (en) Speech classification method and apparatus, model training method and apparatus, device, medium, and program
US10446138B2 (en) System and method for assessing audio files for transcription services
Sood et al. Speech recognition employing mfcc and dynamic time warping algorithm
Wu et al. Speaker identification based on the frame linear predictive coding spectrum technique
CN112347788A (en) Corpus processing method, apparatus and storage medium
Mini et al. Feature vector selection of fusion of MFCC and SMRT coefficients for SVM classifier based speech recognition system
Revathi et al. Comparative analysis on the use of features and models for validating language identification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200515

RJ01 Rejection of invention patent application after publication