CN109378007B - Method for realizing gender recognition based on intelligent voice conversation - Google Patents
Method for realizing gender recognition based on intelligent voice conversation Download PDFInfo
- Publication number
- CN109378007B CN109378007B CN201811624157.6A CN201811624157A CN109378007B CN 109378007 B CN109378007 B CN 109378007B CN 201811624157 A CN201811624157 A CN 201811624157A CN 109378007 B CN109378007 B CN 109378007B
- Authority
- CN
- China
- Prior art keywords
- voice
- training set
- gender
- training
- recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/22—Interactive procedures; Man-machine interfaces
Abstract
The invention discloses a method for realizing gender identification based on intelligent voice conversation, which makes up the blank of gender information acquisition and application in current intelligent voice conversation data; aiming at the defects and shortcomings of low recognition accuracy, high environmental requirement, low processing speed and the like in the prior art, the invention adopts a machine learning multi-model fusion technology, utilizes Logistic Regression, SVM, Random Forest algorithm and Gradient Boosting Classifier algorithm, combines probability statistical analysis, realizes gender recognition based on intelligent voice conversation, particularly voice data under a telephone channel, can effectively improve the accuracy, recognition efficiency, recognition robustness and application range of recognition results of gender recognition, can be fully applied in the field of customer relationship management, can update, supplement and verify customer information, expand the acquisition way of enterprises to ready-made customer data, reduce the cost of enterprise maintenance system customers, and has high practicability and convenient popularization and use.
Description
Technical Field
The invention relates to the field of intelligent voice recognition, in particular to a method for realizing gender recognition based on intelligent voice conversation.
Background
With the rapid development of technologies such as artificial intelligence, big data, cloud computing and the like and the change of population structures, intelligent voice is more and more commonly applied to various industries to replace or assist human beings to perform a large amount of repetitive voice work, such as outgoing calls or reception services of a call center, satisfaction survey of customers, questionnaire survey and the like. The intelligent voice conversation keeps the interactive voice of the robot and the human, and the linguistic data comprise more dimensionality information such as gender information, emotion information, age information and the like besides text semantic information. At present, the application of voice data of intelligent voice conversation basically stays at a text semantic understanding level, and the information mining of other dimensions, especially the acquisition and application of human gender information in voice, is blank.
The speech data of the intelligent speech dialogue has two distinct features, which increase the difficulty of performing gender recognition based on these speech data: (1) the sound source is complex: the two parties of the conversation are respectively a robot and a human, so the voice data comprises the voice of the robot (including but not limited to TTS \ NLG \ real person voice recording and the like) and the voice of the human, and the voice data relates to two or more voice information; (2) ambient noise, mixing sound are unavoidable: in practical applications, especially in intelligent voice conversation based on a telephone channel, the sampling rate of 8K is low, the sound quality is low, and environmental noise and other sounds of people or objects are mixed into the conversation inevitably often due to environmental factors of people during the conversation, so that the difficulty of gender identification is increased.
At present, there are two common methods for voice gender identification, and the specific schemes and respective disadvantages are as follows: (1) one is based on the difference in pitch frequency between male and female voices, with a fixed frequency as a boundary below which male voices are recognized and above which female voices are recognized. The identification method is relatively extensive and has high error rate. (2) The other method is a voiceprint recognition method based on a machine learning/deep learning technology, the recognition method is low in processing speed, high in requirement on sound environment, not suitable for noisy environment and mixed speaking conditions, different microphones and channels have influence on recognition performance, and especially under a telephone channel, the conventional machine learning/deep learning recognition method cannot well process gender recognition of voices.
Disclosure of Invention
The invention aims to provide a method for realizing gender identification based on intelligent voice conversation, which aims to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a method for realizing gender identification based on intelligent voice conversation comprises the following steps:
s1: downloading the voice call of the user to a designated server;
s2: the voice call is cut into a plurality of voice segments, and voice features of the voice segments are extracted through OpenSmile respectively;
s3: analyzing the voice segments through a machine learning algorithm, extracting numerical parameters of voice, and obtaining gender probability of each voice segment through model fusion; and performing statistical analysis on all gender probabilities to obtain a plurality of statistical characteristics, and performing final prediction on the gender of the user based on the statistical characteristics.
As a preferred technical solution of the present invention, the voice features include MFCC, PCM high dimensional features up to 6000 dimensions.
As a preferred technical solution of the present invention, the machine learning algorithm is as follows:
a1, after obtaining high-dimensional voice characteristics of a voice file by OpenSmile, constructing a high-dimensional data training set with labels; performing feature extraction on the high-dimensional data training set based on a python algorithm packet of LigntGBM, obtaining importance of each dimension of voice features according to the degree of correlation with a label, selecting the first N voice features with the highest importance, reconstructing a low-dimensional data training set with a lower dimension, and taking N as 150-200;
a2, dividing the low-dimensional data training set into two parts of a training set and a testing set according to the proportion of (0.7: 0.3) to (0.8: 0.2); the training set is adjusted according to the following ratio of 0.5: the scale of 0.5 is divided into training set 1 and training set 2.
A3, performing two-class training on the data of the training set 1 by respectively using Logistic Regression, SVM, Random Forest and Gradient Boosting Classifier algorithms by adopting model fusion in the training set 1 to respectively obtain training results; accessing the training results of the four algorithms to Logistic Regression to obtain four models, and performing model fusion on the prediction results of the four models;
a4, transferring the training result and the model parameters of the training set 1 to the training set 2, storing the training result and the model parameters in a sklern.
As a preferred technical scheme of the invention, the voice segments in the training set 2 are classified according to the user ID, and the gender prediction probabilities of a plurality of voices under one user ID can be constructed by information including the maximum probability value, the minimum probability value, the mean probability value, the median probability value, the variance of the probability, the number of the probabilities larger than 0.5, the proportion of the probabilities larger than 0.5 and the like.
Compared with the prior art:
1. the invention provides a technical scheme capable of carrying out gender identification on intelligent voice dialogue data, which makes up the blank of gender information acquisition and application in the current intelligent voice dialogue data;
2. aiming at the defects and the defects of low recognition accuracy, high environmental requirement, low processing speed and the like in the prior art, the invention adopts a machine learning multi-model fusion technology, utilizes Logistic Regression, SVM, Random Forest algorithm and Gradient Boosting Classifier algorithm, combines probability statistical analysis, realizes the gender recognition based on intelligent voice conversation, especially voice data under a telephone channel, and can effectively improve the accuracy, recognition efficiency, recognition robustness and application range of recognition results;
3. the invention can be fully applied in the field of customer relationship management, can update, supplement and verify customer information, expands the acquisition way of enterprises for ready-made customer data, reduces the cost of maintaining customers of enterprises, has high practicability and is convenient to popularize and use.
Drawings
FIG. 1 is a schematic diagram of a speech recognition process of the present invention.
FIG. 2 is a diagram of a speech recognition model and training framework according to the present invention.
FIG. 3 is a schematic diagram of an application scenario framework of the speech gender recognition model of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-3, the present invention provides a method for implementing gender identification based on intelligent voice dialog, comprising:
s1: downloading the voice call of the user to a designated server;
s2: the voice call is cut into a plurality of voice segments, and voice features of the voice segments are extracted through OpenSmile respectively;
s3: analyzing the voice segments through a machine learning algorithm, extracting numerical parameters of the voice, and obtaining gender probability of each voice segment through model fusion; and performing statistical analysis on all gender probabilities to obtain a plurality of statistical characteristics, and performing final prediction on the gender of the user based on the statistical characteristics.
The voice features include high-dimensional features with MFCC and PCM up to 6000 dimensions.
The machine learning algorithm is as follows:
a1, after obtaining the high-dimensional voice characteristics of the voice file by OpenSmile, constructing a high-dimensional data training set with labels; performing feature extraction on a high-dimensional data training set based on a python algorithm packet of LigntGBM, obtaining importance of each dimension of voice features according to the degree of correlation with a label, selecting the first N voice features with the highest importance, reconstructing a low-dimensional data training set with a lower dimension, and taking N as 150-200;
a2, dividing a low-dimensional data training set into two parts of a training set and a testing set according to the ratio of (0.7: 0.3) to (0.8: 0.2); and (4) performing training set according to the following ratio of 0.5: dividing the ratio of 0.5 into a training set 1 and a training set 2;
a3, performing model fusion in a training set 1, and performing two-class training on data in the training set 1 by respectively using a Logistic Regression, an SVM, a Random Forest and a Gradient Boosting Classifer algorithm to respectively obtain training results; accessing the training results of the four algorithms into Logistic Regression to obtain four models, and performing model fusion on the prediction results of the four models;
a4, transferring the training result and the model parameters of the training set 1 to the training set 2, storing the training result and the model parameters in a skear.
And classifying the voice segments in the training set 2 according to the user ID, wherein the gender prediction probabilities of a plurality of voices can be obtained under one user ID, and the statistical characteristics of each user ID are constructed according to the information such as the maximum probability, the minimum probability, the mean probability, the median probability, the variance of the probabilities, the number of the probabilities greater than 0.5, the proportion of the probabilities greater than 0.5 and the like.
The working principle is as follows: referring to the attached figure 1 of the specification, after voice call information of a user is obtained, the voice call is downloaded to a designated server, the voice call is cut into a plurality of segments, and numerical features (including high-dimensional features such as MFCC (Mel frequency cepstrum coefficient), PCM (pulse code modulation) and the like which are up to 6000 dimensions are extracted from the voice segments to describe the voice information); and analyzing the voice based on the voice characteristics through a machine learning algorithm so as to judge the gender of the voice.
Referring to the attached figure 2 of the specification, a technical detail and model parameter training process of a machine learning algorithm framework, 1. after obtaining high-dimensional voice characteristics of a voice file (wav format) by OpenSmile, constructing a data set with labeled voice characteristics as a high-dimensional data training set; 2. the method comprises the steps that a high-dimensional data training set is subjected to feature extraction based on a python algorithm packet of LigntGBM, the importance of each dimension of voice features is obtained according to the degree of correlation with a label, the first N voice features with the highest importance are selected, a voice feature data set with lower dimensions is reconstructed, the effect is better when N is 150-200 in the experimental process, the N is too small, information is easily lost, the N is too large, redundant information can be brought, and the N is actually taken as 180; 3. the first time, the data is divided into a training set and a test set, the training set is used for training parameters, the test set is used for checking accuracy, the ratio of the training set to the test set is generally (0.7: 0.3) to (0.8: 0.2), and in actual operation, the ratio of the training set to the test set is 0.7: 0.3; the second time the training set is subdivided into two parts: training set 1 and training set 2, at a ratio of 0.5: 0.5, wherein the training set 1 is used for training the classification of the voice segments, and the training set 2 is used for training the classification after obtaining the statistical characteristics, so as to avoid overfitting; 4. model fusion is adopted in the training set 1, and the Logistic Regression, SVM algorithm, Random Forest algorithm and Gradient Boosting Classifier algorithm are respectively utilized to carry out two-class training on the data of the training set 1; the training results of the four algorithms are accessed into Logistic Regression, model fusion is carried out on the prediction results of the four models, the algorithms are common algorithms, an integrated package can be directly called in python, and the method does not belong to the innovation result of the invention; 5. transferring the model parameters and the training results of the training set 1 to the training set 2, performing gender prediction on the voice segments in the training set 2 to obtain the gender probability of each segment, wherein the training process of the training set 1 involves the 4 algorithms, each algorithm can obtain corresponding parameters after the training is finished, for example, the parameters of the Logistic Regression and SVM algorithms are some matrixes, and the parameters of the other two algorithms are the tree construction and classification rules; in practice, after training, the sklern. exoternals. joblb package of python generally saves the parameters of the trained model into the file of.m; 6. because the voice of one user ID can be composed of a plurality of segments, the voice segments in the training set 2 are classified according to the user ID to obtain the gender prediction probability of a plurality of voices under one user ID; 7. carrying out statistical analysis on the multi-segment gender probability prediction result of one user ID to obtain a plurality of statistical indexes of the segment probability prediction results, including information such as a probability maximum value, a probability minimum value, a probability average, a probability median, a probability variance, the number of the probability greater than 0.5, a proportion of the probability greater than 0.5 and the like, and constructing statistical characteristics of each user; 8. training two classification parameters of Logistic Regression based on the statistical characteristics of the user; 9. and finishing the model parameter training.
Referring to the attached figure 3 of the specification, 1, an application framework of a machine learning algorithm is similar to a training process, and gender prediction of voice is carried out by using parameters obtained by training in the step two; 2. firstly, extracting a plurality of voice fragment files of a user to be analyzed through OpenSmile to obtain high-dimensional voice features; 3. extracting the numerical parameters of the voice under the characteristics according to the important characteristics obtained by LightGBM learning; 4. carrying out model fusion on the numerical characteristic data of the voice to obtain the gender probability of each voice fragment; 5. carrying out statistical analysis on the predicted gender probability of each voice segment to obtain a plurality of statistical characteristics; 6. and finally predicting the gender of the user based on the statistical characteristics.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (2)
1. A method for realizing gender recognition based on intelligent voice conversation is characterized by comprising the following steps:
s1: downloading the voice call of the user to a designated server;
s2: the voice call is cut into a plurality of voice segments, and voice features of the voice segments are extracted through OpenSmile respectively;
s3: analyzing the voice segments through a machine learning algorithm, extracting numerical parameters of voice, and obtaining gender probability of each voice segment through model fusion; performing statistical analysis on all gender probabilities to obtain a plurality of statistical characteristics, and performing final prediction on the gender of the user based on the statistical characteristics;
the machine learning algorithm is as follows: a1, after obtaining the high-dimensional voice characteristics of the voice file by OpenSmile, constructing a high-dimensional data training set with labels; performing feature extraction on the high-dimensional data training set based on a python algorithm packet of LigntGBM, obtaining importance of each dimension of voice features according to the degree of correlation with a label, selecting the first N voice features with the highest importance, reconstructing a low-dimensional data training set with a lower dimension, and taking N as 150-200; a2, dividing the low-dimensional data training set into two parts of a training set and a testing set according to the proportion of (0.7: 0.3) to (0.8: 0.2); the training set is adjusted according to the following ratio of 0.5: dividing the ratio of 0.5 into a training set 1 and a training set 2; a3, performing model fusion in the training set 1, and performing two-class training on the data of the training set 1 by respectively using a Logistic Regression, an SVM, a Random Forest and a Gradient Boosting Classifer algorithm to respectively obtain training results; accessing the training results of the four algorithms to Logistic Regression to obtain four models, and performing model fusion on the prediction results of the four models; a4, transferring the training result and the model parameters of the training set 1 to a training set 2, storing the training result and the model parameters into a file m by using a skear. And classifying the voice segments in the training set 2 according to the user IDs, wherein the gender prediction probabilities of a plurality of voices exist under one user ID, the gender prediction probabilities comprise a maximum probability value, a minimum probability value, a mean probability value, a median probability value, a variance of the probabilities, the number of the probabilities larger than 0.5 and the proportion information of the probabilities larger than 0.5, and the statistical characteristics of each user ID are constructed.
2. The method of claim 1, wherein: the speech features include high-dimensional features with MFCC, PCM up to 6000 dimensions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811624157.6A CN109378007B (en) | 2018-12-28 | 2018-12-28 | Method for realizing gender recognition based on intelligent voice conversation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811624157.6A CN109378007B (en) | 2018-12-28 | 2018-12-28 | Method for realizing gender recognition based on intelligent voice conversation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109378007A CN109378007A (en) | 2019-02-22 |
CN109378007B true CN109378007B (en) | 2022-09-13 |
Family
ID=65372048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811624157.6A Active CN109378007B (en) | 2018-12-28 | 2018-12-28 | Method for realizing gender recognition based on intelligent voice conversation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109378007B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110211569A (en) * | 2019-07-09 | 2019-09-06 | 浙江百应科技有限公司 | Real-time gender identification method based on voice map and deep learning |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5550928A (en) * | 1992-12-15 | 1996-08-27 | A.C. Nielsen Company | Audience measurement system and method |
CN102222500A (en) * | 2011-05-11 | 2011-10-19 | 北京航空航天大学 | Extracting method and modeling method for Chinese speech emotion combining emotion points |
CN102266241B (en) * | 2011-08-05 | 2013-04-17 | 上海交通大学 | Cooperative gender recognition method integrating face and fingerprint visual information |
CN103871413A (en) * | 2012-12-13 | 2014-06-18 | 上海八方视界网络科技有限公司 | Men and women speaking voice classification method based on SVM and HMM mixing model |
CN105513597B (en) * | 2015-12-30 | 2018-07-10 | 百度在线网络技术(北京)有限公司 | Voiceprint processing method and processing device |
CN106295507B (en) * | 2016-07-25 | 2019-10-18 | 华南理工大学 | A kind of gender identification method based on integrated convolutional neural networks |
-
2018
- 2018-12-28 CN CN201811624157.6A patent/CN109378007B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN109378007A (en) | 2019-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112804400B (en) | Customer service call voice quality inspection method and device, electronic equipment and storage medium | |
Fahad et al. | A survey of speech emotion recognition in natural environment | |
CN106503805B (en) | A kind of bimodal based on machine learning everybody talk with sentiment analysis method | |
US9881617B2 (en) | Blind diarization of recorded calls with arbitrary number of speakers | |
CN111524527B (en) | Speaker separation method, speaker separation device, electronic device and storage medium | |
US11790896B2 (en) | Detecting non-verbal, audible communication conveying meaning | |
CN112289323B (en) | Voice data processing method and device, computer equipment and storage medium | |
CN113094578B (en) | Deep learning-based content recommendation method, device, equipment and storage medium | |
CN113066499B (en) | Method and device for identifying identity of land-air conversation speaker | |
GB2424502A (en) | Apparatus and method for model adaptation for spoken language understanding | |
US20220319535A1 (en) | Utilizing machine learning models to provide cognitive speaker fractionalization with empathy recognition | |
Sánchez-Hevia et al. | Age group classification and gender recognition from speech with temporal convolutional neural networks | |
CN114818649A (en) | Service consultation processing method and device based on intelligent voice interaction technology | |
CN116110405A (en) | Land-air conversation speaker identification method and equipment based on semi-supervised learning | |
CN116631412A (en) | Method for judging voice robot through voiceprint matching | |
Susan et al. | A fuzzy nearest neighbor classifier for speaker identification | |
CN109378007B (en) | Method for realizing gender recognition based on intelligent voice conversation | |
Ullah et al. | Speech emotion recognition using deep neural networks | |
Chakraborty et al. | Mining call center conversations exhibiting similar affective states | |
US11398239B1 (en) | ASR-enhanced speech compression | |
KR101066228B1 (en) | Emotion classification system and method thereof | |
CN113593525A (en) | Method, device and storage medium for training accent classification model and accent classification | |
CN110910904A (en) | Method for establishing voice emotion recognition model and voice emotion recognition method | |
CN117275458B (en) | Speech generation method, device and equipment for intelligent customer service and storage medium | |
Li et al. | Learning deep representations by multilayer bootstrap networks for speaker diarization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |