CN109378007B - Method for realizing gender recognition based on intelligent voice conversation - Google Patents

Method for realizing gender recognition based on intelligent voice conversation Download PDF

Info

Publication number
CN109378007B
CN109378007B CN201811624157.6A CN201811624157A CN109378007B CN 109378007 B CN109378007 B CN 109378007B CN 201811624157 A CN201811624157 A CN 201811624157A CN 109378007 B CN109378007 B CN 109378007B
Authority
CN
China
Prior art keywords
voice
training set
gender
training
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811624157.6A
Other languages
Chinese (zh)
Other versions
CN109378007A (en
Inventor
刘鹏
林雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Baiying Technology Co Ltd
Original Assignee
Zhejiang Baiying Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Baiying Technology Co Ltd filed Critical Zhejiang Baiying Technology Co Ltd
Priority to CN201811624157.6A priority Critical patent/CN109378007B/en
Publication of CN109378007A publication Critical patent/CN109378007A/en
Application granted granted Critical
Publication of CN109378007B publication Critical patent/CN109378007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces

Abstract

The invention discloses a method for realizing gender identification based on intelligent voice conversation, which makes up the blank of gender information acquisition and application in current intelligent voice conversation data; aiming at the defects and shortcomings of low recognition accuracy, high environmental requirement, low processing speed and the like in the prior art, the invention adopts a machine learning multi-model fusion technology, utilizes Logistic Regression, SVM, Random Forest algorithm and Gradient Boosting Classifier algorithm, combines probability statistical analysis, realizes gender recognition based on intelligent voice conversation, particularly voice data under a telephone channel, can effectively improve the accuracy, recognition efficiency, recognition robustness and application range of recognition results of gender recognition, can be fully applied in the field of customer relationship management, can update, supplement and verify customer information, expand the acquisition way of enterprises to ready-made customer data, reduce the cost of enterprise maintenance system customers, and has high practicability and convenient popularization and use.

Description

Method for realizing gender recognition based on intelligent voice conversation
Technical Field
The invention relates to the field of intelligent voice recognition, in particular to a method for realizing gender recognition based on intelligent voice conversation.
Background
With the rapid development of technologies such as artificial intelligence, big data, cloud computing and the like and the change of population structures, intelligent voice is more and more commonly applied to various industries to replace or assist human beings to perform a large amount of repetitive voice work, such as outgoing calls or reception services of a call center, satisfaction survey of customers, questionnaire survey and the like. The intelligent voice conversation keeps the interactive voice of the robot and the human, and the linguistic data comprise more dimensionality information such as gender information, emotion information, age information and the like besides text semantic information. At present, the application of voice data of intelligent voice conversation basically stays at a text semantic understanding level, and the information mining of other dimensions, especially the acquisition and application of human gender information in voice, is blank.
The speech data of the intelligent speech dialogue has two distinct features, which increase the difficulty of performing gender recognition based on these speech data: (1) the sound source is complex: the two parties of the conversation are respectively a robot and a human, so the voice data comprises the voice of the robot (including but not limited to TTS \ NLG \ real person voice recording and the like) and the voice of the human, and the voice data relates to two or more voice information; (2) ambient noise, mixing sound are unavoidable: in practical applications, especially in intelligent voice conversation based on a telephone channel, the sampling rate of 8K is low, the sound quality is low, and environmental noise and other sounds of people or objects are mixed into the conversation inevitably often due to environmental factors of people during the conversation, so that the difficulty of gender identification is increased.
At present, there are two common methods for voice gender identification, and the specific schemes and respective disadvantages are as follows: (1) one is based on the difference in pitch frequency between male and female voices, with a fixed frequency as a boundary below which male voices are recognized and above which female voices are recognized. The identification method is relatively extensive and has high error rate. (2) The other method is a voiceprint recognition method based on a machine learning/deep learning technology, the recognition method is low in processing speed, high in requirement on sound environment, not suitable for noisy environment and mixed speaking conditions, different microphones and channels have influence on recognition performance, and especially under a telephone channel, the conventional machine learning/deep learning recognition method cannot well process gender recognition of voices.
Disclosure of Invention
The invention aims to provide a method for realizing gender identification based on intelligent voice conversation, which aims to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a method for realizing gender identification based on intelligent voice conversation comprises the following steps:
s1: downloading the voice call of the user to a designated server;
s2: the voice call is cut into a plurality of voice segments, and voice features of the voice segments are extracted through OpenSmile respectively;
s3: analyzing the voice segments through a machine learning algorithm, extracting numerical parameters of voice, and obtaining gender probability of each voice segment through model fusion; and performing statistical analysis on all gender probabilities to obtain a plurality of statistical characteristics, and performing final prediction on the gender of the user based on the statistical characteristics.
As a preferred technical solution of the present invention, the voice features include MFCC, PCM high dimensional features up to 6000 dimensions.
As a preferred technical solution of the present invention, the machine learning algorithm is as follows:
a1, after obtaining high-dimensional voice characteristics of a voice file by OpenSmile, constructing a high-dimensional data training set with labels; performing feature extraction on the high-dimensional data training set based on a python algorithm packet of LigntGBM, obtaining importance of each dimension of voice features according to the degree of correlation with a label, selecting the first N voice features with the highest importance, reconstructing a low-dimensional data training set with a lower dimension, and taking N as 150-200;
a2, dividing the low-dimensional data training set into two parts of a training set and a testing set according to the proportion of (0.7: 0.3) to (0.8: 0.2); the training set is adjusted according to the following ratio of 0.5: the scale of 0.5 is divided into training set 1 and training set 2.
A3, performing two-class training on the data of the training set 1 by respectively using Logistic Regression, SVM, Random Forest and Gradient Boosting Classifier algorithms by adopting model fusion in the training set 1 to respectively obtain training results; accessing the training results of the four algorithms to Logistic Regression to obtain four models, and performing model fusion on the prediction results of the four models;
a4, transferring the training result and the model parameters of the training set 1 to the training set 2, storing the training result and the model parameters in a sklern.
As a preferred technical scheme of the invention, the voice segments in the training set 2 are classified according to the user ID, and the gender prediction probabilities of a plurality of voices under one user ID can be constructed by information including the maximum probability value, the minimum probability value, the mean probability value, the median probability value, the variance of the probability, the number of the probabilities larger than 0.5, the proportion of the probabilities larger than 0.5 and the like.
Compared with the prior art:
1. the invention provides a technical scheme capable of carrying out gender identification on intelligent voice dialogue data, which makes up the blank of gender information acquisition and application in the current intelligent voice dialogue data;
2. aiming at the defects and the defects of low recognition accuracy, high environmental requirement, low processing speed and the like in the prior art, the invention adopts a machine learning multi-model fusion technology, utilizes Logistic Regression, SVM, Random Forest algorithm and Gradient Boosting Classifier algorithm, combines probability statistical analysis, realizes the gender recognition based on intelligent voice conversation, especially voice data under a telephone channel, and can effectively improve the accuracy, recognition efficiency, recognition robustness and application range of recognition results;
3. the invention can be fully applied in the field of customer relationship management, can update, supplement and verify customer information, expands the acquisition way of enterprises for ready-made customer data, reduces the cost of maintaining customers of enterprises, has high practicability and is convenient to popularize and use.
Drawings
FIG. 1 is a schematic diagram of a speech recognition process of the present invention.
FIG. 2 is a diagram of a speech recognition model and training framework according to the present invention.
FIG. 3 is a schematic diagram of an application scenario framework of the speech gender recognition model of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-3, the present invention provides a method for implementing gender identification based on intelligent voice dialog, comprising:
s1: downloading the voice call of the user to a designated server;
s2: the voice call is cut into a plurality of voice segments, and voice features of the voice segments are extracted through OpenSmile respectively;
s3: analyzing the voice segments through a machine learning algorithm, extracting numerical parameters of the voice, and obtaining gender probability of each voice segment through model fusion; and performing statistical analysis on all gender probabilities to obtain a plurality of statistical characteristics, and performing final prediction on the gender of the user based on the statistical characteristics.
The voice features include high-dimensional features with MFCC and PCM up to 6000 dimensions.
The machine learning algorithm is as follows:
a1, after obtaining the high-dimensional voice characteristics of the voice file by OpenSmile, constructing a high-dimensional data training set with labels; performing feature extraction on a high-dimensional data training set based on a python algorithm packet of LigntGBM, obtaining importance of each dimension of voice features according to the degree of correlation with a label, selecting the first N voice features with the highest importance, reconstructing a low-dimensional data training set with a lower dimension, and taking N as 150-200;
a2, dividing a low-dimensional data training set into two parts of a training set and a testing set according to the ratio of (0.7: 0.3) to (0.8: 0.2); and (4) performing training set according to the following ratio of 0.5: dividing the ratio of 0.5 into a training set 1 and a training set 2;
a3, performing model fusion in a training set 1, and performing two-class training on data in the training set 1 by respectively using a Logistic Regression, an SVM, a Random Forest and a Gradient Boosting Classifer algorithm to respectively obtain training results; accessing the training results of the four algorithms into Logistic Regression to obtain four models, and performing model fusion on the prediction results of the four models;
a4, transferring the training result and the model parameters of the training set 1 to the training set 2, storing the training result and the model parameters in a skear.
And classifying the voice segments in the training set 2 according to the user ID, wherein the gender prediction probabilities of a plurality of voices can be obtained under one user ID, and the statistical characteristics of each user ID are constructed according to the information such as the maximum probability, the minimum probability, the mean probability, the median probability, the variance of the probabilities, the number of the probabilities greater than 0.5, the proportion of the probabilities greater than 0.5 and the like.
The working principle is as follows: referring to the attached figure 1 of the specification, after voice call information of a user is obtained, the voice call is downloaded to a designated server, the voice call is cut into a plurality of segments, and numerical features (including high-dimensional features such as MFCC (Mel frequency cepstrum coefficient), PCM (pulse code modulation) and the like which are up to 6000 dimensions are extracted from the voice segments to describe the voice information); and analyzing the voice based on the voice characteristics through a machine learning algorithm so as to judge the gender of the voice.
Referring to the attached figure 2 of the specification, a technical detail and model parameter training process of a machine learning algorithm framework, 1. after obtaining high-dimensional voice characteristics of a voice file (wav format) by OpenSmile, constructing a data set with labeled voice characteristics as a high-dimensional data training set; 2. the method comprises the steps that a high-dimensional data training set is subjected to feature extraction based on a python algorithm packet of LigntGBM, the importance of each dimension of voice features is obtained according to the degree of correlation with a label, the first N voice features with the highest importance are selected, a voice feature data set with lower dimensions is reconstructed, the effect is better when N is 150-200 in the experimental process, the N is too small, information is easily lost, the N is too large, redundant information can be brought, and the N is actually taken as 180; 3. the first time, the data is divided into a training set and a test set, the training set is used for training parameters, the test set is used for checking accuracy, the ratio of the training set to the test set is generally (0.7: 0.3) to (0.8: 0.2), and in actual operation, the ratio of the training set to the test set is 0.7: 0.3; the second time the training set is subdivided into two parts: training set 1 and training set 2, at a ratio of 0.5: 0.5, wherein the training set 1 is used for training the classification of the voice segments, and the training set 2 is used for training the classification after obtaining the statistical characteristics, so as to avoid overfitting; 4. model fusion is adopted in the training set 1, and the Logistic Regression, SVM algorithm, Random Forest algorithm and Gradient Boosting Classifier algorithm are respectively utilized to carry out two-class training on the data of the training set 1; the training results of the four algorithms are accessed into Logistic Regression, model fusion is carried out on the prediction results of the four models, the algorithms are common algorithms, an integrated package can be directly called in python, and the method does not belong to the innovation result of the invention; 5. transferring the model parameters and the training results of the training set 1 to the training set 2, performing gender prediction on the voice segments in the training set 2 to obtain the gender probability of each segment, wherein the training process of the training set 1 involves the 4 algorithms, each algorithm can obtain corresponding parameters after the training is finished, for example, the parameters of the Logistic Regression and SVM algorithms are some matrixes, and the parameters of the other two algorithms are the tree construction and classification rules; in practice, after training, the sklern. exoternals. joblb package of python generally saves the parameters of the trained model into the file of.m; 6. because the voice of one user ID can be composed of a plurality of segments, the voice segments in the training set 2 are classified according to the user ID to obtain the gender prediction probability of a plurality of voices under one user ID; 7. carrying out statistical analysis on the multi-segment gender probability prediction result of one user ID to obtain a plurality of statistical indexes of the segment probability prediction results, including information such as a probability maximum value, a probability minimum value, a probability average, a probability median, a probability variance, the number of the probability greater than 0.5, a proportion of the probability greater than 0.5 and the like, and constructing statistical characteristics of each user; 8. training two classification parameters of Logistic Regression based on the statistical characteristics of the user; 9. and finishing the model parameter training.
Referring to the attached figure 3 of the specification, 1, an application framework of a machine learning algorithm is similar to a training process, and gender prediction of voice is carried out by using parameters obtained by training in the step two; 2. firstly, extracting a plurality of voice fragment files of a user to be analyzed through OpenSmile to obtain high-dimensional voice features; 3. extracting the numerical parameters of the voice under the characteristics according to the important characteristics obtained by LightGBM learning; 4. carrying out model fusion on the numerical characteristic data of the voice to obtain the gender probability of each voice fragment; 5. carrying out statistical analysis on the predicted gender probability of each voice segment to obtain a plurality of statistical characteristics; 6. and finally predicting the gender of the user based on the statistical characteristics.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (2)

1. A method for realizing gender recognition based on intelligent voice conversation is characterized by comprising the following steps:
s1: downloading the voice call of the user to a designated server;
s2: the voice call is cut into a plurality of voice segments, and voice features of the voice segments are extracted through OpenSmile respectively;
s3: analyzing the voice segments through a machine learning algorithm, extracting numerical parameters of voice, and obtaining gender probability of each voice segment through model fusion; performing statistical analysis on all gender probabilities to obtain a plurality of statistical characteristics, and performing final prediction on the gender of the user based on the statistical characteristics;
the machine learning algorithm is as follows: a1, after obtaining the high-dimensional voice characteristics of the voice file by OpenSmile, constructing a high-dimensional data training set with labels; performing feature extraction on the high-dimensional data training set based on a python algorithm packet of LigntGBM, obtaining importance of each dimension of voice features according to the degree of correlation with a label, selecting the first N voice features with the highest importance, reconstructing a low-dimensional data training set with a lower dimension, and taking N as 150-200; a2, dividing the low-dimensional data training set into two parts of a training set and a testing set according to the proportion of (0.7: 0.3) to (0.8: 0.2); the training set is adjusted according to the following ratio of 0.5: dividing the ratio of 0.5 into a training set 1 and a training set 2; a3, performing model fusion in the training set 1, and performing two-class training on the data of the training set 1 by respectively using a Logistic Regression, an SVM, a Random Forest and a Gradient Boosting Classifer algorithm to respectively obtain training results; accessing the training results of the four algorithms to Logistic Regression to obtain four models, and performing model fusion on the prediction results of the four models; a4, transferring the training result and the model parameters of the training set 1 to a training set 2, storing the training result and the model parameters into a file m by using a skear. And classifying the voice segments in the training set 2 according to the user IDs, wherein the gender prediction probabilities of a plurality of voices exist under one user ID, the gender prediction probabilities comprise a maximum probability value, a minimum probability value, a mean probability value, a median probability value, a variance of the probabilities, the number of the probabilities larger than 0.5 and the proportion information of the probabilities larger than 0.5, and the statistical characteristics of each user ID are constructed.
2. The method of claim 1, wherein: the speech features include high-dimensional features with MFCC, PCM up to 6000 dimensions.
CN201811624157.6A 2018-12-28 2018-12-28 Method for realizing gender recognition based on intelligent voice conversation Active CN109378007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811624157.6A CN109378007B (en) 2018-12-28 2018-12-28 Method for realizing gender recognition based on intelligent voice conversation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811624157.6A CN109378007B (en) 2018-12-28 2018-12-28 Method for realizing gender recognition based on intelligent voice conversation

Publications (2)

Publication Number Publication Date
CN109378007A CN109378007A (en) 2019-02-22
CN109378007B true CN109378007B (en) 2022-09-13

Family

ID=65372048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811624157.6A Active CN109378007B (en) 2018-12-28 2018-12-28 Method for realizing gender recognition based on intelligent voice conversation

Country Status (1)

Country Link
CN (1) CN109378007B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110211569A (en) * 2019-07-09 2019-09-06 浙江百应科技有限公司 Real-time gender identification method based on voice map and deep learning

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5550928A (en) * 1992-12-15 1996-08-27 A.C. Nielsen Company Audience measurement system and method
CN102222500A (en) * 2011-05-11 2011-10-19 北京航空航天大学 Extracting method and modeling method for Chinese speech emotion combining emotion points
CN102266241B (en) * 2011-08-05 2013-04-17 上海交通大学 Cooperative gender recognition method integrating face and fingerprint visual information
CN103871413A (en) * 2012-12-13 2014-06-18 上海八方视界网络科技有限公司 Men and women speaking voice classification method based on SVM and HMM mixing model
CN105513597B (en) * 2015-12-30 2018-07-10 百度在线网络技术(北京)有限公司 Voiceprint processing method and processing device
CN106295507B (en) * 2016-07-25 2019-10-18 华南理工大学 A kind of gender identification method based on integrated convolutional neural networks

Also Published As

Publication number Publication date
CN109378007A (en) 2019-02-22

Similar Documents

Publication Publication Date Title
CN112804400B (en) Customer service call voice quality inspection method and device, electronic equipment and storage medium
Fahad et al. A survey of speech emotion recognition in natural environment
CN106503805B (en) A kind of bimodal based on machine learning everybody talk with sentiment analysis method
US9881617B2 (en) Blind diarization of recorded calls with arbitrary number of speakers
CN111524527B (en) Speaker separation method, speaker separation device, electronic device and storage medium
US11790896B2 (en) Detecting non-verbal, audible communication conveying meaning
CN112289323B (en) Voice data processing method and device, computer equipment and storage medium
CN113094578B (en) Deep learning-based content recommendation method, device, equipment and storage medium
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
GB2424502A (en) Apparatus and method for model adaptation for spoken language understanding
US20220319535A1 (en) Utilizing machine learning models to provide cognitive speaker fractionalization with empathy recognition
Sánchez-Hevia et al. Age group classification and gender recognition from speech with temporal convolutional neural networks
CN114818649A (en) Service consultation processing method and device based on intelligent voice interaction technology
CN116110405A (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
CN116631412A (en) Method for judging voice robot through voiceprint matching
Susan et al. A fuzzy nearest neighbor classifier for speaker identification
CN109378007B (en) Method for realizing gender recognition based on intelligent voice conversation
Ullah et al. Speech emotion recognition using deep neural networks
Chakraborty et al. Mining call center conversations exhibiting similar affective states
US11398239B1 (en) ASR-enhanced speech compression
KR101066228B1 (en) Emotion classification system and method thereof
CN113593525A (en) Method, device and storage medium for training accent classification model and accent classification
CN110910904A (en) Method for establishing voice emotion recognition model and voice emotion recognition method
CN117275458B (en) Speech generation method, device and equipment for intelligent customer service and storage medium
Li et al. Learning deep representations by multilayer bootstrap networks for speaker diarization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant