CN109378007B

CN109378007B - Method for realizing gender recognition based on intelligent voice conversation

Info

Publication number: CN109378007B
Application number: CN201811624157.6A
Authority: CN
Inventors: 刘鹏; 林雨
Original assignee: Zhejiang Baiying Technology Co Ltd
Current assignee: Zhejiang Baiying Technology Co Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2022-09-13
Anticipated expiration: 2038-12-28
Also published as: CN109378007A

Abstract

The invention discloses a method for realizing gender identification based on intelligent voice conversation, which makes up the blank of gender information acquisition and application in current intelligent voice conversation data; aiming at the defects and shortcomings of low recognition accuracy, high environmental requirement, low processing speed and the like in the prior art, the invention adopts a machine learning multi-model fusion technology, utilizes Logistic Regression, SVM, Random Forest algorithm and Gradient Boosting Classifier algorithm, combines probability statistical analysis, realizes gender recognition based on intelligent voice conversation, particularly voice data under a telephone channel, can effectively improve the accuracy, recognition efficiency, recognition robustness and application range of recognition results of gender recognition, can be fully applied in the field of customer relationship management, can update, supplement and verify customer information, expand the acquisition way of enterprises to ready-made customer data, reduce the cost of enterprise maintenance system customers, and has high practicability and convenient popularization and use.

Description

Method for realizing gender recognition based on intelligent voice conversation

Technical Field

The invention relates to the field of intelligent voice recognition, in particular to a method for realizing gender recognition based on intelligent voice conversation.

Background

With the rapid development of technologies such as artificial intelligence, big data, cloud computing and the like and the change of population structures, intelligent voice is more and more commonly applied to various industries to replace or assist human beings to perform a large amount of repetitive voice work, such as outgoing calls or reception services of a call center, satisfaction survey of customers, questionnaire survey and the like. The intelligent voice conversation keeps the interactive voice of the robot and the human, and the linguistic data comprise more dimensionality information such as gender information, emotion information, age information and the like besides text semantic information. At present, the application of voice data of intelligent voice conversation basically stays at a text semantic understanding level, and the information mining of other dimensions, especially the acquisition and application of human gender information in voice, is blank.

The speech data of the intelligent speech dialogue has two distinct features, which increase the difficulty of performing gender recognition based on these speech data: (1) the sound source is complex: the two parties of the conversation are respectively a robot and a human, so the voice data comprises the voice of the robot (including but not limited to TTS \ NLG \ real person voice recording and the like) and the voice of the human, and the voice data relates to two or more voice information; (2) ambient noise, mixing sound are unavoidable: in practical applications, especially in intelligent voice conversation based on a telephone channel, the sampling rate of 8K is low, the sound quality is low, and environmental noise and other sounds of people or objects are mixed into the conversation inevitably often due to environmental factors of people during the conversation, so that the difficulty of gender identification is increased.

At present, there are two common methods for voice gender identification, and the specific schemes and respective disadvantages are as follows: (1) one is based on the difference in pitch frequency between male and female voices, with a fixed frequency as a boundary below which male voices are recognized and above which female voices are recognized. The identification method is relatively extensive and has high error rate. (2) The other method is a voiceprint recognition method based on a machine learning/deep learning technology, the recognition method is low in processing speed, high in requirement on sound environment, not suitable for noisy environment and mixed speaking conditions, different microphones and channels have influence on recognition performance, and especially under a telephone channel, the conventional machine learning/deep learning recognition method cannot well process gender recognition of voices.

Disclosure of Invention

The invention aims to provide a method for realizing gender identification based on intelligent voice conversation, which aims to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a method for realizing gender identification based on intelligent voice conversation comprises the following steps:

s1: downloading the voice call of the user to a designated server;

s2: the voice call is cut into a plurality of voice segments, and voice features of the voice segments are extracted through OpenSmile respectively;

s3: analyzing the voice segments through a machine learning algorithm, extracting numerical parameters of voice, and obtaining gender probability of each voice segment through model fusion; and performing statistical analysis on all gender probabilities to obtain a plurality of statistical characteristics, and performing final prediction on the gender of the user based on the statistical characteristics.

As a preferred technical solution of the present invention, the voice features include MFCC, PCM high dimensional features up to 6000 dimensions.

As a preferred technical solution of the present invention, the machine learning algorithm is as follows:

a1, after obtaining high-dimensional voice characteristics of a voice file by OpenSmile, constructing a high-dimensional data training set with labels; performing feature extraction on the high-dimensional data training set based on a python algorithm packet of LigntGBM, obtaining importance of each dimension of voice features according to the degree of correlation with a label, selecting the first N voice features with the highest importance, reconstructing a low-dimensional data training set with a lower dimension, and taking N as 150-200;

a2, dividing the low-dimensional data training set into two parts of a training set and a testing set according to the proportion of (0.7: 0.3) to (0.8: 0.2); the training set is adjusted according to the following ratio of 0.5: the scale of 0.5 is divided into training set 1 and training set 2.

A3, performing two-class training on the data of the training set 1 by respectively using Logistic Regression, SVM, Random Forest and Gradient Boosting Classifier algorithms by adopting model fusion in the training set 1 to respectively obtain training results; accessing the training results of the four algorithms to Logistic Regression to obtain four models, and performing model fusion on the prediction results of the four models;

a4, transferring the training result and the model parameters of the training set 1 to the training set 2, storing the training result and the model parameters in a sklern.

As a preferred technical scheme of the invention, the voice segments in the training set 2 are classified according to the user ID, and the gender prediction probabilities of a plurality of voices under one user ID can be constructed by information including the maximum probability value, the minimum probability value, the mean probability value, the median probability value, the variance of the probability, the number of the probabilities larger than 0.5, the proportion of the probabilities larger than 0.5 and the like.

Compared with the prior art:

1. the invention provides a technical scheme capable of carrying out gender identification on intelligent voice dialogue data, which makes up the blank of gender information acquisition and application in the current intelligent voice dialogue data;

2. aiming at the defects and the defects of low recognition accuracy, high environmental requirement, low processing speed and the like in the prior art, the invention adopts a machine learning multi-model fusion technology, utilizes Logistic Regression, SVM, Random Forest algorithm and Gradient Boosting Classifier algorithm, combines probability statistical analysis, realizes the gender recognition based on intelligent voice conversation, especially voice data under a telephone channel, and can effectively improve the accuracy, recognition efficiency, recognition robustness and application range of recognition results;

3. the invention can be fully applied in the field of customer relationship management, can update, supplement and verify customer information, expands the acquisition way of enterprises for ready-made customer data, reduces the cost of maintaining customers of enterprises, has high practicability and is convenient to popularize and use.

Drawings

FIG. 1 is a schematic diagram of a speech recognition process of the present invention.

FIG. 2 is a diagram of a speech recognition model and training framework according to the present invention.

FIG. 3 is a schematic diagram of an application scenario framework of the speech gender recognition model of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-3, the present invention provides a method for implementing gender identification based on intelligent voice dialog, comprising:

s1: downloading the voice call of the user to a designated server;

s3: analyzing the voice segments through a machine learning algorithm, extracting numerical parameters of the voice, and obtaining gender probability of each voice segment through model fusion; and performing statistical analysis on all gender probabilities to obtain a plurality of statistical characteristics, and performing final prediction on the gender of the user based on the statistical characteristics.

The voice features include high-dimensional features with MFCC and PCM up to 6000 dimensions.

The machine learning algorithm is as follows:

a1, after obtaining the high-dimensional voice characteristics of the voice file by OpenSmile, constructing a high-dimensional data training set with labels; performing feature extraction on a high-dimensional data training set based on a python algorithm packet of LigntGBM, obtaining importance of each dimension of voice features according to the degree of correlation with a label, selecting the first N voice features with the highest importance, reconstructing a low-dimensional data training set with a lower dimension, and taking N as 150-200;

a2, dividing a low-dimensional data training set into two parts of a training set and a testing set according to the ratio of (0.7: 0.3) to (0.8: 0.2); and (4) performing training set according to the following ratio of 0.5: dividing the ratio of 0.5 into a training set 1 and a training set 2;

a3, performing model fusion in a training set 1, and performing two-class training on data in the training set 1 by respectively using a Logistic Regression, an SVM, a Random Forest and a Gradient Boosting Classifer algorithm to respectively obtain training results; accessing the training results of the four algorithms into Logistic Regression to obtain four models, and performing model fusion on the prediction results of the four models;

a4, transferring the training result and the model parameters of the training set 1 to the training set 2, storing the training result and the model parameters in a skear.

And classifying the voice segments in the training set 2 according to the user ID, wherein the gender prediction probabilities of a plurality of voices can be obtained under one user ID, and the statistical characteristics of each user ID are constructed according to the information such as the maximum probability, the minimum probability, the mean probability, the median probability, the variance of the probabilities, the number of the probabilities greater than 0.5, the proportion of the probabilities greater than 0.5 and the like.

The working principle is as follows: referring to the attached figure 1 of the specification, after voice call information of a user is obtained, the voice call is downloaded to a designated server, the voice call is cut into a plurality of segments, and numerical features (including high-dimensional features such as MFCC (Mel frequency cepstrum coefficient), PCM (pulse code modulation) and the like which are up to 6000 dimensions are extracted from the voice segments to describe the voice information); and analyzing the voice based on the voice characteristics through a machine learning algorithm so as to judge the gender of the voice.

Referring to the attached figure 2 of the specification, a technical detail and model parameter training process of a machine learning algorithm framework, 1. after obtaining high-dimensional voice characteristics of a voice file (wav format) by OpenSmile, constructing a data set with labeled voice characteristics as a high-dimensional data training set; 2. the method comprises the steps that a high-dimensional data training set is subjected to feature extraction based on a python algorithm packet of LigntGBM, the importance of each dimension of voice features is obtained according to the degree of correlation with a label, the first N voice features with the highest importance are selected, a voice feature data set with lower dimensions is reconstructed, the effect is better when N is 150-200 in the experimental process, the N is too small, information is easily lost, the N is too large, redundant information can be brought, and the N is actually taken as 180; 3. the first time, the data is divided into a training set and a test set, the training set is used for training parameters, the test set is used for checking accuracy, the ratio of the training set to the test set is generally (0.7: 0.3) to (0.8: 0.2), and in actual operation, the ratio of the training set to the test set is 0.7: 0.3; the second time the training set is subdivided into two parts: training set 1 and training set 2, at a ratio of 0.5: 0.5, wherein the training set 1 is used for training the classification of the voice segments, and the training set 2 is used for training the classification after obtaining the statistical characteristics, so as to avoid overfitting; 4. model fusion is adopted in the training set 1, and the Logistic Regression, SVM algorithm, Random Forest algorithm and Gradient Boosting Classifier algorithm are respectively utilized to carry out two-class training on the data of the training set 1; the training results of the four algorithms are accessed into Logistic Regression, model fusion is carried out on the prediction results of the four models, the algorithms are common algorithms, an integrated package can be directly called in python, and the method does not belong to the innovation result of the invention; 5. transferring the model parameters and the training results of the training set 1 to the training set 2, performing gender prediction on the voice segments in the training set 2 to obtain the gender probability of each segment, wherein the training process of the training set 1 involves the 4 algorithms, each algorithm can obtain corresponding parameters after the training is finished, for example, the parameters of the Logistic Regression and SVM algorithms are some matrixes, and the parameters of the other two algorithms are the tree construction and classification rules; in practice, after training, the sklern. exoternals. joblb package of python generally saves the parameters of the trained model into the file of.m; 6. because the voice of one user ID can be composed of a plurality of segments, the voice segments in the training set 2 are classified according to the user ID to obtain the gender prediction probability of a plurality of voices under one user ID; 7. carrying out statistical analysis on the multi-segment gender probability prediction result of one user ID to obtain a plurality of statistical indexes of the segment probability prediction results, including information such as a probability maximum value, a probability minimum value, a probability average, a probability median, a probability variance, the number of the probability greater than 0.5, a proportion of the probability greater than 0.5 and the like, and constructing statistical characteristics of each user; 8. training two classification parameters of Logistic Regression based on the statistical characteristics of the user; 9. and finishing the model parameter training.

Referring to the attached figure 3 of the specification, 1, an application framework of a machine learning algorithm is similar to a training process, and gender prediction of voice is carried out by using parameters obtained by training in the step two; 2. firstly, extracting a plurality of voice fragment files of a user to be analyzed through OpenSmile to obtain high-dimensional voice features; 3. extracting the numerical parameters of the voice under the characteristics according to the important characteristics obtained by LightGBM learning; 4. carrying out model fusion on the numerical characteristic data of the voice to obtain the gender probability of each voice fragment; 5. carrying out statistical analysis on the predicted gender probability of each voice segment to obtain a plurality of statistical characteristics; 6. and finally predicting the gender of the user based on the statistical characteristics.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for realizing gender recognition based on intelligent voice conversation is characterized by comprising the following steps:

s1: downloading the voice call of the user to a designated server;

s3: analyzing the voice segments through a machine learning algorithm, extracting numerical parameters of voice, and obtaining gender probability of each voice segment through model fusion; performing statistical analysis on all gender probabilities to obtain a plurality of statistical characteristics, and performing final prediction on the gender of the user based on the statistical characteristics;

the machine learning algorithm is as follows: a1, after obtaining the high-dimensional voice characteristics of the voice file by OpenSmile, constructing a high-dimensional data training set with labels; performing feature extraction on the high-dimensional data training set based on a python algorithm packet of LigntGBM, obtaining importance of each dimension of voice features according to the degree of correlation with a label, selecting the first N voice features with the highest importance, reconstructing a low-dimensional data training set with a lower dimension, and taking N as 150-200; a2, dividing the low-dimensional data training set into two parts of a training set and a testing set according to the proportion of (0.7: 0.3) to (0.8: 0.2); the training set is adjusted according to the following ratio of 0.5: dividing the ratio of 0.5 into a training set 1 and a training set 2; a3, performing model fusion in the training set 1, and performing two-class training on the data of the training set 1 by respectively using a Logistic Regression, an SVM, a Random Forest and a Gradient Boosting Classifer algorithm to respectively obtain training results; accessing the training results of the four algorithms to Logistic Regression to obtain four models, and performing model fusion on the prediction results of the four models; a4, transferring the training result and the model parameters of the training set 1 to a training set 2, storing the training result and the model parameters into a file m by using a skear. And classifying the voice segments in the training set 2 according to the user IDs, wherein the gender prediction probabilities of a plurality of voices exist under one user ID, the gender prediction probabilities comprise a maximum probability value, a minimum probability value, a mean probability value, a median probability value, a variance of the probabilities, the number of the probabilities larger than 0.5 and the proportion information of the probabilities larger than 0.5, and the statistical characteristics of each user ID are constructed.

2. The method of claim 1, wherein: the speech features include high-dimensional features with MFCC, PCM up to 6000 dimensions.