CN114203177A

CN114203177A - Intelligent voice question-answering method and system based on deep learning and emotion recognition

Info

Publication number: CN114203177A
Application number: CN202111475872.XA
Authority: CN
Inventors: 唐卓; 李虹宇; 曹嵘晖; 纪军刚; 尹旦; 宋柏森; 朱纯霞; 赵环
Original assignee: Hunan University; Shenzhen Zhengtong Electronics Co Ltd
Current assignee: Hunan University; Shenzhen Zhengtong Electronics Co Ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-18

Abstract

The invention discloses an intelligent voice question-answering method based on deep learning and emotion recognition, which comprises the following steps: acquiring voice from a user, processing the voice to extract corresponding text information, extracting characteristic parameters of the voice of the user, and performing emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label; and inputting the generated text information into a trained semantic representation and matching model, combining the obtained emotion labels, matching the emotion labels with a question-answer library to obtain answers of the question and outputting the answers. The invention can solve the technical problems that the accuracy of the question-answering system is low due to the fact that the deep learning algorithm is not applied abundantly because of the ambiguity and complexity of natural language of the existing intelligent question-answering system, and the technical problem that the system matching deviation is easy to cause because the conditions of data redundancy and single dimension still exist when Chinese sentence information is captured.

Description

Intelligent voice question-answering method and system based on deep learning and emotion recognition

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to an intelligent voice question-answering method and system based on deep learning and emotion recognition.

Background

With the advent of the internet age, data has grown exponentially, and people have begun to use search engines to find their own desired information. The search engine is one of the forms of the intelligent question-answering system, and provides answers corresponding to questions for users in a list form. With the higher requirements of people on the efficiency and quality of information retrieval, the research on the intelligent voice question-answering system also starts to be further developed.

The existing intelligent voice question-answering system usually needs two steps, namely voice recognition is firstly carried out, and then answers to questions are carried out according to recognized voice. Speech recognition techniques are generally divided into three steps: feature extraction, deep neural network training and decoding. The question-answering system is mainly divided into three categories: the system comprises an extraction type intelligent question-answering system, a generation type intelligent question-answering system and an intelligent question-answering system based on question-answer pairs. All of them need to convert the sentences into computer-recognizable structured data by natural language processing technology, analyze the problems, capture the keywords, and finally match with the answer library.

However, the above intelligent question-answering system still has certain defects: firstly, the natural language processing technology is adopted, and due to the ambiguity and complexity of natural language, a deep learning algorithm is not applied abundantly, so that the accuracy of a question-answering system is low; second, when capturing chinese sentence information, there is still a situation of data redundancy and single dimension, so systematic matching bias is easily caused.

Disclosure of Invention

Aiming at the defects or the improvement requirements of the prior art, the invention provides an intelligent voice question-answering system based on deep learning and emotion recognition, aiming at solving the technical problems that the accuracy of the question-answering system is low due to the fact that the deep learning algorithm is not applied abundantly because of the ambiguity and the complexity of natural language in the conventional intelligent question-answering system, and the system matching deviation is easily caused due to the fact that data redundancy and single dimension still exist when Chinese sentence information is captured.

In order to achieve the above object, according to an aspect of the present invention, there is provided an intelligent voice question-answering method based on deep learning and emotion recognition, comprising the steps of:

(1) and acquiring the voice from the user, and processing the voice to extract the corresponding text information.

(2) And extracting characteristic parameters of the voice of the user, and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label.

(3) And (3) inputting the text information generated in the step (1) into a trained semantic representation and matching model, matching the emotion label obtained in the step (2) with a question-answer library to obtain answers of the question and outputting the answers.

Preferably, step (1) comprises in particular the following sub-steps:

(1-1) acquiring a voice from a user and digitally encoding the voice to convert it into an audio map;

(1-2) using a Hamming window to perform framing and windowing processing on the audio image obtained in the step (1-1) in sequence, and performing Fourier transform on an obtained processing result to obtain a spectrogram;

and (1-3) inputting the spectrogram obtained in the step (1-2) into a trained Convolutional Neural Network (CNN) to obtain corresponding text information.

Preferably, step (2) comprises in particular the following sub-steps:

and (2-1) framing the audio image obtained in the step (1-1) by using a Hamming window, thereby obtaining the preprocessed audio data.

And (2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring time domain features of the frame of audio data, and combining the time domain features corresponding to all frames to obtain num initial feature parameters, wherein the dimensionality of each initial feature parameter is 4.

And (2-3) carrying out dimensionality reduction and feature selection processing on the num initial feature parameters obtained in the step (2-2) by using a Principal Component Analysis (PCA) to obtain a mapping matrix A formed by the feature parameter subset.

And (2-4) calculating comprehensive probabilities P corresponding to the principal components of the audio data preprocessed in the step (2-1) according to the mapping matrix A obtained in the step (2-3), and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as the emotion label of the voice.

Preferably, step (2-2) comprises the sub-steps of:

(2-2-1) for each frame of audio data in the audio data preprocessed in step (2-1)Obtaining the short-time energy E of the frame of audio data_n；

(2-2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), obtaining a short-time average zero crossing rate ZCR of the frame of audio data_n：

(2-2-3) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring a formant characteristic parameter HZ of the frame of audio data_n；

(2-2-4) for each frame of audio data in the audio data preprocessed in the step (2-1), extracting a pitch period in the time domain by using a short-time autocorrelation function, namely, selecting d (the value range of d is 1 to 5, preferably 3) groups of signals around the nth frame of audio data by using a window function to perform autocorrelation calculation so as to obtain a pitch frequency R of the frame of audio data_n：

Preferably, the short-time energy E of the nth frame of audio data in step (2-2-1)_nComprises the following steps:

wherein x_n(m) represents the audio map obtained in the step (1-1), and the nth (n is the element of [1, num ] after the framing processing]) The mth sample data in the frame audio data, and N represents the number of samples contained in one frame, depending on the specific audio format.

Preferably, step (2-3) comprises the sub-steps of:

(2-3-1) constructing feature vectors A of num rows and num columns of 4 rows and num columns of num 4-dimensional initial feature parameters obtained in the step (2-2)_4×num＝[a₁,a₂,...,a_num]Inputting collected characteristic data;

(2-3-2) feature vector A obtained in step (2-3-1)_4×num＝[a₁,a₂,...,a_num]Calculating to obtain the mean value mu and covariance matrix COV_4×num；

(2-3-3) for the covariance matrix COV in step (2-3-2)_4×numCalculating to obtain its characteristic value and characteristic vector, and forming a coordinate form (lambda) by the characteristic vector of each element and the characteristic value according to one-to-one correspondence relationship_i,e_i) According to the size of the characteristic value, obtaining a characteristic vector sequence (lambda)₁,e₁),(λ₂,e₂),...,(λ_num,e_num)；

(2-3-4) aiming at the feature vector sequence (lambda) in the step (2-3-3)₁,e₁),(λ₂,e₂),...,(λ_num,e_num) Selecting the first k eigenvectors as the direction of a principal component (namely, a base vector of a space after dimensionality reduction), and constructing a mapping matrix A with 4 rows and k columns, wherein the ith column in the mapping matrix A represents the ith eigenvector, wherein k belongs to [1, num ]]。

Preferably, step (2-4) comprises the sub-steps of:

(2-4-1) calculating the mean value mu of the kth principal component of the mapping matrix A in the jth emotion type by using the original data in the corpus aiming at different emotion types_jkAnd variance σ_jkAnd carrying out maximum separability processing to obtain the discrimination capability H of the kth principal component on the emotion types_k。

Wherein J is the number of emotion classes in the corpus employed, L_kRepresenting the separation of the kth principal component in the emotion class, M_kRepresenting the concentration of the kth principal component in the emotion categories, wherein C is a permutation and combination function; h_kIt is used to reflect the discriminative power of principal components in the emotion classification, H_kThe larger the component is, the stronger the ability of the extracted principal component to distinguish the emotion type is;

(2-4-2) the discrimination ability H of the emotion type according to the principal component obtained in the step (2-4-1)_kAfter sorting the k main components of the mapping matrix A, selecting H_kAnd the larger front p principal components are used as principal elements for emotion recognition, and normalization processing is carried out on the front p principal components to obtain normalized emotion recognition principal elements. The projection summation of the mapping matrix A of the whole principal components is carried out to obtain the score value S of each emotion recognition principal element_k。

(2-4-3) score value S of each emotion recognition pivot element obtained according to the step (2-4-2)_kCalculating comprehensive probabilities P corresponding to all main components in the voice of the user, and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as an emotion label of the voice of the user; wherein the comprehensive probability calculation formula is as follows:

preferably, step (3) comprises in particular the following sub-steps:

(3-1) carrying out word segmentation preprocessing on the Chinese information in the corpus to obtain a word segmented corpus;

(3-2) coding and feature extracting are carried out on the corpus after word segmentation obtained in the step (3-1) by using a word2vec model, so as to obtain a word matrix W with the order of c × v, wherein c is the number of all words in the sentence, v is a vector dimension corresponding to each word, and the dimension of the model is set to be v ═ 300. Taking a word matrix W as input of a training semantic representation and a matching model;

and (3-3) inputting the word matrix W obtained in the step (3-2) into the trained semantic representation and matching model to obtain the trained semantic representation and matching model.

And (3-4) after word segmentation preprocessing is carried out on the text information generated in the step (1), a word2vec model is used for coding and feature extraction, and the result is input to the semantic representation and matching model trained in the step (3-3) to obtain a feature vector of voice analysis.

And (3-5) calculating Euclidean distance between the feature vector of the voice analysis obtained in the step (3-4) and the existing question vector as vector similarity, and matching question answers in a question-answer library corresponding to the emotion labels by combining the emotion labels obtained in the step (2) to obtain answers of the question and output the answers.

Preferably, the semantic representation and the matching model in step (3-3) are trained by the following steps:

(3-3-1) using the c x v-order word matrix W obtained in the step (3-2) as the input of the CNN network;

(3-3-2) updating and optimizing the weight parameters and the bias parameters of each layer in the CNN convolutional neural network by using a back propagation algorithm to obtain an updated CNN convolutional neural network;

(3-3-3) carrying out iterative training on the CNN convolutional neural network updated in the step (3-3-2) until the loss function of the CNN convolutional neural network reaches the minimum, thereby obtaining a preliminarily trained CNN convolutional neural network;

wherein the loss function L of the CNN convolutional neural network is:

where Sample represents the total number of samples in the training set, Z represents the number of classes in the training set, t_i,zRepresents the prediction result of the z-th class i-th training sample after being input into the CNN convolutional neural network, y_i,zRepresenting the real result corresponding to the ith training Sample of the z-th class, i ∈ [1, Sample]λ represents the degree of regularization, which is 0.007, w_i,zRepresenting the weight parameter of the z-th class i-th training sample when it is input into the CNN convolutional neural network, which follows the CNN convolutional neural networkThe training of the network changes itself.

(3-3-4) aiming at the CNN convolutional neural network preliminarily trained in the step (3-3-3), extracting a maximum eigenvalue matrix C after convolution kernel vector convolution by adopting a 1-max pool strategy, and calculating according to the following formula:

C＝max[c₁,c₂,...,c_n-h+1]

wherein c is_iFeatures representing the computation of the ith word of the convolution kernel sentence, which is produced during the training process of the CNN convolutional neural network.

And (3-3-5) taking the maximum eigenvalue matrix C obtained in the step (3-3-4) as the training input of the BilSTM network, performing feature extraction on the maximum eigenvalue matrix C, constructing a feature vector of voice analysis, and taking the feature vector as the output of a semantic representation and matching model to obtain the trained semantic representation and matching model.

According to another aspect of the present invention, there is provided an intelligent voice question-answering system based on deep learning and emotion recognition, comprising:

the first module is used for acquiring voice from a user and processing the voice to extract corresponding text information.

And the second module is used for extracting the characteristic parameters of the voice of the user and carrying out emotion analysis on the voice according to the extracted characteristic parameters so as to form an emotion label.

And the third module is used for inputting the text information generated by the first module into the trained semantic representation and matching model, matching the emotion labels obtained by the second module with the question-answer library to obtain answers of the question and output the answers.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) the invention adopts the step (3), integrates the convolutional neural network and the BilSTM network, constructs the semantic representation and the matching model to obtain the feature vector of the voice analysis, and applies a more perfect deep learning algorithm compared with the prior question-answering system.

(2) Because the emotion analysis is carried out on the voice of the user in the step (2) and the emotion label is formed, the input dimensionality of the question-answering system is enriched, and the problems of data redundancy and single dimensionality of the existing question-answering system can be solved.

Drawings

FIG. 1 is a schematic flow chart of an intelligent voice question-answering system based on deep learning and emotion recognition according to the present invention;

FIG. 2 is a detailed flow chart of the intelligent voice question-answering system based on deep learning and emotion recognition.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1 and 2, the present invention provides an intelligent voice question-answering method based on deep learning and emotion recognition, comprising the following steps:

Further, the step (1) specifically comprises the following substeps:

Further, the step (2) specifically comprises the following sub-steps:

and (2-1) framing the audio image obtained in the step (1-1) by utilizing a Hamming window (each audio record file after framing correspondingly generates a one-dimensional frame array), thereby obtaining the preprocessed audio data.

And (2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring time domain features (including an average pronunciation rate, a pitch frequency, a short-time energy change, a short-time average zero-crossing rate and a formant frequency) of the frame of audio data, and merging the time domain features corresponding to all the frames to obtain num initial feature parameters (the dimension of which is 4).

Specifically, the processing of the audio data in this step includes the following sub-steps:

(2-2-1) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring the short-time energy E of the frame of audio data_n；

Specifically, the audio map obtained in the step (1-1) is subjected to the nth (n is equal to [1, num ] after the framing processing]) The mth sample data in the frame audio data is x_n(m), the short-time energy E of the nth frame audio data_nComprises the following steps:

where N represents the number of samples contained in a frame, depending on the particular audio format (e.g., AAC fixes 1024 samples for a frame, and the MP3 format is 1152).

(2-2-3) for each frame in the audio data preprocessed in step (2-1)For the audio data, the formant characteristic parameter HZ of the frame of audio data is obtained_n；

And (2-3) carrying out dimensionality reduction and feature selection processing on the num initial feature parameters obtained in the step (2-2) in sequence by using a Principal Component Analysis (PCA) to obtain a mapping matrix A formed by feature parameter subsets.

Specifically, the step includes the following substeps:

(2-3-4) aiming at the feature vector sequence (lambda) in the step (2-3-3)₁,e₁),(λ₂,e₂),...,(λ_num,e_num) Selecting the first k (k is equal to [1, num ]]) The eigenvector is taken as the direction of the principal component (i.e. the basis vector of the reduced-dimension space), and a mapping matrix A with 4 rows and k columns is constructed, wherein the ith column in the mapping matrix A represents the ith eigenvector.

The method comprises the following substeps:

(2-4-1) calculating the kth (k is the [1, num ] of the mapping matrix A by using the original data in the corpus aiming at different emotion types]) Mean value mu of individual principal components in jth emotion type_jkAnd variance σ_jkAnd carrying out maximum separability processing to obtain the discrimination capability H of the kth principal component on the emotion types_k。

Wherein J is the number of emotion classes in the corpus employed, L_kRepresenting the separation of the kth principal component in the emotion class, M_kRepresenting the concentration of the kth principal component in the emotion categories, wherein C is a permutation and combination function; h_kIt is used to reflect the discriminative power of principal components in the emotion classification, H_kThe larger the extracted principal component, the more discriminating the emotion type.

(2-4-3) score value S of each emotion recognition pivot element obtained according to the step (2-4-2)_kAnd calculating the comprehensive probability P corresponding to each principal component in the voice of the user, and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities as the emotion label of the voice of the user.

The integrated probability calculation formula is as follows:

Further, the step (3) specifically comprises the following sub-steps:

and (3-1) carrying out word segmentation preprocessing on the Chinese information in the corpus to obtain a word segmented corpus.

(3-2) coding and feature extracting are carried out on the corpus after word segmentation obtained in the step (3-1) by using a word2vec model, so as to obtain a word matrix W with the order of c × v, wherein c is the number of all words in the sentence, v is a vector dimension corresponding to each word, and the dimension of the model is set to be v ═ 300. The word matrix W is the input to train the semantic representation and matching model.

And (3-3) inputting the word matrix W obtained in the step (3-2) into a trained semantic representation and matching model (the model can analyze the semantics and the structure of the problem pair and construct a feature vector of voice analysis) to obtain the trained semantic representation and matching model.

Further, the semantic representation and matching model in the step (3-3) is trained by the following steps:

specifically, the initial value of the weight parameter is a random value output using a truncated normal distribution with a standard deviation of 0.1, and the initial value of the bias parameter is set to 0;

the loss function L of the CNN convolutional neural network is:

where Sample represents the total number of samples in the training set, Z represents the number of classes in the training set, t_i,zRepresents the prediction result of the z-th class i-th training sample after being input into the CNN convolutional neural network, y_i,zRepresenting the real result corresponding to the ith training Sample of the z-th class, i ∈ [1, Sample]λ represents the degree of regularization, which is 0.007, w_i,zAnd the weight parameter represents the weight parameter when the ith training sample of the z type is input into the CNN convolutional neural network, and the weight parameter changes along with the training of the CNN convolutional neural network.

C＝max[c₁,c₂,...,c_n-h+1]

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An intelligent voice question-answering method based on deep learning and emotion recognition is characterized by comprising the following steps:

2. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 1, wherein step (1) specifically comprises the following substeps:

3. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 1 or 2, wherein step (2) specifically comprises the following sub-steps:

4. The intelligent voice question-answering method based on deep learning and emotion recognition according to any one of claims 1 to 3, wherein step (2-2) comprises the following sub-steps:

(2-2-2) for each frame of audio data in the audio data preprocessed in step (2-1)In other words, a short-time average zero crossing rate ZCR of the frame of audio data is obtained_n：

5. The intelligent voice question-answering method based on deep learning and emotion recognition of claim 4, wherein the short-time energy E of the nth frame of audio data in step (2-2-1)_nComprises the following steps:

6. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 3, wherein the step (2-3) comprises the following sub-steps:

(2-3-1) Constructing feature vectors A of the num rows and the num columns of 4 rows and num columns for the num 4-dimensional initial feature parameters obtained in the step (2-2)_4×num＝[a₁,a₂,...,a_num]Inputting collected characteristic data;

7. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 3, wherein the step (2-4) comprises the following sub-steps:

8. the intelligent voice question-answering method based on deep learning and emotion recognition according to claim 1, wherein step (3) specifically comprises the following sub-steps:

9. The intelligent voice question-answering method based on deep learning and emotion recognition of claim 8, wherein the semantic representation and matching model in step (3-3) is trained by the following steps:

wherein the loss function L of the CNN convolutional neural network is:

C＝max[c₁,c₂,...,c_n-h+1]

10. An intelligent voice question-answering system based on deep learning and emotion recognition is characterized by comprising:

The second module is used for extracting the characteristic parameters of the voice of the user and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label;