CN114203177A - Intelligent voice question-answering method and system based on deep learning and emotion recognition - Google Patents
Intelligent voice question-answering method and system based on deep learning and emotion recognition Download PDFInfo
- Publication number
- CN114203177A CN114203177A CN202111475872.XA CN202111475872A CN114203177A CN 114203177 A CN114203177 A CN 114203177A CN 202111475872 A CN202111475872 A CN 202111475872A CN 114203177 A CN114203177 A CN 114203177A
- Authority
- CN
- China
- Prior art keywords
- emotion
- voice
- question
- audio data
- num
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 32
- 238000013135 deep learning Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 title claims abstract description 20
- 230000008451 emotion Effects 0.000 claims abstract description 66
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000004458 analytical method Methods 0.000 claims abstract description 18
- 238000013527 convolutional neural network Methods 0.000 claims description 75
- 239000011159 matrix material Substances 0.000 claims description 48
- 239000013598 vector Substances 0.000 claims description 47
- 238000012549 training Methods 0.000 claims description 30
- 238000013507 mapping Methods 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 12
- 238000009432 framing Methods 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000000513 principal component analysis Methods 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 5
- 238000005311 autocorrelation function Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 3
- 230000007547 defect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an intelligent voice question-answering method based on deep learning and emotion recognition, which comprises the following steps: acquiring voice from a user, processing the voice to extract corresponding text information, extracting characteristic parameters of the voice of the user, and performing emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label; and inputting the generated text information into a trained semantic representation and matching model, combining the obtained emotion labels, matching the emotion labels with a question-answer library to obtain answers of the question and outputting the answers. The invention can solve the technical problems that the accuracy of the question-answering system is low due to the fact that the deep learning algorithm is not applied abundantly because of the ambiguity and complexity of natural language of the existing intelligent question-answering system, and the technical problem that the system matching deviation is easy to cause because the conditions of data redundancy and single dimension still exist when Chinese sentence information is captured.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to an intelligent voice question-answering method and system based on deep learning and emotion recognition.
Background
With the advent of the internet age, data has grown exponentially, and people have begun to use search engines to find their own desired information. The search engine is one of the forms of the intelligent question-answering system, and provides answers corresponding to questions for users in a list form. With the higher requirements of people on the efficiency and quality of information retrieval, the research on the intelligent voice question-answering system also starts to be further developed.
The existing intelligent voice question-answering system usually needs two steps, namely voice recognition is firstly carried out, and then answers to questions are carried out according to recognized voice. Speech recognition techniques are generally divided into three steps: feature extraction, deep neural network training and decoding. The question-answering system is mainly divided into three categories: the system comprises an extraction type intelligent question-answering system, a generation type intelligent question-answering system and an intelligent question-answering system based on question-answer pairs. All of them need to convert the sentences into computer-recognizable structured data by natural language processing technology, analyze the problems, capture the keywords, and finally match with the answer library.
However, the above intelligent question-answering system still has certain defects: firstly, the natural language processing technology is adopted, and due to the ambiguity and complexity of natural language, a deep learning algorithm is not applied abundantly, so that the accuracy of a question-answering system is low; second, when capturing chinese sentence information, there is still a situation of data redundancy and single dimension, so systematic matching bias is easily caused.
Disclosure of Invention
Aiming at the defects or the improvement requirements of the prior art, the invention provides an intelligent voice question-answering system based on deep learning and emotion recognition, aiming at solving the technical problems that the accuracy of the question-answering system is low due to the fact that the deep learning algorithm is not applied abundantly because of the ambiguity and the complexity of natural language in the conventional intelligent question-answering system, and the system matching deviation is easily caused due to the fact that data redundancy and single dimension still exist when Chinese sentence information is captured.
In order to achieve the above object, according to an aspect of the present invention, there is provided an intelligent voice question-answering method based on deep learning and emotion recognition, comprising the steps of:
(1) and acquiring the voice from the user, and processing the voice to extract the corresponding text information.
(2) And extracting characteristic parameters of the voice of the user, and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label.
(3) And (3) inputting the text information generated in the step (1) into a trained semantic representation and matching model, matching the emotion label obtained in the step (2) with a question-answer library to obtain answers of the question and outputting the answers.
Preferably, step (1) comprises in particular the following sub-steps:
(1-1) acquiring a voice from a user and digitally encoding the voice to convert it into an audio map;
(1-2) using a Hamming window to perform framing and windowing processing on the audio image obtained in the step (1-1) in sequence, and performing Fourier transform on an obtained processing result to obtain a spectrogram;
and (1-3) inputting the spectrogram obtained in the step (1-2) into a trained Convolutional Neural Network (CNN) to obtain corresponding text information.
Preferably, step (2) comprises in particular the following sub-steps:
and (2-1) framing the audio image obtained in the step (1-1) by using a Hamming window, thereby obtaining the preprocessed audio data.
And (2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring time domain features of the frame of audio data, and combining the time domain features corresponding to all frames to obtain num initial feature parameters, wherein the dimensionality of each initial feature parameter is 4.
And (2-3) carrying out dimensionality reduction and feature selection processing on the num initial feature parameters obtained in the step (2-2) by using a Principal Component Analysis (PCA) to obtain a mapping matrix A formed by the feature parameter subset.
And (2-4) calculating comprehensive probabilities P corresponding to the principal components of the audio data preprocessed in the step (2-1) according to the mapping matrix A obtained in the step (2-3), and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as the emotion label of the voice.
Preferably, step (2-2) comprises the sub-steps of:
(2-2-1) for each frame of audio data in the audio data preprocessed in step (2-1)Obtaining the short-time energy E of the frame of audio datan;
(2-2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), obtaining a short-time average zero crossing rate ZCR of the frame of audio datan:
(2-2-3) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring a formant characteristic parameter HZ of the frame of audio datan;
(2-2-4) for each frame of audio data in the audio data preprocessed in the step (2-1), extracting a pitch period in the time domain by using a short-time autocorrelation function, namely, selecting d (the value range of d is 1 to 5, preferably 3) groups of signals around the nth frame of audio data by using a window function to perform autocorrelation calculation so as to obtain a pitch frequency R of the frame of audio datan:
Preferably, the short-time energy E of the nth frame of audio data in step (2-2-1)nComprises the following steps:
wherein xn(m) represents the audio map obtained in the step (1-1), and the nth (n is the element of [1, num ] after the framing processing]) The mth sample data in the frame audio data, and N represents the number of samples contained in one frame, depending on the specific audio format.
Preferably, step (2-3) comprises the sub-steps of:
(2-3-1) constructing feature vectors A of num rows and num columns of 4 rows and num columns of num 4-dimensional initial feature parameters obtained in the step (2-2)4×num=[a1,a2,...,anum]Inputting collected characteristic data;
(2-3-2) feature vector A obtained in step (2-3-1)4×num=[a1,a2,...,anum]Calculating to obtain the mean value mu and covariance matrix COV4×num;
(2-3-3) for the covariance matrix COV in step (2-3-2)4×numCalculating to obtain its characteristic value and characteristic vector, and forming a coordinate form (lambda) by the characteristic vector of each element and the characteristic value according to one-to-one correspondence relationshipi,ei) According to the size of the characteristic value, obtaining a characteristic vector sequence (lambda)1,e1),(λ2,e2),...,(λnum,enum);
(2-3-4) aiming at the feature vector sequence (lambda) in the step (2-3-3)1,e1),(λ2,e2),...,(λnum,enum) Selecting the first k eigenvectors as the direction of a principal component (namely, a base vector of a space after dimensionality reduction), and constructing a mapping matrix A with 4 rows and k columns, wherein the ith column in the mapping matrix A represents the ith eigenvector, wherein k belongs to [1, num ]]。
Preferably, step (2-4) comprises the sub-steps of:
(2-4-1) calculating the mean value mu of the kth principal component of the mapping matrix A in the jth emotion type by using the original data in the corpus aiming at different emotion typesjkAnd variance σjkAnd carrying out maximum separability processing to obtain the discrimination capability H of the kth principal component on the emotion typesk。
Wherein J is the number of emotion classes in the corpus employed, LkRepresenting the separation of the kth principal component in the emotion class, MkRepresenting the concentration of the kth principal component in the emotion categories, wherein C is a permutation and combination function; hkIt is used to reflect the discriminative power of principal components in the emotion classification, HkThe larger the component is, the stronger the ability of the extracted principal component to distinguish the emotion type is;
(2-4-2) the discrimination ability H of the emotion type according to the principal component obtained in the step (2-4-1)kAfter sorting the k main components of the mapping matrix A, selecting HkAnd the larger front p principal components are used as principal elements for emotion recognition, and normalization processing is carried out on the front p principal components to obtain normalized emotion recognition principal elements. The projection summation of the mapping matrix A of the whole principal components is carried out to obtain the score value S of each emotion recognition principal elementk。
(2-4-3) score value S of each emotion recognition pivot element obtained according to the step (2-4-2)kCalculating comprehensive probabilities P corresponding to all main components in the voice of the user, and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as an emotion label of the voice of the user; wherein the comprehensive probability calculation formula is as follows:
preferably, step (3) comprises in particular the following sub-steps:
(3-1) carrying out word segmentation preprocessing on the Chinese information in the corpus to obtain a word segmented corpus;
(3-2) coding and feature extracting are carried out on the corpus after word segmentation obtained in the step (3-1) by using a word2vec model, so as to obtain a word matrix W with the order of c × v, wherein c is the number of all words in the sentence, v is a vector dimension corresponding to each word, and the dimension of the model is set to be v ═ 300. Taking a word matrix W as input of a training semantic representation and a matching model;
and (3-3) inputting the word matrix W obtained in the step (3-2) into the trained semantic representation and matching model to obtain the trained semantic representation and matching model.
And (3-4) after word segmentation preprocessing is carried out on the text information generated in the step (1), a word2vec model is used for coding and feature extraction, and the result is input to the semantic representation and matching model trained in the step (3-3) to obtain a feature vector of voice analysis.
And (3-5) calculating Euclidean distance between the feature vector of the voice analysis obtained in the step (3-4) and the existing question vector as vector similarity, and matching question answers in a question-answer library corresponding to the emotion labels by combining the emotion labels obtained in the step (2) to obtain answers of the question and output the answers.
Preferably, the semantic representation and the matching model in step (3-3) are trained by the following steps:
(3-3-1) using the c x v-order word matrix W obtained in the step (3-2) as the input of the CNN network;
(3-3-2) updating and optimizing the weight parameters and the bias parameters of each layer in the CNN convolutional neural network by using a back propagation algorithm to obtain an updated CNN convolutional neural network;
(3-3-3) carrying out iterative training on the CNN convolutional neural network updated in the step (3-3-2) until the loss function of the CNN convolutional neural network reaches the minimum, thereby obtaining a preliminarily trained CNN convolutional neural network;
wherein the loss function L of the CNN convolutional neural network is:
where Sample represents the total number of samples in the training set, Z represents the number of classes in the training set, ti,zRepresents the prediction result of the z-th class i-th training sample after being input into the CNN convolutional neural network, yi,zRepresenting the real result corresponding to the ith training Sample of the z-th class, i ∈ [1, Sample]λ represents the degree of regularization, which is 0.007, wi,zRepresenting the weight parameter of the z-th class i-th training sample when it is input into the CNN convolutional neural network, which follows the CNN convolutional neural networkThe training of the network changes itself.
(3-3-4) aiming at the CNN convolutional neural network preliminarily trained in the step (3-3-3), extracting a maximum eigenvalue matrix C after convolution kernel vector convolution by adopting a 1-max pool strategy, and calculating according to the following formula:
C=max[c1,c2,...,cn-h+1]
wherein c isiFeatures representing the computation of the ith word of the convolution kernel sentence, which is produced during the training process of the CNN convolutional neural network.
And (3-3-5) taking the maximum eigenvalue matrix C obtained in the step (3-3-4) as the training input of the BilSTM network, performing feature extraction on the maximum eigenvalue matrix C, constructing a feature vector of voice analysis, and taking the feature vector as the output of a semantic representation and matching model to obtain the trained semantic representation and matching model.
According to another aspect of the present invention, there is provided an intelligent voice question-answering system based on deep learning and emotion recognition, comprising:
the first module is used for acquiring voice from a user and processing the voice to extract corresponding text information.
And the second module is used for extracting the characteristic parameters of the voice of the user and carrying out emotion analysis on the voice according to the extracted characteristic parameters so as to form an emotion label.
And the third module is used for inputting the text information generated by the first module into the trained semantic representation and matching model, matching the emotion labels obtained by the second module with the question-answer library to obtain answers of the question and output the answers.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) the invention adopts the step (3), integrates the convolutional neural network and the BilSTM network, constructs the semantic representation and the matching model to obtain the feature vector of the voice analysis, and applies a more perfect deep learning algorithm compared with the prior question-answering system.
(2) Because the emotion analysis is carried out on the voice of the user in the step (2) and the emotion label is formed, the input dimensionality of the question-answering system is enriched, and the problems of data redundancy and single dimensionality of the existing question-answering system can be solved.
Drawings
FIG. 1 is a schematic flow chart of an intelligent voice question-answering system based on deep learning and emotion recognition according to the present invention;
FIG. 2 is a detailed flow chart of the intelligent voice question-answering system based on deep learning and emotion recognition.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1 and 2, the present invention provides an intelligent voice question-answering method based on deep learning and emotion recognition, comprising the following steps:
(1) and acquiring the voice from the user, and processing the voice to extract the corresponding text information.
Further, the step (1) specifically comprises the following substeps:
(1-1) acquiring a voice from a user and digitally encoding the voice to convert it into an audio map;
(1-2) using a Hamming window to perform framing and windowing processing on the audio image obtained in the step (1-1) in sequence, and performing Fourier transform on an obtained processing result to obtain a spectrogram;
and (1-3) inputting the spectrogram obtained in the step (1-2) into a trained Convolutional Neural Network (CNN) to obtain corresponding text information.
(2) And extracting characteristic parameters of the voice of the user, and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label.
Further, the step (2) specifically comprises the following sub-steps:
and (2-1) framing the audio image obtained in the step (1-1) by utilizing a Hamming window (each audio record file after framing correspondingly generates a one-dimensional frame array), thereby obtaining the preprocessed audio data.
And (2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring time domain features (including an average pronunciation rate, a pitch frequency, a short-time energy change, a short-time average zero-crossing rate and a formant frequency) of the frame of audio data, and merging the time domain features corresponding to all the frames to obtain num initial feature parameters (the dimension of which is 4).
Specifically, the processing of the audio data in this step includes the following sub-steps:
(2-2-1) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring the short-time energy E of the frame of audio datan;
Specifically, the audio map obtained in the step (1-1) is subjected to the nth (n is equal to [1, num ] after the framing processing]) The mth sample data in the frame audio data is xn(m), the short-time energy E of the nth frame audio datanComprises the following steps:
where N represents the number of samples contained in a frame, depending on the particular audio format (e.g., AAC fixes 1024 samples for a frame, and the MP3 format is 1152).
(2-2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), obtaining a short-time average zero crossing rate ZCR of the frame of audio datan:
(2-2-3) for each frame in the audio data preprocessed in step (2-1)For the audio data, the formant characteristic parameter HZ of the frame of audio data is obtainedn;
(2-2-4) for each frame of audio data in the audio data preprocessed in the step (2-1), extracting a pitch period in the time domain by using a short-time autocorrelation function, namely, selecting d (the value range of d is 1 to 5, preferably 3) groups of signals around the nth frame of audio data by using a window function to perform autocorrelation calculation so as to obtain a pitch frequency R of the frame of audio datan:
And (2-3) carrying out dimensionality reduction and feature selection processing on the num initial feature parameters obtained in the step (2-2) in sequence by using a Principal Component Analysis (PCA) to obtain a mapping matrix A formed by feature parameter subsets.
Specifically, the step includes the following substeps:
(2-3-1) constructing feature vectors A of num rows and num columns of 4 rows and num columns of num 4-dimensional initial feature parameters obtained in the step (2-2)4×num=[a1,a2,...,anum]Inputting collected characteristic data;
(2-3-2) feature vector A obtained in step (2-3-1)4×num=[a1,a2,...,anum]Calculating to obtain the mean value mu and covariance matrix COV4×num;
(2-3-3) for the covariance matrix COV in step (2-3-2)4×numCalculating to obtain its characteristic value and characteristic vector, and forming a coordinate form (lambda) by the characteristic vector of each element and the characteristic value according to one-to-one correspondence relationshipi,ei) According to the size of the characteristic value, obtaining a characteristic vector sequence (lambda)1,e1),(λ2,e2),...,(λnum,enum);
(2-3-4) aiming at the feature vector sequence (lambda) in the step (2-3-3)1,e1),(λ2,e2),...,(λnum,enum) Selecting the first k (k is equal to [1, num ]]) The eigenvector is taken as the direction of the principal component (i.e. the basis vector of the reduced-dimension space), and a mapping matrix A with 4 rows and k columns is constructed, wherein the ith column in the mapping matrix A represents the ith eigenvector.
And (2-4) calculating comprehensive probabilities P corresponding to the principal components of the audio data preprocessed in the step (2-1) according to the mapping matrix A obtained in the step (2-3), and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as the emotion label of the voice.
The method comprises the following substeps:
(2-4-1) calculating the kth (k is the [1, num ] of the mapping matrix A by using the original data in the corpus aiming at different emotion types]) Mean value mu of individual principal components in jth emotion typejkAnd variance σjkAnd carrying out maximum separability processing to obtain the discrimination capability H of the kth principal component on the emotion typesk。
Wherein J is the number of emotion classes in the corpus employed, LkRepresenting the separation of the kth principal component in the emotion class, MkRepresenting the concentration of the kth principal component in the emotion categories, wherein C is a permutation and combination function; hkIt is used to reflect the discriminative power of principal components in the emotion classification, HkThe larger the extracted principal component, the more discriminating the emotion type.
(2-4-2) the discrimination ability H of the emotion type according to the principal component obtained in the step (2-4-1)kAfter sorting the k main components of the mapping matrix A, selecting HkAnd the larger front p principal components are used as principal elements for emotion recognition, and normalization processing is carried out on the front p principal components to obtain normalized emotion recognition principal elements. The projection summation of the mapping matrix A of the whole principal components is carried out to obtain the score value S of each emotion recognition principal elementk。
(2-4-3) score value S of each emotion recognition pivot element obtained according to the step (2-4-2)kAnd calculating the comprehensive probability P corresponding to each principal component in the voice of the user, and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities as the emotion label of the voice of the user.
The integrated probability calculation formula is as follows:
(3) and (3) inputting the text information generated in the step (1) into a trained semantic representation and matching model, matching the emotion label obtained in the step (2) with a question-answer library to obtain answers of the question and outputting the answers.
Further, the step (3) specifically comprises the following sub-steps:
and (3-1) carrying out word segmentation preprocessing on the Chinese information in the corpus to obtain a word segmented corpus.
(3-2) coding and feature extracting are carried out on the corpus after word segmentation obtained in the step (3-1) by using a word2vec model, so as to obtain a word matrix W with the order of c × v, wherein c is the number of all words in the sentence, v is a vector dimension corresponding to each word, and the dimension of the model is set to be v ═ 300. The word matrix W is the input to train the semantic representation and matching model.
And (3-3) inputting the word matrix W obtained in the step (3-2) into a trained semantic representation and matching model (the model can analyze the semantics and the structure of the problem pair and construct a feature vector of voice analysis) to obtain the trained semantic representation and matching model.
Further, the semantic representation and matching model in the step (3-3) is trained by the following steps:
(3-3-1) using the c x v-order word matrix W obtained in the step (3-2) as the input of the CNN network;
(3-3-2) updating and optimizing the weight parameters and the bias parameters of each layer in the CNN convolutional neural network by using a back propagation algorithm to obtain an updated CNN convolutional neural network;
specifically, the initial value of the weight parameter is a random value output using a truncated normal distribution with a standard deviation of 0.1, and the initial value of the bias parameter is set to 0;
(3-3-3) carrying out iterative training on the CNN convolutional neural network updated in the step (3-3-2) until the loss function of the CNN convolutional neural network reaches the minimum, thereby obtaining a preliminarily trained CNN convolutional neural network;
the loss function L of the CNN convolutional neural network is:
where Sample represents the total number of samples in the training set, Z represents the number of classes in the training set, ti,zRepresents the prediction result of the z-th class i-th training sample after being input into the CNN convolutional neural network, yi,zRepresenting the real result corresponding to the ith training Sample of the z-th class, i ∈ [1, Sample]λ represents the degree of regularization, which is 0.007, wi,zAnd the weight parameter represents the weight parameter when the ith training sample of the z type is input into the CNN convolutional neural network, and the weight parameter changes along with the training of the CNN convolutional neural network.
(3-3-4) aiming at the CNN convolutional neural network preliminarily trained in the step (3-3-3), extracting a maximum eigenvalue matrix C after convolution kernel vector convolution by adopting a 1-max pool strategy, and calculating according to the following formula:
C=max[c1,c2,...,cn-h+1]
wherein c isiFeatures representing the computation of the ith word of the convolution kernel sentence, which is produced during the training process of the CNN convolutional neural network.
And (3-3-5) taking the maximum eigenvalue matrix C obtained in the step (3-3-4) as the training input of the BilSTM network, performing feature extraction on the maximum eigenvalue matrix C, constructing a feature vector of voice analysis, and taking the feature vector as the output of a semantic representation and matching model to obtain the trained semantic representation and matching model.
And (3-4) after word segmentation preprocessing is carried out on the text information generated in the step (1), a word2vec model is used for coding and feature extraction, and the result is input to the semantic representation and matching model trained in the step (3-3) to obtain a feature vector of voice analysis.
And (3-5) calculating Euclidean distance between the feature vector of the voice analysis obtained in the step (3-4) and the existing question vector as vector similarity, and matching question answers in a question-answer library corresponding to the emotion labels by combining the emotion labels obtained in the step (2) to obtain answers of the question and output the answers.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. An intelligent voice question-answering method based on deep learning and emotion recognition is characterized by comprising the following steps:
(1) and acquiring the voice from the user, and processing the voice to extract the corresponding text information.
(2) And extracting characteristic parameters of the voice of the user, and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label.
(3) And (3) inputting the text information generated in the step (1) into a trained semantic representation and matching model, matching the emotion label obtained in the step (2) with a question-answer library to obtain answers of the question and outputting the answers.
2. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 1, wherein step (1) specifically comprises the following substeps:
(1-1) acquiring a voice from a user and digitally encoding the voice to convert it into an audio map;
(1-2) using a Hamming window to perform framing and windowing processing on the audio image obtained in the step (1-1) in sequence, and performing Fourier transform on an obtained processing result to obtain a spectrogram;
and (1-3) inputting the spectrogram obtained in the step (1-2) into a trained Convolutional Neural Network (CNN) to obtain corresponding text information.
3. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 1 or 2, wherein step (2) specifically comprises the following sub-steps:
and (2-1) framing the audio image obtained in the step (1-1) by using a Hamming window, thereby obtaining the preprocessed audio data.
And (2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring time domain features of the frame of audio data, and combining the time domain features corresponding to all frames to obtain num initial feature parameters, wherein the dimensionality of each initial feature parameter is 4.
And (2-3) carrying out dimensionality reduction and feature selection processing on the num initial feature parameters obtained in the step (2-2) by using a Principal Component Analysis (PCA) to obtain a mapping matrix A formed by the feature parameter subset.
And (2-4) calculating comprehensive probabilities P corresponding to the principal components of the audio data preprocessed in the step (2-1) according to the mapping matrix A obtained in the step (2-3), and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as the emotion label of the voice.
4. The intelligent voice question-answering method based on deep learning and emotion recognition according to any one of claims 1 to 3, wherein step (2-2) comprises the following sub-steps:
(2-2-1) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring the short-time energy E of the frame of audio datan;
(2-2-2) for each frame of audio data in the audio data preprocessed in step (2-1)In other words, a short-time average zero crossing rate ZCR of the frame of audio data is obtainedn:
(2-2-3) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring a formant characteristic parameter HZ of the frame of audio datan;
(2-2-4) for each frame of audio data in the audio data preprocessed in the step (2-1), extracting a pitch period in the time domain by using a short-time autocorrelation function, namely, selecting d (the value range of d is 1 to 5, preferably 3) groups of signals around the nth frame of audio data by using a window function to perform autocorrelation calculation so as to obtain a pitch frequency R of the frame of audio datan:
5. The intelligent voice question-answering method based on deep learning and emotion recognition of claim 4, wherein the short-time energy E of the nth frame of audio data in step (2-2-1)nComprises the following steps:
wherein xn(m) represents the audio map obtained in the step (1-1), and the nth (n is the element of [1, num ] after the framing processing]) The mth sample data in the frame audio data, and N represents the number of samples contained in one frame, depending on the specific audio format.
6. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 3, wherein the step (2-3) comprises the following sub-steps:
(2-3-1) Constructing feature vectors A of the num rows and the num columns of 4 rows and num columns for the num 4-dimensional initial feature parameters obtained in the step (2-2)4×num=[a1,a2,...,anum]Inputting collected characteristic data;
(2-3-2) feature vector A obtained in step (2-3-1)4×num=[a1,a2,...,anum]Calculating to obtain the mean value mu and covariance matrix COV4×num;
(2-3-3) for the covariance matrix COV in step (2-3-2)4×numCalculating to obtain its characteristic value and characteristic vector, and forming a coordinate form (lambda) by the characteristic vector of each element and the characteristic value according to one-to-one correspondence relationshipi,ei) According to the size of the characteristic value, obtaining a characteristic vector sequence (lambda)1,e1),(λ2,e2),...,(λnum,enum);
(2-3-4) aiming at the feature vector sequence (lambda) in the step (2-3-3)1,e1),(λ2,e2),...,(λnum,enum) Selecting the first k eigenvectors as the direction of a principal component (namely, a base vector of a space after dimensionality reduction), and constructing a mapping matrix A with 4 rows and k columns, wherein the ith column in the mapping matrix A represents the ith eigenvector, wherein k belongs to [1, num ]]。
7. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 3, wherein the step (2-4) comprises the following sub-steps:
(2-4-1) calculating the mean value mu of the kth principal component of the mapping matrix A in the jth emotion type by using the original data in the corpus aiming at different emotion typesjkAnd variance σjkAnd carrying out maximum separability processing to obtain the discrimination capability H of the kth principal component on the emotion typesk。
Wherein J is the number of emotion classes in the corpus employed, LkRepresenting the separation of the kth principal component in the emotion class, MkRepresenting the concentration of the kth principal component in the emotion categories, wherein C is a permutation and combination function; hkIt is used to reflect the discriminative power of principal components in the emotion classification, HkThe larger the component is, the stronger the ability of the extracted principal component to distinguish the emotion type is;
(2-4-2) the discrimination ability H of the emotion type according to the principal component obtained in the step (2-4-1)kAfter sorting the k main components of the mapping matrix A, selecting HkAnd the larger front p principal components are used as principal elements for emotion recognition, and normalization processing is carried out on the front p principal components to obtain normalized emotion recognition principal elements. The projection summation of the mapping matrix A of the whole principal components is carried out to obtain the score value S of each emotion recognition principal elementk。
(2-4-3) score value S of each emotion recognition pivot element obtained according to the step (2-4-2)kCalculating comprehensive probabilities P corresponding to all main components in the voice of the user, and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as an emotion label of the voice of the user; wherein the comprehensive probability calculation formula is as follows:
8. the intelligent voice question-answering method based on deep learning and emotion recognition according to claim 1, wherein step (3) specifically comprises the following sub-steps:
(3-1) carrying out word segmentation preprocessing on the Chinese information in the corpus to obtain a word segmented corpus;
(3-2) coding and feature extracting are carried out on the corpus after word segmentation obtained in the step (3-1) by using a word2vec model, so as to obtain a word matrix W with the order of c × v, wherein c is the number of all words in the sentence, v is a vector dimension corresponding to each word, and the dimension of the model is set to be v ═ 300. Taking a word matrix W as input of a training semantic representation and a matching model;
and (3-3) inputting the word matrix W obtained in the step (3-2) into the trained semantic representation and matching model to obtain the trained semantic representation and matching model.
And (3-4) after word segmentation preprocessing is carried out on the text information generated in the step (1), a word2vec model is used for coding and feature extraction, and the result is input to the semantic representation and matching model trained in the step (3-3) to obtain a feature vector of voice analysis.
And (3-5) calculating Euclidean distance between the feature vector of the voice analysis obtained in the step (3-4) and the existing question vector as vector similarity, and matching question answers in a question-answer library corresponding to the emotion labels by combining the emotion labels obtained in the step (2) to obtain answers of the question and output the answers.
9. The intelligent voice question-answering method based on deep learning and emotion recognition of claim 8, wherein the semantic representation and matching model in step (3-3) is trained by the following steps:
(3-3-1) using the c x v-order word matrix W obtained in the step (3-2) as the input of the CNN network;
(3-3-2) updating and optimizing the weight parameters and the bias parameters of each layer in the CNN convolutional neural network by using a back propagation algorithm to obtain an updated CNN convolutional neural network;
(3-3-3) carrying out iterative training on the CNN convolutional neural network updated in the step (3-3-2) until the loss function of the CNN convolutional neural network reaches the minimum, thereby obtaining a preliminarily trained CNN convolutional neural network;
wherein the loss function L of the CNN convolutional neural network is:
where Sample represents the total number of samples in the training set, Z represents the number of classes in the training set, ti,zRepresents the prediction result of the z-th class i-th training sample after being input into the CNN convolutional neural network, yi,zRepresenting the real result corresponding to the ith training Sample of the z-th class, i ∈ [1, Sample]λ represents the degree of regularization, which is 0.007, wi,zAnd the weight parameter represents the weight parameter when the ith training sample of the z type is input into the CNN convolutional neural network, and the weight parameter changes along with the training of the CNN convolutional neural network.
(3-3-4) aiming at the CNN convolutional neural network preliminarily trained in the step (3-3-3), extracting a maximum eigenvalue matrix C after convolution kernel vector convolution by adopting a 1-max pool strategy, and calculating according to the following formula:
C=max[c1,c2,...,cn-h+1]
wherein c isiFeatures representing the computation of the ith word of the convolution kernel sentence, which is produced during the training process of the CNN convolutional neural network.
And (3-3-5) taking the maximum eigenvalue matrix C obtained in the step (3-3-4) as the training input of the BilSTM network, performing feature extraction on the maximum eigenvalue matrix C, constructing a feature vector of voice analysis, and taking the feature vector as the output of a semantic representation and matching model to obtain the trained semantic representation and matching model.
10. An intelligent voice question-answering system based on deep learning and emotion recognition is characterized by comprising:
the first module is used for acquiring voice from a user and processing the voice to extract corresponding text information.
The second module is used for extracting the characteristic parameters of the voice of the user and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label;
and the third module is used for inputting the text information generated by the first module into the trained semantic representation and matching model, matching the emotion labels obtained by the second module with the question-answer library to obtain answers of the question and output the answers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111475872.XA CN114203177A (en) | 2021-12-06 | 2021-12-06 | Intelligent voice question-answering method and system based on deep learning and emotion recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111475872.XA CN114203177A (en) | 2021-12-06 | 2021-12-06 | Intelligent voice question-answering method and system based on deep learning and emotion recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114203177A true CN114203177A (en) | 2022-03-18 |
Family
ID=80650788
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111475872.XA Pending CN114203177A (en) | 2021-12-06 | 2021-12-06 | Intelligent voice question-answering method and system based on deep learning and emotion recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114203177A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115097946A (en) * | 2022-08-15 | 2022-09-23 | 汉华智能科技(佛山)有限公司 | Remote worship method, system and storage medium based on Internet of things |
CN115482837A (en) * | 2022-07-25 | 2022-12-16 | 科睿纳(河北)医疗科技有限公司 | Emotion classification method based on artificial intelligence |
CN116597821A (en) * | 2023-07-17 | 2023-08-15 | 深圳市国硕宏电子有限公司 | Intelligent customer service voice recognition method and system based on deep learning |
CN117992597A (en) * | 2024-04-03 | 2024-05-07 | 江苏微皓智能科技有限公司 | Information feedback method, device, computer equipment and computer storage medium |
CN117995174A (en) * | 2024-04-07 | 2024-05-07 | 广东实丰智能科技有限公司 | Learning type electric toy control method based on man-machine interaction |
CN118035431A (en) * | 2024-04-12 | 2024-05-14 | 青岛网信信息科技有限公司 | User emotion prediction method, medium and system in text customer service process |
-
2021
- 2021-12-06 CN CN202111475872.XA patent/CN114203177A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115482837A (en) * | 2022-07-25 | 2022-12-16 | 科睿纳(河北)医疗科技有限公司 | Emotion classification method based on artificial intelligence |
CN115482837B (en) * | 2022-07-25 | 2023-04-28 | 科睿纳(河北)医疗科技有限公司 | Emotion classification method based on artificial intelligence |
CN115097946A (en) * | 2022-08-15 | 2022-09-23 | 汉华智能科技(佛山)有限公司 | Remote worship method, system and storage medium based on Internet of things |
CN116597821A (en) * | 2023-07-17 | 2023-08-15 | 深圳市国硕宏电子有限公司 | Intelligent customer service voice recognition method and system based on deep learning |
CN117992597A (en) * | 2024-04-03 | 2024-05-07 | 江苏微皓智能科技有限公司 | Information feedback method, device, computer equipment and computer storage medium |
CN117992597B (en) * | 2024-04-03 | 2024-06-07 | 江苏微皓智能科技有限公司 | Information feedback method, device, computer equipment and computer storage medium |
CN117995174A (en) * | 2024-04-07 | 2024-05-07 | 广东实丰智能科技有限公司 | Learning type electric toy control method based on man-machine interaction |
CN118035431A (en) * | 2024-04-12 | 2024-05-14 | 青岛网信信息科技有限公司 | User emotion prediction method, medium and system in text customer service process |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10515292B2 (en) | Joint acoustic and visual processing | |
CN114203177A (en) | Intelligent voice question-answering method and system based on deep learning and emotion recognition | |
CN113780012A (en) | Depression interview conversation generation method based on pre-training language model | |
CN111125333A (en) | Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism | |
Zhang | Unsupervised speech processing with applications to query-by-example spoken term detection | |
CN113569553A (en) | Sentence similarity judgment method based on improved Adaboost algorithm | |
Huang et al. | Speech emotion recognition using convolutional neural network with audio word-based embedding | |
Almekhlafi et al. | A classification benchmark for Arabic alphabet phonemes with diacritics in deep neural networks | |
Zhang | Ideological and political empowering english teaching: ideological education based on artificial intelligence in classroom emotion recognition | |
CN114937465A (en) | Speech emotion recognition method based on self-supervision learning and computer equipment | |
Hazen et al. | Topic modeling for spoken documents using only phonetic information | |
CN107562907B (en) | Intelligent lawyer expert case response device | |
Lee | Discovering linguistic structures in speech: Models and applications | |
Pujari et al. | A survey on deep learning based lip-reading techniques | |
CN112749567A (en) | Question-answering system based on reality information environment knowledge graph | |
Farooq et al. | Mispronunciation detection in articulation points of Arabic letters using machine learning | |
CN115376547B (en) | Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium | |
Shi et al. | Construction of english pronunciation judgment and detection model based on deep learning neural networks data stream fusion | |
Alishahi et al. | ZR-2021VG: Zero-resource speech challenge, visually-grounded language modelling track | |
Gündogdu | Keyword search for low resource languages | |
CN114461779A (en) | Case writing element extraction method | |
Karakasidis | Comparison of New Curriculum Criteria for End-to-End ASR | |
Zhang | Research on the Application of Speech Database based on Emotional Feature Extraction in International Chinese Education and Teaching | |
Alashban et al. | Language effect on speaker gender classification using deep learning | |
Al-Rami et al. | A framework for pronunciation error detection and correction for non-native Arab speakers of English language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |