CN114203177A - Intelligent voice question-answering method and system based on deep learning and emotion recognition - Google Patents

Intelligent voice question-answering method and system based on deep learning and emotion recognition Download PDF

Info

Publication number
CN114203177A
CN114203177A CN202111475872.XA CN202111475872A CN114203177A CN 114203177 A CN114203177 A CN 114203177A CN 202111475872 A CN202111475872 A CN 202111475872A CN 114203177 A CN114203177 A CN 114203177A
Authority
CN
China
Prior art keywords
emotion
voice
question
audio data
num
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111475872.XA
Other languages
Chinese (zh)
Inventor
唐卓
李虹宇
曹嵘晖
纪军刚
尹旦
宋柏森
朱纯霞
赵环
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Shenzhen Zhengtong Electronics Co Ltd
Original Assignee
Hunan University
Shenzhen Zhengtong Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University, Shenzhen Zhengtong Electronics Co Ltd filed Critical Hunan University
Priority to CN202111475872.XA priority Critical patent/CN114203177A/en
Publication of CN114203177A publication Critical patent/CN114203177A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an intelligent voice question-answering method based on deep learning and emotion recognition, which comprises the following steps: acquiring voice from a user, processing the voice to extract corresponding text information, extracting characteristic parameters of the voice of the user, and performing emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label; and inputting the generated text information into a trained semantic representation and matching model, combining the obtained emotion labels, matching the emotion labels with a question-answer library to obtain answers of the question and outputting the answers. The invention can solve the technical problems that the accuracy of the question-answering system is low due to the fact that the deep learning algorithm is not applied abundantly because of the ambiguity and complexity of natural language of the existing intelligent question-answering system, and the technical problem that the system matching deviation is easy to cause because the conditions of data redundancy and single dimension still exist when Chinese sentence information is captured.

Description

Intelligent voice question-answering method and system based on deep learning and emotion recognition
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to an intelligent voice question-answering method and system based on deep learning and emotion recognition.
Background
With the advent of the internet age, data has grown exponentially, and people have begun to use search engines to find their own desired information. The search engine is one of the forms of the intelligent question-answering system, and provides answers corresponding to questions for users in a list form. With the higher requirements of people on the efficiency and quality of information retrieval, the research on the intelligent voice question-answering system also starts to be further developed.
The existing intelligent voice question-answering system usually needs two steps, namely voice recognition is firstly carried out, and then answers to questions are carried out according to recognized voice. Speech recognition techniques are generally divided into three steps: feature extraction, deep neural network training and decoding. The question-answering system is mainly divided into three categories: the system comprises an extraction type intelligent question-answering system, a generation type intelligent question-answering system and an intelligent question-answering system based on question-answer pairs. All of them need to convert the sentences into computer-recognizable structured data by natural language processing technology, analyze the problems, capture the keywords, and finally match with the answer library.
However, the above intelligent question-answering system still has certain defects: firstly, the natural language processing technology is adopted, and due to the ambiguity and complexity of natural language, a deep learning algorithm is not applied abundantly, so that the accuracy of a question-answering system is low; second, when capturing chinese sentence information, there is still a situation of data redundancy and single dimension, so systematic matching bias is easily caused.
Disclosure of Invention
Aiming at the defects or the improvement requirements of the prior art, the invention provides an intelligent voice question-answering system based on deep learning and emotion recognition, aiming at solving the technical problems that the accuracy of the question-answering system is low due to the fact that the deep learning algorithm is not applied abundantly because of the ambiguity and the complexity of natural language in the conventional intelligent question-answering system, and the system matching deviation is easily caused due to the fact that data redundancy and single dimension still exist when Chinese sentence information is captured.
In order to achieve the above object, according to an aspect of the present invention, there is provided an intelligent voice question-answering method based on deep learning and emotion recognition, comprising the steps of:
(1) and acquiring the voice from the user, and processing the voice to extract the corresponding text information.
(2) And extracting characteristic parameters of the voice of the user, and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label.
(3) And (3) inputting the text information generated in the step (1) into a trained semantic representation and matching model, matching the emotion label obtained in the step (2) with a question-answer library to obtain answers of the question and outputting the answers.
Preferably, step (1) comprises in particular the following sub-steps:
(1-1) acquiring a voice from a user and digitally encoding the voice to convert it into an audio map;
(1-2) using a Hamming window to perform framing and windowing processing on the audio image obtained in the step (1-1) in sequence, and performing Fourier transform on an obtained processing result to obtain a spectrogram;
and (1-3) inputting the spectrogram obtained in the step (1-2) into a trained Convolutional Neural Network (CNN) to obtain corresponding text information.
Preferably, step (2) comprises in particular the following sub-steps:
and (2-1) framing the audio image obtained in the step (1-1) by using a Hamming window, thereby obtaining the preprocessed audio data.
And (2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring time domain features of the frame of audio data, and combining the time domain features corresponding to all frames to obtain num initial feature parameters, wherein the dimensionality of each initial feature parameter is 4.
And (2-3) carrying out dimensionality reduction and feature selection processing on the num initial feature parameters obtained in the step (2-2) by using a Principal Component Analysis (PCA) to obtain a mapping matrix A formed by the feature parameter subset.
And (2-4) calculating comprehensive probabilities P corresponding to the principal components of the audio data preprocessed in the step (2-1) according to the mapping matrix A obtained in the step (2-3), and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as the emotion label of the voice.
Preferably, step (2-2) comprises the sub-steps of:
(2-2-1) for each frame of audio data in the audio data preprocessed in step (2-1)Obtaining the short-time energy E of the frame of audio datan
(2-2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), obtaining a short-time average zero crossing rate ZCR of the frame of audio datan
Figure BDA0003393249730000031
(2-2-3) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring a formant characteristic parameter HZ of the frame of audio datan
(2-2-4) for each frame of audio data in the audio data preprocessed in the step (2-1), extracting a pitch period in the time domain by using a short-time autocorrelation function, namely, selecting d (the value range of d is 1 to 5, preferably 3) groups of signals around the nth frame of audio data by using a window function to perform autocorrelation calculation so as to obtain a pitch frequency R of the frame of audio datan
Figure BDA0003393249730000032
Preferably, the short-time energy E of the nth frame of audio data in step (2-2-1)nComprises the following steps:
Figure BDA0003393249730000033
wherein xn(m) represents the audio map obtained in the step (1-1), and the nth (n is the element of [1, num ] after the framing processing]) The mth sample data in the frame audio data, and N represents the number of samples contained in one frame, depending on the specific audio format.
Preferably, step (2-3) comprises the sub-steps of:
(2-3-1) constructing feature vectors A of num rows and num columns of 4 rows and num columns of num 4-dimensional initial feature parameters obtained in the step (2-2)4×num=[a1,a2,...,anum]Inputting collected characteristic data;
(2-3-2) feature vector A obtained in step (2-3-1)4×num=[a1,a2,...,anum]Calculating to obtain the mean value mu and covariance matrix COV4×num
(2-3-3) for the covariance matrix COV in step (2-3-2)4×numCalculating to obtain its characteristic value and characteristic vector, and forming a coordinate form (lambda) by the characteristic vector of each element and the characteristic value according to one-to-one correspondence relationshipi,ei) According to the size of the characteristic value, obtaining a characteristic vector sequence (lambda)1,e1),(λ2,e2),...,(λnum,enum);
(2-3-4) aiming at the feature vector sequence (lambda) in the step (2-3-3)1,e1),(λ2,e2),...,(λnum,enum) Selecting the first k eigenvectors as the direction of a principal component (namely, a base vector of a space after dimensionality reduction), and constructing a mapping matrix A with 4 rows and k columns, wherein the ith column in the mapping matrix A represents the ith eigenvector, wherein k belongs to [1, num ]]。
Preferably, step (2-4) comprises the sub-steps of:
(2-4-1) calculating the mean value mu of the kth principal component of the mapping matrix A in the jth emotion type by using the original data in the corpus aiming at different emotion typesjkAnd variance σjkAnd carrying out maximum separability processing to obtain the discrimination capability H of the kth principal component on the emotion typesk
Figure BDA0003393249730000041
Figure BDA0003393249730000042
Figure BDA0003393249730000043
Wherein J is the number of emotion classes in the corpus employed, LkRepresenting the separation of the kth principal component in the emotion class, MkRepresenting the concentration of the kth principal component in the emotion categories, wherein C is a permutation and combination function; hkIt is used to reflect the discriminative power of principal components in the emotion classification, HkThe larger the component is, the stronger the ability of the extracted principal component to distinguish the emotion type is;
(2-4-2) the discrimination ability H of the emotion type according to the principal component obtained in the step (2-4-1)kAfter sorting the k main components of the mapping matrix A, selecting HkAnd the larger front p principal components are used as principal elements for emotion recognition, and normalization processing is carried out on the front p principal components to obtain normalized emotion recognition principal elements. The projection summation of the mapping matrix A of the whole principal components is carried out to obtain the score value S of each emotion recognition principal elementk
(2-4-3) score value S of each emotion recognition pivot element obtained according to the step (2-4-2)kCalculating comprehensive probabilities P corresponding to all main components in the voice of the user, and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as an emotion label of the voice of the user; wherein the comprehensive probability calculation formula is as follows:
Figure BDA0003393249730000051
preferably, step (3) comprises in particular the following sub-steps:
(3-1) carrying out word segmentation preprocessing on the Chinese information in the corpus to obtain a word segmented corpus;
(3-2) coding and feature extracting are carried out on the corpus after word segmentation obtained in the step (3-1) by using a word2vec model, so as to obtain a word matrix W with the order of c × v, wherein c is the number of all words in the sentence, v is a vector dimension corresponding to each word, and the dimension of the model is set to be v ═ 300. Taking a word matrix W as input of a training semantic representation and a matching model;
and (3-3) inputting the word matrix W obtained in the step (3-2) into the trained semantic representation and matching model to obtain the trained semantic representation and matching model.
And (3-4) after word segmentation preprocessing is carried out on the text information generated in the step (1), a word2vec model is used for coding and feature extraction, and the result is input to the semantic representation and matching model trained in the step (3-3) to obtain a feature vector of voice analysis.
And (3-5) calculating Euclidean distance between the feature vector of the voice analysis obtained in the step (3-4) and the existing question vector as vector similarity, and matching question answers in a question-answer library corresponding to the emotion labels by combining the emotion labels obtained in the step (2) to obtain answers of the question and output the answers.
Preferably, the semantic representation and the matching model in step (3-3) are trained by the following steps:
(3-3-1) using the c x v-order word matrix W obtained in the step (3-2) as the input of the CNN network;
(3-3-2) updating and optimizing the weight parameters and the bias parameters of each layer in the CNN convolutional neural network by using a back propagation algorithm to obtain an updated CNN convolutional neural network;
(3-3-3) carrying out iterative training on the CNN convolutional neural network updated in the step (3-3-2) until the loss function of the CNN convolutional neural network reaches the minimum, thereby obtaining a preliminarily trained CNN convolutional neural network;
wherein the loss function L of the CNN convolutional neural network is:
Figure BDA0003393249730000061
where Sample represents the total number of samples in the training set, Z represents the number of classes in the training set, ti,zRepresents the prediction result of the z-th class i-th training sample after being input into the CNN convolutional neural network, yi,zRepresenting the real result corresponding to the ith training Sample of the z-th class, i ∈ [1, Sample]λ represents the degree of regularization, which is 0.007, wi,zRepresenting the weight parameter of the z-th class i-th training sample when it is input into the CNN convolutional neural network, which follows the CNN convolutional neural networkThe training of the network changes itself.
(3-3-4) aiming at the CNN convolutional neural network preliminarily trained in the step (3-3-3), extracting a maximum eigenvalue matrix C after convolution kernel vector convolution by adopting a 1-max pool strategy, and calculating according to the following formula:
C=max[c1,c2,...,cn-h+1]
wherein c isiFeatures representing the computation of the ith word of the convolution kernel sentence, which is produced during the training process of the CNN convolutional neural network.
And (3-3-5) taking the maximum eigenvalue matrix C obtained in the step (3-3-4) as the training input of the BilSTM network, performing feature extraction on the maximum eigenvalue matrix C, constructing a feature vector of voice analysis, and taking the feature vector as the output of a semantic representation and matching model to obtain the trained semantic representation and matching model.
According to another aspect of the present invention, there is provided an intelligent voice question-answering system based on deep learning and emotion recognition, comprising:
the first module is used for acquiring voice from a user and processing the voice to extract corresponding text information.
And the second module is used for extracting the characteristic parameters of the voice of the user and carrying out emotion analysis on the voice according to the extracted characteristic parameters so as to form an emotion label.
And the third module is used for inputting the text information generated by the first module into the trained semantic representation and matching model, matching the emotion labels obtained by the second module with the question-answer library to obtain answers of the question and output the answers.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) the invention adopts the step (3), integrates the convolutional neural network and the BilSTM network, constructs the semantic representation and the matching model to obtain the feature vector of the voice analysis, and applies a more perfect deep learning algorithm compared with the prior question-answering system.
(2) Because the emotion analysis is carried out on the voice of the user in the step (2) and the emotion label is formed, the input dimensionality of the question-answering system is enriched, and the problems of data redundancy and single dimensionality of the existing question-answering system can be solved.
Drawings
FIG. 1 is a schematic flow chart of an intelligent voice question-answering system based on deep learning and emotion recognition according to the present invention;
FIG. 2 is a detailed flow chart of the intelligent voice question-answering system based on deep learning and emotion recognition.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1 and 2, the present invention provides an intelligent voice question-answering method based on deep learning and emotion recognition, comprising the following steps:
(1) and acquiring the voice from the user, and processing the voice to extract the corresponding text information.
Further, the step (1) specifically comprises the following substeps:
(1-1) acquiring a voice from a user and digitally encoding the voice to convert it into an audio map;
(1-2) using a Hamming window to perform framing and windowing processing on the audio image obtained in the step (1-1) in sequence, and performing Fourier transform on an obtained processing result to obtain a spectrogram;
and (1-3) inputting the spectrogram obtained in the step (1-2) into a trained Convolutional Neural Network (CNN) to obtain corresponding text information.
(2) And extracting characteristic parameters of the voice of the user, and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label.
Further, the step (2) specifically comprises the following sub-steps:
and (2-1) framing the audio image obtained in the step (1-1) by utilizing a Hamming window (each audio record file after framing correspondingly generates a one-dimensional frame array), thereby obtaining the preprocessed audio data.
And (2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring time domain features (including an average pronunciation rate, a pitch frequency, a short-time energy change, a short-time average zero-crossing rate and a formant frequency) of the frame of audio data, and merging the time domain features corresponding to all the frames to obtain num initial feature parameters (the dimension of which is 4).
Specifically, the processing of the audio data in this step includes the following sub-steps:
(2-2-1) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring the short-time energy E of the frame of audio datan
Specifically, the audio map obtained in the step (1-1) is subjected to the nth (n is equal to [1, num ] after the framing processing]) The mth sample data in the frame audio data is xn(m), the short-time energy E of the nth frame audio datanComprises the following steps:
Figure BDA0003393249730000091
where N represents the number of samples contained in a frame, depending on the particular audio format (e.g., AAC fixes 1024 samples for a frame, and the MP3 format is 1152).
(2-2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), obtaining a short-time average zero crossing rate ZCR of the frame of audio datan
Figure BDA0003393249730000092
(2-2-3) for each frame in the audio data preprocessed in step (2-1)For the audio data, the formant characteristic parameter HZ of the frame of audio data is obtainedn
(2-2-4) for each frame of audio data in the audio data preprocessed in the step (2-1), extracting a pitch period in the time domain by using a short-time autocorrelation function, namely, selecting d (the value range of d is 1 to 5, preferably 3) groups of signals around the nth frame of audio data by using a window function to perform autocorrelation calculation so as to obtain a pitch frequency R of the frame of audio datan
Figure BDA0003393249730000093
And (2-3) carrying out dimensionality reduction and feature selection processing on the num initial feature parameters obtained in the step (2-2) in sequence by using a Principal Component Analysis (PCA) to obtain a mapping matrix A formed by feature parameter subsets.
Specifically, the step includes the following substeps:
(2-3-1) constructing feature vectors A of num rows and num columns of 4 rows and num columns of num 4-dimensional initial feature parameters obtained in the step (2-2)4×num=[a1,a2,...,anum]Inputting collected characteristic data;
(2-3-2) feature vector A obtained in step (2-3-1)4×num=[a1,a2,...,anum]Calculating to obtain the mean value mu and covariance matrix COV4×num
(2-3-3) for the covariance matrix COV in step (2-3-2)4×numCalculating to obtain its characteristic value and characteristic vector, and forming a coordinate form (lambda) by the characteristic vector of each element and the characteristic value according to one-to-one correspondence relationshipi,ei) According to the size of the characteristic value, obtaining a characteristic vector sequence (lambda)1,e1),(λ2,e2),...,(λnum,enum);
(2-3-4) aiming at the feature vector sequence (lambda) in the step (2-3-3)1,e1),(λ2,e2),...,(λnum,enum) Selecting the first k (k is equal to [1, num ]]) The eigenvector is taken as the direction of the principal component (i.e. the basis vector of the reduced-dimension space), and a mapping matrix A with 4 rows and k columns is constructed, wherein the ith column in the mapping matrix A represents the ith eigenvector.
And (2-4) calculating comprehensive probabilities P corresponding to the principal components of the audio data preprocessed in the step (2-1) according to the mapping matrix A obtained in the step (2-3), and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as the emotion label of the voice.
The method comprises the following substeps:
(2-4-1) calculating the kth (k is the [1, num ] of the mapping matrix A by using the original data in the corpus aiming at different emotion types]) Mean value mu of individual principal components in jth emotion typejkAnd variance σjkAnd carrying out maximum separability processing to obtain the discrimination capability H of the kth principal component on the emotion typesk
Figure BDA0003393249730000111
Figure BDA0003393249730000112
Figure BDA0003393249730000113
Wherein J is the number of emotion classes in the corpus employed, LkRepresenting the separation of the kth principal component in the emotion class, MkRepresenting the concentration of the kth principal component in the emotion categories, wherein C is a permutation and combination function; hkIt is used to reflect the discriminative power of principal components in the emotion classification, HkThe larger the extracted principal component, the more discriminating the emotion type.
(2-4-2) the discrimination ability H of the emotion type according to the principal component obtained in the step (2-4-1)kAfter sorting the k main components of the mapping matrix A, selecting HkAnd the larger front p principal components are used as principal elements for emotion recognition, and normalization processing is carried out on the front p principal components to obtain normalized emotion recognition principal elements. The projection summation of the mapping matrix A of the whole principal components is carried out to obtain the score value S of each emotion recognition principal elementk
(2-4-3) score value S of each emotion recognition pivot element obtained according to the step (2-4-2)kAnd calculating the comprehensive probability P corresponding to each principal component in the voice of the user, and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities as the emotion label of the voice of the user.
The integrated probability calculation formula is as follows:
Figure BDA0003393249730000114
(3) and (3) inputting the text information generated in the step (1) into a trained semantic representation and matching model, matching the emotion label obtained in the step (2) with a question-answer library to obtain answers of the question and outputting the answers.
Further, the step (3) specifically comprises the following sub-steps:
and (3-1) carrying out word segmentation preprocessing on the Chinese information in the corpus to obtain a word segmented corpus.
(3-2) coding and feature extracting are carried out on the corpus after word segmentation obtained in the step (3-1) by using a word2vec model, so as to obtain a word matrix W with the order of c × v, wherein c is the number of all words in the sentence, v is a vector dimension corresponding to each word, and the dimension of the model is set to be v ═ 300. The word matrix W is the input to train the semantic representation and matching model.
And (3-3) inputting the word matrix W obtained in the step (3-2) into a trained semantic representation and matching model (the model can analyze the semantics and the structure of the problem pair and construct a feature vector of voice analysis) to obtain the trained semantic representation and matching model.
Further, the semantic representation and matching model in the step (3-3) is trained by the following steps:
(3-3-1) using the c x v-order word matrix W obtained in the step (3-2) as the input of the CNN network;
(3-3-2) updating and optimizing the weight parameters and the bias parameters of each layer in the CNN convolutional neural network by using a back propagation algorithm to obtain an updated CNN convolutional neural network;
specifically, the initial value of the weight parameter is a random value output using a truncated normal distribution with a standard deviation of 0.1, and the initial value of the bias parameter is set to 0;
(3-3-3) carrying out iterative training on the CNN convolutional neural network updated in the step (3-3-2) until the loss function of the CNN convolutional neural network reaches the minimum, thereby obtaining a preliminarily trained CNN convolutional neural network;
the loss function L of the CNN convolutional neural network is:
Figure BDA0003393249730000121
where Sample represents the total number of samples in the training set, Z represents the number of classes in the training set, ti,zRepresents the prediction result of the z-th class i-th training sample after being input into the CNN convolutional neural network, yi,zRepresenting the real result corresponding to the ith training Sample of the z-th class, i ∈ [1, Sample]λ represents the degree of regularization, which is 0.007, wi,zAnd the weight parameter represents the weight parameter when the ith training sample of the z type is input into the CNN convolutional neural network, and the weight parameter changes along with the training of the CNN convolutional neural network.
(3-3-4) aiming at the CNN convolutional neural network preliminarily trained in the step (3-3-3), extracting a maximum eigenvalue matrix C after convolution kernel vector convolution by adopting a 1-max pool strategy, and calculating according to the following formula:
C=max[c1,c2,...,cn-h+1]
wherein c isiFeatures representing the computation of the ith word of the convolution kernel sentence, which is produced during the training process of the CNN convolutional neural network.
And (3-3-5) taking the maximum eigenvalue matrix C obtained in the step (3-3-4) as the training input of the BilSTM network, performing feature extraction on the maximum eigenvalue matrix C, constructing a feature vector of voice analysis, and taking the feature vector as the output of a semantic representation and matching model to obtain the trained semantic representation and matching model.
And (3-4) after word segmentation preprocessing is carried out on the text information generated in the step (1), a word2vec model is used for coding and feature extraction, and the result is input to the semantic representation and matching model trained in the step (3-3) to obtain a feature vector of voice analysis.
And (3-5) calculating Euclidean distance between the feature vector of the voice analysis obtained in the step (3-4) and the existing question vector as vector similarity, and matching question answers in a question-answer library corresponding to the emotion labels by combining the emotion labels obtained in the step (2) to obtain answers of the question and output the answers.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. An intelligent voice question-answering method based on deep learning and emotion recognition is characterized by comprising the following steps:
(1) and acquiring the voice from the user, and processing the voice to extract the corresponding text information.
(2) And extracting characteristic parameters of the voice of the user, and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label.
(3) And (3) inputting the text information generated in the step (1) into a trained semantic representation and matching model, matching the emotion label obtained in the step (2) with a question-answer library to obtain answers of the question and outputting the answers.
2. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 1, wherein step (1) specifically comprises the following substeps:
(1-1) acquiring a voice from a user and digitally encoding the voice to convert it into an audio map;
(1-2) using a Hamming window to perform framing and windowing processing on the audio image obtained in the step (1-1) in sequence, and performing Fourier transform on an obtained processing result to obtain a spectrogram;
and (1-3) inputting the spectrogram obtained in the step (1-2) into a trained Convolutional Neural Network (CNN) to obtain corresponding text information.
3. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 1 or 2, wherein step (2) specifically comprises the following sub-steps:
and (2-1) framing the audio image obtained in the step (1-1) by using a Hamming window, thereby obtaining the preprocessed audio data.
And (2-2) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring time domain features of the frame of audio data, and combining the time domain features corresponding to all frames to obtain num initial feature parameters, wherein the dimensionality of each initial feature parameter is 4.
And (2-3) carrying out dimensionality reduction and feature selection processing on the num initial feature parameters obtained in the step (2-2) by using a Principal Component Analysis (PCA) to obtain a mapping matrix A formed by the feature parameter subset.
And (2-4) calculating comprehensive probabilities P corresponding to the principal components of the audio data preprocessed in the step (2-1) according to the mapping matrix A obtained in the step (2-3), and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as the emotion label of the voice.
4. The intelligent voice question-answering method based on deep learning and emotion recognition according to any one of claims 1 to 3, wherein step (2-2) comprises the following sub-steps:
(2-2-1) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring the short-time energy E of the frame of audio datan
(2-2-2) for each frame of audio data in the audio data preprocessed in step (2-1)In other words, a short-time average zero crossing rate ZCR of the frame of audio data is obtainedn
Figure FDA0003393249720000021
(2-2-3) for each frame of audio data in the audio data preprocessed in the step (2-1), acquiring a formant characteristic parameter HZ of the frame of audio datan
(2-2-4) for each frame of audio data in the audio data preprocessed in the step (2-1), extracting a pitch period in the time domain by using a short-time autocorrelation function, namely, selecting d (the value range of d is 1 to 5, preferably 3) groups of signals around the nth frame of audio data by using a window function to perform autocorrelation calculation so as to obtain a pitch frequency R of the frame of audio datan
Figure FDA0003393249720000022
5. The intelligent voice question-answering method based on deep learning and emotion recognition of claim 4, wherein the short-time energy E of the nth frame of audio data in step (2-2-1)nComprises the following steps:
Figure FDA0003393249720000031
wherein xn(m) represents the audio map obtained in the step (1-1), and the nth (n is the element of [1, num ] after the framing processing]) The mth sample data in the frame audio data, and N represents the number of samples contained in one frame, depending on the specific audio format.
6. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 3, wherein the step (2-3) comprises the following sub-steps:
(2-3-1) Constructing feature vectors A of the num rows and the num columns of 4 rows and num columns for the num 4-dimensional initial feature parameters obtained in the step (2-2)4×num=[a1,a2,...,anum]Inputting collected characteristic data;
(2-3-2) feature vector A obtained in step (2-3-1)4×num=[a1,a2,...,anum]Calculating to obtain the mean value mu and covariance matrix COV4×num
(2-3-3) for the covariance matrix COV in step (2-3-2)4×numCalculating to obtain its characteristic value and characteristic vector, and forming a coordinate form (lambda) by the characteristic vector of each element and the characteristic value according to one-to-one correspondence relationshipi,ei) According to the size of the characteristic value, obtaining a characteristic vector sequence (lambda)1,e1),(λ2,e2),...,(λnum,enum);
(2-3-4) aiming at the feature vector sequence (lambda) in the step (2-3-3)1,e1),(λ2,e2),...,(λnum,enum) Selecting the first k eigenvectors as the direction of a principal component (namely, a base vector of a space after dimensionality reduction), and constructing a mapping matrix A with 4 rows and k columns, wherein the ith column in the mapping matrix A represents the ith eigenvector, wherein k belongs to [1, num ]]。
7. The intelligent voice question-answering method based on deep learning and emotion recognition according to claim 3, wherein the step (2-4) comprises the following sub-steps:
(2-4-1) calculating the mean value mu of the kth principal component of the mapping matrix A in the jth emotion type by using the original data in the corpus aiming at different emotion typesjkAnd variance σjkAnd carrying out maximum separability processing to obtain the discrimination capability H of the kth principal component on the emotion typesk
Figure FDA0003393249720000041
Figure FDA0003393249720000042
Figure FDA0003393249720000043
Wherein J is the number of emotion classes in the corpus employed, LkRepresenting the separation of the kth principal component in the emotion class, MkRepresenting the concentration of the kth principal component in the emotion categories, wherein C is a permutation and combination function; hkIt is used to reflect the discriminative power of principal components in the emotion classification, HkThe larger the component is, the stronger the ability of the extracted principal component to distinguish the emotion type is;
(2-4-2) the discrimination ability H of the emotion type according to the principal component obtained in the step (2-4-1)kAfter sorting the k main components of the mapping matrix A, selecting HkAnd the larger front p principal components are used as principal elements for emotion recognition, and normalization processing is carried out on the front p principal components to obtain normalized emotion recognition principal elements. The projection summation of the mapping matrix A of the whole principal components is carried out to obtain the score value S of each emotion recognition principal elementk
(2-4-3) score value S of each emotion recognition pivot element obtained according to the step (2-4-2)kCalculating comprehensive probabilities P corresponding to all main components in the voice of the user, and selecting the emotion type corresponding to the maximum comprehensive probability from the comprehensive probabilities P as an emotion label of the voice of the user; wherein the comprehensive probability calculation formula is as follows:
Figure FDA0003393249720000044
8. the intelligent voice question-answering method based on deep learning and emotion recognition according to claim 1, wherein step (3) specifically comprises the following sub-steps:
(3-1) carrying out word segmentation preprocessing on the Chinese information in the corpus to obtain a word segmented corpus;
(3-2) coding and feature extracting are carried out on the corpus after word segmentation obtained in the step (3-1) by using a word2vec model, so as to obtain a word matrix W with the order of c × v, wherein c is the number of all words in the sentence, v is a vector dimension corresponding to each word, and the dimension of the model is set to be v ═ 300. Taking a word matrix W as input of a training semantic representation and a matching model;
and (3-3) inputting the word matrix W obtained in the step (3-2) into the trained semantic representation and matching model to obtain the trained semantic representation and matching model.
And (3-4) after word segmentation preprocessing is carried out on the text information generated in the step (1), a word2vec model is used for coding and feature extraction, and the result is input to the semantic representation and matching model trained in the step (3-3) to obtain a feature vector of voice analysis.
And (3-5) calculating Euclidean distance between the feature vector of the voice analysis obtained in the step (3-4) and the existing question vector as vector similarity, and matching question answers in a question-answer library corresponding to the emotion labels by combining the emotion labels obtained in the step (2) to obtain answers of the question and output the answers.
9. The intelligent voice question-answering method based on deep learning and emotion recognition of claim 8, wherein the semantic representation and matching model in step (3-3) is trained by the following steps:
(3-3-1) using the c x v-order word matrix W obtained in the step (3-2) as the input of the CNN network;
(3-3-2) updating and optimizing the weight parameters and the bias parameters of each layer in the CNN convolutional neural network by using a back propagation algorithm to obtain an updated CNN convolutional neural network;
(3-3-3) carrying out iterative training on the CNN convolutional neural network updated in the step (3-3-2) until the loss function of the CNN convolutional neural network reaches the minimum, thereby obtaining a preliminarily trained CNN convolutional neural network;
wherein the loss function L of the CNN convolutional neural network is:
Figure FDA0003393249720000051
where Sample represents the total number of samples in the training set, Z represents the number of classes in the training set, ti,zRepresents the prediction result of the z-th class i-th training sample after being input into the CNN convolutional neural network, yi,zRepresenting the real result corresponding to the ith training Sample of the z-th class, i ∈ [1, Sample]λ represents the degree of regularization, which is 0.007, wi,zAnd the weight parameter represents the weight parameter when the ith training sample of the z type is input into the CNN convolutional neural network, and the weight parameter changes along with the training of the CNN convolutional neural network.
(3-3-4) aiming at the CNN convolutional neural network preliminarily trained in the step (3-3-3), extracting a maximum eigenvalue matrix C after convolution kernel vector convolution by adopting a 1-max pool strategy, and calculating according to the following formula:
C=max[c1,c2,...,cn-h+1]
wherein c isiFeatures representing the computation of the ith word of the convolution kernel sentence, which is produced during the training process of the CNN convolutional neural network.
And (3-3-5) taking the maximum eigenvalue matrix C obtained in the step (3-3-4) as the training input of the BilSTM network, performing feature extraction on the maximum eigenvalue matrix C, constructing a feature vector of voice analysis, and taking the feature vector as the output of a semantic representation and matching model to obtain the trained semantic representation and matching model.
10. An intelligent voice question-answering system based on deep learning and emotion recognition is characterized by comprising:
the first module is used for acquiring voice from a user and processing the voice to extract corresponding text information.
The second module is used for extracting the characteristic parameters of the voice of the user and carrying out emotion analysis on the voice according to the extracted characteristic parameters to form an emotion label;
and the third module is used for inputting the text information generated by the first module into the trained semantic representation and matching model, matching the emotion labels obtained by the second module with the question-answer library to obtain answers of the question and output the answers.
CN202111475872.XA 2021-12-06 2021-12-06 Intelligent voice question-answering method and system based on deep learning and emotion recognition Pending CN114203177A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111475872.XA CN114203177A (en) 2021-12-06 2021-12-06 Intelligent voice question-answering method and system based on deep learning and emotion recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111475872.XA CN114203177A (en) 2021-12-06 2021-12-06 Intelligent voice question-answering method and system based on deep learning and emotion recognition

Publications (1)

Publication Number Publication Date
CN114203177A true CN114203177A (en) 2022-03-18

Family

ID=80650788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111475872.XA Pending CN114203177A (en) 2021-12-06 2021-12-06 Intelligent voice question-answering method and system based on deep learning and emotion recognition

Country Status (1)

Country Link
CN (1) CN114203177A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115097946A (en) * 2022-08-15 2022-09-23 汉华智能科技(佛山)有限公司 Remote worship method, system and storage medium based on Internet of things
CN115482837A (en) * 2022-07-25 2022-12-16 科睿纳(河北)医疗科技有限公司 Emotion classification method based on artificial intelligence
CN116597821A (en) * 2023-07-17 2023-08-15 深圳市国硕宏电子有限公司 Intelligent customer service voice recognition method and system based on deep learning
CN117992597A (en) * 2024-04-03 2024-05-07 江苏微皓智能科技有限公司 Information feedback method, device, computer equipment and computer storage medium
CN117995174A (en) * 2024-04-07 2024-05-07 广东实丰智能科技有限公司 Learning type electric toy control method based on man-machine interaction
CN118035431A (en) * 2024-04-12 2024-05-14 青岛网信信息科技有限公司 User emotion prediction method, medium and system in text customer service process

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482837A (en) * 2022-07-25 2022-12-16 科睿纳(河北)医疗科技有限公司 Emotion classification method based on artificial intelligence
CN115482837B (en) * 2022-07-25 2023-04-28 科睿纳(河北)医疗科技有限公司 Emotion classification method based on artificial intelligence
CN115097946A (en) * 2022-08-15 2022-09-23 汉华智能科技(佛山)有限公司 Remote worship method, system and storage medium based on Internet of things
CN116597821A (en) * 2023-07-17 2023-08-15 深圳市国硕宏电子有限公司 Intelligent customer service voice recognition method and system based on deep learning
CN117992597A (en) * 2024-04-03 2024-05-07 江苏微皓智能科技有限公司 Information feedback method, device, computer equipment and computer storage medium
CN117992597B (en) * 2024-04-03 2024-06-07 江苏微皓智能科技有限公司 Information feedback method, device, computer equipment and computer storage medium
CN117995174A (en) * 2024-04-07 2024-05-07 广东实丰智能科技有限公司 Learning type electric toy control method based on man-machine interaction
CN118035431A (en) * 2024-04-12 2024-05-14 青岛网信信息科技有限公司 User emotion prediction method, medium and system in text customer service process

Similar Documents

Publication Publication Date Title
US10515292B2 (en) Joint acoustic and visual processing
CN114203177A (en) Intelligent voice question-answering method and system based on deep learning and emotion recognition
CN113780012A (en) Depression interview conversation generation method based on pre-training language model
CN111125333A (en) Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism
Zhang Unsupervised speech processing with applications to query-by-example spoken term detection
CN113569553A (en) Sentence similarity judgment method based on improved Adaboost algorithm
Huang et al. Speech emotion recognition using convolutional neural network with audio word-based embedding
Almekhlafi et al. A classification benchmark for Arabic alphabet phonemes with diacritics in deep neural networks
Zhang Ideological and political empowering english teaching: ideological education based on artificial intelligence in classroom emotion recognition
CN114937465A (en) Speech emotion recognition method based on self-supervision learning and computer equipment
Hazen et al. Topic modeling for spoken documents using only phonetic information
CN107562907B (en) Intelligent lawyer expert case response device
Lee Discovering linguistic structures in speech: Models and applications
Pujari et al. A survey on deep learning based lip-reading techniques
CN112749567A (en) Question-answering system based on reality information environment knowledge graph
Farooq et al. Mispronunciation detection in articulation points of Arabic letters using machine learning
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
Shi et al. Construction of english pronunciation judgment and detection model based on deep learning neural networks data stream fusion
Alishahi et al. ZR-2021VG: Zero-resource speech challenge, visually-grounded language modelling track
Gündogdu Keyword search for low resource languages
CN114461779A (en) Case writing element extraction method
Karakasidis Comparison of New Curriculum Criteria for End-to-End ASR
Zhang Research on the Application of Speech Database based on Emotional Feature Extraction in International Chinese Education and Teaching
Alashban et al. Language effect on speaker gender classification using deep learning
Al-Rami et al. A framework for pronunciation error detection and correction for non-native Arab speakers of English language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination