WO2021051577A1 - 语音情绪识别方法、装置、设备及存储介质 - Google Patents

语音情绪识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021051577A1
WO2021051577A1 PCT/CN2019/117886 CN2019117886W WO2021051577A1 WO 2021051577 A1 WO2021051577 A1 WO 2021051577A1 CN 2019117886 W CN2019117886 W CN 2019117886W WO 2021051577 A1 WO2021051577 A1 WO 2021051577A1
Authority
WO
WIPO (PCT)
Prior art keywords
emotion recognition
training
speech
emotion
recognition model
Prior art date
Application number
PCT/CN2019/117886
Other languages
English (en)
French (fr)
Inventor
占小杰
方豪
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051577A1 publication Critical patent/WO2021051577A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a voice emotion recognition method, device, equipment and storage medium.
  • Speech emotion recognition and prediction is a strongly subjective problem, and its judgment must be based on context.
  • Existing speech emotion recognition is directly based on a given emotion prediction result of a speech segment, and the segment duration is related to the continuous speaking duration of the speaker.
  • the emotion of the speech fluctuates, so the emotion judgment based on the speech fragment has a large error.
  • the present application provides a voice emotion recognition method, device, equipment, and storage medium to solve the problem of large errors in emotion recognition of voice segments in the prior art.
  • the first aspect of this application is to provide a voice emotion recognition method, which includes the following steps:
  • Preprocessing the acquired speech fragment to be recognized includes: performing frame division processing on the speech fragment to be recognized to obtain multiple frames of speech;
  • the emotion recognition model generated by pre-training is used to process the multi-frame speech to obtain a plurality of emotion recognition results, and each emotion recognition result corresponds to a frame of speech or a set number of frames of speech;
  • the emotion corresponding to the speech segment to be recognized is obtained.
  • the second aspect of the present application is to provide a voice emotion recognition device, including: a voice acquisition module, a preprocessing module, an emotion recognition module, and an emotion acquisition module, wherein the voice acquisition module is used to acquire the voice to be recognized Fragments, the preprocessing module is used to preprocess the acquired speech fragments to be recognized to obtain multiple frames of speech; the emotion recognition module is used to process the preprocessed speech fragments to be recognized with the pre-trained emotion recognition model to obtain multiple Each emotion recognition result corresponds to a frame of speech or a set number of frames of speech; the emotion acquisition module is used to obtain the emotion corresponding to the speech segment to be recognized according to the multiple emotion recognition results.
  • the voice acquisition module is used to acquire the voice to be recognized Fragments
  • the preprocessing module is used to preprocess the acquired speech fragments to be recognized to obtain multiple frames of speech
  • the emotion recognition module is used to process the preprocessed speech fragments to be recognized with the pre-trained emotion recognition model to obtain multiple
  • Each emotion recognition result corresponds to
  • the third aspect of the present application is to provide an electronic device, the electronic device includes: a processor and a memory, the memory includes a voice emotion recognition program, the voice emotion recognition program is used by the processor The voice emotion recognition method as described above is realized during execution.
  • the fourth aspect of the present application is to provide a computer non-volatile readable storage medium, the computer non-volatile readable storage medium includes a voice emotion recognition program, and the voice emotion recognition program When executed by the processor, the voice emotion recognition method as described above is realized.
  • This application obtains the emotion corresponding to each frame or multiple frames of speech in a speech fragment through the emotion recognition model, reduces the emotion recognition of the speech fragment to the millisecond level, which is closer to the real-time continuous prediction of the emotion of the speech fragment, and improves the speech The accuracy of emotion recognition.
  • Figure 1 is a schematic flow diagram of the voice emotion recognition method provided by this application.
  • Figure 2 is a schematic diagram of the voice emotion recognition device in this application.
  • FIG. 1 is a schematic flow diagram of the voice emotion recognition method provided by this application. As shown in Figure 1, the voice emotion recognition method provided by this application includes the following steps:
  • Step S1 Obtain a speech fragment to be recognized, which is a speech fragment of any length obtained when the speech recognition system performs voice endpoint detection of the speaker;
  • Step S2 preprocessing the acquired speech fragments to be recognized includes: performing frame processing on the speech fragments to be recognized to obtain multiple frames of speech, and extracting the feature vector of each frame of speech respectively, so as to facilitate the evaluation of the speech. Carry out further processing;
  • Step S3 Use the emotion recognition model generated by pre-training to process the multi-frame speech to obtain multiple emotion recognition results, and each emotion recognition result corresponds to a frame of speech or a set number of frames of speech, for example, for A speech fragment to be recognized with a duration of 2s is processed by frame shifting to obtain 200 frames of speech. Each frame has a duration of 25ms. Each frame of speech can get an emotion recognition result, that is, 200 emotion recognition results can be obtained. Get an emotion recognition result every 5 frames of speech, that is, get 40 emotion recognition results;
  • Step S4 Obtain the emotion corresponding to the voice segment to be recognized according to the multiple output emotion recognition results.
  • the present application obtains the emotion corresponding to each frame or multiple frames of speech through the emotion recognition model, reduces the emotion recognition of the speech to the millisecond level, which is closer to the real-time continuous prediction of the emotion of the speech segment, and improves the accuracy of speech emotion recognition.
  • the emotion recognition result corresponds to the set number of frames of speech
  • the number of consecutive frames is too large, the effect of emotion recognition may be reduced. Therefore, according to different application scenarios, determine a continuous frame number threshold, that is, for example When the number is set to 8, it is closer to the expression of the scene's emotions, and the recognition effect is the best.
  • the feature vector extracted for each frame of speech includes the zero crossing rate (Zero Crossing Rate, ZCR), short-term energy (short-term energy), short-term energy entropy (short-term entropy of energy), spectrum center and extension degree ( spectral centroid and spread), spectral entropy (spectral entropy), spectral flux (spectral flux), spectral roll-off point (spectral roll-off), Mel Frequency Cepstrum Coefficient (MFCC), harmonic ratio and spacing
  • ZCR Zero crossing rate
  • short-term energy short-term energy
  • short-term energy entropy short-term energy entropy of energy
  • spectrum center and extension degree spectral centroid and spread
  • spectral entropy spectral entropy
  • spectral flux spectral flux
  • spectral roll-off point spectral roll-off
  • MFCC Mel Frequency Cepstrum Coefficient
  • the dimension of the feature matrix obtained for a speech segment is N*34, where N
  • the method before processing the multi-frame speech with the emotion recognition model generated by pre-training, the method further includes: training the emotion recognition model.
  • the training step includes:
  • the sample is a speech fragment
  • the marked label is the emotion type corresponding to the speech fragment, including negative emotions and non-negative emotions, where negative emotions are used NEG said, including anger, anger, and sadness, etc.
  • Non-negative emotions are expressed by NEU, including normal, happy, excited, etc.
  • the duration of the divided voice segment is determined according to the pause point, so that the whole recording file is divided into multiple samples, and the emotion type of each sample is marked, so that a given value can be obtained Negative emotion fragments and positive emotion fragments in the speech fragments of, and mark them separately;
  • the sample library is divided into a training set and a test set, and the training set is divided into a development set and a verification set.
  • the development set is used to train the model
  • the verification set is used to tune the model
  • the test set is used to test the actual environment of the model Performance;
  • the training samples in the training set to train the hyperparameters of the emotion recognition model to obtain a hyperparameter set that meets preset conditions, and update the emotion recognition model according to the hyperparameter set, where the hyperparameters include the model's hyperparameters.
  • the preset condition is that the determined hyperparameter set makes the performance of the emotion recognition model optimal, and the emotion recognition accuracy of the emotion recognition model is greater than the preset accuracy;
  • test samples in the test set to test the updated emotion recognition model. If the generated test result passes the verification, the training ends, and if the generated test result fails the verification, then continue to perform the training using the The step of training the hyperparameters of the emotion recognition model with concentrated training samples.
  • the hyperparameters of the emotion recognition model are trained using the training samples in the training set to obtain a hyperparameter set that meets preset conditions, and the emotion is updated according to the hyperparameter set Recognition model, including:
  • the optimal hyperparameter set is selected from the preset number of sets of hyperparameters according to the preset hyperparameter set selection conditions, wherein the preset hyperparameter set selection condition is that the selected hyperparameter set makes the performance of the corresponding emotion recognition model the best excellent;
  • Update the emotion recognition model according to the optimal hyperparameter set and continue to perform the step of dividing the training samples in the training set into a preset number of training subsets, until the optimal hyperparameter set meets the preset conditions .
  • the subset is used as the validation set, and the remaining four training subsets are combined as the development set. Five rounds of training are performed on the emotion recognition model, and one iteration is completed.
  • each round of training obtains a set of hyperparameters, and then one iteration can Obtain 5 sets of hyperparameters, determine the optimal hyperparameter set in the 5 sets of hyperparameters according to the preset hyperparameter set selection conditions, as the basis for the next iteration, update the emotion recognition model, and continue to execute the training set
  • the training sample of is divided into 5 training subsets on average, and multiple iterations are performed until the optimal hyperparameter set meets the preset conditions.
  • step of using the training samples in the training set to train the hyperparameters of the emotion recognition model includes:
  • the hyperparameters of the initialized emotion recognition model are updated according to the iterative loss value.
  • the iterative loss value is obtained through the cross-entropy loss function, which is expressed as:
  • L represents the iterative loss value
  • y (i) represents the label of sample i
  • N represents the total number of samples.
  • using test samples in the test set to test the updated emotion recognition model includes: using the test samples to test the emotion recognition model, and obtain the emotion The emotion recognition accuracy rate of the recognition model, if the accuracy rate exceeds or is less than or equal to the accuracy rate, the emotion recognition model fails the verification, otherwise, the emotion recognition model passes the verification.
  • the accuracy rate is set according to the actual application scenario of the speech segment, and different application scenarios can set different accuracy rate thresholds.
  • the emotion recognition model includes: a cyclic memory neural network structure and an attention mechanism layer.
  • the cyclic memory neural network structure is used to process a longer time sequence, so that the emotion recognition result of each frame of speech Considering the previous emotional state characteristics, the attention mechanism layer is connected to the last hidden layer of the cyclic memory neural network structure, and the attention mechanism layer is used to strengthen the key emotional characteristics in the speech segment, for example, to improve the voice of negative emotions. Weight, reduce the weight of non-negative emotional speech.
  • the cyclic neural network structure includes a bidirectional cyclic neural network, and the bidirectional cyclic neural network includes a Long Short-Term Memory (LSTM) network structure and a BILSTM (Bi-directional Long-Term Memory) network structure.
  • LSTM Long Short-Term Memory
  • BILSTM Bi-directional Long-Term Memory
  • the LSTM network structure includes:
  • f t represents the output of the forget gate at time t
  • W f represents the weight of the forget gate
  • represents the sigmoid activation function
  • h t-1 represents the hidden layer output of the LSTM cell at time t-1
  • x t represents time t
  • the input data of, b f represents the bias of the forget gate
  • i t represents a t the output timing of updating door
  • W i represents the weight updating door sigmoid activation function of weight
  • b i represents the offset sigmoid activation function
  • represents a sigmoid activation function
  • h t-1 represents the time t-1 LSTM
  • the hidden layer output of the cell unit x t represents the input data at time t
  • W C represents the weight of the activation function tanh in the update gate
  • b C represents the bias of the activation function tanh in the update gate
  • C t represents the output state of the update gate at time t
  • o t represents the output of LSTM
  • W o represents the weight of the output gate sigmoid
  • b o represents the bias of the output gate sigmoid
  • represents the sigmoid activation function
  • h t-1 represents the implicitness of the LSTM cell unit at t-1 Layer output
  • x t represents the input data at time t
  • h t represents the hidden layer output of the LSTM cell at time t.
  • the attention mechanism layer includes an attention mechanism model and a scoring model.
  • the attention mechanism model is used to iteratively train the weight parameters of the LSTM network structure, and the scoring model is used to adjust the negative emotions in the speech segment.
  • the weight parameter of the LSTM network structure by the attention mechanism layer is obtained by the following formula:
  • u j represents the nonlinear transformation result of the output vector of the j-th node of the hidden layer of the LSTM network structure
  • v j represents the hyperparameter of the attention mechanism layer
  • W 1 represents the weight parameter of the LSTM network structure
  • h j represents The output of the j-th node of the hidden layer of the LSTM network structure
  • f() represents the nonlinear function
  • ⁇ j represents the result of calculating the state score
  • the softmax() function combined with the f() nonlinear function constitutes the score calculation model
  • c j represents The output result of the attention mechanism layer corresponding to h j.
  • obtaining the emotion corresponding to the voice segment to be recognized according to the multiple emotion recognition results includes:
  • the proportion of negative emotions is greater than or equal to the preset proportion threshold, it is determined that the emotion corresponding to the speech segment to be recognized is negative; if the proportion of negative emotions is less than the preset proportion threshold, it is determined that The emotion corresponding to the speech fragment is non-negative.
  • 125 frames of speech are obtained through frame shift processing, each frame of speech is obtained through the emotion recognition model to obtain a recognition result, and 125 frames of speech are obtained from 125 emotion recognition results.
  • the set ratio threshold is 70%, and if there are more than or equal to 70% of the speech frames recognized as negative emotions, it is determined that the emotion type of the speech segment to be recognized is negative.
  • the voice emotion recognition method described in this application is applied to electronic devices, which may be terminal devices such as televisions, smart phones, tablet computers, and computers.
  • the electronic device includes a processor and a memory, the memory is used to store a voice emotion recognition program, and the processor executes the voice emotion recognition program to implement the following voice emotion recognition method:
  • preprocessing the speech fragment to be recognized includes: sub-frame processing the speech fragment to be recognized to obtain a multi-frame speech; and use a pre-trained emotion recognition model to perform the pre-processing on the multi-frame speech Processing is performed to obtain multiple emotion recognition results, and each emotion recognition result corresponds to a frame of speech or a set number of frames of speech; according to the multiple emotion recognition results, the emotion corresponding to the speech segment to be recognized is obtained.
  • the electronic device also includes a network interface, a communication bus, and the like.
  • the network interface may include a standard wired interface and a wireless interface
  • the communication bus is used to realize the connection and communication between various components.
  • the memory includes at least one type of readable storage medium, which can be a non-volatile storage medium such as a flash memory, a hard disk, an optical disk, or a plug-in hard disk, etc., and is not limited to this, and can be stored in a non-transitory manner Any device that provides instructions or software and any associated data files to the processor to enable the processor to execute the instructions or software program.
  • the software program stored in the memory includes a voice emotion recognition program, and the voice emotion recognition program can be provided to the processor, so that the processor can execute the voice emotion recognition program to realize the voice emotion recognition method.
  • the processor may be a central processing unit, a microprocessor, or other data processing chips, etc., and may run a program stored in the memory, for example, the voice emotion recognition program in this application.
  • the electronic device may also include a display, and the display may also be called a display screen or a display unit.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like.
  • the display is used to display the information processed in the electronic device and to display the visual work interface.
  • the electronic device may also include a user interface, and the user interface may include an input unit (such as a keyboard), a voice output device (such as a stereo, earphone), and the like.
  • the user interface may include an input unit (such as a keyboard), a voice output device (such as a stereo, earphone), and the like.
  • FIG. 2 is a schematic diagram of a voice emotion recognition device in this application.
  • the voice emotion recognition device includes: a voice acquisition module 1, a preprocessing module 2, an emotion recognition module 3, and an emotion acquisition module 4.
  • the voice acquisition module 1 acquires the voice to be recognized After the segment, the acquired speech segment to be recognized is preprocessed by the preprocessing module 2.
  • the preprocessing includes: framing the speech segment to be recognized to obtain multiple frames of speech; the emotion recognition module 3 uses pre-training to generate The emotion recognition model processes the preprocessed speech fragments to be recognized to obtain multiple emotion recognition results.
  • Each emotion recognition result corresponds to a frame of speech or a set number of frames of speech; finally, the emotion acquisition module 4 is based on multiple emotion recognition results.
  • the emotion recognition result obtains the emotion corresponding to the speech segment to be recognized.
  • the emotion recognition module the emotion recognition result of a speech segment corresponding to each frame or multiple frames of speech can be recognized, and the emotion recognition of the speech can be reduced to the millisecond level, which is closer to the real-time continuous prediction of the emotion of the speech segment, and improves The accuracy of speech emotion recognition.
  • the speech emotion recognition device further includes a training module for training the emotion recognition model before processing the multi-frame speech, and the training module includes:
  • a sample library construction unit which constructs a sample library and labels the samples in the sample library, where the samples are voice segments; a sample library dividing unit, divides the sample library into a training set and a test set, and divides the training set
  • the development set and validation set the development set is used to train the model
  • the validation set is used to tune the model
  • the test set is used to test the model
  • the training unit uses the training samples in the training set to train the emotion recognition model. Parameters to obtain a hyperparameter set that meets preset conditions, and update the emotion recognition model according to the hyperparameter set
  • the testing unit uses the test samples in the test set to test the updated emotion recognition model, If the generated test result fails the verification, the training unit continues to train the hyperparameters of the emotion recognition model.
  • the training unit trains the hyperparameters of the emotion recognition model in the following manner, including:
  • the training samples in the training set are equally divided into a preset number of training subsets; the i-th training subset is taken as the verification set in turn, and the remaining training subsets are used as the development set, and the emotion recognition model is trained for a preset number of rounds, and Obtain a preset number of hyperparameter sets; select an optimal hyperparameter set from the preset number of hyperparameter sets according to a preset hyperparameter set selection condition; update the emotion recognition model according to the optimal hyperparameter set , And continue to perform the step of dividing the training samples in the training set into a preset number of training subsets, until the optimal hyperparameter set meets the preset conditions.
  • the subset is used as the validation set, and the remaining four training subsets are combined as the development set. Five rounds of training are performed on the emotion recognition model, and one iteration is completed.
  • each round of training obtains a set of hyperparameters, and then one iteration can Obtain 5 sets of hyperparameters, determine the optimal hyperparameter set in the 5 sets of hyperparameters according to the preset hyperparameter set selection conditions, as the basis for the next iteration, update the emotion recognition model, and continue to execute the training set
  • the training sample of is divided into 5 training subsets on average, and multiple iterations are performed until the optimal hyperparameter set meets the preset conditions.
  • the emotion recognition model used in the emotion recognition module includes a cyclic memory neural network structure and an attention mechanism layer.
  • the cyclic memory neural network structure is used to process the time series that becomes longer, so that each frame The result of speech emotion recognition takes into account the previous emotional state characteristics.
  • the attention mechanism layer is connected to the last hidden layer of the cyclic memory neural network structure. The attention mechanism layer strengthens the key emotional features in the speech segment, for example, Increase the weight of negative emotional speech and reduce the weight of non-negative emotional speech.
  • the structure of the emotion recognition model is roughly the same as the emotion recognition model used in the above voice emotion recognition method, and will not be repeated here.
  • the emotion acquisition module includes a ratio acquisition module and an emotion determination module, wherein the ratio acquisition module is used to acquire the proportion of negative emotions in a plurality of emotion recognition results, and the emotion determination module is based on the negative emotions.
  • Determining whether the emotion corresponding to the speech segment to be recognized is negative or non-negative by the proportion of the recognition includes: if the proportion of the negative emotion is greater than or equal to the preset proportion threshold, determining the emotion corresponding to the speech segment to be recognized The emotion is negative; if the proportion of the negative emotion is less than the preset ratio threshold, it is determined that the emotion corresponding to the speech segment to be recognized is non-negative.
  • 125 frames of speech are obtained through frame shift processing, each frame of speech is obtained through the emotion recognition model to obtain a recognition result, and 125 frames of speech are obtained from 125 emotion recognition results.
  • the set ratio threshold is 70%, and if there are frames greater than or equal to 70% of which are recognized as negative emotions, it is determined that the emotion type of the speech segment to be recognized is negative.
  • the voice emotion recognition program can also be divided into one or more modules, and one or more modules are stored in the memory and executed by the processor to complete the application and realize the above-mentioned voice emotion recognition device.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions.
  • the voice emotion recognition program can be divided into: a voice acquisition module 1, a preprocessing module 2, an emotion recognition module 3, and an emotion acquisition module 4.
  • the functions or operation steps implemented by the above modules are all similar to the above, and will not be described in detail here. For example, for example:
  • Voice acquisition module 1 to acquire a voice segment to be recognized
  • the preprocessing module 2 preprocesses the acquired speech fragments to be recognized, including: performing frame division processing on the speech fragments to be recognized to obtain multiple frames of speech;
  • the emotion recognition module 3 processes the multi-frame speech using a pre-generated emotion recognition model to obtain multiple emotion recognition results, and each emotion recognition result corresponds to a frame or a set number of frames of speech;
  • the emotion obtaining module 4 obtains the emotion corresponding to the voice segment to be recognized according to the multiple emotion recognition results.
  • the computer non-volatile readable storage medium may be any tangible medium that contains or stores a program or instruction, the program can be executed, and the stored program instructs the relevant hardware to realize the corresponding function.
  • the computer-readable storage medium may be a computer disk, a hard disk, a random access memory, a read-only memory, and so on.
  • the present application is not limited to this, and can be any device that stores instructions or software and any related data files or data structures in a non-transitory manner and can be provided to the processor to enable the processor to execute the programs or instructions therein.
  • the computer non-volatile readable storage medium includes a voice emotion recognition program, and when the voice emotion recognition program is executed by a processor, the following voice emotion recognition method is implemented:
  • preprocessing the speech fragment to be recognized includes: sub-frame processing the speech fragment to be recognized to obtain a multi-frame speech; and use a pre-trained emotion recognition model to perform the pre-processing on the multi-frame speech Processing is performed to obtain multiple emotion recognition results, and each emotion recognition result corresponds to a frame of speech or a set number of frames of speech; according to the multiple emotion recognition results, the emotion corresponding to the speech segment to be recognized is obtained.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

一种语音情绪识别方法、装置、设备及存储介质,其中,方法包括:获取待识别语音片段(S1);对获取的待识别语音片段进行预处理(S2),包括:对所述待识别语音片段进行分帧处理,得到多帧语音;用预先训练生成的情绪识别模型对所述多帧语音进行处理,以得出多个情绪识别结果,每个情绪识别结果与一帧语音或设定数量帧语音相对应(S3);根据所述多个情绪识别结果得到与所述待识别语音片段对应的情绪(S4)。通过得到与每一帧或多帧语音对应的情绪,将对语音的情绪识别减小到毫秒级,更加接近对语音片段情绪的实时连续预测,提高语音情绪识别的准确率。

Description

语音情绪识别方法、装置、设备及存储介质
本申请要求于2019年09月17日提交的中国专利申请号201910875372.1的优先权益,上述案件全部内容以引用的方式并入本文中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种语音情绪识别方法、装置、设备及存储介质。
背景技术
随着人工智能、机器学习与网络信息的快速发展,很多场合都应用到对语音进行情绪识别预测的技术,例如,人机交互、语音通话等。语音情绪识别预测是一个有强烈主观性的问题,其判断必须依据上下文。现有的语音情绪识别均是直接基于语音片段给定情绪预测结果,片段时长与说话人连续说话时长相关。但是,即使在一句话中,语音的情绪也是存在波动的,所以基于语音片段进行的情绪判断存在较大的误差。
发明内容
本申请提供一种语音情绪识别方法、装置、设备及存储介质,以解决现有技术对语音片段进行情绪识别存在较大误差的问题。
为了实现上述目的,本申请的第一个方面是提供一种语音情绪识别方法,包括以下步骤:
获取待识别语音片段;
对获取的待识别语音片段进行预处理,包括:对所述待识别语音片段进行分帧处理,得到多帧语音;
用预先训练生成的情绪识别模型对所述多帧语音进行处理,以得出多个情绪识别结果,每个情绪识别结果与一帧语音或设定数量帧语音相对应;
根据所述多个情绪识别结果得到与所述待识别语音片段对应的情绪。
为了实现上述目的,本申请的第二个方面是提供一种语音情绪识别装置,包括:语音获取模块、预处理模块、情绪识别模块和情绪获取模块,其中, 语音获取模块用于获取待识别语音片段,预处理模块用于对获取的待识别语音片段进行预处理,得到多帧语音;情绪识别模块用于用预先训练生成的情绪识别模型对经过预处理的待识别语音片段进行处理,得到多个情绪识别结果,每个情绪识别结果与一帧语音或设定数量帧语音相对应;情绪获取模块用于根据多个情绪识别结果得到与所述待识别语音片段对应的情绪。
为了实现上述目的,本申请的第三个方面是提供一种电子设备,该电子设备包括:处理器和存储器,所述存储器中包括语音情绪识别程序,所述语音情绪识别程序被所述处理器执行时实现如上所述的语音情绪识别方法。
为了实现上述目的,本申请的第四个方面是提供一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质中包括语音情绪识别程序,所述语音情绪识别程序被处理器执行时,实现如上所述的语音情绪识别方法。
相对于现有技术,本申请具有以下优点和有益效果:
本申请通过情绪识别模型得到与一个语音片段中的每一帧或多帧语音对应的情绪,将对语音片段的情绪识别减小到毫秒级,更加接近对语音片段情绪的实时连续预测,提高语音情绪识别的准确率。
附图说明
图1为本申请所提供的语音情绪识别方法的流程示意图;
图2为本申请中语音情绪识别装置的示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
下面将参考附图来描述本申请所述的实施例。本领域的普通技术人员可以认识到,在不偏离本申请的精神和范围的情况下,可以用各种不同的方式或其组合对所描述的实施例进行修正。因此,附图和描述在本质上是说明性的,仅仅用以解释本申请,而不是用于限制权利要求的保护范围。此外,在本说明书中,附图未按比例画出,并且相同的附图标记表示相同的部分。
图1为本申请所提供的语音情绪识别方法的流程示意图,如图1所示,本申请所提供的语音情绪识别方法,包括以下步骤:
步骤S1,获取待识别语音片段,待识别语音片段为语音识别系统对说话人进行语音端点检测时获取的任意时长的语音片段;
步骤S2,对获取的所述待识别语音片段进行预处理,包括:对所述待识别语音片段进行分帧处理,得到多帧语音,并分别提取每一帧语音的特征向量,以便于对语音进行进一步地处理;
步骤S3,用预先训练生成的情绪识别模型对所述多帧语音进行处理,以得出多个情绪识别结果,每个情绪识别结果与一帧语音或设定数量帧语音相对应,例如,对于一个时长为2s的待识别语音片段,经过分帧移帧处理,得到200帧语音,每一帧时长为25ms,可以每一帧语音得到一个情绪识别结果,即得到200个情绪识别结果,也可以每5帧语音得到一个情绪识别结果,即得到40个情绪识别结果;
步骤S4,根据输出的多个情绪识别结果得到与所述待识别语音片段对应的情绪。
由于语音情绪的表达体现在文字的语气上,存在间歇性,有高有底,语气严重的语音片段的情绪,人容易识别,情绪识别模型也会容易识别,仅仅通过语音片段中的某一帧来判断情绪可能会存在不准确的情况,难以最大限度的检测出语音中的负面情绪。本申请通过情绪识别模型得到与每一帧或多帧语音对应的情绪,将对语音的情绪识别减小到毫秒级,更加接近对语音片段情绪的实时连续预测,提高语音情绪识别的准确率。
需要说明的是,当情绪识别结果与设定数量帧语音相对应时,如果连续帧数太大,可能又会降低情绪识别效果,所以根据不同的应用场景,确定一个连续帧数阈值,即例如当设定数量为8的时候,更加贴近该场景情绪的表达,识别效果最佳。
优选地,对语音片段进行分帧时,每一帧语音时长为25ms~50ms,以符合短时平稳的语音特征;移帧时,每次移帧时长15ms,以保证帧之间的情绪连续性,例如,一个时长1s的语音片段,可以得到1000/(20-15)=200帧。
对每一帧语音提取的特征向量包括过零率(Zero Crossing Rate,ZCR)、短时能量(short-term energy)、短时能量熵(short-term entropy of energy)、频谱中心和延展度(spectral centroid and spread)、谱熵(spectral entropy)、频谱流量(spectral flux)、频谱滚降点(spectral roll-off)、梅尔频率倒谱系数(Mel  Frequency Cepstrum Coefficient,MFCC)、调和比和间距(Harmonic ratio and pitch)等34维的LLDs(Low-Level Descriptors)特征,则对于一个语音片段得到的特征矩阵维度为N*34,其中,N表示帧的数量,50≤N≤800。
本申请的一个可选实施例中,在所述用预先训练生成的情绪识别模型对所述多帧语音进行处理之前,还包括:对情绪识别模型进行训练,具体地,训练步骤包括:
构建样本库,并对所述样本库中的样本标注标签,所述样本为语音片段,所标注的标签为与该语音片段对应的情绪类型,包括负面情绪和非负面情绪,其中,负面情绪用NEG表示,包括愤怒、生气和悲伤等,非负面情绪用NEU表示,包括正常、开心、兴奋等,例如,对于一个整通的录音文件,标注该录音文件中带有负面情绪的语音片段的时间起点和终点(500ms<片段时长<8s),根据停顿点确定划分的语音片段时长,从而将整通录音文件化分为多个样本,并标注每个样本的情绪类型,从而可以获取一个给定的语音片段中的负面情绪片段以及正面情绪片段,并分别标注;
将样本库划分为训练集和测试集,并将所述训练集划分为开发集和验证集,开发集用于训练模型,验证集用于调优模型,所述测试集用于测试模型实际环境的性能;
利用所述训练集中的训练样本训练所述情绪识别模型的超参数,以获取符合预设条件的超参数集,并根据所述超参数集更新所述情绪识别模型,其中,超参数包括模型的所有的权重和偏置,以及微调参数等,预设条件为所确定的超参数集使得所述情绪识别模型的性能最优,情绪识别模型的情绪识别准确率大于预设准确率;
利用所述测试集中的测试样本对更新后的所述情绪识别模型进行测试,若生成的测试结果通过验证,则结束训练,若生成的测试结果未通过验证,则继续执行所述利用所述训练集中的训练样本训练所述情绪识别模型的超参数的步骤。
本申请的一个可选实施例中,利用所述训练集中的训练样本训练所述情绪识别模型的超参数,以获取符合预设条件的超参数集,并根据所述超参数集更新所述情绪识别模型,包括:
将训练集中的训练样本平均分成预设数目份训练子集;
依次取第i份训练子集作为验证集,其余训练子集作为开发集,对所述情绪识别模型进行预设数目轮训练,并得出预设数目组超参数集;
根据预设超参数集选择条件从所述预设数目组超参数集中选取出最优超参数集,其中,预设超参数集选择条件为选择的超参数集使得对应的情绪识别模型的性能最优;
根据所述最优超参数集更新所述情绪识别模型,并继续执行所述将训练集中的训练样本平均分成预设数目份训练子集的步骤,直至所述最优超参数集符合预设条件。
例如,预设数目为5时,使用5折交叉验证训练模型,将训练集中的训练样本平均分成5份训练子集,依次取第i份(i=1,2,3,4,5)训练子集作为验证集,其余四份训练子集进行合并作为开发集,对情绪识别模型分别进行5轮训练,完成一次迭代,其中,每轮训练均得到一组超参数集,则通过一次迭代可以得到5组超参数集,根据预设超参数集选择条件确定5组超参数集中的最优超参数集,作为下一次迭代的基础,更新所述情绪识别模型,并继续执行所述将训练集中的训练样本平均分成5份训练子集的步骤,进行多次迭代,直至所述最优超参数集符合预设条件。
进一步地,利用训练集中的训练样本训练情绪识别模型的超参数的步骤,包括:
初始化所述情绪识别模型的超参数并生成初始化情绪识别模型;
用所述初始化情绪识别模型对所述训练集中的训练样本进行处理,得出与每个训练样本对应的预测类别标签;
根据所述预测类别标签与标注标签计算此次的迭代损失值;
根据所述迭代损失值更新所述初始化情绪识别模型的超参数。
其中,通过交叉熵损失函数获取迭代损失值,表示为:
Figure PCTCN2019117886-appb-000001
其中,L表示迭代损失值,y (i)表示样本i的标注标签,
Figure PCTCN2019117886-appb-000002
表示样本i的预测标签,N表示样本总数量。
本申请的一个可选实施例中,利用所述测试集中的测试样本对更新后的所述情绪识别模型进行测试,包括:利用所述测试样本测试所述情绪识别模 型,并得出所述情绪识别模型的情绪识别准确率,若准确率超过小于或等于准确率,则所述情绪识别模型未通过验证,否则,所述情绪识别模型通过验证。其中,准确率根据语音片段的实际应用场景而设定,不同的应用场景可以设定不同的准确率阈值。
本申请的一个可选实施例中,所述情绪识别模型包括:循环记忆神经网络结构和注意力机制层,通过循环记忆神经网络结构处理变长的时间序列,使得每一帧语音的情绪识别结果考虑了之前的情绪状态特征,所述注意力机制层与所述循环记忆神经网络结构的最后一个隐藏层连接,通过注意力机制层强化语音片段中的关键情绪特征,例如,提高负面情绪语音的权重,降低非负面情绪语音的权重。
所述循环神经网络结构包括双向循环神经网络,所述双向循环神经网络包括长短期记忆(Long Short-Term Memory,LSTM)网络结构、BILSTM(Bi-directional Long Short-Term Memory)网络结构。
本申请的一个可选实施例中,所述LSTM网络结构包括:
遗忘门:
f t=σ(W f·[h t-1,x t]+b f)
其中,f t表示t时刻遗忘门的输出,W f表示遗忘门的权重,σ表示sigmoid激活函数,h t-1表示t-1时刻LSTM的cell单元的隐含层输出,x t表示t时刻的输入数据,b f表示遗忘门的偏置;
更新门:
i t=σ(W i·[h t-1,x t]+b i)
Figure PCTCN2019117886-appb-000003
其中,i t表示t时刻更新门的输出,W i表示更新门中sigmoid激活函数的权重,b i表示sigmoid激活函数的偏置,σ表示sigmoid激活函数,h t-1表示t-1时刻LSTM的cell单元的隐含层输出,x t表示t时刻的输入数据,
Figure PCTCN2019117886-appb-000004
表示更新门中激活函数tanh的输出,W C表示更新门中激活函数tanh的权重,b C表示更新门中激活函数tanh的偏置;
更新信息:
Figure PCTCN2019117886-appb-000005
其中,C t表示t时刻更新门的输出状态;
输出门:
o t=σ(W o·[h t-1,x t]+b o)
h t=o t*tanh(C t)
其中,o t表示LSTM的输出,W o表示输出门sigmoid的权重,b o表示输出门sigmoid的偏置,σ表示sigmoid激活函数,h t-1表示t-1时刻LSTM的cell单元的隐含层输出,x t表示t时刻的输入数据,h t表示t时刻LSTM的cell单元的隐含层输出。
本申请通过所述注意力机制层学习到LSTM网络结构最后一个隐藏层的信息的权重参数,对隐含层输出不做选择性的进行深度特征提取,获取输入的语音片段的基于注意力的深度特征表示,提高语音情绪识别的准确率。本申请的一个可选实施例中,所述注意力机制层包括注意力机制模型和打分模型,注意机制模型用于迭代训练LSTM网络结构的权重参数,打分模型用于调整语音片段中的负面情绪的权重参数,其中,所述注意力机制层对LSTM网络结构的权重参数的训练通过下式得到:
u j=v jf(W 1h j)
α j=softmax(u j)
c j=Σ jα jh j
其中,u j表示对LSTM网络结构的隐含层的第j个节点输出向量的非线性变换结果,v j表示注意力机制层的超参数,W 1表示LSTM网络结构的权重参数,h j表示LSTM网络结构的隐含层的第j个节点输出,f()表示非线性函数,α j表示计算状态得分结果,其中softmax()函数结合f()非线性函数构成得分计算模型,c j表示与h j对应的注意力机制层输出结果。
本申请的一个可选实施例中,根据所述多个情绪识别结果得到与所述待识别语音片段对应的情绪,包括:
获取多个情绪识别结果中,负面情绪所占的比例;
若负面情绪所占的比例大于或等于预设比例阈值,则判定与所述待识别语音片段对应的情绪为负面;若负面情绪所占的比例小于预设比例阈值,则判定与所述待识别语音片段对应的情绪为非负面。
例如,对于一个待识别语音片段,通过分帧移帧处理,得到125帧语音,每一帧语音通过情绪识别模型得到一个识别结果,125帧语音得到125个情绪识别结果,对于整个待识别语音片段,设定比例阈值为70%,若其中有大于 或等于70%帧的语音识别为负面情绪,则判定所述待识别语音片段的情绪类型为负面。
对于一个时长为2s的待识别语音片段,经过分帧移帧处理,得到200帧语音,每一帧时长为25ms,通过情绪识别模型,每5帧得到一个情绪识别结果,即得到40个情绪识别结果,若其中有70%的帧识别为负面情绪,则判定所述待识别语音片段的情绪类型为负面,如果忽略情绪起止点判断误差,和情绪不存在较大起伏问题,可直接取5帧的识别结果作为实时输出,则可以认为,对于任意一通语音数据,每50ms就会得到一个情绪识别结果,可以认为是连续性的情绪侦测,且情绪识别准确率较高。
本申请所述语音情绪识别方法应用于电子设备,所述电子设备可以是电视机、智能手机、平板电脑、计算机等终端设备。
所述电子设备包括:处理器和存储器,所述存储器用于存储语音情绪识别程序,处理器执行所述语音情绪识别程序,实现以下的语音情绪识别方法:
获取待识别语音片段;对获取的待识别语音片段进行预处理,包括:对所述待识别语音片段进行分帧处理,得到多帧语音;用预先训练生成的情绪识别模型对所述多帧语音进行处理,以得出多个情绪识别结果,每个情绪识别结果与一帧语音或设定数量帧语音相对应;根据所述多个情绪识别结果得到与所述待识别语音片段对应的情绪。
所述电子设备还包括网络接口和通信总线等。其中,网络接口可以包括标准的有线接口、无线接口,通信总线用于实现各个组件之间的连接通信。
存储器包括至少一种类型的可读存储介质,可以是闪存、硬盘、光盘等非易失性存储介质,也可以是插接式硬盘等,且并不限于此,可以是以非暂时性方式存储指令或软件以及任何相关联的数据文件并向处理器提供指令或软件程序以使该处理器能够执行指令或软件程序的任何装置。本申请中,存储器存储的软件程序包括语音情绪识别程序,并可以向处理器提供该语音情绪识别程序,以使得处理器可以执行该语音情绪识别程序,实现语音情绪识别方法。
处理器可以是中央处理器、微处理器或其他数据处理芯片等,可以运行存储器中的存储程序,例如,本申请中语音情绪识别程序。
所述电子设备还可以包括显示器,显示器也可以称为显示屏或显示单元。 在一些实施例中显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。显示器用于显示在电子设备中处理的信息以及用于显示可视化的工作界面。
所述电子设备还可以包括用户接口,用户接口可以包括输入单元(比如键盘)、语音输出装置(比如音响、耳机)等。
需要说明的是,本申请中所述电子设备的具体实施方式与上述语音情绪识别方法的具体实施方式大致相同,在此不再一一赘述。
图2为本申请中语音情绪识别装置的示意图,所述语音情绪识别装置包括:语音获取模块1、预处理模块2、情绪识别模块3和情绪获取模块4,通过语音获取模块1获取待识别语音片段之后,通过预处理模块2对获取的待识别语音片段进行预处理,预处理包括:对所述待识别语音片段进行分帧处理,得到多帧语音;通过情绪识别模块3利用预先训练生成的情绪识别模型对经过预处理的待识别语音片段进行处理,得到多个情绪识别结果,每个情绪识别结果与一帧语音或设定数量帧语音相对应;最后,通过情绪获取模块4根据多个情绪识别结果得到与所述待识别语音片段对应的情绪。
通过情绪识别模块将一段语音片段识别出与每一帧或多帧语音对应的情绪识别结果,可以将对语音的情绪识别减小到毫秒级,从而更加接近对语音片段情绪的实时连续预测,提高语音情绪识别的准确率。
所述语音情绪识别装置还包括训练模块,在对所述多帧语音进行处理之前对所述情绪识别模型进行训练,所述训练模块包括:
样本库构建单元,构建样本库并对所述样本库中的样本标注标签,所述样本为语音片段;样本库划分单元,将样本库划分为训练集和测试集,并将所述训练集划分为开发集和验证集,开发集用于训练模型,验证集用于调优模型,所述测试集用于测试模型;训练单元,利用所述训练集中的训练样本训练所述情绪识别模型的超参数,以获取符合预设条件的超参数集,并根据所述超参数集更新所述情绪识别模型;测试单元,利用所述测试集中的测试样本对更新后的所述情绪识别模型进行测试,若生成的测试结果未通过验证,则继续通过所述训练单元训练所述情绪识别模型的超参数。
所述训练单元通过下述方式训练所述情绪识别模型的超参数,包括:
将训练集中的训练样本平均分成预设数目份训练子集;依次取第i份训练 子集作为验证集,其余训练子集作为开发集,对所述情绪识别模型进行预设数目轮训练,并得出预设数目组超参数集;根据预设超参数集选择条件从所述预设数目组超参数集中选取出最优超参数集;根据所述最优超参数集更新所述情绪识别模型,并继续执行所述将训练集中的训练样本平均分成预设数目份训练子集的步骤,直至所述最优超参数集符合预设条件。
例如,预设数目为5时,使用5折交叉验证训练模型,将训练集中的训练样本平均分成5份训练子集,依次取第i份(i=1,2,3,4,5)训练子集作为验证集,其余四份训练子集进行合并作为开发集,对情绪识别模型分别进行5轮训练,完成一次迭代,其中,每轮训练均得到一组超参数集,则通过一次迭代可以得到5组超参数集,根据预设超参数集选择条件确定5组超参数集中的最优超参数集,作为下一次迭代的基础,更新所述情绪识别模型,并继续执行所述将训练集中的训练样本平均分成5份训练子集的步骤,进行多次迭代,直至所述最优超参数集符合预设条件。
需要说明的是,所述语音情绪识别装置对情绪识别模型的其他训练步骤与上述语音情绪识别方法中的训练步骤类似,在此不再赘述。
本申请的一个可选实施例中,所述情绪识别模块中使用的情绪识别模型包括循环记忆神经网络结构和注意力机制层,通过循环记忆神经网络结构处理变长的时间序列,使得每一帧语音的情绪识别结果考虑了之前的情绪状态特征,所述注意力机制层与所述循环记忆神经网络结构的最后一个隐藏层连接,通过注意力机制层强化语音片段中的关键情绪特征,例如,提高负面情绪语音的权重,降低非负面情绪语音的权重。
需要说明的是,所述情绪识别模型的结构与上述语音情绪识别方法中使用的情绪识别模型大致相同,在此不再赘述。
本申请的一个可选实施例中,所述情绪获取模块包括比例获取模块和情绪判定模块,其中,通过比例获取模块获取多个情绪识别结果中负面情绪所占的比例,情绪判定模块根据负面情绪所占的比例判定与待识别语音片段对应的情绪为负面或非负面,具体地,包括:若负面情绪所占的比例大于或等于预设比例阈值,则判定与所述待识别语音片段对应的情绪为负面;若负面情绪所占的比例小于预设比例阈值,则判定与所述待识别语音片段对应的情绪为非负面。
例如,对于一个待识别语音片段,通过分帧移帧处理,得到125帧语音,每一帧语音通过情绪识别模型得到一个识别结果,125帧语音得到125个情绪识别结果,对于整个待识别语音片段,设定比例阈值为70%,若其中有大于或等于70%的帧识别为负面情绪,则判定所述待识别语音片段的情绪类型为负面。
在其他实施例中,语音情绪识别程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器中,并由处理器执行,以完成本申请,实现上述语音情绪识别装置的功能。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。所述语音情绪识别程序可以被分割为:语音获取模块1、预处理模块2、情绪识别模块3和情绪获取模块4。上述模块所实现的功能或操作步骤均与上文类似,此处不再详述,示例性地,例如其中:
语音获取模块1,获取待识别语音片段;
预处理模块2,对获取的待识别语音片段进行预处理,包括:对所述待识别语音片段进行分帧处理,得到多帧语音;
情绪识别模块3,用预先生成的情绪识别模型对所述多帧语音进行处理,以得出多个情绪识别结果,每个情绪识别结果与一帧或设定数量帧语音相对应;
情绪获取模块4,根据所述多个情绪识别结果得到与所述待识别语音片段对应的情绪。
本申请的一个实施例中,计算机非易失性可读存储介质可以是任何包含或存储程序或指令的有形介质,其中的程序可以被执行,通过存储的程序指令相关的硬件实现相应的功能。例如,计算机可读存储介质可以是计算机磁盘、硬盘、随机存取存储器、只读存储器等。本申请并不限于此,可以是以非暂时性方式存储指令或软件以及任何相关数据文件或数据结构并且可提供给处理器以使处理器执行其中的程序或指令的任何装置。所述计算机非易失性可读存储介质中包括语音情绪识别程序,所述语音情绪识别程序被处理器执行时,实现如下的语音情绪识别方法:
获取待识别语音片段;对获取的待识别语音片段进行预处理,包括:对所述待识别语音片段进行分帧处理,得到多帧语音;用预先训练生成的情绪识别模型对所述多帧语音进行处理,以得出多个情绪识别结果,每个情绪识 别结果与一帧语音或设定数量帧语音相对应;根据所述多个情绪识别结果得到与所述待识别语音片段对应的情绪。
本申请之计算机非易失性可读存储介质的具体实施方式与上述语音情绪识别方法、装置、电子设备的具体实施方式大致相同,在此不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。

Claims (20)

  1. 一种语音情绪识别方法,应用于电子设备,其特征在于,包括以下步骤:
    获取待识别语音片段;
    对获取的待识别语音片段进行预处理,包括:对所述待识别语音片段进行分帧处理,得到多帧语音;
    用预先训练生成的情绪识别模型对所述多帧语音进行处理,以得出多个情绪识别结果,每个情绪识别结果与一帧语音或设定数量帧语音相对应;
    根据所述多个情绪识别结果得到与所述待识别语音片段对应的情绪。
  2. 根据权利要求1所述的语音情绪识别方法,其特征在于,在所述用预先训练生成的情绪识别模型对所述多帧语音进行处理之前,还包括:
    构建样本库,并对所述样本库中的样本标注标签,所述样本为语音片段;
    将样本库划分为训练集和测试集,并将所述训练集划分为开发集和验证集,开发集用于训练模型,验证集用于调优模型,所述测试集用于测试模型;
    利用所述训练集中的训练样本训练所述情绪识别模型的超参数,以获取符合预设条件的超参数集,并根据所述超参数集更新所述情绪识别模型;
    利用所述测试集中的测试样本对更新后的所述情绪识别模型进行测试,若生成的测试结果未通过验证,则继续执行所述利用所述训练集中的训练样本训练所述情绪识别模型的超参数的步骤。
  3. 根据权利要求2所述的语音情绪识别方法,其特征在于,利用所述训练集中的训练样本训练所述情绪识别模型的超参数,以获取符合预设条件的超参数集,并根据所述超参数集更新所述情绪识别模型,包括:
    将训练集中的训练样本平均分成预设数目份训练子集;
    依次取第i份训练子集作为验证集,其余训练子集作为开发集,对所述情绪识别模型进行预设数目轮训练,并得出预设数目组超参数集;
    根据预设超参数集选择条件从所述预设数目组超参数集中选取出最优超参数集;
    根据所述最优超参数集更新所述情绪识别模型,并继续执行所述将训练集中的训练样本平均分成预设数目份训练子集的步骤,直至所述最优超参数 集符合预设条件。
  4. 根据权利要求2或3所述的语音情绪识别方法,其特征在于,利用训练集中的训练样本训练情绪识别模型的超参数的步骤包括:
    初始化所述情绪识别模型的超参数并生成初始化情绪识别模型;
    用所述初始化情绪识别模型对所述训练集中的训练样本进行处理,得出与每个训练样本对应的预测类别标签;
    根据所述预测类别标签与标注标签计算迭代损失值;
    根据所述迭代损失值更新所述初始化情绪识别模型的超参数,
    其中,迭代损失值通过下式获取:
    Figure PCTCN2019117886-appb-100001
    其中,L表示迭代损失值,y (i)表示样本i的标注标签,
    Figure PCTCN2019117886-appb-100002
    表示样本i的预测类别标签,N表示样本总数量。
  5. 根据权利要求2所述的语音情绪识别方法,其特征在于,利用所述测试集中的测试样本对更新后的所述情绪识别模型进行测试,包括:利用所述测试样本测试所述情绪识别模型,并得出所述情绪识别模型的情绪识别准确率,若所述准确率小于或等于预设准确率,则所述情绪识别模型未通过验证,否则,所述情绪识别模型通过验证。
  6. 根据权利要求1所述的语音情绪识别方法,其特征在于,根据所述多个情绪识别结果得到与所述待识别语音片段对应的情绪的步骤,包括:
    获取所述多个情绪识别结果中,负面情绪所占的比例;
    若负面情绪所占的比例大于或等于预设比例阈值,则判定与所述待识别语音片段对应的情绪为负面;若负面情绪所占的比例小于预设比例阈值,则判定与所述待识别语音片段对应的情绪为非负面。
  7. 根据权利要求1所述的语音情绪识别方法,其特征在于,所述情绪识别模型包括:循环记忆神经网络结构和注意力机制层,其中,所述注意力机制层与所述循环记忆神经网络结构的最后一个隐藏层连接,所述循环记忆神经网络结构是LSTM网络结构,所述LSTM网络结构包括:
    遗忘门:
    f t=σ(W f·[h t-1,x t]+b f)
    其中,f t表示t时刻遗忘门的输出,W f表示遗忘门的权重,σ表示sigmoid 激活函数,h t-1表示t-1时刻LSTM的cell单元的隐含层输出,x t表示t时刻的输入数据,b f表示遗忘门的偏置;
    更新门:
    i t=σ(W i·[h t-1,x t]+b i)
    Figure PCTCN2019117886-appb-100003
    其中,i t表示t时刻更新门的输出,W i表示更新门中sigmoid激活函数的权重,b i表示sigmoid激活函数的偏置,σ表示sigmoid激活函数,h t-1表示t-1时刻LSTM的cell单元的隐含层输出,x t表示t时刻的输入数据,
    Figure PCTCN2019117886-appb-100004
    表示更新门中激活函数tanh的输出,W C表示更新门中激活函数tanh的权重,b C表示更新门中激活函数tanh的偏置;
    更新信息:
    Figure PCTCN2019117886-appb-100005
    其中,C t表示t时刻更新门的输出状态;
    输出门:
    o t=σ(W o·[h t-1,x t]+b o)
    h t=o t*tanh(C t)
    其中,o t表示LSTM的输出,W o表示输出门sigmoid的权重,b o表示输出门sigmoid的偏置,σ表示sigmoid激活函数,h t-1表示t-1时刻LSTM的cell单元的隐含层输出,x t表示t时刻的输入数据,h t表示t时刻LSTM的cell单元的隐含层输出。
  8. 根据权利要求7所述的语音情绪识别方法,其特征在于,所述注意力机制层包括注意力机制模型和打分模型,所述注意机制模型用于迭代训练LSTM网络结构的权重参数,所述打分模型用于调整语音片段中的负面情绪的权重参数;
    其中,注意力机制层对LSTM网络结构的权重参数的训练通过下式得到:
    u j=v jf(W 1h j)
    α j=softmax(u j)
    Figure PCTCN2019117886-appb-100006
    其中,u j表示对LSTM网络结构的隐含层的第j个节点输出向量的非线性变换结果,v j表示注意力机制层的超参数,W 1表示LSTM网络结构的权重参数,h j表示LSTM网络结构的隐含层的第j个节点输出,f()表示非线性函 数,α j表示计算状态得分结果,c j表示与h j对应的注意力机制层输出结果。
  9. 根据权利要求1所述的语音情绪识别方法,其特征在于,对所述待识别语音片段进行分帧处理时,每一帧语音时长为25ms~50ms;移帧时,每次移帧时长15ms。
  10. 一种语音情绪识别装置,其特征在于,包括:语音获取模块、预处理模块、情绪识别模块和情绪获取模块,其中,语音获取模块用于获取待识别语音片段,预处理模块用于对获取的待识别语音片段进行预处理,得到多帧语音;情绪识别模块用于用预先训练生成的情绪识别模型对经过预处理的待识别语音片段进行处理,得到多个情绪识别结果,每个情绪识别结果与一帧语音或设定数量帧语音相对应;情绪获取模块用于根据多个情绪识别结果得到与所述待识别语音片段对应的情绪。
  11. 根据权利要求10所述的语音情绪识别装置,其特征在于,所述情绪识别模型包括:循环记忆神经网络结构和注意力机制层,其中,所述注意力机制层与所述循环记忆神经网络结构的最后一个隐藏层连接,所述循环记忆神经网络结构是LSTM网络结构,所述LSTM网络结构包括:
    遗忘门:
    f t=σ(W f·[h t-1,x t]+b f)
    其中,f t表示t时刻遗忘门的输出,W f表示遗忘门的权重,σ表示sigmoid激活函数,h t-1表示t-1时刻LSTM的cell单元的隐含层输出,x t表示t时刻的输入数据,b f表示遗忘门的偏置;
    更新门:
    i t=σ(W i·[h t-1,x t]+b i)
    Figure PCTCN2019117886-appb-100007
    其中,i t表示t时刻更新门的输出,W i表示更新门中sigmoid激活函数的权重,b i表示sigmoid激活函数的偏置,σ表示sigmoid激活函数,h t-1表示t-1时刻LSTM的cell单元的隐含层输出,x t表示t时刻的输入数据,
    Figure PCTCN2019117886-appb-100008
    表示更新门中激活函数tanh的输出,W C表示更新门中激活函数tanh的权重,b C表示更新门中激活函数tanh的偏置;
    更新信息:
    Figure PCTCN2019117886-appb-100009
    其中,C t表示t时刻更新门的输出状态;
    输出门:
    o t=σ(W o·[h t-1,x t]+b o)
    h t=o t*tanh(C t)
    其中,o t表示LSTM的输出,W o表示输出门sigmoid的权重,b o表示输出门sigmoid的偏置,σ表示sigmoid激活函数,h t-1表示t-1时刻LSTM的cell单元的隐含层输出,x t表示t时刻的输入数据,h t表示t时刻LSTM的cell单元的隐含层输出。
  12. 根据权利要求10所述的语音情绪识别装置,其特征在于,所述注意力机制层包括注意力机制模型和打分模型,所述注意机制模型用于迭代训练LSTM网络结构的权重参数,所述打分模型用于调整语音片段中的负面情绪的权重参数;
    其中,注意力机制层对LSTM网络结构的权重参数的训练通过下式得到:
    u j=v jf(W 1h j)
    α j=softmax(u j)
    Figure PCTCN2019117886-appb-100010
    其中,u j表示对LSTM网络结构的隐含层的第j个节点输出向量的非线性变换结果,v j表示注意力机制层的超参数,W 1表示LSTM网络结构的权重参数,h j表示LSTM网络结构的隐含层的第j个节点输出,f()表示非线性函数,α j表示计算状态得分结果,c j表示与h j对应的注意力机制层输出结果。
  13. 根据权利要求10所述的语音情绪识别装置,其特征在于,所述情绪获取模块包括比例获取模块和情绪判定模块,其中,比例获取模块用于获取多个情绪识别结果中负面情绪所占的比例,情绪判定模块用于根据负面情绪所占的比例判定与待识别语音片段对应的情绪为负面或非负面。
  14. 根据权利要求10所述的语音情绪识别装置,其特征在于,所述语音情绪识别装置还包括训练模块,在对所述多帧语音进行处理之前对所述情绪识别模型进行训练,所述训练模块包括:
    样本库构建单元,构建样本库并对所述样本库中的样本标注标签,所述样本为语音片段;
    样本库划分单元,将样本库划分为训练集和测试集,并将所述训练集划分为开发集和验证集,开发集用于训练模型,验证集用于调优模型,所述测试集用于测试模型;
    训练单元,利用所述训练集中的训练样本训练所述情绪识别模型的超参数,以获取符合预设条件的超参数集,并根据所述超参数集更新所述情绪识别模型;
    测试单元,利用所述测试集中的测试样本对更新后的所述情绪识别模型进行测试,若生成的测试结果未通过验证,则继续通过所述训练单元训练所述情绪识别模型的超参数。
  15. 根据权利要求14所述的语音情绪识别方法,其特征在于,所述训练单元通过下述方式训练所述情绪识别模型的超参数,包括:
    将训练集中的训练样本平均分成预设数目份训练子集;
    依次取第i份训练子集作为验证集,其余训练子集作为开发集,对所述情绪识别模型进行预设数目轮训练,并得出预设数目组超参数集;
    根据预设超参数集选择条件从所述预设数目组超参数集中选取出最优超参数集;
    根据所述最优超参数集更新所述情绪识别模型,并继续执行所述将训练集中的训练样本平均分成预设数目份训练子集的步骤,直至所述最优超参数集符合预设条件。
  16. 一种电子设备,其特征在于,该电子设备包括:处理器和存储器,所述存储器中包括语音情绪识别程序,所述语音情绪识别程序被所述处理器执行时实现如权利要求1所述的语音情绪识别方法。
  17. 根据权利要求16所述的电子设备,其特征在于,所述语音情绪识别程序被所述处理器执行时,还实现在对所述多帧语音进行处理之前,对情绪识别模型进行训练的方法,包括:
    构建样本库,并对所述样本库中的样本标注标签,所述样本为语音片段;
    将样本库划分为训练集和测试集,并将所述训练集划分为开发集和验证集,开发集用于训练模型,验证集用于调优模型,所述测试集用于测试模型;
    利用所述训练集中的训练样本训练所述情绪识别模型的超参数,以获取符合预设条件的超参数集,并根据所述超参数集更新所述情绪识别模型;
    利用所述测试集中的测试样本对更新后的所述情绪识别模型进行测试,若生成的测试结果未通过验证,则继续执行所述利用所述训练集中的训练样本训练所述情绪识别模型的超参数的步骤。
  18. 根据权利要求17所述的电子设备,其特征在于,利用所述训练集中的训练样本训练所述情绪识别模型的超参数,以获取符合预设条件的超参数集,并根据所述超参数集更新所述情绪识别模型的步骤,包括:
    将训练集中的训练样本平均分成预设数目份训练子集;
    依次取第i份训练子集作为验证集,其余训练子集作为开发集,对所述情绪识别模型进行预设数目轮训练,并得出预设数目组超参数集;
    根据预设超参数集选择条件从所述预设数目组超参数集中选取出最优超参数集;
    根据所述最优超参数集更新所述情绪识别模型,并继续执行所述将训练集中的训练样本平均分成预设数目份训练子集的步骤,直至所述最优超参数集符合预设条件。
  19. 根据权利要求17所述的电子设备,其特征在于,利用训练集中的训练样本训练情绪识别模型的超参数的步骤包括:
    初始化所述情绪识别模型的超参数并生成初始化情绪识别模型;
    用所述初始化情绪识别模型对所述训练集中的训练样本进行处理,得出与每个训练样本对应的预测类别标签;
    根据所述预测类别标签与标注标签计算迭代损失值;
    根据所述迭代损失值更新所述初始化情绪识别模型的超参数,
    其中,迭代损失值通过下式获取:
    Figure PCTCN2019117886-appb-100011
    其中,L表示迭代损失值,y (i)表示样本i的标注标签,
    Figure PCTCN2019117886-appb-100012
    表示样本i的预测类别标签,N表示样本总数量。
  20. 一种计算机非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质中包括语音情绪识别程序,所述语音情绪识别程序被处理器执行时,实现如权利要求1至9中任一项所述的语音情绪识别方法。
PCT/CN2019/117886 2019-09-17 2019-11-13 语音情绪识别方法、装置、设备及存储介质 WO2021051577A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910875372.1A CN110556130A (zh) 2019-09-17 2019-09-17 语音情绪识别方法、装置及存储介质
CN201910875372.1 2019-09-17

Publications (1)

Publication Number Publication Date
WO2021051577A1 true WO2021051577A1 (zh) 2021-03-25

Family

ID=68740478

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117886 WO2021051577A1 (zh) 2019-09-17 2019-11-13 语音情绪识别方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN110556130A (zh)
WO (1) WO2021051577A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113241095A (zh) * 2021-06-24 2021-08-10 中国平安人寿保险股份有限公司 通话情绪实时识别方法、装置、计算机设备及存储介质

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111210844B (zh) * 2020-02-03 2023-03-24 北京达佳互联信息技术有限公司 语音情感识别模型的确定方法、装置、设备及存储介质
CN111312292A (zh) * 2020-02-18 2020-06-19 北京三快在线科技有限公司 基于语音的情绪识别方法、装置、电子设备及存储介质
CN111796180A (zh) * 2020-06-23 2020-10-20 广西电网有限责任公司电力科学研究院 一种高压开关机械故障的自动识别方法及装置
CN111798874A (zh) * 2020-06-24 2020-10-20 西北师范大学 一种语音情绪识别方法及系统
CN111951832B (zh) * 2020-08-24 2023-01-13 上海茂声智能科技有限公司 一种语音分析用户对话情绪的方法及装置
CN112472090A (zh) * 2020-11-25 2021-03-12 广东技术师范大学 一种基于声音的哺乳母猪情绪识别方法
CN112509561A (zh) * 2020-12-03 2021-03-16 中国联合网络通信集团有限公司 情绪识别方法、装置、设备及计算机可读存储介质
CN112466337A (zh) * 2020-12-15 2021-03-09 平安科技(深圳)有限公司 音频数据情绪检测方法、装置、电子设备及存储介质
CN112634873A (zh) * 2020-12-22 2021-04-09 上海幻维数码创意科技股份有限公司 一种基于中文语音OpenSmile和双向LSTM的端到端情绪识别方法
CN113113048B (zh) * 2021-04-09 2023-03-10 平安科技(深圳)有限公司 语音情绪识别方法、装置、计算机设备及介质
CN113345468A (zh) * 2021-05-25 2021-09-03 平安银行股份有限公司 语音质检方法、装置、设备及存储介质
CN113571096B (zh) * 2021-07-23 2023-04-07 平安科技(深圳)有限公司 语音情绪分类模型训练方法、装置、计算机设备及介质
CN113889150B (zh) * 2021-10-15 2023-08-29 北京工业大学 语音情感识别方法及装置
CN114417868B (zh) * 2022-03-15 2022-07-01 云天智能信息(深圳)有限公司 一种智能负面情绪测评方法和系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162807A1 (en) * 2014-12-04 2016-06-09 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation Emotion Recognition System and Method for Modulating the Behavior of Intelligent Systems
CN108346436A (zh) * 2017-08-22 2018-07-31 腾讯科技(深圳)有限公司 语音情感检测方法、装置、计算机设备及存储介质
CN108447470A (zh) * 2017-12-28 2018-08-24 中南大学 一种基于声道和韵律特征的情感语音转换方法
CN109003625A (zh) * 2018-07-27 2018-12-14 中国科学院自动化研究所 基于三元损失的语音情感识别方法及系统
CN109036465A (zh) * 2018-06-28 2018-12-18 南京邮电大学 语音情感识别方法
CN109599128A (zh) * 2018-12-24 2019-04-09 北京达佳互联信息技术有限公司 语音情感识别方法、装置、电子设备和可读介质
CN110223714A (zh) * 2019-06-03 2019-09-10 杭州哲信信息技术有限公司 一种基于语音的情绪识别方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10127927B2 (en) * 2014-07-28 2018-11-13 Sony Interactive Entertainment Inc. Emotional speech processing
CN108122552B (zh) * 2017-12-15 2021-10-15 上海智臻智能网络科技股份有限公司 语音情绪识别方法和装置
CN109285562B (zh) * 2018-09-28 2022-09-23 东南大学 基于注意力机制的语音情感识别方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162807A1 (en) * 2014-12-04 2016-06-09 Carnegie Mellon University, A Pennsylvania Non-Profit Corporation Emotion Recognition System and Method for Modulating the Behavior of Intelligent Systems
CN108346436A (zh) * 2017-08-22 2018-07-31 腾讯科技(深圳)有限公司 语音情感检测方法、装置、计算机设备及存储介质
CN108447470A (zh) * 2017-12-28 2018-08-24 中南大学 一种基于声道和韵律特征的情感语音转换方法
CN109036465A (zh) * 2018-06-28 2018-12-18 南京邮电大学 语音情感识别方法
CN109003625A (zh) * 2018-07-27 2018-12-14 中国科学院自动化研究所 基于三元损失的语音情感识别方法及系统
CN109599128A (zh) * 2018-12-24 2019-04-09 北京达佳互联信息技术有限公司 语音情感识别方法、装置、电子设备和可读介质
CN110223714A (zh) * 2019-06-03 2019-09-10 杭州哲信信息技术有限公司 一种基于语音的情绪识别方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113241095A (zh) * 2021-06-24 2021-08-10 中国平安人寿保险股份有限公司 通话情绪实时识别方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN110556130A (zh) 2019-12-10

Similar Documents

Publication Publication Date Title
WO2021051577A1 (zh) 语音情绪识别方法、装置、设备及存储介质
US20230410796A1 (en) Encoder-decoder models for sequence to sequence mapping
US11934935B2 (en) Feedforward generative neural networks
US9818409B2 (en) Context-dependent modeling of phonemes
US9396724B2 (en) Method and apparatus for building a language model
US10114809B2 (en) Method and apparatus for phonetically annotating text
US20210050033A1 (en) Utilizing bi-directional recurrent encoders with multi-hop attention for speech emotion recognition
CN113094578B (zh) 基于深度学习的内容推荐方法、装置、设备及存储介质
CN110502610A (zh) 基于文本语义相似度的智能语音签名方法、装置及介质
KR20230040951A (ko) 음성 인식 방법, 장치 및 디바이스, 및 저장 매체
WO2014190732A1 (en) Method and apparatus for building a language model
CN111798840B (zh) 语音关键词识别方法和装置
JP7266683B2 (ja) 音声対話に基づく情報検証方法、装置、デバイス、コンピュータ記憶媒体、およびコンピュータプログラム
CN111833845A (zh) 多语种语音识别模型训练方法、装置、设备及存储介质
US11947920B2 (en) Man-machine dialogue method and system, computer device and medium
CN108345612A (zh) 一种问题处理方法和装置、一种用于问题处理的装置
CN113688955B (zh) 文本识别方法、装置、设备及介质
CN115312033A (zh) 基于人工智能的语音情感识别方法、装置、设备及介质
US20220277732A1 (en) Method and apparatus for training speech recognition model, electronic device and storage medium
CN113555005B (zh) 模型训练、置信度确定方法及装置、电子设备、存储介质
US11393447B2 (en) Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium
CN108073704B (zh) 一种liwc词表扩展方法
CN113990353B (zh) 识别情绪的方法、训练情绪识别模型的方法、装置及设备
US20230042234A1 (en) Method for training model, device, and storage medium
CN116796733A (zh) 机器人、实体识别方法、装置和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945609

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945609

Country of ref document: EP

Kind code of ref document: A1