WO2020173133A1 - 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质 - Google Patents

情感识别模型的训练方法、情感识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020173133A1
WO2020173133A1 PCT/CN2019/117711 CN2019117711W WO2020173133A1 WO 2020173133 A1 WO2020173133 A1 WO 2020173133A1 CN 2019117711 W CN2019117711 W CN 2019117711W WO 2020173133 A1 WO2020173133 A1 WO 2020173133A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice information
mel
frequency
emotion recognition
layer
Prior art date
Application number
PCT/CN2019/117711
Other languages
English (en)
French (fr)
Inventor
刘博卿
贾雪丽
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020173133A1 publication Critical patent/WO2020173133A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the technical field of model training, and in particular to an emotion recognition model training method, emotion recognition method, device, computer equipment and storage medium.
  • emotion recognition models that use voice to recognize user emotions based on machine learning have been extensively developed, but emotion recognition for voice still faces many challenges. For example, in order to generate continuous and accurate recognition of positive and negative emotions, some recognition models use The combination of text and acoustic features requires the use of automatic Speech Recognition (ASR) technology to convert sound into text information, but there is a serious problem of delay. At the same time, the emotion recognition model also has the problem of poor generalization. When the model is applied to a new speaker, its accuracy will be reduced.
  • ASR Automatic Speech Recognition
  • This application provides an emotion recognition model training method, emotion recognition method, device, computer equipment, and storage medium, so as to improve the generalizability of the emotion recognition model and improve the accuracy of recognition.
  • this application provides a method for training an emotion recognition model, the method including:
  • model training is performed according to the frequency spectrum vector and data label corresponding to the voice information to obtain an emotion recognition model.
  • this application also provides an emotion recognition method, which includes:
  • the frequency spectrum vector is input to an emotion recognition model to recognize the emotion of the user to obtain the emotion category of the user, and the emotion recognition model is a model obtained by training using the aforementioned emotion recognition model training method.
  • the present application also provides an emotion recognition model training device, the device includes:
  • the acquiring unit is used to acquire the user's voice information and the data tag corresponding to the voice information
  • a sample construction unit configured to construct sample data according to the voice information and corresponding data tags
  • a preprocessing unit configured to preprocess the voice information in the sample data according to preset processing rules to obtain the corresponding spectrum vector
  • An extraction unit configured to extract a preset cyclic neural network, the cyclic neural network including an attention mechanism, and the attention mechanism is used to strengthen a part of the voice information;
  • the model training unit is configured to perform model training according to the frequency spectrum vector and data label corresponding to the voice information based on the recurrent neural network to obtain an emotion recognition model.
  • an emotion recognition device which includes:
  • the signal collection unit is used to collect the user's voice signal
  • a signal processing unit configured to preprocess the voice signal according to preset processing rules to obtain a spectrum vector corresponding to the voice signal
  • the emotion recognition unit is configured to input the frequency spectrum vector into the emotion recognition model to recognize the emotion of the user to obtain the emotion category of the user, and the emotion recognition model is obtained by training using the above emotion recognition model training method Model.
  • the present application also provides a computer device that includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and execute the The computer program implements the above-mentioned emotion recognition model training method or the described emotion recognition method.
  • this application also provides a computer-readable storage medium that stores a computer program that when executed by a processor causes the processor to implement the emotion recognition model described above Training method, or the emotion recognition method described.
  • This application discloses a training method, device, equipment, and storage medium for an emotion recognition model. After obtaining the user's voice information and corresponding data tags, the method preprocesses the voice information according to preset processing rules to obtain the corresponding Based on the preset cyclic neural network, model training is performed according to the spectrum vector and data label corresponding to the voice information to obtain an emotion recognition model, where the cyclic neural network includes an attention mechanism, and the attention mechanism is used for Strengthen some areas in the voice information.
  • the emotion recognition model trained by this method has the advantages of strong generalization and high recognition accuracy.
  • FIG. 1 is a schematic flowchart of a method for training an emotion recognition model provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the structure of a recurrent neural network provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of sub-steps of the training method of the emotion recognition model in FIG. 1;
  • FIG. 4 is a schematic flowchart of a method for training an emotion recognition model provided by an embodiment of the present application
  • FIG. 5 is a schematic flowchart of an emotion recognition method provided by an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a model training device provided by an embodiment of the application.
  • FIG. 7 is a schematic block diagram of another model training device provided by an embodiment of the application.
  • FIG. 8 is a schematic block diagram of an emotion recognition device provided by an embodiment of this application.
  • FIG. 9 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • the embodiments of the present application provide an emotion recognition model training method, emotion recognition method, device, computer equipment, and storage medium.
  • the training method of the emotion recognition model can be trained using a server; the emotion recognition method can be applied to a terminal or a server to identify the user's emotion type, such as happy or sad, according to the user's voice.
  • the server can be an independent server or a server cluster.
  • the terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.
  • FIG. 1 is a schematic flowchart of an emotion recognition model training method provided by an embodiment of the present application.
  • the emotion recognition model is obtained by model training based on a preset recurrent neural network.
  • FIG. 2 is a schematic structural diagram of a preset recurrent neural network provided by an embodiment of the present application.
  • the structure of the cyclic neural network includes an input layer, a cyclic layer, an attention mechanism, a fully connected layer, and an output layer; the attention mechanism is used to establish the relationship between the output of the cyclic layer and the weight vector according to the attention equation
  • the mapping relationship is used to strengthen a part of the voice information, thereby improving the recognition accuracy of the model.
  • the cyclic layer includes Long Short-Term Memory (LSTM) units, and the output layer uses Softmax output.
  • LSTM Long Short-Term Memory
  • the output layer uses Softmax output.
  • the time dependence of the input sequence corresponding to the input layer is modeled by a recurrent layer including long and short-term memory network units; the attention mechanism is applied to every time point in the sequence
  • the output of the corresponding loop layer adds more weight to some regions in the sequence. These regions are important regions when identifying positive and negative emotions.
  • the preset recurrent neural network can be used to learn long-term dependencies, and there is no problem of gradient disappearance or gradient explosion, which can get better Recognition effect.
  • the following describes the training method of the emotion recognition model provided by the embodiment of the present application in combination with the structure of the recurrent neural network in FIG. 2.
  • the training method of the emotion recognition model is used to train the emotion recognition model to accurately and quickly recognize the emotion type of the user.
  • the training method includes steps S101 to S105.
  • the data label is the user's emotional label, including positive emotional label, neutral emotional label and negative emotional label.
  • the voice information can also be divided into more categories, corresponding to more data tags, such as happy, sad, scared, sad, or neutral data tags. Different data tags represent different emotions of the user.
  • the user's voice information is obtained from a preset database, and the voice information includes tag data, that is, a data tag corresponding to the voice information. Prior to this, it also includes: collecting the user's voice information and marking the voice information according to the data tags, and storing the voice information marked with the data tags in the preset database.
  • Users can include users from different groups of people, such as children, young people, middle-aged and elderly users, etc.; understandably, they can also be people of different occupations, such as teachers, students, doctors, lawyers, and IT personnel, etc., and then Enrich the diversity of sample data.
  • the voice information is set and collected, that is, the obtaining the user's voice information and the data tags corresponding to the voice information includes: obtaining the user's different emotion types The voice information corresponding to the story and the data tags generated by the user's emotional score on the voice information.
  • Scoring scores such as 0-5 points indicate negative emotions, 6-10 points are positive emotions, and corresponding data labels are generated according to the scoring score; for example, if the score is 4 points, the label data corresponding to the voice information is negative emotion labels .
  • the voice information corresponding to the two negative stories and two optimistic stories from the collected users can be scored in segments, and the corresponding data tags can be determined according to the scores corresponding to the segment scores, for example, the voice information is divided into For two speech fragments, the first speech fragment has a score of 0, and the corresponding data label is negative emotion, and the second speech fragment has a score of 10, and the corresponding data label is positive emotion.
  • the sample data can be formed according to the collected user's voice information and the corresponding data tags.
  • the user is multiple users, and the specific number is not limited here.
  • the sample data includes positive sample data and negative sample data.
  • the positive sample data corresponds to the voice information of the positive emotion, and the positive emotion is such as optimism, happiness, and Excitement, etc.; negative sample data corresponds to the voice information of negative emotions, such as negative emotions, sadness, pain and other related emotions.
  • S103 Preprocess the voice information in the sample data according to a preset processing rule to obtain a corresponding spectrum vector.
  • the preset processing rule is used to transfer the voice information in the sample data out of the information in the frequency domain, specifically, for example, using fast Fourier transform rules or wavelet transform rules to convert the voice information collected in the time domain Information in the frequency domain.
  • step S103 in order to speed up the training of the model and the accuracy of recognition, a preprocessing rule is used, as shown in FIG. 3, that is, step S103 includes: sub-step S103a to sub-step S103d.
  • S103a Perform frame and window processing on the voice information in the sample data to obtain processed voice information.
  • the frame length of the frame and window processing is set to 40ms, and the voice information is segmented according to the set frame length of 40ms to obtain the segmented voice information, and then the segmented voice information is processed with a Hamming window and Hamming is added.
  • Window processing refers to multiplying the segmented speech information by a window function for the purpose of Fourier expansion.
  • the specific setting of the frame length can be set to other values, such as 50ms, 30ms or other values.
  • the voice information in the sample data is framed and windowed to obtain the processed voice information
  • the voice information may also be pre-emphasized, specifically multiplying by one and the voice information
  • the frequency is positively correlated with the preset coefficient to increase the amplitude of the high frequency.
  • the size of the preset coefficient is related to the parameters of model training, that is, changes according to the changes of the model parameters, such as being related to the weight vector a i .
  • the mean value corresponding to the weight vector a i increases, or decreases according to the mean value decrease. The purpose is to better improve the recognition accuracy of the model.
  • the preset coefficient can be set as an empirical value, and setting an empirical value can be used to eliminate the effect caused by the vocal cords and lips during the user's vocalization process to compensate for the high frequency of the voice information suppressed by the pronunciation system Part, and can highlight high-frequency resonance peaks.
  • S103b Perform frequency domain transformation on the processed voice information to obtain a corresponding amplitude spectrum.
  • FFT Fast Fourier Transform
  • the amplitude is used as the amplitude spectrum, that is, the fast Fourier transform After the amplitude.
  • other parameters after FFT transformation can also be used, such as amplitude plus phase information.
  • the filtering processing of the amplitude spectrum by the mel filter bank includes: obtaining the maximum frequency corresponding to the voice information, and calculating the mel frequency corresponding to the maximum frequency by using a mel frequency calculation formula;
  • the calculated Mel frequency and the number of triangular filters in the Mel filter bank calculate the Mel distance between the center frequencies of two adjacent triangular filters; the linearity of multiple triangular filters is completed according to the Mel distance Distribution; filtering the amplitude spectrum according to multiple triangular filters that complete linear distribution.
  • the Mel filter bank specifically includes 40 triangular filters linearly distributed in the Mel measurement. After filtering the obtained amplitude spectrum through 40 linearly distributed triangular filters measured by Mel, and then performing discrete cosine transform to obtain Mel frequency cepstrum coefficients.
  • f mel is the Mel frequency
  • f is the maximum frequency corresponding to the voice information
  • A is the coefficient, specifically 2595.
  • the determined maximum frequency is 4000 Hz
  • the maximum Mel frequency can be calculated as 2146.1 mel using formula (1).
  • each triangular filter is a linear distribution with equal intervals. From this, the distance between the center frequencies of two adjacent triangular filters can be calculated as:
  • ⁇ mel is the distance between the center frequencies of two adjacent triangular filters; k is the number of triangular filters.
  • the conversion formula corresponding to the zero-mean normalization is:
  • Z-Score normalization also known as standard deviation normalization.
  • the mean value of the processed data is 0, and the label difference is 1.
  • Z-Score standardization is to uniformly transform data of different magnitudes into the same magnitude, and uniformly measure it with the calculated Z-Score value to ensure the comparability of data.
  • the structure of the cyclic neural network includes an input layer, a cyclic layer, an attention mechanism, a fully connected layer, and an output layer; the attention mechanism is used to establish the relationship between the output of the cyclic layer and the weight vector according to the attention equation The mapping relationship between the two to realize the enhancement of a part of the voice information.
  • the key to the attention mechanism is to learn this equation.
  • the equation establishes a mapping relationship between the output h i of each loop layer at each time point i and a weight vector a i , where h i represents the output of the loop layer.
  • a i is used to represent the impact of each time point on subsequent layers in the network.
  • the simplified form of the expression is specifically such as formula (4) adopting a linear function plus an activation function of tanh, which can achieve better results and improve the training speed of the model.
  • W is a matrix parameter of dimension S*D
  • S is a positive integer
  • b and u are vector parameters of dimension S
  • D is the number of network units in the cyclic layer.
  • g is a vector as the input of the fully connected layer, the activation function uses the ReLu function, and then the fully connected layer uses the Softmax function to obtain the final output.
  • S105 Based on the cyclic neural network, perform model training according to the frequency spectrum vector and data label corresponding to the voice information to obtain an emotion recognition model.
  • the frequency spectrum vector is input to the preset recurrent neural network for model training, the main part of the sound is strengthened through the attention mechanism in the improved model, and the corresponding model parameters are optimized to obtain the emotion recognition model and model training parameters As shown in Table 1.
  • Table 1 shows the relevant parameters of the training network
  • the model training method provided by the foregoing embodiment obtains the user's voice information and corresponding data tags, preprocesses the voice information according to preset processing rules to obtain the corresponding spectrum vector, and then based on the preset recurrent neural network, according to The frequency spectrum vector and the data label corresponding to the voice information are trained to obtain an emotion recognition model, wherein the cyclic neural network includes an attention mechanism, and the attention mechanism is used to strengthen a part of the region in the voice information.
  • the emotion recognition model trained by this method has the advantages of strong generalization and high recognition accuracy.
  • FIG. 4 is a schematic flowchart of another method for training an emotion recognition model provided by an embodiment of the present application.
  • the emotion recognition model is obtained by model training based on a preset recurrent neural network, and of course, other networks can also be used for training.
  • the training method of the emotion recognition model includes steps S201 to S207.
  • S201 Acquire voice information of a user and a data tag corresponding to the voice information.
  • the data label is the user's emotional label, including positive emotional label, neutral emotional label and negative emotional label.
  • the voice information can also be divided into more categories, corresponding to more data tags, such as happy, sad, scared, sad, or neutral data tags. Different data tags represent different emotions of the user.
  • sample data Construct sample data according to the voice information and corresponding data tags, where the sample data includes at least positive sample data and negative sample data.
  • the sample data can be formed according to the collected user's voice information and the corresponding data tags. Since users have different emotions, the sample data includes at least positive sample data and negative sample data, for example, neutral sample data.
  • the positive sample data corresponds to the voice information of positive emotion; the negative sample data corresponds to the voice information of negative emotion.
  • S203 Determine whether the positive sample data and the negative sample data in the sample data reach a balance.
  • the judgment result includes: the positive sample data and the negative sample data are balanced, and the positive sample data and the negative sample data are not balanced. balance.
  • step S204 if the positive sample data and the negative sample data are unbalanced, step S204 is executed; if the positive sample data and the negative sample data are balanced, step S205 is executed.
  • S204 Process the sample data according to a preset data processing rule to balance the positive sample data and the negative sample data.
  • the sample data is processed according to a preset data processing rule to balance the positive sample data and the negative sample data.
  • the sample data can be processed in two ways to balance the positive sample data and the negative sample data. They are:
  • the positive sample data and the negative sample data in the constructed sample data are generally smaller than the positive sample data. Specifically, the negative sample data is copied multiple times and combined with the positive sample data.
  • the sample data constitutes sample data for training. For the sample data used for training, since the negative sample data is copied several times to form new sample data, the problem of sample unevenness can be solved.
  • the training model weight ⁇ is optimal, specifically through the weighting idea, such as fewer negative samples, in training
  • the model parameters are adjusted by weight to increase the influence of the negative sample.
  • the expression corresponding to the standard cross entropy loss function is:
  • the label of the class corresponding to n, the value range of the label is ⁇ 0, 1 ⁇ , of course, it can also be ⁇ 0, 1, 2 ⁇ , corresponding to negative samples, neutral samples and positive samples respectively.
  • a weighted cross entropy function can also be used.
  • the weighted cross entropy function is similar to the standard cross entropy loss function, and the goal is to solve the problem of uneven sample data.
  • S205 Preprocess the voice information in the sample data according to a preset processing rule to obtain a corresponding spectrum vector.
  • the voice information in the sample data is preprocessed according to a preset processing rule to obtain a corresponding spectrum vector.
  • the preset processing rule is used to transfer the voice information in the sample data out of the information in the frequency domain, specifically, for example, using fast Fourier transform rules or wavelet transform rules to convert the voice information collected in the time domain Information in the frequency domain.
  • the structure of the cyclic neural network includes an input layer, a cyclic layer, an attention mechanism, a fully connected layer, and an output layer; the attention mechanism is used to establish the relationship between the output of the cyclic layer and the weight vector according to the attention equation The mapping relationship between the two to realize the enhancement of a part of the voice information.
  • S207 Based on the cyclic neural network, perform model training according to the frequency spectrum vector and data label corresponding to the voice information to obtain an emotion recognition model.
  • the frequency spectrum vector is input to the preset recurrent neural network for model training, the main part of the sound is strengthened through the attention mechanism in the improved model, and the corresponding model parameters are optimized to obtain the emotion recognition model.
  • the emotion recognition model trained by this method has the advantages of strong generalization and high recognition accuracy. At the same time, because extreme emotions are often much less common than neutral emotions, the problem of uneven samples and over-fitting problems can be solved by this method, and the accuracy of the model can be improved.
  • FIG. 5 is a schematic flowchart of an emotion recognition method provided by an embodiment of the present application.
  • the emotion recognition method can be applied to a terminal or a server to recognize the emotion of the user according to the voice of the user.
  • the emotion recognition method includes steps S301 to S303.
  • the voice signal corresponding to the chat with the user can be collected through a recording device, such as a voice recorder, a smart phone, a tablet computer, a notebook or a smart wearable device, such as a smart bracelet or a smart watch.
  • a recording device such as a voice recorder, a smart phone, a tablet computer, a notebook or a smart wearable device, such as a smart bracelet or a smart watch.
  • preprocessing the voice signal according to preset processing rules to obtain the spectrum vector corresponding to the voice signal includes: performing frame and windowing processing on the voice information to obtain the processed voice information;
  • the voice information is subjected to fast Fourier transform to obtain the amplitude spectrum;
  • the mel filter bank is added to the amplitude spectrum, and the output of the mel filter bank is subjected to discrete cosine transform to obtain the mel frequency cepstrum coefficient; each of the obtained The Mel frequency cepstrum coefficients are normalized to obtain the frequency spectrum vector corresponding to the voice information.
  • the emotion recognition model is a model obtained by training using the emotion recognition model training method provided in the foregoing embodiment.
  • the input spectrum vector is analyzed through the emotion recognition model to accurately obtain the user's emotion, specifically the emotion type, such as happy, sad, or neutral.
  • the emotion recognition method provided in the foregoing embodiment collects a user's voice signal; preprocesses the voice signal according to a preset processing rule to obtain a spectrum vector corresponding to the voice signal; and inputs the spectrum vector into the emotion recognition model
  • the emotion of the user is recognized to obtain the emotion category of the user.
  • This method can quickly identify the user's emotion type, and at the same time has the advantages of high recognition accuracy.
  • FIG. 6 is a schematic block diagram of a model training device provided by an embodiment of the present application.
  • the model training device may be configured in a server and used to execute the aforementioned emotion recognition model training method.
  • the model training device 400 includes: an information acquisition unit 401, a sample construction unit 402, a data processing unit 403, a network extraction unit 404, and a model training unit 405.
  • the information acquisition unit 401 is configured to acquire user voice information and data tags corresponding to the voice information.
  • the sample construction unit 402 is used to construct sample data according to the voice information and corresponding data tags.
  • the data processing unit 403 is configured to preprocess the voice information in the sample data according to preset processing rules to obtain the corresponding spectrum vector.
  • the data processing unit 403 includes:
  • the information processing subunit 4031 is used to perform frame and window processing on the voice information in the sample data to obtain processed voice information; the information transformation subunit 4032 is used to perform frequency domain transformation on the processed voice information to Obtain the corresponding amplitude spectrum; a filter transformation subunit 4033, configured to filter the amplitude spectrum through the Mel filter bank, and perform discrete cosine transform on the filtered amplitude spectrum to obtain the Mel frequency cepstrum coefficient; The normalization subunit 4034 is configured to perform normalization processing on the Mel frequency cepstrum coefficients to obtain the spectrum vector corresponding to the voice information.
  • the filter transformation subunit 4033 is specifically configured to: obtain the maximum frequency corresponding to the voice information, calculate the mel frequency corresponding to the maximum frequency by using the mel frequency calculation formula; according to the calculated mel frequency and The number of triangular filters in the mel filter bank calculates the mel distance between the center frequencies of two adjacent triangular filters; the linear distribution of multiple triangular filters is completed according to the mel distance; the linear distribution is completed according to A plurality of triangular filters performs filtering processing on the amplitude spectrum.
  • the network extraction unit 404 is configured to extract a preset cyclic neural network, the cyclic neural network including an attention mechanism, and the attention mechanism is used to enhance a part of the voice information;
  • the model training unit 405 is configured to perform model training according to the frequency spectrum vector and data label corresponding to the voice information based on the cyclic neural network to obtain an emotion recognition model.
  • FIG. 7 is a schematic block diagram of another model training device provided by an embodiment of the present application.
  • the model training device may be configured in a server and used to execute the aforementioned emotion recognition model training method.
  • the model training device 500 includes: an information acquisition unit 501, a sample construction unit 502, a balance judgment unit 503, a balance processing unit 504, a data processing unit 505, a network extraction unit 506, and a model training unit 507.
  • the information acquiring unit 501 is configured to acquire the user's voice information and the data tag corresponding to the voice information.
  • the sample construction unit 502 is configured to construct sample data according to the voice information and corresponding data tags, the sample data including positive sample data and negative sample data.
  • the balance judgment unit 503 is used to judge whether the positive sample data and the negative sample data in the sample data are in balance.
  • the balance processing unit 504 is configured to, if the positive sample data and the negative sample data are not balanced, process the sample data according to a preset data processing rule to balance the positive sample data and the negative sample data.
  • the data processing unit 505 is configured to, if the positive sample data and the negative sample data are balanced, preprocess the voice information in the sample data according to a preset processing rule to obtain a corresponding spectrum vector.
  • the network extraction unit 506 is configured to extract a preset cyclic neural network, the cyclic neural network including an attention mechanism, and the attention mechanism is used to enhance a part of the voice information;
  • the model training unit 507 is configured to perform model training according to the frequency spectrum vector and data label corresponding to the speech information based on the cyclic neural network to obtain an emotion recognition model.
  • FIG. 8 is a schematic block diagram of an emotion recognition device provided by an embodiment of the present application.
  • the emotion recognition device may be configured in a terminal or a server to execute the aforementioned emotion recognition method.
  • the emotion recognition device 600 includes: a signal collection unit 601, a signal processing unit 602 and an emotion recognition unit 603.
  • the signal collection unit 601 is used to collect the user's voice signal.
  • the signal processing unit 602 is configured to preprocess the voice signal according to preset processing rules to obtain a spectrum vector corresponding to the voice signal.
  • the emotion recognition unit 603 is configured to input the frequency spectrum vector into the emotion recognition model to recognize the emotion of the user, so as to obtain the emotion category of the user, and the emotion recognition model adopts any of the emotions described above. Recognize the model trained by the model training method.
  • the above-mentioned apparatus can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 9.
  • FIG. 9 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application.
  • the computer equipment can be a server or a terminal.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium can store an operating system and a computer program.
  • the computer program includes program instructions.
  • the processor can execute any emotion recognition model training method or emotion recognition method.
  • the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
  • the internal memory provides an environment for the operation of the computer program in the non-volatile storage medium.
  • the processor can execute any emotion recognition model training method or emotion recognition method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 9 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the processor executes the program instructions to implement the present application Any one of the emotion recognition model training methods or emotion recognition methods provided in the embodiments.
  • the computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, such as the hard disk or memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)

Abstract

一种情感识别模型的训练方法、装置、设备及存储介质,该方法包括:根据用户的语音信息及数据标签构建样本数据;对样本数据中的语音信息进行预处理以得到对应的频谱向量;基于循环神经网络,根据语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型。

Description

情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
本申请要求于2019年2月27日提交中国专利局、申请号为201910145605.2、发明名称为“情感识别模型的训练方法、情感识别方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中
技术领域
本申请涉及模型训练技术领域,尤其涉及一种情感识别模型的训练方法、情感识别方法、装置、计算机设备及存储介质。
背景技术
近年来,基于机器学习利用声音识别用户情感的情感识别模型得到了广泛的发展,但针对声音的情感识别还面临了很多挑战,比如为了产生持续的精确的正负情感的识别,部分识别模型采用文字和声学特征结合的方式,这种方式需要利用语音识别(Automatic Speech Recognition,ASR)技术将声音转化为文字信息,但是存在延迟性严重的问题。同时,情感识别模型还存在泛化性差的问题,当把模型应用到新的说话人时,其准确率会降低。
发明内容
本申请提供了一种情感识别模型的训练方法、情感识别方法、装置、计算机设备及存储介质,以提高情感识别模型的可泛化性,提高识别的准确率。
第一方面,本申请提供了一种情感识别模型的训练方法,所述方法包括:
获取用户的语音信息以及所述语音信息对应的数据标签;
根据所述语音信息以及对应的数据标签构建样本数据;
根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量;
提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域;
基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型。
第二方面,本申请还提供了一种情感识别方法,所述方法包括:
采集用户的语音信号;
根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量;
将所述频谱向量输入至情感识别模型对所述用户的情感进行识别,以得到所述用户的情感类别,所述情感识别模型为采用上述的情感识别模型训练方法训练得到的模型。
第三方面,本申请还提供了一种情感识别模型的训练装置,所述装置包括:
获取单元,用于获取用户的语音信息以及所述语音信息对应的数据标签;
样本构建单元,用于根据所述语音信息以及对应的数据标签构建样本数据;
预处理单元,用于根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量;
提取单元,用于提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域;
模型训练单元,用于基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型。
第三方面,本申请还提供了一种情感识别装置,所述装置包括:
信号采集单元,用于采集用户的语音信号;
信号处理单元,用于根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量;
情感识别单元,用于将所述频谱向量输入至情感识别模型对所述用户的情感进行识别,以得到所述用户的情感类别,所述情感识别模型为采用上述的情感识别模型训练方法训练得到的模型。
第四方面,本申请还提供了一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机程序;所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如上述的情感识别模型的训练方法,或者所述的情感识别方法。
第五方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如上述的情感识别模型的训练方法,或者所述的情感识别方法。
本申请公开了一种情感识别模型的训练方法、装置、设备及存储介质,该方法在获取到用户的语音信息以及对应的数据标签后,根据预设处理规则对语音信息进行预处理以得到对应的频谱向量,再基于预设的循环神经网络,根据语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型,其中,该循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域。该方法训练出的情感识别模型具有可泛化性强,识别的准确率高等优点。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请的实施例提供的一种情感识别模型的训练方法的示意流程图;
图2是本申请的实施例提供的循环神经网络的结构示意图;
图3是图1中的情感识别模型的训练方法的子步骤示意流程图;
图4是本申请的实施例提供的一种情感识别模型的训练方法的示意流程图;
图5是本申请的实施例提供的一种情感识别方法的示意流程图;
图6为本申请实施例提供的一种模型训练装置的示意性框图;
图7为本申请实施例提供的另一种模型训练装置的示意性框图;
图8为本申请实施例提供的一种情感识别装置的示意性框图;
图9为本申请一实施例提供的一种计算机设备的结构示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。
本申请的实施例提供了一种情感识别模型的训练方法、情感识别方法、装置、计算机设备及存储介质。其中,情感识别模型的训练方法可使用服务器进行训练;情感识别方法可以应用于终端或服务器中,用于根据用户的声音识别出该用户的情感类型,比如高兴或悲伤等。
其中,服务器可以为独立的服务器,也可以为服务器集群。该终端可以是手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
请参阅图1,图1是本申请的实施例提供的一种情感识别模型的训练方法的示意流程图。其中,该情感识别模型是基于预设的循环神经网络进行模型训练得到的。
如图2所示,图2是本申请的实施例提供的一种预设的循环神经网络的结构示意图。所述循环神经网络的结构包括输入层、循环层、注意力机制、全连层和输出层;所述注意力机制用于根据注意力方程建立所述循环层的输出量与权重向量之间的映射关系以实现加强所述语音信息中的部分区域,进而提高模型的识别准确度。
其中,循环层包括长短期记忆网络(Long Short-Term Memory,LSTM)单元,输出层采用的是Softmax输出。在循环神经网络的结构中,输入层对应的输入序列中时间上的依赖是用一个包括长短期记忆网络单元的循环层来建模 的;注意力机制是被应用到在序列中每一个时间点对应的循环层的输出上,为序列中的一些区域增加更多的权重,这些区域是识别正负情绪时重要的区域。相对于其他的循环神经网络(Recurrent Neural Networks,RNN)来说,该预设的循环神经网络可以用来学习长时间的依赖关系,同时还没有梯度消失或者梯度爆炸的问题,可以得到更好的识别效果。
以下将结合图2中的循环神经网络的结构,介绍本申请的实施例提供的情感识别模型的训练方法。
如图1所示,该情感识别模型的训练方法,用于训练出情感识别模型以准确快速地识别出用户的情感类型。其中该训练方法包括步骤S101至步骤S105。
S101、获取用户的语音信息以及所述语音信息对应的数据标签。
其中,数据标签为用户的情感标签,包括正情绪标签、中性情绪标签和负情绪标签等。当然,也可以将语音信息分为更多的类,进而对应更多数据标签,比如高兴、悲伤、害怕、伤心或中性等数据标签,不同数据标签代表用户的不同情绪。
具体地,从预设数据库中获取用户的语音信息,该语音信息均包括有标签数据,即所述语音信息对应的数据标签。在此之前,还包括:采集用户的语音信息并根据数据标签对所述语音信息进行标记,以及将标记有数据标签的语音信息保存在所述预设数据库中。用户可以包括不同人群中用户,比如小孩、青年、中年和老年等人群的用户等;可以理解的是,也可以是不同职业的人群,比如教师、学生、医生、律师和IT人员等,进而丰富样本数据的多样性。
在一个实施例中,为了提高模型的识别准确度,对语音信息进行设定并采集,即所述获取用户的语音信息以及所述语音信息对应的数据标签,包括:获取用户讲述不同情感类型的故事时对应的语音信息以及所述用户对所述语音信息进行情感打分生成的数据标签。
具体地,首先采集用户讲述两个消极的故事和两个乐观的故事分别对应的语音信息;并在讲每一个故事之前或讲故事之后,获取所述用户按照打分标准对其情绪进行打分对应的打分分数;打分标准比如打0-5分表示负情绪,6-10分是正情绪,并根据打分分数生成对应的数据标签;比如打分为4分,则该语音信息对应的标签数据为负情绪标签。
当然,也可以将采集的用户讲述两个消极的故事和两个乐观的故事对应的语音信息进行分段打分,并根据分段打分对应的打分分数确定相应的数据标签,比如,将语音信息分成两段语音片段,第一段语音片段的打分分数为0分,则对应的数据标签为负情绪,第二段语音片段的打分分数为10分,则对应的数据标签为正情绪。
S102、根据所述语音信息以及对应的数据标签构建样本数据。
具体地,可以根据采集用户的语音信息以及对应的数据标签构成样本数据。 用户为多个用户,具体数量在此不限定,由于用户的情感不同,因此该样本数据包括正样本数据和负样本数据,正样本数据对应正情绪的语音信息,正情绪比如为乐观、高兴和兴奋等;负样本数据对应负情绪的语音信息,负情绪比如为消极、悲伤和痛苦等相关的情绪。
S103、根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量。
其中,该预设处理规则为用于将所述样本数据中的语音信息转出频域中的信息,具体比如采用快速傅里叶变换规则或者小波变换规则将在时域中采集的语音信息转换成频域中的信息。
在一实施例中,为了加快模型的训练以及识别的精度,采用预处理规则,如图3所示,即步骤S103包括:子步骤S103a至子步骤S103d。
S103a、对所述样本数据中的语音信息进行分帧加窗处理以得到处理后的语音信息。
其中,分帧加窗处理具体设置帧长为40ms,按照设置的帧长40ms对语音信息进行分割处理以得到分割后的语音信息,然后再对分割后语音信息加海明窗处理,加海明窗处理是指将分割后语音信息乘以一个窗函数,目的是为了进行傅里叶展开。
需要说明的是,分帧加窗处理,具体设置帧长可以设为其他值,比如设置为50ms、30ms或其他值。
在一个实施例中,在对所述样本数据中的语音信息进行分帧加窗处理以得到处理后的语音信息之前,还可对语音信息进行预加重处理,具体是乘以一个与语音信息的频率成正相关的预设系数,以提升高频的幅值,该预设系数的大小与模型训练的参数相关联,即根据模型参数的变化而变化,比如与权重向量a i相关联,具体根据权重向量a i对应的均值增大而增大,或者根据该均值减小而减小。目的是更好地提高模型的识别精度。
在一个可选的实施例中,预设系数可以设为一个经验值,设置一个经验值可以用于消除用户发声过程中声带和嘴唇造成的效应,来补偿语音信息受到发音系统所压抑的高频部分,并且能突显高频的共振峰。
S103b、对处理后的语音信息进行频域变换以得到对应的幅度谱。
具体地,是对处理后的语音信息进行快速傅里叶变换(Fast Fourier Transform、FFT),以得到相应的参数,在本实施例中是为了得到幅值作为幅度谱,即快速傅里叶变换后的幅值。当然,也可以用FFT变换后的其他参数,比如幅值加上相位信息等。
可以理解的是,也可以对处理后的语音信息进行小波变换以得到相应的参数,并选择变换后的幅值作为幅度谱。
S103c、通过梅尔滤波器组对所述幅度谱进行滤波处理,并对滤波处理后的 幅度谱进行离散余弦变换以得到梅尔频率倒谱系数。
具体地,所述通过梅尔滤波器组对所述幅度谱进行滤波处理,包括:获取所述语音信息对应的最大频率,利用梅尔频率计算公式计算所述最大频率对应的梅尔频率;根据计算的梅尔频率以及所述梅尔滤波器组中三角滤波器的数量计算两个相邻三角滤波器的中心频率的梅尔间距;根据所述梅尔间距完成对多个三角滤波器的线性分布;根据完成线性分布的多个三角滤波器对所述幅度谱进行滤波处理。
梅尔滤波器组具体包括40个线性分布在梅尔量度的三角滤波器。将得到幅度谱通过40个线性分布在梅尔量度的三角滤波器进行滤波处理后,再进行离散余弦变换得到梅尔频率倒谱系数。
确定语音信息中对应的最大频率,根据最大频率利用梅尔频率计算公式可计算最大梅尔频率,根据最大梅尔频率以及三角滤波器的数量(40个)计算两个相邻三角滤波器的中心频率的间距;根据计算出来的间距完成对多个三角滤波器的线性分布。
其中,所述梅尔频率计算公式为:
Figure PCTCN2019117711-appb-000001
在公式(1)中,f mel为所述梅尔频率,f为所述语音信息对应的最大频率,A为系数,具体为2595。
例如,确定的最大频率为4000Hz,利用公式(1)可以求出最大梅尔频率为2146.1mel。
由于在梅尔量度范围内,各个三角滤波器的中心频率是相等间隔的线性分布。由此,可以计算两个相邻三角滤波器的中心频率的间距为:
Figure PCTCN2019117711-appb-000002
其中,Δmel为两个相邻三角滤波器的中心频率的间距;k为三角滤波器的数量。
S103d、对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量。
具体地,采用零均值归一化对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量,所述零均值归一化对应的转化公式为:
Figure PCTCN2019117711-appb-000003
其中,
Figure PCTCN2019117711-appb-000004
为梅尔频率倒谱系数的均值;σ为梅尔频率倒谱系数的标准差;x为每个梅尔频率倒谱系数;x *为归一化后的梅尔频率倒谱系数。
采用的零-均值归一化(Z-Score标准化),也称为标准差标准化。经过处理的数据的均值为0,标注差为1。Z-Score标准化是将不同量级的数据统一转化 为同一个量级,统一用计算出的Z-Score值衡量,以保证数据之间的可比性。
S104、提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域。
其中,所述循环神经网络的结构包括输入层、循环层、注意力机制、全连层和输出层;所述注意力机制用于根据注意力方程建立所述循环层的输出量与权重向量之间的映射关系以实现加强所述语音信息中的部分区域。
所述注意力方程为:
Figure PCTCN2019117711-appb-000005
其中,g为所述全连层的输入向量;h i为每一个时间点i对应的循环层的输出量;a i是每一个时间点i对应的权重向量,用来代表每一个时间点i对全连层和输出层的影响大小。
注意力机制的关键是学习到这个方程,该方程在每一个时间点i给每一个循环层的输出h i和一个权重向量a i之间建立了一个映射关系,h i表示循环层的输出,a i是用来代表每一个时间点对网络中之后的层的影响大小。
其中,f(h i)中的参数在训练过程中会被优化,其表达式具体为:
f(h i)=tanh(Wh i+b)       (4)
在公式(4)中,W和b是线性方程的参数,h i对应的是每个时间点i的LSTM层的输出,表示为h i=(h 0,...h T-1),其中T是对于一个给定的序列中时间点的总个数。在本实施例中简化的其表达式的形式,具体如公式(4)采用一个线性函数加上一个tanh的激活函数,既可以取得较好的效果,同时又可以提高模型的训练速度。
对于一个给定的时间点i,权重向量a i的公式为:
Figure PCTCN2019117711-appb-000006
在公式(5)中,W为一个维度S*D的矩阵参数,S为正整数,b和u为一个维度为S的向量参数,D为所述循环层中网络单元的个数。
需要说明的是,g为一个向量作为全连接层的输入,激活函数采用ReLu函数,之后全连层使用的是Softmax函数,从而得到最后的输出。
S105、基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型。
具体地,将频谱向量输入至预设的循环神经网络进行模型训练,通过改进的模型中的注意力机制对声音中的主要部分进行加强,优化相应的模型参数进而得到情感识别模型,模型训练参数如表1所示。
表1为训练网络的相关参数
参数类型 参数值
优化算法 Adam
学习率 0.0005
LSTM单元个数 128
全连接层神经元个数 20
Dropout保留的概率 0.7
上述实施例提供的模型训练方法在获取到用户的语音信息以及对应的数据标签后,根据预设处理规则对语音信息进行预处理以得到对应的频谱向量,再基于预设的循环神经网络,根据语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型,其中,该循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域。该方法训练出的情感识别模型具有可泛化性强,识别的准确率高等优点。
请参阅图4,图4是本申请的实施例提供的另一种情感识别模型的训练方法的示意流程图。其中,该情感识别模型是基于预设的循环神经网络进行模型训练得到的,当然也可以采用其他网络进行训练得到。
如图4所示,该情感识别模型的训练方法,包括步骤S201至步骤S207。
S201、获取用户的语音信息以及所述语音信息对应的数据标签。
其中,数据标签为用户的情感标签,包括正情绪标签、中性情绪标签和负情绪标签等。当然,也可以将语音信息分为更多的类,进而对应更多数据标签,比如高兴、悲伤、害怕、伤心或中性等数据标签,不同数据标签代表用户的不同情绪。
S202、根据所述语音信息以及对应的数据标签构建样本数据,所述样本数据至少包括正样本数据和负样本数据。
具体地,可以根据采集用户的语音信息以及对应的数据标签构成样本数据。由于用户的情感不同,因此该样本数据至少包括正样本数据和负样本数据,比如还可包括中性样本数据。正样本数据对应正情绪的语音信息;负样本数据对应负情绪的语音信息。
S203、判断所述样本数据中的正样本数据和负样本数据是否达到平衡。
具体地,所述判断所述样本数据中的正样本数据和负样本数据是否达到平衡,并产生判断结果,该判断结果包括:正样本数据和负样本数据平衡,和正样本数据和负样本数据不平衡。
其中,若正样本数据和负样本数据不平衡,则执行步骤S204;若正样本数据和负样本数据平衡,则执行步骤S205。
S204、根据预设数据处理规则对所述样本数据进行处理以使所述正样本数据和负样本数据达到平衡。
若所述正样本数据和负样本数据不平衡,根据预设数据处理规则对所述样本数据进行处理以使所述正样本数据和负样本数据达到平衡。具体地,可通过两种方式对应样本数据进行处理以使正样本数据和负样本数据达到平衡。分别 为:
一、通过过采样的方式对样本数据进行处理:构建的样本数据中的正样本数据和负样本数据,一般是负样本数据要小于正样本数据,具体将该负样本数据复制多次并与正样本数据构成训练用的样本数据。对于训练用于的样本数据来说,由于把其中的负样本数据多复制了几遍,构成新的样本数据,进而可以解决样本不均的问题。
二、通过设置加权损失函数对样本数据进行处理:通过使一个标准的交叉熵函数或者加权的交叉熵函数最小化训练的模型权重θ最优,具体通过加权的思想,比如负样本少,在训练的时候知道是负样本,通过权重去对模型参数进行调整,以增大负样本的影响。其中,标准的交叉熵损失函数对应的表达式为:
Figure PCTCN2019117711-appb-000007
其中,
Figure PCTCN2019117711-appb-000008
是每一个观察到的序列n的Softmax的输出,其中X是维度为F*D的矩阵,其中F代表的是在每一个时间点输入的频谱系数的数量;C n是每一个观察到的序列n对应的类的标签,标签的取值范围是{0,1},当然也可以是{0,1,2},分别对应负样本,中性样本和正样本。当然,也可以采用加权的交叉熵函数,该加权的交叉熵函数与标准的交叉熵损失函数类似,目标都是解决样本数据不均匀的问题。
S205、根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量。
具体地,若所述正样本数据和负样本数据达到平衡,则根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量。其中,该预设处理规则为用于将所述样本数据中的语音信息转出频域中的信息,具体比如采用快速傅里叶变换规则或者小波变换规则将在时域中采集的语音信息转换成频域中的信息。
S206、提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域。
其中,所述循环神经网络的结构包括输入层、循环层、注意力机制、全连层和输出层;所述注意力机制用于根据注意力方程建立所述循环层的输出量与权重向量之间的映射关系以实现加强所述语音信息中的部分区域。
S207、基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型。
具体地,将频谱向量输入至预设的循环神经网络进行模型训练,通过改进的模型中的注意力机制对声音中的主要部分进行加强,优化相应的模型参数进而得到情感识别模型。
该方法训练出的情感识别模型具有可泛化性强,识别的准确率高等优点。同时因为极端的情绪经常会比中性的情绪要少见很多,因此样本不均的问题以 及导致过拟合问题,该方法可以很好解决样本不均匀问题,进而提高模型的准确度。
请参阅图5,图5是本申请的实施例提供的一种情感识别方法的示意流程图。该情感识别方法,可应用于终端或服务器中,用于根据用户的声音识别用户的情感。
如图5所示,该情感识别方法,包括步骤S301至步骤S303。
S301、采集用户的语音信号。
具体地,可通过录音设备采集与用户聊天时对应的语音信号,该录音设备比如录音笔、智能手机、平板电脑、笔记本或智能穿戴设备等,比如智能手环或智能手表等。
S302、根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量。
具体地,根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量,包括:对语音信息进行分帧加窗处理以得到处理后的语音信息;对处理后的语音信息进行快速傅里叶变换以得到幅度谱;对幅度谱增加梅尔滤波器组,并将梅尔滤波器组的输出做离散余弦变换以得到梅尔频率倒谱系数;将得到的每个梅尔频率倒谱系数进行归一化处理以得到语音信息对应的频谱向量。
S303、将所述频谱向量输入至情感识别模型对所述用户的情感进行识别,以得到所述用户的情感类别。
其中,所述情感识别模型为采用上述实施例中提供的情感识别模型训练方法训练得到的模型。通过该情感识别模型对输入的频谱向量进行分析,以准确地得到用户的情感,具体为情感类型,比如高兴、悲伤或中性等。
上述实施例提供的情感识别方法,通过采集用户的语音信号;根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量;将所述频谱向量输入至情感识别模型对所述用户的情感进行识别,以得到所述用户的情感类别。该方法可以快速识别到用户的情感类型,同时又具有识别准确率高等优点。
请参阅图6,图6是本申请一实施例提供的一种模型训练装置的示意性框图,该模型训练装置可以配置于服务器中,用于执行前述的情感识别模型的训练方法。
如图6所示,该模型训练装置400,包括:信息获取单元401、样本构建单元402、数据处理单元403、网络提取单元404和模型训练单元405。
信息获取单元401,用于获取用户的语音信息以及所述语音信息对应的数据标签。
样本构建单元402,用于根据所述语音信息以及对应的数据标签构建样本数 据。
数据处理单元403,用于根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量。
在一个实施例中,所述数据处理单元403,包括:
信息处理子单元4031,用于对所述样本数据中的语音信息进行分帧加窗处理以得到处理后的语音信息;信息变换子单元4032,用于对处理后的语音信息进行频域变换以得到对应的幅度谱;滤波变换子单元4033,用于通过梅尔滤波器组对所述幅度谱进行滤波处理,并对滤波处理后的幅度谱进行离散余弦变换以得到梅尔频率倒谱系数;归一化子单元4034,用于对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量。
在一个实施例中,滤波变换子单元4033,具体用于:获取所述语音信息对应的最大频率,利用梅尔频率计算公式计算所述最大频率对应的梅尔频率;根据计算的梅尔频率以及所述梅尔滤波器组中三角滤波器的数量计算两个相邻三角滤波器的中心频率的梅尔间距;根据所述梅尔间距完成对多个三角滤波器的线性分布;根据完成线性分布的多个三角滤波器对所述幅度谱进行滤波处理。
网络提取单元404,用于提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域;
模型训练单元405,用于基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型。
请参阅图7,图7是本申请一实施例提供的另一种模型训练装置的示意性框图,该模型训练装置可以配置于服务器中,用于执行前述的情感识别模型的训练方法。
如图7所示,该模型训练装置500,包括:信息获取单元501、样本构建单元502、平衡判断单元503、平衡处理单元504、数据处理单元505、网络提取单元506和模型训练单元507。
信息获取单元501,用于获取用户的语音信息以及所述语音信息对应的数据标签。
样本构建单元502,用于根据所述语音信息以及对应的数据标签构建样本数据,所述样本数据包括正样本数据和负样本数据。
平衡判断单元503,用于判断所述样本数据中的正样本数据和负样本数据是否达到平衡.
平衡处理单元504,用于若所述正样本数据和负样本数据不平衡,根据预设数据处理规则对所述样本数据进行处理以使所述正样本数据和负样本数据达到平衡。
数据处理单元505,用于若所述正样本数据和负样本数据平衡,根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量。
网络提取单元506,用于提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域;
模型训练单元507,用于基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型。
请参阅图8,图8是本申请一实施例提供的一种情感识别装置的示意性框图,该情感识别装置可以配置于终端或服务器中,用于执行前述的情感识别方法。
如图8所示,该情感识别装置600,包括:信号采集单元601、信号处理单元602和情感识别单元603。
信号采集单元601,用于采集用户的语音信号。
信号处理单元602,用于根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量。
情感识别单元603,用于将所述频谱向量输入至情感识别模型对所述用户的情感进行识别,以得到所述用户的情感类别,所述情感识别模型为采用上述任一项所述的情感识别模型训练方法训练得到的模型。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
上述的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图9所示的计算机设备上运行。
请参阅图9,图9是本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以是服务器或终端。
参阅图9,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性存储介质和内存储器。
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种情感识别模型的训练方法或情感识别方法。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种情感识别模型的训练方法或情感识别方法。
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU), 该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本申请的实施例中还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序中包括程序指令,所述处理器执行所述程序指令,实现本申请实施例提供的任一项情感识别模型的训练方法或情感识别方法。
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种情感识别模型的训练方法,所述方法包括:
    获取用户的语音信息以及所述语音信息对应的数据标签;
    根据所述语音信息以及对应的数据标签构建样本数据;
    根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量;
    提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域;
    基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型;
    其中,所述循环神经网络的结构包括输入层、循环层、注意力机制、全连层和输出层;所述注意力机制用于根据注意力方程建立所述循环层的输出量与权重向量之间的映射关系以实现加强所述语音信息中的部分区域;
    所述注意力方程为:
    Figure PCTCN2019117711-appb-100001
    其中,
    Figure PCTCN2019117711-appb-100002
    f(h i)=tanh(Wh i+b);g为所述全连层的输入向量;h i为每一个时间点i对应的循环层的输出量;a i是每一个时间点i对应的权重向量,用来代表每一个时间点i对全连层和输出层的影响大小;T为时间点i的总个数;W为一个维度S*D的矩阵参数,S为正整数,b和u为一个维度为S的向量参数,D为所述循环层中网络单元的个数。
  2. 根据权利要求1所述的训练方法,其中,所述根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量,包括:
    对所述样本数据中的语音信息进行分帧加窗处理以得到处理后的语音信息;
    对处理后的语音信息进行频域变换以得到对应的幅度谱;
    通过梅尔滤波器组对所述幅度谱进行滤波处理,并对滤波处理后的幅度谱进行离散余弦变换以得到梅尔频率倒谱系数;
    对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量。
  3. 根据权利要求2所述的训练方法,其中,所述通过梅尔滤波器组对所述幅度谱进行滤波处理,包括:
    获取所述语音信息对应的最大频率,利用梅尔频率计算公式计算所述最大频率对应的梅尔频率;
    根据计算的梅尔频率以及所述梅尔滤波器组中三角滤波器的数量计算两个相邻三角滤波器的中心频率的梅尔间距;
    根据所述梅尔间距完成对多个三角滤波器的线性分布;
    根据完成线性分布的多个三角滤波器对所述幅度谱进行滤波处理。
  4. 根据权利要求3所述的训练方法,其中,所述梅尔频率计算公式为:
    Figure PCTCN2019117711-appb-100003
    其中,f mel为所述梅尔频率,f为所述语音信息对应的最大频率,A为系数;
    所述对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量,包括:
    采用零均值归一化对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量,所述零均值归一化对应的转化公式为:
    Figure PCTCN2019117711-appb-100004
    其中,
    Figure PCTCN2019117711-appb-100005
    为梅尔频率倒谱系数的均值;σ为梅尔频率倒谱系数的标准差;x为每个梅尔频率倒谱系数;x *为归一化后的梅尔频率倒谱系数。
  5. 根据权利要求2所述的训练方法,其中,所述对所述样本数据中的语音信息进行分帧加窗处理以得到处理后的语音信息之前,还包括:
    对语音信息进行预加重处理,所述预加重处理包括乘以一个与所述语音信息的频率成正相关的预设系数。
  6. 一种情感识别方法,所述方法包括:
    采集用户的语音信号;
    根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量;
    将所述频谱向量输入至情感识别模型对所述用户的情感进行识别,以得到所述用户的情感类别,所述情感识别模型为采用权利要求1至5中任一项所述的情感识别模型训练方法训练得到的模型。
  7. 一种情感识别模型的训练装置,所述装置包括:
    信息获取单元,用于获取用户的语音信息以及所述语音信息对应的数据标签;
    样本构建单元,用于根据所述语音信息以及对应的数据标签构建样本数据;
    数据处理单元,用于根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量;
    网络提取单元,用于提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域;
    模型训练单元,用于基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型;
    其中,所述循环神经网络的结构包括输入层、循环层、注意力机制、全连层和输出层;所述注意力机制用于根据注意力方程建立所述循环层的输出量与权重向量之间的映射关系以实现加强所述语音信息中的部分区域;
    所述注意力方程为:
    Figure PCTCN2019117711-appb-100006
    其中,
    Figure PCTCN2019117711-appb-100007
    f(h i)=tanh(Wh i+b);g为所述全连层的输入向量;h i为每一个时间点i对应的循环层的输出量;a i是每一个时间点i对应的权重向量,用来代表每一个时间点i对全连层和输出层的影响大小;T为时间点i的总个数;W为一个维度S*D的矩阵参数,S为正整数,b和u为一个维度为S的向量参数,D为所述循环层中网络单元的个数。
  8. 一种情感识别装置,所述装置包括:
    信号采集单元,用于采集用户的语音信号;
    信号处理单元,用于根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量;
    情感识别单元,用于将所述频谱向量输入至情感识别模型对所述用户的情感进行识别,以得到所述用户的情感类别,所述情感识别模型为采用权利要求1至5中任一项所述的情感识别模型训练方法训练得到的模型。
  9. 一种计算机设备,其中,所述计算机设备包括存储器和处理器;
    所述存储器用于存储计算机程序;
    所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如下步骤:
    获取用户的语音信息以及所述语音信息对应的数据标签;
    根据所述语音信息以及对应的数据标签构建样本数据;
    根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量;
    提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域;
    基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型;
    其中,所述循环神经网络的结构包括输入层、循环层、注意力机制、全连层和输出层;所述注意力机制用于根据注意力方程建立所述循环层的输出量与权重向量之间的映射关系以实现加强所述语音信息中的部分区域;
    所述注意力方程为:
    Figure PCTCN2019117711-appb-100008
    其中,
    Figure PCTCN2019117711-appb-100009
    f(h i)=tanh(Wh i+b);g为所述全连层的输入向量;h i为每一个时间点i对应的循环层的输出量;a i是每一个时间点i对应的权重向量,用来代表每一个时间点i对全连层和输出层的影响大小;T为时间点i的总个数;W为一个维度S*D的矩阵参数,S为正整数,b和u为一个维度为S的向量参数,D为所述循环层中网络单元的个数。
  10. 根据权利要求9所述的计算机设备,其中,所述处理器实现所述根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量的步骤,包括:
    对所述样本数据中的语音信息进行分帧加窗处理以得到处理后的语音信息;
    对处理后的语音信息进行频域变换以得到对应的幅度谱;
    通过梅尔滤波器组对所述幅度谱进行滤波处理,并对滤波处理后的幅度谱进行离散余弦变换以得到梅尔频率倒谱系数;
    对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量。
  11. 根据权利要求10所述的计算机设备,其中,所述处理器实现所述通过梅尔滤波器组对所述幅度谱进行滤波处理的步骤,包括:
    获取所述语音信息对应的最大频率,利用梅尔频率计算公式计算所述最大频率对应的梅尔频率;
    根据计算的梅尔频率以及所述梅尔滤波器组中三角滤波器的数量计算两个相邻三角滤波器的中心频率的梅尔间距;
    根据所述梅尔间距完成对多个三角滤波器的线性分布;
    根据完成线性分布的多个三角滤波器对所述幅度谱进行滤波处理。。
  12. 根据权利要求11所述的计算机设备,其中,所述梅尔频率计算公式为:
    Figure PCTCN2019117711-appb-100010
    其中,f mel为所述梅尔频率,f为所述语音信息对应的最大频率,A为系数;
    所述处理器实现所述对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量的步骤,包括:
    采用零均值归一化对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量,所述零均值归一化对应的转化公式为:
    Figure PCTCN2019117711-appb-100011
    其中,
    Figure PCTCN2019117711-appb-100012
    为梅尔频率倒谱系数的均值;σ为梅尔频率倒谱系数的标准差;x为每个梅尔频率倒谱系数;x *为归一化后的梅尔频率倒谱系数。
  13. 根据权利要求9所述的计算机设备,其中,所述处理器实现所述对所述样本数据中的语音信息进行分帧加窗处理以得到处理后的语音信息之前,还包括:
    对语音信息进行预加重处理,所述预加重处理包括乘以一个与所述语音信息的频率成正相关的预设系数。
  14. 一种计算机设备,其中,所述计算机设备包括存储器和处理器;
    所述存储器用于存储计算机程序;
    所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如下步骤:
    采集用户的语音信号;
    根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量;
    将所述频谱向量输入至情感识别模型对所述用户的情感进行识别,以得到所述用户的情感类别,所述情感识别模型为采用权利要求1至5中任一项所述的情感识别模型训练方法训练得到的模型。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:
    获取用户的语音信息以及所述语音信息对应的数据标签;
    根据所述语音信息以及对应的数据标签构建样本数据;
    根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量;
    提取预设的循环神经网络,所述循环神经网络包括注意力机制,所述注意力机制用于加强所述语音信息中的部分区域;
    基于所述循环神经网络,根据所述语音信息对应的频谱向量和数据标签进行模型训练以得到情感识别模型;
    其中,所述循环神经网络的结构包括输入层、循环层、注意力机制、全连层和输出层;所述注意力机制用于根据注意力方程建立所述循环层的输出量与权重向量之间的映射关系以实现加强所述语音信息中的部分区域;
    所述注意力方程为:
    Figure PCTCN2019117711-appb-100013
    其中,
    Figure PCTCN2019117711-appb-100014
    f(h i)=tanh(Wh i+b);g为所述全连层的输入向量;h i为每一个时间点i对应的循环层的输出量;a i是每一个时间点i对应的权重向量,用来代表每一个时间点i对全连层和输出层的影响大小;T为时间点i的总个数;W为一个维度S*D的矩阵参数,S为正整数,b和u为一个维度为S的向量参数,D为所述循环层中网络单元的个数。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述处理器实现所述根据预设处理规则对所述样本数据中的语音信息进行预处理以得到对应的频谱向量的步骤,包括:
    对所述样本数据中的语音信息进行分帧加窗处理以得到处理后的语音信息;
    对处理后的语音信息进行频域变换以得到对应的幅度谱;
    通过梅尔滤波器组对所述幅度谱进行滤波处理,并对滤波处理后的幅度谱进行离散余弦变换以得到梅尔频率倒谱系数;
    对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述处理器实现所述通过梅尔滤波器组对所述幅度谱进行滤波处理的步骤,包括:
    获取所述语音信息对应的最大频率,利用梅尔频率计算公式计算所述最大频率对应的梅尔频率;
    根据计算的梅尔频率以及所述梅尔滤波器组中三角滤波器的数量计算两个相邻三角滤波器的中心频率的梅尔间距;
    根据所述梅尔间距完成对多个三角滤波器的线性分布;
    根据完成线性分布的多个三角滤波器对所述幅度谱进行滤波处理。。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述梅尔频率计算公式为:
    Figure PCTCN2019117711-appb-100015
    其中,f mel为所述梅尔频率,f为所述语音信息对应的最大频率,A为系数;
    所述处理器实现所述对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量的步骤,包括:
    采用零均值归一化对所述梅尔频率倒谱系数进行归一化处理以得到所述语音信息对应的频谱向量,所述零均值归一化对应的转化公式为:
    Figure PCTCN2019117711-appb-100016
    其中,
    Figure PCTCN2019117711-appb-100017
    为梅尔频率倒谱系数的均值;σ为梅尔频率倒谱系数的标准差;x为每个梅尔频率倒谱系数;x *为归一化后的梅尔频率倒谱系数。
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述处理器实现所述对所述样本数据中的语音信息进行分帧加窗处理以得到处理后的语音信息之前,还包括:
    对语音信息进行预加重处理,所述预加重处理包括乘以一个与所述语音信息的频率成正相关的预设系数。
  20. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:
    采集用户的语音信号;
    根据预设处理规则对所述语音信号进行预处理以得到所述语音信号对应的频谱向量;
    将所述频谱向量输入至情感识别模型对所述用户的情感进行识别,以得到所述用户的情感类别,所述情感识别模型为采用权利要求1至5中任一项所述的情感识别模型训练方法训练得到的模型。
PCT/CN2019/117711 2019-02-27 2019-11-12 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质 WO2020173133A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910145605.2A CN109817246B (zh) 2019-02-27 2019-02-27 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
CN201910145605.2 2019-02-27

Publications (1)

Publication Number Publication Date
WO2020173133A1 true WO2020173133A1 (zh) 2020-09-03

Family

ID=66607622

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117711 WO2020173133A1 (zh) 2019-02-27 2019-11-12 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN109817246B (zh)
WO (1) WO2020173133A1 (zh)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185423A (zh) * 2020-09-28 2021-01-05 南京工程学院 基于多头注意力机制的语音情感识别方法
CN112257658A (zh) * 2020-11-11 2021-01-22 微医云(杭州)控股有限公司 一种脑电信号的处理方法、装置、电子设备及存储介质
CN112733994A (zh) * 2020-12-10 2021-04-30 中国科学院深圳先进技术研究院 机器人的自主情感生成方法、系统及应用
CN112786017A (zh) * 2020-12-25 2021-05-11 北京猿力未来科技有限公司 语速检测模型的训练方法及装置、语速检测方法及装置
CN112948554A (zh) * 2021-02-28 2021-06-11 西北工业大学 基于强化学习和领域知识的实时多模态对话情感分析方法
CN113178197A (zh) * 2021-04-27 2021-07-27 平安科技(深圳)有限公司 语音验证模型的训练方法、装置以及计算机设备
CN113343860A (zh) * 2021-06-10 2021-09-03 南京工业大学 一种基于视频图像和语音的双模态融合情感识别方法
CN113420556A (zh) * 2021-07-23 2021-09-21 平安科技(深圳)有限公司 基于多模态信号的情感识别方法、装置、设备及存储介质
CN113592001A (zh) * 2021-08-03 2021-11-02 西北工业大学 一种基于深度典型相关性分析的多模态情感识别方法
CN113837299A (zh) * 2021-09-28 2021-12-24 平安科技(深圳)有限公司 基于人工智能的网络训练方法及装置、电子设备
CN113919387A (zh) * 2021-08-18 2022-01-11 东北林业大学 基于gbdt-lr模型的脑电信号情感识别
CN114548262A (zh) * 2022-02-21 2022-05-27 华中科技大学鄂州工业技术研究院 一种情感计算中多模态生理信号的特征级融合方法

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817246B (zh) * 2019-02-27 2023-04-18 平安科技(深圳)有限公司 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
CN110223714B (zh) * 2019-06-03 2021-08-03 杭州哲信信息技术有限公司 一种基于语音的情绪识别方法
CN110288980A (zh) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 语音识别方法、模型的训练方法、装置、设备及存储介质
CN110211563B (zh) * 2019-06-19 2024-05-24 平安科技(深圳)有限公司 面向情景及情感的中文语音合成方法、装置及存储介质
CN110400579B (zh) * 2019-06-25 2022-01-11 华东理工大学 基于方向自注意力机制和双向长短时网络的语音情感识别
CN110532380B (zh) * 2019-07-12 2020-06-23 杭州电子科技大学 一种基于记忆网络的文本情感分类方法
CN110890088B (zh) * 2019-10-12 2022-07-15 中国平安财产保险股份有限公司 语音信息反馈方法、装置、计算机设备和存储介质
CN111357051B (zh) * 2019-12-24 2024-02-02 深圳市优必选科技股份有限公司 语音情感识别方法、智能装置和计算机可读存储介质
CN111179945B (zh) * 2019-12-31 2022-11-15 中国银行股份有限公司 基于声纹识别的安全门的控制方法和装置
CN111276119B (zh) * 2020-01-17 2023-08-22 平安科技(深圳)有限公司 语音生成方法、系统和计算机设备
CN111341351B (zh) * 2020-02-25 2023-05-23 厦门亿联网络技术股份有限公司 基于自注意力机制的语音活动检测方法、装置及存储介质
CN111429948B (zh) * 2020-03-27 2023-04-28 南京工业大学 一种基于注意力卷积神经网络的语音情绪识别模型及方法
CN111582382B (zh) * 2020-05-09 2023-10-31 Oppo广东移动通信有限公司 状态识别方法、装置以及电子设备
CN111832317B (zh) * 2020-07-09 2023-08-18 广州市炎华网络科技有限公司 智能信息导流方法、装置、计算机设备及可读存储介质
CN111816205B (zh) * 2020-07-09 2023-06-20 中国人民解放军战略支援部队航天工程大学 一种基于飞机音频的机型智能识别方法
CN111985231B (zh) * 2020-08-07 2023-12-26 中移(杭州)信息技术有限公司 无监督角色识别方法、装置、电子设备及存储介质
CN112331182A (zh) * 2020-10-26 2021-02-05 平安科技(深圳)有限公司 语音数据生成方法、装置、计算机设备及存储介质
CN112163571B (zh) * 2020-10-29 2024-03-05 腾讯科技(深圳)有限公司 电子设备使用者的属性识别方法、装置、设备及存储介质
CN112466324A (zh) * 2020-11-13 2021-03-09 上海听见信息科技有限公司 一种情绪分析方法、系统、设备及可读存储介质
CN112992177B (zh) * 2021-02-20 2023-10-17 平安科技(深圳)有限公司 语音风格迁移模型的训练方法、装置、设备及存储介质
CN113053361B (zh) * 2021-03-18 2023-07-04 北京金山云网络技术有限公司 语音识别方法、模型训练方法、装置、设备及介质
CN112712824B (zh) * 2021-03-26 2021-06-29 之江实验室 一种融合人群信息的语音情感识别方法和系统
CN113270111A (zh) * 2021-05-17 2021-08-17 广州国音智能科技有限公司 一种基于音频数据的身高预测方法、装置、设备和介质
CN113421594B (zh) * 2021-06-30 2023-09-22 平安科技(深圳)有限公司 语音情感识别方法、装置、设备及存储介质
CN113327631B (zh) * 2021-07-15 2023-03-21 广州虎牙科技有限公司 一种情感识别模型的训练方法、情感识别方法及装置
CN113889150B (zh) * 2021-10-15 2023-08-29 北京工业大学 语音情感识别方法及装置
CN113889149B (zh) * 2021-10-15 2023-08-29 北京工业大学 语音情感识别方法及装置
CN116916497B (zh) * 2023-09-12 2023-12-26 深圳市卡能光电科技有限公司 基于嵌套态势识别的落地柱形氛围灯光照控制方法及系统
CN117648717B (zh) * 2024-01-29 2024-05-03 知学云(北京)科技股份有限公司 用于人工智能语音陪练的隐私保护方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766894A (zh) * 2017-11-03 2018-03-06 吉林大学 基于注意力机制和深度学习的遥感图像自然语言生成方法
CN108922515A (zh) * 2018-05-31 2018-11-30 平安科技(深圳)有限公司 语音模型训练方法、语音识别方法、装置、设备及介质
CN109062937A (zh) * 2018-06-15 2018-12-21 北京百度网讯科技有限公司 训练描述文本生成模型的方法、生成描述文本的方法及装置
CN109817246A (zh) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102410914B1 (ko) * 2015-07-16 2022-06-17 삼성전자주식회사 음성 인식을 위한 모델 구축 장치 및 음성 인식 장치 및 방법
CN106340309B (zh) * 2016-08-23 2019-11-12 上海索洛信息技术有限公司 一种基于深度学习的狗叫情感识别方法及装置
CN108550375A (zh) * 2018-03-14 2018-09-18 鲁东大学 一种基于语音信号的情感识别方法、装置和计算机设备
CN109285562B (zh) * 2018-09-28 2022-09-23 东南大学 基于注意力机制的语音情感识别方法
CN109243493B (zh) * 2018-10-30 2022-09-16 南京工程学院 基于改进长短时记忆网络的婴儿哭声情感识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766894A (zh) * 2017-11-03 2018-03-06 吉林大学 基于注意力机制和深度学习的遥感图像自然语言生成方法
CN108922515A (zh) * 2018-05-31 2018-11-30 平安科技(深圳)有限公司 语音模型训练方法、语音识别方法、装置、设备及介质
CN109062937A (zh) * 2018-06-15 2018-12-21 北京百度网讯科技有限公司 训练描述文本生成模型的方法、生成描述文本的方法及装置
CN109817246A (zh) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185423A (zh) * 2020-09-28 2021-01-05 南京工程学院 基于多头注意力机制的语音情感识别方法
CN112185423B (zh) * 2020-09-28 2023-11-21 南京工程学院 基于多头注意力机制的语音情感识别方法
CN112257658B (zh) * 2020-11-11 2023-10-10 微医云(杭州)控股有限公司 一种脑电信号的处理方法、装置、电子设备及存储介质
CN112257658A (zh) * 2020-11-11 2021-01-22 微医云(杭州)控股有限公司 一种脑电信号的处理方法、装置、电子设备及存储介质
CN112733994A (zh) * 2020-12-10 2021-04-30 中国科学院深圳先进技术研究院 机器人的自主情感生成方法、系统及应用
CN112786017A (zh) * 2020-12-25 2021-05-11 北京猿力未来科技有限公司 语速检测模型的训练方法及装置、语速检测方法及装置
CN112786017B (zh) * 2020-12-25 2024-04-09 北京猿力未来科技有限公司 语速检测模型的训练方法及装置、语速检测方法及装置
CN112948554A (zh) * 2021-02-28 2021-06-11 西北工业大学 基于强化学习和领域知识的实时多模态对话情感分析方法
CN112948554B (zh) * 2021-02-28 2024-03-08 西北工业大学 基于强化学习和领域知识的实时多模态对话情感分析方法
CN113178197A (zh) * 2021-04-27 2021-07-27 平安科技(深圳)有限公司 语音验证模型的训练方法、装置以及计算机设备
CN113178197B (zh) * 2021-04-27 2024-01-09 平安科技(深圳)有限公司 语音验证模型的训练方法、装置以及计算机设备
CN113343860A (zh) * 2021-06-10 2021-09-03 南京工业大学 一种基于视频图像和语音的双模态融合情感识别方法
CN113420556B (zh) * 2021-07-23 2023-06-20 平安科技(深圳)有限公司 基于多模态信号的情感识别方法、装置、设备及存储介质
CN113420556A (zh) * 2021-07-23 2021-09-21 平安科技(深圳)有限公司 基于多模态信号的情感识别方法、装置、设备及存储介质
CN113592001A (zh) * 2021-08-03 2021-11-02 西北工业大学 一种基于深度典型相关性分析的多模态情感识别方法
CN113592001B (zh) * 2021-08-03 2024-02-02 西北工业大学 一种基于深度典型相关性分析的多模态情感识别方法
CN113919387A (zh) * 2021-08-18 2022-01-11 东北林业大学 基于gbdt-lr模型的脑电信号情感识别
CN113837299B (zh) * 2021-09-28 2023-09-01 平安科技(深圳)有限公司 基于人工智能的网络训练方法及装置、电子设备
CN113837299A (zh) * 2021-09-28 2021-12-24 平安科技(深圳)有限公司 基于人工智能的网络训练方法及装置、电子设备
CN114548262A (zh) * 2022-02-21 2022-05-27 华中科技大学鄂州工业技术研究院 一种情感计算中多模态生理信号的特征级融合方法
CN114548262B (zh) * 2022-02-21 2024-03-22 华中科技大学鄂州工业技术研究院 一种情感计算中多模态生理信号的特征级融合方法

Also Published As

Publication number Publication date
CN109817246B (zh) 2023-04-18
CN109817246A (zh) 2019-05-28

Similar Documents

Publication Publication Date Title
WO2020173133A1 (zh) 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
WO2021000408A1 (zh) 面试评分方法、装置、设备及存储介质
CN109243491B (zh) 在频谱上对语音进行情绪识别的方法、系统及存储介质
CN112259106A (zh) 声纹识别方法、装置、存储介质及计算机设备
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
WO2021114841A1 (zh) 一种用户报告的生成方法及终端设备
WO2020034628A1 (zh) 口音识别方法、装置、计算机装置及存储介质
CN108962231B (zh) 一种语音分类方法、装置、服务器及存储介质
CN103943104A (zh) 一种语音信息识别的方法及终端设备
CN110222841A (zh) 基于间距损失函数的神经网络训练方法和装置
WO2023283823A1 (zh) 语音对抗样本检测方法、装置、设备及计算机可读存储介质
WO2023279691A1 (zh) 语音分类方法、模型训练方法及装置、设备、介质和程序
Shah et al. Speech emotion recognition based on SVM using MATLAB
Jiang et al. RETRACTED ARTICLE: Intelligent online education system based on speech recognition with specialized analysis on quality of service
CN108847251B (zh) 一种语音去重方法、装置、服务器及存储介质
CN111755029B (zh) 语音处理方法、装置、存储介质以及电子设备
Taran A nonlinear feature extraction approach for speech emotion recognition using VMD and TKEO
Singh et al. Speaker Recognition Assessment in a Continuous System for Speaker Identification
Yue English spoken stress recognition based on natural language processing and endpoint detection algorithm
Płonkowski Using bands of frequencies for vowel recognition for Polish language
CN115631748A (zh) 基于语音对话的情感识别方法、装置、电子设备及介质
Fathan et al. An Ensemble Approach for the Diagnosis of COVID-19 from Speech and Cough Sounds
CN112712792A (zh) 一种方言识别模型的训练方法、可读存储介质及终端设备
Abdulwahid et al. Arabic Speaker Identification System for Forensic Authentication Using K-NN Algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19916986

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19916986

Country of ref document: EP

Kind code of ref document: A1