CN110956981B - Speech emotion recognition method, device, equipment and storage medium - Google Patents

Speech emotion recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN110956981B
CN110956981B CN201911246544.5A CN201911246544A CN110956981B CN 110956981 B CN110956981 B CN 110956981B CN 201911246544 A CN201911246544 A CN 201911246544A CN 110956981 B CN110956981 B CN 110956981B
Authority
CN
China
Prior art keywords
feature
preset
data
voice
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201911246544.5A
Other languages
Chinese (zh)
Other versions
CN110956981A (en
Inventor
孙亚新
叶青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Arts and Science
Original Assignee
Hubei University of Arts and Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Arts and Science filed Critical Hubei University of Arts and Science
Priority to CN201911246544.5A priority Critical patent/CN110956981B/en
Publication of CN110956981A publication Critical patent/CN110956981A/en
Application granted granted Critical
Publication of CN110956981B publication Critical patent/CN110956981B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of voice signal processing and mode recognition, and discloses a voice emotion recognition method, a device, equipment and a storage medium. The method comprises the following steps: acquiring a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples; extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed; performing feature statistics on the feature data of the voice signal to be processed through a preset statistical function to obtain a feature statistical result to be confirmed; obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed; and inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result. Through the method, the speech emotion fragments form speech emotion data, and the speech emotion data are input into the preset Softmax classification model, so that speech emotion can be better recognized.

Description

Speech emotion recognition method, device, equipment and storage medium
Technical Field
The present invention relates to the field of speech signal processing and pattern recognition technologies, and in particular, to a speech emotion recognition method, apparatus, device, and storage medium.
Background
There are many speech emotion recognition methods, but these methods do not notice that human speech emotion expressions have short-term and local characteristics. For example, in speech emotion recognition, the first half of a sentence, one word is angry and the whole sentence is angry. Several problems arise: first, identifying emotion using whole sentence often dilutes the characteristic changes of emotion. For example, "do we know to Beijing tomorrow and you feel feasible? "this sentence is often the second half of the sentence showing a large emotional difference. Feature variations that cause the use of mean pooling for time, convolution and fully connected layers for all features in deep learning to dilute emotion; secondly, when the parts are combined into sentences, the emotional characteristic changes are neutralized frequently. As is well known, chinese tones have one to four tones, wherein two and four tones have completely opposite characteristics in terms of time variation. Leading to the use of mean pooling for time in deep learning, neutralizing emotional feature changes for time series attention layers, and the like; thirdly, the positions of the words composing the emotion in the sentence are not fixed, which causes great difference of characteristics with the emotion. For example, "do this work? Is "and" feasible? Thus! "express the same meaning, but the output characteristics of the prior convolutional neural network are completely different.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a speech emotion recognition method, a speech emotion recognition device, speech emotion recognition equipment and a storage medium, and aims to solve the technical problem of how to accurately recognize speech emotion.
In order to achieve the above object, the present invention provides a speech emotion recognition method, which comprises the following steps:
acquiring a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples;
extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed;
performing feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed;
obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed;
and inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result.
Preferably, before the step of obtaining a test speech sample with a preset dimension and performing segmentation processing on the test speech sample through a preset rule to obtain a plurality of initial speech samples, the method further includes:
acquiring training voice samples with preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples;
performing feature extraction on the initial training voice sample to obtain the features of a training voice signal to be processed;
performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain a training feature statistical result to be confirmed;
obtaining target training characteristic data through a preset multi-objective optimization algorithm according to the statistical result of the training characteristics to be confirmed;
acquiring emotion types corresponding to the target training characteristic data according to the target training characteristic data;
and establishing a preset Softmax classification model according to the emotion classes and target training characteristic data corresponding to the emotion classes.
Preferably, the step of obtaining target training feature data through a preset multi-objective optimization algorithm according to the statistical result of the training features to be confirmed includes:
carrying out emotion category division on the statistical result of the training features to be confirmed to obtain training feature data to be optimized corresponding to different emotion categories;
and obtaining target training characteristic data through a preset multi-target optimization algorithm according to the training characteristic data to be optimized.
Preferably, the step of inputting the feature target data into a preset Softmax classification model to obtain a speech emotion recognition result includes:
inputting the characteristic target data into the preset Softmax classification model to obtain speech emotion category data;
performing data statistics on the voice emotion type data to obtain a voice emotion type data value;
and obtaining a voice emotion recognition result according to the voice emotion category data value.
Preferably, the step of obtaining the speech emotion recognition result according to the speech emotion category data value includes:
judging whether the voice emotion type data value belongs to a preset voice emotion type threshold range or not;
and if the voice emotion type data value belongs to the preset voice emotion type threshold range, acquiring a voice emotion recognition result according to the voice emotion type data value.
Preferably, after the step of determining whether the speech emotion category data value belongs to a preset speech emotion category threshold range, the method further includes:
and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the feature target data into the preset Softmax classification model to obtain voice emotion category data.
Preferably, the step of performing feature statistics on the to-be-processed speech signal feature data through a preset statistical function to obtain a to-be-confirmed feature statistical result includes:
screening the voice signal characteristic data to be processed to obtain label sample characteristic data;
and carrying out feature statistics on the tag sample feature data through a preset statistical function to obtain a feature statistical result to be confirmed.
In addition, to achieve the above object, the present invention further provides a speech emotion recognition apparatus, including: the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring a test voice sample with a preset dimensionality and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples;
the extraction module is used for extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed;
the statistical module is used for carrying out feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed;
the calculation module is used for obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed;
and the determining module is used for inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result.
In addition, to achieve the above object, the present invention also provides an electronic device, including: a memory, a processor and a speech emotion recognition program stored on the memory and executable on the processor, the speech emotion recognition program being configured to implement the steps of the speech emotion recognition method as described in any of the above.
In addition, to achieve the above object, the present invention further provides a storage medium, on which a speech emotion recognition program is stored, and the speech emotion recognition program implements the steps of the speech emotion recognition method as described in any one of the above when executed by a processor.
The method comprises the steps of firstly obtaining a test voice sample with a preset dimensionality, carrying out segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples, then carrying out signal characteristic data extraction on the initial voice samples to obtain voice signal characteristic data to be processed, screening the voice signal characteristic data to be processed to obtain tag sample characteristic data, carrying out characteristic statistics on the tag sample characteristic data through a preset statistical function to obtain a characteristic statistical result to be confirmed, then obtaining characteristic target data through a preset multi-target optimization algorithm according to the characteristic statistical result to be confirmed, and finally inputting the characteristic target data into a preset Softmax classification model to obtain a voice emotion recognition result. By the method, the voice emotion fragments and the emotion relation between the sentences and the fragments can be fully utilized and converted into the voice emotion data, so that the voice emotion recognition effect is improved.
Drawings
Fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of a speech emotion recognition method according to the present invention;
FIG. 3 is a flowchart illustrating a speech emotion recognition method according to a second embodiment of the present invention;
FIG. 4 is a block diagram of a speech emotion recognition apparatus according to a first embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the electronic device may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and a speech emotion recognition program.
In the electronic apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the electronic device according to the present invention may be disposed in the electronic device, and the electronic device calls the speech emotion recognition program stored in the memory 1005 through the processor 1001 and executes the speech emotion recognition method according to the embodiment of the present invention.
An embodiment of the present invention provides a speech emotion recognition method, and referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of the speech emotion recognition method according to the present invention.
In this embodiment, the speech emotion recognition method includes the following steps:
step S10: the method comprises the steps of obtaining a test voice sample with preset dimensionality, and conducting segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples.
Before the step of obtaining a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples, obtaining a training voice sample with a preset dimension, performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial training voice samples, performing feature extraction on the initial training voice samples to obtain training voice signal features to be processed, performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain training feature statistical results to be confirmed, obtaining target training feature data through a preset multi-objective optimization algorithm according to the training feature statistical results to be confirmed, and obtaining an emotion category corresponding to the target training feature data according to the target training feature data, and establishing a preset Softmax classification model according to the emotion classes and target training characteristic data corresponding to the emotion classes.
In addition, it should be understood that the preset rule is a user-defined sample division rule, that is, if the duration of the obtained test voice sample of the preset dimension is 5s, and the preset rule is set to 0.2s, 25 segments of 0.2s initial voice samples are obtained after division according to the preset rule.
In addition, it should be noted that the preset dimension may be a time dimension, or may be a non-time dimension, and the present embodiment is not limited thereto.
Step S20: and extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed.
Furthermore, it should be understood that performing signal feature data on the initial speech sample extracts Mel Frequency Cepstral Coefficients (MFCC), Log Frequency Power Coefficients (Log Frequency coeffients, LFPC), Linear Predictive Cepstral Coefficients (LPCC), Zero Crossing with Peak Amplitude (zca), Perceptual Linear Prediction (PLP), Rasta filter Perceptual Linear prediction (R-PLP).
It should be understood that the above feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension, and then each type of feature F is calculatediFirst derivative in the time dimension Δ FiSecond derivative Δ Δ FiConnecting the original features, the first derivative result and the second derivative result in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.
Further, for ease of understanding, the following is exemplified:
suppose, MFCC corresponds to FMFCC∈R39×z,ΔFMFCC∈R39×z,ΔΔFi∈R39×zWherein z is the number of frames, i.e. the number of degrees of the time dimension, the concatenation result in the non-time dimension
Figure BDA0002306906920000061
When MFCC and LPCC are connected, suppose
Figure BDA0002306906920000062
After being connected in series, are
Figure BDA0002306906920000063
Furthermore, it should be understood that at each time of speech signal feature extraction, MFCC, LFPC, LPCC, ZCPA, PLP, R-PLP features are extracted, where the number of Mel filters of MFCC, LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).
Step S30: and carrying out feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed.
The statistical results of the above features in the time dimension are obtained using a statistical function using a mean (mean), a standard deviation (standard deviation), a minimum (min), a maximum (max), a kurtosis (kurtosis), and a skewness (skewness).
In addition, it should be understood that the tag sample feature data is obtained by screening the statistical results obtained above, feature statistics is performed on the tag sample feature data through a preset statistical function to obtain a feature statistical result to be confirmed, and the feature statistical result of the tag sample is recorded as { x }1,x2,...,xnAnd n is the number of labeled specimens.
Step S40: and obtaining characteristic target data through a preset multi-target optimization algorithm according to the statistical result of the characteristics to be confirmed.
In addition, it should be noted that { x in the above step1,x2,...,xnDivide by statement label into XA=[x1,x2,…,xm],XB=[xm+1,xm+2,…,xn]Wherein X isAIs a fragment of class A emotion, XBIs a segment of B-type emotion, and the training of the sentence segment emotion classification method based on tendency cognitive learning comprises the following steps:
(1) for X ∈ XAIs mixing XAThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equationAThe distribution characteristics of the surrounding data.
bx=[b1,b2,…,bk]
Figure BDA0002306906920000071
In the formula bjDenotes the jth box, 1 (x)i∈Xj) At xiBelong to XjValue of time is 1 otherwise 0, XjIs XAA subset of (2), XjThe angle between the sample inside and x is distributed in the jth bin.
(2) For X ∈ XAIs mixing XBThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equationAThe distribution characteristics of the surrounding data.
Figure BDA0002306906920000072
Figure BDA0002306906920000073
In the formula
Figure BDA0002306906920000074
Is shown as
Figure BDA0002306906920000075
A box, 1 (x)i∈Xj) At xiBelong to XjValue of time is 1 otherwise 0, XjIs XBA subset of (2), XjThe angle between the sample inside and x is distributed in the jth bin.
(3) The difference in data distribution of the two data sets around point x is calculated using the following equation:
Figure BDA0002306906920000081
in the formula
Figure BDA0002306906920000082
The distance between two vectors is represented, where the euclidean distance is used.
(4) According to the calculation result of the last step, the fragment set which is inclined to the A feeling can be obtained
Figure BDA0002306906920000083
Segment set inclined to B emotion
Figure BDA0002306906920000084
And a collection of segments that are prone to neutral emotion
Figure BDA0002306906920000085
Wherein
Figure BDA0002306906920000086
Is dx>X of T.
Figure BDA0002306906920000087
Is dx<-x of T.
Figure BDA0002306906920000088
Is T>dx>-T. T is a threshold value set autonomously. For each set, clustering into a plurality of regions by using a spectral clustering method to obtain a region label of each segment xi
Figure BDA0002306906920000089
(5) Definition of
Figure BDA00023069069200000810
L=[LA,LB,LC]Wherein L isA∈Rp、LB∈Rq、LC∈RuP, q and u are each independently
Figure BDA00023069069200000811
And
Figure BDA00023069069200000812
number of samples, LA、LBAnd LCThe element values in (1) are 1, 2 and 3 respectively. The feature subspace of the segment is learned using the following objective equation:
J=J1(oi,oj)+β*J2(oi,oj)
beta is an equilibrium parameter. Wherein J1(oi,oj) Can realize
Figure BDA00023069069200000813
And
Figure BDA00023069069200000814
the intra-class distance between three classes is small, and the inter-class distance is large, and is defined as follows:
Figure BDA00023069069200000815
in the formula oiAnd ojIs composed of
Figure BDA00023069069200000816
And
Figure BDA00023069069200000817
results after mapping to subspace. liAnd ljCorresponds to oiAnd ojThe value in L. m is a threshold value, the range of the distance between classes is adjusted, and the value of 1 is taken in the invention. GijIs xi and xjThe gaussian distance between. The calculation formula is as follows:
Figure BDA00023069069200000818
J2(oi,oj) It is possible to try to keep the relative relationship within each region constant and the regions belonging to the same class relatively close but not overlapping. The definition is as follows:
Figure BDA00023069069200000819
in the formula
Figure BDA00023069069200000820
And
Figure BDA00023069069200000821
are the region labels for xi and xj. GliIs aiClass all GijMaximum value of (2). The method can keep the relationship between two segments when the two segments belong to the same area, and minimize the distance between the two segments with a small weight when the two segments do not belong to the same area but belong to the same category, so that the two areas can be prevented from overlapping as much as possible.
To optimize the objective equation J, we define oi=φ(Wqφ(…W3φ(W2φ(W1xi+b1)+b2)+b3)+b4) Where phi (·) is a sigmoid function, W1,W2,…,WqTo map the matrix, b1,b2,…,bqIs an offset. By calculating
Figure BDA0002306906920000091
And
Figure BDA0002306906920000092
can obtain W1,W2,…,WqAnd b1,b2,…,bqThe value of (a) is,
Figure BDA0002306906920000093
is to take the derivative of J with respect to W,
Figure BDA0002306906920000094
is the derivative of J over b.
Step S50: and inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result.
Further, it is understood that W is obtained according to the above steps1,W2,…,WqAnd b1,b2,…,bqCalculating { x1,x2,...,xmFeature selection result z.
In addition, W is defined as1,W2,…,WqAnd b1,b2,…,bqIs the characteristic target data in the present application.
Further, it should be understood that { x } is obtained separately using a preset Softmax classifier obtained during training1,x2,...,xmSpeech emotion classification of { l }1,l2,...,lm}. Then according to { l1,l2,...,lmAnd voting to obtain the emotion of the statement.
In addition, it should be noted that the feature target data is input into the preset Softmax classification model to obtain speech emotion category data, data statistics is performed on the speech emotion category data to obtain a speech emotion category data value, and a speech emotion recognition result is obtained according to the speech emotion category data value.
In addition, the step of obtaining the speech emotion recognition result according to the speech emotion category data value is to determine whether the speech emotion category data value belongs to a preset speech emotion category threshold range, and if the speech emotion category data value belongs to the preset speech emotion category threshold range, obtain the speech emotion recognition result according to the speech emotion category data value; and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the feature target data into the preset Softmax classification model to obtain voice emotion category data.
In addition, the corpus used for emotion recognition effect evaluation according to the present invention is a standard database in the speech emotion recognition field. The training process is first completed and then the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 94.65% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 89.30% with speaker independence.
In the embodiment, a test voice sample with a preset dimension is obtained, the test voice sample is subjected to sectional processing through a preset rule to obtain a plurality of initial voice samples, then signal characteristic data extraction is carried out on the initial voice samples to obtain voice signal characteristic data to be processed, the voice signal characteristic data to be processed is screened to obtain label sample characteristic data, characteristic statistics is carried out on the label sample characteristic data through a preset statistical function to obtain a characteristic statistical result to be confirmed, then emotion classification is carried out on the training characteristic statistical result to be confirmed to obtain training characteristic data to be optimized corresponding to different emotion classifications, target training characteristic data is obtained through a preset multi-objective optimization algorithm according to the training characteristic data to be optimized, and finally the characteristic target data is input into a preset Softmax classification model, and obtaining a speech emotion recognition result. By the method, the voice emotion fragments and the emotion relation between the sentences and the fragments can be fully utilized to form the data with a tendency, so that the process of processing the tendency of a human can be simulated, the unbalanced information of the data is utilized, the data are compared with each other, the fragments with different emotions are separated under constraint conditions, and the sample scale is increased and the sample diversity is improved.
Referring to fig. 3, fig. 3 is a flowchart illustrating a speech emotion recognition method according to a second embodiment of the present invention.
Based on the first embodiment, before the step S10, the speech emotion recognition method in this embodiment further includes:
step S000: and acquiring training voice samples with preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples.
Step S001: and extracting the characteristics of the initial training voice sample to obtain the characteristics of the training voice signal to be processed.
Step S002: and carrying out feature statistics on the training voice signal features to be processed through a preset statistical function to obtain a training feature statistical result to be confirmed.
Step S003: and obtaining target training characteristic data through a preset multi-objective optimization algorithm according to the statistical result of the training characteristics to be confirmed.
Step S004: and acquiring the emotion types corresponding to the target training characteristic data according to the target training characteristic data.
Step S005: and establishing a preset Softmax classification model according to the emotion classes and target training characteristic data corresponding to the emotion classes.
In addition, it should be noted that, the step of obtaining the target training feature data through the preset multi-objective optimization algorithm according to the to-be-confirmed training feature statistical result includes performing emotion class division on the to-be-confirmed training feature statistical result to obtain the to-be-optimized training feature data corresponding to different emotion classes, and obtaining the target training feature data through the preset multi-objective optimization algorithm according to the to-be-optimized training feature data.
In addition, it should be noted that, in the above step, a preset Softmax classification model is established, and in this stage, training is performed on all speakers respectively to obtain a classifier corresponding to each speaker, and the specific process is as follows:
step (1-1) segmenting each statement;
extracting the characteristics of each segment;
step (1-3) performing feature statistics on all features;
step (1-4) training a sentence fragment emotion classification method based on tendency cognitive learning;
step (1-5) training a support vector machine for each feature subspace;
the classification result of the step (1-6) is obtained by voting the results of all the support vector machines;
in addition, in the step (1-1), the speech signal is segmented at intervals of 0.2 seconds.
In the step (1-2), the extracting the speech signal feature for each segment includes: MFCC (Mel Frequency Cepstrum Coefficient), LFPC (Log Frequency Power Coefficient), LPCC (Linear Predictive Cepstral coding), ZCPA (Zero Crossing with Peak Amplitude), PLP (Percentual Linear Predictive), R-PLP (Ras ta Percentual Linear Predictive), the feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension; then calculate each class of features FiFirst derivative in the time dimension Δ FiSecond derivative Δ Δ FiConnecting the original features, the first derivative result and the second derivative result in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.
The characteristic statistics of the characteristics in the step (1-3) is as follows: obtaining the statistical results of mean, standard variance, minimum, maximum, kurtosis and skewness of the features in the time dimensionThe feature statistics are noted as { x1,x2,...,xnThe corresponding label is denoted as Y ═ Y1,y2,...,yn]∈Rn
In the step (1-4), a data set X is givenA=[x1,x2,…,xm],XB=[xm+1,xm+2,…,xn]Wherein X isAIs a fragment of class A emotion, XBIs a segment of B-type emotion, and the training of the sentence segment emotion classification method based on tendency cognitive learning comprises the following steps:
step (1-4-1) for X ∈ XAIs mixing XAThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equationAThe distribution characteristics of the surrounding data.
bx=[b1,b2,…,bk]
Figure BDA0002306906920000121
In the formula bjDenotes the jth box, 1 (x)i∈Xj) At xiBelong to XjValue of time is 1 otherwise 0, XjIs XAA subset of (2), XjThe angle between the sample inside and x is distributed in the jth bin.
Step (1-4-2) for X ∈ XAIs mixing XBThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equationAThe distribution characteristics of the surrounding data.
Figure BDA0002306906920000122
Figure BDA0002306906920000123
In the formula
Figure BDA0002306906920000124
Is shown as
Figure BDA0002306906920000125
A box, 1 (x)i∈Xj) At xiBelong to XjValue of time is 1 otherwise 0, XjIs XBA subset of (2), XjThe angle between the sample inside and x is distributed in the jth bin.
Step (1-4-3) calculates the difference in data distribution of the two data sets around point x using the following formula:
Figure BDA0002306906920000126
in the formula
Figure BDA0002306906920000127
Representing the distance between two vectors, a variety of distance calculation methods may be used.
Step (1-4-4) according to the calculation result of step (1-4-3), a fragment set prone to A emotion can be obtained
Figure BDA0002306906920000128
Segment set inclined to B emotion
Figure BDA0002306906920000129
And a collection of segments that are prone to neutral emotion
Figure BDA00023069069200001210
Wherein
Figure BDA00023069069200001211
Is dx>X of T.
Figure BDA00023069069200001212
Is dx<-x of T.
Figure BDA00023069069200001213
Is T>dx>-T. T is a threshold value set autonomously. For each set, clustering into a plurality of regions by using a spectral clustering method to obtain a region label of each segment xi
Figure BDA00023069069200001214
Step (1-4-5) definition
Figure BDA00023069069200001215
L=[LA,LB,LC]Wherein L isA∈Rp、LB∈Rq、LC∈RuP, q and u are each independently
Figure BDA00023069069200001216
And
Figure BDA00023069069200001217
number of samples, LA、LBAnd LCThe element values in (1) are 1, 2 and 3 respectively. The feature subspace of the segment is learned using the following objective equation:
J=J1(oi,oj)+β*J2(oi,oj)
beta is an equilibrium parameter. Wherein J1(oi,oj) Can realize
Figure BDA00023069069200001218
And
Figure BDA00023069069200001219
the intra-class distance between three classes is small, and the inter-class distance is large, and is defined as follows:
Figure BDA0002306906920000131
in the formula oiAnd ojIs composed of
Figure BDA0002306906920000132
And
Figure BDA0002306906920000133
results after mapping to subspace. liAnd ljCorresponds to oiAnd ojThe value in L. m is a threshold value, adjusting the range of inter-class distances. GijIs xiAnd xjThe gaussian distance between. The calculation formula is as follows:
Figure BDA0002306906920000134
J2(oi,oj) It is possible to try to keep the relative relationship within each region constant and the regions belonging to the same class relatively close but not overlapping. The definition is as follows:
Figure BDA0002306906920000135
in the formula
Figure BDA0002306906920000136
And
Figure BDA0002306906920000137
are the region labels for xi and xj. GliIs aiClass all GijMaximum value of (2). The method can keep the relationship between two segments when the two segments belong to the same area, and minimize the distance between the two segments with a small weight when the two segments do not belong to the same area but belong to the same category, so that the two areas can be prevented from overlapping as much as possible.
To optimize the objective equation J, define oi=φ(Wqφ(…W3φ(W2φ(W1xi+b1)+b2)+b3)+b4) Where phi (·) is a sigmoid function, W1,W2,…,WqTo map the matrix, b1,b2,…,bqIs an offset. By calculating
Figure BDA0002306906920000138
And
Figure BDA0002306906920000139
can obtain W1,W2,…,WqAnd b1,b2,…,bqThe value of (a) is,
Figure BDA00023069069200001310
is to take the derivative of J with respect to W,
Figure BDA00023069069200001311
is the derivative of J over b.
Step (1-4-6) for the product obtained in step (1-4-5)
Figure BDA00023069069200001312
And
Figure BDA00023069069200001313
training a Softmax classifier to separate emotion a, emotion B and emotion C.
And (1-4-7) training a softmax classifier capable of recognizing all emotion pairs according to the operation processes of the steps (1-4-5) and the steps (1-4-6).
Further, it should be understood that the following is a summary of the above:
the first step is as follows: all training sample voices were segmented at 0.2 second intervals.
The second step is that: extracting the characteristics of MFCC, LFPC, LPCC, ZCAP, PLP and R-PLP from all the speech segment training signals, wherein the number of Mel filters of the MFCC and the LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).
The third step: the following statistical function was used: the mean (mean), standard deviation (standard deviation), minimum (min), maximum (max), kurtosis (kurtosis), skewness (skewness) are obtained as statistics of the above features in the time dimension. The feature statistics of the labeled samples are noted as { x1,x2,...,xnAnd n is the number of labeled specimens.
The fourth step: will { x in the previous step1,x2,...,xnDivide by statement label into XA=[x1,x2,…,xm],XB=[xm+1,xm+2,…,xn]Wherein X isAIs a fragment of class A emotion, XBIs a segment of B-type emotion, and the training of the sentence segment emotion classification method based on tendency cognitive learning comprises the following steps:
(1) for X ∈ XAIs mixing XAThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equationAThe distribution characteristics of the surrounding data.
bx=[b1,b2,…,bk]
Figure BDA0002306906920000141
In the formula bjDenotes the jth box, 1 (x)i∈Xj) At xiBelong to XjValue of time is 1 otherwise 0, XjIs XAA subset of (2), XjThe angle between the sample inside and x is distributed in the jth bin.
(2) For X ∈ XAIs mixing XBThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equationAThe distribution characteristics of the surrounding data.
Figure BDA0002306906920000142
Figure BDA0002306906920000143
In the formula
Figure BDA0002306906920000144
Is shown as
Figure BDA0002306906920000145
A box, 1 (x)i∈Xj) At xiBelong to XjValue of time is 1 otherwise 0, XjIs XBA subset of (2), XjThe angle between the sample inside and x is distributed in the jth bin.
(3) The difference in data distribution of the two data sets around point x is calculated using the following equation:
Figure BDA0002306906920000146
in the formula
Figure BDA0002306906920000147
The distance between two vectors is represented, where the euclidean distance is used.
(4) According to the calculation results of the steps (1-4-3), the fragment set which is inclined to the A feeling can be obtained
Figure BDA0002306906920000148
Segment set inclined to B emotion
Figure BDA0002306906920000149
And a collection of segments that are prone to neutral emotion
Figure BDA00023069069200001410
Wherein
Figure BDA0002306906920000151
Is dx>X of T.
Figure BDA0002306906920000152
Is dx<-x of T.
Figure BDA0002306906920000153
Is T>dx>-T. T is a threshold value set autonomously. For each set, clustering into a plurality of regions by using a spectral clustering method to obtain a region label of each segment xi
Figure BDA0002306906920000154
(5) Definition of
Figure BDA0002306906920000155
L=[LA,LB,LC]Wherein L isA∈Rp、LB∈Rq、LC∈RuP, q and u are each independently
Figure BDA0002306906920000156
And
Figure BDA0002306906920000157
number of samples, LA、LBAnd LCThe element values in (1) are 1, 2 and 3 respectively. The feature subspace of the segment is learned using the following objective equation:
J=J1(oi,oj)+β*J2(oi,oj)
beta is an equilibrium parameter. Wherein J1(oi,oj) Can realize
Figure BDA0002306906920000158
And
Figure BDA0002306906920000159
the intra-class distance between three classes is small, and the inter-class distance is large, and is defined as follows:
Figure BDA00023069069200001510
in the formula oiAnd ojIs composed of
Figure BDA00023069069200001511
And
Figure BDA00023069069200001512
results after mapping to subspace. liAnd ljCorresponds to oiAnd ojThe value in L. m is a threshold value, the range of the distance between classes is adjusted, and the value of 1 is taken in the invention. GijIs xiAnd xjThe gaussian distance between. The calculation formula is as follows:
Figure BDA00023069069200001513
J2(oi,oj) It is possible to try to keep the relative relationship within each region constant and the regions belonging to the same class relatively close but not overlapping. The definition is as follows:
Figure BDA00023069069200001514
in the formula
Figure BDA00023069069200001515
And
Figure BDA00023069069200001516
are the region labels for xi and xj. GliIs aiClass all GijMaximum value of (2). The method can keep the relationship between two segments when the two segments belong to the same area, and minimize the distance between the two segments with a small weight when the two segments do not belong to the same area but belong to the same category, so that the two areas can be prevented from overlapping as much as possible.
To optimize the objective equation J, define oi=φ(Wqφ(…W3φ(W2φ(W1xi+b1)+b2)+b3)+b4) Where phi (·) is a sigmoid function, W1,W2,…,WqTo map the matrix, b1,b2,…,bqIs an offset. By calculating
Figure BDA00023069069200001517
And
Figure BDA00023069069200001518
can obtain W1,W2,…,WqAnd b1,b2,…,bqThe value of (a) is,
Figure BDA00023069069200001519
is to take the derivative of J with respect to W,
Figure BDA00023069069200001520
is the derivative of J over b.
(6) For those obtained in the above step (1-4-5)
Figure BDA00023069069200001521
And
Figure BDA00023069069200001522
training the Sof tmax classifier to separate emotion A, emotion B and emotion C.
(7) And (4) training a Softmax classifier capable of recognizing all emotion pairs according to the operation process of the step (1-4-5) and the step (1-4-6).
In the embodiment, a plurality of initial training voice samples are obtained by obtaining training voice samples with preset dimensions and performing segmentation processing on the test voice samples through preset rules, then feature extraction is performed on the initial training voice samples to obtain training voice signal features to be processed, feature statistics is performed on the training voice signal features to be processed through a preset statistical function to obtain training feature statistical results to be confirmed, target training feature data are obtained through a preset multi-target optimization algorithm according to the training feature statistical results to be confirmed, then emotion classes corresponding to the target training feature data are obtained according to the target training feature data, and a preset Softmax classification model is established according to the emotion classes and the target training feature data corresponding to the emotion classes. By the method, the model can be trained aiming at the sentence local segments, different local segments in a sentence can be prevented from containing different emotions, or different local segments of the same emotion are prevented from conflicting with each other, and therefore the difference between deep learning physical meanings and speech emotion recognition characteristics is reduced.
In addition, an embodiment of the present invention further provides a storage medium, where a speech emotion recognition program is stored on the storage medium, and the speech emotion recognition program, when executed by a processor, implements the steps of the speech emotion recognition method described above.
Referring to FIG. 4, FIG. 4 is a block diagram illustrating a first embodiment of a speech emotion recognition apparatus according to the present invention.
As shown in fig. 4, the speech emotion recognition apparatus according to the embodiment of the present invention includes: the acquisition module 4001 is configured to acquire a test voice sample with a preset dimension, and perform segmentation processing on the test voice sample according to a preset rule to obtain a plurality of initial voice samples; an extraction module 4002, configured to perform signal feature data extraction on the initial voice sample to obtain to-be-processed voice signal feature data; the statistic module 4003 is configured to perform feature statistics on the to-be-processed speech signal feature data through a preset statistic function to obtain a feature statistical result to be confirmed; the calculating module 4004 is configured to obtain feature target data through a preset multi-objective optimization algorithm according to the feature statistical result to be confirmed; the determining module 4005 is configured to input the feature target data into a preset Softmax classification model, and obtain a speech emotion recognition result.
The obtaining module 4001 obtains a test voice sample with a preset dimension, and performs a segmentation process on the test voice sample according to a preset rule to obtain a plurality of initial voice samples.
Before the step of obtaining a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples, obtaining a training voice sample with a preset dimension, performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial training voice samples, performing feature extraction on the initial training voice samples to obtain training voice signal features to be processed, performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain training feature statistical results to be confirmed, obtaining target training feature data through a preset multi-objective optimization algorithm according to the training feature statistical results to be confirmed, and obtaining an emotion category corresponding to the target training feature data according to the target training feature data, and establishing a preset Softmax classification model according to the emotion classes and target training characteristic data corresponding to the emotion classes.
In addition, it should be understood that the preset rule is a user-defined sample division rule, that is, if the duration of the obtained test voice sample of the preset dimension is 5s, and the preset rule is set to 0.2s, 25 segments of 0.2s initial voice samples are obtained after division according to the preset rule.
In addition, it should be noted that the preset dimension may be a time dimension, or may be a non-time dimension, and the present embodiment is not limited thereto.
The extraction module 4002 performs signal feature data extraction on the initial voice sample to obtain the operation of voice signal feature data to be processed.
Furthermore, it should be understood that performing signal feature data on the initial speech sample extracts Mel Frequency Cepstral Coefficients (MFCC), Log Frequency Power Coefficients (Log Frequency coeffients, LFPC), Linear Predictive Cepstral Coefficients (LPCC), Zero Crossing with Peak Amplitude (zca), Perceptual Linear Prediction (PLP), Rasta filter Perceptual Linear prediction (R-PLP).
It should be understood that the above feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension, and then each type of feature F is calculatediFirst derivative in the time dimension Δ FiThe original features, the first derivative results and the second derivative results are connected in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.
Further, for ease of understanding, the following is exemplified:
suppose, MFCC corresponds to FMFCC∈R39×z,ΔFMFCC∈R39×z,ΔΔFi∈R39×zWherein z is the number of frames, i.e. the number of degrees of the time dimension, the concatenation result in the non-time dimension
Figure BDA0002306906920000171
When MFCC and LPCC are connected, suppose
Figure BDA0002306906920000181
After being connected in series, are
Figure BDA0002306906920000182
Furthermore, it should be understood that at each time of speech signal feature extraction, MFCC, LFPC, LPCC, ZCPA, PLP, R-PLP features are extracted, where the number of Mel filters of MFCC, LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).
And the statistic module 4003 performs feature statistics on the feature data of the voice signal to be processed through a preset statistic function to obtain a feature statistic result to be confirmed.
The statistical results of the above features in the time dimension are obtained using a statistical function using a mean (mean), a standard deviation (standard deviation), a minimum (min), a maximum (max), a kurtosis (kurtosis), and a skewness (skewness).
In addition, it should be understood that the tag sample feature data is obtained by screening the statistical results obtained above, feature statistics is performed on the tag sample feature data through a preset statistical function to obtain a feature statistical result to be confirmed, and the feature statistical result of the tag sample is recorded as { x }1,x2,...,xnAnd n is the number of labeled specimens.
And the calculating module 4004 obtains the characteristic target data through a preset multi-objective optimization algorithm according to the statistical result of the characteristics to be confirmed.
In addition, it should be noted that { x in the above step1,x2,...,xnDivide by statement label into XA=[x1,x2,…,xm],XB=[xm+1,xm+2,…,xn]Wherein X isAIs a fragment of class A emotion, XBIs a segment of B-type emotion, and the training of the sentence segment emotion classification method based on tendency cognitive learning comprises the following steps:
(1) for X ∈ XAIs mixing XAThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equationAThe distribution characteristics of the surrounding data.
bx=[b1,b2,…,bk]
Figure BDA0002306906920000191
In the formula bjDenotes the jth box, 1 (x)i∈Xj) At xiBelong to XjValue of time is 1 otherwise 0, XjIs XAA subset of (2), XjThe angle between the sample inside and x is distributed in the jth bin.
(2) For X ∈ XAIs mixing XBThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equationAThe distribution characteristics of the surrounding data.
Figure BDA0002306906920000192
Figure BDA0002306906920000193
In the formula
Figure BDA0002306906920000194
Is shown as
Figure BDA0002306906920000195
A box, 1 (x)i∈Xj) At xiBelong to XjValue of time is 1 otherwise 0, XjIs XBA subset of (2), XjThe angle between the sample inside and x is distributed in the jth bin.
(3) The difference in data distribution of the two data sets around point x is calculated using the following equation:
Figure BDA0002306906920000196
in the formula
Figure BDA0002306906920000197
The distance between two vectors is represented, where the euclidean distance is used.
(4) According to the calculation result of the last step, the fragment set which is inclined to the A feeling can be obtained
Figure BDA0002306906920000198
Segment set inclined to B emotion
Figure BDA0002306906920000199
And a collection of segments that are prone to neutral emotion
Figure BDA00023069069200001910
Wherein
Figure BDA00023069069200001911
Is dx>X of T.
Figure BDA00023069069200001912
Is dx<-x of T.
Figure BDA00023069069200001913
Is T>dx>-T. T is a threshold value set autonomously. For each set, clustering into a plurality of regions by using a spectral clustering method to obtain a region label of each segment xi
Figure BDA00023069069200001914
(5) Definition of
Figure BDA00023069069200001915
L=[LA,LB,LC]Wherein L isA∈Rp、LB∈Rq、LC∈RuP, q and u are each independently
Figure BDA00023069069200001916
And
Figure BDA00023069069200001917
number of samples, LA、LBAnd LCThe element values in (1) are 1, 2 and 3 respectively. The feature subspace of the segment is learned using the following objective equation:
J=J1(oi,oj)+β*J2(oi,oj)
beta is an equilibrium parameter. Wherein J1(oi,oj) Can realize
Figure BDA00023069069200001918
And
Figure BDA00023069069200001919
the intra-class distance between three classes is small, and the inter-class distance is large, and is defined as follows:
Figure BDA00023069069200001920
in the formula oiAnd ojIs composed of
Figure BDA00023069069200001921
And
Figure BDA00023069069200001922
results after mapping to subspace. liAnd ljCorresponds to oiAnd ojThe value in L. m is a threshold value, the range of the distance between classes is adjusted, and the value of 1 is taken in the invention. GijIs xiAnd xjThe gaussian distance between. The calculation formula is as follows:
Figure BDA0002306906920000201
J2(oi,oj) It is possible to try to keep the relative relationship within each region constant and the regions belonging to the same class relatively close but not overlapping. The definition is as follows:
Figure BDA0002306906920000202
in the formula
Figure BDA0002306906920000203
And
Figure BDA0002306906920000204
are the region labels for xi and xj. GliIs aiClass all GijMaximum value of (2). The method can keep the relationship between two segments when the two segments belong to the same area, and minimize the distance between the two segments with a small weight when the two segments do not belong to the same area but belong to the same category, so that the two areas can be prevented from overlapping as much as possible.
To optimize the objective equation J, we define oi=φ(Wqφ(…W3φ(W2φ(W1xi+b1)+b2)+b3)+b4) Where phi (·) is a sigmoid function, W1,W2,…,WqTo map the matrix, b1,b2,…,bqIs an offset. By calculating
Figure BDA0002306906920000205
And
Figure BDA0002306906920000206
can obtain W1,W2,…,WqAnd b1,b2,…,bqThe value of (a) is,
Figure BDA0002306906920000207
is to find the conductance of J to WThe number of the first and second groups is,
Figure BDA0002306906920000208
is the derivative of J over b.
And the determining module 4005 inputs the feature target data into a preset Softmax classification model to obtain a speech emotion recognition result.
Further, it is understood that W is obtained according to the above steps1,W2,…,WqAnd b1,b2,…,bqCalculating { x1,x2,...,xmFeature selection result z.
In addition, W is defined as1,W2,…,WqAnd b1,b2,…,bqIs the characteristic target data in the present application.
Further, it should be understood that { x } is obtained separately using a preset Softmax classifier obtained during training1,x2,...,xmSpeech emotion classification of { l }1,l2,...,lm}. Then according to { l1,l2,...,lmAnd voting to obtain the emotion of the statement.
In addition, it should be noted that the feature target data is input into the preset Softmax classification model to obtain speech emotion category data, data statistics is performed on the speech emotion category data to obtain a speech emotion category data value, and a speech emotion recognition result is obtained according to the speech emotion category data value.
In addition, the step of obtaining the speech emotion recognition result according to the speech emotion category data value is to determine whether the speech emotion category data value belongs to a preset speech emotion category threshold range, and if the speech emotion category data value belongs to the preset speech emotion category threshold range, obtain the speech emotion recognition result according to the speech emotion category data value; and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the feature target data into the preset Softmax classification model to obtain voice emotion category data.
In addition, the corpus used for emotion recognition effect evaluation according to the present invention is a standard database in the speech emotion recognition field. The training process is first completed and then the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 94.65% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 89.30% with speaker independence.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited thereto.
In the embodiment, a test voice sample with a preset dimension is obtained, the test voice sample is subjected to sectional processing through a preset rule to obtain a plurality of initial voice samples, then signal characteristic data extraction is carried out on the initial voice samples to obtain voice signal characteristic data to be processed, the voice signal characteristic data to be processed is screened to obtain label sample characteristic data, characteristic statistics is carried out on the label sample characteristic data through a preset statistical function to obtain a characteristic statistical result to be confirmed, then emotion classification is carried out on the training characteristic statistical result to be confirmed to obtain training characteristic data to be optimized corresponding to different emotion classifications, target training characteristic data is obtained through a preset multi-objective optimization algorithm according to the training characteristic data to be optimized, and finally the characteristic target data is input into a preset Softmax classification model, and obtaining a speech emotion recognition result. By the method, the voice emotion fragments and the emotion relation between the sentences and the fragments can be fully utilized to form the data with a tendency, so that the process of processing the tendency of a human can be simulated, the unbalanced information of the data is utilized, the data are compared with each other, the fragments with different emotions are separated under constraint conditions, and the sample scale is increased and the sample diversity is improved.
It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.
In addition, the technical details that are not described in detail in this embodiment may refer to the speech emotion recognition method provided in any embodiment of the present invention, and are not described herein again.
Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. A speech emotion recognition method, characterized in that the method comprises:
acquiring a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples;
extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed;
performing feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed;
obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed;
inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result;
the method for obtaining the initial voice samples comprises the following steps of obtaining a test voice sample with a preset dimensionality, carrying out segmentation processing on the test voice sample through a preset rule, and obtaining a plurality of initial voice samples, wherein before the step of obtaining the initial voice samples, the method further comprises the following steps:
acquiring training voice samples with preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples;
performing feature extraction on the initial training voice sample to obtain the features of a training voice signal to be processed;
performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain a training feature statistical result to be confirmed;
carrying out emotion category division on the statistical result of the training features to be confirmed to obtain training feature data to be optimized corresponding to different emotion categories;
determining a peripheral data distribution characteristic set corresponding to each training characteristic data to be optimized;
determining data distribution differences corresponding to different emotion classes according to the surrounding data distribution feature set;
obtaining emotion fragment sets corresponding to different emotion types according to the data distribution difference;
determining a feature subspace corresponding to each emotion fragment set;
and establishing a preset Softmax classification model based on the plurality of feature subspaces.
2. The method of claim 1, wherein the step of inputting the feature target data into a preset Softmax classification model to obtain the speech emotion recognition result comprises:
inputting the characteristic target data into the preset Softmax classification model to obtain speech emotion category data;
performing data statistics on the voice emotion type data to obtain a voice emotion type data value;
and obtaining a voice emotion recognition result according to the voice emotion category data value.
3. The method of claim 2, wherein the step of obtaining speech emotion recognition results according to the speech emotion classification data values comprises:
judging whether the voice emotion type data value belongs to a preset voice emotion type threshold range or not;
and if the voice emotion type data value belongs to the preset voice emotion type threshold range, acquiring a voice emotion recognition result according to the voice emotion type data value.
4. The method of claim 3, wherein the step of determining whether the speech emotion classification data value falls within a preset speech emotion classification threshold range further comprises:
and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the feature target data into the preset Softmax classification model to obtain voice emotion category data.
5. The method according to claim 1, wherein the step of performing feature statistics on the feature data of the speech signal to be processed by using a preset statistical function to obtain a feature statistical result to be confirmed comprises:
screening the voice signal characteristic data to be processed to obtain label sample characteristic data;
and carrying out feature statistics on the tag sample feature data through a preset statistical function to obtain a feature statistical result to be confirmed.
6. An apparatus for speech emotion recognition, the apparatus comprising:
the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring a test voice sample with a preset dimensionality and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples;
the extraction module is used for extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed;
the statistical module is used for carrying out feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed;
the calculation module is used for obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed;
the determining module is used for inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result;
the speech emotion recognition apparatus further includes: acquiring training voice samples with preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples; performing feature extraction on the initial training voice sample to obtain the features of a training voice signal to be processed; performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain a training feature statistical result to be confirmed; carrying out emotion category division on the statistical result of the training features to be confirmed to obtain training feature data to be optimized corresponding to different emotion categories; determining a peripheral data distribution characteristic set corresponding to each training characteristic data to be optimized; determining data distribution differences corresponding to different emotion classes according to the surrounding data distribution feature set; obtaining emotion fragment sets corresponding to different emotion types according to the data distribution difference; determining a feature subspace corresponding to each emotion fragment set; and establishing a preset Softmax classification model based on the plurality of feature subspaces.
7. An electronic device, characterized in that the device comprises: a memory, a processor and a speech emotion recognition program stored on the memory and executable on the processor, the speech emotion recognition program being configured to implement the steps of the speech emotion recognition method as claimed in any of claims 1 to 5.
8. A storage medium having stored thereon a speech emotion recognition program, which when executed by a processor implements the steps of the speech emotion recognition method as claimed in any one of claims 1 to 5.
CN201911246544.5A 2019-12-06 2019-12-06 Speech emotion recognition method, device, equipment and storage medium Expired - Fee Related CN110956981B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911246544.5A CN110956981B (en) 2019-12-06 2019-12-06 Speech emotion recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911246544.5A CN110956981B (en) 2019-12-06 2019-12-06 Speech emotion recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110956981A CN110956981A (en) 2020-04-03
CN110956981B true CN110956981B (en) 2022-04-26

Family

ID=69980269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911246544.5A Expired - Fee Related CN110956981B (en) 2019-12-06 2019-12-06 Speech emotion recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110956981B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466324A (en) * 2020-11-13 2021-03-09 上海听见信息科技有限公司 Emotion analysis method, system, equipment and readable storage medium
CN113326678B (en) * 2021-06-24 2024-08-06 深圳前海微众银行股份有限公司 Conference summary generation method and device, terminal equipment and computer storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663413A (en) * 2012-03-09 2012-09-12 中盾信安科技(江苏)有限公司 Multi-gesture and cross-age oriented face image authentication method
CN105488456A (en) * 2015-11-23 2016-04-13 中国科学院自动化研究所 Adaptive rejection threshold adjustment subspace learning based human face detection method
CN105913025A (en) * 2016-04-12 2016-08-31 湖北工业大学 Deep learning face identification method based on multiple-characteristic fusion

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250477A1 (en) * 2009-03-31 2010-09-30 Shekhar Yadav Systems and methods for optimizing a campaign
CA2786380C (en) * 2010-01-18 2020-07-14 Elminda Ltd. Method and system for weighted analysis of neurophysiological data
CN104008754B (en) * 2014-05-21 2017-01-18 华南理工大学 Speech emotion recognition method based on semi-supervised feature selection
CN107977651B (en) * 2017-12-21 2019-12-24 西安交通大学 Common spatial mode spatial domain feature extraction method based on quantization minimum error entropy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663413A (en) * 2012-03-09 2012-09-12 中盾信安科技(江苏)有限公司 Multi-gesture and cross-age oriented face image authentication method
CN105488456A (en) * 2015-11-23 2016-04-13 中国科学院自动化研究所 Adaptive rejection threshold adjustment subspace learning based human face detection method
CN105913025A (en) * 2016-04-12 2016-08-31 湖北工业大学 Deep learning face identification method based on multiple-characteristic fusion

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
The Role of Principal Angles in Subspace Classification;Jiaji Huang;《IEEE Transactions on Signal Processing》;20151117;全文 *
Transferable Representation Learning with Deep Adaptation Networks;Mingsheng Long;<IEEE Transactions on Pattern Analysis and Machine Intelligence>;20180905;第3071-3085页 *
人机情感交互的方法与技术研究;王国江;《中国博士学位论文全文数据库》;20080415(第4期);全文 *
面向局部特征和特征表达的图像分类算法研究;张旭;《中国博士学位论文全文数据库》;20170215(第2期);全文 *

Also Published As

Publication number Publication date
CN110956981A (en) 2020-04-03

Similar Documents

Publication Publication Date Title
Dai et al. Learning discriminative features from spectrograms using center loss for speech emotion recognition
Bahari et al. Speaker age estimation and gender detection based on supervised non-negative matrix factorization
Lan et al. An extreme learning machine approach for speaker recognition
CN111696557A (en) Method, device and equipment for calibrating voice recognition result and storage medium
CN108346436A (en) Speech emotional detection method, device, computer equipment and storage medium
Parthasarathy et al. Convolutional neural network techniques for speech emotion recognition
Wang et al. Discriminative neural embedding learning for short-duration text-independent speaker verification
CN111916111A (en) Intelligent voice outbound method and device with emotion, server and storage medium
JP6787770B2 (en) Language mnemonic and language dialogue system
CN107767881B (en) Method and device for acquiring satisfaction degree of voice information
CN110956981B (en) Speech emotion recognition method, device, equipment and storage medium
CN104538036A (en) Speaker recognition method based on semantic cell mixing model
Bahari Speaker age estimation using Hidden Markov Model weight supervectors
CN112233651A (en) Dialect type determining method, dialect type determining device, dialect type determining equipment and storage medium
CN112632248A (en) Question answering method, device, computer equipment and storage medium
CN109377984B (en) ArcFace-based voice recognition method and device
Mande et al. EMOTION DETECTION USING AUDIO DATA SAMPLES.
Akbal et al. Development of novel automated language classification model using pyramid pattern technique with speech signals
Birla A robust unsupervised pattern discovery and clustering of speech signals
JP6996627B2 (en) Information processing equipment, control methods, and programs
Elbarougy Speech emotion recognition based on voiced emotion unit
CN111145787B (en) Voice emotion feature fusion method and system based on main and auxiliary networks
Hanilçi et al. Investigation of the effect of data duration and speaker gender on text-independent speaker recognition
CN115315746A (en) Speaker recognition method, recognition device, recognition program, gender recognition model generation method, and speaker recognition model generation method
Michalevsky et al. Speaker identification using diffusion maps

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220426