CN110619893A - Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal - Google Patents
Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal Download PDFInfo
- Publication number
- CN110619893A CN110619893A CN201910823584.5A CN201910823584A CN110619893A CN 110619893 A CN110619893 A CN 110619893A CN 201910823584 A CN201910823584 A CN 201910823584A CN 110619893 A CN110619893 A CN 110619893A
- Authority
- CN
- China
- Prior art keywords
- corpus
- audio information
- feature sequence
- emotion
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 76
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000013473 artificial intelligence Methods 0.000 title claims description 11
- 238000000605 extraction Methods 0.000 title claims description 11
- 238000012544 monitoring process Methods 0.000 title claims description 11
- 238000012216 screening Methods 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 11
- 239000000463 material Substances 0.000 claims description 3
- 230000002996 emotional effect Effects 0.000 abstract description 10
- 230000008569 process Effects 0.000 abstract description 9
- 230000000694 effects Effects 0.000 abstract description 5
- 238000007781 pre-processing Methods 0.000 abstract description 5
- 238000009432 framing Methods 0.000 abstract description 4
- 238000012300 Sequence Analysis Methods 0.000 abstract description 3
- 230000008909 emotion recognition Effects 0.000 description 6
- 230000036651 mood Effects 0.000 description 6
- 230000009471 action Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 241000257303 Hymenoptera Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000005556 hormone Substances 0.000 description 1
- 229940088597 hormone Drugs 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 239000002858 neurotransmitter agent Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 239000003016 pheromone Substances 0.000 description 1
- 230000035790 physiological processes and functions Effects 0.000 description 1
- 210000000952 spleen Anatomy 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- General Health & Medical Sciences (AREA)
- Child & Adolescent Psychology (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The method comprises the steps of preprocessing audio data of a monitored target individual, carrying out pre-emphasis, windowing, framing and other preprocessing, further building a deep belief network to fully mine the time-frequency characteristics of the data, and then matching the extracted time-frequency characteristics with predefined corpus time-frequency characteristics in a speech emotion corpus by utilizing dynamic time programming and an ant colony algorithm to determine emotion distribution corresponding to the monitored target individual. The technical scheme has simple and quick process, does not need to depend on empirical values, has high recognition rate of emotional states when the signal-to-noise ratio is low, and can realize time-sequence analysis of voice and more accurate emotional state recognition effect by extracting the time-frequency characteristics of the audio by utilizing the deep belief network. Meanwhile, the technical scheme combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and greatly increases the recognition efficiency of emotional states.
Description
Technical Field
The application relates to the field of psychology and information processing power, in particular to a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals.
Background
Mood, a common term for a series of subjective cognitive experiences, is a psychological and physiological state resulting from the integration of multiple senses, ideas and behaviors. The most common and popular emotions are happiness, anger, grief, surprise, terror, love, etc., and also some subtle and subtle emotions, such as jealousy, jeopardy, shame, self-haury, etc. Mood often interacts with factors such as mood, character, spleen qi, purpose, etc., and is also affected by hormones and neurotransmitters. Either positive or negative emotions are motivations for people to act. Although some mood-induced behaviors do not appear to be thought, in practice consciousness is one of the important rings in creating mood. It is seen that focusing on the emotional characteristics of an individual can play a very important role in mood guidance and people's safety.
At present, some of the technical solutions for analyzing emotional characteristics of individuals are determined by using audio information of the individuals. The audio information can reflect to some extent an emotional characteristic of the individual, for example, an increase in sound intensity may represent that the individual's emotion is anger.
According to the technical scheme of analyzing emotion characteristics according to audio information, a Mel Frequency Cepstrum Coefficient (MFCC), a perception linear prediction coefficient (PLP) and the like are mainly used, the scheme mostly adopts artificial design characteristics, the process is complicated and depends on experience, and the emotion recognition rate is usually greatly reduced when the signal-to-noise ratio is low. Further, the conventional emotion recognition methods include methods such as a Hidden Markov Model (HMM), a gaussian mixture model, a Support Vector Machine (SVM), an artificial neural network, and k-nearest neighbor, and these methods mostly consider global features of a speech signal, classify and recognize the speech signal according to the global features, but do not consider continuous features of the speech signal, cannot analyze the time sequence of the speech signal, consider a change of emotion in a time dimension, and are not comprehensive.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals, and the defects of complex process, dependence on experience, no consideration of the comprehensiveness of voice signal analysis, poor anti-jamming capability and low efficiency existing in the emotion determination method in the prior art are overcome.
(II) technical scheme
In order to achieve the above purpose, the present application is implemented by the following technical solutions:
the application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals, which comprises the following steps:
acquiring audio information of a target individual when reading a given corpus;
performing emphasis processing on a high-frequency part in the audio information by using a pre-emphasis digital filter with the order of one, and dividing the audio information subjected to the emphasis processing into a plurality of sub-audio information so as to smooth the waveform of the audio information; determining an endpoint of each sub-audio information based on the calculated energy value of each sub-audio;
for each piece of sub-audio information, taking the audio information between two endpoints of the sub-audio information as target audio information corresponding to the sub-audio information;
extracting audio features in all target audio information by using a deep belief network comprising a plurality of limiting Boltzmann machines (RBMs) to obtain an audio feature sequence;
acquiring all corpus feature sequences in a speech emotion corpus and emotion labels corresponding to the corpus feature sequences;
aiming at each corpus feature sequence in all the corpus feature sequences, aligning the target feature sequence with the audio feature sequence to be compared according to the expected information to obtain a processed corpus feature sequence; projecting the corpus feature sequence and the audio feature sequence to the same plane, calculating the minimum distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the minimum Euclidean distance to the audio feature sequence;
based on the emotion label corresponding to the corpus characteristic sequence obtained by screening, determining the emotion category of the target individual;
and sending the determined emotion category of the target individual to the client and the display terminal for display.
In one possible embodiment, the method further comprises the step of training the deep belief network:
taking the first sample audio information X and the first hidden layer h1 as a first limiting Boltzmann machine RBM, and training to obtain parameters of the RBM;
fixing parameters of a first RBM, taking h1 as a visible vector, taking a second hidden layer h2 as a hidden vector, and training to obtain parameters of a second RBM until all RBM parameters are obtained through training;
and (4) finely adjusting each parameter in the deep belief network by using a back propagation algorithm to obtain the trained deep belief network.
In one possible implementation, the parameters of the RBM include a weight connecting the visible vector and the hidden vector, an offset of each node in the visible vector, and an offset of each node in the hidden vector.
In a possible implementation, the method further includes the step of forming the speech emotion corpus:
acquiring a plurality of second sample audio information and emotion labels corresponding to the second sample audio information;
processing each second sample audio information by using the trained deep belief network to obtain a corpus feature sequence corresponding to each second sample audio information;
establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information;
and establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information, and storing the corresponding relation into a speech emotion corpus.
In a possible implementation manner, the calculating a euclidean distance between the processed corpus feature sequence and an audio feature sequence, and screening the corpus feature sequence with the smallest euclidean distance with the audio feature sequence includes:
and calculating the Euclidean distance between the processed corpus feature sequences and the audio feature sequence by using an ant colony algorithm, and screening the corpus feature sequence with the minimum Euclidean distance with the audio feature sequence.
In a possible implementation manner, the determining the emotion classification of the target individual based on the emotion label corresponding to the corpus feature sequence obtained by the screening includes:
and taking the emotion category corresponding to the emotion label of the corpus characteristic sequence obtained by screening as the emotion category of the target individual.
In one possible embodiment, the obtaining audio information of the target individual when reading the given corpus includes:
and acquiring audio information of the target individual when reading the given language material by using a microphone.
(III) advantageous effects
The application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals. The method has the following beneficial effects:
the method comprises the steps of firstly carrying out pre-emphasis, windowing, framing and other processing on the acquired audio information of a target individual, then extracting an audio characteristic sequence by using a deep belief network, and then matching the extracted audio characteristic sequence with a corpus characteristic sequence in a speech emotion corpus by using dynamic time programming and an ant colony algorithm so as to determine an emotion label corresponding to the target individual. The technical scheme has simple and direct process, does not need to depend on empirical values, has high emotion recognition rate when the signal-to-noise ratio is low, extracts the audio characteristic sequence by utilizing the deep belief network, can obtain the continuous characteristics of the voice signal, realizes the time sequence analysis of the voice, greatly reduces the characteristic extraction time, and has stronger anti-interference performance and better emotion recognition effect on performance. Meanwhile, the technical scheme combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and enables the matching of voice signals to be more comprehensive and efficient.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart schematically illustrating a time-frequency feature extraction and artificial intelligence emotion monitoring method for a speech signal according to an embodiment of the present application;
fig. 2 schematically shows a flowchart for training a deep belief network in another embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals. Meanwhile, the method combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and enables the matching of voice signals to be more comprehensive and efficient.
Specifically, as shown in fig. 1, the method for extracting time-frequency features of voice signals and monitoring artificial intelligence emotion includes the following steps:
s110, obtaining audio information of the target individual when reading the given language material.
Here, the audio information of the target individual may be acquired by using a microphone. Specifically, news reports and article segments can be given to allow the target individual to read corresponding words and enter audio information.
S120, performing emphasis processing on a high-frequency part in the audio information by using a pre-emphasis digital filter with the order of one, and dividing the audio information subjected to the emphasis processing into a plurality of pieces of sub-audio information to smooth the waveform of the audio information; and determining the endpoint of each sub-audio information based on the calculated energy value of each sub-audio.
Here, the high frequency part of the audio information is emphasized by passing the audio information through a pre-emphasis digital filter with an order of 1 for improving the resolution of the speech. Here, a rectangular window function may be used for the audio information, which is divided into a plurality of stationary short-time frames to smooth the waveform; and extracting audio information by calculating the short-time energy of the frame to carry out endpoint detection so as to remove noise.
And S130, regarding each piece of sub-audio information, taking the audio information between the two endpoints of the sub-audio information as the target audio information corresponding to the sub-audio information.
S140, extracting audio features in all target audio information by using a deep belief network comprising a plurality of limiting Boltzmann machines (RBMs) to obtain an audio feature sequence.
The deep belief network is a pre-trained deep belief network. The DBN is a network structure that is stacked layer by a plurality of constraint boltzmann machines (RBMs).
S150, acquiring all corpus feature sequences in the speech emotion corpus and emotion labels corresponding to the corpus feature sequences.
Before this step is executed, the method further comprises the step of forming the speech emotion corpus:
step one, obtaining a plurality of second sample audio information and emotion labels corresponding to the second sample audio information.
And step two, processing each second sample audio information by using the trained deep belief network to obtain a corpus feature sequence corresponding to each second sample audio information.
Here, the corpus feature sequence is determined using the following formula:
y=f(x;θDBN)
where y represents the corpus feature sequence and x represents the second sample audio information.
And step three, establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information.
And step four, storing the corpus feature sequence corresponding to each second sample audio information and the corresponding relation established by the emotion label corresponding to each second sample audio information into a speech emotion corpus.
In the above step, the second sample audio information uses information in a corpus published at home and abroad as input data, and obtains corpus feature characteristic a ═ a1,a2,...anAnd storing the extracted corpus feature sequence and the emotion label P corresponding to the corpus feature sequence into a speech emotion corpus as a reference template.
S160, aiming at each corpus feature sequence in all corpus feature sequences, aligning the target feature sequence with the audio feature sequence to be compared according to expected information to obtain a processed corpus feature sequence; and projecting the corpus feature sequence and the audio feature sequence to the same plane, calculating the minimum distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the minimum Euclidean distance to the audio feature sequence.
Here, the calculating a euclidean distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the smallest euclidean distance with the audio feature sequence includes:
and calculating the Euclidean distance between the processed corpus feature sequences and the audio feature sequence by using an ant colony algorithm, and screening the corpus feature sequence with the minimum Euclidean distance with the audio feature sequence.
In this step, the target audio information is extracted with its audio feature sequence B ═ B by the DBN1,b2,...,bnAnd after the features to be compared are subjected to time regularization processing based on Dynamic Time Warping (DTW), acquiring an audio feature sequence BiWith corpus feature sequence AiSimilarity matching is carried out, but because the DTW algorithm is a local optimal algorithm, global optimal cannot be achieved, and the identification result is likely to be inaccurate, the DTW is improved by introducing the ant colony algorithm. In the improved algorithm, when searching from the initial point, the state transition probability is used for randomly searching, when all ants in an ant colony are completely searched, the pheromone is updated, and in the voice matching algorithm, a cost function D is introduced from the global considerationske(t) is the average distance of the Kth ant from the starting point S to the whole path of e, and the average distance is used for measuring the difference between the two characteristics. In the continuous circulation process, the optimal distance for matching the audio features can be obtained.
S170, obtaining emotion labels corresponding to the corpus feature sequences based on screening, and determining emotion types of target individuals; and sending the determined emotion category of the target individual to the client and the display terminal for display.
Here, the determining the emotion classification of the target individual based on the emotion label corresponding to the corpus feature sequence obtained by screening includes:
and taking the emotion category corresponding to the emotion label of the corpus characteristic sequence obtained by screening as the emotion category of the target individual.
The above method may further comprise: and the collected audio feature sequence is used as a new corpus feature sequence to be supplemented into the existing speech emotion corpus so as to increase the diversity of the speech emotion corpus data.
Before executing step 140, a step of training the deep belief network may be further included, as shown in fig. 2:
s210, taking the first sample audio information X and the first hidden layer h1 as a first limiting Boltzmann machine RBM, and training to obtain parameters of the RBM.
Here, the parameters of the RBM include a weight connecting the visible vector and the hidden vector, an offset of each node in the visible vector, and an offset of each node in the hidden vector.
S220, fixing parameters of the first RBM, taking h1 as a visible vector and taking a second hidden layer h2 as a hidden vector, and training to obtain parameters of the second RBM until all the parameters of the RBM are obtained through training.
And S230, fine-tuning each parameter in the deep belief network by using a back propagation algorithm to obtain the trained deep belief network.
In the above embodiment, the microphone collects the audio information of the target individual when reading the given corpus in real time; pre-emphasis, windowing, framing, endpoint detection and other pre-processing are carried out on the audio information to smooth the waveform and remove noise; after the emotion characteristics (namely the audio characteristic sequence) in the audio information are extracted based on a model of a Deep Belief Network (DBN), similarity matching is carried out on the extracted characteristics and the characteristics in the speech emotion corpus based on dynamic time programming (DTW) and an ant colony algorithm, and therefore the emotion state of the extracted audio information is identified. The scheme can effectively quantize and monitor the audio information in real time.
Compared with the traditional MFCC characteristics, the DBN is used as a multilayer network model based on unsupervised training, the training time and generalization capability of the corpus characteristic sequence are both superior to those of the traditional mode, the characteristic extraction time can be greatly reduced, and the voice emotion recognition effect is stronger in anti-interference performance and better in voice emotion recognition performance.
In the embodiment, the similarity of different audio feature sequences is compared, the similarity matching of the audio feature sequences is used as a method for recognizing emotional features, the sum of weighted distances of the traditional DTW algorithm is minimized by a local optimization method, the ant colony algorithm and the DTW algorithm are combined, the similarity between voice signals is measured by using an average distance instead of the weighted distance, and the normalization of time and the measurement of distance are effectively combined. When different audio characteristic sequences are identified from continuous audio information, the identification rate of the ant colony dynamic time planning algorithm is superior to that of the DTW algorithm, and particularly under the condition of complex environment, the superiority of the ant colony algorithm dynamic time planning algorithm can be reflected.
The method for extracting the time-frequency characteristics of the voice signals and monitoring the artificial intelligence emotion comprises the steps of preprocessing audio data of a monitored target individual, performing pre-emphasis, windowing, framing and other preprocessing, further building a deep belief network to fully mine the time-frequency characteristics of the data, and then matching the extracted time-frequency characteristics with predefined corpus time-frequency characteristics in a voice emotion corpus by utilizing dynamic time planning and an ant colony algorithm to determine emotion distribution corresponding to the monitored target individual. The technical scheme has simple and quick process, does not need to depend on empirical values, has high recognition rate of emotional states when the signal-to-noise ratio is low, and can realize time-sequence analysis of voice and more accurate emotional state recognition effect by extracting the time-frequency characteristics of the audio by utilizing the deep belief network. Meanwhile, the technical scheme combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and greatly increases the recognition efficiency of emotional states.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (7)
1. A time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals is characterized by comprising the following steps:
acquiring audio information of a target individual when reading a given corpus;
performing emphasis processing on a high-frequency part in the audio information by using a pre-emphasis digital filter with the order of one, and dividing the audio information subjected to the emphasis processing into a plurality of pieces of sub-audio information so as to smooth the waveform of the audio information;
determining an endpoint of each sub-audio information based on the calculated energy value of each sub-audio;
for each piece of sub-audio information, taking the audio information between two endpoints of the sub-audio information as target audio information corresponding to the sub-audio information;
extracting audio features in all target audio information by using a deep belief network comprising a plurality of limiting Boltzmann machines (RBMs) to obtain an audio feature sequence;
acquiring all corpus feature sequences in a speech emotion corpus and emotion labels corresponding to the corpus feature sequences;
aiming at each corpus feature sequence in all the corpus feature sequences, aligning the target feature sequence with the audio feature sequence to be compared according to the expected information to obtain a processed corpus feature sequence; projecting the corpus feature sequence and the audio feature sequence to the same plane, calculating the minimum distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the minimum Euclidean distance to the audio feature sequence;
based on the emotion label corresponding to the corpus characteristic sequence obtained by screening, determining the emotion category of the target individual;
and sending the determined emotion category of the target individual to the client and the display terminal for display.
2. The method of claim 1, further comprising the step of training a deep belief network:
taking the first sample audio information X and the first hidden layer h1 as a first limiting Boltzmann machine RBM, and training to obtain parameters of the RBM;
fixing parameters of a first RBM, taking h1 as a visible vector, taking a second hidden layer h2 as a hidden vector, and training to obtain parameters of a second RBM until all RBM parameters are obtained through training;
and (4) finely adjusting each parameter in the deep belief network by using a back propagation algorithm to obtain the trained deep belief network.
3. The method of claim 2, wherein the parameters of the RBM include a weight connecting the visible vector and the hidden vector, an offset of each node in the visible vector, and an offset of each node in the hidden vector.
4. The method of claim 2, further comprising the step of forming the speech emotion corpus:
acquiring a plurality of second sample audio information and emotion labels corresponding to the second sample audio information;
processing each second sample audio information by using the trained deep belief network to obtain a corpus feature sequence corresponding to each second sample audio information;
establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information;
and storing the corpus feature sequence corresponding to each second sample audio information and the corresponding relation established by the emotion label corresponding to each second sample audio information into a speech emotion corpus.
5. The method according to claim 1, wherein said calculating the euclidean distance between the processed corpus feature sequence and the audio feature sequence and screening the corpus feature sequence with the smallest euclidean distance to the audio feature sequence comprises:
and calculating the Euclidean distance between the processed corpus feature sequences and the audio feature sequence by using an ant colony algorithm, and screening the corpus feature sequence with the minimum Euclidean distance with the audio feature sequence.
6. The method according to claim 1, wherein the determining the emotion classification of the target individual based on the emotion label corresponding to the corpus feature sequence obtained by screening comprises:
and taking the emotion category corresponding to the emotion label of the corpus characteristic sequence obtained by screening as the emotion category of the target individual.
7. The method of claim 1, wherein obtaining audio information of a target individual when reading a given corpus comprises:
and acquiring audio information of the target individual when reading the given language material by using a microphone.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910823584.5A CN110619893A (en) | 2019-09-02 | 2019-09-02 | Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910823584.5A CN110619893A (en) | 2019-09-02 | 2019-09-02 | Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110619893A true CN110619893A (en) | 2019-12-27 |
Family
ID=68922188
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910823584.5A Pending CN110619893A (en) | 2019-09-02 | 2019-09-02 | Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110619893A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113593532A (en) * | 2021-08-31 | 2021-11-02 | 竹间智能科技(上海)有限公司 | Speech emotion recognition model training method and electronic equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106297825A (en) * | 2016-07-25 | 2017-01-04 | 华南理工大学 | A kind of speech-emotion recognition method based on integrated degree of depth belief network |
CN106790949A (en) * | 2015-11-20 | 2017-05-31 | 北京奇虎科技有限公司 | The collocation method and device in the phonetic feature storehouse of malicious call |
CN106790950A (en) * | 2015-11-20 | 2017-05-31 | 北京奇虎科技有限公司 | The recognition methods of malicious call and device |
CN109409496A (en) * | 2018-11-14 | 2019-03-01 | 重庆邮电大学 | One kind being based on the improved LDTW sequence similarity amount method of ant group algorithm |
CN109785863A (en) * | 2019-02-28 | 2019-05-21 | 中国传媒大学 | A kind of speech-emotion recognition method and system of deepness belief network |
CN109841229A (en) * | 2019-02-24 | 2019-06-04 | 复旦大学 | A kind of Neonate Cry recognition methods based on dynamic time warping |
CN109903781A (en) * | 2019-04-14 | 2019-06-18 | 湖南检信智能科技有限公司 | A kind of sentiment analysis method for mode matching |
CN110084579A (en) * | 2018-01-26 | 2019-08-02 | 百度在线网络技术(北京)有限公司 | Method for processing resource, device and system |
-
2019
- 2019-09-02 CN CN201910823584.5A patent/CN110619893A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106790949A (en) * | 2015-11-20 | 2017-05-31 | 北京奇虎科技有限公司 | The collocation method and device in the phonetic feature storehouse of malicious call |
CN106790950A (en) * | 2015-11-20 | 2017-05-31 | 北京奇虎科技有限公司 | The recognition methods of malicious call and device |
CN106297825A (en) * | 2016-07-25 | 2017-01-04 | 华南理工大学 | A kind of speech-emotion recognition method based on integrated degree of depth belief network |
CN110084579A (en) * | 2018-01-26 | 2019-08-02 | 百度在线网络技术(北京)有限公司 | Method for processing resource, device and system |
CN109409496A (en) * | 2018-11-14 | 2019-03-01 | 重庆邮电大学 | One kind being based on the improved LDTW sequence similarity amount method of ant group algorithm |
CN109841229A (en) * | 2019-02-24 | 2019-06-04 | 复旦大学 | A kind of Neonate Cry recognition methods based on dynamic time warping |
CN109785863A (en) * | 2019-02-28 | 2019-05-21 | 中国传媒大学 | A kind of speech-emotion recognition method and system of deepness belief network |
CN109903781A (en) * | 2019-04-14 | 2019-06-18 | 湖南检信智能科技有限公司 | A kind of sentiment analysis method for mode matching |
Non-Patent Citations (1)
Title |
---|
黄涛: ""蚁群算法在语音识别中的应用研究"", 《武汉理工大学学报信息与管理工程版》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113593532A (en) * | 2021-08-31 | 2021-11-02 | 竹间智能科技(上海)有限公司 | Speech emotion recognition model training method and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shahin et al. | Emotion recognition using hybrid Gaussian mixture model and deep neural network | |
Harb et al. | Voice-based gender identification in multimedia applications | |
Tong et al. | A comparative study of robustness of deep learning approaches for VAD | |
US10535000B2 (en) | System and method for speaker change detection | |
CN105895078A (en) | Speech recognition method used for dynamically selecting speech model and device | |
CN108831506B (en) | GMM-BIC-based digital audio tamper point detection method and system | |
Praksah et al. | Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier | |
Woubie et al. | Voice-quality Features for Deep Neural Network Based Speaker Verification Systems | |
CN110619893A (en) | Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal | |
Ge et al. | Speaker change detection using features through a neural network speaker classifier | |
Zhou et al. | Speech Emotion Recognition with Discriminative Feature Learning. | |
Khoury et al. | I-Vectors for speech activity detection. | |
Biagetti et al. | Robust speaker identification in a meeting with short audio segments | |
Jamil et al. | Influences of age in emotion recognition of spontaneous speech: A case of an under-resourced language | |
Zewoudie et al. | Short-and long-term speech features for hybrid hmm-i-vector based speaker diarization system | |
Ahsan | Physical features based speech emotion recognition using predictive classification | |
Khanum et al. | Speech based gender identification using feed forward neural networks | |
Xu et al. | Voiceprint recognition of Parkinson patients based on deep learning | |
Kanrar | Robust threshold selection for environment specific voice in speaker recognition | |
Odriozola et al. | An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods | |
Lakra et al. | Automated pitch-based gender recognition using an adaptive neuro-fuzzy inference system | |
Heittola | Computational Audio Content Analysis in Everyday Environments | |
Keyvanrad et al. | Feature selection and dimension reduction for automatic gender identification | |
Prabha et al. | Advanced Gender Recognition System Using Speech Signal | |
CN113870901B (en) | SVM-KNN-based voice emotion recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191227 |
|
RJ01 | Rejection of invention patent application after publication |