CN110619893A - Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal - Google Patents

Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal Download PDF

Info

Publication number
CN110619893A
CN110619893A CN201910823584.5A CN201910823584A CN110619893A CN 110619893 A CN110619893 A CN 110619893A CN 201910823584 A CN201910823584 A CN 201910823584A CN 110619893 A CN110619893 A CN 110619893A
Authority
CN
China
Prior art keywords
corpus
audio information
feature sequence
emotion
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910823584.5A
Other languages
Chinese (zh)
Inventor
丁帅
李莹辉
孙晓
卢亮
杨善林
尤田
余文颖
张园园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Hefei Polytechnic University
China Astronaut Research and Training Center
Original Assignee
Hefei Polytechnic University
China Astronaut Research and Training Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Polytechnic University, China Astronaut Research and Training Center filed Critical Hefei Polytechnic University
Priority to CN201910823584.5A priority Critical patent/CN110619893A/en
Publication of CN110619893A publication Critical patent/CN110619893A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The method comprises the steps of preprocessing audio data of a monitored target individual, carrying out pre-emphasis, windowing, framing and other preprocessing, further building a deep belief network to fully mine the time-frequency characteristics of the data, and then matching the extracted time-frequency characteristics with predefined corpus time-frequency characteristics in a speech emotion corpus by utilizing dynamic time programming and an ant colony algorithm to determine emotion distribution corresponding to the monitored target individual. The technical scheme has simple and quick process, does not need to depend on empirical values, has high recognition rate of emotional states when the signal-to-noise ratio is low, and can realize time-sequence analysis of voice and more accurate emotional state recognition effect by extracting the time-frequency characteristics of the audio by utilizing the deep belief network. Meanwhile, the technical scheme combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and greatly increases the recognition efficiency of emotional states.

Description

Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal
Technical Field
The application relates to the field of psychology and information processing power, in particular to a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals.
Background
Mood, a common term for a series of subjective cognitive experiences, is a psychological and physiological state resulting from the integration of multiple senses, ideas and behaviors. The most common and popular emotions are happiness, anger, grief, surprise, terror, love, etc., and also some subtle and subtle emotions, such as jealousy, jeopardy, shame, self-haury, etc. Mood often interacts with factors such as mood, character, spleen qi, purpose, etc., and is also affected by hormones and neurotransmitters. Either positive or negative emotions are motivations for people to act. Although some mood-induced behaviors do not appear to be thought, in practice consciousness is one of the important rings in creating mood. It is seen that focusing on the emotional characteristics of an individual can play a very important role in mood guidance and people's safety.
At present, some of the technical solutions for analyzing emotional characteristics of individuals are determined by using audio information of the individuals. The audio information can reflect to some extent an emotional characteristic of the individual, for example, an increase in sound intensity may represent that the individual's emotion is anger.
According to the technical scheme of analyzing emotion characteristics according to audio information, a Mel Frequency Cepstrum Coefficient (MFCC), a perception linear prediction coefficient (PLP) and the like are mainly used, the scheme mostly adopts artificial design characteristics, the process is complicated and depends on experience, and the emotion recognition rate is usually greatly reduced when the signal-to-noise ratio is low. Further, the conventional emotion recognition methods include methods such as a Hidden Markov Model (HMM), a gaussian mixture model, a Support Vector Machine (SVM), an artificial neural network, and k-nearest neighbor, and these methods mostly consider global features of a speech signal, classify and recognize the speech signal according to the global features, but do not consider continuous features of the speech signal, cannot analyze the time sequence of the speech signal, consider a change of emotion in a time dimension, and are not comprehensive.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals, and the defects of complex process, dependence on experience, no consideration of the comprehensiveness of voice signal analysis, poor anti-jamming capability and low efficiency existing in the emotion determination method in the prior art are overcome.
(II) technical scheme
In order to achieve the above purpose, the present application is implemented by the following technical solutions:
the application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals, which comprises the following steps:
acquiring audio information of a target individual when reading a given corpus;
performing emphasis processing on a high-frequency part in the audio information by using a pre-emphasis digital filter with the order of one, and dividing the audio information subjected to the emphasis processing into a plurality of sub-audio information so as to smooth the waveform of the audio information; determining an endpoint of each sub-audio information based on the calculated energy value of each sub-audio;
for each piece of sub-audio information, taking the audio information between two endpoints of the sub-audio information as target audio information corresponding to the sub-audio information;
extracting audio features in all target audio information by using a deep belief network comprising a plurality of limiting Boltzmann machines (RBMs) to obtain an audio feature sequence;
acquiring all corpus feature sequences in a speech emotion corpus and emotion labels corresponding to the corpus feature sequences;
aiming at each corpus feature sequence in all the corpus feature sequences, aligning the target feature sequence with the audio feature sequence to be compared according to the expected information to obtain a processed corpus feature sequence; projecting the corpus feature sequence and the audio feature sequence to the same plane, calculating the minimum distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the minimum Euclidean distance to the audio feature sequence;
based on the emotion label corresponding to the corpus characteristic sequence obtained by screening, determining the emotion category of the target individual;
and sending the determined emotion category of the target individual to the client and the display terminal for display.
In one possible embodiment, the method further comprises the step of training the deep belief network:
taking the first sample audio information X and the first hidden layer h1 as a first limiting Boltzmann machine RBM, and training to obtain parameters of the RBM;
fixing parameters of a first RBM, taking h1 as a visible vector, taking a second hidden layer h2 as a hidden vector, and training to obtain parameters of a second RBM until all RBM parameters are obtained through training;
and (4) finely adjusting each parameter in the deep belief network by using a back propagation algorithm to obtain the trained deep belief network.
In one possible implementation, the parameters of the RBM include a weight connecting the visible vector and the hidden vector, an offset of each node in the visible vector, and an offset of each node in the hidden vector.
In a possible implementation, the method further includes the step of forming the speech emotion corpus:
acquiring a plurality of second sample audio information and emotion labels corresponding to the second sample audio information;
processing each second sample audio information by using the trained deep belief network to obtain a corpus feature sequence corresponding to each second sample audio information;
establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information;
and establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information, and storing the corresponding relation into a speech emotion corpus.
In a possible implementation manner, the calculating a euclidean distance between the processed corpus feature sequence and an audio feature sequence, and screening the corpus feature sequence with the smallest euclidean distance with the audio feature sequence includes:
and calculating the Euclidean distance between the processed corpus feature sequences and the audio feature sequence by using an ant colony algorithm, and screening the corpus feature sequence with the minimum Euclidean distance with the audio feature sequence.
In a possible implementation manner, the determining the emotion classification of the target individual based on the emotion label corresponding to the corpus feature sequence obtained by the screening includes:
and taking the emotion category corresponding to the emotion label of the corpus characteristic sequence obtained by screening as the emotion category of the target individual.
In one possible embodiment, the obtaining audio information of the target individual when reading the given corpus includes:
and acquiring audio information of the target individual when reading the given language material by using a microphone.
(III) advantageous effects
The application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals. The method has the following beneficial effects:
the method comprises the steps of firstly carrying out pre-emphasis, windowing, framing and other processing on the acquired audio information of a target individual, then extracting an audio characteristic sequence by using a deep belief network, and then matching the extracted audio characteristic sequence with a corpus characteristic sequence in a speech emotion corpus by using dynamic time programming and an ant colony algorithm so as to determine an emotion label corresponding to the target individual. The technical scheme has simple and direct process, does not need to depend on empirical values, has high emotion recognition rate when the signal-to-noise ratio is low, extracts the audio characteristic sequence by utilizing the deep belief network, can obtain the continuous characteristics of the voice signal, realizes the time sequence analysis of the voice, greatly reduces the characteristic extraction time, and has stronger anti-interference performance and better emotion recognition effect on performance. Meanwhile, the technical scheme combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and enables the matching of voice signals to be more comprehensive and efficient.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart schematically illustrating a time-frequency feature extraction and artificial intelligence emotion monitoring method for a speech signal according to an embodiment of the present application;
fig. 2 schematically shows a flowchart for training a deep belief network in another embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals. Meanwhile, the method combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and enables the matching of voice signals to be more comprehensive and efficient.
Specifically, as shown in fig. 1, the method for extracting time-frequency features of voice signals and monitoring artificial intelligence emotion includes the following steps:
s110, obtaining audio information of the target individual when reading the given language material.
Here, the audio information of the target individual may be acquired by using a microphone. Specifically, news reports and article segments can be given to allow the target individual to read corresponding words and enter audio information.
S120, performing emphasis processing on a high-frequency part in the audio information by using a pre-emphasis digital filter with the order of one, and dividing the audio information subjected to the emphasis processing into a plurality of pieces of sub-audio information to smooth the waveform of the audio information; and determining the endpoint of each sub-audio information based on the calculated energy value of each sub-audio.
Here, the high frequency part of the audio information is emphasized by passing the audio information through a pre-emphasis digital filter with an order of 1 for improving the resolution of the speech. Here, a rectangular window function may be used for the audio information, which is divided into a plurality of stationary short-time frames to smooth the waveform; and extracting audio information by calculating the short-time energy of the frame to carry out endpoint detection so as to remove noise.
And S130, regarding each piece of sub-audio information, taking the audio information between the two endpoints of the sub-audio information as the target audio information corresponding to the sub-audio information.
S140, extracting audio features in all target audio information by using a deep belief network comprising a plurality of limiting Boltzmann machines (RBMs) to obtain an audio feature sequence.
The deep belief network is a pre-trained deep belief network. The DBN is a network structure that is stacked layer by a plurality of constraint boltzmann machines (RBMs).
S150, acquiring all corpus feature sequences in the speech emotion corpus and emotion labels corresponding to the corpus feature sequences.
Before this step is executed, the method further comprises the step of forming the speech emotion corpus:
step one, obtaining a plurality of second sample audio information and emotion labels corresponding to the second sample audio information.
And step two, processing each second sample audio information by using the trained deep belief network to obtain a corpus feature sequence corresponding to each second sample audio information.
Here, the corpus feature sequence is determined using the following formula:
y=f(x;θDBN)
where y represents the corpus feature sequence and x represents the second sample audio information.
And step three, establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information.
And step four, storing the corpus feature sequence corresponding to each second sample audio information and the corresponding relation established by the emotion label corresponding to each second sample audio information into a speech emotion corpus.
In the above step, the second sample audio information uses information in a corpus published at home and abroad as input data, and obtains corpus feature characteristic a ═ a1,a2,...anAnd storing the extracted corpus feature sequence and the emotion label P corresponding to the corpus feature sequence into a speech emotion corpus as a reference template.
S160, aiming at each corpus feature sequence in all corpus feature sequences, aligning the target feature sequence with the audio feature sequence to be compared according to expected information to obtain a processed corpus feature sequence; and projecting the corpus feature sequence and the audio feature sequence to the same plane, calculating the minimum distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the minimum Euclidean distance to the audio feature sequence.
Here, the calculating a euclidean distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the smallest euclidean distance with the audio feature sequence includes:
and calculating the Euclidean distance between the processed corpus feature sequences and the audio feature sequence by using an ant colony algorithm, and screening the corpus feature sequence with the minimum Euclidean distance with the audio feature sequence.
In this step, the target audio information is extracted with its audio feature sequence B ═ B by the DBN1,b2,...,bnAnd after the features to be compared are subjected to time regularization processing based on Dynamic Time Warping (DTW), acquiring an audio feature sequence BiWith corpus feature sequence AiSimilarity matching is carried out, but because the DTW algorithm is a local optimal algorithm, global optimal cannot be achieved, and the identification result is likely to be inaccurate, the DTW is improved by introducing the ant colony algorithm. In the improved algorithm, when searching from the initial point, the state transition probability is used for randomly searching, when all ants in an ant colony are completely searched, the pheromone is updated, and in the voice matching algorithm, a cost function D is introduced from the global considerationske(t) is the average distance of the Kth ant from the starting point S to the whole path of e, and the average distance is used for measuring the difference between the two characteristics. In the continuous circulation process, the optimal distance for matching the audio features can be obtained.
S170, obtaining emotion labels corresponding to the corpus feature sequences based on screening, and determining emotion types of target individuals; and sending the determined emotion category of the target individual to the client and the display terminal for display.
Here, the determining the emotion classification of the target individual based on the emotion label corresponding to the corpus feature sequence obtained by screening includes:
and taking the emotion category corresponding to the emotion label of the corpus characteristic sequence obtained by screening as the emotion category of the target individual.
The above method may further comprise: and the collected audio feature sequence is used as a new corpus feature sequence to be supplemented into the existing speech emotion corpus so as to increase the diversity of the speech emotion corpus data.
Before executing step 140, a step of training the deep belief network may be further included, as shown in fig. 2:
s210, taking the first sample audio information X and the first hidden layer h1 as a first limiting Boltzmann machine RBM, and training to obtain parameters of the RBM.
Here, the parameters of the RBM include a weight connecting the visible vector and the hidden vector, an offset of each node in the visible vector, and an offset of each node in the hidden vector.
S220, fixing parameters of the first RBM, taking h1 as a visible vector and taking a second hidden layer h2 as a hidden vector, and training to obtain parameters of the second RBM until all the parameters of the RBM are obtained through training.
And S230, fine-tuning each parameter in the deep belief network by using a back propagation algorithm to obtain the trained deep belief network.
In the above embodiment, the microphone collects the audio information of the target individual when reading the given corpus in real time; pre-emphasis, windowing, framing, endpoint detection and other pre-processing are carried out on the audio information to smooth the waveform and remove noise; after the emotion characteristics (namely the audio characteristic sequence) in the audio information are extracted based on a model of a Deep Belief Network (DBN), similarity matching is carried out on the extracted characteristics and the characteristics in the speech emotion corpus based on dynamic time programming (DTW) and an ant colony algorithm, and therefore the emotion state of the extracted audio information is identified. The scheme can effectively quantize and monitor the audio information in real time.
Compared with the traditional MFCC characteristics, the DBN is used as a multilayer network model based on unsupervised training, the training time and generalization capability of the corpus characteristic sequence are both superior to those of the traditional mode, the characteristic extraction time can be greatly reduced, and the voice emotion recognition effect is stronger in anti-interference performance and better in voice emotion recognition performance.
In the embodiment, the similarity of different audio feature sequences is compared, the similarity matching of the audio feature sequences is used as a method for recognizing emotional features, the sum of weighted distances of the traditional DTW algorithm is minimized by a local optimization method, the ant colony algorithm and the DTW algorithm are combined, the similarity between voice signals is measured by using an average distance instead of the weighted distance, and the normalization of time and the measurement of distance are effectively combined. When different audio characteristic sequences are identified from continuous audio information, the identification rate of the ant colony dynamic time planning algorithm is superior to that of the DTW algorithm, and particularly under the condition of complex environment, the superiority of the ant colony algorithm dynamic time planning algorithm can be reflected.
The method for extracting the time-frequency characteristics of the voice signals and monitoring the artificial intelligence emotion comprises the steps of preprocessing audio data of a monitored target individual, performing pre-emphasis, windowing, framing and other preprocessing, further building a deep belief network to fully mine the time-frequency characteristics of the data, and then matching the extracted time-frequency characteristics with predefined corpus time-frequency characteristics in a voice emotion corpus by utilizing dynamic time planning and an ant colony algorithm to determine emotion distribution corresponding to the monitored target individual. The technical scheme has simple and quick process, does not need to depend on empirical values, has high recognition rate of emotional states when the signal-to-noise ratio is low, and can realize time-sequence analysis of voice and more accurate emotional state recognition effect by extracting the time-frequency characteristics of the audio by utilizing the deep belief network. Meanwhile, the technical scheme combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and greatly increases the recognition efficiency of emotional states.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (7)

1. A time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals is characterized by comprising the following steps:
acquiring audio information of a target individual when reading a given corpus;
performing emphasis processing on a high-frequency part in the audio information by using a pre-emphasis digital filter with the order of one, and dividing the audio information subjected to the emphasis processing into a plurality of pieces of sub-audio information so as to smooth the waveform of the audio information;
determining an endpoint of each sub-audio information based on the calculated energy value of each sub-audio;
for each piece of sub-audio information, taking the audio information between two endpoints of the sub-audio information as target audio information corresponding to the sub-audio information;
extracting audio features in all target audio information by using a deep belief network comprising a plurality of limiting Boltzmann machines (RBMs) to obtain an audio feature sequence;
acquiring all corpus feature sequences in a speech emotion corpus and emotion labels corresponding to the corpus feature sequences;
aiming at each corpus feature sequence in all the corpus feature sequences, aligning the target feature sequence with the audio feature sequence to be compared according to the expected information to obtain a processed corpus feature sequence; projecting the corpus feature sequence and the audio feature sequence to the same plane, calculating the minimum distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the minimum Euclidean distance to the audio feature sequence;
based on the emotion label corresponding to the corpus characteristic sequence obtained by screening, determining the emotion category of the target individual;
and sending the determined emotion category of the target individual to the client and the display terminal for display.
2. The method of claim 1, further comprising the step of training a deep belief network:
taking the first sample audio information X and the first hidden layer h1 as a first limiting Boltzmann machine RBM, and training to obtain parameters of the RBM;
fixing parameters of a first RBM, taking h1 as a visible vector, taking a second hidden layer h2 as a hidden vector, and training to obtain parameters of a second RBM until all RBM parameters are obtained through training;
and (4) finely adjusting each parameter in the deep belief network by using a back propagation algorithm to obtain the trained deep belief network.
3. The method of claim 2, wherein the parameters of the RBM include a weight connecting the visible vector and the hidden vector, an offset of each node in the visible vector, and an offset of each node in the hidden vector.
4. The method of claim 2, further comprising the step of forming the speech emotion corpus:
acquiring a plurality of second sample audio information and emotion labels corresponding to the second sample audio information;
processing each second sample audio information by using the trained deep belief network to obtain a corpus feature sequence corresponding to each second sample audio information;
establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information;
and storing the corpus feature sequence corresponding to each second sample audio information and the corresponding relation established by the emotion label corresponding to each second sample audio information into a speech emotion corpus.
5. The method according to claim 1, wherein said calculating the euclidean distance between the processed corpus feature sequence and the audio feature sequence and screening the corpus feature sequence with the smallest euclidean distance to the audio feature sequence comprises:
and calculating the Euclidean distance between the processed corpus feature sequences and the audio feature sequence by using an ant colony algorithm, and screening the corpus feature sequence with the minimum Euclidean distance with the audio feature sequence.
6. The method according to claim 1, wherein the determining the emotion classification of the target individual based on the emotion label corresponding to the corpus feature sequence obtained by screening comprises:
and taking the emotion category corresponding to the emotion label of the corpus characteristic sequence obtained by screening as the emotion category of the target individual.
7. The method of claim 1, wherein obtaining audio information of a target individual when reading a given corpus comprises:
and acquiring audio information of the target individual when reading the given language material by using a microphone.
CN201910823584.5A 2019-09-02 2019-09-02 Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal Pending CN110619893A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910823584.5A CN110619893A (en) 2019-09-02 2019-09-02 Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910823584.5A CN110619893A (en) 2019-09-02 2019-09-02 Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal

Publications (1)

Publication Number Publication Date
CN110619893A true CN110619893A (en) 2019-12-27

Family

ID=68922188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910823584.5A Pending CN110619893A (en) 2019-09-02 2019-09-02 Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal

Country Status (1)

Country Link
CN (1) CN110619893A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593532A (en) * 2021-08-31 2021-11-02 竹间智能科技(上海)有限公司 Speech emotion recognition model training method and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106297825A (en) * 2016-07-25 2017-01-04 华南理工大学 A kind of speech-emotion recognition method based on integrated degree of depth belief network
CN106790949A (en) * 2015-11-20 2017-05-31 北京奇虎科技有限公司 The collocation method and device in the phonetic feature storehouse of malicious call
CN106790950A (en) * 2015-11-20 2017-05-31 北京奇虎科技有限公司 The recognition methods of malicious call and device
CN109409496A (en) * 2018-11-14 2019-03-01 重庆邮电大学 One kind being based on the improved LDTW sequence similarity amount method of ant group algorithm
CN109785863A (en) * 2019-02-28 2019-05-21 中国传媒大学 A kind of speech-emotion recognition method and system of deepness belief network
CN109841229A (en) * 2019-02-24 2019-06-04 复旦大学 A kind of Neonate Cry recognition methods based on dynamic time warping
CN109903781A (en) * 2019-04-14 2019-06-18 湖南检信智能科技有限公司 A kind of sentiment analysis method for mode matching
CN110084579A (en) * 2018-01-26 2019-08-02 百度在线网络技术(北京)有限公司 Method for processing resource, device and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790949A (en) * 2015-11-20 2017-05-31 北京奇虎科技有限公司 The collocation method and device in the phonetic feature storehouse of malicious call
CN106790950A (en) * 2015-11-20 2017-05-31 北京奇虎科技有限公司 The recognition methods of malicious call and device
CN106297825A (en) * 2016-07-25 2017-01-04 华南理工大学 A kind of speech-emotion recognition method based on integrated degree of depth belief network
CN110084579A (en) * 2018-01-26 2019-08-02 百度在线网络技术(北京)有限公司 Method for processing resource, device and system
CN109409496A (en) * 2018-11-14 2019-03-01 重庆邮电大学 One kind being based on the improved LDTW sequence similarity amount method of ant group algorithm
CN109841229A (en) * 2019-02-24 2019-06-04 复旦大学 A kind of Neonate Cry recognition methods based on dynamic time warping
CN109785863A (en) * 2019-02-28 2019-05-21 中国传媒大学 A kind of speech-emotion recognition method and system of deepness belief network
CN109903781A (en) * 2019-04-14 2019-06-18 湖南检信智能科技有限公司 A kind of sentiment analysis method for mode matching

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄涛: ""蚁群算法在语音识别中的应用研究"", 《武汉理工大学学报信息与管理工程版》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593532A (en) * 2021-08-31 2021-11-02 竹间智能科技(上海)有限公司 Speech emotion recognition model training method and electronic equipment

Similar Documents

Publication Publication Date Title
Shahin et al. Emotion recognition using hybrid Gaussian mixture model and deep neural network
Harb et al. Voice-based gender identification in multimedia applications
Tong et al. A comparative study of robustness of deep learning approaches for VAD
US10535000B2 (en) System and method for speaker change detection
CN105895078A (en) Speech recognition method used for dynamically selecting speech model and device
CN108831506B (en) GMM-BIC-based digital audio tamper point detection method and system
Praksah et al. Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier
Woubie et al. Voice-quality Features for Deep Neural Network Based Speaker Verification Systems
CN110619893A (en) Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal
Ge et al. Speaker change detection using features through a neural network speaker classifier
Zhou et al. Speech Emotion Recognition with Discriminative Feature Learning.
Khoury et al. I-Vectors for speech activity detection.
Biagetti et al. Robust speaker identification in a meeting with short audio segments
Jamil et al. Influences of age in emotion recognition of spontaneous speech: A case of an under-resourced language
Zewoudie et al. Short-and long-term speech features for hybrid hmm-i-vector based speaker diarization system
Ahsan Physical features based speech emotion recognition using predictive classification
Khanum et al. Speech based gender identification using feed forward neural networks
Xu et al. Voiceprint recognition of Parkinson patients based on deep learning
Kanrar Robust threshold selection for environment specific voice in speaker recognition
Odriozola et al. An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods
Lakra et al. Automated pitch-based gender recognition using an adaptive neuro-fuzzy inference system
Heittola Computational Audio Content Analysis in Everyday Environments
Keyvanrad et al. Feature selection and dimension reduction for automatic gender identification
Prabha et al. Advanced Gender Recognition System Using Speech Signal
CN113870901B (en) SVM-KNN-based voice emotion recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191227

RJ01 Rejection of invention patent application after publication