CN110619893A

CN110619893A - Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal

Info

Publication number: CN110619893A
Application number: CN201910823584.5A
Authority: CN
Inventors: 丁帅; 李莹辉; 孙晓; 卢亮; 杨善林; 尤田; 余文颖; 张园园
Original assignee: Hefei Polytechnic University; China Astronaut Research and Training Center
Current assignee: Hefei University of Technology; Hefei Polytechnic University; China Astronaut Research and Training Center
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2019-12-27

Abstract

The method comprises the steps of preprocessing audio data of a monitored target individual, carrying out pre-emphasis, windowing, framing and other preprocessing, further building a deep belief network to fully mine the time-frequency characteristics of the data, and then matching the extracted time-frequency characteristics with predefined corpus time-frequency characteristics in a speech emotion corpus by utilizing dynamic time programming and an ant colony algorithm to determine emotion distribution corresponding to the monitored target individual. The technical scheme has simple and quick process, does not need to depend on empirical values, has high recognition rate of emotional states when the signal-to-noise ratio is low, and can realize time-sequence analysis of voice and more accurate emotional state recognition effect by extracting the time-frequency characteristics of the audio by utilizing the deep belief network. Meanwhile, the technical scheme combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and greatly increases the recognition efficiency of emotional states.

Description

Time-frequency feature extraction and artificial intelligence emotion monitoring method of voice signal

Technical Field

The application relates to the field of psychology and information processing power, in particular to a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals.

Background

Mood, a common term for a series of subjective cognitive experiences, is a psychological and physiological state resulting from the integration of multiple senses, ideas and behaviors. The most common and popular emotions are happiness, anger, grief, surprise, terror, love, etc., and also some subtle and subtle emotions, such as jealousy, jeopardy, shame, self-haury, etc. Mood often interacts with factors such as mood, character, spleen qi, purpose, etc., and is also affected by hormones and neurotransmitters. Either positive or negative emotions are motivations for people to act. Although some mood-induced behaviors do not appear to be thought, in practice consciousness is one of the important rings in creating mood. It is seen that focusing on the emotional characteristics of an individual can play a very important role in mood guidance and people's safety.

At present, some of the technical solutions for analyzing emotional characteristics of individuals are determined by using audio information of the individuals. The audio information can reflect to some extent an emotional characteristic of the individual, for example, an increase in sound intensity may represent that the individual's emotion is anger.

According to the technical scheme of analyzing emotion characteristics according to audio information, a Mel Frequency Cepstrum Coefficient (MFCC), a perception linear prediction coefficient (PLP) and the like are mainly used, the scheme mostly adopts artificial design characteristics, the process is complicated and depends on experience, and the emotion recognition rate is usually greatly reduced when the signal-to-noise ratio is low. Further, the conventional emotion recognition methods include methods such as a Hidden Markov Model (HMM), a gaussian mixture model, a Support Vector Machine (SVM), an artificial neural network, and k-nearest neighbor, and these methods mostly consider global features of a speech signal, classify and recognize the speech signal according to the global features, but do not consider continuous features of the speech signal, cannot analyze the time sequence of the speech signal, consider a change of emotion in a time dimension, and are not comprehensive.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals, and the defects of complex process, dependence on experience, no consideration of the comprehensiveness of voice signal analysis, poor anti-jamming capability and low efficiency existing in the emotion determination method in the prior art are overcome.

(II) technical scheme

In order to achieve the above purpose, the present application is implemented by the following technical solutions:

the application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals, which comprises the following steps:

acquiring audio information of a target individual when reading a given corpus;

performing emphasis processing on a high-frequency part in the audio information by using a pre-emphasis digital filter with the order of one, and dividing the audio information subjected to the emphasis processing into a plurality of sub-audio information so as to smooth the waveform of the audio information; determining an endpoint of each sub-audio information based on the calculated energy value of each sub-audio;

for each piece of sub-audio information, taking the audio information between two endpoints of the sub-audio information as target audio information corresponding to the sub-audio information;

extracting audio features in all target audio information by using a deep belief network comprising a plurality of limiting Boltzmann machines (RBMs) to obtain an audio feature sequence;

acquiring all corpus feature sequences in a speech emotion corpus and emotion labels corresponding to the corpus feature sequences;

aiming at each corpus feature sequence in all the corpus feature sequences, aligning the target feature sequence with the audio feature sequence to be compared according to the expected information to obtain a processed corpus feature sequence; projecting the corpus feature sequence and the audio feature sequence to the same plane, calculating the minimum distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the minimum Euclidean distance to the audio feature sequence;

based on the emotion label corresponding to the corpus characteristic sequence obtained by screening, determining the emotion category of the target individual;

and sending the determined emotion category of the target individual to the client and the display terminal for display.

In one possible embodiment, the method further comprises the step of training the deep belief network:

taking the first sample audio information X and the first hidden layer h1 as a first limiting Boltzmann machine RBM, and training to obtain parameters of the RBM;

fixing parameters of a first RBM, taking h1 as a visible vector, taking a second hidden layer h2 as a hidden vector, and training to obtain parameters of a second RBM until all RBM parameters are obtained through training;

and (4) finely adjusting each parameter in the deep belief network by using a back propagation algorithm to obtain the trained deep belief network.

In one possible implementation, the parameters of the RBM include a weight connecting the visible vector and the hidden vector, an offset of each node in the visible vector, and an offset of each node in the hidden vector.

In a possible implementation, the method further includes the step of forming the speech emotion corpus:

acquiring a plurality of second sample audio information and emotion labels corresponding to the second sample audio information;

processing each second sample audio information by using the trained deep belief network to obtain a corpus feature sequence corresponding to each second sample audio information;

establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information;

and establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information, and storing the corresponding relation into a speech emotion corpus.

In a possible implementation manner, the calculating a euclidean distance between the processed corpus feature sequence and an audio feature sequence, and screening the corpus feature sequence with the smallest euclidean distance with the audio feature sequence includes:

and calculating the Euclidean distance between the processed corpus feature sequences and the audio feature sequence by using an ant colony algorithm, and screening the corpus feature sequence with the minimum Euclidean distance with the audio feature sequence.

In a possible implementation manner, the determining the emotion classification of the target individual based on the emotion label corresponding to the corpus feature sequence obtained by the screening includes:

and taking the emotion category corresponding to the emotion label of the corpus characteristic sequence obtained by screening as the emotion category of the target individual.

In one possible embodiment, the obtaining audio information of the target individual when reading the given corpus includes:

and acquiring audio information of the target individual when reading the given language material by using a microphone.

(III) advantageous effects

The application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals. The method has the following beneficial effects:

the method comprises the steps of firstly carrying out pre-emphasis, windowing, framing and other processing on the acquired audio information of a target individual, then extracting an audio characteristic sequence by using a deep belief network, and then matching the extracted audio characteristic sequence with a corpus characteristic sequence in a speech emotion corpus by using dynamic time programming and an ant colony algorithm so as to determine an emotion label corresponding to the target individual. The technical scheme has simple and direct process, does not need to depend on empirical values, has high emotion recognition rate when the signal-to-noise ratio is low, extracts the audio characteristic sequence by utilizing the deep belief network, can obtain the continuous characteristics of the voice signal, realizes the time sequence analysis of the voice, greatly reduces the characteristic extraction time, and has stronger anti-interference performance and better emotion recognition effect on performance. Meanwhile, the technical scheme combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and enables the matching of voice signals to be more comprehensive and efficient.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart schematically illustrating a time-frequency feature extraction and artificial intelligence emotion monitoring method for a speech signal according to an embodiment of the present application;

fig. 2 schematically shows a flowchart for training a deep belief network in another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals. Meanwhile, the method combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and enables the matching of voice signals to be more comprehensive and efficient.

Specifically, as shown in fig. 1, the method for extracting time-frequency features of voice signals and monitoring artificial intelligence emotion includes the following steps:

s110, obtaining audio information of the target individual when reading the given language material.

Here, the audio information of the target individual may be acquired by using a microphone. Specifically, news reports and article segments can be given to allow the target individual to read corresponding words and enter audio information.

S120, performing emphasis processing on a high-frequency part in the audio information by using a pre-emphasis digital filter with the order of one, and dividing the audio information subjected to the emphasis processing into a plurality of pieces of sub-audio information to smooth the waveform of the audio information; and determining the endpoint of each sub-audio information based on the calculated energy value of each sub-audio.

Here, the high frequency part of the audio information is emphasized by passing the audio information through a pre-emphasis digital filter with an order of 1 for improving the resolution of the speech. Here, a rectangular window function may be used for the audio information, which is divided into a plurality of stationary short-time frames to smooth the waveform; and extracting audio information by calculating the short-time energy of the frame to carry out endpoint detection so as to remove noise.

And S130, regarding each piece of sub-audio information, taking the audio information between the two endpoints of the sub-audio information as the target audio information corresponding to the sub-audio information.

S140, extracting audio features in all target audio information by using a deep belief network comprising a plurality of limiting Boltzmann machines (RBMs) to obtain an audio feature sequence.

The deep belief network is a pre-trained deep belief network. The DBN is a network structure that is stacked layer by a plurality of constraint boltzmann machines (RBMs).

S150, acquiring all corpus feature sequences in the speech emotion corpus and emotion labels corresponding to the corpus feature sequences.

Before this step is executed, the method further comprises the step of forming the speech emotion corpus:

step one, obtaining a plurality of second sample audio information and emotion labels corresponding to the second sample audio information.

And step two, processing each second sample audio information by using the trained deep belief network to obtain a corpus feature sequence corresponding to each second sample audio information.

Here, the corpus feature sequence is determined using the following formula:

y＝f(x；θ_DBN)

where y represents the corpus feature sequence and x represents the second sample audio information.

And step three, establishing a corresponding relation between the corpus feature sequence corresponding to each second sample audio information and the emotion label corresponding to each second sample audio information.

And step four, storing the corpus feature sequence corresponding to each second sample audio information and the corresponding relation established by the emotion label corresponding to each second sample audio information into a speech emotion corpus.

In the above step, the second sample audio information uses information in a corpus published at home and abroad as input data, and obtains corpus feature characteristic a ═ a₁,a₂,...a_nAnd storing the extracted corpus feature sequence and the emotion label P corresponding to the corpus feature sequence into a speech emotion corpus as a reference template.

S160, aiming at each corpus feature sequence in all corpus feature sequences, aligning the target feature sequence with the audio feature sequence to be compared according to expected information to obtain a processed corpus feature sequence; and projecting the corpus feature sequence and the audio feature sequence to the same plane, calculating the minimum distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the minimum Euclidean distance to the audio feature sequence.

Here, the calculating a euclidean distance between the processed corpus feature sequence and the audio feature sequence, and screening the corpus feature sequence with the smallest euclidean distance with the audio feature sequence includes:

In this step, the target audio information is extracted with its audio feature sequence B ═ B by the DBN₁,b₂,...,b_nAnd after the features to be compared are subjected to time regularization processing based on Dynamic Time Warping (DTW), acquiring an audio feature sequence B_iWith corpus feature sequence A_iSimilarity matching is carried out, but because the DTW algorithm is a local optimal algorithm, global optimal cannot be achieved, and the identification result is likely to be inaccurate, the DTW is improved by introducing the ant colony algorithm. In the improved algorithm, when searching from the initial point, the state transition probability is used for randomly searching, when all ants in an ant colony are completely searched, the pheromone is updated, and in the voice matching algorithm, a cost function D is introduced from the global consideration_sk_e(t) is the average distance of the Kth ant from the starting point S to the whole path of e, and the average distance is used for measuring the difference between the two characteristics. In the continuous circulation process, the optimal distance for matching the audio features can be obtained.

S170, obtaining emotion labels corresponding to the corpus feature sequences based on screening, and determining emotion types of target individuals; and sending the determined emotion category of the target individual to the client and the display terminal for display.

Here, the determining the emotion classification of the target individual based on the emotion label corresponding to the corpus feature sequence obtained by screening includes:

The above method may further comprise: and the collected audio feature sequence is used as a new corpus feature sequence to be supplemented into the existing speech emotion corpus so as to increase the diversity of the speech emotion corpus data.

Before executing step 140, a step of training the deep belief network may be further included, as shown in fig. 2:

s210, taking the first sample audio information X and the first hidden layer h1 as a first limiting Boltzmann machine RBM, and training to obtain parameters of the RBM.

Here, the parameters of the RBM include a weight connecting the visible vector and the hidden vector, an offset of each node in the visible vector, and an offset of each node in the hidden vector.

S220, fixing parameters of the first RBM, taking h1 as a visible vector and taking a second hidden layer h2 as a hidden vector, and training to obtain parameters of the second RBM until all the parameters of the RBM are obtained through training.

And S230, fine-tuning each parameter in the deep belief network by using a back propagation algorithm to obtain the trained deep belief network.

In the above embodiment, the microphone collects the audio information of the target individual when reading the given corpus in real time; pre-emphasis, windowing, framing, endpoint detection and other pre-processing are carried out on the audio information to smooth the waveform and remove noise; after the emotion characteristics (namely the audio characteristic sequence) in the audio information are extracted based on a model of a Deep Belief Network (DBN), similarity matching is carried out on the extracted characteristics and the characteristics in the speech emotion corpus based on dynamic time programming (DTW) and an ant colony algorithm, and therefore the emotion state of the extracted audio information is identified. The scheme can effectively quantize and monitor the audio information in real time.

Compared with the traditional MFCC characteristics, the DBN is used as a multilayer network model based on unsupervised training, the training time and generalization capability of the corpus characteristic sequence are both superior to those of the traditional mode, the characteristic extraction time can be greatly reduced, and the voice emotion recognition effect is stronger in anti-interference performance and better in voice emotion recognition performance.

In the embodiment, the similarity of different audio feature sequences is compared, the similarity matching of the audio feature sequences is used as a method for recognizing emotional features, the sum of weighted distances of the traditional DTW algorithm is minimized by a local optimization method, the ant colony algorithm and the DTW algorithm are combined, the similarity between voice signals is measured by using an average distance instead of the weighted distance, and the normalization of time and the measurement of distance are effectively combined. When different audio characteristic sequences are identified from continuous audio information, the identification rate of the ant colony dynamic time planning algorithm is superior to that of the DTW algorithm, and particularly under the condition of complex environment, the superiority of the ant colony algorithm dynamic time planning algorithm can be reflected.

The method for extracting the time-frequency characteristics of the voice signals and monitoring the artificial intelligence emotion comprises the steps of preprocessing audio data of a monitored target individual, performing pre-emphasis, windowing, framing and other preprocessing, further building a deep belief network to fully mine the time-frequency characteristics of the data, and then matching the extracted time-frequency characteristics with predefined corpus time-frequency characteristics in a voice emotion corpus by utilizing dynamic time planning and an ant colony algorithm to determine emotion distribution corresponding to the monitored target individual. The technical scheme has simple and quick process, does not need to depend on empirical values, has high recognition rate of emotional states when the signal-to-noise ratio is low, and can realize time-sequence analysis of voice and more accurate emotional state recognition effect by extracting the time-frequency characteristics of the audio by utilizing the deep belief network. Meanwhile, the technical scheme combines dynamic time planning and ant colony algorithm to carry out feature matching, realizes optimal matching on local and global aspects, and greatly increases the recognition efficiency of emotional states.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A time-frequency feature extraction and artificial intelligence emotion monitoring method for voice signals is characterized by comprising the following steps:

acquiring audio information of a target individual when reading a given corpus;

performing emphasis processing on a high-frequency part in the audio information by using a pre-emphasis digital filter with the order of one, and dividing the audio information subjected to the emphasis processing into a plurality of pieces of sub-audio information so as to smooth the waveform of the audio information;

determining an endpoint of each sub-audio information based on the calculated energy value of each sub-audio;

2. The method of claim 1, further comprising the step of training a deep belief network:

3. The method of claim 2, wherein the parameters of the RBM include a weight connecting the visible vector and the hidden vector, an offset of each node in the visible vector, and an offset of each node in the hidden vector.

4. The method of claim 2, further comprising the step of forming the speech emotion corpus:

and storing the corpus feature sequence corresponding to each second sample audio information and the corresponding relation established by the emotion label corresponding to each second sample audio information into a speech emotion corpus.

5. The method according to claim 1, wherein said calculating the euclidean distance between the processed corpus feature sequence and the audio feature sequence and screening the corpus feature sequence with the smallest euclidean distance to the audio feature sequence comprises:

6. The method according to claim 1, wherein the determining the emotion classification of the target individual based on the emotion label corresponding to the corpus feature sequence obtained by screening comprises:

7. The method of claim 1, wherein obtaining audio information of a target individual when reading a given corpus comprises: