CN110415728B

CN110415728B - Method and device for recognizing emotion voice

Info

Publication number: CN110415728B
Application number: CN201910690493.9A
Authority: CN
Inventors: 崔明明; 房建东
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2019-07-29
Filing date: 2019-07-29
Publication date: 2022-04-01
Anticipated expiration: 2039-07-29
Also published as: CN110415728A

Abstract

The application provides a method and a device for recognizing emotional speech, wherein the method comprises the following steps: acquiring a voice signal to be recognized; preprocessing the voice signal to be recognized according to a preprocessing rule to obtain a multi-frame preprocessed voice signal; acquiring the corresponding spectrogram signal based on each frame of preprocessed voice signal; respectively inputting the spectrogram signals into a second network model with first optimization parameters to obtain corresponding emotion voice types; wherein the second network model comprises: the first sub-network is used for extracting emotional voice features from the spectrogram signals and reducing dimensions, and the third sub-network is used for accelerating algorithm convergence and classifying the emotional voice features. The CNN network model is used for extracting features from the spectrogram, and the LSTM network model is used for carrying out time sequence modeling on the feature data. The accuracy and the robustness of emotion recognition in an open environment are enhanced. The method reduces the network processing amount and the algorithm complexity, and is suitable for the operation of the algorithm in the embedded equipment.

Description

Method and device for recognizing emotion voice

Technical Field

The application relates to the field of artificial intelligence, in particular to a training method and a device for recognizing emotion voice, and a method and a device for recognizing emotion voice.

Background

The purpose of emotion recognition is to give computers the ability to observe, understand, recognize, and express emotions like humans for better interaction with humans. Since ancient times, the communication between people is mostly transmitted through the mouth and the ear, and the voice becomes the main medium of human communication because the voice includes many other information besides the character information to be expressed, such as the language, emotion, physical condition, sex, etc. of the speaker. This cannot be done with a pure text understanding. The potential emotion information obtained by processing the voice signals has wide application potential in the fields of teaching cognitive state analysis, patient emotion state analysis, public area danger early warning, blind person visual perception and the like. Therefore, speech emotion recognition has become an important research point in recent years as a key technology for intelligent interaction and emotion calculation.

At present, the research on the speech emotion recognition at home and abroad has made great progress. However, traditional acoustic features, prosodic features, spectrum-based related features and voice quality features respectively describe voice emotion from the angles of time domain and frequency domain, and cannot simultaneously reflect time-frequency emotion features of voice signals. However, when the static image algorithm is applied to a natural scene, effective utilization of dynamic sequence information is lacked, so that the algorithm robustness is poor, and the application effect needs to be improved.

The current main method is to take a spectrogram as a voice emotional feature, reflect time-frequency emotional features of voice signals at the same time, and then extract and classify the emotional features by using a deep learning algorithm. However, the identification accuracy of the algorithm is improved, and meanwhile, the algorithm has high requirements on hardware computing performance due to large data processing amount and high algorithm complexity, and is difficult to apply to real-life natural scenes due to the fact that the algorithm is mostly deployed on a high-performance server.

Disclosure of Invention

The application provides a training method for recognizing emotional voices, a training device for recognizing the emotional voices, a method for recognizing the emotional voices and a device for recognizing the emotional voices; the problem that the current emotion recognition algorithm is large in operation amount is solved.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

the application provides a training method for recognizing emotion voice, which is characterized by comprising the following steps:

sequentially acquiring a group of sample spectrogram signals, wherein the sample spectrogram signals are divided into one of N groups according to N emotion voice types, the emotion voice types of the sample spectrogram signals in the same group are the same, the emotion voice types of the sample spectrogram signals in each group are different, and N is an integer greater than 1; the sample spectrogram signal comprises a tag for marking the emotion voice type;

respectively training the first network model by using each group of sample spectrogram signals to reach a preset training termination condition, thereby obtaining a first optimization parameter of the first network model;

wherein the first network model comprises: the first sub-network is used for extracting emotional voice features from the spectrogram signals and reducing dimensions, and the second sub-network is used for accelerating algorithm convergence and classifying the emotional voice features.

Optionally, before the sequentially acquiring a group of training speech signals, the method further includes:

acquiring a voice signal to be recognized;

preprocessing the voice signal to be recognized according to a preprocessing rule to obtain a multi-frame preprocessed voice signal;

generating a corresponding spectrogram signal based on each frame of preprocessed voice signals;

and marking the label on the spectrogram signal according to N emotion voice types to obtain N groups of sample spectrogram signals.

Optionally, the preprocessing rule includes: pre-emphasis rules, windowing framing rules, and endpoint detection rules;

the preprocessing the voice signal to be recognized according to the preprocessing rule to obtain the multi-frame preprocessed voice signal comprises the following steps:

pre-emphasis processing is carried out on the voice signal to be recognized according to a pre-emphasis rule, and a first voice signal is obtained;

windowing and framing the first voice signal according to a windowing and framing rule to obtain a plurality of frames of second voice signals;

and performing endpoint detection on each frame of second voice signal according to an endpoint detection rule to obtain a plurality of frames of preprocessed voice signals.

Optionally, the endpoint detection rule is specifically a time domain endpoint detection rule.

Optionally, the first sub-network includes: the first full-link layer, the first activation function, the first over-fit prevention layer, the second full-link layer, the second activation function, the second over-fit prevention layer, the third full-link layer, and the third activation function.

Optionally, the second sub-network includes: changing input data dimensionality, changing input timestamp dimensionality, a first long and short time cyclic recursion network layer, a third prevention over-fitting layer, a second long and short time cyclic recursion network layer, a third long and short time cyclic recursion network layer, an output cutting layer, changing long and short time cyclic recursion network dimensionality, a fourth full-connection layer, changing input label dimensionality, a loss function layer and outputting classification accuracy;

wherein the output slicing is configured to slice an end-of-sequence vector output by the second sub-network; and the sequence end vector is used for calculating error feedback with the label and correcting the network weight or predicting the emotion voice type.

Optionally, the preset training termination condition at least includes one of the following conditions:

outputting the percentage of the correct sample number in the total test sample number is greater than a preset percentage threshold:

the percentage of the correct sample number output by each group of sample spectrogram signals in the total test sample number of the corresponding group is larger than a preset second percentage threshold.

The application provides a trainer for discerning emotion pronunciation, includes:

the system comprises an acquisition sample unit, a processing unit and a processing unit, wherein the acquisition sample unit is used for sequentially acquiring a group of sample spectrogram signals, the sample spectrogram signals are divided into one of N groups according to N emotion voice types, the emotion voice types of the sample spectrogram signals in the same group are the same, the emotion voice types of the sample spectrogram signals in each group are different, and N is an integer greater than 1; the sample spectrogram signal comprises a tag for marking the emotion voice type;

the training sample unit is used for respectively training the first network model by utilizing each group of sample spectrogram signals to reach a preset training termination condition so as to obtain a first optimization parameter of the first network model;

The application provides a method for recognizing emotional speech, which comprises the following steps:

acquiring a voice signal to be recognized;

acquiring the corresponding spectrogram signal based on each frame of preprocessed voice signal;

respectively inputting the spectrogram signals into a second network model with first optimization parameters to obtain corresponding emotion voice types;

wherein the second network model comprises: the first sub-network is used for extracting emotional voice features from the spectrogram signals and reducing dimensions, and the third sub-network is used for accelerating algorithm convergence and classifying the emotional voice features.

The application provides a device for recognizing emotion voice, comprising:

the voice signal to be recognized unit is used for acquiring a voice signal to be recognized;

the preprocessing unit is used for preprocessing the voice signal to be recognized according to a preprocessing rule to obtain a multi-frame preprocessed voice signal;

the acquisition speech spectrogram signal unit is used for acquiring the corresponding speech spectrogram signal based on each frame of preprocessed speech signal;

the emotion voice type acquiring unit is used for respectively inputting the spectrogram signals into a second network model with first optimization parameters and acquiring corresponding emotion voice types;

Based on the disclosure of the above embodiments, it can be known that the embodiments of the present application have the following beneficial effects:

the application provides a method and a device for recognizing emotional speech, wherein the method comprises the following steps: acquiring a voice signal to be recognized; preprocessing the voice signal to be recognized according to a preprocessing rule to obtain a multi-frame preprocessed voice signal; acquiring the corresponding spectrogram signal based on each frame of preprocessed voice signal; respectively inputting the spectrogram signals into a second network model with first optimization parameters to obtain corresponding emotion voice types; wherein the second network model comprises: the first sub-network is used for extracting emotional voice features from the spectrogram signals and reducing dimensions, and the third sub-network is used for accelerating algorithm convergence and classifying the emotional voice features.

There are three main characteristics in emotion recognition in combination with the brain: the method is based on the three characteristics of time sequence, randomness and real-time performance, and an embedded intelligent front-end recognition system based on the speech emotion under the open environment condition is constructed by taking an intelligent nursing robot as an application background.

According to the method, the dynamic emotion recognition method of the LSTM is adopted, the LSTM is used for circularly collecting the image sequence, learning and memorizing the sequence associated information, and emotion judgment is carried out by combining the single image information and the sequence associated information, so that the accuracy and robustness of emotion recognition in an open environment are enhanced.

The method simplifies the CNN convolution layer, reduces algorithm complexity, increases the number of LSTM layers, enhances the algorithm learning image sequence relation capability, increases the time stamp layer cont to solve the problem of correlation learning of image sequences with different LSTM lengths, and increases the slice layer for segmenting the sequence end vector output by the second sub-network; and the sequence end vector is used for calculating error feedback with the label and correcting the network weight or predicting the emotion voice type. The data processing amount of the network is greatly reduced, and the algorithm complexity is reduced to adapt to the operation of the algorithm in the embedded equipment.

The method transplants a speech emotion recognition algorithm and a network model trained and completed by a server to an embedded platform (such as a Huawei atlas 200dk (a system on chip with a CPU, NPU and ISP) so as to realize an intelligent front-end recognition system of the speech emotion recognition system.

Drawings

FIG. 1 is a flowchart of a training method for recognizing emotional speech according to an embodiment of the present disclosure;

FIG. 2 is a diagram of a first network model of a training method for recognizing emotional speech according to an embodiment of the present disclosure;

fig. 3 is a block diagram of a first sub-network provided in an embodiment of the present application;

FIG. 4 is a block diagram of a training apparatus for recognizing emotion speech according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for recognizing emotional speech according to an embodiment of the present application;

FIG. 6 is a diagram of a second network model of a method for recognizing emotional speech according to an embodiment of the present application;

FIG. 7 is a block diagram of the units of an apparatus for recognizing emotion speech according to an embodiment of the present application.

Detailed Description

Specific embodiments of the present application will be described in detail below with reference to the accompanying drawings, but the present application is not limited thereto.

It will be understood that various modifications may be made to the embodiments disclosed herein. Accordingly, the foregoing description should not be construed as limiting, but merely as exemplifications of embodiments. Those skilled in the art will envision other modifications within the scope and spirit of the application.

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the application and, together with a general description of the application given above and the detailed description of the embodiments given below, serve to explain the principles of the application.

These and other characteristics of the present application will become apparent from the following description of preferred forms of embodiment, given as non-limiting examples, with reference to the attached drawings.

It should also be understood that, although the present application has been described with reference to some specific examples, a person of skill in the art shall certainly be able to achieve many other equivalent forms of application, having the characteristics as set forth in the claims and hence all coming within the field of protection defined thereby.

The above and other aspects, features and advantages of the present application will become more apparent in view of the following detailed description when taken in conjunction with the accompanying drawings.

Specific embodiments of the present application are described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely examples of the application, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the application of unnecessary or unnecessary detail. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present application in virtually any appropriately detailed structure.

The specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may each refer to one or more of the same or different embodiments in accordance with the application.

The first embodiment provided by the application is an embodiment of a training method for recognizing emotional speech.

The present embodiment is described in detail below with reference to fig. 1, fig. 2 and fig. 3, wherein fig. 1 is a flowchart of a training method for recognizing emotion voice provided by the present embodiment; FIG. 2 is a diagram of a first network model of a training method for recognizing emotional speech according to an embodiment of the present disclosure; fig. 3 is a block diagram of a first sub-network according to an embodiment of the present disclosure.

The embodiment of the application mainly has three characteristics when combined with the brain for emotion recognition: the embedded front-end recognition system based on the speech emotion under the open environment condition is constructed by taking the intelligent nursing robot as an application background according to the three characteristics of time sequence, randomness and instantaneity.

Referring to fig. 1, in step S101, a set of sample spectrogram signals is sequentially obtained.

A spectrogram is a two-dimensional image that describes the change over time of spectral information in a speech signal. In the spectrogram, the horizontal axis represents time of the speech signal, and the vertical axis represents respective frequency components of the speech signal. The intensity of each frequency component at any time in the image is represented by the lightness of color. In speech signal analysis, the commonly used analysis methods include a frequency domain analysis method and a time domain analysis method, and a speech spectrogram combines the two methods and can dynamically display the sizes of different frequency components at each moment, so that the information quantity carried in the speech spectrogram is far greater than the sum of the information quantities carried in a simple time domain and a simple frequency domain. Spectrogram is often referred to as visual speech due to its great value in speech analysis.

The spectrogram signal is information related to the voice acquired from the spectrogram.

In order to train the network model to recognize the emotion voice types, the embodiment of the disclosure divides the acquired spectrogram signals into N groups according to N emotion voice types before training to be used as training samples. The sample spectrogram signal is divided into one of N groups according to N emotion voice types, the emotion voice types of the sample spectrogram signals in the same group are the same, the emotion voice types of the sample spectrogram signals in each group are different, and N is an integer greater than 1; the sample spectrogram signal comprises a tag marking the emotion voice type.

Step S102, the first network model is respectively trained by utilizing each group of sample spectrogram signals to reach a preset training termination condition, so that first optimization parameters of the first network model are obtained.

The first network model, comprising: the first sub-network is used for extracting emotional voice features from the spectrogram signals and reducing dimensions, and the second sub-network is used for accelerating algorithm convergence and classifying the emotional voice features.

The embodiment of the application provides that a CNN + LSTM network model (namely, a first network model) is a CNN network model and an LSTM network model are unified into a framework, and the design aims are as follows: the CNN network model is good at image processing, the LSTM network model is good at time sequence modeling, the LSTM network model of the deep structure also has the capability of mapping features to a separable space, and in order to utilize the capability of mutual advantage complementation, the CNN network model and the LSTM network model are combined together, the CNN network model is used for extracting the features from a spectrogram, and the LSTM network model is used for carrying out time sequence modeling on the feature data. The embodiment of the application simplifies the ALEXTnet network model (including the stacking of a plurality of CNN network models), removes a plurality of convolution layers, reduces the calculation amount of the algorithm, and adapts to the operation of embedded equipment.

Please refer to fig. 2, in which ALEXTNet is the first sub-network. cont is a time stamp layer, and transmits the time information to the LSTM network model for processing.

The first sub-network comprising: a first fully connected tier (fc1), a first activation function (relu4), a first overfit prevention tier (drop4), a second fully connected tier (fc2), a second activation function (relu5), a second overfit prevention tier (drop5), a third fully connected tier (fc3), and a third activation function (relu 6). For example, a simplified alextnet network model. As shown in figure 3 of the drawings,

the second sub-network comprising: changing an input data dimension (reshape-data), changing an input timestamp dimension (reshape-cm), a first long and short time cyclic recursive network layer (lstm1), a third prevention overfitting layer (lstm1-drop), a second long and short time cyclic recursive network layer (lstm2), a third long and short time cyclic recursive network layer (lstm3), an output slicing layer (slice), changing a long and short time cyclic recursive network dimension (reshape-lstm), a fourth fully connected layer (fc4), changing an input label dimension (reshape-label), a loss function layer (loss), and an output classification accuracy (accuracy).

Wherein the output slicing level (slice) is configured to segment a sequence end vector of the second sub-network output; and the sequence end vector is used for calculating error feedback with the label and correcting the network weight or predicting the emotion voice type. Such as the LSTM network model.

In the training, three layers of LSTM networks are selected, the number of 128 hidden layer neurons is set, and the time sequence length is set to be 10. Using the batch gradient descent method, the batch size is set to 10, while the number of iteration batch rounds is 80000 times, the LSTM gradient clipping threshold is 5, using the Adam optimization method, the learning rate is set to 0.0005.

The preset training termination condition at least comprises one of the following conditions:

and under the condition one, the percentage of the output correct sample number in the total test sample number is larger than a preset first percentage threshold value.

For example, the preset first percentage threshold is 90%.

And secondly, the percentage of the correct sample number output by each group of sample spectrogram signals to the total test sample number of the corresponding group is larger than a preset second percentage threshold.

For example, the preset second percentage threshold is 90%.

In the embodiment of the application, a 5-fold cross validation method is adopted, all pictures composed of 7000 sample spectrogram signal sequences (each sequence comprises 10 spectrogram) are divided into 5 parts, the ith part is taken as a test sample every time, the rest parts are taken as training samples (the i part is calculated according to the accuracy of spectrogram pictures, namely the i part is calculated according to pictures in all spectrogram sequences as one sample data, and the original label of one spectrogram sequence means that all pictures of the spectrogram sequence are marked as the label), and then the corresponding accuracy C is obtained_iThen finally, the accuracy of the model for the emotion language spectrogram data set is

The recognition accuracy obtained by using the method is more reliable.

The embodiment of the application simplifies the CNN convolution layer, reduces algorithm complexity, increases the number of LSTM layers, enhances the algorithm learning image sequence relation capability, increases the time stamp layer cont to solve the problem of correlation learning of image sequences with different LSTM lengths, and increases the output segmentation layer (slice layer) for segmenting the sequence end vector output by the second sub-network; and the sequence end vector is used for calculating error feedback with the label and correcting the network weight or predicting the emotion voice type. The data processing amount of the network is greatly reduced, and the algorithm complexity is reduced to adapt to the operation of the algorithm in the embedded equipment.

Before the group of training voice signals are sequentially acquired, the method further comprises the following steps:

step 100-1, acquiring a voice signal to be recognized.

And step 100-2, preprocessing the voice signal to be recognized according to a preprocessing rule to obtain a multi-frame preprocessed voice signal.

Due to the physical characteristics of the speech signal itself, the physical characteristics of the uttering organs of the speaker, and the environment of the recorded speech, the original speech of the speaker cannot be directly processed, and needs to go through a speech preprocessing step for subsequent processing.

The preprocessing rule comprises: pre-emphasis rules, windowing framing rules, and endpoint detection rules.

The method for preprocessing the voice signal to be recognized according to the preprocessing rule to obtain the multi-frame preprocessed voice signal comprises the following steps:

and step 100-2-1, performing pre-emphasis processing on the voice signal to be recognized according to a pre-emphasis rule to obtain a first voice signal.

Due to glottic excitation and oronasal radiation, the average power spectrum of the voice signal can be sharply attenuated in a high frequency band, and the higher the frequency is, the smaller the corresponding frequency spectrum value is. After the voice is vibrated and sounded by the vocal cords, the voice is excited by the vocal tract and then transmitted into the ears of the human body. In the process, there is some attenuation of the high frequency part of the speech. And compared with the low-frequency part of the voice signal, the high-frequency part information is more difficult to acquire and can represent the emotional information of people.

The pre-emphasis process is to process the speech signal to be recognized by using a digital filter and amplify the high-frequency information of the speech signal.

Formula of pre-emphasis rule:

wherein the content of the first and second substances,

x (n) representing an input speech signal to be recognized;

z, expressed as frequency;

h (z), denoted pre-emphasis filter;

μ, expressed as pre-emphasis coefficient; the threshold value range of mu is 0.9-0.97, and mu is 0.9375 in the embodiment of the application;

y (n) as a first speech signal.

And step 100-2-2, performing windowing and framing on the first voice signal according to a windowing and framing rule to obtain multiple frames of second voice signals.

The voice signal has non-stationarity, so the voice signal cannot be directly analyzed and processed, but the non-stationarity is caused by resonance and resonance of a human body pronunciation system, the vibration speed is relatively slow, and the voice signal can be regarded as a quasi-steady state process, namely a short-time steady process in a short time range of about 20 ms. Therefore, before analyzing and processing the speech signal, it needs to perform short-time processing, i.e. dividing the speech signal into several short-time speech segments, wherein each short-time speech segment is called a frame of the speech signal.

In order to keep the smoothness of the voice signal after the framing processing and avoid losing part of the information of the voice signal due to the framing processing as much as possible, the adjacent frames are mutually partially overlapped, namely the frame shift is smaller than the frame length, which means that the left and right intervals of each frame of the voice signal mutually comprise a small part.

In the process of windowing and framing a speech signal, it can be considered to apply a sliding window function to the speech signal, and the sliding step length of the window function is the frame shift. If the speech signal has been subjected to a framing process, the framing windowing process is to add a window function to each frame of the speech signal.

Formula of windowing framing rule:

S_ω(n)＝s(n)*ω(n)；

wherein the content of the first and second substances,

s (n), expressed as a first speech signal;

ω (n), expressed as a window function. The commonly used window functions comprise a rectangular window, a Haining window, a Hamming window and the like, and the Hamming window which has good smoothness and can greatly avoid truncation effect is adopted as the window function of windowing and framing processing in the algorithm of the embodiment of the application. The expression of the Hamming window is shown in the formula.

Wherein, N is the frame length.

And step 100-2-3, performing endpoint detection on each frame of second voice signal according to an endpoint detection rule to obtain a plurality of frames of preprocessed voice signals.

Endpoint detection is an important link for voice signal preprocessing. Due to the limited recording environment, interference information such as environmental noise, long-time silence, head and tail useless signals and the like may exist in the voice signal, which may cause great influence on the subsequent acoustic feature extraction and generated spectrogram effect, and further influence the recognition capability of the voice emotion recognition system. The purpose of the endpoint detection is to detect the effective voice starting and stopping points in the voice signal, so that invalid silence and environmental noise in the voice signal can be eliminated, and the negative influence of interference information on subsequent work can be reduced as much as possible.

Two commonly used methods of endpoint detection are time domain endpoint detection and frequency domain endpoint detection.

The time domain endpoint detection method usually uses volume for detection, and the method has small calculation amount but is easy to cause misjudgment on unvoiced parts.

The frequency domain endpoint detection method can be subdivided into two methods: one is based on the frequency spectrum variability, namely the comparison rule of the frequency spectrum variation of the sound, and the comparison rule is used as a judgment standard; the other is based on entropy value of frequency spectrum, which is called spectrum entropy for short, and the spectrum entropy of part with sound is generally smaller.

The endpoint detection rule in the embodiment of the present application is specifically a time domain endpoint detection rule. The method mainly uses the volume and takes the zero-crossing rate as an important detection parameter, has small calculated amount and high operation speed, and simultaneously avoids misjudgment caused by only using the volume to carry out endpoint detection to a certain extent.

The method specifically comprises the following steps:

and step 100-2-3-1, judging whether the value of the second voice signal is larger than a preset volume threshold value.

The preset volume threshold is an empirical value. Since the subsequent comprehensive detection needs to be combined with the zero crossing rate, the threshold value should be set to be higher or lower, i.e., the threshold value should be set to be higher.

Step 100-2-3-2, if yes, the second speech signal is a voiced segment.

A voiced segment is a portion of a person's voice with a large decibel.

And step 100-2-3-3, if not, muting the body section.

A silent segment refers to a segment with very low decibels, approximately no sound, ambient noise, or unvoiced sound.

Whether a low volume part is unvoiced sound (a part of sound with low decibel) can be distinguished according to the short-time zero crossing rate. Generally, in an indoor environment, the zero-crossing rate of unvoiced parts is significantly higher than that of the ambient noise and the silent parts, and therefore, a zero-crossing rate threshold is preset, and unvoiced parts above the zero-crossing rate threshold are considered as unvoiced parts, and ambient noise or silent parts below the zero-crossing rate threshold are considered as silent parts.

For convenience, the front and rear end time points of the voiced part detected according to the preset volume threshold are set as a voiced start point and a voiced end point. Pushing a frame forward from the voiced starting point, judging whether the voice signal value is larger than a preset volume threshold value, if so, regarding the voice signal value as a voiced part sent by a person, and regarding the voice signal value as a new voiced starting point; if not, the forward part of the point is considered as environmental noise or silence or an unvoiced sound emitted by a person, and whether the person emits the unvoiced sound is judged according to a preset zero crossing rate threshold. Similarly, the method is similar to the method for moving from the voiced ending point to the voiced starting point, and is not described in detail.

The short-time zero-crossing rate refers to the number of times that a waveform passes through a zero point in one frame of a speech signal.

Formula of short-time zero-crossing rate:

wherein the content of the first and second substances,

s, expressed as the value of the sampling point;

t, expressed as frame length;

π { A }, meaning that when A is true, the value is 1; when A is true, the value is 0.

And step 100-3, generating the corresponding spectrogram signal based on each frame of preprocessed voice signal.

Because the traditional feature extraction method is to use various artificially designed filter banks to extract features, information loss in the frequency domain is brought about. In order to avoid this problem, the embodiment of the present application provides a CNN + LSTM network model, which directly inputs the spectrogram of a speaker into the network model, so as to maximally retain the spectrum information of a speech signal.

Suppose that a discrete speech signal x (n) is denoted as x after being subjected to framing processing_n(m); n-0, 1, …, N-1; wherein N is the frame number, m is the number of sampling points in a frame, and N is the frame length.

The short-time Fourier transform of signal x (n) is as follows:

wherein ω (n) is represented as a window function;

the formula for the discrete time domain fourier transform (DTFT) of signal x (n) is as follows:

the formula of the Discrete Fourier Transform (DFT) is as follows:

wherein, when k is more than or equal to 0 and less than or equal to N-1, X (N, k) is the short-time amplitude spectrum estimation of X (N); the formula of the spectral energy density function p (n, k) at m is as follows

P(n,k)＝|X(n,k)|²；

And expressing the value of p (n, k) by using gray scale or color by taking n as an abscissa and k as an ordinate, and obtaining a two-dimensional graph which is a spectrogram. Through 10log₁₀The (P (n, k)) formula transform yields a spectrogram in decibels.

And step 100-4, marking the label on the spectrogram signal according to N emotion voice types, and acquiring N groups of sample spectrogram signals.

Corresponding to the first embodiment provided by the present application, the present application also provides a second embodiment, namely a training device for recognizing emotional voices. Since the second embodiment is basically similar to the first embodiment, the description is simple, and the relevant portions should be referred to the corresponding description of the first embodiment. The device embodiments described below are merely illustrative.

FIG. 4 shows an embodiment of a training apparatus for recognizing emotion speech provided by the present application. FIG. 4 is a block diagram of a training apparatus for recognizing emotion speech according to an embodiment of the present application.

Referring to fig. 4, the present application provides a training device for recognizing emotion voices, including: a sample unit 401 is obtained, and a sample unit 402 is trained.

The acquiring sample unit 401 is configured to sequentially acquire a group of sample spectrogram signals, where the sample spectrogram signals are divided into one of N groups according to N emotion voice types, emotion voice types of the sample spectrogram signals in the same group are the same, emotion voice types of the sample spectrogram signals in each group are different, and N is an integer greater than 1; the sample spectrogram signal comprises a tag for marking the emotion voice type;

a training sample unit 402, configured to respectively train the first network model with each group of sample spectrogram signals to reach a preset training termination condition, so as to obtain a first optimization parameter of the first network model;

Optionally, the apparatus further comprises: a first pre-processing unit;

in the pretreatment unit, comprises

The acquisition subunit is used for acquiring a voice signal to be recognized;

the first preprocessing subunit is used for preprocessing the voice signal to be recognized according to a preprocessing rule to obtain a multi-frame preprocessed voice signal;

the generating speech spectrogram signal subunit is used for generating the corresponding speech spectrogram signal based on each frame of preprocessed speech signal;

and the acquisition sample spectrogram signal subunit is used for marking the label on the spectrogram signal according to the N emotion voice types and acquiring N groups of sample spectrogram signals.

in the preprocessing subunit, comprising:

the pre-emphasis subunit is used for performing pre-emphasis processing on the voice signal to be recognized according to a pre-emphasis rule to acquire a first voice signal;

formula of pre-emphasis rule:

wherein the content of the first and second substances,

x (n) representing an input speech signal to be recognized;

z, expressed as frequency;

h (z), denoted pre-emphasis filter;

y (n) as a first speech signal.

The windowing framing subunit is used for carrying out windowing framing on the first voice signal according to a windowing framing rule to obtain a plurality of frames of second voice signals;

formula of windowing framing rule:

S_ω(n)＝s(n)*ω(n)；

wherein the content of the first and second substances,

s (n), expressed as a first speech signal;

Wherein, N is the frame length.

And the endpoint detection subunit is used for carrying out endpoint detection on each frame of second voice signal according to an endpoint detection rule to acquire a plurality of frames of preprocessed voice signals.

In the endpoint detection subunit, comprising:

and the judging subunit is used for judging whether the value of the second voice signal is greater than a preset volume threshold value.

And the voiced sound stage judging subunit is used for judging that the second voice signal is a voiced sound stage if the output result of the judging subunit is 'yes'.

A voiced segment is a portion of a person's voice with a large decibel.

And a mute sound segment sub-unit for muting the sound segment if not.

Optionally, the first sub-network includes: a first fully connected tier (fc1), a first activation function (relu4), a first overfit prevention tier (drop4), a second fully connected tier (fc2), a second activation function (relu5), a second overfit prevention tier (drop5), a third fully connected tier (fc3), and a third activation function (relu 6).

Optionally, the second sub-network includes: changing an input data dimension (reshape-data), changing an input timestamp dimension (reshape-cm), a first long and short time cyclic recursive network layer (lstm1), a third prevention overfitting layer (lstm1-drop), a second long and short time cyclic recursive network layer (lstm2), a third long and short time cyclic recursive network layer (lstm3), an output slicing layer (slice), changing a long and short time cyclic recursive network dimension (reshape-lstm), a fourth fully connected layer (fc4), changing an input label dimension (reshape-label), a loss function layer (loss), and an output classification accuracy (accuracy).

the method comprises the following steps that under the condition one, the percentage of the output correct sample number in the total test sample number is larger than a preset first percentage threshold;

The third embodiment provided by the application is an embodiment of a method for recognizing emotional speech. Since the third embodiment is related to the first embodiment, the description is relatively simple, and the relevant portions only need to refer to the corresponding description of the first embodiment. The device embodiments described below are merely illustrative.

The present embodiment is described in detail below with reference to fig. 5 and fig. 6, wherein fig. 5 is a flowchart of a method for recognizing emotional voices according to the present embodiment; FIG. 6 is a diagram of a second network model of a method for recognizing emotional speech according to an embodiment of the present application.

Referring to fig. 5, in step S501, a speech signal to be recognized is obtained;

step S502, preprocessing the voice signal to be recognized according to a preprocessing rule to obtain a multi-frame preprocessed voice signal;

step S503, acquiring the corresponding spectrogram signal based on each frame of preprocessed voice signal;

step S504, inputting the spectrogram signals into a second network model with first optimization parameters respectively, and acquiring corresponding emotion voice types;

Please refer to fig. 6, in which ALEXTNet is the first sub-network.

The first sub-network comprising: a first fully connected tier (fc1), a first activation function (relu4), a first overfit prevention tier (drop4), a second fully connected tier (fc2), a second activation function (relu5), a second overfit prevention tier (drop5), a third fully connected tier (fc3), and a third activation function (relu 6).

Optionally, the third sub-network includes: changing an input data dimension (reshape-data), changing an input timestamp dimension (reshape-cm), a first long and short time cyclic recursive network layer (lstm1), a third prevention overfitting layer (lstm1-drop), a second long and short time cyclic recursive network layer (lstm2), a third long and short time cyclic recursive network layer (lstm3), changing the long and short time cyclic recursive network dimension (reshape-lstm), a fourth fully connected layer (fc4) and calculating a class probability (softmax-loss).

The first network model and the second network model are essentially the same in algorithmic principle, and fig. 2 and 6 are data flow diagrams drawn by inputting a training model file (train _ val.prototxt) and a deployment model file (deployment.prototxt) into netscope (a drawing tool provided by caffe).

Firstly, the generic file is formed after deleting something on the basis of the train _ val. Due to the nature of both files, the portion of the training inside the train _ val.

The last layer of the first network model for training is the loss function layer (loss) and the output classification accuracy (accuracuracy), while the last layer of the second network model for deployment is the computed class probability (softmax-loss). First two are both lossy layers, neither of which has any weight parameters. By looking at the application of caffe source code, in which the final layers of both network models are Softmax regression, the probability front is calculated directly when defining Softmax, calculating the class probability (Softmax-loss), while the probability back is calculated in the loss function layer (loss) and the output classification accuracy (accuracuracy).

Training is not needed in the second network model, the rear part of the probability related to training is deleted, compared with the API defined by TensorFlow, the training and deployment all use one network layer API, the interior of a network interface has judgment operation on the training and deployment, some network layers of caffe define two sets of APIs, and different APIs are adopted during the training and deployment.

In addition, as can be seen from fig. 2 and 6, compared with the structure of the second network model, the structure of the first network model includes an output slice layer (slice) and a changed long-short time recursive network dimension (reshape-tmls) between the third long-short time recursive network layer (lstm3) and the fourth fully-connected layer (fc4) of the last layer. They mainly act as output slicing (slice) and are mainly responsible for truncating the last elements of the output sequence of the third long and short recursive network layer (lstm3) (each element in the sequence represents an image feature). The long-short time cyclic recursive network dimension (reshape-lstm) is mainly responsible for changing the dimension of the output data of the output slice layer (slice) to be suitable for the input of the data dimension of the fourth full connection layer fc 4. This is the operation of the present invention, the input and output of the second sub-network are respectively time series of the same length, in the same output second sub-network time series, each element is affected by the input element of the same sequence before it, that is, the output slicing (slice) cuts the last element of the output sequence of the third long and short time cyclic recursive network layer (lstm3) and is affected by all the input elements in the same sequence before it, compared with other elements in the same sequence, it can represent the characteristics of the whole sequence, the first network model is to use this element value and the label to calculate the error, then calculate the probability back.

Referring to the second network model shown in fig. 6, the output slicing (slice) is not placed in the present invention, and first, the network does not have a network weight that needs to be trained, and does not belong to a neural network algorithm, and the absence of the output slicing (slice) means that each element in the output sequence of the third long-short recursive network layer (lstm3) is output. Of course, the output slicing layer (slice) can be added to output only the last bit element of the network output sequence of the third long and short time cyclic recursive network layer (lstm 3). The difference in the effect of the algorithm implementation is that the former microphone collects the probability of giving a recognition result per frame of speech, and the latter microphone collects the probability of giving a recognition result per time series (10 frames) of speech.

Corresponding to the third embodiment provided by the present application, the present application also provides a fourth embodiment, namely, an apparatus for recognizing emotional voices. Since the fourth embodiment is basically similar to the third embodiment, the description is simple, and the related portions should be referred to the corresponding description of the third embodiment. The device embodiments described below are merely illustrative.

FIG. 7 shows an embodiment of an apparatus for recognizing emotional speech provided by the present application. FIG. 7 is a block diagram of the units of an apparatus for recognizing emotion speech according to an embodiment of the present application.

Referring to fig. 7, the present application provides an apparatus for recognizing emotional voices, including: the device comprises a unit 701 for acquiring a voice signal to be recognized, a preprocessing unit 702, a unit 703 for acquiring a spectrogram signal, and a unit 704 for acquiring an emotion voice type.

A to-be-recognized voice signal obtaining unit 701 configured to obtain a to-be-recognized voice signal;

a preprocessing unit 702, configured to preprocess the voice signal to be recognized according to a preprocessing rule, to obtain a multi-frame preprocessed voice signal;

an obtaining spectrogram signal unit 703, configured to obtain a corresponding spectrogram signal based on each frame of preprocessed voice signal;

an emotion voice type obtaining unit 704, configured to input the spectrogram signals into a second network model with first optimized parameters, respectively, and obtain corresponding emotion voice types;

The above embodiments are only exemplary embodiments of the present application, and are not intended to limit the present application, and the protection scope of the present application is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present application and such modifications and equivalents should also be considered to be within the scope of the present application.

Claims

1. A training method for recognizing emotional speech, comprising:

wherein the first network model comprises: a first sub-network for extracting emotion voice features from the spectrogram signal and reducing dimensions, and a second sub-network for accelerating algorithm convergence and classifying the emotion voice features, the second sub-network comprising: changing input data dimensionality, changing input timestamp dimensionality, a first long and short time cyclic recursion network layer, a third prevention over-fitting layer, a second long and short time cyclic recursion network layer, a third long and short time cyclic recursion network layer, an output cutting layer, changing long and short time cyclic recursion network dimensionality, a fourth full-connection layer, changing input label dimensionality, a loss function layer and outputting classification accuracy;

2. The training method of claim 1, further comprising, before said sequentially acquiring a set of sample spectrogram signals:

acquiring a voice signal to be recognized;

3. The training method of claim 2, wherein the pre-processing rules comprises: pre-emphasis rules, windowing framing rules, and endpoint detection rules;

4. Training method according to claim 3, wherein the endpoint detection rule is in particular a temporal endpoint detection rule.

5. Training method according to claim 1, characterized in that said first subnetwork comprises: the first full-link layer, the first activation function, the first over-fit prevention layer, the second full-link layer, the second activation function, the second over-fit prevention layer, the third full-link layer, and the third activation function.

6. The training method according to claim 1, wherein the preset training termination condition includes at least one of the following conditions:

7. A training apparatus for recognizing emotional speech, comprising: