KR20170073113A

KR20170073113A - Method and apparatus for recognizing emotion using tone and tempo of voice signal

Info

Publication number: KR20170073113A
Application number: KR1020150181619A
Authority: KR
Inventors: 이석필; 변성우
Original assignee: 상명대학교산학협력단
Priority date: 2015-12-18
Filing date: 2015-12-18
Publication date: 2017-06-28
Also published as: WO2017104875A1

Abstract

According to an aspect of the present invention, there is provided an emotion recognition method using tone and tempo information, comprising: receiving a voice signal of a user; Detecting a voice interval by dividing the voice signal into a voice interval and a non-voice interval using an absolute division; Extracting tone information and tempo information from the detected voice interval; And the neural network extracts emotion information using the tone information and the tempo information in two or more neural networks. The first neural network of the neural network classifies emotions and sadness emotions, and the second neural network emits joy emotions and angry emotions And extracting the emotion information separately.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a method and apparatus for recognizing an emotion using voice tone and tempo information,

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to speech signal processing, and more particularly, to a method and apparatus for recognizing a speaker's emotion using tone and tempo information of a speech signal.

In communication, the transmission and recognition of emotion is a very important factor, which is necessary for accurate communication not only between people but also between people and animals or between people and machines.

Communication between human beings consists of various elements such as voice, gesture, and facial expression, which act individually or in combination to convey and recognize emotions.

Recently, as the internet technology of things has developed, the communication between human and machine has been emerged as an important factor to communicate emotion. Up to now, the research has mainly focused on the emotion of human being based on the expression of human face, And to identify and judge them.

A variety of studies have been conducted using speech to communicate between humans and machines. However, research has focused on recognition of human speech, synthesis of text by speech, or recognition and authentication of speech by speech. Research that recognizes emotions is not active yet.

Conventionally, emotion recognition using speech recognition has used a method of determining an anger or the like based on a pitch in a calm state based on a pitch or a volume, that is, a strength of a voice signal, according to a person's emotional state.

However, there is a problem that the method of using such a pitch has a large deviation according to an individual, so that it is difficult to obtain an average value thereof, and the strength of a voice signal is greatly influenced by the state of a microphone and a distance between a speaker and a microphone. There is a problem in that the accuracy is low.

Also, since the voice signal has a voice section and a non-voice section, when a voice signal is analyzed to analyze the entire voice, a non-voice section included in the voice signal lowers the accuracy of speech recognition or emotion recognition. And a speech onset technique capable of detecting only a speech section is also necessary.

The present invention has been made in view of the technical background as described above, and it is an object of the present invention to provide an apparatus and method for recognizing emotions using a tone and tempo of a voice section by distinguishing a voice section and a non- And to provide such a method.

The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

According to an aspect of the present invention, there is provided an emotion recognition method using tone and tempo information, the method comprising: receiving a voice signal of a user; Detecting a voice interval by dividing the voice signal into a voice interval and a non-voice interval using an absolute division; Extracting tone information and tempo information from the detected voice interval; And the neural network extracts emotion information using the tone information and the tempo information in two or more neural networks. The first neural network of the neural network classifies emotions and sadness emotions, and the second neural network emits joy emotions and angry emotions And extracting the emotion information separately.

According to another aspect of the present invention, an emotion recognition apparatus using tone and tempo information includes: an input unit for receiving a user's voice signal; A voice section detector for detecting a voice section by dividing the voice signal into a voice section and a non-voice section using an absolute division; A tone information extracting unit for extracting tone information from the detected voice interval; A tempo information extracting unit for extracting tempo information from the extracted voice interval; And extracting emotional information using the tone information and the tempo information, wherein the first neural network distinguishes between the emotion and the sadness emotion, and the second neural network emits joy emotions And an emotion recognition unit for extracting emotion information by classifying the emotion.

According to the present invention, it is possible to correctly distinguish a voice section and a non-voice section of a voice signal, and more effective and accurate from the voice section to recognize emotions.

1 is a flowchart of an emotion recognition method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a speech interval extraction method according to an embodiment of the present invention; FIG.
FIG. 3 illustrates extracted voice segments according to an embodiment of the present invention. FIG.
4 is a structural diagram of an emotion recognition apparatus according to another embodiment of the present invention.
5 is a diagram showing tone characteristics of a voice signal according to emotion;
6 is a diagram showing a tempo characteristic of a voice signal according to emotion;
7 is a structural diagram of an emotion recognition apparatus according to another embodiment of the present invention.
8 is a structural diagram of an emotion recognition apparatus according to another embodiment of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, the terms " comprises, " and / or "comprising" refer to the presence or absence of one or more other components, steps, operations, and / Or additions.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. 1 shows a flow chart of an emotion recognition method according to the present invention.

First, the user's voice to be subjected to emotion recognition is input (S110).

The input voice can be acquired via a microphone or the like, or can be acquired from a voice message or the like, or the voice of the user can be input in such a manner as extracting only a voice portion from a moving image attached to the mail.

Next, a voice interval necessary for emotion recognition is detected from the voice signal of the input user (S120).

Since the input voice signal has mixed voice and non-voice sections, the recognition rate of the voice recognition is lowered when the entire voice signal is used. Therefore, only the voice section is separated and used for emotion recognition.

The IAV (Integral Absolute Value) feature is used to separate speech segments. This is to reflect the energy magnitude value of the signal because the voice interval is larger than the non-voice interval.

2 is a flowchart for detecting a voice interval.

First, in order to detect a speech interval, an absolute value for each frame is calculated (S210). The frame of the speech signal depends on the sampling frequency and the number of samples. One frame having a sample frequency of 48 kHz and including 1536 samples has a length of 32 milliseconds (ms).

That is, the absolute division can be obtained by integrating the absolute values of 1536 samples in one frame.

If the absolute value of the input signal is obtained, the maximum value and the minimum value in the interval are calculated (S220), and a threshold value for determining whether the voice interval is calculated from the maximum value is calculated.

First, it is determined whether the minimum value exceeds 70% of the maximum value (S230). In the case where the minimum value is close to the maximum value, the threshold value becomes too high to prevent the section judged as the voice section from becoming too short.

If the minimum value is equal to or greater than 70% of the maximum value, the threshold value is made 20% of the maximum value (S240), and the voice interval is determined.

If the minimum value is less than 70% of the maximum value, the threshold value is set to a value obtained by adding 10% of the difference value between the maximum value and the minimum value to the minimum value (S250), and the threshold value is determined.

If it is determined that the absolute interval is greater than the threshold value, it is determined that the voice interval is started (S270). If the absolute interval is smaller than the threshold value, it is determined that the voice interval is terminated (S280) And ends the step S120 for detecting the voice interval.

Each of the numerical values used in the speech interval detection step (S120) can be calculated by substituting an optimal value through an experiment as an example value for explanation.

When the voice interval detection step S120 is finished, the tone information of the voice interval is extracted (S130), the tempo information of the voice interval is extracted (S140), and used for emotion recognition.

4 shows an apparatus for extracting tone and tempo information and performing emotion recognition using a neural network.

A human voice signal is a quasi-periodic signal generated by vibrating the vocal cords. The vibration period of such a voice signal is called a fundamental frequency or pitch or tone.

Tone of a voice signal is an important feature that is widely used in the field of voice signal processing. There are various methods for obtaining tone information.

Autocorrelation or AMDF (Average Magnitude Difference Function) method is a method of finding a frequency having the greatest autocorrelation in a voice signal and determining the frequency as a fundamental frequency, that is, a tone. Usually, 500 Hz, so change the frequency from 80 Hz to 500 Hz, find the cycle with the largest autocorrelation value, and determine the frequency with the highest correlation as the fundamental frequency.

In the method of using the energy of a voice signal, a voice signal as a time axis signal is converted into a frequency signal by FFT (Fast Fourier Transform) or the like, and energy values of each frequency are measured to determine a frequency having the largest energy value as a fundamental frequency . As a method of converting a voice signal into a frequency signal, a method such as DCT (Discrete Cosine Transform), DFT (Discrete Fourier Transform), and a filter bank may be used in addition to FFT.

As shown in FIG. 4, the tone extracted for each frame is used to determine an average value and variance value for the entire voice interval, and transmits the tone value and the variance value to the neural network to recognize the emotion.

The tempo of the voice signal is measured using a BPM (Beat Per Minute) unit. In the case of music, the number of beats is constant within one minute. In the case of human voice, the tempo of a voice signal is obtained by using the number of syllables composed of one consonant and vowel or one vowel.

In the present invention, a vowel and a consonant are extracted by analyzing the envelope of a speech signal, and the length of the vowel is defined as the length of the syllable.

The syllable extraction result is expressed by the number of frames for one vowel. As described above, one frame has a length of 32 ms in case of 48 kHz and 1536 samples / frame, so the average value of the syllable length extracted from one sentence is extracted as a tempo.

The artificial intelligence algorithm is used in step S150 for recognizing the emotion based on the extracted tone and tempo. In this embodiment, the Recurrent Neural Network algorithm is used. However, Deep Neural Network (DNN) (CNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), and Deep Q-Network (Deep Q-Networks) Can be used.

Like the tone information, the tempo information of the tempo information is also obtained for each frame, and the average value and the variance value of the tempo information are obtained and transmitted to the neural network.

In order to analyze the emotion in the artificial intelligence algorithm by using the tone information and the tempo information thus obtained, an initial learning process is required. After inputting the voice signals for the four emotions of the people, the optimal threshold value is set .

After the learning is completed, the neural network recognizes the emotion by dividing it into the primary neural network and the secondary neural network. In the primary neural network, it recognizes the normal emotions and the sad emotions with relatively low tone. In the secondary neural network, Tons that are not recognized in the neural network go through a process of recognition for joy and angry feelings rather than feelings and sadness feelings.

By recognizing emotions by dividing the neural network into two primary and secondary neurons, we distinguish only emotions and sadness emotions in the primary neural network. In the secondary neural network, we distinguish only joy and angry emotions. .

The emotion recognition step compares the variance of the tone of each emotion, the average value and the variance of the tempo, and the mean value, which are previously set through learning, and the variance and average value of the tone extracted from the previous stage. It is judged to express emotions.

FIG. 5 is a graph showing features extracted from tones in a voice signal. In the graph, the abscissa represents the time and the ordinate the frequency of the voice signal in hertz (Hz).

The diamonds corresponding to sadness are distributed below 150Hz, indicating the characteristics of bass, while joy is more than 200Hz, and angular frequency is more than 300Hz And that it has high frequency characteristics compared to sadness.

Therefore, by analyzing these characteristics, it is possible to analyze and recognize sadness, joy, anger, or general emotional state in a neural network.

FIG. 6 is a graph showing features extracted from the tempo of a voice signal. The vertical axis indicates the presence or absence of a voice signal. The interval in which the voice is present is 1, and the interval is 0. The horizontal axis represents time in frame units.

The difference in thickness of each bar in the graph indicates the tempo speed. The thicker the bar, the faster the tempo.

6 (a) shows the sadness, (b) shows the joy, and (c) shows the tempo for the South Korean. In the case of South Korea and joy, the frequency of the bar indicated by the thicker line than the sadness is higher.

The emotion can be determined using the tone information and the tempo information of the emotions shown in FIGS. 5 and 6, and the threshold value of tone and tempo for emotion determination can be determined through experiments.

If the emotion can not be recognized through the above steps, a method of recognizing the emotion by analyzing the breath sound of the person can also be used.

In a section in which no speech interval is detected, only a breathing sound may be generated without a speech in a case where a person is extremely sad or angry. Therefore, when a speech interval is detected using an existing threshold value, This is to compensate for the situation in which emotions can not be recognized because it can not be done.

In addition, even if the voice interval is detected, emotion recognition can be additionally performed by analyzing the energy level and tempo of the breathing sound in addition to the ambiguous recognition in the boundary section of the normal / sadness, joy / angry emotion. The threshold of breath sounds can also be set by experiment.

7 shows an emotion recognition apparatus 700 according to the present invention.

The input unit 710 receives a user's voice through a microphone or the like, or extracts a voice portion from a file such as a voice message video.

The voice section detector 720 receives the voice signal from the input unit 710 and distinguishes the voice section from the non-voice section.

In order to detect the voice interval, the voice interval and the non-voice interval are separated based on the energy level using the absolute dividing feature as described above, and are transmitted to the tone information extracting unit 730 and the tempo information extracting unit 740.

The tone information extracting unit 730 finds the fundamental frequency of the voice and finds the tone information based on the found frequency.

The tone information can be obtained by using an autocorrelation function or a method using the energy of each frequency of the frequency signal.

The tempo information extracting unit 740 determines the tempo of the voice, that is, the length of the syllable corresponding to the vowel to be paced for finding the pace.

When the tone information and the tempo information are found, the emotion recognition unit 750 detects the emotion corresponding to the voice signal.

The emotion recognition unit 750 can be configured as a secondary neural network circuit. In the primary neural network circuit, the emotion of the sadness, which is relatively low, is distinguished. In the secondary neural network circuit, , And recognizes emotions by distinguishing them.

It is possible to recognize the user's emotion more precisely by the emotion recognition apparatus as described above, and there is a possibility that the emotion recognition apparatus can be utilized in many parts.

Meanwhile, the emotion recognition method according to the embodiment of the present invention can be implemented in a computer system or recorded on a recording medium. 8, the computer system includes at least one processor 821, a memory 823, a user input device 126, a data communication bus 822, a user output device 827, And may include a reservoir 828. Each of the above-described components performs data communication via a data communication bus 822. [

The computer system may further include a network interface 129 coupled to the network. The processor 821 may be a central processing unit (CPU) or a semiconductor device that processes instructions stored in the memory 123 and / or the storage 828.

The memory 823 and the storage 128 may include various forms of volatile or nonvolatile storage media. For example, the memory 823 may include a ROM 124 and a RAM 825.

Accordingly, the emotion recognition method according to the embodiment of the present invention can be implemented in a computer-executable method. When the emotion recognition method according to the embodiment of the present invention is performed in a computer device, computer-readable instructions can perform the recognition method according to the present invention.

Meanwhile, the above-described emotion recognition method according to the present invention can be implemented as a computer-readable code on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording media storing data that can be decoded by a computer system. For example, there may be a ROM (Read Only Memory), a RAM (Random Access Memory), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device and the like. The computer-readable recording medium may also be distributed and executed in a computer system connected to a computer network and stored and executed as a code that can be read in a distributed manner.

While the present invention has been described in detail with reference to the accompanying drawings, it is to be understood that the invention is not limited to the above-described embodiments. Those skilled in the art will appreciate that various modifications, Of course, this is possible. Accordingly, the scope of protection of the present invention should not be limited to the above-described embodiments, but should be determined by the description of the following claims.

Claims

Receiving a voice signal of a user;
Detecting a voice interval by dividing the voice signal into a voice interval and a non-voice interval using an absolute division;
Extracting tone information and tempo information from the detected voice interval; And
Wherein the first neural network distinguishes between emotion and sadness emotion, and the second neural network distinguishes between joy emotion and anger emotion. Extracting emotion information;
The emotion recognition method comprising:

2. The method of claim 1, wherein detecting the voice interval comprises:
Calculating a maximum value and a minimum value of an absolute division of the speech signal,
If the minimum value exceeds a preset constant rate of the maximum value, the threshold value is multiplied by a maximum value and a first rate. If the minimum value is less than the predetermined rate, the threshold value is set as a minimum value and a second rate is set as a difference between a maximum value and a minimum value Multiplied value,
Determining that the absolute interval exceeds the threshold value, and determining that the absolute interval is a non-voice interval if the absolute interval is less than the threshold value
/ RTI >

The method according to claim 1,
Wherein the tone information includes an average value and a variance value of a fundamental frequency of the detected voice interval,
Wherein the tempo information includes an average value and a variance value of a speed of the detected voice interval
/ RTI >

The method of claim 3,
The step of extracting the emotion information may include comparing an average value and a variance value of the fundamental frequency and an average value and a variance value of the fundamental frequency with a mean value and a variance value of an average value, If the value is below the set threshold value, judge the emotion.
/ RTI >

The method of claim 1, wherein extracting the tone information comprises:
Extracting a fundamental frequency using an autocorrelation function, AMDF (Average Magnitude Difference Function), or FFT (Fast Fourier Transform)
/ RTI >

An input unit for receiving a user's voice signal;
A voice section detector for detecting a voice section by dividing the voice signal into a voice section and a non-voice section using an absolute division;
A tone information extracting unit for extracting tone information from the detected voice interval;
A tempo information extracting unit for extracting tempo information from the extracted voice interval; And
Wherein the first neural network of the neural network distinguishes the emotion from the normal emotion and the second neural network distinguishes the emotional state from the joy emotion and the angry emotion An emotion recognition unit for extracting emotion information by classifying the emotion information;
And an emotion recognition device.

7. The apparatus of claim 6, wherein the voice interval detector
Calculating a maximum value and a minimum value of an absolute division of the speech signal,
If the minimum value exceeds a preset constant rate of the maximum value, the threshold value is multiplied by a maximum value and a first rate. If the minimum value is less than the predetermined rate, the threshold value is set as a minimum value and a second rate is set as a difference between a maximum value and a minimum value Multiplied value,
Determining that the absolute interval exceeds the threshold value, and determining that the absolute interval is a non-voice interval if the absolute interval is less than the threshold value
/ RTI >

The method according to claim 6,
Wherein the tone information extracting unit extracts tone information including an average value and a variance value of the tones of the detected voice interval,
Wherein the tempo information extracting unit extracts tempo information including an average value and a variance value of the detected tempo of the speech interval
/ RTI >

The method according to claim 8, wherein the emotion recognition unit
Comparing the mean value and the variance value of the tones and the average value and the variance value of the tempo with an average value and a mean value and a variance value of a mean value, a variance value and a tempo of each predetermined emotion,
/ RTI >

7. The apparatus of claim 6, wherein the tone information extractor
Extracting a fundamental frequency using an autocorrelation function, AMDF (Average Magnitude Difference Function), or FFT (Fast Fourier Transform)
/ RTI >