WO2019078407A1

WO2019078407A1 - Apparatus and method for estimating emotions by using surrounding environment images

Info

Publication number: WO2019078407A1
Application number: PCT/KR2017/014040
Authority: WO
Inventors: 이의철; 박민우; 황현상
Original assignee: 상명대학교산학협력단
Priority date: 2017-10-18
Filing date: 2017-12-01
Publication date: 2019-04-25
Also published as: KR20190043391A; KR102027494B1

Abstract

The present invention provides an apparatus and a method for estimating emotions, the apparatus and the method analyzing surrounding environment images obtained through a camera capable of capturing images of a frontal environment, so as to quantitatively extract spatial context information that can influence human emotion, thereby estimating an emotional state. The present invention quantitatively extracts the spatial context information that can influence human emotion through the results of an analysis of surrounding environment images, an analysis of subjective emotional reactions of people according to the images, and an analysis of biometric signals, thereby providing the type of emotion of a user actually influenced by different surrounding environment images.

Description

Apparatus and method for estimating emotion using surrounding image

More particularly, the present invention relates to an apparatus and method for estimating emotion, and more particularly, to a method and apparatus for quantitatively extracting spatial context information that may affect human emotion by analyzing images of a surrounding environment obtained through a camera, And to provide a method and an apparatus for performing the method.

Recently, life logging technology that can extract information useful in various fields as well as information generated in everyday life is further developed. Life logging is a combination of "Life" and "Log", which means "record of life." This technology involves recording, storing, and organizing all the events that occur to an individual. In general, people are exposed to the surrounding environment in everyday life, and human beings are greatly influenced by emotions depending on the surrounding visual information. However, the research on the effect of specific elements of the surrounding environment on human emotion has not been studied extensively yet. A number of studies have been conducted to analyze the relationship with human emotions using visual elements. In addition, in the field of psychology and marketing, visual information such as color and image complexity is known to have a great influence on human emotion. However, the researches on the effect of the surrounding environment and the specific elements of the image information, which can be obtained through the camera mounted on the smart device, on human emotions have not been studied extensively yet.

The background art of the present invention is disclosed in Korean Patent Registration No. 10-1402724 (Registered May 31, 2017).

The present invention relates to a sensitivity estimation apparatus and method for estimating a sensitivity state by quantitatively extracting spatial context information that may affect a human sensibility by analyzing a surrounding environment image obtained through a camera capable of shooting a forward environment .

According to an aspect of the present invention, an emotion estimation apparatus is provided. The emotion estimation apparatus according to an embodiment of the present invention includes a controller for receiving a peripheral environment image acquired from a camera sensor for a predetermined time, a subjective emotion evaluation input from a user interface input unit, and a living body signal generated from the biometric sensor, Extracts a plurality of emotion elements from an image, extracts spatial context information through a fully connected SVR network designed using at least one of the emotion element, the subjective emotion evaluation, and the bio-signal, And a emotion inferring unit for inferring the emotional state of the user corresponding to the spatial context information.

According to another aspect of the present invention, a method of emotion inference and a computer program for executing the method are provided. The emotion inferencing method and the computer program for executing the emotion inferring method according to an embodiment of the present invention may include acquiring a still image of a still image or a moving image form acquired through an RGB camera for a predetermined time, Extracting an emotional element including at least one of a temporal complexity, a spatial complexity, a pixel component, and a sound component using feature values between pixels, and extracting an emotional element including a spatial context information Extracting the spatial context information, inferring the emotional state corresponding to the spatial context information, and generating the inferred result as the n-dimensional emotional reasoning map.

The present invention can estimate sensibility by quantitatively extracting spatial context information that can affect human emotion by analyzing the correlation between space time perception information and human emotion.

In addition, the present invention quantitatively extracts spatial context information that can affect human emotions through analysis of peripheral environment images, subjective sensibility analysis of people according to each image, and bio-signal analysis results, It can be provided that the surrounding environment image may affect the emotion to the actual user.

1 is a block diagram illustrating an emotion estimation apparatus according to an embodiment of the present invention;

FIG. 2 and FIG. 3 are flowcharts illustrating a method of estimating emotion using the emotion estimation apparatus according to an embodiment of the present invention.

4 is a diagram illustrating a method of extracting spatial complexity by a sensory inferencing apparatus according to an exemplary embodiment of the present invention.

FIGS. 5 and 6 are diagrams illustrating a method of extracting a pixel component from a sensory inferencing apparatus according to an exemplary embodiment of the present invention.

7 to 9 are views illustrating a method of extracting a sound component from a sensory inferencing apparatus according to an embodiment of the present invention.

FIG. 10 and FIG. 11 are views for explaining a method of extracting spatial context information using a fully connected support vector regression network according to an embodiment of the present invention. FIG.

FIG. 12 is a diagram showing an example of generating a sensibility state inferred by the emotion inferencing apparatus according to an embodiment of the present invention as a two-dimensional emotion inference map; FIG.

While the present invention has been described in connection with certain exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and similarities. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

1 is a block diagram illustrating an emotion estimation apparatus according to an embodiment of the present invention.

1, the emotion estimation apparatus 100 according to an exemplary embodiment of the present invention includes a user interface input unit 110, a user interface output unit 120, a controller 130, a storage unit 140, (150).

The user interface input unit 110 may be connected to an input device that receives a user input through a known method, and the user interface output unit 120 may be connected to an output device that outputs an image.

The control unit 130 receives the peripheral environment image acquired from the camera sensor. In addition, the control unit 130 receives the subjective sensibility evaluation of the user and the bio-signal generated from the biometric sensor at the time of acquiring the peripheral environment image. Here, the subjective sensibility evaluation can receive the feeling that the user feels when acquiring the surrounding environment image through the user interface input unit. The control unit 130 may transmit the emotion state deduced from the emotion inferring unit 150 to be described later to the user interface output unit 120. [

The storage unit 140 stores the peripheral environment image, the subjective sensibility evaluation, and the bio-signal received by the control unit 130. Also, the storage unit 140 stores a fully connected SVR network designed for emotion inferencing in the emotion inferring unit 150, which will be described later.

The emotion speculation unit 150 extracts spatial context information related to emotion from the surrounding environment image, subjective sensibility evaluation, and bio-signal received by the control unit 130 using a fully connected SVR network, And extracts the extracted spatial context information and corresponding emotions. In addition, the emotion inferring unit 150 can generate an n-dimensional emotion inference map of the reasoned emotion, and stores the inferred emotion state and the generated emotion inference map in the storage unit 140. [

The emotion speculation unit 150 includes a peripheral environment image input unit 151, an emotion element extraction unit 152, a spatial context information extraction unit 153, and an emotion map generation unit 154.

The surrounding environment image input unit 151 inputs the surrounding environment image received by the control unit 130. Here, the surrounding environment image may be a still image or a moving image acquired through the RGB camera for a predetermined time. In this case, the moving picture includes surrounding sounds generated at the time of acquiring the surrounding environment image.

The emotion element extracting unit 152 extracts an emotion element including at least one of time complexity, spatial complexity, pixel component, and sound component using the feature value between pixels in the spatial coordinates of the input peripheral environment image.

The spatial context information extraction unit 153 extracts the spatial context information from the sensory elements extracted from the sensory element extraction unit 152 and the spatial context information from the subjective sensibility evaluation and / And extracts the emotional state corresponding to the extracted spatial context information. Here, the spatial context information means emotional information that changes according to context information that can be generated at a time when a user acquires a peripheral environment image.

The emotion map generation unit 154 expresses the emotion state inferred by the position coordinates, generates the emotion state map as an n-dimensional emotion inference map, stores the emotion state inferred, and stores the generated emotion inference map in the storage unit 130.

2 and 3 are flowcharts illustrating a method of estimating emotion using the emotion estimation apparatus according to an embodiment of the present invention.

Referring to FIG. 2, in step S210, the emotion inferencing apparatus 100 acquires a surrounding image from a camera sensor. Here, the surrounding environment image may be a still image or a moving image acquired through the RGB camera for a predetermined time. In this case, the moving picture includes surrounding sounds generated at the time of acquiring the surrounding environment image.

In step S220, the emotion inference apparatus 100 extracts an emotion element including at least one of time complexity, spatial complexity, pixel component, and sound component using the feature value between pixels in the spatial coordinates of the obtained peripheral environment image.

Referring to FIG. 3, in step S221, the emotion inferencing apparatus 100 converts the inputted peripheral environment image into gray scale, calculates a difference image between the current frame and the previous frame, and calculates the number of pixels equal to or larger than the threshold value in the calculated difference image The time complexity is divided by the total number of pixels.

In step S222, the emotion inferencing apparatus 100 generates a boundary component image by applying a mask that detects a boundary component to a single frame of the input peripheral environment image in the horizontal and vertical directions, and calculates a pixel brightness value of the generated boundary component image The threshold value is replaced by 0 or 255, and the spatial complexity is extracted by dividing the number of pixels corresponding to the pixel brightness value of 255 by the total number of pixels. A method of extracting the spatial complexity will be described in more detail in FIG.

In step S223, the emotion inferencing apparatus 100 converts the surrounding environment image of the RGB color image type into the HSI model, and extracts pixel components by detecting pixels in a color range causing positive and negative emotion states through the following equation (1) .

(One)

In Equation (1), hue represents the color value of the HSI model, histo represents the number of corresponding hue values present in the image, W represents the width of the image, and H represents the height of the image, which is divided by the product of the height and width of the image, To < RTI ID = 0.0 > 1. &Lt; / RTI > The method of extracting pixel components will be described in more detail with reference to FIGS. 5 and 6. FIG.

In step S224, the emotion inferencing apparatus 100 extracts a sound component by analyzing the amplitude or frequency of sound generated at the time of acquiring an image when the surrounding environment image is a moving image. A method of extracting a sound component will be described in more detail in FIG. 7 to FIG.

Referring back to FIG. 2, in step S230, the emotion inferencing apparatus 100 receives a subjective emotion evaluation, which is felt at the time when the user acquires the surrounding environment image, through the user interface input unit.

In step S240, the emotion inferencing apparatus 100 acquires a bio-signal generated at the time when the user acquires the surrounding environment image from the biometric sensor.

In step S250, the emotion inference apparatus 100 extracts the extracted emotion element and the spatial context information related to emotion in the subjective sensibility evaluation and / or the bio-signal using a fully connected support vector regression network, extracts the extracted spatial context information, Infer the corresponding emotional state. A method of extracting the spatial context information will be described in more detail in FIGS. 10 and 11. FIG.

In step S260, the emotion inferencing apparatus 100 expresses the inferred emotion state as position coordinates to generate an n-dimensional emotion inference map, stores the inferred emotion state, and the generated emotion inference map in the storage unit 130. [ The emotion speculation map will be described in more detail in Fig.

4 is a diagram illustrating a method of extracting spatial complexity by the emotion inferencing apparatus according to an embodiment of the present invention.

Referring to FIG. 4, the emotion inference apparatus 100 generates a boundary component image 430 by applying a mask for detecting a boundary component in the horizontal direction 410 and the vertical direction 420, when the image 400 is input. The spatial complexity can be extracted by replacing the pixel brightness value of the generated boundary component image with 0 or 255 based on the threshold value and dividing the number of pixels corresponding to the pixel brightness value 255 by the total number of pixels.

5 and 6 are diagrams illustrating a method of extracting pixel components by the emotion inferencing apparatus according to an exemplary embodiment of the present invention.

Referring to FIG. 5, when the RGB color image 500 is input, the emotion inferencing apparatus 100 may convert an image into an HSI model to represent a pixel corresponding to a color HUE as a histogram distribution 510. The emotion inference apparatus 100 can detect a pixel in a color range that causes positive and negative emotion in the histogram distribution 510 by using a model 520 that expresses colors in relation to an emotion concept.

Referring to FIG. 6, the emotional reasoning device 100 is generally based on the Plutchik`s wheel of Emotion theory 600, in which the color is the most influential classification method for a person's emotional response. It is possible to detect pixels in a color range that cause positive and negative sensation. According to Flurich's Theory of Wheels (600), a person is basically anger, fear, sadness, disgust, surprise, anticipation, trust, ), And Joy (Joy). These basic emotions can be expressed not only with different color intensities but also with each other to form different emotions. For example, when we match the basic emotions asserted in Flirich's emotional wheel theory (600) to the two-dimensional model proposed by Russel (610), the red and blue lines indicate unpleasantness, Green and purple lines are associated with pleasure. In the present invention, color and human emotion are matched based on the float wheel theory 600, but not limited thereto.

7 to 9 are views illustrating a method of extracting a sound component by the emotion inferencing apparatus according to an embodiment of the present invention.

Referring to Fig. 7, the sound has an amplitude and a period because it is a vibration having a constant period. In general, the larger and smaller the amplitude, the larger and smaller the sound, and the shorter the period, the higher the frequency (higher) feel. Thus, the sounds that can be heard in everyday life come in the form of sounds with more complicated waveforms.

Referring to FIG. 8, the emotion estimation apparatus 100 may analyze the sound component in terms of amplitude from the sound included in the image using Equation (2). The amplitude represents the magnitude of the acquired sound. In general, the human ear tends to be sensitive to small sounds and relatively less sensitive to loud sounds. There is a method of switching to the unit of dB (Decibel) of the log scale which can express the amplitude well. It is possible to obtain an average of the amplitudes of the data of N seconds acquired to use the size information of sound and use it as a feature value of the sound component. For example, the amplitude in the Android environment is obtained in the range of -2 ¹⁵ to 2 ¹⁵ -1.

(2)

Referring to FIG. 9, the emotion estimation apparatus 100 may analyze the sound component in terms of frequency from the sound included in the image. The frequency is a characteristic of how many times the cycle of sound is repeated over a certain period of time and represents the height of the acquired sound. The sensitivity estimation apparatus 100 performs a Discrete Fourier Transform (DFT) to obtain a frequency from the acquired data. The DFT computation is to calculate the frequency-specific content of a waveform with a mixture of frequencies. As a result of the calculation, it is possible to know how certain frequencies (Hz) constitute the sound. Using this, the emotion estimation apparatus 100 converts the acquired data for N seconds into a DFT as shown in FIG. 9 (B) as shown in FIG. 9A, and outputs the frequency value (Hz) Can be used as feature values. The DFT operation is defined as Equation 3 below.

(3)

FIGS. 10 and 11 are diagrams for explaining a method of extracting spatial context information using a fully connected support vector regression network according to an embodiment of the present invention.

The emotion inference apparatus 100 can predict the emotional state of the user using the full connection support vector regression network based on the emotion elements extracted from the surrounding environment image. The emotion inferencing apparatus 100 according to an exemplary embodiment of the present invention includes the temporal complexity F1, spatial complexity (horizontal edge F2, vertical edge F3), pixel components (color F4, We extract nine features such as saturation (F5), intensity (F6), and saturation (F7)) and sound components (amplitude (F8) and frequency (F9)) as emotional elements and design the extracted emotional elements The full connection support vector regression network is a connection of multiple support vector regressions, which can be used to infer the emotional state by adding a support vector regression number depending on the situation .

10, when the emotion state is inferred using two support vector regression, the first support vector regression (SVR # F1) of the emotion inference apparatus 100 is characterized by the features of unpleasant and pleasant The second support vector regression SVR # F2 deduces the spatial context information indicating the feature of the arousal and the relaxation and combines the two inference results to generate the two-dimensional emotion state Can be deduced.

Referring to FIG. 11, the emotion estimation apparatus 100 has seven emotion elements (F1 to F7 described above) that can be obtained from a still image when the emotion state is inferred using four support vector regression, The sound components (F8 and F9 described above) that deduce the emotion state using the support vector regression (SVR # 1 and SVR # 2) and that can be obtained from the moving image other than the still image are the third and fourth support vector regression (SVR F3 And SVR F4) can be used to infer emotional states. Through this, it is possible to infer the two-dimensional emotion state by combining the first spatial context information deduced through the first and second support vector regression and the second spatial context information deduced through the third and fourth support vector regression. At this time, the emotion estimation apparatus 100 can deduce the emotion state by applying different weights to the first and second spatial context information as shown in Equation (4).

(4)

Although the result deduced in the present invention is based on a two-dimensional image, it is possible to infer the emotion state n-dimensionally according to the number of the support vector regression to be used.

12 is a diagram illustrating an example of generating a sensory state inferred by a sensory inferencing apparatus according to an exemplary embodiment of the present invention as a two-dimensional sensory inference map.

Referring to FIG. 12, when the surrounding environment image shown in FIG. 12 (A) is input, the emotion inferencing apparatus 100 generates time complexity (spatial complexity) (horizontal edge, , A vertical edge (Vertical Edge), a pixel component (Hue, Saturation, Intensity, Contrast), and a sound component (Amplitude, Frequency) and the spatial context information inferred through a fully connected support vector regression network, Dimensional sensory reasoning map expressing the sensibility state as position coordinates as shown in (B) of FIG. Here, the abscissa of the emotion speculation map represents unpleasant and pleasant sensibility, and the ordinate represents emotion of arousal and relaxation. The emotion inferencing apparatus 100 may output the generated emotion inferring map through the user interface output unit.

The method for inferring emotion through the emotion inferencing apparatus according to the embodiment of the present invention can be implemented in the form of a program command that can be executed through various computer means and recorded in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, and the like, alone or in combination. Program instructions to be recorded on a computer-readable medium may be those specially designed and constructed for the present invention or may be available to those skilled in the computer software arts. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Includes hardware devices specifically configured to store and execute program instructions such as magneto-optical media and ROM, RAM, flash memory, and the like. The above-mentioned medium may also be a transmission medium such as a light or metal wire, wave guide, etc., including a carrier wave for transmitting a signal designating a program command, a data structure and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

The embodiments of the present invention have been described above. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

Claims

In the emotion inferencing apparatus,

A control unit for receiving a peripheral environment image acquired from the camera sensor for a predetermined time, a subjective sensibility evaluation input from the user interface input unit, and a living body signal generated from the biometric sensor; And

A plurality of emotional elements are extracted from the peripheral environment image, and a space context information is provided through a fully connected SVR network designed using at least one of the emotion element, the subjective sensibility evaluation, and the bio- And a emotional reasoning unit for extracting the emotional state of the user and inferring the emotional state of the user corresponding to the spatial context information.
The method according to claim 1, wherein the emotion speculation unit

A peripheral environment image input unit for inputting a peripheral environment image of RGB color image type acquired for a predetermined time;

An emotional element extracting unit for extracting an emotional element including at least one of time complexity, spatial complexity, pixel component, and sound component using feature values between pixels in the spatial coordinates of the peripheral environment image;

A spatial context information extracting unit for extracting spatial context information from at least one of the emotional element, the subjective sensibility evaluation, and the bio-signal; And

And an emotion map generation unit that generates an emotion reasoning map of n dimensions using the extracted spatial context information.
3. The apparatus according to claim 2, wherein the emotion element extracting unit

And converting the peripheral environment image into gray scale to calculate a difference image between a current frame and a previous frame, and dividing the number of pixels equal to or larger than the threshold value in the difference image by the total number of pixels to extract a time complexity.
3. The apparatus according to claim 2, wherein the emotion element extracting unit

Generating a boundary component image by applying a mask for detecting a boundary component to a single frame of the peripheral environment image in the horizontal and vertical directions and replacing the pixel brightness value of the boundary component image with 0 or 255 based on the threshold value, And the spatial complexity is extracted by dividing the number of pixels corresponding to the pixel brightness value of 255 by the total number of pixels.
3. The apparatus according to claim 2, wherein the emotion element extracting unit

And extracting a pixel component by detecting a pixel in a color range causing positive and negative sensation through the following Equation 1,

(One)

In Equation 1, hue denotes the color value of the HSI model, histo denotes the number of corresponding hue values present in the image, W denotes the width of the image, and H denotes the height of the image. Lt; RTI ID = 0.0 > -1 < / RTI >
3. The apparatus according to claim 2, wherein the emotion element extracting unit

And a sound component is extracted by analyzing an amplitude or a frequency of sound generated at the time of acquiring the peripheral environment image.
3. The apparatus of claim 2, wherein the spatial context information extractor

Extracting spatial context information from n output values input to n fully connected SVR networks designed as at least one of the emotional element, the subjective sensibility evaluation, and the biometric information, And empirically inferences the emotional state corresponding to the spatial context information.
The method according to claim 1, wherein the emotion map generating unit

Dimensional emotional reasoning map according to the number of fully connected support vector regression networks having predetermined emotional states of the user corresponding to the spatial context information, Reasoning device.
A method of inferring emotion using an emotion inference apparatus,

Acquiring a still image or a moving image of a surrounding environment acquired through an RGB camera for a predetermined time;

Extracting an emotional element including at least one of a time complexity, a spatial complexity, a pixel component, and a sound component using feature values between pixels in spatial coordinates of the obtained peripheral environment image;

Extracting spatial context information between the emotional elements using a fully connected support vector regression network designed in advance and inferring emotional state corresponding to the spatial context information; And

And generating an inferred result as an n-dimensional emotion inference map.
10. The method of claim 9, wherein extracting the emotional element

The time complexity is calculated by calculating a difference image between a current frame and a previous frame by converting the surrounding image into a gray scale, extracting the number of pixels in the calculated difference image by a total number of pixels,

The spatial complexity may be determined by generating a boundary component image by applying a mask for detecting a boundary component to a single frame of the peripheral environment image in the horizontal and vertical directions and setting the pixel brightness value of the generated boundary component image to 0 255, the number of pixels corresponding to the pixel brightness value of 255 is divided by the total number of pixels,

Wherein the pixel component converts the ambient environment image into an HSI model, and detects and extracts pixels of a color range causing positive and negative emotion states of the color components using the following Equation 1: hue = Histo is the number of corresponding hue values in the image, W is the width of the image, and H is the height of the image. The sum of the height and the width of the image is a value between -1 and 1 Normalize,

(One)

Wherein the sound component analyzes and extracts an amplitude or a frequency of sound generated at the time of acquiring the peripheral environment image.
10. The method of claim 9, wherein extracting the spatial context information comprises:

Extracting spatial context information by using the emotion elements as input values of n fully connected SVR networks,

A sensory reasoning method for extracting spatial context information by using the emotional element, the subjective sensibility evaluation, and the biological signal as input values when receiving the subjective sensibility evaluation of the user inputted from the user interface input unit and the biometric signal generated from the biometric sensor .
12. A computer program for executing the emotion inferencing method according to any one of claims 7 to 11 and recorded on a computer-readable recording medium.