CN113673567B

CN113673567B - Panorama emotion recognition method and system based on multi-angle sub-region self-adaption

Info

Publication number: CN113673567B
Application number: CN202110816786.4A
Authority: CN
Inventors: 青春美; 黄容; 徐向民
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2023-07-21
Anticipated expiration: 2041-07-20
Also published as: CN113673567A

Abstract

The invention discloses a panorama emotion recognition method and system based on multi-angle sub-region self-adaption, wherein the panorama emotion recognition method and system comprise a multi-angle rotation module, a feature extraction module, a sub-region self-adaption module, a multi-scale fusion module and an emotion classification module, wherein the multi-angle rotation module, the feature extraction module, the sub-region self-adaption module, the multi-scale fusion module and the emotion classification module are used for predicting user emotion recognition in an immersive virtual environment. A series of equidistant columnar projection panoramic images are generated by utilizing a spherical multi-angle rotation algorithm, and characteristic advantages of different layers are obtained by inputting a convolutional neural network. And guiding local features through global features, adaptively establishing relevance between current scale context features, and capturing global and local context dependencies of feature graphs of different layers. And (3) up-sampling the feature graphs of different layers, and splicing in the channel dimension to realize feature fusion and obtain the emotion classification labels of the users. The invention can correctly predict the emotion preference and distribution of the user in various scenes and promote the user experience under VR.

Description

Panorama emotion recognition method and system based on multi-angle sub-region self-adaption

Technical Field

The invention relates to the field of emotion recognition, in particular to a panorama emotion recognition method and system based on multi-angle sub-region self-adaption.

Background

Emotion is a psychological and physiological state, accompanied by cognitive and conscious processes, and the study of human emotion and cognition is a high-level stage of artificial intelligence. With the explosive development of artificial intelligence and deep learning, it is possible to build emotion models with the ability to perceive, recognize and understand human emotion. The intelligent, sensitive and friendly feedback capability of the machine to the emotion of the user is provided, so that a natural environment where people are harmonious is finally built, and the attractive prospect guides a new direction for future application of the computer.

The traditional emotion induction has modes of pictures, words, voice, video and the like, and the actual prediction effect of the corresponding emotion recognition data set is not satisfactory. The virtual reality technology achieves the aim of emotion induction through immersive vivid three-dimensional experience, and is a better emotion induction element. In recent years, deep learning has revolutionized in practice, but in emotion interaction, emotion tag data based on a virtual reality induced state is rare, and an effective emotion research method and model are lacking. The panorama is a storage form of all-dimensional and real space information on a two-dimensional plane, and can be used as an effective material for analyzing emotion of the VR immersive virtual environment.

Disclosure of Invention

In order to overcome the defects and shortcomings of the prior art, the invention provides a panorama emotion recognition system and method based on multi-angle sub-region self-adaption.

According to the invention, through the display characteristics of panoramic content in the head-mounted display and the equidistant columnar projection mode, a spherical multi-angle rotation algorithm is designed to obtain panoramic images at different angles, and the panoramic images are combined with the self-adaptive context convolutional neural network, so that the accuracy of emotion classification labels is effectively improved.

The invention adopts the following technical scheme:

a panorama emotion recognition method based on multi-angle sub-region self-adaption comprises the following steps:

a multi-angle rotation step: the conversion from a three-dimensional omnibearing stereoscopic view to a two-dimensional plane panoramic view is realized by spherical multi-angle rotation and equidistant columnar projection;

and a feature extraction step: performing feature extraction on the two-dimensional plane panorama by using a pre-training convolutional neural network model to obtain feature maps of different layers;

and a sub-region self-adaption step: inputting feature graphs of different levels, searching for the relevance between the global and the local, adaptively establishing the context features of the current scale, and capturing the context dependence between the global and the local of the feature graphs of different levels;

a multi-scale fusion step: unifying the sizes of the feature graphs of different layers through an up-sampling step, and splicing the feature graphs in the channel dimension to realize multi-scale feature fusion;

and emotion classification: and determining target emotion according to the advantages of different layers of features, and outputting a corresponding emotion label.

Further, the spherical multi-angle rotation specifically includes:

establishing a three-dimensional spherical coordinate system with the head of the user as the sphere center, and firstly projecting a 360-degree panoramic image presented by the user under the head-mounted display onto the surface of the sphere;

rotating the projection graph according to the content distribution characteristics of the panoramic graph;

the rotation comprises horizontal rotation and vertical rotation, and the horizontal rotation realizes that the edge content cut at two sides rotates to an intermediate main vision area; vertical rotation achieves rotation of the bipolar severely distorted content to near the equator.

Further, the equidistant cylindrical projection is to map warp lines to vertical lines with constant spacing, map weft lines to horizontal lines with constant spacing, and project a three-dimensional stereoscopic view equidistant cylinder to a two-dimensional panorama.

Further, the three-dimensional spherical coordinates are a right-hand coordinate system, the field angle is 90 degrees, and the binocular direct vision direction of the user is taken as a horizontal axis, so that the center coordinates of the front vision port are [0, 0]; the center coordinate of the right viewing port is [90,0,0]; the center coordinate of the rear view port is [180,0,0]; the center coordinate of the left view port is [ -90,0,0]; the center coordinate of the upper view port is [0,90,0]; the center coordinates of the lower view port are [0, -90,0]; corresponding to the six faces of the cube tangent to the sphere.

Further, the feature extraction step specifically includes:

inputting the two-dimensional panorama into a pre-trained convolutional neural network, extracting the hierarchical structures of different universal feature spaces of the visual world, and forming a feature vector set [ X ] ¹ ,X ² ,...,X ^l ]Each element in the collection represents a feature map of the current hierarchy.

Further, the sub-region self-adaptation step comprises two branches, namely a sub-region content representation branch and an emotion contribution degree representation branch;

the sub-region content representation branch obtains a sub-region content representation y through self-adaptive average pooling operation of a feature map with the input size of h multiplied by w multiplied by c ^s Wherein h, w, c, s respectively representThe height, width, channel number and preset size of the feature map;

the emotion contribution degree characterizes branches, and specifically comprises the following steps:

for feature vector set [ X ] ¹ ,X ² ,...,X ^l ]Each element in the list is subjected to global pooling to obtain a global information representation g (X ^l )；

Global information characterization g (X) using broadcast mechanism ^l ) Residual connection is realized by adding the residual connection with the input characteristic diagram element by element, and the number of channels is converted into s through convolution operation of 1x1 ² Thereby constructing a size of hw×s ² Is an adaptive emotion contribution matrix a ^s ；

Will adapt emotion contribution degree matrix a ^s Representation y of sub-region content ^s Multiplying to obtain a context feature characterization vector Z ^l The vector represents each pixel point i and each sub-area y ^s×s Is a degree of association of (a) with each other.

Further, the adaptive averaging pooling divides the input feature map into sxs subregions, resulting in a set of subregions representing Y ^s×s ＝[y ¹ ,y ² ,...,y ^s×s ]The feature map with the size of sxsxc is deformed into s ² X c sub-region content characterization y ^s 。

Further, constructing an emotion contribution degree matrix a ^s The method comprises the following specific steps: set sub-area y ^s×s The contribution degree to the emotion classification label at the point i of the feature map is a _i Then any point of the feature map corresponds to s×s emotion contribution degree vectors a _i Form a collectionDeformation to obtain emotion contribution degree matrix a ^s The size of the catalyst is hw×s ² 。

Further, the multi-scale fusion step specifically comprises the following steps: by means of up-sampling operation, such as deconvolution or interpolation operation, multi-scale feature graphs of different layers are achieved, the sizes are uniform, the feature fusion is completed by splicing in the channel dimension, and finally the multi-scale feature graph with the size of H multiplied by W multiplied by C (C) ₁ +C ₂ +...+C _l ) Is combined with the high-level semantic information representation.

A system for realizing a panorama emotion recognition method based on multi-angle sub-region self-adaption comprises:

a multi-angle rotation module: the method is used for realizing conversion from a three-dimensional panoramic view to a two-dimensional panoramic view through multi-angle rotation and equidistant columnar projection;

and the feature extraction module is used for: the method is used for extracting the characteristics of the two-dimensional panoramic image to obtain characteristic images of different levels;

a subregion self-adaption module: the method is used for correlating the areas with consistent emotion classification labels, the global features guide local features to adaptively establish the correlation of the context features of the current scale, and long-distance dependence is captured;

a multi-scale fusion module: the method is used for unifying the sizes of the feature graphs of different layers and splicing the feature graphs in the channel dimension to realize multi-scale feature fusion;

and an emotion classification module: and determining target emotion according to the advantages of different layers of features, and outputting a corresponding emotion label.

The invention has the following beneficial effects:

1. aiming at the problem of sparse emotion tag data in a virtual reality induction state, a spherical multi-angle rotation algorithm is provided to realize data enhancement. And establishing a three-dimensional spherical coordinate system for the 360-degree view in the virtual environment of the user, respectively carrying out equidistant columnar projection after rotating the sphere along different coordinate axes to obtain an expanded data sample, and effectively improving the generalization capability of the model.

2. Equidistant columnar projection projects warp and weft to a rectangular plane equidistantly, so that serious distortion and distortion of panoramic content occur at the upper pole and the lower pole. The data sample expanded by the spherical multi-angle rotation algorithm can keep rotation invariance, and rotate the edge information at two sides to a central main visual area while relieving distortion, so that the content characteristics can be better captured and extracted by the emotion model, and the model identification accuracy is improved.

3. And extracting different layers of features of the panorama by using the pre-trained convolutional neural network, and playing the complementary advantages of the bottom layer detail information and the high-level semantic information. And guiding local features through global features, adaptively establishing the relevance between different areas or objects of the feature map, and capturing long-distance dependence. Therefore, the prediction performance of the model on the panoramic image emotion induction area is effectively improved.

4. The invention fills the blank in the field of panorama emotion recognition, is beneficial to interpretation and feedback collection of user emotion in an immersive virtual environment, and is critical to development of VR application scenes such as user behavior prediction, VR scene modeling and the like.

Drawings

FIG. 1 is a flow chart of a general implementation method of the present invention.

FIG. 2 is a schematic diagram of a user wearing a display in a virtual environment.

Fig. 3 (a) and 3 (b) are respectively three-dimensional spherical coordinates and a two-dimensional plan view after projection.

FIG. 4 is a schematic diagram of the effect of the multi-angle rotation algorithm rotated 180 degrees along the x-axis.

Fig. 5 is a schematic diagram of a sub-region adaptive module of the present invention.

FIG. 6 is a schematic diagram of a model framework of the overall implementation of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Examples

As shown in fig. 1, a panorama emotion recognition method based on multi-angle sub-region self-adaption is used for recognizing and predicting user emotion in an immersive virtual environment, and comprises the following steps:

the multi-angle rotation module, for an interactive 360 degree view presented to a user of an immersive virtual environment, acquires a series of data expansion samples using a spherical multi-angle rotation algorithm as shown in fig. 2. And the warp yarns are mapped into vertical lines with constant intervals by utilizing equidistant columnar projection, and the weft yarns are mapped into horizontal lines with constant intervals, so that the conversion from the three-dimensional omnibearing stereoscopic view to the two-dimensional plane panoramic view is completed.

HMD in fig. 2 represents a head mounted display.

The spherical multi-angle rotation algorithm specifically comprises the following steps: and establishing a three-dimensional Cartesian coordinate system with the head of the user as the sphere center. The spheres are sequentially rotated by a certain angle along the horizontal axis, so that the object which is seriously distorted at two poles is rotated to the vicinity of the equator at multiple angles to improve distortion. And simultaneously, the sphere is sequentially rotated by a certain angle along a vertical axis, and the edge contents cut at two sides are rotated to a central main visual area.

The multi-angle rotation algorithm is used for enabling the region of the panoramic image inducing emotion to rotate to a position close to the equator of the main view according to the content distribution characteristics of the panoramic image, reducing adverse effects caused by distortion projection, and facilitating capturing of relevant features by the model.

Further, the spherical multi-angle rotation algorithm specifically comprises the following steps:

a three-dimensional spherical coordinate system with the head of the user as the origin o is constructed to conform to the right-hand coordinate system, as shown in fig. 3 (a). The spherical body is rotated by 90 degrees along the horizontal direction by using a spherical multi-angle rotation algorithm and repeated for 2 times, so that the cut edge contents at the two sides are rotated to the middle main vision area, and the method is shown in fig. 4. And then the sphere is rotated for 45 degrees along the vertical direction and repeated for 4 times, and the object with the severely distorted original two poles is rotated to the vicinity of the equator to improve the distortion. Each panorama gives results of 2x4=8 data enhancements.

The height of the panorama is H, the width of the panorama is W, the coordinates of any point on the plane are (u, v), the corresponding three-dimensional spherical coordinate points are (x, y, z), and the longitude and latitude values areThe relationship between longitude and latitude and the spherical coordinates is as follows:

the conversion formula of the same point in the three-dimensional space and the two-dimensional plane is as follows:

the warp yarns are mapped to vertical lines of constant pitch and the weft yarns are mapped to horizontal lines of constant pitch, as shown in fig. 3 (b).

In the emotion recognition field, due to the limitation of distortion of content in the panorama ERP storage format, in order to facilitate model capture of relevant features, a multi-angle algorithm needs to rotate an object or region inducing emotion to a position of a main view, which is close to the equator, so that the object or region is projected to the center of a two-dimensional plane through equidistant rectangles. However, the rotation angles required by different panoramic views are different, and personalized customization of each panoramic view is impractical by manpower. Typically, the above requirement is basically achieved by rotating the sphere horizontally by 90 degrees, repeating 2 times, and rotating the sphere by 45 degrees along the x-axis, repeating 4 times, and obtaining 2×4=8 results for each panorama.

And the feature extraction module is used for realizing feature extraction by using a convolutional neural network which is pre-trained on a large-scale image classification task. For the input image I, equation X is utilized ^l ＝f(Σk ^l ·X ^l-1 +b ^l ) Extracting the hierarchical structure of different feature spaces universal to the visual world to form a feature map vector set [ X ] ¹ ,X ² ,...,X ^l ]. Wherein k is ^l X is the convolution kernel of the first layer ^l-1 B for the l-1 layer output feature map ^l Is a bias term. Each element in the set represents the characteristic diagram of the current level, and serves as the input of the sub-area self-adaptive module to exert the complementary advantages of information of different levels.

And the sub-region self-adaption module is used for adaptively establishing the context characteristics of the current scale by searching the global and local relevance as shown in fig. 5 and capturing the global and local context dependencies of the feature graphs of different layers. The module consists of two branches of a sub-region content representation branch and an emotion contribution degree representation branch, and specifically comprises the following steps:

sub-region content characterization branch pair feature vector set [ X ] ¹ ,X ² ,...,X ^l ]Each element in the (c) is subjected to adaptive average pooling, and an adaptive average pooling function is defined as follows:

kernel_size＝(input_size+2×padding)-(output_size-1)×stride

i.e. the input size, the output size, the boundary filling and the step size of the movement determine the size of the current convolution kernel. Feature map X with size of h multiplied by w multiplied by c ^l Converted into sxsxc where h, w, c, s represent the height, width, number of channels, and preset size of the feature map, respectively. Then the adaptive averaging pooling divides the input feature map into sxs subregions, resulting in a set of subregions representing Y ^s×s ＝[y ¹ ,y ² ,...,y ^s×s ]. Transforming a feature map of size sxsxc to s ² X c sub-region content characterization y ^s 。

Emotion contribution degree characterization branch pair feature vector set [ X ] ¹ ,X ² ,...,X ^l ]Each element in the list is subjected to global average pooling to obtain a global information representation g (X ^l ). And adding the 1 multiplied by c global information representation with the input feature map pixel by using a broadcasting mechanism to realize residual connection, so as to obtain the feature map with the size of h multiplied by w multiplied by c.

Set sub-area y ^s×s The contribution degree to the emotion classification label at the point i of the feature map is a _i Converting the number of channels to s by a convolution operation of 1x1 ² Then any point of the feature map corresponds to s×s emotion contribution degree vectors a _i Form a collectionDeformation to obtain a size hw×s ² Self-adaptive emotion contribution degree matrix a ^s 。

Emotion contribution matrix a for representing emotion contribution degree to branch output ^s Sub-region content representation y branching from sub-region content representation ^s Multiplied, the function is defined as follows:

obtaining a context feature characterization vector Z ^l The vector represents each pixel point i and each sub-area y ^s×s Is the degree of association of its implicit emotion contribution vector A _i The global and local connection weights are characterized and automatically optimized with continuous iteration of the network.

Further, the dependency refers to a correlation between two or more emotion subjects. The feature extraction module can identify different areas or objects, such as emotion principals and cats, by utilizing global and local features of the panorama, but the feature extraction module is insufficient as a standard for emotion prediction. It is also necessary to adaptively establish the association between a person and a cat through a sub-region adaptation module, where the person is teasing or stroking the cat, thereby giving the correct positive emotion label.

And the multi-scale fusion module is used for realizing feature fusion of the feature graphs of different layers. The up-sampling operation is utilized to realize the dimension unification of the feature images of different layers, then the feature images with the unification dimension are spliced in the channel dimension, and finally the size H multiplied by W multiplied by (C) ₁ +C ₂ +...+C _l ) Is combined with the high-level semantic information characterization.

And the emotion classification module can realize higher emotion classification effect on the panoramic image containing the remarkable main body and the panoramic image not containing the remarkable main body. Due to the parameter redundancy of the full connection layer, the global average pooling is utilized to replace the full connection layer to play the role of a classifier. And carrying out emotion recognition on the panorama with the remarkable main body by using deep features which pay more attention to abstract semantic information. The panoramic view without significant subjects is emotion identified using shallow features that provide detail perception information about edges, fringes, and colors. And obtaining the emotion classification label with higher accuracy, wherein the overall framework of the model is shown in fig. 6.

The features extracted by the different layers of convolution operations of the feature extraction module are different, and the lower-layer convolution of conv layers_1, 2 and the like extracts visual layer features such as colors, textures, contours and the like, and the higher-layer convolution of conv layers 4,5 and the like extracts object layer features and concept layer features, namely abstract semantic information. Predicting emotion areas of different/same panoramic images needs to combine the characteristic advantages of different layers, and if the panoramic image content is a single straight white natural scene, the bottom color and texture information are key for correct classification; if the panorama content is a complex multi-object interaction scenario, then the higher-level semantic information is important. The sub-region self-adaptive module is beneficial to better capturing the emotion induction region by establishing the relevance between different regions of the feature map and the object, so that a correct emotion label is given.

In this embodiment, the feature extraction module extracts 4 layers of feature maps of conv layers_2, 3,4,5, and at the same time, the feature map of each layer is sent to the sub-region adaptive module, and establishes the relevance of different regions under different scales s=1, 2,4, n (S is set to be not limited, and the combination effect of 1,2,4 is generally the best). Because the feature images of different levels are different in size, the feature images are firstly unified in size through a multi-scale fusion module, then all the feature images are spliced in the channel dimension, the spliced total features are used as the basis of emotion classification, and finally the emotion polarity of the input panoramic image, namely positive or negative, is obtained.

The embodiments described above are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the embodiments described above, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principles of the present invention should be made in the equivalent manner, and are included in the scope of the present invention.

Claims

1. A panorama emotion recognition method based on multi-angle sub-region self-adaption is characterized by comprising the following steps:

and a feature extraction step: extracting features of the two-dimensional panorama by using a pre-training model to obtain feature maps of different levels;

a multi-scale fusion step: splicing the feature graphs of different layers on the channel dimension to realize multi-scale feature fusion;

and emotion classification: determining target emotion according to the advantages of different layers of features, and outputting a corresponding emotion label;

the spherical multi-angle rotation specifically comprises the following steps:

the rotation comprises horizontal rotation and vertical rotation, and the horizontal rotation realizes that the edge content cut at two sides rotates to an intermediate main vision area; the vertical rotation realizes the rotation of the severely distorted content of the two poles to the vicinity of the equator;

the equidistant cylindrical projection is to map warp lines into vertical lines with constant spacing, map weft lines into horizontal lines with constant spacing, and project the equidistant cylinders of the three-dimensional stereoscopic view into the two-dimensional panoramic view;

the sub-region self-adaptation step comprises two branches, namely a sub-region content representation branch and an emotion contribution degree representation branch;

the sub-region content representation branch obtains a sub-region content representation y through self-adaptive average pooling operation of a feature map with the input size of h multiplied by w multiplied by c ^s Wherein h, w, c and s respectively represent the height, width, channel number and preset size of the feature map;

2. The panorama emotion recognition method according to claim 1, wherein the three-dimensional spherical coordinates are a right-hand coordinate system, the angle of view is 90 degrees, and the center coordinates of the front view port are [0, 0] with the binocular direct view direction of the user as the horizontal axis; the center coordinate of the right viewing port is [90,0,0]; the center coordinate of the rear view port is [180,0,0]; the center coordinate of the left view port is [ -90,0,0]; the center coordinate of the upper view port is [0,90,0]; the center coordinates of the lower view port are [0, -90,0]; corresponding to the six faces of the cube tangent to the sphere.

3. The panorama emotion recognition method according to claim 1, wherein the feature extraction step specifically comprises:

4. The panorama emotion recognition method according to claim 1, wherein a size hw x s is constructed ² Is an adaptive emotion contribution matrix a ^s The method specifically comprises the following steps: set sub-area y ^s×s The contribution degree to the emotion classification label at the point i of the feature map is a _i Converting the number of channels to s by a convolution operation of 1x1 ² Then the arbitrary point of the feature map corresponds to s×s emotion contribution degreesQuantity a _i Form a collectionDeformation to obtain a size hw×s ² Self-adaptive emotion contribution degree matrix a ^s 。

5. The panorama emotion recognition method according to claim 1, wherein the adaptive averaging pooling divides the input feature map into sxs sub-regions, resulting in a set of sub-region representations Y ^s×s ＝[y ¹ ,y ² ,...,y ^s×s ]The feature map with the size of sxsxc is deformed into s ² X c sub-region content characterization y ^s 。

6. A system for implementing a panorama emotion recognition method based on multi-angle sub-region adaptation according to any one of claims 1-5, comprising:

and the feature extraction module is used for: extracting features of the two-dimensional panoramic image to obtain feature images of different levels, and capturing the overall and local context dependence of the feature images;

a subregion self-adaption module: the areas with consistent emotion classification labels are associated with each other, and the context characteristics of the current scale are built in a self-adaptive mode by searching for the association between the global and the local;

a multi-scale fusion module: splicing the feature graphs in the channel dimension, and carrying out multi-scale feature fusion;