CN113673567A

CN113673567A - Panorama emotion recognition method and system based on multi-angle subregion self-adaption

Info

Publication number: CN113673567A
Application number: CN202110816786.4A
Authority: CN
Inventors: 青春美; 黄容; 徐向民
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-11-19
Anticipated expiration: 2041-07-20
Also published as: CN113673567B

Abstract

The invention discloses a panorama emotion recognition method and system based on multi-angle sub-region self-adaptation. A series of equidistant cylindrical projection panoramas are generated by utilizing a spherical multi-angle rotation algorithm, and input into a convolutional neural network to obtain the characteristic advantages of different levels. And guiding local features through global features, adaptively establishing the relevance between current scale context features, and capturing the global and local context dependencies of feature maps in different levels. And sampling feature graphs of different levels, splicing in channel dimensions to realize feature fusion, and acquiring the emotion classification labels of the users. According to the invention, the emotional preference and distribution of the user in various scenes can be correctly predicted, and the user experience under VR is improved.

Description

Panorama emotion recognition method and system based on multi-angle subregion self-adaption

Technical Field

The invention relates to the field of emotion recognition, in particular to a panorama emotion recognition method and system based on multi-angle subregion self-adaption.

Background

Emotion is a psychological and physiological state, accompanied by cognitive and consciousness processes, and the research on human emotion and cognition is a high-level stage of artificial intelligence. With the vigorous development of artificial intelligence and deep learning, it becomes possible to establish an emotion model with the abilities of perceiving, recognizing and understanding human emotion. By giving the machine the ability to make intelligent, sensitive and friendly feedback on the user's emotions, a natural environment in which people and people, and people and machines are harmonious and co-located is finally created, which is a good vision to guide new directions for future applications of computers.

Traditional emotion inducement has modes such as pictures, characters, voice, video and the like, and the actual prediction effect of the corresponding emotion recognition data set is not satisfactory. The virtual reality technology achieves the purpose of emotion induction through immersive vivid and three-dimensional experience, and is a better emotion induction element. In recent years, deep learning has revolutionized in practice, but in terms of emotional interaction, emotional tag data based on a virtual reality evoked state is rare, and an effective emotional research method and model are lacking. The panorama is a storage form of all-round and real space information on a two-dimensional plane, and can be used as an effective material for analyzing the emotion of the VR immersive virtual environment.

Disclosure of Invention

In order to overcome the defects and shortcomings of the prior art, the invention provides a panorama emotion recognition system and method based on multi-angle subregion adaptation.

According to the invention, through the display characteristics of panoramic contents in the head-mounted display and an equidistant columnar projection mode, a spherical multi-angle rotation algorithm is designed to obtain panoramic pictures with different angles, and the panoramic pictures are combined with a convolutional neural network adaptive to context, so that the accuracy of emotion classification labels is effectively improved.

The invention adopts the following technical scheme:

a panorama emotion recognition method based on multi-angle subregion adaptation comprises the following steps:

multi-angle rotation step: the conversion from a three-dimensional omnibearing stereoscopic view to a two-dimensional plane panorama is realized by adopting spherical multi-angle rotation and equidistant columnar projection;

a characteristic extraction step: extracting the characteristics of the two-dimensional planar panoramic image by using a pre-trained convolutional neural network model to obtain characteristic images of different levels;

sub-region adaptation step: inputting feature maps of different levels, searching for global and local relevance, adaptively establishing context features of the current scale, and capturing global and local context dependencies of the feature maps of different levels;

multi-scale fusion step: unifying the sizes of the feature maps of different levels through an up-sampling step, and splicing the feature maps on the channel dimension to realize multi-scale feature fusion;

and (3) emotion classification step: and determining the target emotion according to the advantages of the different level features, and outputting a corresponding emotion label.

Further, the spherical multi-angle rotation specifically comprises:

establishing a three-dimensional spherical coordinate system with the head of the user as the sphere center, and projecting a 360-degree panoramic image presented by the user under a head-mounted display to the surface of the sphere;

rotating the projection drawing according to the content distribution characteristics of the panoramic drawing;

the rotation comprises horizontal rotation and vertical rotation, and the horizontal rotation realizes that the cut edge contents on the two sides rotate to the middle main visual area; the vertical rotation enables the polar severely distorted content to rotate to near the equator.

Further, the equidistant columnar projection is to map the longitude lines into vertical lines with constant spacing, map the latitude lines into horizontal lines with constant spacing, and project the equidistant cylinders of the three-dimensional view onto the two-dimensional panorama.

Further, the three-dimensional spherical coordinate is a right-hand coordinate system, the field angle is 90 degrees, the binocular direct-viewing direction of the user is taken as a horizontal axis, and the central coordinate of the front viewing port is [0,0,0 ]; the right viewport center coordinate is [90,0,0 ]; the center coordinate of the back viewport is [180,0,0 ]; the left viewport center coordinate is [ -90,0,0 ]; the upper viewport center coordinate is [0,90,0 ]; the central coordinate of the lower viewport is [0, -90,0 ]; corresponding to the six faces of the cube tangent to the sphere.

Further, the feature extraction step specifically comprises:

inputting the two-dimensional panoramic picture into a pre-trained convolutional neural network, extracting the hierarchical structure of different feature spaces universal to the visual world, and forming a feature vector set [ X ]¹,X²,...,X^l]Each element in the set represents a feature map of the current level.

Further, the sub-region self-adapting step comprises a sub-region content characterization branch and an emotion contribution characterization branch;

the sub-region content characterization branch obtains a sub-region content characterization y by performing adaptive average pooling operation on the characteristic graph with the input size of h multiplied by w multiplied by c^sWherein h, w, c and s respectively represent the height, width, channel number and preset size of the characteristic diagram;

the emotion contribution degree characterization branch specifically comprises the following steps:

for feature vector set [ X¹,X²,...,X^l]Is globally pooled to obtain a global information token g (X) of size 1X1 xc^l)；

Characterizing global information g (X) with a broadcast mechanism^l) Residual error connection is realized by adding the input feature diagram element by element, and the number of channels is converted into s through convolution operation of 1x1²Thereby constructing a size of hw × s²Adaptive emotion contribution matrix a^s；

Adaptive emotion contribution matrix a^sWith sub-region content characterisation y^sMultiplying to obtain a context feature characterization vector Z^lThe vector represents each pixel point i and each sub-region y^s×sThe degree of association of (c).

Further, the adaptive average pooling divides the input feature map into s × s sub-regions to obtain a group of sub-regions representing Y^s×s＝[y¹,y²,...,y^s×s]Transforming the feature map with size of s × s × c into s²Sub-region content characterization y of x c^s。

Further, the emotion contribution degree matrix is constructeda^sThe method comprises the following specific steps: let sub-region y^s×sThe contribution degree of the emotion classification label at the i point of the feature map is a_iThen, any i point of the feature map corresponds to s × s emotion contribution degree vectors a_iForm a set

Transforming to obtain an emotion contribution degree matrix a^sThe size is hw × s²。

Further, the multi-scale fusion step specifically comprises: the multi-scale feature maps with different levels are realized by utilizing upsampling operation, such as deconvolution or interpolation operation, the sizes are uniform, the feature fusion is completed by splicing on channel dimensions, and the size H multiplied by W multiplied by x (C) is finally obtained₁+C₂+...+C_l) The total information representation is combined with the high-level semantic information representation.

A system for realizing a panorama emotion recognition method based on multi-angle subregion adaptation comprises the following steps:

multi-angle rotating module: the method is used for realizing the conversion from a three-dimensional panoramic view to a two-dimensional panoramic view by multi-angle rotation and equidistant columnar projection;

a feature extraction module: the two-dimensional panoramic image feature extraction device is used for extracting features of the two-dimensional panoramic image to obtain feature images of different levels;

a sub-region adaptive module: the method is used for correlating regions with consistent emotion classification labels, and global features guide local features to adaptively establish the relevance of context features of the current scale and capture long-distance dependence;

a multi-scale fusion module: the device is used for unifying the sizes of the feature maps in different levels and splicing the feature maps in the channel dimension to realize multi-scale feature fusion;

and an emotion classification module: and determining the target emotion according to the advantages of the different level features, and outputting a corresponding emotion label.

The invention has the following beneficial effects:

1. aiming at the problem that emotion label data are rare under a virtual reality induction state, a spherical multi-angle rotation algorithm is provided to realize data enhancement. A three-dimensional spherical coordinate system is established for 360-degree views in a user virtual environment, a sphere rotates along different coordinate axes in multiple angles, and then equidistant columnar projection is carried out respectively to obtain expanded data samples, so that the generalization capability of the model can be effectively improved.

2. The equidistant columnar projection projects the longitude lines and the latitude lines to a rectangular plane at equal intervals, which causes serious distortion of panoramic contents at the upper and lower poles. The data sample expanded by the spherical multi-angle rotation algorithm can keep rotation invariance, and when distortion is relieved, the edge information of two sides is rotated to the central main visual area, so that the content characteristics can be well captured and extracted by the emotional model, and the identification accuracy of the model is improved.

3. And extracting different-level features of the panoramic image by using the pre-trained convolutional neural network, and exerting the complementary advantages of the bottom-level detail information and the high-level semantic information. The global feature guides local features, the relevance between different areas or objects of the feature map is built in a self-adaptive mode, and long-distance dependence is captured. Therefore, the prediction performance of the model on the panoramic image emotion induction area is effectively improved.

4. The method fills the blank in the field of panoramic image emotion recognition, is beneficial to reading the user emotion and collecting feedback in an immersive virtual environment, and is important for developing VR application scenes such as user behavior prediction and VR scene modeling.

Drawings

FIG. 1 is a flow chart of the overall method of practicing the present invention.

FIG. 2 is a schematic diagram of a user head-mounted display in a virtual environment.

Fig. 3(a) and 3(b) are schematic diagrams of three-dimensional spherical coordinates and a projected two-dimensional plane, respectively.

FIG. 4 is a schematic diagram illustrating the effect of a multi-angle rotation algorithm rotating 180 degrees along the x-axis.

FIG. 5 is a diagram of a sub-region adaptation module according to the present invention.

FIG. 6 is a schematic diagram of a model framework of an overall implementation of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.

Examples

As shown in fig. 1, a panorama emotion recognition method based on multi-angle sub-region adaptation is used for recognizing and predicting user emotion in an immersive virtual environment, and includes the following steps:

and the multi-angle rotation module is used for presenting an interactive 360-degree view of the immersive virtual environment to the user, and as shown in fig. 2, a series of data expansion samples are obtained by adopting a spherical multi-angle rotation algorithm. And the longitude lines are mapped into vertical lines with constant intervals by utilizing equidistant columnar projection, and the latitude lines are mapped into horizontal lines with constant intervals, so that the conversion from a three-dimensional omnibearing stereoscopic view to a two-dimensional plane panorama is completed.

The HMD in fig. 2 represents a head mounted display.

The spherical multi-angle rotation algorithm specifically comprises the following steps: and establishing a three-dimensional Cartesian coordinate system with the head of the user as the center of a sphere. The sphere is sequentially rotated by a certain angle along the horizontal axis, so that distortion is improved when an object with two severely distorted poles rotates to the position near the equator in a multi-angle mode. Meanwhile, the sphere is sequentially rotated by a certain angle along the vertical axis, and the edge contents cut at the two sides are rotated to the central main visual area.

The multi-angle rotation algorithm is adopted, so that the region of the panorama inducing emotion is rotated to the position, close to the equator, of the main view according to the content distribution characteristics of the panorama, the adverse effect caused by distortion projection is reduced, and the model can capture relevant characteristics conveniently.

Further, the spherical multi-angle rotation algorithm specifically comprises the following steps:

a three-dimensional spherical coordinate system with the user's head as the origin o is constructed to conform to the right-hand coordinate system, as shown in fig. 3 (a). And (3) rotating the sphere by 90 degrees in the horizontal direction by utilizing a spherical multi-angle rotation algorithm, repeating for 2 times, and realizing that the edge contents cut at two sides rotate to the middle main visual area, which is shown in figure 4. And rotating the sphere by 45 degrees along the vertical direction, repeating for 4 times, and rotating the object with two poles seriously twisted to the vicinity of the equator to improve distortion. Each panorama yielded results of 2x 4-8 data enhancements.

Let the height of the panorama be H, the width be W, the coordinate of any point on the plane be (u, v), the corresponding three-dimensional sphere coordinate point be (x, y, z), the longitude and latitude value be

The relationship between the longitude and latitude and the spherical coordinates is as follows:

the conversion formula of the same point in the three-dimensional space and the two-dimensional plane is as follows:

the warp lines are mapped to vertical lines of constant pitch and the weft lines are mapped to horizontal lines of constant pitch, as shown in fig. 3 (b).

In the emotion recognition field, due to the limitation of content distortion existing in a panorama ERP storage format, in order to facilitate a model to capture relevant features, a multi-angle algorithm needs to rotate an object or a region inducing emotion to a position close to the equator of a main view, so that the object or the region is projected to the center of a two-dimensional plane through equidistant rectangles. However, different panoramic pictures need different rotation angles, and manual personalized customization of each panoramic picture is impractical. Generally, this is achieved by rotating the sphere horizontally 90 degrees, 2 times, then rotating the sphere 45 degrees along the x-axis, 4 times, each panorama yielding 2x 4-8 results.

And the feature extraction module is used for realizing feature extraction by using a pre-trained convolutional neural network on a large-scale image classification task. For the input image I, the formula X is used^l＝f(Σk^l·X^l-1+b^l) Extracting the hierarchical structure of different feature spaces commonly used in the visual world to form a feature vector set [ X ]¹,X²,...,X^l]. Wherein k is^lIs a convolution kernel of layer I, X^l-1Is a profile of the l-1 layer output, b^lIs the bias term. Each element in the set represents a feature map of the current layer, and the feature map serves as the input of the sub-region adaptive module to exert the complementary advantages of different layers of information.

The sub-region adaptive module, as shown in fig. 5, adaptively establishes the context feature of the current scale by finding the global and local relevance, and captures the global and local context dependencies of the feature maps of different levels. The module consists of a sub-region content representation branch and an emotion contribution degree representation branch, and specifically comprises the following steps:

sub-region content characterization branch pair feature vector set [ X ]¹,X²,...,X^l]Is subjected to adaptive average pooling, and the adaptive average pooling function is defined as follows:

kernel_size＝(input_size+2×padding)-(output_size-1)×stride

i.e., input size, output size, boundary padding and move step size, determine the size of the current convolution kernel. A feature map X with the size of h multiplied by w multiplied by c^lAnd converting into s × s × c, wherein h, w, c, and s respectively represent the height, width, number of channels, and preset size of the feature map. Then the adaptive average pooling divides the input feature map into s x s sub-regions to obtain a set of sub-region representations Y^s×s＝[y¹,y²,...,y^s×s]. Transforming a feature map of size sxsxsxsxxc c to s²Sub-region content characterization y of x c^s。

Emotion contribution degree characterization branch pair feature vector set [ X [ ]¹,X²,...,X^l]Is subjected to global average pooling to obtain a global information characterization g (X) with a size of 1 × 1 × c^l). And adding the 1 × 1 × c global information representation and the input feature map pixel by using a broadcasting mechanism to realize residual connection, and obtaining the feature map with the size of h × w × c.

Let sub-region y^s×sThe contribution degree of the emotion classification label at the i point of the feature map is a_iThe number of channels is converted into s by convolution operation of 1x1²Then, any i point of the feature map corresponds to s × s emotion contribution degree vectors a_iForm a set

Deforming to obtain a size hw × s²Adaptive emotion contribution matrix a^s。

Representing emotion contribution degree matrix a of branch output by emotion contribution degree^sSub-region content representation y output by sub-region content representation branch^sMultiplication, the function is defined as follows:

obtaining a context feature characterization vector Z^lThe vector represents each pixel point i and each sub-region y^s×sThe degree of association of (a), its internal implied emotional contribution vector A_iAnd the global and local connection weights are characterized, and the automatic optimization is carried out along with the continuous iteration of the network.

Further, the dependency refers to an association between two or more emotional subjects. The feature extraction module can realize the identification of different areas or objects, such as emotional main persons and cats, by using global and local features of the panoramic image, but the feature extraction module is not enough to be used as a standard for emotion prediction. The relevance between a person and a cat is also required to be adaptively established through a sub-region adaptive module, and the person makes a touch or cares for the cat so as to give a correct positive emotional label.

And the multi-scale fusion module is used for realizing the feature fusion of the feature graphs of different levels. The method comprises the steps of utilizing an upsampling operation to realize the size unification of feature graphs of different levels, splicing the feature graphs of the unified size on channel dimensions, and finally obtaining the dimension H multiplied by W multiplied by C₁+C₂+...+C_l) The underlying geometric information representation of (a) is combined with the higher-level semantic information representation of (b).

And the emotion classification module can realize higher emotion classification effect on the panoramic image containing the remarkable main body and the panoramic image not containing the remarkable main body. Due to the parameter redundancy of the fully-connected layer, replacing the fully-connected layer with global average pooling serves as a "classifier". And carrying out emotion recognition on the panoramic image with the remarkable main body by utilizing deep features which pay more attention to the abstract semantic information. And performing emotion recognition on the panoramic image without the prominent subject by utilizing shallow features providing detail perception information about edges, stripes, colors and the like. Obtaining the emotion classification label with higher accuracy, and the overall framework of the model is shown in fig. 6.

The features extracted by different levels of convolution operation of the feature extraction module are different, the conv layer _1, 2 and other bottom layers of convolution extract visual layer features, such as color, texture, contour and the like, and the conv layer 4, 5 and other high layers of convolution extract object layer and concept layer features, namely abstract semantic information. Predicting the emotional areas of different/same panoramas needs to be combined with the characteristic advantages of different levels, and if the content of the panoramas is a single straight-white natural scene, the color and texture information of the bottom layer are the key for correct classification; if the panoramic image content is a complex multi-object interactive scene, high-level semantic information is important. The sub-region self-adaptive module is beneficial to capturing the emotion induction region better by establishing the relevance between different regions of the characteristic diagram and the object, so that a correct emotion label is given.

In this embodiment, the feature extraction module extracts 4 layers of feature maps of conv layer _2, 3, 4, and 5, and meanwhile, the feature map of each layer is sent to the sub-region adaptation module, and the relevance of different regions is established under the condition that the different scales S are 1, 2, 4, and n (S is set to be unlimited, and generally, the effect of combining 1, 2, and 4 is the best). Because the sizes of the feature maps of different levels are different, a multi-scale fusion module is needed, firstly, the feature maps are unified in scale, then all the feature maps are spliced on the channel dimension, the total feature after splicing is used as the basis of emotion classification, and finally, the emotion polarity of the input panoramic image is obtained, namely, the input panoramic image is positive or negative.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A panorama emotion recognition method based on multi-angle subregion adaptation is characterized by comprising the following steps:

a characteristic extraction step: extracting the features of the two-dimensional panoramic image by using a pre-training model to obtain feature images of different levels;

multi-scale fusion step: splicing feature graphs of different levels on channel dimensions to realize multi-scale feature fusion;

2. The panorama emotion recognition method of claim 1, wherein the spherical multi-angle rotation specifically comprises:

3. The method for emotion recognition of a panorama image according to claim 1, wherein the equidistant cylindrical projection is a projection of a three-dimensional stereoscopic view equidistant cylinder onto the two-dimensional panorama image by mapping longitude lines to vertical lines of constant pitch, latitude lines to horizontal lines of constant pitch.

4. The panoramic image emotion recognition method according to claim 2, wherein the three-dimensional spherical coordinates are a right-hand coordinate system, the field angle is 90 degrees, and the center coordinates of a front view are [0,0,0] when the binocular direct-view direction of the user is taken as a horizontal axis; the right viewport center coordinate is [90,0,0 ]; the center coordinate of the back viewport is [180,0,0 ]; the left viewport center coordinate is [ -90,0,0 ]; the upper viewport center coordinate is [0,90,0 ]; the central coordinate of the lower viewport is [0, -90,0 ]; corresponding to the six faces of the cube tangent to the sphere.

5. The panorama emotion recognition method of claim 1, wherein the feature extraction step specifically comprises:

6. The method for emotion recognition of a panorama image according to claim 1, wherein said sub-region adaptation step includes two branches of a sub-region content characterization branch and an emotion contribution characterization branch;

for feature vector set [ X¹,X²,...,X^l]Each element ofThe pixels are globally pooled to obtain a global information representation g (X) with a size of 1 × 1 × c^l)；

7. The method for emotion recognition of panoramic image according to claim 6, wherein the construction size is hw x s²Adaptive emotion contribution matrix a^sThe method specifically comprises the following steps: let sub-region y^s×sThe contribution degree of the emotion classification label at the i point of the feature map is a_iThe number of channels is converted into s by convolution operation of 1x1²Then, any i point of the feature map corresponds to s × s emotion contribution degree vectors a_iForm a set

Deforming to obtain a size hw × s²Adaptive emotion contribution matrix a^s。

8. The method for emotion recognition of panoramic imagery according to claim 6, wherein adaptive averaging pooling divides the input feature map into s x s sub-regions, resulting in a set of sub-region representations Y^s×s＝[y¹,y²,...,y^s×s]Transforming the feature map with size of s × s × c into s²Sub-region content characterization y of x c^s。

9. A system for realizing the panorama emotion recognition method based on any one of claims 1-8, wherein the system comprises:

a feature extraction module: extracting features of the two-dimensional panoramic image to obtain feature images of different levels, and capturing context dependence of the feature images on the whole and the part;

a sub-region adaptive module: areas with consistent emotion classification labels are correlated, and context features of the current scale are established in a self-adaptive mode by finding global and local correlation;

a multi-scale fusion module: splicing the feature maps on the channel dimension, and performing multi-scale feature fusion;