WO2023155460A1 - 一种基于强化学习的情绪化图像描述方法及系统 - Google Patents

一种基于强化学习的情绪化图像描述方法及系统 Download PDF

Info

Publication number
WO2023155460A1
WO2023155460A1 PCT/CN2022/126071 CN2022126071W WO2023155460A1 WO 2023155460 A1 WO2023155460 A1 WO 2023155460A1 CN 2022126071 W CN2022126071 W CN 2022126071W WO 2023155460 A1 WO2023155460 A1 WO 2023155460A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
emotional
feature
description
sentence
Prior art date
Application number
PCT/CN2022/126071
Other languages
English (en)
French (fr)
Inventor
卢官明
陈晨
卢峻禾
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京邮电大学 filed Critical 南京邮电大学
Publication of WO2023155460A1 publication Critical patent/WO2023155460A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Definitions

  • the invention relates to the technical field of image processing and pattern recognition, in particular to an emotional image description method and system based on reinforcement learning.
  • Image information interaction is the future "Metaverse" ", but for visually impaired and cognitively deficient people, it is impossible to accurately obtain information from images, including image semantic information and emotional information, which seriously affects the smoothness of information interaction and the convenience of information acquisition for such groups in the future.
  • the emotional words corresponding to the image emotion category are selected from the emotional word embedding library to generate the initial emotional image description; finally, the reinforcement learning-based The fine-tuning module of , which uses the reinforcement learning method to fine-tune the generated initial emotional image description, generates the final emotional image description, making the semantics more smooth and rich.
  • the research on image description mostly focuses on the single factual image description.
  • the syntax of the generated image description is simple and the model is highly interpretable, due to the simple nonlinear mapping relationship between image features and image text descriptions, the abstraction is not high, the expressive ability is limited, and the generated description is relatively blunt.
  • emotional color it is impossible to describe the content of the image in more detail through the generated text; at the same time, the single factual image description lacks the description of the objects and their interaction in the image; moreover, the factual image description lacks the image color
  • the description of the emotion conveyed cannot fully describe the color atmosphere information displayed by the image.
  • the problem with this method is that the long-short-term memory network model is used to directly encode and decode the image to be described to obtain the image description, and the internal features of the image are not fully extracted, and there is no emotion extraction based on the image theme color feature, which may It will affect the semantic richness of the final image emotion recognition, and there is no emotional support.
  • the technical problem to be solved by the present invention is to overcome the deficiencies of the prior art and provide an emotional image description method and system based on reinforcement learning.
  • extract the subject color feature prior of the training set image and input the image together with the training set image
  • the emotion recognition model optimizes the parameters of the network model, and combines the image factual description model of multi-feature information fusion to generate the initial emotional image description; on the other hand, fine-tunes the initial emotional image description through reinforcement learning to make the sentence more fluent and full of emotion Color; further play the complementary role between image emotion recognition and image semantic description, and improve the accuracy and robustness of image description while obtaining emotional image description.
  • a kind of emotional image description method based on reinforcement learning proposed by the present invention comprises the following steps:
  • Step 1 Construct an embedding library of emotional words on the basis of a large-scale corpus
  • Step 2 Build an image emotion recognition model
  • Step 3 Use the image emotion analysis data set to train the image emotion recognition model
  • Step 4 Construct an image factual description model based on attention mechanism for generating image factual description.
  • the image factual description model includes image factual description preprocessing module, image feature encoder and feature- text decoder;
  • Step 5 Use the image description data set to train the image factual description model
  • Step 6 Build an emotional image description initialization module.
  • the emotional image description initialization module selects the emotional words corresponding to the image emotional categories from the emotional word embedding library according to the image emotion category output by the trained image emotion recognition model, and converts them into Embedded into the image factual description output by the trained image factual description model to generate an initial emotional image description;
  • Step 7 Build a fine-tuning module based on reinforcement learning, which is used to fine-tune the initial emotional image description to generate a final emotional image description.
  • the fine-tuning module based on reinforcement learning includes a sentence reconstruction generator, a sentence storage unit, a sentence sampling unit, a sentence evaluation unit and a selection Word evaluation unit; wherein, the sentence reconstruction generator is used as an agent in the reinforcement learning system, and the sentence storage unit, sentence sampling unit, sentence evaluation unit and word selection evaluation unit constitute the external environment in the reinforcement learning system; the sentence reconstruction generator and the external The environment continuously interacts, obtains reward information from the external environment, learns the mapping from the environment state to behavioral actions, optimizes and adjusts behavioral actions, fine-tunes the initial emotional image description, and generates the final emotional image description.
  • the fine-tuning module based on reinforcement learning includes a sentence reconstruction generator, a sentence storage unit, a sentence sampling unit, a sentence evaluation unit and a selection Word evaluation unit, the fine-tuning module is used to fine-tune the initial description of emotional images.
  • the specific method is as follows:
  • Step 701 the sentence reconstruction generator selects words with similar semantics from the emotional word embedding library through the word selector according to the environment state at the t-1th moment and the reward at the t-1th moment, performs the word selection action, and filters
  • the word out is added to the sentence S t- 1 generated at the t- 1 time, and the sentence S t at the t time is generated; wherein, the sentence S 0 generated at the 0th time is the sentence generating initiator, and the sentence at the t-1 time
  • the environment state is the sentence S t-1 generated at the t-1th moment, the reward R t-1 at the t-1th moment is the score of the word selected at the t-1th moment, and t is the time;
  • Step 702 the statement storage unit stores the statement S t at the tth moment after the update; the statement sampling unit rolls back the statement S t at the tth moment after the update based on the sampling search algorithm, and generates N statements, and the value of N is 3, 4 or 5; the sentence evaluation unit first evaluates and scores the N sentences generated by the sentence sampling unit using the emotion discriminator, grammar collocation discriminator, and semantic discriminator, and obtains N emotional reward scores, grammar collocation reward scores, and semantic discriminators. Reward score, then take the weighted average method to get the comprehensive reward score, and finally input the comprehensive reward score to the word selection evaluation unit; the word selection evaluation unit outputs the selected word score as the reward R t fed back from the external environment to the sentence reconstruction generator;
  • Step 703 iterating from step 701 to step 702 , the sentence reconstruction generator continuously interacts with the external environment until the maximum reward for sentence reconstruction is obtained, and a final emotional image description is generated.
  • the sampling search algorithm of the sentence sampling unit adopts polynomial sampling or Monte Carlo sampling method.
  • step 1 the specific method of constructing an embedding library of emotional words is as follows:
  • Step 101 using the NLTK tool to obtain nouns and verbs in the target detection and image description data sets, generate a semantic thesaurus, and calculate the word vector of each semantic word;
  • Step 102 screen out emotional words from the large-scale corpus LSCC, generate emotional thesaurus, and calculate the emotional word vector of each emotional word; the emotional words corresponding to each semantic word in the semantic lexicon are divided into 8 defined by IAPS categories: joy, rage, surprise, acceptance, hate, ecstasy, fear, grief;
  • Step 103 Screen out emotional phrases of different emotional categories corresponding to semantic words from the emotional lexicon, and construct an emotional word embedding library.
  • the image emotion recognition model includes an image emotion recognition preprocessing module, a face emotion feature extraction module, an image theme color feature extraction module, and an image emotion feature extraction module.
  • the image emotion recognition preprocessing module includes a face detection unit, a face image normalization processing unit and an image size normalization processing unit;
  • the human face detection unit utilizes a pre-trained human face detection network to detect human face regions in the input image, and label different human face regions;
  • the human face image normalization processing unit is used to crop, align and size normalize each detected human face area
  • the image size normalization processing unit is used to normalize the size of the input image
  • the facial emotion feature extraction module is used to extract the facial emotion features of each person in the human face image after cropping, alignment and size normalization;
  • the image theme color feature extraction module is used to extract the theme color feature of the input image
  • the image emotional feature extraction module is used to extract the emotional feature of the image after the size normalization output by the image size normalization processing unit;
  • the feature fusion layer is used to fuse the facial emotion features output by the human face emotion feature extraction module, the theme color feature output by the image theme color feature extraction module and the emotional feature output by the image emotion feature extraction module, to obtain the fused emotion feature vector;
  • the fully connected layer is used for fully connected feature fusion layer and classification layer
  • the classification layer is used to output the emotional category to which the image belongs.
  • step 2 the specific method of using the image theme color feature extraction module to extract the theme color feature of the input image is as follows:
  • Step 1 Use the micro-element method to cut the RGB space to form independent three-dimensional blocks
  • Step 2 Put the RGB format pixel scatter of the image into the RGB space after cutting, and use the scatter value as the value of the cube. If there is no scatter in the cube, then use the center value of the cube area as the cube the value of the block;
  • Step 3 carry out weighted summation to the value of the three-dimensional box in the whole sliding window area by the sliding window weighting mode to obtain the value of the three-dimensional box of the size of the sliding window, the size of the sliding window depends on the number of types of the final image theme color to be selected;
  • Step 4 Through steps 1 to 3, the image theme color feature of the input image is finally obtained.
  • the micro-element method is used to cut the RGB space to form independent three-dimensional blocks, and the three-dimensional blocks are pixel-level cubes.
  • step 4 the specific method of constructing an image factual description model in step 4 is as follows:
  • Step 4.1 image factual description preprocessing module, utilizes the pre-trained network model on the target detection and target relationship detection data set to preprocess the input image; the specific method is as follows: 1) through the pre-training target detection algorithm, detect The area where various targets appear in the image; use the pre-trained target relationship detection algorithm to detect the area where various target interactions appear in the image; The image in the area is cropped and aligned, and normalized to obtain the normalized image of the input image, the normalized image of the area where various targets are located, and the normalized image of the area where various target interactions are located. Image;
  • Step 4.2 build an image feature encoder, which includes an image global feature encoding branch, a target feature encoding branch, an interactive feature encoding branch between objects, an attention mechanism and a feature fusion layer;
  • the image global feature encoding branch includes multiple A convolution module, the input of the image global feature encoding branch is the normalized image of the input image, which is used to extract the global feature of the image and convert it into a vector form;
  • the target feature encoding branch includes multiple The convolution module, the input of the target feature encoding branch is the normalized image of the area where various targets are located, which is used to extract the local target feature and convert it into a vector form;
  • the interactive feature encoding branch between the targets Including multiple convolution modules, the input of the inter-target interaction feature encoding branch is the normalized image of the area where various target interactions are located, which is used to extract the action interaction area features between targets and convert it into a vector form;
  • the convolution module includes one or more convolutional layers and a pooling
  • Step 4.3 build a feature-text decoder, the input of the feature-text decoder is the image feature vector obtained by the image feature encoder; and use a combination module comprising at least 2 layers of long short-term memory LSTM network to decode the image feature vector for the text.
  • An emotional image description system based on reinforcement learning including:
  • Emotional word embedding library constructing an emotional word embedding library based on a large-scale corpus, providing corpus support for the final generation of emotional image descriptions;
  • Image emotion recognition model the image emotion recognition model building block includes image emotion recognition preprocessing module, face emotion feature extraction module, image theme color feature extraction module, image emotion feature extraction module, feature fusion layer, fully connected layer and classification layer;
  • the image emotion recognition preprocessing module includes a face detection unit, a face image normalization processing unit, and an image size normalization processing unit; wherein, the face detection unit utilizes a pre-trained face detection network , detecting the face area in the input image, and labeling different face areas;
  • the image size normalization processing unit is used to normalize the pixel size of the input image to obtain a unified image input size;
  • the human face image normalization processing unit is used to perform cropping, alignment and size normalization on each detected human face area;
  • the human face emotion feature extraction module includes a plurality of convolution modules;
  • the image The theme color feature extraction module uses the color clustering method to extract theme color features from the training set images in the image sentiment analysis data set;
  • the fully connected layer is used for the fully connected feature fusion layer and
  • the image factual description model includes an image factual description preprocessing module, an image feature encoder and a feature-text decoder; the image factual description preprocessing module preprocesses the input image;
  • the image feature encoder includes an image global feature encoding branch, a target feature encoding branch, an interactive feature encoding branch between targets, an attention mechanism and a feature fusion layer;
  • the image global feature encoding branch is used to extract the global image feature feature, and convert it into vector form;
  • the target feature encoding branch is used to extract local target features and convert it into vector form;
  • the inter-target interaction feature encoding branch is used to extract inter-target actions Interacting regional features and converting them into vector forms;
  • the attention mechanism is used to capture the target features that need to be focused on and the interaction features between targets that need to be focused on relative to the global features;
  • the feature fusion layer is used to respectively After normalizing the above-mentioned image global features, focused target features, and focused target interaction features, an image feature vector
  • the emotional image description initialization module according to the image emotion output by the trained image emotion recognition model, selects the emotional word corresponding to the image emotion category from the emotional word embedding library, and embeds it into the trained image factual description model In the factual description of the output image, an initial emotional image description is generated;
  • a fine-tuning module based on reinforcement learning which uses reinforcement learning to adjust the initial emotional image description generated;
  • the fine-tuning module of the reinforcement learning includes a sentence reconstruction generator, a sentence storage unit, a sentence sampling unit, a sentence evaluation unit and a word selection evaluation unit ;
  • the statement reconstruction generator is an agent in the reinforcement learning system, and the statement storage unit, statement sampling unit, statement evaluation unit and word selection evaluation unit constitute the external environment in the reinforcement learning system; the statement reconstruction generator and the external environment are continuously Interact with the ground, obtain the reward information of the external environment, learn the mapping from the environment state to the behavior action, optimize and adjust the behavior action, fine-tune the initial emotional image description, and generate the final emotional image description.
  • the present invention adopts the above technical scheme and has the following technical effects:
  • the present invention can effectively identify Amusement, Anger, Awe, Contentment, Disgust, and Excitement by introducing theme color features prior to image emotion recognition. , fear (Fear), grief (Sadness) eight image emotions, and select the emotional words corresponding to the image emotion categories from the emotional word embedding library, and embed them into the factual description of the image to generate the initial emotional image description.
  • fear Fear
  • grief Sadness
  • Use the reinforcement learning method to fine-tune the generated initial emotional image description, so that the generated emotional image description is more vivid and full of emotion; the details are as follows:
  • the corpus design in the present invention comes from large-scale target detection data set, image description data set and emotion analysis data set, and adopts the method for space classification of emotional word, can avoid the defective that semantics is similar but emotion gap is bigger, simultaneously It can learn collocations between words from a large amount of labeled corpus, so it can achieve better image description results.
  • the image emotion recognition model in the present invention gives the image theme color feature prior through the color clustering method, and simultaneously adopts the facial emotion feature extraction module and the image emotion feature extraction module
  • the double-branch network model of the input image respectively, performs feature extraction on the face image after input image cropping, alignment and size normalization, and the size-normalized global image output by the image size normalization processing unit; makes image emotion recognition
  • the model also fully captures the global emotional information of the image while obtaining the facial emotion category, which promotes the information interaction between the facial emotion and the overall atmosphere of the image, and has stronger representation ability and generalization ability.
  • the present invention preprocesses the image-focused target area features, target features, and target relationship features, and simultaneously uses the attention mechanism to judge the focused target and its corresponding relationship,
  • the feature of the factual description model of the input image is richer and closely related to the content of the input image, making the factual description of the image more reasonable.
  • the present invention uses the image emotional recognition results to determine the image emotional words, and combines the generated image factual description semantics to generate the initial emotional image description. These features are acquired through the image itself. Feature extraction is obtained, so the obtained initial emotional image description is highly related to the semantics of the original image, and has stronger representation ability and generalization ability.
  • the present invention uses a reinforcement learning method to fine-tune the generated initial emotional image description, while maintaining the original semantics, it solves the problems of unreasonable semantic matching and inaccurate application of emotional vocabulary.
  • Fig. 1 is a flow chart of steps of the emotional image description method based on reinforcement learning of the present invention.
  • Fig. 2 is a structural diagram of the emotional image description system based on reinforcement learning of the present invention.
  • Fig. 3 is a structural diagram of an image emotion recognition model used in an embodiment of the present invention.
  • Fig. 4 is an example diagram of image theme color features in the embodiment of the present invention; wherein, (a) is the image theme color features obtained by solving, and (b) is the scatter mapping of the image in RGB space.
  • Fig. 5 is a structural diagram of an image factual description model used in an embodiment of the present invention.
  • Fig. 6 is a structural diagram of the fine-tuning module based on reinforcement learning in the present invention.
  • Figure 7 is an example of images in the Ai Challenger Caption2017 database.
  • the present invention aims at the problem that the image description network uses the encoder to directly extract the global features of the image, and uses the decoder to directly map the text to generate the image description.
  • the text description lacks semantics and lacks emotional expression.
  • the learning emotional image description method and system solves the problem that the existing technology cannot accurately and vividly describe images, and provides a more vivid image description system that meets human emotional needs for infant education guidance and visually impaired persons. Ways and methods.
  • the development of an emotional image recognition system provides more vivid and emotional image descriptions for the visually impaired and infant auxiliary education, and is of great significance for the visually impaired to more vividly understand the content expressed by the image and the emotions it conveys and value.
  • the embodiment of the present invention provides an emotional image description method based on reinforcement learning, which mainly includes the following steps:
  • Step 1 Build an emotional word embedding library on the basis of a large-scale corpus; first, use the NLTK (Natural Language Toolkit) tool to obtain nouns and verbs in the target detection and image description data sets, and generate a semantic lexicon; then, from the large-scale corpus LSCC (Large Scale ChineseCorpus) screen out emotional words to generate an emotional thesaurus; finally, calculate the emotional words corresponding to each semantic word in the semantic thesaurus, and build an emotional word embedding library;
  • NLTK Natural Language Toolkit
  • i 1,2...n 2 ⁇ , where n 2 represents that the relationship detection can The number of motion categories detected. And calculate the word vector of each word;
  • i 1,...m 2 ⁇ , m 2 represents the number of optional adverbs.
  • the specific method of classification is as follows, The sum of the distances between the emotional word embedding vectors spliced from semantic word vectors and emotional word vectors in each group and the eight benchmark emotional word vectors is used as the objective function, and the space classification of emotional word embedding word vectors is solved by minimizing the objective function;
  • the classification objective function is Where u i is the centroid of class c i , and x ij is the emotional word embedding vector after the i-th semantic word is fused with the j-th emotional word;
  • Construct the emotional strength relationship of each semantic word corresponding to the emotional word, and the strength and weakness relationship is determined by the emotion classification probability detected by the text emotion recognition algorithm BERT.
  • the constructed emotional word embedding library is shown in Table 2;
  • Step 2 build a kind of image emotion recognition model as shown in Figure 3, this model comprises image emotion recognition preprocessing module, the face emotion feature extraction module that at least comprises 2 convolution modules, image theme color feature extraction module, at least An image emotion feature extraction module comprising 2 convolution modules, a feature fusion layer, a fully connected layer and a classification layer; the convolution module includes at least one convolution layer and a pooling layer;
  • the image emotion recognition preprocessing module includes face detection, face image normalization processing, and image size normalization processing; the face detection uses a pre-trained face detection network to detect Human face area, and different human face areas are labeled; Described human face image normalization processing, each detected human face area is cut and aligned, and each processed human face image is normalized; Image size normalization processing is used to normalize the input image; image size normalization processing is used to normalize the input image to obtain a unified image input size;
  • the facial emotion feature extraction module includes a plurality of convolution modules, and the input of this module is the facial expression image output by the image emotion recognition preprocessing module, which is used to extract people's facial emotion features;
  • the image theme color feature extraction module uses a color clustering method to extract theme color features from the training set images in the image sentiment analysis data set, and encodes the image theme color features into vectors through word embedding methods as prior knowledge of image emotions;
  • the image emotion feature extraction module is used to extract image emotion features, including a plurality of convolution modules, and the input of this module is the image after the size normalization of the image size normalization processing unit output in the image emotion recognition preprocessing module;
  • the convolution module includes one or more convolution layers and a pooling layer.
  • the convolution module includes one or more convolution layers and a pooling layer
  • the feature fusion layer is used to fuse the facial emotion features output by the human face emotion feature extraction module, the theme color feature output by the image theme color feature extraction module and the emotional feature output by the image emotion feature extraction module, to obtain the fused emotion feature vector;
  • the fully connected layer is used for the fully connected feature fusion layer and the classification layer;
  • the classification layer is used to output the emotional category to which the image belongs;
  • Image emotion recognition preprocessing module including face detection, face image normalization processing, image theme color feature extraction module, image size normalization processing;
  • Face detection first use the pre-trained FaceNet network to detect the area where the face is located in the input image Figure 7 (image source: Ai Challenger Caption 2017; Image Id: 1059397860565088790), then cut out the area where the face is located, and finally analyze the area of the face in the image The different face areas cut out are numbered;
  • Face image normalization processing is used to normalize the different face regions obtained by the above face detection to obtain an image with a size of 56 ⁇ 56 pixels;
  • Image size normalization processing used to normalize the input image into an image with a size of 224 ⁇ 224 pixels
  • the image theme color feature extraction module first maps the image in the MSCOCO dataset to a scatter point in RGB space, and extracts the theme color of the image through the color clustering algorithm to obtain the image as shown in Figure 4
  • the result shown in (a) in (a) after that, the result of color clustering is converted into HSV (hue, saturation, lightness) format; finally, the image theme color feature is encoded into a vector by word embedding method, as the image emotion Prior knowledge:
  • the direct clustering method based on traditional RGB scatter points will cause the color of the scatter points to be neutralized, resulting in two types of color scatter points with obvious differences being clustered into color categories that are significantly different from themselves Therefore, the clustering method of color scatter points is modified:
  • the RGB space is cut into pixel-level cubes using the microelement method; then the scatter points of the RGB space of the image are put into the cut RGB space, and the scatter values are used as the value of the cube. If there is no scatter, then use the center value of the cube area as the value of the cube;
  • the size of the sliding window depends on the number of types of image theme colors to be selected in the end; the weight of the window is used for more reasonable and smooth transition of colors, and the method of rotation and translation is used for processing.
  • the specific method is as follows: (1) If the RGB scatter value of the original image If the distribution of the window area is relatively uniform, the method of decreasing from the center of the window to the surrounding area is used to assign weights; (2) if the original RGB scatter value is distributed in the corner or edge area of the window, the weight is sequentially decreased from the center of the window to the surrounding areas. The small weight window gradually moves towards the direction where there are more scattered points in the window, so that the center of the window is as close as possible to the area where the original RGB scatter points gather in the area; (3) For the weights in the window that exceed this part due to sliding The value is rotated as the weight of the oblique upper part of the window. If there are multiple cluster points in the window area, use this method to slide multiple times and weight the average.
  • the size of the weight depends on the degree of aggregation of the RGB scatter points in the original image, as shown in Table 3
  • the image theme color feature results (RGB form): orange (6.40%), dark green (17.8%), light gray (18.6%), olive (23.2%), brown (34.0%);
  • the extracted image theme color will be converted into HSV form.
  • the image theme color is converted into HSV (hue, saturation, lightness) format.
  • HSV hue, saturation, lightness
  • the face emotion feature extraction module includes three convolution modules connected in sequence and a feature fusion layer.
  • the specific implementation is as follows:
  • Convolution module d1 includes 2 convolutional layers and 1 pooling layer. Both convolutional layers use 128 3 ⁇ 3 convolution kernels to perform convolution operations on feature maps. The convolution step size is 1 and zero padding The length of the added side is 1. After convolution, the ReLU nonlinear mapping is performed, and the output size is 56 ⁇ 56 ⁇ 128.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, the output feature map is 28 ⁇ 28 ⁇ 128 in size;
  • Convolution module d2 includes 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 256 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step size is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed, and the output size is 28 ⁇ 28 ⁇ 256.
  • the feature map is selected; the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. operation, outputting a feature map with a size of 14 ⁇ 14 ⁇ 256;
  • Convolution module d3 including 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 512 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step size is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed, and the output size is 14 ⁇ 14 ⁇ 512.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, the output feature map is 7 ⁇ 7 ⁇ 512 in size;
  • the feature fusion layer c1 the input is the facial emotional features output by different human emotional feature branches in the facial emotional feature extraction module, the size of which is 7 ⁇ 7 ⁇ 512, and the global average pooling operation is performed on these two feature maps respectively, Get two 512-dimensional feature vectors, and perform vector fusion, and finally output a 512-dimensional feature vector;
  • the image emotion feature extraction module includes five convolution modules connected in sequence, as follows:
  • Convolution module d4 includes 2 convolutional layers and 1 pooling layer. Both convolutional layers use 64 3 ⁇ 3 convolution kernels to perform convolution operations on feature maps. The convolution step size is 1 and zero padding The length of the added side is 1. After convolution, the ReLU nonlinear mapping is performed, and the output size is 224 ⁇ 224 ⁇ 64.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, the output feature map is 112 ⁇ 112 ⁇ 64 in size;
  • Convolution module d5 includes 2 convolutional layers and 1 pooling layer. Both convolutional layers use 128 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step size is 1 and zero padding
  • the length of the added side is 1, and after convolution, the ReLU nonlinear mapping is performed to output a feature map with a size of 112 ⁇ 112 ⁇ 128; the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2 Operation, the output feature map is 56 ⁇ 56 ⁇ 128 in size;
  • Convolution module d6 includes 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 256 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step size is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed, and the output size is 56 ⁇ 56 ⁇ 256.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, output feature map with size 28 ⁇ 28 ⁇ 256;
  • Convolution module d7 including 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 512 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step size is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed to output a feature map with a size of 28 ⁇ 28 ⁇ 512;
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. operation, outputting a feature map with a size of 14 ⁇ 14 ⁇ 512;
  • Convolution module d8 including 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 512 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step size is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed, and the output size is 14 ⁇ 14 ⁇ 512.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, the output feature map is 7 ⁇ 7 ⁇ 512 in size;
  • Feature fusion layer c2 the input is 7 ⁇ 7 ⁇ 512 image emotion features output by the image emotion feature extraction module, and 512-dimensional human face emotion features output by feature fusion layer c1, and global average pooling is performed on the image emotion features Transformation operation to obtain two 512-dimensional feature vectors, which are fused with 512-dimensional facial emotion features, and finally the two fused feature vectors are spliced to output a 1024-dimensional feature vector; the obtained 1024-dimensional feature The vector and the 1024-dimensional image theme color feature output by the image theme color feature extraction module are fused to obtain a new 1024-dimensional vector;
  • Fully connected layer b1 including 256 neurons, used for fully connected feature fusion layer and classification layer;
  • Classification layer a1 using a Softmax classifier, including 8 neurons, outputting the emotional category to which the image belongs;
  • Step 3 Use the image emotion analysis data set ArtPhoto to train the image emotion recognition model
  • the ArtPhoto image sentiment analysis data set is selected.
  • the ArtPhoto image sentiment analysis data set uses three data sets: images in IAPS, ArtPhoto, and Abstract Paintings. It contains a total of 1429 image samples, and each sample corresponds to an expression category, including Amusement, Anger, There are 8 emotional categories: Awe, Contentment, Disgust, Excitement, Fear, and Sadness. In practice, you can also use other image sentiment analysis datasets, or collect image sentiment analysis datasets yourself, and create image sentiment analysis datasets containing emotional category labels.
  • Step 4 construct a kind of image factual description model based on attention mechanism as shown in Figure 5, this model includes image factual description preprocessing module, image feature encoder and feature-text decoder;
  • the image factual description preprocessing module utilizes a pre-trained network model on the target detection and target relationship detection data set to preprocess the input image; the specific method is as follows: 1) by pre-training the target detection algorithm, the image is detected The regions where various targets appear in the image are cropped and aligned, and the processed images of various target regions are normalized; 2) Using the pre-trained target relationship detection algorithm, detect all The areas where various target interactions appear, crop and align the areas where various target interactions are located, and normalize the processed images of various target interaction areas; 3) Normalize the input images;
  • the image feature encoder includes an image global feature encoding branch, an object feature encoding branch, an interactive feature encoding branch between objects, an attention mechanism and a feature fusion layer;
  • the image global feature encoding branch includes a plurality of convolution modules , the input of the image global feature encoding branch is the normalized image of the input image, which is used to extract the global feature of the image and convert it into a vector form;
  • the target feature encoding branch includes a plurality of convolution modules,
  • the input of the target feature coding branch is the normalized image of the area where various targets are located, which is used to extract local target features and convert them into vector form;
  • the interactive feature coding branch between targets includes multiple volumes
  • the product module, the input of the interaction feature encoding branch between targets is the normalized image of the area where various target interactions are located, which is used to extract the action interaction area features between targets and convert it into a vector form;
  • the attention mechanism which is used to capture the target features that need to be focused on and the interaction
  • the feature-text decoder input is the image feature vector obtained by image feature encoder processing; and utilizes at least 2 layers of long short-term memory (LSTM) network combination modules to decode the image feature vector into text; feature-text decoder is Refers to the decoder from image features to text.
  • LSTM long short-term memory
  • the image factual description preprocessing module first preprocesses the input image; then utilizes the pre-trained target detection algorithm to detect the areas where various targets appearing in the image; moreover, through the pre-trained The target relationship detection algorithm detects the areas where various target interactions appear in the image; finally, crops and aligns the images of the target areas and the interaction relationship areas between targets, and normalizes them into images with a size of 56 ⁇ 56 pixels; specifically
  • the implementation is as follows:
  • the image feature encoder includes an image global feature encoding branch, an object feature encoding branch, an interactive feature encoding branch between objects, an attention mechanism, and a feature fusion layer;
  • the image global feature encoding branch includes a plurality of convolution modules, and the input of this branch is the normalized image of the input image, which is used to extract the global feature of the image and convert it into a vector form ;
  • Convolution module d9 includes 2 convolutional layers and 1 pooling layer. Both convolutional layers use 64 3 ⁇ 3 convolution kernels to perform convolution operations on feature maps. The convolution step size is 1 and zero padding The length of the added side is 1. After convolution, the ReLU nonlinear mapping is performed, and the output size is 224 ⁇ 224 ⁇ 64.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, the output feature map is 112 ⁇ 112 ⁇ 64 in size;
  • Convolution module d10 includes 2 convolutional layers and 1 pooling layer. Both convolutional layers use 128 3 ⁇ 3 convolution kernels to perform convolution operations on feature maps.
  • the convolution step is 1 and zero-padded
  • the length of the added side is 1, and after convolution, the ReLU nonlinear mapping is performed to output a feature map with a size of 112 ⁇ 112 ⁇ 128;
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2 Operation, the output feature map is 56 ⁇ 56 ⁇ 128 in size;
  • Attention module 1 use the spatial attention mechanism to process the 14 ⁇ 14 ⁇ 256 image global features output by the convolution module d10, the specific implementation is as follows:
  • two 14 ⁇ 14 ⁇ 1 feature layers are obtained through global maximum pooling and global average pooling; then, the above two feature layers are stacked, and the stacked 14 ⁇ 14 ⁇ 2 feature layers are used 1 ⁇ 1 convolution adjusts the number of channels to obtain a 14 ⁇ 14 ⁇ 1 feature layer; finally, a 14 ⁇ 14 ⁇ 1 global spatial attention mechanism is output through sigmoid, and the obtained global spatial attention mechanism is multiplied by the original input feature , to obtain the final image global feature map based on the spatial attention mechanism, with a size of 14 ⁇ 14 ⁇ 256;
  • Convolution module d11 includes 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 256 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed, and the output size is 56 ⁇ 56 ⁇ 256.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, output feature map with size 28 ⁇ 28 ⁇ 256;
  • Convolution module d12 includes 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 512 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed to output a feature map with a size of 28 ⁇ 28 ⁇ 512;
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. operation, outputting a feature map with a size of 14 ⁇ 14 ⁇ 512;
  • Convolution module d13 includes 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 512 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed, and the output size is 14 ⁇ 14 ⁇ 512.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, the output feature map is 7 ⁇ 7 ⁇ 512 in size;
  • the inter-target interaction feature encoding branch includes multiple convolution modules and attention modules.
  • the input of this branch is the normalized image of the area where various target interactions are located, and is used to extract the interaction between targets. feature, and convert it into vector form, as follows:
  • Convolution module d14 includes 2 convolutional layers and 1 pooling layer. Both convolutional layers use 128 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps. The convolution step size is 1 and zero padding The length of the added side is 1. After convolution, the ReLU nonlinear mapping is performed, and the output size is 56 ⁇ 56 ⁇ 128.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, the output feature map is 28 ⁇ 28 ⁇ 128 in size;
  • Convolution module d15 includes 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 256 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step size is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed, and the output size is 28 ⁇ 28 ⁇ 256.
  • the feature map is selected; the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. operation, outputting a feature map with a size of 14 ⁇ 14 ⁇ 256;
  • Attention module 2 According to the global feature based on the spatial attention mechanism output by attention module 1, the non-important inter-target interaction feature map is removed, and the channel attention mechanism is used to output the focus of the convolution module d15 with a size of 14 ⁇ 14 ⁇ 256 Pay attention to the interaction feature map between targets, the specific implementation is as follows:
  • Convolution module d16 includes 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 512 3 ⁇ 3 convolution kernels to perform convolution operations on the interactive feature maps between the focused targets.
  • the convolution step The length is 1, the length of zero padding and side is 1, and after convolution, the ReLU nonlinear mapping is performed to output a feature map with a size of 14 ⁇ 14 ⁇ 512;
  • the pooling layer uses a 2 ⁇ 2 maximum pooling kernel, with a step size of 2
  • the feature map is down-sampled, and the output size is 7 ⁇ 7 ⁇ 512
  • the target feature encoding branch includes a plurality of convolution modules, and the input of this branch is the normalized image of the region where various targets are located, for extracting local target features, and converting it into
  • the specific implementation is as follows:
  • Convolution module d17 includes 2 convolutional layers and 1 pooling layer. Both convolutional layers use 128 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps. The convolution step is 1 and zero padding The length of the added side is 1. After convolution, the ReLU nonlinear mapping is performed, and the output size is 56 ⁇ 56 ⁇ 128.
  • the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. Operation, the output feature map is 28 ⁇ 28 ⁇ 128 in size;
  • Convolution module d18 includes 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 256 3 ⁇ 3 convolution kernels to perform convolution operations on the feature maps.
  • the convolution step is 1 and zero padding
  • the length of the added side is 1.
  • the ReLU nonlinear mapping is performed, and the output size is 28 ⁇ 28 ⁇ 256.
  • the feature map is selected; the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is down-sampled with a step size of 2. operation, outputting a feature map with a size of 14 ⁇ 14 ⁇ 256;
  • Attention module 3 According to the global feature based on the spatial attention mechanism output by attention module 1, the non-important target feature map is removed, and the channel attention mechanism is used to output the focused target with a size of 14 ⁇ 14 ⁇ 256 to the convolution module d18
  • the feature map is processed, and the specific implementation is as follows:
  • two 1 ⁇ 1 ⁇ 256 feature layers are obtained through global average pooling; then, the above feature layers are fully connected twice, the number of channels of the first full connection is less, about 150, and the second time The number of fully connected channels is 256, and finally output a feature layer of 1 ⁇ 1 ⁇ 256; finally output a 1 ⁇ 1 ⁇ 256 target channel attention mechanism through sigmoid, and multiply the obtained target channel attention mechanism with the original input feature, Get the final inter-target feature map based on the channel attention mechanism, with a size of 14 ⁇ 14 ⁇ 256;
  • Convolution module d19 including 3 convolutional layers and 1 pooling layer.
  • the 3 convolutional layers all use 512 3 ⁇ 3 convolution kernels to perform convolution operations on the feature map of the focus target, and the convolution step is 1.
  • the zero-padded and side length is 1, and after convolution, the ReLU nonlinear mapping is performed, and the output size is 14 ⁇ 14 ⁇ 512.
  • the feature map; the pooling layer uses the largest pooling kernel of 2 ⁇ 2, and the feature map is paired with a step size of 2. Perform a downsampling operation, and output a focused target feature map with a size of 7 ⁇ 7 ⁇ 512;
  • the feature fusion module includes a plurality of feature fusion layers and pooling layers, and the specific implementation is as follows:
  • the input of the feature fusion layer c3 is the focused inter-target interaction feature map output by the inter-target interaction feature encoding branch and the image global feature output by the image global feature encoding branch, both of which have a size of 7 ⁇ 7 ⁇ 512.
  • the average pooling operation is performed on the two feature maps, both of which are 4 ⁇ 4 ⁇ 512 feature maps, and the pooled image global features and inter-target interaction feature maps are added to obtain a feature fusion size of 4 ⁇ 4 ⁇ 512 feature map; at the same time, output the pooled feature map with a size of 4 ⁇ 4 ⁇ 512 focusing on the interaction between targets;
  • the input of the feature fusion layer c4 is the focused target feature map output by the target feature encoding branch and the image global feature output by the image global feature encoding branch, both of which have a size of 7 ⁇ 7 ⁇ 512.
  • the average pooling operation is performed, and the size of the feature map is 4 ⁇ 4 ⁇ 512, and the global feature of the image after pooling and the feature map of the focused target are added to obtain a feature with a size of 4 ⁇ 4 ⁇ 512 after feature fusion Figure; at the same time output the focused target feature map with a size of 4 ⁇ 4 ⁇ 512 after pooling;
  • the fused feature map output by the upsampling layer e1 and the feature fusion layer c3 has a size of 4 ⁇ 4 ⁇ 512, and the upsampling is a feature map of a size of 7 ⁇ 7 ⁇ 512;
  • the upsampling layer e2 and the feature fusion layer c4 output a fused feature map with a size of 4 ⁇ 4 ⁇ 512, and the upsampling is a feature map with a size of 7 ⁇ 7 ⁇ 512;
  • the feature fusion layer c5 first, input the 7 ⁇ 7 ⁇ 512 feature map output by the upsampling layer e1, the 7 ⁇ 7 ⁇ 512 feature map output by the convolution module d13, and the 7 ⁇ 7 ⁇ 512 feature map output by the feature fusion layer c3. Focus on the interaction feature map between targets with a size of 4 ⁇ 4 ⁇ 512 after pooling; then, stack the input two feature maps of size 7 ⁇ 7 ⁇ 512 to obtain a new global feature of size 7 ⁇ 7 ⁇ 512 Figure; Finally, the global average pooling operation is performed on the global feature map and the interactive feature map between focused targets to obtain two 512-dimensional feature vectors, and these two feature vectors are spliced to output a 1024-dimensional feature vector;
  • the feature fusion layer c6 first, input the 7 ⁇ 7 ⁇ 512 feature map output by the upsampling layer e2, the 7 ⁇ 7 ⁇ 512 feature map output by the convolution module d13, and the 7 ⁇ 7 ⁇ 512 feature map output by the feature fusion layer c4. Focus on the target feature map with a size of 4 ⁇ 4 ⁇ 512 after pooling; then, stack the two input feature maps with a size of 7 ⁇ 7 ⁇ 512 to obtain a new global feature map with a size of 7 ⁇ 7 ⁇ 512; Finally, the global average pooling operation is performed on the global feature map and the focused target feature map respectively, and two 512-dimensional feature vectors are obtained, and the two feature vectors are spliced to output a 1024-dimensional feature vector;
  • the feature fusion layer c7 splices two 1024-dimensional vectors output by the feature fusion layer c5 and the feature fusion layer c6 to obtain a 2048-dimensional feature vector;
  • the feature-text decoder input is the image feature vector that the image feature encoder processes and obtains; and utilizes at least 2 layers of long-short-term memory (LSTM) network combination modules to decode the image feature vector into text; the specific algorithm flow is as follows:
  • E( ) is the word embedding function
  • the target features, inter-target interaction features and global features based on the attention mechanism can be expressed as Then, based on the attention mechanism, the inter-label interaction features
  • the ⁇ th relationship feature in It can be expressed as the following formula, where W is the weight matrix
  • Step 5 Use the image description dataset Ai-Challenger Caption to train the image factual description model
  • the Ai-Challenger Caption image description data set is selected.
  • the Ai-Challenger Caption image description dataset has a five-sentence Chinese description for each given image.
  • the dataset contains 300,000 pictures and 1.5 million Chinese descriptions.
  • the training set contains 210,000 images
  • the verification set contains 30,000 images
  • the test set A contains 30,000 images
  • the test set B contains 30,000 images; in practice, you can also use other image description datasets, or collect image description data yourself set to build an image description dataset with Chinese description tags.
  • Step 6 Build an emotional image description initialization module.
  • the module selects the emotion word corresponding to the image emotion category from the emotion word embedding library, and embeds it into the Generate an initial emotional image description from the image factual description output by the good image factual description model;
  • the text emotion detector AYLIENAPI is used to detect the emotion of the sentence S generated in step 4, and the One-Hot vector J T is used to represent the position k of the emotional word, the object characteristics modified by it and the interaction between the corresponding objects characterized by J T vector dimension is the length L of S;
  • the emotion basis vector with the closest similarity is extracted to compare whether it is the same as the image emotion output by the image emotion recognition model.
  • the emotional vocabulary replacement corresponding to the image detection emotion is found from the emotional lexicon S-corpus according to the association mapping relationship of verbs and nouns, and the replacement result is used as the initial Emotional description sentences;
  • step 4 If no emotional vocabulary can be detected in the result generated in step 4, directly find the emotional vocabulary corresponding to the image detection emotion from the emotional lexicon S-corpus according to the association mapping relationship of verbs and nouns, and add them to the corresponding relationship area, and finally generate the initial emotion
  • the length of the optimized image description sentence X is L'
  • the initial emotional image description sentence is "a peaceful woman standing in a leisurely garden with her baby in her arms".
  • Step 7 building the fine-tuning module based on reinforcement learning as shown in Figure 6 includes a sentence reconstruction generator, a sentence storage unit, a sentence sampling unit, a sentence evaluation unit, and a word selection evaluation unit; wherein the sentence reconstruction generator is used as a reinforcement learning system.
  • Agent Agent
  • statement storage unit, statement sampling unit, statement evaluation unit and word selection evaluation unit constitute the external environment (Environment) in the reinforcement learning system; the sentence reconstruction generator constantly interacts with the external environment to obtain the information of the external environment.
  • Reward (Reward) information learn the mapping from the environment state (State) to the behavior action (Action), to optimize and adjust the behavior action, adjust the initial emotional image description generated in step 6, and generate the final emotional image description, Specific steps are as follows:
  • the sentence reconstruction generator is based on the environment state (State) S t-1 at the t-1th moment, that is, the sentence S t-1 generated at the t-1th moment, and the reward (Reward) R t at the t-1th moment -1 , that is, the score R t-1 of the selected word at the t-1th moment, select the word with similar semantics from the emotional word embedding library through the word selector, execute the "action” of the word selection, and filter the selected word Word is added in the statement S t-1 that generates at the t-1 moment, generates the statement S t at the t moment;
  • the statement S 0 that generates at the 0th moment is the sentence generating initiator;
  • Word selector " action (Action) "On the basis of the known semantics of the target sentence, words with similar semantics are selected according to the recorded sentence and its evaluation results at the previous moment, where the degree of semantic similarity is represented by the distance between semantic word vectors; the word selected at time t is expressed as
  • the statement s t updated at time t as the state feedback represents the newly generated statement at time t after adding y t to the statement s t-1 at time t-1 after the implementation of action (Action) a t ;
  • the sentence s t updated at time t when the selected word score as a reward (Reward) is used as state (State) feedback means that after the implementation of action (Action) a t , y t is added to the sentence s t at time t-1
  • the newly generated sentence at time t after -1 specifically expressed as the score of each word y t corresponding to the state S t , which is calculated by the word selection evaluation unit in the external environment.
  • construct as shown in Figure 6 the sentence reconstruction generator of intelligent body (Agent) and be used for reconstructing the initial emotional image description sentence of input;
  • the sentence reconstruction generator network structure adopts the double layer that combines attention mechanism Recurrent neural network; using a deterministic strategy, P ⁇ (y t
  • is the parameter of the generator
  • the statement storage unit stores the statement S t at the tth moment after the update; the statement sampling unit rolls back (Rolling Out) the statement S t at the tth moment after the update based on the sampling search algorithm, generates N statements, and N Take a value of 3, 4 or 5;
  • the sampling search algorithm of the sentence sampling unit can adopt polynomial sampling or Monte Carlo sampling;
  • the sentence evaluation unit first uses emotion discriminator and grammar collocation respectively to the N sentences that the sentence sampling unit generates The discriminator and the semantic discriminator perform evaluation and scoring to obtain N emotional reward scores, grammatical collocation reward scores, and semantic reward scores, and then adopt a weighted average method to obtain comprehensive reward scores, and finally input the comprehensive reward scores into the word selection evaluation unit;
  • the word evaluation unit outputs the selected word score as the reward (Reward) R t fed back from the external environment (Environment) to the sentence reconstruction generator;
  • a Monte Carlo-based random beam search method is used to generate sentences, and the number of generated samples is N sampling complete sentences. If the sampling starts at time t, it is expressed as Y 1:t : The number of N sampling can be set to 3-5, and N sampling is adopted as 3 in the embodiment of the present invention;
  • the sentence evaluation unit is used for evaluating and scoring the generated N sampling complete sampling sentences to obtain N
  • the emotional reward score, semantic reward score and grammatical collocation reward score of the sentences generated by sampling are then weighted and averaged to obtain a comprehensive reward score; finally, it provides a reward basis for the word selection evaluation in the sentence reconstruction unit;
  • the semantic discriminator D 1 Calculated by word shift distance WMD the specific formula is as follows:
  • the emotion discriminator D 2 uses the confrontation neural network to train on the sentiment140 data set to identify the emotion category of the generated sentence.
  • the loss function during the training process can be set as follows:
  • the emotional detection, semantic reward results and grammatical collocation reward results of the sentence can be obtained through the reward evaluation module: D 1 (Y), D 2 (Y) and D 3 (Y);
  • the grammatical collocation discriminator is formed by a double-layer recurrent neural network pre-trained by the grammatical collocation corpus CCL (Centre for Chinese Linguistics);
  • both ⁇ and ⁇ can be set to a value greater than 0.5, and the value of ⁇ can be set to a value in the range of 0.2-0.5.
  • the statement storage unit is used to store the updated statement, and the size of the storage unit is L";
  • the word selection evaluation unit uses the sentence evaluation score f(S t , y t ) output by the sentence evaluation unit at the current t moment to subtract the sentence evaluation score f(S t-1 output by the sentence evaluation unit at the previous moment t- 1 ,y t-1 ), get the score ⁇ (S t ,y t ) of the word selected at the current moment, specifically expressed as:
  • the gradient update during optimization is:
  • step 1) Iterating from step 1) to step 2), the sentence reconstruction generator interacts with the external environment continuously until the maximum reward for sentence reconstruction is obtained, and the final emotional image description is generated.
  • First, according to the initial emotional image description output by the emotional image description initialization module "a calm woman is standing in a leisurely garden with a baby in her arms”, initialize the sentence state and select Words, let the word selector at t 1 moment, select the candidate words similar to it from the lexicon on the basis of the first word "one" described by the initial emotional image, for example: “one", “"single” and so on, and iteratively evaluate each candidate word.
  • Figure 2 the structural diagram of the emotional image description system based on reinforcement learning disclosed in the embodiment of the present invention is shown in Figure 2, which includes at least one computing device, which includes a memory, a processor and a A computer program that can run on a processor, and when the computer program is loaded into the processor, implements the above-mentioned emotional image description method based on reinforcement learning.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于强化学习的情绪化图像描述方法,涉及图像处理与模式识别技术领域,在大规模语料库基础上构建情绪词嵌入库;构建图像情绪识别模型;使用图像情绪分析数据集训练图像情绪识别模型;构建图像事实性描述模型;使用图像描述数据集训练图像事实性描述模型;构建情绪化图像描述初始化模块,利用情绪词嵌入库、图像情绪识别模型输出的图像情绪类别以及图像事实性描述模型输出的图像事实性描述,生成初始的情绪化图像描述;构建基于强化学习的微调模块,对初始的情绪化图像描述进行微调,生成最终的情绪化图像描述。本发明还公开了一种基于强化学习的情绪化图像描述系统,本发明可使得各类复杂场景的图像描述更加生动,富有情感。

Description

一种基于强化学习的情绪化图像描述方法及系统 技术领域
本发明涉及图像处理与模式识别技术领域,特别是一种基于强化学习的情绪化图像描述方法及系统。
背景技术
如今信息社会中充斥着图像数据,如日常生活照、医疗图像和遥感卫星图像,人们的信息交互方式已经从传统的语音、文字转化到多模态式信息交互,图像信息交互是未来“元宇宙”的重要核心,但对于视觉障碍以及认知不足者而言无法从图像中准确获取信息,包括图像语义信息、情感信息,严重影响这类群体在未来信息交互的通畅性和信息获取的便捷性。
信息交互中,情感是重要的一环,对于交互的双方而言,都需要进行情感上的互动以获取对方情绪进而完成更好的信息交流。对于视觉障碍以及认知不足者而言,通过一定程度的事实性的图像描述可以了解图像的描述对象。由于事实性描述只是单纯叙述图像所包含的对象,缺乏对图像中情绪与色彩的表达,使得人们无法获取图片所传递的情感。因此情绪化的图像描述成为图像描述中最具挑战性的一个难题。图像的色彩通过RGB得到的像素值输入到计算机以表示,但是人们对具体的数值不具有任何感受与联想,所以直接通过像素值大小的提取无法让视觉障碍以及认知不足者直观地感受图像所包含的各类信息,并且图像中所包含的色彩信息多种多样,并且存在一定的图像意境,同时图像所包含的情绪千变万化,包括愉悦(Amusement)、狂怒(Anger)、惊奇(Awe)、接受(Contentment)、憎恨(Disgust)、狂喜(Excitement)、恐惧(Fear)、悲痛(Sadness)8类情绪。因此,根据图像情绪识别模型的情绪分析结果与生成的图像事实性描述语义,从情绪词嵌入库中选择与图像情绪类别对应的情绪词,生成初始的情绪化图像描述;最后,构建基于强化学习的微调模块,该模块使用强化学习方法对生成的初始情绪化图像描述进行微调,生成最终的情绪化图像描述,使语义更加通顺丰富。
目前,对图像的描述研究多集中于单一性的事实性图像描述。虽然生成的图像描述句法简单,且模型的可解释性强,但由于图像特征于图像文本描述间非线性映射关系简单,导致其抽象性不高,表述能力有限,并且生成的描述较为生硬,不具有情感色彩,无法通过生成的文本更细致描述图像的内容;与此同时,单一性的事实性图像描述对于图像中物体及其交互关系的描述匮乏;再者,事实性图像描述缺乏对图像色彩所传递情绪的描述,无法完全描述出图像所表现出的色彩氛围信息。
中国专利申请“一种基于深度注意力机制的图像描述生成方法”(专利申请号201711073398.1,公开号CN108052512B),构建深度长短期记忆网络模型,通过在长短期记忆网络模型的单元之间添加注意力机制函数,并利用卷积神经网络提取的训练图片特征和训练图片的描述信息对添加了注意力机制函数的长短期记忆网络进行训练,得到深度长短期记忆网络模型;图像描述生成步骤,将待生成描述的图像依次通过卷积神经网络模型和深度长短期记忆网络模型,生成与图像对应的描述。该方法存在的问题是使用长短期记忆网络模型直接对待生成描述的图像进行编码并解码处理得到图像描述,没有做到图像内部特征的充分提取,且无基于图像主题色特征的情绪提取,这可能会影响最后的图像情绪识别的语义丰富度,且无情感支撑。
发明内容
本发明所要解决的技术问题是克服现有技术的不足而提供一种基于强化学习的情绪化图像描述方法及系统,一方面提取训练集图像主题色彩特征先验,与训练集图像一并输入图像情绪识别模型,优化网络模型参数,并结合多特征信息融合的图像事实性描述模型,生成初始情绪化图像描述;另一方面通过强化学习微调初始情绪化图像描述,使得语句更加通顺,且富有情感色彩;进一步发挥图像情绪识别和图像语义描述间的互补作用,获得情绪化图像描述的同时,提升图像描述的准确性和鲁棒性。
本发明为解决上述技术问题采用以下技术方案:
根据本发明提出的一种基于强化学习的情绪化图像描述方法,包括以下步骤:
步骤一、在大规模语料库基础上构建情绪词嵌入库;
步骤二、构建图像情绪识别模型;
步骤三、使用图像情绪分析数据集训练图像情绪识别模型;
步骤四、构建一种用于生成图像事实性描述的基于注意力机制的图像事实性描述模型,图像事实性描述模型包括依次顺序连接的图像事实性描述预处理模块、图像特征编码器和特征-文本解码器;
步骤五、使用图像描述数据集训练图像事实性描述模型;
步骤六、构建情绪化图像描述初始化模块,情绪化图像描述初始化模块根据训练好的图像情绪识别模型输出的图像情绪类别,从情绪词嵌入库中选取与图像情绪类别对应的情绪词,并将之嵌入到由训练好的图像事实性描述模型输出的图像事实性描述中,生成初始的情绪化图像描述;
步骤七、构建基于强化学习的微调模块,微调模块用于对初始的情绪化图像描述进行微调,生成最终的情绪化图像描述。
作为本发明所述的一种基于强化学习的情绪化图像描述方法进一步优化方案,步骤七中,基于强化学习的微调模块包括语句重建生成器、语句存储单元、语句抽样单元、语句评估单元和选词评估单元;其中,语句重建生成器作为强化学习系统中的智能体,语句存储单元、语句抽样单元、语句评估单元和选词评估单元构成强化学习系统中的外部环境;语句重建生成器与外部环境进行不断地交互,获取外部环境的奖励信息,学习从环境状态到行为动作的映射,来优化调整行为动作,对初始的情绪化图像描述进行微调,生成最终的情绪化图像描述。
作为本发明所述的一种基于强化学习的情绪化图像描述方法进一步优化方案,步骤七中,基于强化学习的微调模块包括语句重建生成器、语句存储单元、语句抽样单元、语句评估单元和选词评估单元,微调模块用于对初始的情绪化图像描述进行微调的具体方法如下:
步骤701、语句重建生成器根据第t-1时刻的环境状态以及第t-1时刻的奖励,通过选词器从情绪词嵌入库中选择语义相近的单词,执行选词的动作,并将筛选出的单词加入第t-1时刻生成的语句S t-1中,生成第t时刻的语句S t;其中,第0时刻生成的语句S 0为语句生成起始符,第t-1时刻的环境状态即为第t-1时刻生成的语句S t-1,第t-1时刻的奖励R t-1即为第t-1时刻所选单词得分,t为时刻;
步骤702、语句存储单元存储更新后的第t时刻的语句S t;语句抽样单元基于采样搜索算法对更新后的第t时刻的语句S t进行回滚,生成N个语句,N的取值为3、4或5;语句评估单元首先对语句抽样单元生成的N个语句分别使用情绪鉴别器、语法搭配鉴别器、语义鉴别器进行评估打分,得到N个情绪奖励得分、语法搭配奖励得分、语义奖励得分,然后采取加权平均的方法得到综合奖励 得分,最后将综合奖励得分输入到选词评估单元;选词评估单元输出所选单词得分,作为外部环境向语句重建生成器反馈的奖励R t
步骤703、迭代步骤701至步骤702,语句重建生成器与外部环境不断地进行交互,直至取得语句重建的最大奖励,生成最终的情绪化图像描述。
作为本发明所述的一种基于强化学习的情绪化图像描述方法进一步优化方案,所述语句抽样单元的采样搜索算法采用多项式采样或蒙特卡洛抽样方法。
作为本发明所述的一种基于强化学习的情绪化图像描述方法进一步优化方案,步骤一中,构建情绪词嵌入库的具体方法如下:
步骤101、利用NLTK工具获取目标检测和图像描述数据集中的名词、动词,生成语义词库,并计算其中每个语义词的词向量;
步骤102、从大规模语料库LSCC中筛选出情绪词,生成情绪词库,并计算每个情绪词的情绪词向量;将语义词库中的每个语义词对应的情绪词分为IAPS定义的8个类别:愉悦、狂怒、惊奇、接受、憎恨、狂喜、恐惧、悲痛;
步骤103、从情绪词库中筛选出与语义词相对应的不同情绪类别的情绪词组,构建情绪词嵌入库。
作为本发明所述的一种基于强化学习的情绪化图像描述方法进一步优化方案,图像情绪识别模型包括图像情绪识别预处理模块、人脸情绪特征提取模块、图像主题色彩特征提取模块、图像情绪特征提取模块、特征融合层、全连接层以及分类层;所述图像情绪识别预处理模块包括人脸检测单元、人脸图像归一化处理单元和图像尺寸归一化处理单元;其中,
所述人脸检测单元,利用预先训练的人脸检测网络,检测出输入的图像中人脸区域,并对不同的人脸区域进行标号;
所述人脸图像归一化处理单元,用于对检测出的每个人脸区域进行裁剪、对齐和尺寸归一化;
所述图像尺寸归一化处理单元,用于对输入的图像进行尺寸归一化;
所述人脸情绪特征提取模块,用于提取裁剪、对齐和尺寸归一化后的人脸图像中每一个人的面部情绪特征;
所述图像主题色彩特征提取模块,用于提取输入的图像的主题色彩特征;
所述图像情绪特征提取模块,用于提取图像尺寸归一化处理单元输出的尺寸归一化后的图像的情绪特征;
所述特征融合层,用于分别对人脸情绪特征提取模块输出的面部情绪特征、图像主题色彩特征提取模块输出的主题色彩特征以及图像情绪特征提取模块输出的情绪特征进行融合,得到融合后的情绪特征向量;
所述全连接层,用于全连接特征融合层与分类层;
所述分类层,用于输出图像所属的情绪类别。
作为本发明所述的一种基于强化学习的情绪化图像描述方法进一步优化方案,步骤二中,使用图像主题色彩特征提取模块提取输入的图像的主题色彩特征的具体方法如下:
步骤1、使用微元法切割RGB空间,形成一个个独立的立体方块;
步骤2、将图像的RGB格式像素散点放入切割后的RGB空间中,将散点值作为立体方块的值,如果该立体方块中没有散点,则将该立体方块区域中心值作为该立体方块的值;
步骤3、通过滑动窗口加权的方式对整个滑动窗口区域内的立体方块的值进行加权求和得到滑动窗口大小立体方块的值,滑动窗口的大小取决于最终所要选 择的图像主题色的种类数;
步骤4、通过步骤1至3,最终得到输入图像的图像主题色彩特征。
作为本发明所述的一种基于强化学习的情绪化图像描述方法进一步优化方案,使用微元法切割RGB空间,形成一个个独立的立体方块,立体方块为像素级大小的立方块。
作为本发明所述的一种基于强化学习的情绪化图像描述方法进一步优化方案,步骤四中构建一种图像事实性描述模型的具体方法如下:
步骤4.1、图像事实性描述预处理模块,利用在目标检测及目标关系检测数据集上预先训练好的网络模型对输入的图像进行预处理;具体方法如下:1)通过预先训练目标检测算法,检测图像中所出现的各类目标所在区域;利用预先训练目标关系检测算法,检测图像中所出现的各类目标交互所在区域;2)对输入的图像、各类目标所在区域图像以及各类目标交互所在区域图像进行裁剪与对齐,并进行归一化处理,得到输入的图像归一化后的图像、各类目标所在区域图像归一化后的图像、各类目标交互所在区域图像归一化后的图像;
步骤4.2、构建图像特征编码器,其包括图像全局特征编码支路、目标特征编码支路、目标间交互特征编码支路、注意力机制和特征融合层;所述图像全局特征编码支路包括多个卷积模块,图像全局特征编码支路的输入为输入的图像归一化后的图像,用于提取图像的全局特征,并将其转化为向量形式;所述目标特征编码支路包括多个卷积模块,目标特征编码支路的输入为各类目标所在区域图像归一化后的图像,用于提取局部的目标特征,并将其转化为向量形式;所述目标间交互特征编码支路包括多个卷积模块,目标间交互特征编码支路的输入为各类目标交互所在区域图像归一化后的图像,用于提取目标间动作交互区域特征,并将其转化为向量形式;所述卷积模块,包含一个或多个卷积层以及一个池化层;所述注意力机制,用于捕捉相对于全局特征,需要重点关注的目标特征及重点关注的目标间交互特征;所述特征融合层,用于分别对上述图像全局特征、重点关注的目标特征及重点关注的目标间交互特征进行归一化处理后,通过全连接层拼接输出一个图像特征向量;所述全连接层将特征融合层的输出全连接至本层的c个输出神经元,输出一个c维的特征向量;
步骤4.3、构建特征-文本解码器,所述特征-文本解码器的输入为图像特征编码器处理得到的图像特征向量;并利用至少包含2层长短期记忆LSTM网络的组合模块将图像特征向量解码为文本。
一种基于强化学习的情绪化图像描述系统,包括:
情绪词嵌入库,在大规模语料库基础上构建情绪词嵌入库,为最终的情绪化图像描述生成提供语料库支撑;
图像情绪识别模型,所述图像情绪识别模型构建模块包括图像情绪识别预处理模块、人脸情绪特征提取模块、图像主题色彩特征提取模块、图像情绪特征提取模块、特征融合层、全连接层以及分类层;所述图像情绪识别预处理模块包括人脸检测单元、人脸图像归一化处理单元、图像尺寸归一化处理单元;其中,所述人脸检测单元,利用预先训练的人脸检测网络,检测出输入的图像中人脸区域,并对不同的人脸区域进行标号;图像尺寸归一化处理单元,用于对输入的图像的像素大小归一化处理,得到统一的图像输入尺寸;所述人脸图像归一化处理单元,用于对检测出的每个人脸区域进行裁剪、对齐和尺寸归一化;所述人脸情绪特征提取模块,包括多个卷积模块;所述图像主题色彩特征提取模块,使用颜色聚类方法对图像情绪分析数据集中的训练集图像提取主题色彩特征;所述全连接层, 用于全连接特征融合层与分类层;所述分类层,用于输出图像所属的情绪类别;最后使用图像情绪分析数据集训练图像情绪识别模型;
图像事实性描述模型,图像事实性描述模型包括图像事实性描述预处理模块、图像特征编码器和特征-文本解码器;所述图像事实性描述预处理模块,对输入的图像进行预处理;所述图像特征编码器包括图像全局特征编码支路、目标特征编码支路、目标间交互特征编码支路、注意力机制和特征融合层;所述图像全局特征编码支路,用于提取图像的全局特征,并将其转化为向量形式;所述目标特征编码支路,用于提取局部的目标特征,并将其转化为向量形式;所述目标间交互特征编码支路,用于提取目标间动作交互区域特征,并将其转化为向量形式;所述注意力机制,用于捕捉相对于全局特征,需要重点关注的目标特征及重点关注的目标间交互特征;所述特征融合层,用于分别对上述图像全局特征、重点关注的目标特征及重点关注的目标间交互特征进行归一化处理后,通过全连接层拼接输出一个图像特征向量;所述特征-文本解码器的输入为图像特征编码器处理得到的图像特征向量;并利用长短期记忆LSTM网络的组合模块将图像特征向量解码为文本;最后使用图像描述数据集训练图像事实性描述模型;
情绪化图像描述初始化模块,根据训练好的图像情绪识别模型输出的图像情绪,从情绪词嵌入库中选取与图像情绪类别对应的情绪词,并将之嵌入到由训练好的图像事实性描述模型输出的图像事实性描述中,生成初始的情绪化图像描述;
基于强化学习的微调模块,利用强化学习对生成初始的情绪化图像描述进行调整;所述强化学习的微调模块包括语句重建生成器、语句存储单元、语句抽样单元、语句评估单元和选词评估单元;其中,语句重建生成器作为强化学习系统中的智能体,语句存储单元、语句抽样单元、语句评估单元和选词评估单元构成强化学习系统中的外部环境;语句重建生成器与外部环境进行不断地交互,获取外部环境的奖励信息,学习从环境状态到行为动作的映射,来优化调整行为动作,对初始的情绪化图像描述进行微调,生成最终的情绪化图像描述。
本发明采用以上技术方案与现有技术相比,具有以下技术效果:
本发明通过将主题色彩特征先验引入到图像情绪识别之中,能有效地识别愉悦(Amusement)、狂怒(Anger)、惊奇(Awe)、接受(Contentment)、憎恨(Disgust)、狂喜(Excitement)、恐惧(Fear)、悲痛(Sadness)八种图像情绪,并从情绪词嵌入库中选取与图像情绪类别对应的情绪词,将之嵌入到图像事实性描述中,生成初始的情绪化图像描述,使用强化学习方法对生成的初始情绪化图像描述进行微调,使得生成的情绪化图像描述更加生动,富有情感;具体如下:
(1)目前已有的情绪分析方法主要将情绪分为正向情绪、负向情绪和中性情绪;本发明中在情绪类别的划分方面采用愉悦(Amusement)、狂怒(Anger)、惊奇(Awe)、接受(Contentment)、憎恨(Disgust)、狂喜(Excitement)、恐惧(Fear)、悲痛(Sadness)八种图像情绪,属于细粒度的图像情绪分析。
(2)本发明中的语料库设计来自于大型目标检测数据集、图像描述数据集以及情感分析数据集,且采用情绪词空间分类的方法,可避免语义近似但情感差距较大的缺陷,同时又能从大量已经标注的语料中学习词汇间搭配,因此可以取得较好的图像描述结果。
(3)与现有的图像情绪分析模型相比,本发明中的图像情绪识别模型通过颜色聚类方法给予图像主题色彩特征先验,同时采用基于人脸情绪特征提取模块和图像情绪特征提取模块的双支路网络模型,分别对输入图像裁剪、对齐和尺寸归一化后的人脸图像和图尺寸归一化处理单元输出的尺寸归一化后的全局图像 进行特征提取;使得图像情绪识别模型在获得人脸面部情绪类别的同时还充分捕捉了图像的全局情绪信息,促进了人脸面部情绪以及图像整体氛围的情绪间的信息交互,具有更强的表征能力和泛化能力。
(4)本发明在进行图像事实性描述的过程中,通过对图像重点关注的目标区域特征、目标特征、目标关系特征的预处理,同时使用注意力机制判断重点关注的目标及其对应关系,使得输入图像事实性描述模型的特征更加丰富且与输入图像内容紧密联系,使得图像事实性描述更加合理。
(5)本发明在生成初始情绪化描述过程中,利用图像情绪识别结果确定图像情绪词,并结合生成的图像事实性描述语义,生成初始情绪化图像描述,这些特征的获取均通过图像本身的特征提取获得,因此获得的初始情绪化图像描述与原图像语义高度相关,具有更强的表征能力和泛化能力。
(6)本发明使用强化学习方法对生成的初始情绪化图像描述进行微调,在保持原语义的同时,解决语义匹配不合理、情绪词汇应用不准确的问题。
附图说明
图1是本发明的基于强化学习的情绪化图像描述方法步骤流程图。
图2是本发明的基于强化学习的情绪化图像描述系统结构图。
图3是本发明实施例中使用的图像情绪识别模型结构图。
图4是本发明实施例中图像主题色彩特征示例图;其中,(a)为求解得到的图像主题色彩特征,(b)为图像在RGB空间的散点映射。
图5是本发明实施例中使用的图像事实性描述模型结构图。
图6是本发明的基于强化学习的微调模块结构图。
图7是Ai Challenger Caption2017数据库中的图像示例。
具体实施方式
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图及具体实施例对本发明进行详细描述。
本发明针对图像描述网络分别使用编码器直接提取图像全局特征,并使用解码器直接映射文本,生成图像描述,文本描述语义匮乏且缺乏情绪性表达这个问题,本发明的目的是提供一种基于强化学习的情绪化图像描述方法及系统,解决现有技术不能准确、生动地进行图像描述的问题,为婴幼儿教育辅导以及视觉障碍人士提供更加生动以及符合人类情感需求的图像描述系统开辟一条新的途径与方法。情绪化的图像识别系统的开发,为视觉障碍人士以及婴幼儿辅助教育提供更加生动且富有情感的图像描述,对视觉障碍人士更加生动了解图像所表达的内容及其传递的情绪具有非常重要的意义和价值。
如图1所示,本发明实施例提供的基于强化学习的情绪化图像描述方法,该方法主要包括如下步骤:
步骤一、在大规模语料库基础上构建情绪词嵌入库;首先,利用NLTK(Natural Language Toolkit)工具获取目标检测和图像描述数据集中的名词、动词,生成语义词库;接着,从大规模语料库LSCC(Large Scale ChineseCorpus)中筛选出情绪词,生成情绪词库;最后,计算语义词库的每个语义词对应的情绪词,构建情绪词嵌入库;
本实施例中,利用NLTK工具获取目标检测数据集COCO以及图像描述数据集MSCOCO、flickr30k中的名词、动词,生成语义词库C-corpus={N,V},其包括名词库N={N i|i=1,2…n 1},其中n 1表示算法可以识别的物体种类数;动 词库V={V i|i=1,2…n 2},其中n 2代表关系检测可以检测到的动作类别数。并计算其中每个单词的词向量;
在大规模语料库-NRC情绪情感语料库中筛选出情绪词,生成情绪词库S-corpus={ADJ,ADV},其包括形容词库ADJ={ADJ i|i=1,2…m 1},m 1表示可选形容词数;副词ADJ={ADJ i|i=1,…m 2},m 2表示可选副词数。并且将S-corpus分为3大类别(积极、消极、中立)的同时细分为IAPS(International Affective Picture System)定义的8个类别:愉悦(Amusement)、狂怒(Anger)、惊奇(Awe)、接受(Contentment)、憎恨(Disgust)、狂喜(Excitement)、恐惧(Fear)、悲痛(Sadness);各个类别表示为c i={c 1,c 2,…,c 8};分类具体方法如下,将每组中语义词向量与情绪词向量拼接后的情绪词嵌入向量与8种基准情绪词向量的距离之和作为目标函数,通过最小化目标函数,求解情绪词嵌入词向量的空间分类;设分类的目标函数为
Figure PCTCN2022126071-appb-000001
其中u i为c i类的质心,x ij为第i个语义词与第j个情绪词融合后的情绪词嵌入向量;
构建每个语义词对应情绪词的情绪化强弱关系,强弱关系由文本情感识别算法BERT检测出的情感分类概率决定。
最后构建的情绪词库如表1所示;
表1情绪词库样例
Figure PCTCN2022126071-appb-000002
构建的情绪词嵌入库如表2所示;
表2情绪词嵌入库样例
Figure PCTCN2022126071-appb-000003
Figure PCTCN2022126071-appb-000004
步骤二、构建一种如图3所示的图像情绪识别模型,该模型包括图像情绪识别预处理模块、至少包含2个卷积模块的人脸情绪特征提取模块、图像主题色彩特征提取模块、至少包含2个卷积模块的图像情绪特征提取模块、特征融合层、全连接层以及分类层;所述卷积模块至少包括一个卷积层和一个池化层;
所述图像情绪识别预处理模块,包括人脸检测、人脸图像归一化处理、图像尺寸归一化处理;所述人脸检测,利用预先训练的人脸检测网络,检测出输入的图像中人脸区域,并对不同的人脸区域进行标号;所述人脸图像归一化处理,对检测出的每个人脸区域进行裁剪与对齐,将处理后的每个人脸图像进行归一化;图像尺寸归一化处理,用于对输入的图像进行归一化处理;图像尺寸归一化处理,用于对输入的图像进行归一化处理,得到统一的图像输入尺寸;
所述人脸情绪特征提取模块,包括多个卷积模块,该模块的输入为图像情绪识别预处理模块输出的人脸表情图像,用于提取人的面部情绪特征;
所述图像主题色彩特征提取模块,使用颜色聚类方法对图像情绪分析数据集中的训练集图像提取主题色彩特征,通过词嵌入方法将图像主题色彩特征编码成向量,作为图像情绪的先验知识;
所述图像情绪特征提取模块用于提取图像情绪特征,包括多个卷积模块,该模块的输入为图像情绪识别预处理模块中图像尺寸归一化处理单元输出的尺寸归一化后的图像;所述卷积模块,包含一个或多个卷积层以及一个池化层。
所述卷积模块,包含一个或多个卷积层以及一个池化层;
所述特征融合层,用于分别对人脸情绪特征提取模块输出的面部情绪特征、图像主题色彩特征提取模块输出的主题色彩特征以及图像情绪特征提取模块输出的情绪特征进行融合,得到融合后的情绪特征向量;
所述全连接层,用于全连接特征融合层与分类层;所述分类层,用于输出图像所属的情绪类别;
本实施例构建的一种图像情绪识别模型,如图3所示,具体实施如下:
(1)图像情绪识别预处理模块,包括人脸检测、人脸图像归一化处理、图像主题色彩特征提取模块、图像尺寸归一化处理;
人脸检测,首先使用预先训练的FaceNet网络检测出输入的图像图7(图像来源:Ai Challenger Caption 2017;Image Id:1059397860565088790)中人脸所在区域,接着切割出人脸所在区域,最后对图像中切割出的不同人脸区域进行编号;
人脸图像归一化处理,用于对上述人脸检测获得的不同的人脸区域进行归一化处理,得到56×56像素大小的图像;
图像尺寸归一化处理,用于将输入的图像归一化处理为224×224像素大小的图像;
(2)如图4中的(b)所示,图像主题色彩特征提取模块,首先将MSCOCO数据集中图像映射为RGB空间的散点,通过颜色聚类算法对图像进行主题色提取得到如图4中的(a)所示结果;之后,将颜色聚类后的结果转化为HSV(色调、饱和度、明度)格式,;最后通过词嵌入方法将图像主题色彩特征编码成向量,作为图像情绪的先验知识;本实施例中,基于传统RGB散点直接聚类的方法会造成散点色彩被中和掉,导致两类明显区别的色彩散点被聚类到与自身有明 显区别的颜色类别中,故对色彩散点的聚类方法进行修正:
首先RGB空间使用微元法切割为像素级大小的立体方块;之后将图像的RGB空间的散点放入切割后的RGB空间中,将散点值作为立体方块的值,如果该立体方块中没有散点,则将该立体方块区域中心值作为该立体方块的值;
接着采用立体滑动窗口的方式将立体方块聚类为滑动窗口大小的立体方块,并通过滑动窗口加权的方式对整个滑动窗口区域内的立体方块值进行加权求和得到滑动窗口大小的立体方块的值,滑动窗口的大小取决于最终所要选择的图像主题色的种类数;其中窗口的权值为了更加合理平滑过渡色彩,采用旋转平移的方法处理,具体方法如下:(1)若原图RGB散点值在窗口区域分布较为均匀,则采用由窗口中心向四周依次减小的方法赋予权值;(2)若出现原图RGB散点值分布在窗口角落或边缘区域,则将由窗口中心向四周依次减小的权值窗口逐渐朝着窗口中散点较多的方向移动,尽可能使得窗口中心靠近该区域原图RGB散点聚集区域;(3)对于由于窗口中权值因为滑动超出该部分的权值,通过旋转作为窗口斜上方的权值,若窗口区域存在多个聚类点,则用该方法多次滑动并加权平均,权值大小取决于原图RGB散点聚集度,得到如表3所示图像主题色彩特征结果(RGB形式):橙色(6.40%)、深绿(17.8%)、浅灰(18.6%)、橄榄(23.2%)、茶色(34.0%);
表3图像情绪识别结果示例
Figure PCTCN2022126071-appb-000005
最终将提取图像主题色彩转化为HSV形式。图像主题色彩转化为HSV(色调、饱和度、明度)格式,首先对求得的RGB结果进行预处理,预处理公式如下:
R'=R/255;G'=G/255;B'=B/255;C max=max(R',G',B')
C min=min(R',G',B');Δ=C max-C min
其中饱和度S的转换公式如下:
Figure PCTCN2022126071-appb-000006
明度V表示为V=C max;色调H的计算公式如下:
Figure PCTCN2022126071-appb-000007
输出如表3中HSV特征:色调11.59、饱和度0.36、亮度0.63;并将所述HSV特征编码为1024维向量;
(3)人脸情绪特征提取模块包括顺序连接的三个个卷积模块以及一个特征融合层,具体实施如下:
卷积模块d1:包括2个卷积层和1个池化层,2个卷积层均选用128个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为56×56×128的特征图;池化层选用2×2的最 大池化核,以步长2对特征图进行下采样操作,输出大小为28×28×128的特征图;
卷积模块d2:包括3个卷积层和1个池化层,3个卷积层均选用256个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为28×28×256的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为14×14×256的特征图;
卷积模块d3:包括3个卷积层和1个池化层,3个卷积层均选用512个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为14×14×512的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为7×7×512的特征图;
特征融合层c1,输入为人脸情绪特征提取模块中,不同人情绪特征支路输出的人脸情绪特征,大小均为7×7×512,分别对这两个特征图进行全局平均池化操作,得到两个512维的特征向量,并进行向量融合,最后输出512维的特征向量;
(4)图像情绪特征提取模块包括顺序连接的五个卷积模块,具体如下:
卷积模块d4:包括2个卷积层和1个池化层,2个卷积层均选用64个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为224×224×64的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为112×112×64的特征图;
卷积模块d5:包括2个卷积层和1个池化层,2个卷积层均选用128个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为112×112×128的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为56×56×128的特征图;
卷积模块d6:包括3个卷积层和1个池化层,3个卷积层均选用256个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为56×56×256的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为28×28×256的特征图;
卷积模块d7:包括3个卷积层和1个池化层,3个卷积层均选用512个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为28×28×512的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为14×14×512的特征图;
卷积模块d8:包括3个卷积层和1个池化层,3个卷积层均选用512个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为14×14×512的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为7×7×512的特征图;
(5)特征融合层c2,输入为图像情绪特征提取模块输出的大小为7×7×512图像情绪特征、特征融合层c1输出的512维的人脸情绪特征,对图像情绪特征进行全局平均池化操作,得到两个512维的特征向量,并与512维的人脸情绪特征融合,最后将融合后的这两个特征向量拼接,输出一个1024维的特征向量; 将得到的1024维的特征向量与图像主题色彩特征提取模块输出的1024维度图像主题色彩特征进行融合得到新的1024维向量;
(6)全连接层b1,包含256个神经元,用于全连接特征融合层与分类层;
(7)分类层a1,采用Softmax分类器,包含8个神经元,输出图像所属的情绪类别;
步骤三、使用图像情绪分析数据集ArtPhoto训练图像情绪识别模型;
本实施例选用ArtPhoto图像情绪分析数据集。ArtPhoto图像情绪分析数据集使用了三个数据集:IAPS、ArtPhoto、Abstract Paintings中的图像,共包含1429个图像样本,每个样本对应一种表情类别,包括愉悦(Amusement)、狂怒(Anger)、惊奇(Awe)、接受(Contentment)、憎恨(Disgust)、狂喜(Excitement)、恐惧(Fear)、悲痛(Sadness)8类情绪类别。在实际中,也可以采用其他的图像情绪分析数据集,或自行采集图像情绪分析数据集,建立包含情绪类别标签的图像情绪分析数据集。
步骤四、构建一种如图5所示的基于注意力机制的图像事实性描述模型,该模型包括图像事实性描述预处理模块、图像特征编码器和特征-文本解码器;
所述图像事实性描述预处理模块,利用在目标检测及目标关系检测数据集上预先训练好的网络模型对输入的图像进行预处理;具体方法如下:1)通过预先训练目标检测算法,检测图像中所出现的各类目标所在区域,对各类目标所在区域进行裁剪与对齐,将处理后的各类目标所在区域图像进行归一化;2)利用预先训练目标关系检测算法,检测图像中所出现的各类目标交互所在区域,对各类目标交互所在区域进行裁剪与对齐,将处理后的各类目标交互所在区域图像进行归一化;3)对输入图像进行归一化处理;
所述图像特征编码器包括图像全局特征编码支路、目标特征编码支路、目标间交互特征编码支路、注意力机制和特征融合层;所述图像全局特征编码支路包括多个卷积模块,图像全局特征编码支路的输入为输入的图像归一化后的图像,用于提取图像的全局特征,并将其转化为向量形式;所述目标特征编码支路包括多个卷积模块,目标特征编码支路的输入为各类目标所在区域图像归一化后的图像,用于提取局部的目标特征,并将其转化为向量形式;所述目标间交互特征编码支路包括多个卷积模块,目标间交互特征编码支路的输入为各类目标交互所在区域图像归一化后的图像,用于提取目标间动作交互区域特征,并将其转化为向量形式;所述注意力机制,用于捕捉相对于全局特征,需要重点关注的目标特征及重点关注的目标间交互特征;所述特征融合层,用于将上述图像全局特征、重点关注的目标特征及重点关注的目标间交互特征归一化处理后,通过全连接层拼接输出一个图像特征向量;
所述特征-文本解码器输入为图像特征编码器处理得到的图像特征向量;并利用至少包含2层长短期记忆(LSTM)网络的组合模块将图像特征向量解码为文本;特征-文本解码器是指从图像特征到文本的解码器。
本实施例构建的一种基于注意力机制的图像事实性描述模型,如图5所示,具体实施如下:
(1)所述图像事实性描述预处理模块,首先对输入的图像进行预处理;接着利用预先训练的目标检测算法,检测图像中所出现的各类目标所在区域;再者,通过预先训练的目标关系检测算法,检测图像中所出现的各类目标交互所在区域;最后,对目标区域及目标间的交互关系区域的图像进行裁剪与对齐,归一化处理为56×56像素大小图像;具体实施如下:
首先,对输入的图像进行预处理,归一化处理为224×224像素图像;
接着,利用在COCO数据集上预先训练好的Faster-RCNN作为目标检测器提取图像中的目标所在区域F O,并确定图像中所包含目标的类别;
之后,利用在Open Images数据集上预先训练好的Faster-RCNN为骨干网络并采用两个全连接头+SoftNMS方式的预先训练网络,提取图像中的目标间交互所在区域F R
最后、对目标区域及目标间的交互关系区域的图像进行裁剪与对齐,归一化处理为56×56像素大小图像;
所述图像特征编码器包括图像全局特征编码支路、目标特征编码支路、目标间交互特征编码支路、注意力机制、特征融合层;
(2)所述图像全局特征编码支路,包括多个卷积模块,该支路的输入为输入的图像归一化后的图像,用于提取图像的全局特征,并将其转化为向量形式;
卷积模块d9:包括2个卷积层和1个池化层,2个卷积层均选用64个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为224×224×64的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为112×112×64的特征图;
卷积模块d10:包括2个卷积层和1个池化层,2个卷积层均选用128个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为112×112×128的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为56×56×128的特征图;
注意力模块1:利用空间注意力机制对卷积模块d10输出的14×14×256的图像全局特征进行处理,具体实施如下:
首先,通过全局最大池化和全局平均池化得到两个14×14×1的特征层;接着,将上述两个特征层进行堆叠,并将堆叠后14×14×2的特征层,利用1×1的卷积进行通道数的调整,得到14×14×1特征层;最后通过sigmoid输出14×14×1的全局空间注意力机制,将得到的全局空间注意力机制与原输入特征相乘,得到最后基于空间注意力机制的图像全局特征图,大小为14×14×256;
卷积模块d11:包括3个卷积层和1个池化层,3个卷积层均选用256个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为56×56×256的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为28×28×256的特征图;
卷积模块d12:包括3个卷积层和1个池化层,3个卷积层均选用512个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为28×28×512的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为14×14×512的特征图;
卷积模块d13:包括3个卷积层和1个池化层,3个卷积层均选用512个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为14×14×512的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为7×7×512的特征图;
(3)所述目标间交互特征编码支路,包括多个卷积模块以及注意力模块,该 支路的输入为各类目标交互所在区域图像归一化后的图像,用于提取目标间交互特征,并将其转化为向量形式,具体如下:
卷积模块d14:包括2个卷积层和1个池化层,2个卷积层均选用128个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为56×56×128的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为28×28×128的特征图;
卷积模块d15:包括3个卷积层和1个池化层,3个卷积层均选用256个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为28×28×256的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为14×14×256的特征图;
注意力模块2:根据注意力模块1输出的基于空间注意力机制的全局特征去除非重要的目标间交互特征图,利用通道注意力机制对卷积模块d15输出大小为14×14×256的重点关注目标间交互特征图,具体实施如下:
首先,通过全局平均池化得到两个1×1×256的特征层;接着,将上述特征层进行两次全连接,第一次全连接的通道数较少,约为150大小,第二次全连接通道数大小为256,最后输出1×1×256的特征层;最后通过sigmoid输出1×1×256的目标间交互关系通道注意力机制,将得到的目标间交互关系的通道注意力机制与原输入特征相乘,得到最后基于通道注意力机制的目标间交互特征图,大小为14×14×256;
卷积模块d16:包括3个卷积层和1个池化层,3个卷积层均选用512个3×3的卷积核对重点关注的目标间交互特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为14×14×512的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为7×7×512的重点关注的目标间交互特征图;
(4)所述目标特征编码支路,包括多个卷积模块,该支路的输入为各类目标所在区域图像归一化后的图像,用于提取局部的目标特征,并将其转化为向量形式,具体实施如下:
卷积模块d17:包括2个卷积层和1个池化层,2个卷积层均选用128个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为56×56×128的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为28×28×128的特征图;
卷积模块d18:包括3个卷积层和1个池化层,3个卷积层均选用256个3×3的卷积核对特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为28×28×256的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为14×14×256的特征图;
注意力模块3:根据注意力模块1输出的基于空间注意力机制的全局特征去除非重要的目标特征图,利用通道注意力机制对卷积模块d18输出大小为14×14×256的重点关注目标特征图进行处理,具体实施如下:
首先,通过全局平均池化得到两个1×1×256的特征层;接着,将上述特征层进行两次全连接,第一次全连接的通道数较少,约为150大小,第二次全连接 通道数大小为256,最后输出1×1×256的特征层;最后通过sigmoid输出1×1×256的目标通道注意力机制,将得到的目标通道注意力机制与原输入特征相乘,得到最后基于通道注意力机制的目标间特征图,大小为14×14×256;
卷积模块d19:包括3个卷积层和1个池化层,3个卷积层均选用512个3×3的卷积核对重点关注目标特征图进行卷积操作,卷积步长为1,补零加边长度为1,卷积后经过ReLU非线性映射,输出大小为14×14×512的特征图;池化层选用2×2的最大池化核,以步长2对特征图进行下采样操作,输出大小为7×7×512的重点关注目标特征图;
(5)所述特征融合模块包含多个特征融合层及池化层,具体实施如下:
所述特征融合层c3,输入为目标间交互特征编码支路输出的重点关注目标间交互特征图和图像全局特征编码支路输出的图像全局特征,大小均为7×7×512,分别对这两个特征图进行平均池化操作,大小均为4×4×512的特征图,并将池化后的图像全局特征和目标间交互特征图进行相加,得到特征融合后大小为4×4×512的特征图;同时输出池化后大小为4×4×512的重点关注目标间交互特征图;
所述特征融合层c4,输入为目标特征编码支路输出的重点关注目标特征图和图像全局特征编码支路输出的图像全局特征,大小均为7×7×512,分别对这两个特征图进行平均池化操作,大小均为4×4×512的特征图,并将池化后的图像全局特征和重点关注目标特征图进行相加,得到特征融合后大小为4×4×512的特征图;同时输出池化后大小为4×4×512的重点关注目标特征图;
所述上采样层e1,特征融合层c3输出的融合后大小为4×4×512的特征图,上采样为7×7×512大小的特征图;
所述上采样层e2,特征融合层c4输出的融合后大小为4×4×512的特征图,上采样为7×7×512大小的特征图;
所述特征融合层c5,首先,输入上采样层e1输出的7×7×512大小的特征图、卷积模块d13输出的大小为7×7×512大小的特征图、特征融合层c3输出的池化后大小为4×4×512的重点关注目标间交互特征图;接着,将输入的两个7×7×512大小的特征图进行堆叠,得到新的7×7×512大小的全局特征图;最后,分别对全局特征图及重点关注目标间交互特征图进行全局平均池化操作,得到两个512维的特征向量,将这两个特征向量拼接,输出一个1024维的特征向量;
所述特征融合层c6,首先,输入上采样层e2输出的7×7×512大小的特征图、卷积模块d13输出的大小为7×7×512大小的特征图、特征融合层c4输出的池化后大小为4×4×512的重点关注目标特征图;接着,将输入的两个7×7×512大小的特征图进行堆叠,得到新的7×7×512大小的全局特征图;最后,分别对全局特征图及重点关注目标特征图进行全局平均池化操作,得到两个512维的特征向量,将这两个特征向量拼接,输出一个1024维的特征向量;
所述特征融合层c7,将特征融合层c5及特征融合层c6输出的两个1024维向量进行拼接,得到2048维的特征向量;
所述特征-文本解码器输入为图像特征编码器处理得到的图像特征向量;并利用至少包含2层长短期记忆(LSTM)网络的组合模块将图像特征向量解码为文本;具体算法流程如下:
依据注意力机制给定特征F,解码器输出结果表示为如下公式:
Figure PCTCN2022126071-appb-000008
E(·)为词嵌入函数,
Figure PCTCN2022126071-appb-000009
为所有输出结果的总的状态,其注意力权重的计算 方式可以设为:
Figure PCTCN2022126071-appb-000010
Figure PCTCN2022126071-appb-000011
则在具体位置生成步骤t时刻的特征注意力A(t)可以表示为如下公式:
Figure PCTCN2022126071-appb-000012
在此基础上基于注意力机制的目标特征、目标间交互特征及全局特征可分别表示为
Figure PCTCN2022126071-appb-000013
则基于注意力机制的标间交互关系特征
Figure PCTCN2022126071-appb-000014
中第ε个关系特征
Figure PCTCN2022126071-appb-000015
可表示为如下公式,其中W为权重矩阵;
Figure PCTCN2022126071-appb-000016
最终输出事实性描述“一个女人抱着婴儿站在花园里”;
步骤五、使用图像描述数据集Ai-Challenger Caption训练图像事实性描述模型;
本实施例选用Ai-Challenger Caption图像描述数据集。Ai-Challenger Caption图像描述数据集对给定的每一张图片有五句话的中文描述。数据集包含30万张图片,150万句中文描述。训练集包含210,000张图像,验证集包含30,000张图像,测试集A包含30,000张图像,测试集B包含30,000张图像;在实际中,也可以采用其他的图像描述数据集,或自行采集图像描述数据集,建立中文描述标签的图像描述数据集。
步骤六、构建情绪化图像描述初始化模块,该模块根据训练好的图像情绪识别模型输出的图像情绪类别,从情绪词嵌入库中选取与图像情绪类别对应的情绪词,并将之嵌入到由训练好的图像事实性描述模型输出的图像事实性描述中,生成初始的情绪化图像描述;
本实施例中,利用文本情感检测器AYLIENAPI,对步骤四生成的语句S进行情感检测,并使用One-Hot向量J T表示情绪词所在位置k,其所修饰的对象特征及其对应对象间交互特征为
Figure PCTCN2022126071-appb-000017
J T向量维度为S的长度 L
构建8类情绪的基向量J S,同时检测情绪词与情绪的基向量的相似度,其相似度
Figure PCTCN2022126071-appb-000018
k为S中待检测的情绪词汇个数,计算公式为:
Figure PCTCN2022126071-appb-000019
利用
Figure PCTCN2022126071-appb-000020
提取相似度最接近的情绪基向量,以比较是否与图像情绪识别模型输出的图像情绪相同。
若相同,其默认为初始情绪化描述结果;若不相同,则从情绪化词库S-corpus中依据动词名词的关联映射关系寻找与图像检测情绪相对应的情绪化词汇替换,替换结果作为初始情绪描述语句;
若步骤四生成结果检测不到情绪化词汇,则直接从情绪化词库S-corpus依 据动词名词的关联映射关系寻找与图像检测情绪相对应的情绪化词汇加入到对应关系区域,最终生成初始情绪化图像描述语句X的长度为L',初始情绪化图像描述语句“一个恬淡的女人抱着婴儿站在闲逸的花园里”。
步骤七、构建如图6所示的基于强化学习的微调模块包括语句重建生成器、语句存储单元、语句抽样单元、语句评估单元、选词评估单元;其中语句重建生成器作为强化学习系统中的智能体(Agent),语句存储单元、语句抽样单元、语句评估单元和选词评估单元构成强化学习系统中的外部环境(Environment);语句重建生成器与外部环境进行不断地交互,获取外部环境的奖励(Reward)信息,学习从环境状态(State)到行为动作(Action)的映射,来优化调整行为动作,对步骤六生成的初始的情绪化图像描述进行调整,生成最终的情绪化图像描述,具体步骤如下:
1)语句重建生成器根据第t-1时刻的环境状态(State)S t-1,即第t-1时刻生成的语句S t-1,以及第t-1时刻的奖励(Reward)R t-1,即第t-1时刻所选单词得分R t-1,通过选词器从情绪词嵌入库中选择语义相近的单词,执行选词的“动作(Action)”,并将筛选出的单词加入第t-1时刻生成的语句S t-1中,生成第t时刻的语句S t;其中,第0时刻生成的语句S 0为语句生成起始符;选词器“动作(Action)”在已知目标语句语义基础上,根据记录的前一时刻语句及其评估结果选择语义相近的单词,其中语义相似程度采用语义词向量间距离表示;t时刻选词a t表示为t时刻在已生成t-1个单词基础上,将y t作为即将生成的第t个单词的操作;单词y t取自于目标词汇库C +=(C-corpus)∪(S-corpus)。作为状态(State)反馈的t时刻更新后的语句s t表示经过动作(Action)a t实施后,将y t加入t-1时刻的语句s t-1后新生成的第t时刻的语句;所述作为奖励(Reward)的所选单词得分作为状态(State)反馈的t时刻更新后的语句s t表示经过动作(Action)a t实施后,将y t加入t-1时刻的语句s t-1后新生成的第t时刻的语句;具体表述为每一个单词y t对应于状态S t的得分,其由外部环境中的选词评估单元计算所得。
本实施例中,构建如图6所示作为智能体(Agent)的语句重建生成器用于对输入的初始情绪化图像描述语句进行重构;语句重建生成器网络结构采用结合注意力机制的双层循环神经网络;采用确定性策略,P θ(y t|S t)表示状态S t下给出单词y t的概率;L”为语句总长度,μ为随着语句长度奖励减少的函数,则语句重建生成器训练过程中的总奖励
Figure PCTCN2022126071-appb-000021
可表示为:
Figure PCTCN2022126071-appb-000022
优化过程中的梯度为
Figure PCTCN2022126071-appb-000023
θ为生成器的参数;
2)语句存储单元存储更新后的第t时刻的语句S t;语句抽样单元基于采样搜索算法对更新后的第t时刻的语句S t进行回滚(Rolling Out),生成N个语句, N的取值为3、4或5;所述语句抽样单元的采样搜索算法可以采用多项式采样或蒙特卡洛抽样方法;语句评估单元首先对语句抽样单元生成的N个语句分别使用情绪鉴别器、语法搭配鉴别器、语义鉴别器进行评估打分,得到N个情绪奖励得分、语法搭配奖励得分、语义奖励得分,然后采取加权平均的方法得到综合奖励得分,最后将综合奖励得分输入到选词评估单元;选词评估单元输出所选单词得分,作为外部环境(Environment)向语句重建生成器反馈的奖励(Reward)R t
本实施例中采用基于蒙特卡洛的随机束搜索的方式进行语句生成,生成数量为N sampling的采样完整语句,若t时刻采样开始,则表示为Y 1:t:
Figure PCTCN2022126071-appb-000024
N sampling的个数可设置为3-5,本发明实施例中采用N sampling为3;所述语句评估单元,用于对已生成的N sampling个完整的抽样语句,进行评估打分,得到N个抽样生成的语句的情绪奖励得分、语义奖励得分及语法搭配奖励得分,之后才去加权平均的方法,得到综合奖励得分;最终为对语句重建单元中的选词评估提供奖励依据;语义鉴别器D 1采用词移距离WMD计算,具体公式如下:
Figure PCTCN2022126071-appb-000025
L”表示为源输入长度为L'的初始情绪化图像描述语句经过G处理后生成的目标情绪化图像描述语句Y={y 1,…,y L”}的长度。情绪鉴别器D 2利用对抗神经网络在sentiment140数据集上进行训练,以识别生成语句的情绪类别,训练过程中的损失函数可设为如下:
Figure PCTCN2022126071-appb-000026
Figure PCTCN2022126071-appb-000027
为生成器结果,
Figure PCTCN2022126071-appb-000028
为标注真值,通过奖励评估模块可获得语句的情绪检测、语义奖励结果及语法搭配奖励结果:D 1(Y)、D 2(Y)及D 3(Y);
所述语法搭配鉴别器通过语法搭配语料库CCL(Centre for Chinese Linguistics)预先训练的双层循环神经网络构成;
若已生成的t-1个目标序列词汇状态记为S t=(X,Y 1:t-1),则可得到状态S t与t时刻选择单词的行为y t的奖励计算公式:
Figure PCTCN2022126071-appb-000029
由于情绪和语义都非常重要,因此α、β均可设置为大于0.5的值,η值可设为0.2-0.5中的值。
所述语句存储单元用于存储更新后的语句,存储单元大小为L”;
所述选词评估单元,利用当前t时刻语句评估单元输出的语句评估得分f(S t,y t)减去前一时刻t-1时语句评估单元输出的语句评估得分f(S t-1,y t-1),得到当前时刻所选单词得分γ(S t,y t),具体表示为:
γ(S t,y t)=f(S t,y t)-f(S t-1,y t-1)
更新语句重建生成器G训练过程中的总奖励
Figure PCTCN2022126071-appb-000030
为:
Figure PCTCN2022126071-appb-000031
优化过程中的梯度更新为:
Figure PCTCN2022126071-appb-000032
3)迭代步骤1)至步骤2),语句重建生成器与外部环境不断地进行交互,直至取得语句重建的最大奖励,生成最终的情绪化图像描述。
本实施例构建的一种基于强化学习的微调模块,当t=1、t=2时刻基于强化学习的微调模块的运行流程如下:
语句重建生成器从t=1时刻开始,首先,依据情绪化图像描述初始化模块输出的初始的情绪化图像描述“一个恬淡的女人抱着婴儿站在闲逸的花园里”,初始化语句状态以及选词,设t=1时刻的选词器,依据初始的情绪化图像描述的第一个单词“一个”的基础上,从词库中选择与之相近的候选词,例如:“一个”、“单个”等并对每一个候选词进行迭代评估,以“一个”为例,设此时选词器选择了“一个”,选词初始设为第一个单词a 1←y 1=“一个”,设t=1时刻的语句S 1为“一个”;之后,对语句S 1“一个”,利用蒙特卡洛的随机束搜索的方式进行语句生成,生成以语句状态S 1“一个”为基础的三个完整语句:(1)一个漂亮的女子在花园里……;(2)一个无聊的女人在花田里……;(3)一个美丽的女子在花田里……;接着,利用语句评估单元从语句情绪、语句语义及语句语法搭配角度分别对上述三个语句进行评分,语句奖励f(S 1,y 1)分别为0.8、0.2、0.8;采用取平均值的方式得到f(S 1,y 1)的综合得分0.6;最后,将结果反馈给语句重建单元中的选词评估单元并记录当前语句以及语句奖励f(S 1,y 1)的综合得分;当t=2时刻,首先,提取记录的情绪化图像描述初始化模块输出的初始的情绪化图像描述“一个恬淡的女人抱着婴儿站在闲逸的花园里”以及语句S 1“一个”;之后,选词器依据f(S 1,y 1)结果以及语句S 1回滚产生的单词,从“漂亮的”、“美丽的”等与之相近的情绪词中,选择某一单词作为t=2时刻的选词器行为,假设当前选择为a 2←y 2=“美丽的”;接着,将“美丽的”与语句结合,生成更新后的语句S 2“一个美丽的”,利用蒙特卡洛的随机束搜索的方式生成以语句S 2“一个美丽的”为基础的三个完整语句:(1)一个美丽的女子在花园里……;(2)一个美丽的夫人在花田里……;(3)一个美丽的妇人在花田里……;最后,利用语句评估单元从语句情绪、语句语义及语句语法搭配角度分别对上述三个语句进行评分,语句奖励f(S 2,y 2)分别为0.9、0.9、0.9;采用取平均值的方式得到f(S 1,y 1)的综合得分0.9;将结果反馈给选词评估单元并记录当前语句状态以及语句奖励f(S 2,y 2)的综合得分0.9;得到单词y 2对应于状态S 2的得分γ(S 2,y 2)=f(S 2,y 2)-f(S 1,y 1)=0.3,则说明美丽的对于语句状态属于正向作用;同时语句重建生成器利用公式
Figure PCTCN2022126071-appb-000033
更新行为奖励;以此类推不断从情绪词嵌入库中选择单词,直至总奖励的目标函数值最大化。最终通过强化学习微调生成最终情绪化图像描述“一个美丽的女人抱着婴儿 站在繁花似锦的花园里”。
基于相同的发明构思,本发明实施例公开的基于强化学习的情绪化图像描述系统结构图如图2所示,则其包括至少一台计算设备,该计算设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,该计算机程序被加载至处理器时实现上述的一种基于强化学习的情绪化图像描述方法。
以上所述,仅为本发明中的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉该技术的人在本发明所揭露的技术范围内,可理解想到的变换或替换,都应涵盖在本发明的包含范围之内,因此,本发明的保护范围应该以权利要求书的保护范围为准。

Claims (10)

  1. 一种基于强化学习的情绪化图像描述方法,其特征在于,包括以下步骤:
    步骤一、在大规模语料库基础上构建情绪词嵌入库;
    步骤二、构建图像情绪识别模型;
    步骤三、使用图像情绪分析数据集训练图像情绪识别模型;
    步骤四、构建一种用于生成图像事实性描述的基于注意力机制的图像事实性描述模型,图像事实性描述模型包括依次顺序连接的图像事实性描述预处理模块、图像特征编码器和特征-文本解码器;
    步骤五、使用图像描述数据集训练图像事实性描述模型;
    步骤六、构建情绪化图像描述初始化模块,情绪化图像描述初始化模块根据训练好的图像情绪识别模型输出的图像情绪类别,从情绪词嵌入库中选取与图像情绪类别对应的情绪词,并将之嵌入到由训练好的图像事实性描述模型输出的图像事实性描述中,生成初始的情绪化图像描述;
    步骤七、构建基于强化学习的微调模块,微调模块用于对初始的情绪化图像描述进行微调,生成最终的情绪化图像描述。
  2. 根据权利要求1所述的一种基于强化学习的情绪化图像描述方法,其特征在于,步骤七中,基于强化学习的微调模块包括语句重建生成器、语句存储单元、语句抽样单元、语句评估单元和选词评估单元;其中,语句重建生成器作为强化学习系统中的智能体,语句存储单元、语句抽样单元、语句评估单元和选词评估单元构成强化学习系统中的外部环境;语句重建生成器与外部环境进行不断地交互,获取外部环境的奖励信息,学习从环境状态到行为动作的映射,来优化调整行为动作,对初始的情绪化图像描述进行微调,生成最终的情绪化图像描述。
  3. 根据权利要求1所述的一种基于强化学习的情绪化图像描述方法,其特征在于,步骤七中,基于强化学习的微调模块包括语句重建生成器、语句存储单元、语句抽样单元、语句评估单元和选词评估单元,微调模块用于对初始的情绪化图像描述进行微调的具体方法如下:
    步骤701、语句重建生成器根据第t-1时刻的环境状态以及第t-1时刻的奖励,通过选词器从情绪词嵌入库中选择语义相近的单词,执行选词的动作,并将筛选出的单词加入第t-1时刻生成的语句S t-1中,生成第t时刻的语句S t;其中,第0时刻生成的语句S 0为语句生成起始符,第t-1时刻的环境状态即为第t-1时刻生成的语句S t-1,第t-1时刻的奖励R t-1即为第t-1时刻所选单词得分,t为时刻;
    步骤702、语句存储单元存储更新后的第t时刻的语句S t;语句抽样单元基于采样搜索算法对更新后的第t时刻的语句S t进行回滚,生成N个语句,N的取值为3、4或5;语句评估单元首先对语句抽样单元生成的N个语句分别使用情绪鉴别器、语法搭配鉴别器、语义鉴别器进行评估打分,得到N个情绪奖励得分、语法搭配奖励得分、语义奖励得分,然后采取加权平均的方法得到综合奖励得分,最后将综合奖励得分输入到选词评估单元;选词评估单元输出所选单词得分,作为外部环境向语句重建生成器反馈的奖励R t
    步骤703、迭代步骤701至步骤702,语句重建生成器与外部环境不断地进行交互,直至取得语句重建的最大奖励,生成最终的情绪化图像描述。
  4. 根据权利要求3所述的一种基于强化学习的情绪化图像描述方法,其特征在于,所述语句抽样单元的采样搜索算法采用多项式采样或蒙特卡洛抽样方法。
  5. 根据权利要求1所述的一种基于强化学习的情绪化图像描述方法,其特征 在于,步骤一中,构建情绪词嵌入库的具体方法如下:
    步骤101、利用NLTK工具获取目标检测和图像描述数据集中的名词、动词,生成语义词库,并计算其中每个语义词的词向量;
    步骤102、从大规模语料库LSCC中筛选出情绪词,生成情绪词库,并计算每个情绪词的情绪词向量;将语义词库中的每个语义词对应的情绪词分为IAPS定义的8个类别:愉悦、狂怒、惊奇、接受、憎恨、狂喜、恐惧、悲痛;
    步骤103、从情绪词库中筛选出与语义词相对应的不同情绪类别的情绪词组,构建情绪词嵌入库。
  6. 根据权利要求1所述的一种基于强化学习的情绪化图像描述方法,其特征在于,图像情绪识别模型包括图像情绪识别预处理模块、人脸情绪特征提取模块、图像主题色彩特征提取模块、图像情绪特征提取模块、特征融合层、全连接层以及分类层;所述图像情绪识别预处理模块包括人脸检测单元、人脸图像归一化处理单元和图像尺寸归一化处理单元;其中,
    所述人脸检测单元,利用预先训练的人脸检测网络,检测出输入的图像中人脸区域,并对不同的人脸区域进行标号;
    所述人脸图像归一化处理单元,用于对检测出的每个人脸区域进行裁剪、对齐和尺寸归一化;
    所述图像尺寸归一化处理单元,用于对输入的图像进行尺寸归一化;
    所述人脸情绪特征提取模块,用于提取裁剪、对齐和尺寸归一化后的人脸图像中每一个人的面部情绪特征;
    所述图像主题色彩特征提取模块,用于提取输入的图像的主题色彩特征;
    所述图像情绪特征提取模块,用于提取图像尺寸归一化处理单元输出的尺寸归一化后的图像的情绪特征;
    所述特征融合层,用于分别对人脸情绪特征提取模块输出的面部情绪特征、图像主题色彩特征提取模块输出的主题色彩特征以及图像情绪特征提取模块输出的情绪特征进行融合,得到融合后的情绪特征向量;
    所述全连接层,用于全连接特征融合层与分类层;
    所述分类层,用于输出图像所属的情绪类别。
  7. 根据权利要求6所述的一种基于强化学习的情绪化图像描述方法,其特征在于,步骤二中,使用图像主题色彩特征提取模块提取输入的图像的主题色彩特征的具体方法如下:
    步骤1、使用微元法切割RGB空间,形成一个个独立的立体方块;
    步骤2、将图像的RGB格式像素散点放入切割后的RGB空间中,将散点值作为立体方块的值,如果该立体方块中没有散点,则将该立体方块区域中心值作为该立体方块的值;
    步骤3、通过滑动窗口加权的方式对整个滑动窗口区域内的立体方块的值进行加权求和得到滑动窗口大小立体方块的值,滑动窗口的大小取决于最终所要选择的图像主题色的种类数;
    步骤4、通过步骤1至3,最终得到输入图像的图像主题色彩特征。
  8. 根据权利要求7所述的一种基于强化学习的情绪化图像描述方法,其特征在于,使用微元法切割RGB空间,形成一个个独立的立体方块,立体方块为像素级大小的立方块。
  9. 根据权利要求1所述的一种基于强化学习的情绪化图像描述方法,其特征在于,步骤四中构建一种图像事实性描述模型的具体方法如下:
    步骤4.1、图像事实性描述预处理模块,利用在目标检测及目标关系检测数据集上预先训练好的网络模型对输入的图像进行预处理;具体方法如下:1)通过预先训练目标检测算法,检测图像中所出现的各类目标所在区域;利用预先训练目标关系检测算法,检测图像中所出现的各类目标交互所在区域;2)对输入的图像、各类目标所在区域图像以及各类目标交互所在区域图像进行裁剪与对齐,并进行归一化处理,得到输入的图像归一化后的图像、各类目标所在区域图像归一化后的图像、各类目标交互所在区域图像归一化后的图像;
    步骤4.2、构建图像特征编码器,其包括图像全局特征编码支路、目标特征编码支路、目标间交互特征编码支路、注意力机制和特征融合层;所述图像全局特征编码支路包括多个卷积模块,图像全局特征编码支路的输入为输入的图像归一化后的图像,用于提取图像的全局特征,并将其转化为向量形式;所述目标特征编码支路包括多个卷积模块,目标特征编码支路的输入为各类目标所在区域图像归一化后的图像,用于提取局部的目标特征,并将其转化为向量形式;所述目标间交互特征编码支路包括多个卷积模块,目标间交互特征编码支路的输入为各类目标交互所在区域图像归一化后的图像,用于提取目标间动作交互区域特征,并将其转化为向量形式;所述卷积模块,包含一个或多个卷积层以及一个池化层;所述注意力机制,用于捕捉相对于全局特征,需要重点关注的目标特征及重点关注的目标间交互特征;所述特征融合层,用于分别对上述图像全局特征、重点关注的目标特征及重点关注的目标间交互特征进行归一化处理后,通过全连接层拼接输出一个图像特征向量;所述全连接层将特征融合层的输出全连接至本层的c个输出神经元,输出一个c维的特征向量;
    步骤4.3、构建特征-文本解码器,所述特征-文本解码器的输入为图像特征编码器处理得到的图像特征向量;并利用至少包含2层长短期记忆LSTM网络的组合模块将图像特征向量解码为文本。
  10. 一种基于强化学习的情绪化图像描述系统,其特征在于,包括:
    情绪词嵌入库,在大规模语料库基础上构建情绪词嵌入库,为最终的情绪化图像描述生成提供语料库支撑;
    图像情绪识别模型,所述图像情绪识别模型构建模块包括图像情绪识别预处理模块、人脸情绪特征提取模块、图像主题色彩特征提取模块、图像情绪特征提取模块、特征融合层、全连接层以及分类层;所述图像情绪识别预处理模块包括人脸检测单元、人脸图像归一化处理单元、图像尺寸归一化处理单元;其中,所述人脸检测单元,利用预先训练的人脸检测网络,检测出输入的图像中人脸区域,并对不同的人脸区域进行标号;图像尺寸归一化处理单元,用于对输入的图像的像素大小归一化处理,得到统一的图像输入尺寸;所述人脸图像归一化处理单元,用于对检测出的每个人脸区域进行裁剪、对齐和尺寸归一化;所述人脸情绪特征提取模块,包括多个卷积模块;所述图像主题色彩特征提取模块,使用颜色聚类方法对图像情绪分析数据集中的训练集图像提取主题色彩特征;所述全连接层,用于全连接特征融合层与分类层;所述分类层,用于输出图像所属的情绪类别;最后使用图像情绪分析数据集训练图像情绪识别模型;
    图像事实性描述模型,图像事实性描述模型包括图像事实性描述预处理模块、图像特征编码器和特征-文本解码器;所述图像事实性描述预处理模块,对输入的图像进行预处理;所述图像特征编码器包括图像全局特征编码支路、目标特征编码支路、目标间交互特征编码支路、注意力机制和特征融合层;所述图像全局特征编码支路,用于提取图像的全局特征,并将其转化为向量形式;所述目标特 征编码支路,用于提取局部的目标特征,并将其转化为向量形式;所述目标间交互特征编码支路,用于提取目标间动作交互区域特征,并将其转化为向量形式;所述注意力机制,用于捕捉相对于全局特征,需要重点关注的目标特征及重点关注的目标间交互特征;所述特征融合层,用于分别对上述图像全局特征、重点关注的目标特征及重点关注的目标间交互特征进行归一化处理后,通过全连接层拼接输出一个图像特征向量;所述特征-文本解码器的输入为图像特征编码器处理得到的图像特征向量;并利用长短期记忆LSTM网络的组合模块将图像特征向量解码为文本;最后使用图像描述数据集训练图像事实性描述模型;
    情绪化图像描述初始化模块,根据训练好的图像情绪识别模型输出的图像情绪,从情绪词嵌入库中选取与图像情绪类别对应的情绪词,并将之嵌入到由训练好的图像事实性描述模型输出的图像事实性描述中,生成初始的情绪化图像描述;
    基于强化学习的微调模块,利用强化学习对生成初始的情绪化图像描述进行调整;所述强化学习的微调模块包括语句重建生成器、语句存储单元、语句抽样单元、语句评估单元和选词评估单元;其中,语句重建生成器作为强化学习系统中的智能体,语句存储单元、语句抽样单元、语句评估单元和选词评估单元构成强化学习系统中的外部环境;语句重建生成器与外部环境进行不断地交互,获取外部环境的奖励信息,学习从环境状态到行为动作的映射,来优化调整行为动作,对初始的情绪化图像描述进行微调,生成最终的情绪化图像描述。
PCT/CN2022/126071 2022-02-16 2022-10-19 一种基于强化学习的情绪化图像描述方法及系统 WO2023155460A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210151256.7A CN114639139A (zh) 2022-02-16 2022-02-16 一种基于强化学习的情绪化图像描述方法及系统
CN202210151256.7 2022-02-16

Publications (1)

Publication Number Publication Date
WO2023155460A1 true WO2023155460A1 (zh) 2023-08-24

Family

ID=81946840

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/126071 WO2023155460A1 (zh) 2022-02-16 2022-10-19 一种基于强化学习的情绪化图像描述方法及系统

Country Status (2)

Country Link
CN (1) CN114639139A (zh)
WO (1) WO2023155460A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611131A (zh) * 2023-07-05 2023-08-18 大家智合(北京)网络科技股份有限公司 一种包装图形自动生成方法、装置、介质及设备
CN116912629A (zh) * 2023-09-04 2023-10-20 小舟科技有限公司 基于多任务学习的通用图像文字描述生成方法及相关装置
CN117009925A (zh) * 2023-10-07 2023-11-07 北京华电电子商务科技有限公司 一种基于方面的多模态情感分析系统和方法
CN117423168A (zh) * 2023-12-19 2024-01-19 湖南三湘银行股份有限公司 基于多模态特征融合的用户情绪识别方法及系统
CN117808923A (zh) * 2024-02-29 2024-04-02 浪潮电子信息产业股份有限公司 一种图像生成方法、系统、电子设备及可读存储介质
CN117952185A (zh) * 2024-03-15 2024-04-30 中国科学技术大学 基于多维度数据评估的金融领域大模型训练方法及系统
CN117953108A (zh) * 2024-03-20 2024-04-30 腾讯科技(深圳)有限公司 图像生成方法、装置、电子设备和存储介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114639139A (zh) * 2022-02-16 2022-06-17 南京邮电大学 一种基于强化学习的情绪化图像描述方法及系统
CN115174620B (zh) * 2022-07-01 2023-06-16 北京博数嘉科技有限公司 一种智能化旅游综合服务系统和方法
CN115497153B (zh) * 2022-11-16 2023-04-07 吉林大学 一种基于兴奋分析的车辆驾驶参数控制方法及系统
CN117807995B (zh) * 2024-02-29 2024-06-04 浪潮电子信息产业股份有限公司 一种情绪引导的摘要生成方法、系统、装置及介质
CN117808924B (zh) * 2024-02-29 2024-05-24 浪潮电子信息产业股份有限公司 一种图像生成方法、系统、电子设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052512A (zh) * 2017-11-03 2018-05-18 同济大学 一种基于深度注意力机制的图像描述生成方法
US20200089938A1 (en) * 2018-09-14 2020-03-19 Adp, Llc Automatic emotion response detection
CN112417172A (zh) * 2020-11-23 2021-02-26 东北大学 一种多模态情绪知识图谱的构建及展示方法
CN113947798A (zh) * 2021-10-28 2022-01-18 平安科技(深圳)有限公司 应用程序的背景更换方法、装置、设备及存储介质
CN114639139A (zh) * 2022-02-16 2022-06-17 南京邮电大学 一种基于强化学习的情绪化图像描述方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052512A (zh) * 2017-11-03 2018-05-18 同济大学 一种基于深度注意力机制的图像描述生成方法
US20200089938A1 (en) * 2018-09-14 2020-03-19 Adp, Llc Automatic emotion response detection
CN112417172A (zh) * 2020-11-23 2021-02-26 东北大学 一种多模态情绪知识图谱的构建及展示方法
CN113947798A (zh) * 2021-10-28 2022-01-18 平安科技(深圳)有限公司 应用程序的背景更换方法、装置、设备及存储介质
CN114639139A (zh) * 2022-02-16 2022-06-17 南京邮电大学 一种基于强化学习的情绪化图像描述方法及系统

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611131A (zh) * 2023-07-05 2023-08-18 大家智合(北京)网络科技股份有限公司 一种包装图形自动生成方法、装置、介质及设备
CN116912629A (zh) * 2023-09-04 2023-10-20 小舟科技有限公司 基于多任务学习的通用图像文字描述生成方法及相关装置
CN116912629B (zh) * 2023-09-04 2023-12-29 小舟科技有限公司 基于多任务学习的通用图像文字描述生成方法及相关装置
CN117009925A (zh) * 2023-10-07 2023-11-07 北京华电电子商务科技有限公司 一种基于方面的多模态情感分析系统和方法
CN117009925B (zh) * 2023-10-07 2023-12-15 北京华电电子商务科技有限公司 一种基于方面的多模态情感分析系统和方法
CN117423168A (zh) * 2023-12-19 2024-01-19 湖南三湘银行股份有限公司 基于多模态特征融合的用户情绪识别方法及系统
CN117423168B (zh) * 2023-12-19 2024-04-02 湖南三湘银行股份有限公司 基于多模态特征融合的用户情绪识别方法及系统
CN117808923A (zh) * 2024-02-29 2024-04-02 浪潮电子信息产业股份有限公司 一种图像生成方法、系统、电子设备及可读存储介质
CN117808923B (zh) * 2024-02-29 2024-05-14 浪潮电子信息产业股份有限公司 一种图像生成方法、系统、电子设备及可读存储介质
CN117952185A (zh) * 2024-03-15 2024-04-30 中国科学技术大学 基于多维度数据评估的金融领域大模型训练方法及系统
CN117953108A (zh) * 2024-03-20 2024-04-30 腾讯科技(深圳)有限公司 图像生成方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN114639139A (zh) 2022-06-17

Similar Documents

Publication Publication Date Title
WO2023155460A1 (zh) 一种基于强化学习的情绪化图像描述方法及系统
CN110210037B (zh) 面向循证医学领域的类别检测方法
Manmadhan et al. Visual question answering: a state-of-the-art review
CN112860888B (zh) 一种基于注意力机制的双模态情感分析方法
CN108416065A (zh) 基于层级神经网络的图像-句子描述生成系统及方法
CN112201228A (zh) 一种基于人工智能的多模态语义识别服务接入方法
CN111324765A (zh) 基于深度级联跨模态相关性的细粒度草图图像检索方法
Sharma et al. A survey of methods, datasets and evaluation metrics for visual question answering
Zhang et al. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
Zhang et al. Dimensionality reduction-based spoken emotion recognition
CN111967334B (zh) 一种人体意图识别方法、系统以及存储介质
Cerna et al. A multimodal LIBRAS-UFOP Brazilian sign language dataset of minimal pairs using a microsoft Kinect sensor
KR20200010672A (ko) 딥러닝을 이용한 스마트 상품 검색 방법 및 시스템
CN115860152B (zh) 一种面向人物军事知识发现的跨模态联合学习方法
Bhalekar et al. D-CNN: a new model for generating image captions with text extraction using deep learning for visually challenged individuals
Kommineni et al. Attention-based Bayesian inferential imagery captioning maker
Paul et al. A modern approach for sign language interpretation using convolutional neural network
CN117033609A (zh) 文本视觉问答方法、装置、计算机设备和存储介质
Verma et al. Intelligence Embedded Image Caption Generator using LSTM based RNN Model
Ishmam et al. From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities
Sun et al. The exploration of facial expression recognition in distance education learning system
Cheng et al. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval
Kumar et al. A constructive deep convolutional network model for analyzing video-to-image sequences
CN116662924A (zh) 基于双通道与注意力机制的方面级多模态情感分析方法
Vahdati et al. Facial beauty prediction from facial parts using multi-task and multi-stream convolutional neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22926764

Country of ref document: EP

Kind code of ref document: A1