CN113449801B

CN113449801B - Image character behavior description generation method based on multi-level image context coding and decoding

Info

Publication number: CN113449801B
Application number: CN202110776126.8A
Authority: CN
Inventors: 田锋; 南方; 经纬; 郑庆华
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2023-05-02
Anticipated expiration: 2041-07-08
Also published as: CN113449801A

Abstract

The invention discloses an image character behavior description generation method based on multi-level image context coding and decoding, and belongs to the field of image description generation. The invention uses the pre-trained target detection model to extract character and entity characteristics respectively, and uses independent encoders to encode respectively, and fuses in the decoding stage, thereby solving the problem that the traditional method can not capture enough downstream information. In the image description task, however, similar visual signals are not equivalent to the same semantic information, a phenomenon known as semantic gap between images and languages. Multiple behaviors in a multi-person scene have similar image signals, different semantic information exists between the person and the interaction behaviors of multiple objects, and the description of the person-object interaction behaviors needs to capture high-level semantic content such as interaction types, behavior intentions and the like, so that the problem is more serious. According to the invention, the image semantic information is independently modeled by using a structure based on a transducer, so that the problem of semantic gap is better solved.

Description

Image character behavior description generation method based on multi-level image context coding and decoding

Technical Field

The invention belongs to the field of image description generation, and particularly relates to an image character behavior description generation method based on multi-level image context encoding and decoding.

Background

The use of images to identify and extract behavioral information of a person is a research hotspot for computer vision. The character behavior description generation technology is a part of image description research, is important content in image understanding, can convert character semantics in images into text representation conforming to human expression habits, and has wide application prospects in the fields of safety, education, social media, cross-mode retrieval and the like.

In a multi-person multi-object scene, people are usually dense, and the variety and number of objects are large, which results in multiple types of behaviors in the same scene. The image context information is very important for image semantic understanding, especially behavior semantic understanding, and the types and the image ranges of the image context information related to different human-object interactions and human-human interaction behaviors are different. The complex image context dependencies make it difficult to generate accurate descriptive text of character behavior in a multi-person multi-object scene. The conventional deep learning method generally adopts an encoder-decoder structure to perform image semantic reasoning, adopts local features and image global features of objects to be described in the deep convolution features as image coding, learns context relations among a plurality of objects to be described through an attention mechanism, and completes semantic reasoning. The common method has the following defects in a multi-person multi-object scene, and firstly: in a multi-person and multi-object scene, a single person is usually small, more noise information exists in the image global information, and correct context information is difficult to learn by means of a attention mechanism. Second,: the degree of dependence of the generated different types of behavior description texts on the image context information is different, and the lack of screening of the context information can lead to erroneous descriptions in the description texts.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an image character behavior description generation method based on multi-level image context coding and decoding.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

an image character behavior description generation method based on multi-level image context coding and decoding comprises the following steps:

1. training model

1) Acquiring images comprising characters and object objects, and marking to obtain marked images;

the marked content comprises position coordinates of a person main body, position coordinates of a person face, position coordinates of an object, object types, person behavior description text and object attribute description text;

2) Training a target detection model Det with a region proposal function by using the character behavior description data set in the annotation image until characters and objects in the image can be detected and classified, so as to obtain a pre-trained target detection model Det;

extracting target entity characteristics and multi-level context characteristics of each character to be described in the image by using a pre-trained target detection model Det;

3) In a two-way image feature fusion model Cap based on a transducer, two independent encoders EncoderE and EncoderE are utilized to encode image entity information and context information respectively to obtain an image entity information code E _encode And context information encoding C _encode ；

4) Encoding E the image entity information using a Decoder _encode And context information encoding C _encode Decoding is carried out, and a behavior description text word= { Word is output ₁ ，word ₂ ，…，word _len }；

Wherein word _i A vector representation for describing an i-th word in the text;

calculating the probability of outputting each Word corresponding to each position through a softmax function, taking the sum of cross entropy of the labeling content and the output behavior description text Word as loss, and performing iterative optimization through back propagation to obtain a trained Cap model;

2. using models

For an input image comprising one or more objects to be described, detecting the positions of a person and an object by using a pre-trained target detection model Det, extracting each target entity characteristic and multistage context characteristics of each person to be described from a convolution tensor in a RoI (RoI) Pooling mode, and encoding the local characteristics by using a trained two-way image characteristic fusion model Cap to obtain an image entity information code E _encode And context information encoding C _encode And then using a Decoder to perform decoding output so as to output descriptive text for the behavior of each object to be described.

Further, the physical characteristics in the step 2) include image characteristics of target positions corresponding to the character targets, the multiple types of object targets and the face targets;

the context features include range image features corresponding to the multi-level context regions.

Further, the multi-level context area includes a local area, a neighboring area, and an interaction area;

the local area is an expansion range of the human target area;

the vicinity is an expansion of a minimum range including a person and a plurality of objects nearest thereto;

the interactive region is an expansion of a minimum range including a neighboring region and another person from the descriptive object.

Further, the coordinates of the upper left point and the lower right point of the rectangular region of the single description object in the image are set as

Multiple related objects and position coordinates->

Wherein the position coordinates of the i-th entity +.>

Position coordinates of another person nearest +.>

The multi-level context area is as follows:

the local area is a local extension area of a single description object, and the calculation mode is as follows:

wherein P is the expanded pixel range, and after expansion, the four coordinate values are set to 0, wherein the coordinate values are smaller than 0, and the coordinate values are set to be larger than the image height/width;

the neighborhood is the smallest rectangular area comprising the object to be described and the image entity associated with it, the four-point coordinates are as follows:

wherein ：

/>

wherein: w is the image width and H is the image height.

The interaction area includes a neighboring area and another character object nearest to the character object to be described, four-point coordinates are as follows:

wherein ：

further, the decoder in step 4) includes a plurality of decoding modules, and during decoding, the decoding modules sequentially decode the entity feature codes to generate a preliminary description text, then sequentially decode the context feature codes, correct the preliminary description text, and dynamically weight the context codes output by different encoding modules by adopting a cross attention mechanism in the process of decoding the context feature codes, so as to finally obtain the behavior description text.

Further, the inputs to each decoder module are: image entity feature coding, context feature coding, and word vector group representation of descriptive text output by a previous decoder module, the descriptive text of the first decoder module being the initial text.

Further, in the decoding process, the decoding process of the ith decoder module includes the steps of:

step1: decoding the output of the previous decoder module;

wherein, when i=1,

namely, inputting an initial word vector group;

step2: decoding the image entity characteristics;

decoding the image entity feature codes through a multi-head attention mechanism, and outputting a preliminary result;

step3: decodes the context code, modifies the decoded output of Step2,

wherein ,N_C The number of modules is encoded for the context encoder,

a coding vector output by the first coding module; alpha _l Is a cross attention weight.

Further, alpha is dynamically calculated based on context coding and network layer input _l The method comprises the following steps:

wherein ,

for a learnable weight matrix, layerNorm is a layer normalization operation [ …, … ]]In order for the splicing operation to be performed,

the calculation is as follows:

step4: each decoding module performs forward computation on the decoded vector and outputs the result:

wherein ,

is a weight matrix>

The bias parameters are all learnable parameters. ReLU is a linear rectification function.

Compared with the prior art, the invention has the following beneficial effects that

The image character behavior description generation method based on multi-level image context coding and decoding, provided by the invention, captures the important research content of feature extraction, and shows the image character behavior description as an image context relation in the character behavior description problem. Because of the dense characters and complex character relationships in a multi-person scene, typical attention mechanisms cannot capture enough context information, and the lack of extraction of image context information related to character behaviors can lead to text description semantic errors. The invention uses the pre-trained target detection model to extract character and entity characteristics respectively, and uses independent encoders to encode respectively, and fuses in the decoding stage, thereby solving the problem that the traditional method can not capture enough downstream information. In the image description task, however, similar visual signals are not equivalent to the same semantic information, a phenomenon known as semantic gap between images and languages. The multiple behaviors in the multi-person scene have similar image signals, and low-level semantic information such as the colors of the person clothes, the types of articles and the like cannot effectively help to solve the behavior description; different semantic information exists between the person and the interaction behaviors of a plurality of objects, and the description of the person-object interaction behaviors needs to capture high-level semantic content such as interaction types, behavior intentions and the like, so that the problem is more serious. According to the invention, the image semantic information is independently modeled by using a structure based on a transducer, so that the problem of semantic gap is better solved.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a flow chart of the encoding and decoding of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the attached drawing figures:

referring to fig. 1, fig. 1 is a flowchart of the present invention, and the present invention mainly includes:

model training part:

step1: images including people and object objects are collected and annotated.

The labeling content comprises position coordinates of a person main body, position coordinates of a person face, position coordinates of an object, object types, person behavior description text and object attribute description text;

for a single Zhang Dai label image, the label forms of the single object are as follows, and the label forms include the position, the category and other information of various objects in the label image:

<MainID，TypeID>

wherein, mainID is the serial number of the main person corresponding to the object in the figure, and TypeID is the object type.

The labeling of the character body is as follows:

<MainID，Caption1，Caption2，Caption3>

wherein, mainID is the person number, caption1, caption2, caption3 are the person behavior description text labels provided by different labels respectively.

The position information of the person and the object is determined by marking the coordinates of two points of the upper left corner and the lower right corner of the corresponding rectangular area, and the form is as follows:

wherein the coordinates of the upper left point and the lower right point respectively correspond to<X ^min ，Y ^min> and <X^max ，Y ^max >

in the process of training the detection model Det, all position frame information and category information in the data set are used according to the ratio of 6:2: 2, dividing the training set, the testing set and the verification set in proportion; in the process of training behavior description generation model, sample division is carried out by taking the character to be described as a unit, wherein a single sample comprises the position of the character in the figure, the position of a related object, the face position of the character, the character behavior description text and the like, and a training set, a test set and a verification set are also divided in a ratio of 6:2:2.

Step2: image entity features and context features are extracted.

The labeled image character behavior description data set is utilized to train the target detection model Det with the region proposal function, so that different targets such as characters, objects and the like in the image can be accurately detected and classified. Then, the pretrained object detection model Det is used to extract image object entity characteristics and multi-level context characteristics of each character to be described from the convolution tensor in a RoI Pooling mode.

The multi-level context region is used for extracting multi-level context features, and comprises a local region, a neighboring region and an interaction region. The local area is an expansion range of the human target area, the adjacent area is an expansion of a minimum range including a human and a plurality of objects nearest thereto, and the interactive area is an expansion of a minimum range including an adjacent area and another human from the description object. Let the coordinates of the upper left and lower right points of the rectangular region of the single descriptive object in the image be

Multiple related objects and position coordinates

Wherein the position coordinates of the i-th entity +.>

Position coordinates of another person nearest +.>

The multi-level context area is as follows: />

wherein P is the expanded pixel range, and the value of P is 50 pixels in the invention. After expansion, the four coordinate values are set to 0, and the coordinate values are set to be greater than the image height/width, so that the reasonability is ensured.

wherein ：

wherein: w is the image width and H is the image height.

wherein ：

image features represented by 2048-dimensional vectors of respective entities and context areas are extracted by RoI Pooling after the entity positions and the multi-level context areas are obtained.

Step3: the extracted physical features and contextual features are encoded.

Referring to fig. 2, fig. 2 is a flowchart of encoding and decoding according to the present invention, in which the encoders EncoderE and EncodereC of image entity feature encoding and context feature are identical in structure, and a single encoder includes three encoding modules of identical structure.

Taking the image entity information encoding process as an example, the encoding process of the first encoding module is as follows:

step1: self-attention coding:

wherein ,

i.e. input of the initial first module +.>

Multitead is a multi-headed attention mechanism, calculated as follows: />

MultiHead(Q，K，V)＝Concate(Head1，…，Headh)WO

Head _i ＝Attention(QW _i ^Q ，KW _i ^K ，VW _i ^V )

Where d is the dimension of a single eigenvector in the input matrix, here corresponding to F _i ^E I.e. 2048;

is a projection matrix; h=4 in this method.

Step2: forward calculation:

where FF is the forward computation layer, for the input matrix X, the computation is as follows:

FF(X)＝W ₂ ReLU(W ₁ X+b ₁ )+b ₂

in the parameter matrix

d _encode For the encoding dimension, 1024 in the method; b ₁ 、b ₂ The bias parameters are all learnable parameters. ReLU is a linear rectification function.

Step4: behavior description text generation.

Decoding the image entity information encoding and the context information encoding in the step3 by using a Decoder, and outputting a description text word= { Word ₁ ，word ₂ ，…，word _len The specific steps are:

step1: and (5) input decoding.

The output of the previous module is firstly decoded, wherein the input of the first decoding module is an initial word vector group comprising position codes

Wherein, when i=1,

i.e. the initial word vector set input.

Step2: and decoding the image entity characteristics.

Decoding the image entity feature codes through a multi-head attention mechanism, and outputting a preliminary result.

Step3: and (5) context decoding.

The context code is decoded and the decoded output of the previous step is modified.

wherein ,N_C The number of modules is encoded for the context encoder,

is the encoded vector output by the first encoding module. Alpha _l Is a cross attention weight, dynamically calculated from context coding and network layer input, the method is as follows:

wherein ,

for a learnable weight matrix, layerNorm is a layer normalization operation [ …, … ]]For splicing operation, < >>

The calculation is as follows:

step4: forward calculation.

Each decoding module performs forward computation on the decoded vector and outputs the result:

the output of the last decoding module outputs a set of predictor vectors Word after passing through the linear mapping layer,

wherein len is the set maximum length of output text, V _dict For the dictionary length, the dictionary includes all possible predicted words and ending symbols.

Calculating and outputting the probability of each position corresponding to each word through a softmax function:

and calculating the cross entropy sum of the label text Cap and the output predicted text Word as loss, and performing iterative optimization through back propagation to train the model.

Model use part:

step1: for an input single image Img, detecting a person and an object in the image by adopting a detection network Det;

step2: for each person object, a person face area and at most 5 objects having a minimum distance from the center of the person rectangular area and a distance not exceeding a larger value in the length/width of 1.5 times of the rectangle are taken as related objects, and another person nearest to the person rectangular area is taken as a nearby person, and a multi-level context area is calculated. Extracting entity characteristics and multi-level context characteristics for each person object by using a RoI Pooling method;

step3: and encoding and decoding the physical characteristics and the contextual characteristics of each character object to be described by using the trained model, and outputting a prediction result. Prediction results for individual words

And taking the word corresponding to the maximum component as output.

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. An image character behavior description generation method based on multi-level image context coding and decoding, which is characterized by comprising the following steps:

1. training model

4) Encoding E the image entity information using a Decoder _encode And context information encoding C _encode Decoding is carried out, and a behavior description text word= { Word is output ₁ ，word ₂ ，...，word _len }；

2. using models

For an input image comprising one or more objects to be described, detecting the positions of a person and an object by using a pre-trained target detection model Det, extracting each target entity characteristic and multistage context characteristics of each person to be described from a convolution tensor in a RoI (RoI) Pooling mode, and encoding the local characteristics by using a trained two-way image characteristic fusion model Cap to obtain an image entity information code E _encode And context information encodingCode C _encode And then using a Decoder to perform decoding output so as to output descriptive text for the behavior of each object to be described.

2. The method for generating image character behavior descriptions based on multi-level image context codec as claimed in claim 1, wherein the physical features in the step 2) include image features of the positions of the objects corresponding to the character objects, the multi-class object objects, and the face objects;

3. The image character behavioral description generation method based on multi-level image context codec of claim 2, wherein the multi-level context region includes a local region, a neighboring region, and an interaction region;

the local area is an expansion range of the human target area;

4. The image character behavior description generation method based on multi-level image context codec of claim 3, wherein coordinates of upper left and lower right two points of a single description object rectangular region in the image are set as

Multiple related objects and position coordinates->

Wherein the position coordinates of the i-th entity +.>

Position coordinates of another person nearest +.>

The multi-level context area is as follows:

wherein ：

wherein: w is the image width, H is the image height;

wherein ：

。

5. the method for generating the behavior description of the image character based on the multi-level image context codec as claimed in claim 1, wherein the decoder in the step 4) includes a plurality of decoding modules, and the decoding modules sequentially decode the physical feature codes to generate the preliminary description text, sequentially decode the context feature codes to correct the preliminary description text, and dynamically weight the context codes output by the different encoding modules by using a cross attention mechanism in the process of decoding the context feature codes to finally obtain the behavior description text.

6. The method for generating an image character behavioral description based on multi-level image context codec of claim 5 wherein the input to each decoder module is: image entity feature coding, context feature coding, and word vector group representation of descriptive text output by a previous decoder module, the descriptive text of the first decoder module being the initial text.

7. The method for generating an image character behavioral description based on multi-level image context codec of claim 6 wherein, in the decoding process, the decoding process of the i-th decoder module comprises the steps of:

step1: decoding the output of the previous decoder module;