CN113449801B - Image character behavior description generation method based on multi-level image context coding and decoding - Google Patents

Image character behavior description generation method based on multi-level image context coding and decoding Download PDF

Info

Publication number
CN113449801B
CN113449801B CN202110776126.8A CN202110776126A CN113449801B CN 113449801 B CN113449801 B CN 113449801B CN 202110776126 A CN202110776126 A CN 202110776126A CN 113449801 B CN113449801 B CN 113449801B
Authority
CN
China
Prior art keywords
image
context
decoding
character
person
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110776126.8A
Other languages
Chinese (zh)
Other versions
CN113449801A (en
Inventor
田锋
南方
经纬
郑庆华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202110776126.8A priority Critical patent/CN113449801B/en
Publication of CN113449801A publication Critical patent/CN113449801A/en
Application granted granted Critical
Publication of CN113449801B publication Critical patent/CN113449801B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image character behavior description generation method based on multi-level image context coding and decoding, and belongs to the field of image description generation. The invention uses the pre-trained target detection model to extract character and entity characteristics respectively, and uses independent encoders to encode respectively, and fuses in the decoding stage, thereby solving the problem that the traditional method can not capture enough downstream information. In the image description task, however, similar visual signals are not equivalent to the same semantic information, a phenomenon known as semantic gap between images and languages. Multiple behaviors in a multi-person scene have similar image signals, different semantic information exists between the person and the interaction behaviors of multiple objects, and the description of the person-object interaction behaviors needs to capture high-level semantic content such as interaction types, behavior intentions and the like, so that the problem is more serious. According to the invention, the image semantic information is independently modeled by using a structure based on a transducer, so that the problem of semantic gap is better solved.

Description

Image character behavior description generation method based on multi-level image context coding and decoding
Technical Field
The invention belongs to the field of image description generation, and particularly relates to an image character behavior description generation method based on multi-level image context encoding and decoding.
Background
The use of images to identify and extract behavioral information of a person is a research hotspot for computer vision. The character behavior description generation technology is a part of image description research, is important content in image understanding, can convert character semantics in images into text representation conforming to human expression habits, and has wide application prospects in the fields of safety, education, social media, cross-mode retrieval and the like.
In a multi-person multi-object scene, people are usually dense, and the variety and number of objects are large, which results in multiple types of behaviors in the same scene. The image context information is very important for image semantic understanding, especially behavior semantic understanding, and the types and the image ranges of the image context information related to different human-object interactions and human-human interaction behaviors are different. The complex image context dependencies make it difficult to generate accurate descriptive text of character behavior in a multi-person multi-object scene. The conventional deep learning method generally adopts an encoder-decoder structure to perform image semantic reasoning, adopts local features and image global features of objects to be described in the deep convolution features as image coding, learns context relations among a plurality of objects to be described through an attention mechanism, and completes semantic reasoning. The common method has the following defects in a multi-person multi-object scene, and firstly: in a multi-person and multi-object scene, a single person is usually small, more noise information exists in the image global information, and correct context information is difficult to learn by means of a attention mechanism. Second,: the degree of dependence of the generated different types of behavior description texts on the image context information is different, and the lack of screening of the context information can lead to erroneous descriptions in the description texts.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an image character behavior description generation method based on multi-level image context coding and decoding.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
an image character behavior description generation method based on multi-level image context coding and decoding comprises the following steps:
1. training model
1) Acquiring images comprising characters and object objects, and marking to obtain marked images;
the marked content comprises position coordinates of a person main body, position coordinates of a person face, position coordinates of an object, object types, person behavior description text and object attribute description text;
2) Training a target detection model Det with a region proposal function by using the character behavior description data set in the annotation image until characters and objects in the image can be detected and classified, so as to obtain a pre-trained target detection model Det;
extracting target entity characteristics and multi-level context characteristics of each character to be described in the image by using a pre-trained target detection model Det;
3) In a two-way image feature fusion model Cap based on a transducer, two independent encoders EncoderE and EncoderE are utilized to encode image entity information and context information respectively to obtain an image entity information code E encode And context information encoding C encode
4) Encoding E the image entity information using a Decoder encode And context information encoding C encode Decoding is carried out, and a behavior description text word= { Word is output 1 ,word 2 ,…,word len };
Wherein word i A vector representation for describing an i-th word in the text;
calculating the probability of outputting each Word corresponding to each position through a softmax function, taking the sum of cross entropy of the labeling content and the output behavior description text Word as loss, and performing iterative optimization through back propagation to obtain a trained Cap model;
2. using models
For an input image comprising one or more objects to be described, detecting the positions of a person and an object by using a pre-trained target detection model Det, extracting each target entity characteristic and multistage context characteristics of each person to be described from a convolution tensor in a RoI (RoI) Pooling mode, and encoding the local characteristics by using a trained two-way image characteristic fusion model Cap to obtain an image entity information code E encode And context information encoding C encode And then using a Decoder to perform decoding output so as to output descriptive text for the behavior of each object to be described.
Further, the physical characteristics in the step 2) include image characteristics of target positions corresponding to the character targets, the multiple types of object targets and the face targets;
the context features include range image features corresponding to the multi-level context regions.
Further, the multi-level context area includes a local area, a neighboring area, and an interaction area;
the local area is an expansion range of the human target area;
the vicinity is an expansion of a minimum range including a person and a plurality of objects nearest thereto;
the interactive region is an expansion of a minimum range including a neighboring region and another person from the descriptive object.
Further, the coordinates of the upper left point and the lower right point of the rectangular region of the single description object in the image are set as
Figure GDA0004125085890000031
Multiple related objects and position coordinates->
Figure GDA0004125085890000032
Figure GDA0004125085890000033
Wherein the position coordinates of the i-th entity +.>
Figure GDA0004125085890000034
Position coordinates of another person nearest +.>
Figure GDA0004125085890000035
The multi-level context area is as follows:
the local area is a local extension area of a single description object, and the calculation mode is as follows:
Figure GDA0004125085890000036
wherein P is the expanded pixel range, and after expansion, the four coordinate values are set to 0, wherein the coordinate values are smaller than 0, and the coordinate values are set to be larger than the image height/width;
the neighborhood is the smallest rectangular area comprising the object to be described and the image entity associated with it, the four-point coordinates are as follows:
Figure GDA0004125085890000041
wherein :
Figure GDA0004125085890000042
Figure GDA0004125085890000043
/>
Figure GDA0004125085890000044
Figure GDA0004125085890000045
wherein: w is the image width and H is the image height.
The interaction area includes a neighboring area and another character object nearest to the character object to be described, four-point coordinates are as follows:
Figure GDA0004125085890000046
wherein :
Figure GDA0004125085890000047
Figure GDA0004125085890000048
Figure GDA0004125085890000049
Figure GDA00041250858900000410
further, the decoder in step 4) includes a plurality of decoding modules, and during decoding, the decoding modules sequentially decode the entity feature codes to generate a preliminary description text, then sequentially decode the context feature codes, correct the preliminary description text, and dynamically weight the context codes output by different encoding modules by adopting a cross attention mechanism in the process of decoding the context feature codes, so as to finally obtain the behavior description text.
Further, the inputs to each decoder module are: image entity feature coding, context feature coding, and word vector group representation of descriptive text output by a previous decoder module, the descriptive text of the first decoder module being the initial text.
Further, in the decoding process, the decoding process of the ith decoder module includes the steps of:
step1: decoding the output of the previous decoder module;
Figure GDA0004125085890000051
wherein, when i=1,
Figure GDA0004125085890000052
namely, inputting an initial word vector group;
step2: decoding the image entity characteristics;
decoding the image entity feature codes through a multi-head attention mechanism, and outputting a preliminary result;
Figure GDA0004125085890000053
step3: decodes the context code, modifies the decoded output of Step2,
Figure GDA0004125085890000054
wherein ,NC The number of modules is encoded for the context encoder,
Figure GDA0004125085890000055
a coding vector output by the first coding module; alpha l Is a cross attention weight.
Further, alpha is dynamically calculated based on context coding and network layer input l The method comprises the following steps:
Figure GDA0004125085890000056
wherein ,
Figure GDA0004125085890000057
for a learnable weight matrix, layerNorm is a layer normalization operation [ …, … ]]In order for the splicing operation to be performed,
Figure GDA0004125085890000058
the calculation is as follows:
Figure GDA0004125085890000059
step4: each decoding module performs forward computation on the decoded vector and outputs the result:
Figure GDA00041250858900000510
wherein ,
Figure GDA00041250858900000511
is a weight matrix>
Figure GDA00041250858900000512
The bias parameters are all learnable parameters. ReLU is a linear rectification function.
Compared with the prior art, the invention has the following beneficial effects that
The image character behavior description generation method based on multi-level image context coding and decoding, provided by the invention, captures the important research content of feature extraction, and shows the image character behavior description as an image context relation in the character behavior description problem. Because of the dense characters and complex character relationships in a multi-person scene, typical attention mechanisms cannot capture enough context information, and the lack of extraction of image context information related to character behaviors can lead to text description semantic errors. The invention uses the pre-trained target detection model to extract character and entity characteristics respectively, and uses independent encoders to encode respectively, and fuses in the decoding stage, thereby solving the problem that the traditional method can not capture enough downstream information. In the image description task, however, similar visual signals are not equivalent to the same semantic information, a phenomenon known as semantic gap between images and languages. The multiple behaviors in the multi-person scene have similar image signals, and low-level semantic information such as the colors of the person clothes, the types of articles and the like cannot effectively help to solve the behavior description; different semantic information exists between the person and the interaction behaviors of a plurality of objects, and the description of the person-object interaction behaviors needs to capture high-level semantic content such as interaction types, behavior intentions and the like, so that the problem is more serious. According to the invention, the image semantic information is independently modeled by using a structure based on a transducer, so that the problem of semantic gap is better solved.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a flow chart of the encoding and decoding of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention is described in further detail below with reference to the attached drawing figures:
referring to fig. 1, fig. 1 is a flowchart of the present invention, and the present invention mainly includes:
model training part:
step1: images including people and object objects are collected and annotated.
The labeling content comprises position coordinates of a person main body, position coordinates of a person face, position coordinates of an object, object types, person behavior description text and object attribute description text;
for a single Zhang Dai label image, the label forms of the single object are as follows, and the label forms include the position, the category and other information of various objects in the label image:
<MainID,TypeID>
wherein, mainID is the serial number of the main person corresponding to the object in the figure, and TypeID is the object type.
The labeling of the character body is as follows:
<MainID,Caption1,Caption2,Caption3>
wherein, mainID is the person number, caption1, caption2, caption3 are the person behavior description text labels provided by different labels respectively.
The position information of the person and the object is determined by marking the coordinates of two points of the upper left corner and the lower right corner of the corresponding rectangular area, and the form is as follows:
<X min ,Y min ,X max ,Y max >
wherein the coordinates of the upper left point and the lower right point respectively correspond to<X min ,Y min> and <Xmax ,Y max >
in the process of training the detection model Det, all position frame information and category information in the data set are used according to the ratio of 6:2: 2, dividing the training set, the testing set and the verification set in proportion; in the process of training behavior description generation model, sample division is carried out by taking the character to be described as a unit, wherein a single sample comprises the position of the character in the figure, the position of a related object, the face position of the character, the character behavior description text and the like, and a training set, a test set and a verification set are also divided in a ratio of 6:2:2.
Step2: image entity features and context features are extracted.
The labeled image character behavior description data set is utilized to train the target detection model Det with the region proposal function, so that different targets such as characters, objects and the like in the image can be accurately detected and classified. Then, the pretrained object detection model Det is used to extract image object entity characteristics and multi-level context characteristics of each character to be described from the convolution tensor in a RoI Pooling mode.
The multi-level context region is used for extracting multi-level context features, and comprises a local region, a neighboring region and an interaction region. The local area is an expansion range of the human target area, the adjacent area is an expansion of a minimum range including a human and a plurality of objects nearest thereto, and the interactive area is an expansion of a minimum range including an adjacent area and another human from the description object. Let the coordinates of the upper left and lower right points of the rectangular region of the single descriptive object in the image be
Figure GDA0004125085890000081
Multiple related objects and position coordinates
Figure GDA0004125085890000082
Wherein the position coordinates of the i-th entity +.>
Figure GDA0004125085890000083
Figure GDA0004125085890000091
Position coordinates of another person nearest +.>
Figure GDA0004125085890000092
Figure GDA0004125085890000093
The multi-level context area is as follows: />
The local area is a local extension area of a single description object, and the calculation mode is as follows:
Figure GDA0004125085890000094
wherein P is the expanded pixel range, and the value of P is 50 pixels in the invention. After expansion, the four coordinate values are set to 0, and the coordinate values are set to be greater than the image height/width, so that the reasonability is ensured.
The neighborhood is the smallest rectangular area comprising the object to be described and the image entity associated with it, the four-point coordinates are as follows:
Figure GDA0004125085890000095
wherein :
Figure GDA0004125085890000096
Figure GDA0004125085890000097
Figure GDA0004125085890000098
Figure GDA0004125085890000099
wherein: w is the image width and H is the image height.
The interaction area includes a neighboring area and another character object nearest to the character object to be described, four-point coordinates are as follows:
Figure GDA00041250858900000910
wherein :
Figure GDA00041250858900000911
Figure GDA00041250858900000912
Figure GDA00041250858900000913
Figure GDA00041250858900000914
image features represented by 2048-dimensional vectors of respective entities and context areas are extracted by RoI Pooling after the entity positions and the multi-level context areas are obtained.
Step3: the extracted physical features and contextual features are encoded.
Referring to fig. 2, fig. 2 is a flowchart of encoding and decoding according to the present invention, in which the encoders EncoderE and EncodereC of image entity feature encoding and context feature are identical in structure, and a single encoder includes three encoding modules of identical structure.
Taking the image entity information encoding process as an example, the encoding process of the first encoding module is as follows:
step1: self-attention coding:
Figure GDA0004125085890000101
wherein ,
Figure GDA0004125085890000102
i.e. input of the initial first module +.>
Figure GDA0004125085890000103
Multitead is a multi-headed attention mechanism, calculated as follows: />
MultiHead(Q,K,V)=Concate(Head1,…,Headh)WO
Head i =Attention(QW i Q ,KW i K ,VW i V )
Figure GDA0004125085890000104
Where d is the dimension of a single eigenvector in the input matrix, here corresponding to F i E I.e. 2048;
Figure GDA0004125085890000105
is a projection matrix; h=4 in this method.
Step2: forward calculation:
Figure GDA0004125085890000106
where FF is the forward computation layer, for the input matrix X, the computation is as follows:
FF(X)=W 2 ReLU(W 1 X+b 1 )+b 2
in the parameter matrix
Figure GDA0004125085890000107
d encode For the encoding dimension, 1024 in the method; b 1 、b 2 The bias parameters are all learnable parameters. ReLU is a linear rectification function.
Step4: behavior description text generation.
Decoding the image entity information encoding and the context information encoding in the step3 by using a Decoder, and outputting a description text word= { Word 1 ,word 2 ,…,word len The specific steps are:
step1: and (5) input decoding.
The output of the previous module is firstly decoded, wherein the input of the first decoding module is an initial word vector group comprising position codes
Figure GDA0004125085890000111
Wherein, when i=1,
Figure GDA0004125085890000112
i.e. the initial word vector set input.
Step2: and decoding the image entity characteristics.
Decoding the image entity feature codes through a multi-head attention mechanism, and outputting a preliminary result.
Figure GDA0004125085890000113
Step3: and (5) context decoding.
The context code is decoded and the decoded output of the previous step is modified.
Figure GDA0004125085890000114
wherein ,NC The number of modules is encoded for the context encoder,
Figure GDA0004125085890000115
is the encoded vector output by the first encoding module. Alpha l Is a cross attention weight, dynamically calculated from context coding and network layer input, the method is as follows:
Figure GDA0004125085890000116
wherein ,
Figure GDA0004125085890000117
for a learnable weight matrix, layerNorm is a layer normalization operation [ …, … ]]For splicing operation, < >>
Figure GDA0004125085890000118
The calculation is as follows:
Figure GDA0004125085890000119
step4: forward calculation.
Each decoding module performs forward computation on the decoded vector and outputs the result:
Figure GDA0004125085890000121
the output of the last decoding module outputs a set of predictor vectors Word after passing through the linear mapping layer,
Figure GDA0004125085890000122
wherein len is the set maximum length of output text, V dict For the dictionary length, the dictionary includes all possible predicted words and ending symbols.
Calculating and outputting the probability of each position corresponding to each word through a softmax function:
Figure GDA0004125085890000123
and calculating the cross entropy sum of the label text Cap and the output predicted text Word as loss, and performing iterative optimization through back propagation to train the model.
Model use part:
step1: for an input single image Img, detecting a person and an object in the image by adopting a detection network Det;
step2: for each person object, a person face area and at most 5 objects having a minimum distance from the center of the person rectangular area and a distance not exceeding a larger value in the length/width of 1.5 times of the rectangle are taken as related objects, and another person nearest to the person rectangular area is taken as a nearby person, and a multi-level context area is calculated. Extracting entity characteristics and multi-level context characteristics for each person object by using a RoI Pooling method;
step3: and encoding and decoding the physical characteristics and the contextual characteristics of each character object to be described by using the trained model, and outputting a prediction result. Prediction results for individual words
Figure GDA0004125085890000124
And taking the word corresponding to the maximum component as output.
The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (7)

1. An image character behavior description generation method based on multi-level image context coding and decoding, which is characterized by comprising the following steps:
1. training model
1) Acquiring images comprising characters and object objects, and marking to obtain marked images;
the marked content comprises position coordinates of a person main body, position coordinates of a person face, position coordinates of an object, object types, person behavior description text and object attribute description text;
2) Training a target detection model Det with a region proposal function by using the character behavior description data set in the annotation image until characters and objects in the image can be detected and classified, so as to obtain a pre-trained target detection model Det;
extracting target entity characteristics and multi-level context characteristics of each character to be described in the image by using a pre-trained target detection model Det;
3) In a two-way image feature fusion model Cap based on a transducer, two independent encoders EncoderE and EncoderE are utilized to encode image entity information and context information respectively to obtain an image entity information code E encode And context information encoding C encode
4) Encoding E the image entity information using a Decoder encode And context information encoding C encode Decoding is carried out, and a behavior description text word= { Word is output 1 ,word 2 ,...,word len };
Wherein word i A vector representation for describing an i-th word in the text;
calculating the probability of outputting each Word corresponding to each position through a softmax function, taking the sum of cross entropy of the labeling content and the output behavior description text Word as loss, and performing iterative optimization through back propagation to obtain a trained Cap model;
2. using models
For an input image comprising one or more objects to be described, detecting the positions of a person and an object by using a pre-trained target detection model Det, extracting each target entity characteristic and multistage context characteristics of each person to be described from a convolution tensor in a RoI (RoI) Pooling mode, and encoding the local characteristics by using a trained two-way image characteristic fusion model Cap to obtain an image entity information code E encode And context information encodingCode C encode And then using a Decoder to perform decoding output so as to output descriptive text for the behavior of each object to be described.
2. The method for generating image character behavior descriptions based on multi-level image context codec as claimed in claim 1, wherein the physical features in the step 2) include image features of the positions of the objects corresponding to the character objects, the multi-class object objects, and the face objects;
the context features include range image features corresponding to the multi-level context regions.
3. The image character behavioral description generation method based on multi-level image context codec of claim 2, wherein the multi-level context region includes a local region, a neighboring region, and an interaction region;
the local area is an expansion range of the human target area;
the vicinity is an expansion of a minimum range including a person and a plurality of objects nearest thereto;
the interactive region is an expansion of a minimum range including a neighboring region and another person from the descriptive object.
4. The image character behavior description generation method based on multi-level image context codec of claim 3, wherein coordinates of upper left and lower right two points of a single description object rectangular region in the image are set as
Figure QLYQS_1
Multiple related objects and position coordinates->
Figure QLYQS_2
Figure QLYQS_3
Wherein the position coordinates of the i-th entity +.>
Figure QLYQS_4
Position coordinates of another person nearest +.>
Figure QLYQS_5
The multi-level context area is as follows:
the local area is a local extension area of a single description object, and the calculation mode is as follows:
Figure QLYQS_6
wherein P is the expanded pixel range, and after expansion, the four coordinate values are set to 0, wherein the coordinate values are smaller than 0, and the coordinate values are set to be larger than the image height/width;
the neighborhood is the smallest rectangular area comprising the object to be described and the image entity associated with it, the four-point coordinates are as follows:
Figure QLYQS_7
wherein :
Figure QLYQS_8
Figure QLYQS_9
Figure QLYQS_10
Figure QLYQS_11
wherein: w is the image width, H is the image height;
the interaction area includes a neighboring area and another character object nearest to the character object to be described, four-point coordinates are as follows:
Figure QLYQS_12
wherein :
Figure QLYQS_13
Figure QLYQS_14
Figure QLYQS_15
Figure QLYQS_16
5. the method for generating the behavior description of the image character based on the multi-level image context codec as claimed in claim 1, wherein the decoder in the step 4) includes a plurality of decoding modules, and the decoding modules sequentially decode the physical feature codes to generate the preliminary description text, sequentially decode the context feature codes to correct the preliminary description text, and dynamically weight the context codes output by the different encoding modules by using a cross attention mechanism in the process of decoding the context feature codes to finally obtain the behavior description text.
6. The method for generating an image character behavioral description based on multi-level image context codec of claim 5 wherein the input to each decoder module is: image entity feature coding, context feature coding, and word vector group representation of descriptive text output by a previous decoder module, the descriptive text of the first decoder module being the initial text.
7. The method for generating an image character behavioral description based on multi-level image context codec of claim 6 wherein, in the decoding process, the decoding process of the i-th decoder module comprises the steps of:
step1: decoding the output of the previous decoder module;
Figure QLYQS_17
/>
wherein, when i=1,
Figure QLYQS_18
namely, inputting an initial word vector group;
step2: decoding the image entity characteristics;
decoding the image entity feature codes through a multi-head attention mechanism, and outputting a preliminary result;
Figure QLYQS_19
step3: decodes the context code, modifies the decoded output of Step2,
Figure QLYQS_20
wherein ,NC The number of modules is encoded for the context encoder,
Figure QLYQS_21
a coding vector output by the first coding module; alpha l Is a cross attention weight;
dynamic computation based on context coding and network layer input, alpha l The method comprises the following steps:
Figure QLYQS_22
wherein ,
Figure QLYQS_23
for a learnable weight matrix, layerNorm is a layer normalization operation [ …, … ]]In order for the splicing operation to be performed,
Figure QLYQS_24
the calculation is as follows:
Figure QLYQS_25
step4: each decoding module performs forward computation on the decoded vector and outputs the result:
Figure QLYQS_26
wherein ,
Figure QLYQS_27
is a weight matrix>
Figure QLYQS_28
The bias parameters are learnable parameters; reLU is a linear rectification function. />
CN202110776126.8A 2021-07-08 2021-07-08 Image character behavior description generation method based on multi-level image context coding and decoding Active CN113449801B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110776126.8A CN113449801B (en) 2021-07-08 2021-07-08 Image character behavior description generation method based on multi-level image context coding and decoding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110776126.8A CN113449801B (en) 2021-07-08 2021-07-08 Image character behavior description generation method based on multi-level image context coding and decoding

Publications (2)

Publication Number Publication Date
CN113449801A CN113449801A (en) 2021-09-28
CN113449801B true CN113449801B (en) 2023-05-02

Family

ID=77815731

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110776126.8A Active CN113449801B (en) 2021-07-08 2021-07-08 Image character behavior description generation method based on multi-level image context coding and decoding

Country Status (1)

Country Link
CN (1) CN113449801B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887468B (en) * 2021-10-14 2023-06-16 西安交通大学 Single-view human-object interaction identification method of three-stage network framework
CN114663915B (en) * 2022-03-04 2024-04-05 西安交通大学 Image human-object interaction positioning method and system based on transducer model
CN115097941B (en) * 2022-07-13 2023-10-10 北京百度网讯科技有限公司 Character interaction detection method, device, equipment and storage medium
CN116612365B (en) * 2023-06-09 2024-01-23 匀熵智能科技(无锡)有限公司 Image subtitle generating method based on target detection and natural language processing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509880A (en) * 2018-03-21 2018-09-07 南京邮电大学 A kind of video personage behavior method for recognizing semantics
CN111598041A (en) * 2020-05-25 2020-08-28 青岛联合创智科技有限公司 Image generation text method for article searching

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101930940B1 (en) * 2017-07-20 2018-12-20 에스케이텔레콤 주식회사 Apparatus and method for analyzing image
CN109711463B (en) * 2018-12-25 2023-04-07 广东顺德西安交通大学研究院 Attention-based important object detection method
WO2020244774A1 (en) * 2019-06-07 2020-12-10 Leica Microsystems Cms Gmbh A system and method for training machine-learning algorithms for processing biology-related data, a microscope and a trained machine learning algorithm
CN111126282B (en) * 2019-12-25 2023-05-12 中国矿业大学 Remote sensing image content description method based on variational self-attention reinforcement learning
US11361550B2 (en) * 2019-12-30 2022-06-14 Yahoo Assets Llc Automatic digital content captioning using spatial relationships method and apparatus
CN112084314B (en) * 2020-08-20 2023-02-21 电子科技大学 Knowledge-introducing generating type session system
CN112508048B (en) * 2020-10-22 2023-06-06 复旦大学 Image description generation method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509880A (en) * 2018-03-21 2018-09-07 南京邮电大学 A kind of video personage behavior method for recognizing semantics
CN111598041A (en) * 2020-05-25 2020-08-28 青岛联合创智科技有限公司 Image generation text method for article searching

Also Published As

Publication number Publication date
CN113449801A (en) 2021-09-28

Similar Documents

Publication Publication Date Title
CN113449801B (en) Image character behavior description generation method based on multi-level image context coding and decoding
Zhang et al. Mask SSD: An effective single-stage approach to object instance segmentation
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN113792113A (en) Visual language model obtaining and task processing method, device, equipment and medium
CN111652357B (en) Method and system for solving video question-answer problem by using specific target network based on graph
Xue et al. A better way to attend: Attention with trees for video question answering
Wang et al. Stroke constrained attention network for online handwritten mathematical expression recognition
US20180365594A1 (en) Systems and methods for generative learning
CN110175330B (en) Named entity recognition method based on attention mechanism
Wang et al. Recognizing handwritten mathematical expressions as LaTex sequences using a multiscale robust neural network
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
Xue et al. Lipformer: Learning to lipread unseen speakers based on visual-landmark transformers
CN113240033B (en) Visual relation detection method and device based on scene graph high-order semantic structure
Xue et al. LCSNet: End-to-end lipreading with channel-aware feature selection
Yin et al. Spatial temporal enhanced network for continuous sign language recognition
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN114511813B (en) Video semantic description method and device
CN116561305A (en) False news detection method based on multiple modes and transformers
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN114943990A (en) Continuous sign language recognition method and device based on ResNet34 network-attention mechanism
CN114492386A (en) Combined detection method for drug name and adverse drug reaction in web text
CN111767402A (en) Limited domain event detection method based on counterstudy
Le et al. An Attention-Based Encoder–Decoder for Recognizing Japanese Historical Documents
Jiang et al. Dynamic Security Assessment Framework for Steel Casting Workshops in Smart Factory
CN113343752B (en) Gesture detection method and system based on space-time sequence diagram

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant