CN113449801B - Image character behavior description generation method based on multi-level image context coding and decoding - Google Patents
Image character behavior description generation method based on multi-level image context coding and decoding Download PDFInfo
- Publication number
- CN113449801B CN113449801B CN202110776126.8A CN202110776126A CN113449801B CN 113449801 B CN113449801 B CN 113449801B CN 202110776126 A CN202110776126 A CN 202110776126A CN 113449801 B CN113449801 B CN 113449801B
- Authority
- CN
- China
- Prior art keywords
- image
- context
- decoding
- character
- person
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000001514 detection method Methods 0.000 claims abstract description 14
- 230000003993 interaction Effects 0.000 claims abstract description 14
- 239000013598 vector Substances 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 5
- 230000003542 behavioural effect Effects 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 238000002372 labelling Methods 0.000 claims description 4
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000006399 behavior Effects 0.000 abstract description 36
- 230000000007 visual effect Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 6
- 238000011160 research Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image character behavior description generation method based on multi-level image context coding and decoding, and belongs to the field of image description generation. The invention uses the pre-trained target detection model to extract character and entity characteristics respectively, and uses independent encoders to encode respectively, and fuses in the decoding stage, thereby solving the problem that the traditional method can not capture enough downstream information. In the image description task, however, similar visual signals are not equivalent to the same semantic information, a phenomenon known as semantic gap between images and languages. Multiple behaviors in a multi-person scene have similar image signals, different semantic information exists between the person and the interaction behaviors of multiple objects, and the description of the person-object interaction behaviors needs to capture high-level semantic content such as interaction types, behavior intentions and the like, so that the problem is more serious. According to the invention, the image semantic information is independently modeled by using a structure based on a transducer, so that the problem of semantic gap is better solved.
Description
Technical Field
The invention belongs to the field of image description generation, and particularly relates to an image character behavior description generation method based on multi-level image context encoding and decoding.
Background
The use of images to identify and extract behavioral information of a person is a research hotspot for computer vision. The character behavior description generation technology is a part of image description research, is important content in image understanding, can convert character semantics in images into text representation conforming to human expression habits, and has wide application prospects in the fields of safety, education, social media, cross-mode retrieval and the like.
In a multi-person multi-object scene, people are usually dense, and the variety and number of objects are large, which results in multiple types of behaviors in the same scene. The image context information is very important for image semantic understanding, especially behavior semantic understanding, and the types and the image ranges of the image context information related to different human-object interactions and human-human interaction behaviors are different. The complex image context dependencies make it difficult to generate accurate descriptive text of character behavior in a multi-person multi-object scene. The conventional deep learning method generally adopts an encoder-decoder structure to perform image semantic reasoning, adopts local features and image global features of objects to be described in the deep convolution features as image coding, learns context relations among a plurality of objects to be described through an attention mechanism, and completes semantic reasoning. The common method has the following defects in a multi-person multi-object scene, and firstly: in a multi-person and multi-object scene, a single person is usually small, more noise information exists in the image global information, and correct context information is difficult to learn by means of a attention mechanism. Second,: the degree of dependence of the generated different types of behavior description texts on the image context information is different, and the lack of screening of the context information can lead to erroneous descriptions in the description texts.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an image character behavior description generation method based on multi-level image context coding and decoding.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
an image character behavior description generation method based on multi-level image context coding and decoding comprises the following steps:
1. training model
1) Acquiring images comprising characters and object objects, and marking to obtain marked images;
the marked content comprises position coordinates of a person main body, position coordinates of a person face, position coordinates of an object, object types, person behavior description text and object attribute description text;
2) Training a target detection model Det with a region proposal function by using the character behavior description data set in the annotation image until characters and objects in the image can be detected and classified, so as to obtain a pre-trained target detection model Det;
extracting target entity characteristics and multi-level context characteristics of each character to be described in the image by using a pre-trained target detection model Det;
3) In a two-way image feature fusion model Cap based on a transducer, two independent encoders EncoderE and EncoderE are utilized to encode image entity information and context information respectively to obtain an image entity information code E encode And context information encoding C encode ;
4) Encoding E the image entity information using a Decoder encode And context information encoding C encode Decoding is carried out, and a behavior description text word= { Word is output 1 ,word 2 ,…,word len };
Wherein word i A vector representation for describing an i-th word in the text;
calculating the probability of outputting each Word corresponding to each position through a softmax function, taking the sum of cross entropy of the labeling content and the output behavior description text Word as loss, and performing iterative optimization through back propagation to obtain a trained Cap model;
2. using models
For an input image comprising one or more objects to be described, detecting the positions of a person and an object by using a pre-trained target detection model Det, extracting each target entity characteristic and multistage context characteristics of each person to be described from a convolution tensor in a RoI (RoI) Pooling mode, and encoding the local characteristics by using a trained two-way image characteristic fusion model Cap to obtain an image entity information code E encode And context information encoding C encode And then using a Decoder to perform decoding output so as to output descriptive text for the behavior of each object to be described.
Further, the physical characteristics in the step 2) include image characteristics of target positions corresponding to the character targets, the multiple types of object targets and the face targets;
the context features include range image features corresponding to the multi-level context regions.
Further, the multi-level context area includes a local area, a neighboring area, and an interaction area;
the local area is an expansion range of the human target area;
the vicinity is an expansion of a minimum range including a person and a plurality of objects nearest thereto;
the interactive region is an expansion of a minimum range including a neighboring region and another person from the descriptive object.
Further, the coordinates of the upper left point and the lower right point of the rectangular region of the single description object in the image are set asMultiple related objects and position coordinates-> Wherein the position coordinates of the i-th entity +.>Position coordinates of another person nearest +.>The multi-level context area is as follows:
the local area is a local extension area of a single description object, and the calculation mode is as follows:
wherein P is the expanded pixel range, and after expansion, the four coordinate values are set to 0, wherein the coordinate values are smaller than 0, and the coordinate values are set to be larger than the image height/width;
the neighborhood is the smallest rectangular area comprising the object to be described and the image entity associated with it, the four-point coordinates are as follows:
wherein: w is the image width and H is the image height.
The interaction area includes a neighboring area and another character object nearest to the character object to be described, four-point coordinates are as follows:
further, the decoder in step 4) includes a plurality of decoding modules, and during decoding, the decoding modules sequentially decode the entity feature codes to generate a preliminary description text, then sequentially decode the context feature codes, correct the preliminary description text, and dynamically weight the context codes output by different encoding modules by adopting a cross attention mechanism in the process of decoding the context feature codes, so as to finally obtain the behavior description text.
Further, the inputs to each decoder module are: image entity feature coding, context feature coding, and word vector group representation of descriptive text output by a previous decoder module, the descriptive text of the first decoder module being the initial text.
Further, in the decoding process, the decoding process of the ith decoder module includes the steps of:
step1: decoding the output of the previous decoder module;
step2: decoding the image entity characteristics;
decoding the image entity feature codes through a multi-head attention mechanism, and outputting a preliminary result;
step3: decodes the context code, modifies the decoded output of Step2,
wherein ,NC The number of modules is encoded for the context encoder,a coding vector output by the first coding module; alpha l Is a cross attention weight.
Further, alpha is dynamically calculated based on context coding and network layer input l The method comprises the following steps:
wherein ,for a learnable weight matrix, layerNorm is a layer normalization operation [ …, … ]]In order for the splicing operation to be performed,the calculation is as follows:
step4: each decoding module performs forward computation on the decoded vector and outputs the result:
wherein ,is a weight matrix>The bias parameters are all learnable parameters. ReLU is a linear rectification function.
Compared with the prior art, the invention has the following beneficial effects that
The image character behavior description generation method based on multi-level image context coding and decoding, provided by the invention, captures the important research content of feature extraction, and shows the image character behavior description as an image context relation in the character behavior description problem. Because of the dense characters and complex character relationships in a multi-person scene, typical attention mechanisms cannot capture enough context information, and the lack of extraction of image context information related to character behaviors can lead to text description semantic errors. The invention uses the pre-trained target detection model to extract character and entity characteristics respectively, and uses independent encoders to encode respectively, and fuses in the decoding stage, thereby solving the problem that the traditional method can not capture enough downstream information. In the image description task, however, similar visual signals are not equivalent to the same semantic information, a phenomenon known as semantic gap between images and languages. The multiple behaviors in the multi-person scene have similar image signals, and low-level semantic information such as the colors of the person clothes, the types of articles and the like cannot effectively help to solve the behavior description; different semantic information exists between the person and the interaction behaviors of a plurality of objects, and the description of the person-object interaction behaviors needs to capture high-level semantic content such as interaction types, behavior intentions and the like, so that the problem is more serious. According to the invention, the image semantic information is independently modeled by using a structure based on a transducer, so that the problem of semantic gap is better solved.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a flow chart of the encoding and decoding of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention is described in further detail below with reference to the attached drawing figures:
referring to fig. 1, fig. 1 is a flowchart of the present invention, and the present invention mainly includes:
model training part:
step1: images including people and object objects are collected and annotated.
The labeling content comprises position coordinates of a person main body, position coordinates of a person face, position coordinates of an object, object types, person behavior description text and object attribute description text;
for a single Zhang Dai label image, the label forms of the single object are as follows, and the label forms include the position, the category and other information of various objects in the label image:
<MainID,TypeID>
wherein, mainID is the serial number of the main person corresponding to the object in the figure, and TypeID is the object type.
The labeling of the character body is as follows:
<MainID,Caption1,Caption2,Caption3>
wherein, mainID is the person number, caption1, caption2, caption3 are the person behavior description text labels provided by different labels respectively.
The position information of the person and the object is determined by marking the coordinates of two points of the upper left corner and the lower right corner of the corresponding rectangular area, and the form is as follows:
<X min ,Y min ,X max ,Y max >
wherein the coordinates of the upper left point and the lower right point respectively correspond to<X min ,Y min> and <Xmax ,Y max >
in the process of training the detection model Det, all position frame information and category information in the data set are used according to the ratio of 6:2: 2, dividing the training set, the testing set and the verification set in proportion; in the process of training behavior description generation model, sample division is carried out by taking the character to be described as a unit, wherein a single sample comprises the position of the character in the figure, the position of a related object, the face position of the character, the character behavior description text and the like, and a training set, a test set and a verification set are also divided in a ratio of 6:2:2.
Step2: image entity features and context features are extracted.
The labeled image character behavior description data set is utilized to train the target detection model Det with the region proposal function, so that different targets such as characters, objects and the like in the image can be accurately detected and classified. Then, the pretrained object detection model Det is used to extract image object entity characteristics and multi-level context characteristics of each character to be described from the convolution tensor in a RoI Pooling mode.
The multi-level context region is used for extracting multi-level context features, and comprises a local region, a neighboring region and an interaction region. The local area is an expansion range of the human target area, the adjacent area is an expansion of a minimum range including a human and a plurality of objects nearest thereto, and the interactive area is an expansion of a minimum range including an adjacent area and another human from the description object. Let the coordinates of the upper left and lower right points of the rectangular region of the single descriptive object in the image beMultiple related objects and position coordinatesWherein the position coordinates of the i-th entity +.> Position coordinates of another person nearest +.> The multi-level context area is as follows: />
The local area is a local extension area of a single description object, and the calculation mode is as follows:
wherein P is the expanded pixel range, and the value of P is 50 pixels in the invention. After expansion, the four coordinate values are set to 0, and the coordinate values are set to be greater than the image height/width, so that the reasonability is ensured.
The neighborhood is the smallest rectangular area comprising the object to be described and the image entity associated with it, the four-point coordinates are as follows:
wherein: w is the image width and H is the image height.
The interaction area includes a neighboring area and another character object nearest to the character object to be described, four-point coordinates are as follows:
image features represented by 2048-dimensional vectors of respective entities and context areas are extracted by RoI Pooling after the entity positions and the multi-level context areas are obtained.
Step3: the extracted physical features and contextual features are encoded.
Referring to fig. 2, fig. 2 is a flowchart of encoding and decoding according to the present invention, in which the encoders EncoderE and EncodereC of image entity feature encoding and context feature are identical in structure, and a single encoder includes three encoding modules of identical structure.
Taking the image entity information encoding process as an example, the encoding process of the first encoding module is as follows:
step1: self-attention coding:
wherein ,i.e. input of the initial first module +.>Multitead is a multi-headed attention mechanism, calculated as follows: />
MultiHead(Q,K,V)=Concate(Head1,…,Headh)WO
Head i =Attention(QW i Q ,KW i K ,VW i V )
Where d is the dimension of a single eigenvector in the input matrix, here corresponding to F i E I.e. 2048;is a projection matrix; h=4 in this method.
Step2: forward calculation:
where FF is the forward computation layer, for the input matrix X, the computation is as follows:
FF(X)=W 2 ReLU(W 1 X+b 1 )+b 2
in the parameter matrixd encode For the encoding dimension, 1024 in the method; b 1 、b 2 The bias parameters are all learnable parameters. ReLU is a linear rectification function.
Step4: behavior description text generation.
Decoding the image entity information encoding and the context information encoding in the step3 by using a Decoder, and outputting a description text word= { Word 1 ,word 2 ,…,word len The specific steps are:
step1: and (5) input decoding.
The output of the previous module is firstly decoded, wherein the input of the first decoding module is an initial word vector group comprising position codes
Step2: and decoding the image entity characteristics.
Decoding the image entity feature codes through a multi-head attention mechanism, and outputting a preliminary result.
Step3: and (5) context decoding.
The context code is decoded and the decoded output of the previous step is modified.
wherein ,NC The number of modules is encoded for the context encoder,is the encoded vector output by the first encoding module. Alpha l Is a cross attention weight, dynamically calculated from context coding and network layer input, the method is as follows:
wherein ,for a learnable weight matrix, layerNorm is a layer normalization operation [ …, … ]]For splicing operation, < >>The calculation is as follows:
step4: forward calculation.
Each decoding module performs forward computation on the decoded vector and outputs the result:
the output of the last decoding module outputs a set of predictor vectors Word after passing through the linear mapping layer,wherein len is the set maximum length of output text, V dict For the dictionary length, the dictionary includes all possible predicted words and ending symbols.
Calculating and outputting the probability of each position corresponding to each word through a softmax function:
and calculating the cross entropy sum of the label text Cap and the output predicted text Word as loss, and performing iterative optimization through back propagation to train the model.
Model use part:
step1: for an input single image Img, detecting a person and an object in the image by adopting a detection network Det;
step2: for each person object, a person face area and at most 5 objects having a minimum distance from the center of the person rectangular area and a distance not exceeding a larger value in the length/width of 1.5 times of the rectangle are taken as related objects, and another person nearest to the person rectangular area is taken as a nearby person, and a multi-level context area is calculated. Extracting entity characteristics and multi-level context characteristics for each person object by using a RoI Pooling method;
step3: and encoding and decoding the physical characteristics and the contextual characteristics of each character object to be described by using the trained model, and outputting a prediction result. Prediction results for individual wordsAnd taking the word corresponding to the maximum component as output.
The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.
Claims (7)
1. An image character behavior description generation method based on multi-level image context coding and decoding, which is characterized by comprising the following steps:
1. training model
1) Acquiring images comprising characters and object objects, and marking to obtain marked images;
the marked content comprises position coordinates of a person main body, position coordinates of a person face, position coordinates of an object, object types, person behavior description text and object attribute description text;
2) Training a target detection model Det with a region proposal function by using the character behavior description data set in the annotation image until characters and objects in the image can be detected and classified, so as to obtain a pre-trained target detection model Det;
extracting target entity characteristics and multi-level context characteristics of each character to be described in the image by using a pre-trained target detection model Det;
3) In a two-way image feature fusion model Cap based on a transducer, two independent encoders EncoderE and EncoderE are utilized to encode image entity information and context information respectively to obtain an image entity information code E encode And context information encoding C encode ;
4) Encoding E the image entity information using a Decoder encode And context information encoding C encode Decoding is carried out, and a behavior description text word= { Word is output 1 ,word 2 ,...,word len };
Wherein word i A vector representation for describing an i-th word in the text;
calculating the probability of outputting each Word corresponding to each position through a softmax function, taking the sum of cross entropy of the labeling content and the output behavior description text Word as loss, and performing iterative optimization through back propagation to obtain a trained Cap model;
2. using models
For an input image comprising one or more objects to be described, detecting the positions of a person and an object by using a pre-trained target detection model Det, extracting each target entity characteristic and multistage context characteristics of each person to be described from a convolution tensor in a RoI (RoI) Pooling mode, and encoding the local characteristics by using a trained two-way image characteristic fusion model Cap to obtain an image entity information code E encode And context information encodingCode C encode And then using a Decoder to perform decoding output so as to output descriptive text for the behavior of each object to be described.
2. The method for generating image character behavior descriptions based on multi-level image context codec as claimed in claim 1, wherein the physical features in the step 2) include image features of the positions of the objects corresponding to the character objects, the multi-class object objects, and the face objects;
the context features include range image features corresponding to the multi-level context regions.
3. The image character behavioral description generation method based on multi-level image context codec of claim 2, wherein the multi-level context region includes a local region, a neighboring region, and an interaction region;
the local area is an expansion range of the human target area;
the vicinity is an expansion of a minimum range including a person and a plurality of objects nearest thereto;
the interactive region is an expansion of a minimum range including a neighboring region and another person from the descriptive object.
4. The image character behavior description generation method based on multi-level image context codec of claim 3, wherein coordinates of upper left and lower right two points of a single description object rectangular region in the image are set asMultiple related objects and position coordinates-> Wherein the position coordinates of the i-th entity +.>Position coordinates of another person nearest +.>The multi-level context area is as follows:
the local area is a local extension area of a single description object, and the calculation mode is as follows:
wherein P is the expanded pixel range, and after expansion, the four coordinate values are set to 0, wherein the coordinate values are smaller than 0, and the coordinate values are set to be larger than the image height/width;
the neighborhood is the smallest rectangular area comprising the object to be described and the image entity associated with it, the four-point coordinates are as follows:
wherein: w is the image width, H is the image height;
the interaction area includes a neighboring area and another character object nearest to the character object to be described, four-point coordinates are as follows:
5. the method for generating the behavior description of the image character based on the multi-level image context codec as claimed in claim 1, wherein the decoder in the step 4) includes a plurality of decoding modules, and the decoding modules sequentially decode the physical feature codes to generate the preliminary description text, sequentially decode the context feature codes to correct the preliminary description text, and dynamically weight the context codes output by the different encoding modules by using a cross attention mechanism in the process of decoding the context feature codes to finally obtain the behavior description text.
6. The method for generating an image character behavioral description based on multi-level image context codec of claim 5 wherein the input to each decoder module is: image entity feature coding, context feature coding, and word vector group representation of descriptive text output by a previous decoder module, the descriptive text of the first decoder module being the initial text.
7. The method for generating an image character behavioral description based on multi-level image context codec of claim 6 wherein, in the decoding process, the decoding process of the i-th decoder module comprises the steps of:
step1: decoding the output of the previous decoder module;
step2: decoding the image entity characteristics;
decoding the image entity feature codes through a multi-head attention mechanism, and outputting a preliminary result;
step3: decodes the context code, modifies the decoded output of Step2,
wherein ,NC The number of modules is encoded for the context encoder,a coding vector output by the first coding module; alpha l Is a cross attention weight;
dynamic computation based on context coding and network layer input, alpha l The method comprises the following steps:
wherein ,for a learnable weight matrix, layerNorm is a layer normalization operation [ …, … ]]In order for the splicing operation to be performed,the calculation is as follows:
step4: each decoding module performs forward computation on the decoded vector and outputs the result:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110776126.8A CN113449801B (en) | 2021-07-08 | 2021-07-08 | Image character behavior description generation method based on multi-level image context coding and decoding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110776126.8A CN113449801B (en) | 2021-07-08 | 2021-07-08 | Image character behavior description generation method based on multi-level image context coding and decoding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113449801A CN113449801A (en) | 2021-09-28 |
CN113449801B true CN113449801B (en) | 2023-05-02 |
Family
ID=77815731
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110776126.8A Active CN113449801B (en) | 2021-07-08 | 2021-07-08 | Image character behavior description generation method based on multi-level image context coding and decoding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113449801B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113887468B (en) * | 2021-10-14 | 2023-06-16 | 西安交通大学 | Single-view human-object interaction identification method of three-stage network framework |
CN114663915B (en) * | 2022-03-04 | 2024-04-05 | 西安交通大学 | Image human-object interaction positioning method and system based on transducer model |
CN115097941B (en) * | 2022-07-13 | 2023-10-10 | 北京百度网讯科技有限公司 | Character interaction detection method, device, equipment and storage medium |
CN116612365B (en) * | 2023-06-09 | 2024-01-23 | 匀熵智能科技(无锡)有限公司 | Image subtitle generating method based on target detection and natural language processing |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509880A (en) * | 2018-03-21 | 2018-09-07 | 南京邮电大学 | A kind of video personage behavior method for recognizing semantics |
CN111598041A (en) * | 2020-05-25 | 2020-08-28 | 青岛联合创智科技有限公司 | Image generation text method for article searching |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101930940B1 (en) * | 2017-07-20 | 2018-12-20 | 에스케이텔레콤 주식회사 | Apparatus and method for analyzing image |
CN109711463B (en) * | 2018-12-25 | 2023-04-07 | 广东顺德西安交通大学研究院 | Attention-based important object detection method |
WO2020244774A1 (en) * | 2019-06-07 | 2020-12-10 | Leica Microsystems Cms Gmbh | A system and method for training machine-learning algorithms for processing biology-related data, a microscope and a trained machine learning algorithm |
CN111126282B (en) * | 2019-12-25 | 2023-05-12 | 中国矿业大学 | Remote sensing image content description method based on variational self-attention reinforcement learning |
US11361550B2 (en) * | 2019-12-30 | 2022-06-14 | Yahoo Assets Llc | Automatic digital content captioning using spatial relationships method and apparatus |
CN112084314B (en) * | 2020-08-20 | 2023-02-21 | 电子科技大学 | Knowledge-introducing generating type session system |
CN112508048B (en) * | 2020-10-22 | 2023-06-06 | 复旦大学 | Image description generation method and device |
-
2021
- 2021-07-08 CN CN202110776126.8A patent/CN113449801B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509880A (en) * | 2018-03-21 | 2018-09-07 | 南京邮电大学 | A kind of video personage behavior method for recognizing semantics |
CN111598041A (en) * | 2020-05-25 | 2020-08-28 | 青岛联合创智科技有限公司 | Image generation text method for article searching |
Also Published As
Publication number | Publication date |
---|---|
CN113449801A (en) | 2021-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113449801B (en) | Image character behavior description generation method based on multi-level image context coding and decoding | |
Zhang et al. | Mask SSD: An effective single-stage approach to object instance segmentation | |
CN111985239B (en) | Entity identification method, entity identification device, electronic equipment and storage medium | |
CN113792113A (en) | Visual language model obtaining and task processing method, device, equipment and medium | |
CN111652357B (en) | Method and system for solving video question-answer problem by using specific target network based on graph | |
Xue et al. | A better way to attend: Attention with trees for video question answering | |
Wang et al. | Stroke constrained attention network for online handwritten mathematical expression recognition | |
US20180365594A1 (en) | Systems and methods for generative learning | |
CN110175330B (en) | Named entity recognition method based on attention mechanism | |
Wang et al. | Recognizing handwritten mathematical expressions as LaTex sequences using a multiscale robust neural network | |
CN111597816A (en) | Self-attention named entity recognition method, device, equipment and storage medium | |
Xue et al. | Lipformer: Learning to lipread unseen speakers based on visual-landmark transformers | |
CN113240033B (en) | Visual relation detection method and device based on scene graph high-order semantic structure | |
Xue et al. | LCSNet: End-to-end lipreading with channel-aware feature selection | |
Yin et al. | Spatial temporal enhanced network for continuous sign language recognition | |
CN110929013A (en) | Image question-answer implementation method based on bottom-up entry and positioning information fusion | |
CN114511813B (en) | Video semantic description method and device | |
CN116561305A (en) | False news detection method based on multiple modes and transformers | |
CN116662924A (en) | Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism | |
CN114943990A (en) | Continuous sign language recognition method and device based on ResNet34 network-attention mechanism | |
CN114492386A (en) | Combined detection method for drug name and adverse drug reaction in web text | |
CN111767402A (en) | Limited domain event detection method based on counterstudy | |
Le et al. | An Attention-Based Encoder–Decoder for Recognizing Japanese Historical Documents | |
Jiang et al. | Dynamic Security Assessment Framework for Steel Casting Workshops in Smart Factory | |
CN113343752B (en) | Gesture detection method and system based on space-time sequence diagram |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |